Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
3971
Jun Wang Zhang Yi Jacek M. Zurada Bao-Liang Lu Hujun Yin (Eds.)
Advances in Neural Networks – ISNN 2006 Third International Symposium on Neural Networks Chengdu, China, May 28 – June 1, 2006 Proceedings, Part I
13
Volume Editors Jun Wang The Chinese University of Hong Kong Dept. of Automation and Computer-Aided Engineering Shatin, New Territories, Hong Kong E-mail:
[email protected] Zhang Yi University of Electronic Science and Technology of China School of Computer Science and Engineering Chengdu, Sichuan, China E-mail:
[email protected] Jacek M. Zurada University of Louisville, Dept. of Electrical and Computer Engineering Louisville, Kentucky, USA E-mail:
[email protected] Bao-Liang Lu Shanghai Jiao Tong University, Dept. of Computer Science and Engineering Shanghai, China E-mail:
[email protected] Hujun Yin University of Manchester, School of Electrical and Electronic Engineering Manchester M60 IQD, UK E-mail:
[email protected] Library of Congress Control Number: 2006925897 CR Subject Classification (1998): F.1, F.2, D.1, G.2, I.2, C.2, I.4-5, J.1-4 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-540-34439-X Springer Berlin Heidelberg New York 978-3-540-34439-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2006 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 11759966 06/3142 543210
Preface
This book and its sister volumes constitute the Proceedings of the Third International Symposium on Neural Networks (ISNN 2006) held in Chengdu in southwestern China during May 28–31, 2006. After a successful ISNN 2004 in Dalian and ISNN 2005 in Chongqing, ISNN became a well-established series of conferences on neural computation in the region with growing popularity and improving quality. ISNN 2006 received 2472 submissions from authors in 43 countries and regions (mainland China, Hong Kong, Macao, Taiwan, South Korea, Japan, Singapore, Thailand, Malaysia, India, Pakistan, Iran, Qatar, Turkey, Greece, Romania, Lithuania, Slovakia, Poland, Finland, Norway, Sweden, Demark, Germany, France, Spain, Portugal, Belgium, Netherlands, UK, Ireland, Canada, USA, Mexico, Cuba, Venezuela, Brazil, Chile, Australia, New Zealand, South Africa, Nigeria, and Tunisia) across six continents (Asia, Europe, North America, South America, Africa, and Oceania). Based on rigorous reviews, 616 high-quality papers were selected for publication in the proceedings with the acceptance rate being less than 25%. The papers are organized in 27 cohesive sections covering all major topics of neural network research and development. In addition to the numerous contributed papers, ten distinguished scholars gave plenary speeches (Robert J. Marks II, Erkki Oja, Marios M. Polycarpou, Donald C. Wunsch II, Zongben Xu, and Bo Zhang) and tutorials (Walter J. Freeman, Derong Liu, Paul J. Werbos, and Jacek M. Zurada). ISNN 2006 provided an academic forum for the participants to disseminate their new research findings and discuss emerging areas of research. It also created a stimulating environment for the participants to interact and exchange information on future challenges and opportunities of neural network research. Many volunteers and organizations made great contributions to ISNN 2006. The organizers are grateful to the University of Electronic Science and Technology of China and the Chinese University of Hong Kong for their sponsorship; to the National Natural Science Foundation of China and K.C. Wong Education Foundation of Hong Kong for their financial supports; and to the Asia Pacific Neural Network Assembly, European Neural Network Society, IEEE Computational Intelligence Society, IEEE Circuits and Systems Society, and International Neural Network Society for their technical cosponsorship. The organizers would like to thank the members of the Advisory Committee for their supports, the members of the International Program Committee for reviewing the papers and members of the Publications Committee for checking the accepted papers in a short period of time. Particularly, the organizers would like to thank the publisher, Springer, for publishing the proceedings in the prestigious series of
VI
Preface
Lecture Notes in Computer Science. Last but not least, the organizers would like to thank all the speakers and authors for their active participation at ISNN 2006, which is essential for the success of the symposium. May 2006
Jun Wang Zhang Yi Jacek M. Zurada Bao-Liang Lu Hujun Yin
ISNN 2006 Organization
ISNN 2006 was organized and sponsored by the University of Electronic Science and Technology of China and the Chinese University of Hong Kong. It was technically cosponsored by the Asia Pacific Neural Network Assembly, European Neural Network Society, IEEE Circuits and Systems Society, IEEE Computational Intelligence Society, and International Neural Network Society. It was financially supported by the National Natural Science Foundation of China and K.C. Wong Education Foundation of Hong Kong. Jun Wang, Hong Kong (General Chair) Zhang Yi, Chengdu, China (General Co-chair) Jacek M. Zurada, Louisville, USA (General Co-chair) Advisory Committee Shun-ichi Amari, Tokyo, Japan (Chair) Hojjat Adeli, Columbus, USA Guoliang Chen, Hefei, China Chunbo Feng, Nanjing, China Kunihiko Fukushima, Tokyo, Japan Okyay Kaynak, Istanbul, Turkey Yanda Li, Beijing, China Erkki Oja, Helsinki, Finland Marios M. Polycarpou, Nicosia, Cyprus Shoujue Wang, Beijing, China Youlun Xiong, Wuhan, China Shuzi Yang, Wuhan, China Siying Zhang, Qingdao, China Shoubin Zou, Chengdu, China
Walter J. Freeman, Berkeley, USA (Co-chair) Zheng Bao, X’ian, China Ruwei Dai, Beijing, China Toshio Fukuda, Nagoya, Japan Zhenya He, Nanjing, China Frank L. Lewis, Fort Worth, USA Ruqian Lu, Beijing, China Nikhil R. Pal, Calcutta, India Tzyh-Jong Tarn, St. Louis, USA Paul J. Werbos, Washington, USA Lei Xu, Hong Kong Bo Zhang, Beijing, China Nanning Zheng, Xi’an, China
Steering Committee Zongben Xu, Xi'an, China (Chair) Tianping Chen, Shanghai, China Wlodzislaw Duch, Torun, Poland Xiaoxin Liao, Wuhan, China Zhiyong Liu, Beijing, China Zhengqi Sun, Beijing, China Donald C. Wunsch II, Rolla, USA Fuliang Yin, Dalian, China Liming Zhang, Shanghai, Chinac Mingtian Zhou, Chengdu, China
Houjun Wang, Chengdu, China (Co-chair) Andrzej Cichocki, Tokyo, Japan Anthony Kuh, Honolulu, USA Derong Liu, Chicago, USA Leszek Rutkowski, Czestochowa, Poland DeLiang Wang, Columbus, USA Gary G. Yen, Stillwater, USA Juebang Yu, Chengdu, China Chunguang Zhou, Changchun, China
VIII
Organization
Program Committee Bao-Liang Lu, Shanghai, China (Chair) Shigeo Abe, Kobe, Japan Khurshid Ahmad, Surrey, UK A. Bouzerdoum, Wollongong, Australia Jinde Cao, Nanjing, China Matthew Casey, Surrey, UK Luonan Chen, Osaka, Japan Yen-Wei Chen, Kyoto, Japan Yuehui Chen, Jinan, China Yiu Ming Cheung, Hong Kong Sungzoon Cho, Seoul, Korea Emilio Corchado, Burgos, Spain Shuxue Ding, Fukushima, Japan Meng Joo Er, Singapore Mauro Forti, Siena, Italy Marcus Gallagher, Brisbane, Australia Chengling Gou, Beijing, China Lei Guo, Nanjing, China Min Han, Dalian, China Zhifeng Hao, Guangzhou, China Zengguang Hou, Beijing, China Jinglu Hu, Fukuoka, Japan Guangbin Huang, Singapore Marc van Hulle, Leuven, Belgium Danchi Jiang, Hobart, Australia Hoon Kang, Seoul, Korea Samuel Kaski, Helsinki, Finland Tai-hoon Kim, Seoul, Korea Yean-Der Kuan, Taipei, Taiwan James Lam, Hong Kong Xiaoli Li, Birmingham, UK Yuanqing Li, Singapore Xun Liang, Beijing, China Lizhi Liao, Hong Kong Fei Liu, Wuxi, China Ju Liu, Jinan, China Hongtao Lu, Shanghai, China Fa-Long Luo, San Jose, USA Jinwen Ma, Beijing, China Stanislaw Osowski, Warsaw, Poland Ikuko Nishkawa, Kyoto, Japan Paul S. Pang, Auckland, New Zealand Yi Shen, Wuhan, China Michael Small, Hong Kong Ponnuthurai N. Suganthan, Singapore Fuchun Sun, Beijing, China
Hujun Yin, Manchester, UK (Co-chair) Ajith Abraham, Seoul, South Korea Sabri Arik, Istanbul, Turkey Jianting Cao, Saitama, Japan Wenming Cao, Hangzhou, China Liang Chen, Prince George, Canada Songcan Chen, Nanjing, China Xue-wen Chen, Kansas, USA Xiaochun Cheng, Berkshire, UK Zheru Chi, Hong Kong Jin-Young Choi, Seoul, Korea Chuanyin Dang, Hong Kong Tom Downs, Brisbane, Australia Shumin Fei, Nanjing, China Wai Keung Fung, Winnipeg, Canada John Qiang Gan, Essex, UK Chengan Guo, Dalian, China Ping Guo, Beijing, China Qing-Long Han, Rockhampton, Australia Daniel W.C. Ho, Hong Kong Dewen Hu, Changsha, China Sanqing Hu, Chicago, USA Shunan Huang, Singapore Malik Magdon Ismail, Troy, USA Joarder Kamruzzaman, Melbourne, Australia Nikola Kasabov, Auckland, New Zealand Tae Seon Kim, Seoul, Korea Hon Keung Kwan, Windsor, Canada James Kwok, Hong Kong Shaowen Li, Chengdu, China Yangmin Li, Macao Hualou Liang, Houston, USA Yanchun Liang, Changchun, China Meng-Hiot Lim, Singapore Guoping Liu, Treforest, UK Meiqin Liu, Hangzhou, China Wenlian Lu, Leipzig, Germany Zhiwei Luo, Nagoya, Japan Qing Ma, Kyoto, Japan Zhiqing Meng, Hangzhou, China Seiichi Ozawa, Kobe, Japan Jagath C. Rajapakse, Singapore Daming Shi, Singapore Jochen J. Steil, Bielefeld, Germany Changyin Sun, Nanjing, China
Organization
Norikazu Takahashi, Fukuoka, Japan Yu Tang, Mexico City, Mexico Christos Tjortjis, Manchester, UK Michel Verleysen, Louwain, Belgium Dan Wang, Singapore Si Wu, Brighton, UK Cheng Xiang, Singapore Simon X. Yang, Guelph, Canada Yingjie Yang, Leicester, UK Dingli Yu, Liverpool, UK Gerson Zaverucha, Rio de Janeiro, Brazil Huaguang Zhang, Shenyang, China Liqing Zhang, Shanghai, China Tao Zhang, Tianjin, China Yanqing Zhang, Atlanta, USA Jin Zhou, Shanghai, China Organizing Committee Yue Wu (Chair), Chengdu, China
Pu Sun, Ann Arbor, USA Ying Tan, Hefei, China Peter Tino, Birmingham, UK Dan Ventura, Provo, USA Bing Wang, Hull, UK Kesheng Wang, Trondheim, Norway Wei Wu, Dalian, China Daoyi Xu, Chengdu, China Xiaosong Yang, Wuhan, China Zhengrong Yang, Exeter, UK Wen Yu, Mexico City, Mexico Zhigang Zeng, Hefei, China Jie Zhang, Newcastle, UK Qingfu Zhang, Essex, UK Ya Zhang, Kansas, USA Yunong Zhang, Maynooth, Ireland
Xiaofeng Liao (Co-chair), Chongqing, China
Publications Committee Chuandong Li (Chair), Chongqing, China Mao Ye (Co-chair), Chengdu, China Jianwei Zhang (Co-chair), Hamburg, Germany Publicity Committee Bin Jiang (Chair), Chengdu, China Zeng-Guang Hou (Co-chair), Beijing, China
Jennie Si (Co-chair), Tempe, USA
Registration Committee Xiaorong Pu (Chair), Chengdu, China Local Arrangements Committee Hongli Zhang (Chair), Chengdu, China Secretariats Jiancheng Lv, Chengdu, China
Tao Xiang, Chongqing, China
IX
Table of Contents – Part I
Neurobiological Analysis The Ideal Noisy Environment for Fast Neural Computation Si Wu, Jianfeng Feng, Shun-ichi Amari . . . . . . . . . . . . . . . . . . . . . . . . . .
1
How Does a Neuron Perform Subtraction? – Arithmetic Rules of Synaptic Integration of Excitation and Inhibition Xu-Dong Wang, Jiang Hao, Mu-Ming Poo, Xiao-Hui Zhang . . . . . . . . .
7
Stochastic Resonance Enhancing Detectability of Weak Signal by Neuronal Networks Model for Receiver Jun Liu, Jian Wu, Zhengguo Lou, Guang Li . . . . . . . . . . . . . . . . . . . . . .
15
A Gaussian Dynamic Convolution Models of the FMRI BOLD Response Huafu Chen, Ling Zeng, Dezhong Yao, Qing Gao . . . . . . . . . . . . . . . . . .
21
Cooperative Motor Learning Model for Cerebellar Control of Balance and Locomotion Mingxiao Ding, Naigong Yu, Xiaogang Ruan . . . . . . . . . . . . . . . . . . . . . .
27
A Model of Category Learning with Attention Augmented Simplistic Prototype Representation Toshihiko Matsuka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
On the Learning Algorithms of Descriptive Models of High-Order Human Cognition Toshihiko Matsuka, Arieta Chouchourelou . . . . . . . . . . . . . . . . . . . . . . . . .
41
A Neural Model on Cognitive Process Ru-bin Wang, Jing Yu, Zhi-kang Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Theoretical Analysis Approximation Bound of Mixture Networks in Lpω Spaces Zongben Xu, Jianjun Wang, Deyu Meng . . . . . . . . . . . . . . . . . . . . . . . . . .
60
Integral Transform and Its Application to Neural Network Approximation Feng-jun Li, Zongben Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
XII
Table of Contents – Part I
The Essential Approximation Order for Neural Networks with Trigonometric Hidden Layer Units Chunmei Ding, Feilong Cao, Zongben Xu . . . . . . . . . . . . . . . . . . . . . . . . .
72
Wavelets Based Neural Network for Function Approximation Yong Fang, Tommy W.S. Chow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
Passivity Analysis of Dynamic Neural Networks with Different Time-Scales Alejandro Cruz Sandoval, Wen Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
Exponential Dissipativity of Non-autonomous Neural Networks with Distributed Delays and Reaction-Diffusion Terms Zhiguo Yang, Daoyi Xu, Yumei Huang . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
Convergence Analysis of Continuous-Time Neural Networks Min-Jae Kang, Ho-Chan Kim, Farrukh A. Khan, Wang-Cheol Song, Jacek M. Zurada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Global Convergence of Continuous-Time Recurrent Neural Networks with Delays Weirui Zhao, Huanshui Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Global Exponential Stability in Lagrange Sense of Continuous-Time Recurrent Neural Networks Xiaoxin Liao, Zhigang Zeng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Global Exponential Stability of Recurrent Neural Networks with Time-Varying Delay Yi Shen, Meiqin Liu, Xiaodong Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 New Criteria of Global Exponential Stability for a Class of Generalized Neural Networks with Time-Varying Delays Gang Wang, Hua-Guang Zhang, Chong-Hui Song . . . . . . . . . . . . . . . . . . 129 Dynamics of General Neural Networks with Distributed Delays Changyin Sun, Linfeng Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 On Equilibrium and Stability of a Class of Neural Networks with Mixed Delays Shuyong Li, Yumei Huang, Daoyi Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Stability Analysis of Neutral Neural Networks with Time Delay Hanlin He, Xiaoxin Liao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Table of Contents – Part I
XIII
Global Asymptotical Stability in Neutral-Type Delayed Neural Networks with Reaction-Diffusion Terms Jianlong Qiu, Jinde Cao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Almost Sure Exponential Stability on Interval Stochastic Neural Networks with Time-Varying Delays Wudai Liao, Zhongsheng Wang, Xiaoxin Liao . . . . . . . . . . . . . . . . . . . . . 159 Stochastic Robust Stability of Markovian Jump Nonlinear Uncertain Neural Networks with Wiener Process Xuyang Lou, Baotong Cui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Stochastic Robust Stability Analysis for Markovian Jump Discrete-Time Delayed Neural Networks with Multiplicative Nonlinear Perturbations Li Xie, Tianming Liu, Guodong Lu, Jilin Liu, Stephen T.C. Wong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Global Robust Stability of General Recurrent Neural Networks with Time-Varying Delays Jun Xu, Daoying Pi, Yong-Yan Cao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Robust Periodicity in Recurrent Neural Network with Time Delays and Impulses Yongqing Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Global Asymptotical Stability of Cohen-Grossberg Neural Networks with Time-Varying and Distributed Delays Tianping Chen, Wenlian Lu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 LMI Approach to Robust Stability Analysis of Cohen-Grossberg Neural Networks with Multiple Delays Ce Ji, Hua-Guang Zhang, Chong-Hui Song . . . . . . . . . . . . . . . . . . . . . . . . 198 Existence and Global Stability Analysis of Almost Periodic Solutions for Cohen-Grossberg Neural Networks Tianping Chen, Lili Wang, Changlei Ren . . . . . . . . . . . . . . . . . . . . . . . . . 204 A New Sufficient Condition on the Complete Stability of a Class Cellular Neural Networks Li-qun Zhou, Guang-da Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Stability Analysis of Reaction-Diffusion Recurrent Cellular Neural Networks with Variable Time Delays Weifan Zheng, Jiye Zhang, Weihua Zhang . . . . . . . . . . . . . . . . . . . . . . . . 217
XIV
Table of Contents – Part I
Exponential Stability of Delayed Stochastic Cellular Neural Networks Wudai Liao, Yulin Xu, Xiaoxin Liao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Global Exponential Stability of Cellular Neural Networks with Time-Varying Delays and Impulses Chaojin Fu, Boshan Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Global Exponential Stability of Fuzzy Cellular Neural Networks with Variable Delays Jiye Zhang, Dianbo Ren, Weihua Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Stability of Fuzzy Cellular Neural Networks with Impulses Tingwen Huang, Marco Roque-Sol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Absolute Stability of Hopfield Neural Network Xiaoxin Liao, Fei Xu, Pei Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Robust Stability Analysis of Uncertain Hopfield Neural Networks with Markov Switching Bingji Xu, Qun Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Asymptotic Stability of Second-Order Discrete-Time Hopfield Neural Networks with Variable Delays Wei Zhu, Daoyi Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Convergence Analysis of Discrete Delayed Hopfield Neural Networks Sheng-Rui Zhang, Run-Nian Ma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 An LMI-Based Approach to the Global Stability of Bidirectional Associative Memory Neural Networks with Variable Delay Minghui Jiang, Yi Shen, Xiaoxin Liao . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Existence of Periodic Solution of BAM Neural Network with Delay and Impulse Hui Wang, Xiaofeng Liao, Chuandong Li, Degang Yang . . . . . . . . . . . . 279 On Control of Hopf Bifurcation in BAM Neural Network with Delayed Self-feedback Min Xiao, Jinde Cao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Convergence and Periodicity of Solutions for a Class of Discrete-Time Recurrent Neural Network with Two Neurons Hong Qu, Zhang Yi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Table of Contents – Part I
XV
Existence and Global Attractability of Almost Periodic Solution for Competitive Neural Networks with Time-Varying Delays and Different Time Scales Wentong Liao, Linshan Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Global Synchronization of Impulsive Coupled Delayed Neural Networks Jin Zhou, Tianping Chen, Lan Xiang, Meichun Liu . . . . . . . . . . . . . . . . 303 Synchronization of a Class of Coupled Discrete Recurrent Neural Networks with Time Delay Ping Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Chaos and Bifurcation in a New Class of Simple Hopfield Neural Network Yan Huang, Xiao-Song Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 Synchronization of Chaotic System with the Perturbation Via Orthogonal Function Neural Network Hongwei Wang, Hong Gu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 Numerical Analysis of a Chaotic Delay Recurrent Neural Network with Four Neurons Haigeng Luo, Xiaodong Xu, Xiaoxin Liao . . . . . . . . . . . . . . . . . . . . . . . . . 328 Autapse Modulated Bursting Guang-Hong Wang, Ping Jiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Neurodynamic Optimization A Neural Network Model for Non-smooth Optimization over a Compact Convex Subset Guocheng Li, Shiji Song, Cheng Wu, Zifang Du . . . . . . . . . . . . . . . . . . . 344 Differential Inclusions-Based Neural Networks for Nonsmooth Convex Optimization on a Closed Convex Subset Shiji Song, Guocheng Li, Xiaohong Guan . . . . . . . . . . . . . . . . . . . . . . . . . 350 A Recurrent Neural Network for Linear Fractional Programming with Bound Constraints Fuye Feng, Yong Xia, Quanju Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 A Delayed Lagrangian Network for Solving Quadratic Programming Problems with Equality Constraints Qingshan Liu, Jun Wang, Jinde Cao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
XVI
Table of Contents – Part I
Wavelet Chaotic Neural Networks and Their Application to Optimization Problems Yao-qun Xu, Ming Sun, Guangren Duan . . . . . . . . . . . . . . . . . . . . . . . . . . 379 A New Optimization Algorithm Based on Ant Colony System with Density Control Strategy Ling Qin, Yixin Chen, Ling Chen, Yuan Yao . . . . . . . . . . . . . . . . . . . . . . 385 A New Neural Network Approach to the Traveling Salesman Problem Paulo Henrique Siqueira, S´ergio Scheer, Maria Teresinha Arns Steiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 Dynamical System for Computing Largest Generalized Eigenvalue Lijun Liu, Wei Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 A Concise Functional Neural Network for Computing the Extremum Eigenpairs of Real Symmetric Matrices Yiguang Liu, Zhisheng You . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Learning Algorithms A Novel Stochastic Learning Rule for Neural Networks Frank Emmert-Streib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 Learning with Single Quadratic Integrate-and-Fire Neuron Deepak Mishra, Abhishek Yadav, Prem K. Kalra . . . . . . . . . . . . . . . . . . . 424 Manifold Learning of Vector Fields Hongyu Li, I-Fan Shen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 Similarity Measure for Vector Field Learning Hongyu Li, I-Fan Shen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 The Mahalanobis Distance Based Rival Penalized Competitive Learning Algorithm Jinwen Ma, Bin Cao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 Dynamic Competitive Learning Seongwon Cho, Jaemin Kim, Sun-Tae Chung . . . . . . . . . . . . . . . . . . . . . . 448 Hyperbolic Quotient Feature Map for Competitive Learning Neural Networks Jinwuk Seok, Seongwon Cho, Jaemin Kim . . . . . . . . . . . . . . . . . . . . . . . . 456
Table of Contents – Part I
XVII
A Gradient Entropy Regularized Likelihood Learning Algorithm on Gaussian Mixture with Automatic Model Selection Zhiwu Lu, Jinwen Ma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 Self-organizing Neural Architecture for Reinforcement Learning Ah-Hwee Tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 On the Efficient Implementation Biologic Reinforcement Learning Using Eligibility Traces SeungGwan Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 Combining Label Information and Neighborhood Graph for Semi-supervised Learning Lianwei Zhao, Siwei Luo, Mei Tian, Chao Shao, Hongliang Ma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 A Cerebellar Feedback Error Learning Scheme Based on Kalman Estimator for Tracing in Dynamic System Liang Liu, Naigong Yu, Mingxiao Ding, Xiaogang Ruan . . . . . . . . . . . . 489 An Optimal Iterative Learning Scheme for Dynamic Neural Network Modelling Lei Guo, Hong Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 Delayed Learning on Internal Memory Network and Organizing Internal States Toshinori Deguchi, Naohiro Ishii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 A Novel Learning Algorithm for Feedforward Neural Networks Huawei Chen, Fan Jin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 On H∞ Filtering in Feedforward Neural Networks Training and Pruning He-Sheng Tang, Song-Tao Xue, Rong Chen . . . . . . . . . . . . . . . . . . . . . . . 515 A Node Pruning Algorithm Based on Optimal Brain Surgeon for Feedforward Neural Networks Jinhua Xu, Daniel W.C. Ho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 A Fast Learning Algorithm Based on Layered Hessian Approximations and the Pseudoinverse E.J. Teoh, C. Xiang, K. Cheng Tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
XVIII
Table of Contents – Part I
A Modular Reduction Method for k-NN Algorithm with Self-recombination Learning Hai Zhao, Bao-Liang Lu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 Selective Neural Network Ensemble Based on Clustering Haixia Chen, Senmiao Yuan, Kai Jiang . . . . . . . . . . . . . . . . . . . . . . . . . . 545 An Individual Adaptive Gain Parameter Backpropagation Algorithm for Complex-Valued Neural Networks Songsong Li, Toshimi Okada, Xiaoming Chen, Zheng Tang . . . . . . . . . . 551 Training Cellular Neural Networks with Stable Learning Algorithm Marco A. Moreno-Armendariz, Giovanni Egidio Pazienza, Wen Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558 A New Stochastic PSO Technique for Neural Network Training Yangmin Li, Xin Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 A Multi-population Cooperative Particle Swarm Optimizer for Neural Network Training Ben Niu, Yun-Long Zhu, Xiao-Xian He . . . . . . . . . . . . . . . . . . . . . . . . . . . 570 Training RBF Neural Network with Hybrid Particle Swarm Optimization Haichang Gao, Boqin Feng, Yun Hou, Li Zhu . . . . . . . . . . . . . . . . . . . . . 577 Robust Learning by Self-organization of Nonlinear Lines of Attractions Ming-Jung Seow, Vijayan K. Asari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 Improved Learning Algorithm Based on Generalized SOM for Dynamic Non-linear System Kai Zhang, Gen-Zhi Guan, Fang-Fang Chen, Lin Zhang, Zhi-Ye Du . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 Q-Learning with FCMAC in Multi-agent Cooperation Kao-Shing Hwang, Yu-Jen Chen, Tzung-Feng Lin . . . . . . . . . . . . . . . . . . 599 Q Learning Based on Self-organizing Fuzzy Radial Basis Function Network Xuesong Wang, Yuhu Cheng, Wei Sun . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 A Fuzzy Neural Networks with Structure Learning Haisheng Lin, Xiao Zhi Gao, Xianlin Huang, Zhuoyue Song . . . . . . . . . 616
Table of Contents – Part I
XIX
Reinforcement Learning-Based Tuning Algorithm Applied to Fuzzy Identification Mariela Cerrada, Jose Aguilar, Andr´e Titli . . . . . . . . . . . . . . . . . . . . . . . 623 A New Learning Algorithm for Function Approximation Incorporating A Priori Information into Extreme Learning Machine Fei Han, Tat-Ming Lok, Michael R. Lyu . . . . . . . . . . . . . . . . . . . . . . . . . . 631 Robust Recursive Complex Extreme Learning Machine Algorithm for Finite Numerical Precision Junseok Lim, Koeng Mo Sung, Joonil Song . . . . . . . . . . . . . . . . . . . . . . . . 637 Evolutionary Extreme Learning Machine – Based on Particle Swarm Optimization You Xu, Yang Shu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644 A Gradient-Based ELM Algorithm in Regressing Multi-variable Functions You Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 A New Genetic Approach to Structure Learning of Bayesian Networks Jaehun Lee, Wooyong Chung, Euntai Kim . . . . . . . . . . . . . . . . . . . . . . . . 659
Model Design Research on Multi-Degree-of-Freedom Neurons with Weighted Graphs Shoujue Wang, Singsing Liu, Wenming Cao . . . . . . . . . . . . . . . . . . . . . . . 669 Output PDF Shaping of Singular Weights System: Monotonical Performance Design Hong Yue, Aurelie J.A. Leprand, Hong Wang . . . . . . . . . . . . . . . . . . . . . 676 Stochastic Time-Varying Competitive Neural Network Systems Yi Shen, Meiqin Liu, Xiaodong Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683 Heterogeneous Centroid Neural Networks Dong-Chul Park, Duc-Hoai Nguyen, Song-Jae Lee, Yunsik Lee . . . . . . . 689 Building Multi-layer Small World Neural Network Shuzhong Yang, Siwei Luo, Jianyu Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695 Growing Hierarchical Principal Components Analysis Self-Organizing Map Stones Lei Zhang, Zhang Yi, Jian Cheng Lv . . . . . . . . . . . . . . . . . . . . . . . 701
XX
Table of Contents – Part I
Hybrid Neural Network Model Based on Multi-layer Perceptron and Adaptive Resonance Theory Andrey Gavrilov, Young-Koo Lee, Sungyoung Lee . . . . . . . . . . . . . . . . . . 707 Evolving Neural Networks Using the Hybrid of Ant Colony Optimization and BP Algorithms Yan-Peng Liu, Ming-Guang Wu, Ji-Xin Qian . . . . . . . . . . . . . . . . . . . . . 714 A Genetic Algorithm with Modified Tournament Selection and Efficient Deterministic Mutation for Evolving Neural Network Dong-Sun Kim, Hyun-Sik Kim, Duck-Jin Chung . . . . . . . . . . . . . . . . . . . 723 A Neural Network Structure Evolution Algorithm Based on e, m Projections and Model Selection Criterion Yunhui Liu, Siwei Luo, Ziang Lv, Hua Huang . . . . . . . . . . . . . . . . . . . . . 732 A Parallel Coevolutionary Immune Neural Network and Its Application to Signal Simulation Zhu-Hong Zhang, Xin Tu, Chang-Gen Peng . . . . . . . . . . . . . . . . . . . . . . . 739 A Novel Elliptical Basis Function Neural Networks Optimized by Particle Swarm Optimization Ji-Xiang Du, Chuan-Min Zhai, Zeng-Fu Wang, Guo-Jun Zhang . . . . . 747 Fuzzy Neural Network Optimization by a Particle Swarm Optimization Algorithm Ming Ma, Li-Biao Zhang, Jie Ma, Chun-Guang Zhou . . . . . . . . . . . . . . 752 Fuzzy Rule Extraction Using Robust Particle Swarm Optimization Sumitra Mukhopadhyay, Ajit K. Mandal . . . . . . . . . . . . . . . . . . . . . . . . . . 762 A New Design Methodology of Fuzzy Set-Based Polynomial Neural Networks with Symbolic Gene Type Genetic Algorithms Seok-Beom Roh, Sung-Kwun Oh, Tae-Chon Ahn . . . . . . . . . . . . . . . . . . . 768 Design of Fuzzy Polynomial Neural Networks with the Aid of Genetic Fuzzy Granulation and Its Application to Multi-variable Process System Sung-Kwun Oh, In-Tae Lee, Jeoung-Nae Choi . . . . . . . . . . . . . . . . . . . . . 774 A Novel Self-Organizing Fuzzy Polynomial Neural Networks with Evolutionary FPNs: Design and Analysis Ho-Sung Park, Sung-Kwun Oh, Tae-Chon Ahn . . . . . . . . . . . . . . . . . . . . 780 Design of Fuzzy Neural Networks Based on Genetic Fuzzy Granulation and Regression Polynomial Fuzzy Inference Sung-Kwun Oh, Byoung-Jun Park, Witold Pedrycz . . . . . . . . . . . . . . . . . 786
Table of Contents – Part I
XXI
A New Fuzzy ART Neural Network Based on Dual Competition and Resonance Technique Lei Zhang, Guoyou Wang, Wentao Wang . . . . . . . . . . . . . . . . . . . . . . . . . 792 Simulated Annealing Based Learning Approach for the Design of Cascade Architectures of Fuzzy Neural Networks Chang-Wook Han, Jung-Il Park . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798 A New Fuzzy Identification Method Based on Adaptive Critic Designs Huaguang Zhang, Yanhong Luo, Derong Liu . . . . . . . . . . . . . . . . . . . . . . 804 Impacts of Perturbations of Training Patterns on Two Fuzzy Associative Memories Based on T-Norms Wei-Hong Xu, Guo-Ping Chen, Zhong-Ke Xie . . . . . . . . . . . . . . . . . . . . . 810 Alpha-Beta Associative Memories for Gray Level Patterns Cornelio Y´ an ˜ez-M´ arquez, Luis P. S´ anchez-Fern´ andez, Itzam´ a L´ opez-Y´ an ˜ez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818 Associative Memories Based on Discrete-Time Cellular Neural Networks with One-Dimensional Space-Invariant Templates Zhigang Zeng, Jun Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824 Autonomous and Deterministic Probabilistic Neural Network Using Global k-Means Roy Kwang Yang Chang, Chu Kiong Loo, Machavaram V.C. Rao . . . . 830 Selecting Variables for Neural Network Committees Marija Bacauskiene, Vladas Cibulskis, Antanas Verikas . . . . . . . . . . . . . 837 An Adaptive Network Topology for Classification Qingyu Xiong, Jian Huang, Xiaodong Xian, Qian Xiao . . . . . . . . . . . . . 843 A Quantitative Comparison of Different MLP Activation Functions in Classification Emad A.M. Andrews Shenouda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849 Estimating the Number of Hidden Neurons in a Feedforward Network Using the Singular Value Decomposition Eu Jin Teoh, Cheng Xiang, Kay Chen Tan . . . . . . . . . . . . . . . . . . . . . . . . 858 Neuron Selection for RBF Neural Network Classifier Based on Multiple Granularities Immune Network Jiang Zhong, Chun Xiao Ye, Yong Feng, Ying Zhou, Zhong Fu Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866
XXII
Table of Contents – Part I
Hierarchical Radial Basis Function Neural Networks for Classification Problems Yuehui Chen, Lizhi Peng, Ajith Abraham . . . . . . . . . . . . . . . . . . . . . . . . . 873 Biased Wavelet Neural Network and Its Application to Streamflow Forecast Fang Liu, Jian-Zhong Zhou, Fang-Peng Qiu, Jun-Jie Yang . . . . . . . . . . 880 A Goal Programming Based Approach for Hidden Targets in Layer-by-Layer Algorithm of Multilayer Perceptron Classifiers Yanlai Li, Kuanquan Wang, Tao Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889 SLIT: Designing Complexity Penalty for Classification and Regression Trees Using the SRM Principle Zhou Yang, Wenjie Zhu, Liang Ji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895 Flexible Neural Tree for Pattern Recognition Hai-Jun Li, Zheng-Xuan Wang, Li-Min Wang, Sen-Miao Yuan . . . . . . 903 A Novel Model of Artificial Immune Network and Simulations on Its Dynamics Lei Wang, Yinling Nie, Weike Nie, Licheng Jiao . . . . . . . . . . . . . . . . . . 909
Kernel Methods A Kernel Optimization Method Based on the Localized Kernel Fisher Criterion Bo Chen, Hongwei Liu, Zheng Bao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915 Genetic Granular Kernel Methods for Cyclooxygenase-2 Inhibitor Activity Comparison Bo Jin, Yan-Qing Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 922 Support Vector Machines with Beta-Mixing Input Sequences Luoqing Li, Chenggao Wan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928 Least Squares Support Vector Machine on Gaussian Wavelet Kernel Function Set Fangfang Wu, Yinliang Zhao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936 A Smoothing Multiple Support Vector Machine Model Huihong Jin, Zhiqing Meng, Xuanxi Ning . . . . . . . . . . . . . . . . . . . . . . . . . 942 Fuzzy Support Vector Machines Based on Spherical Regions Hong-Bing Liu, Sheng-Wu Xiong, Xiao-Xiao Niu . . . . . . . . . . . . . . . . . . 949
Table of Contents – Part I
XXIII
Building Support Vector Machine Alternative Using Algorithms of Computational Geometry Marek Bundzel, Tom´ aˇs Kasanick´y, Baltaz´ ar Frankoviˇc . . . . . . . . . . . . . . 955 Cooperative Clustering for Training SVMs Shengfeng Tian, Shaomin Mu, Chuanhuan Yin . . . . . . . . . . . . . . . . . . . . 962 SVMV - A Novel Algorithm for the Visualization of SVM Classification Results Xiaohong Wang, Sitao Wu, Xiaoru Wang, Qunzhan Li . . . . . . . . . . . . . 968 Support Vector Machines Ensemble Based on Fuzzy Integral for Classification Genting Yan, Guangfu Ma, Liangkuan Zhu . . . . . . . . . . . . . . . . . . . . . . . . 974 An Adaptive Support Vector Machine Learning Algorithm for Large Classification Problem Shu Yu, Xiaowei Yang, Zhifeng Hao, Yanchun Liang . . . . . . . . . . . . . . . 981 SVDD-Based Method for Fast Training of Multi-class Support Vector Classifier Woo-Sung Kang, Ki Hong Im, Jin Young Choi . . . . . . . . . . . . . . . . . . . . 991 Binary Tree Support Vector Machine Based on Kernel Fisher Discriminant for Multi-classification Bo Liu, Xiaowei Yang, Zhifeng Hao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997 A Fast and Sparse Implementation of Multiclass Kernel Perceptron Algorithm Jianhua Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004 Mutual Conversion of Regression and Classification Based on Least Squares Support Vector Machines Jing-Qing Jiang, Chu-Yi Song, Chun-Guo Wu, Yang-Chun Liang, Xiao-Wei Yang, Zhi-Feng Hao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1010 Sparse Least Squares Support Vector Machine for Function Estimation Liang-zhi Gan, Hai-kuan Liu, You-xian Sun . . . . . . . . . . . . . . . . . . . . . . . 1016 A Multiresolution Wavelet Kernel for Support Vector Regression Feng-Qing Han, Da-Cheng Wang, Chuan-Dong Li, Xiao-Feng Liao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022
XXIV
Table of Contents – Part I
Multi-scale Support Vector Machine for Regression Estimation Zhen Yang, Jun Guo, Weiran Xu, Xiangfei Nie, Jian Wang, Jianjun Lei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1030 Gradient Based Fuzzy C-Means Algorithm with a Mercer Kernel Dong-Chul Park, Chung Nguyen Tran, Sancho Park . . . . . . . . . . . . . . . . 1038 An Efficient Similarity-Based Validity Index for Kernel Clustering Algorithm Yun-Wei Pu, Ming Zhu, Wei-Dong Jin, Lai-Zhao Hu . . . . . . . . . . . . . . . 1044 Fuzzy Support Vector Clustering En-Hui Zheng, Min Yang, Ping Li, Zhi-Huan Song . . . . . . . . . . . . . . . . . 1050 An SVM Classification Algorithm with Error Correction Ability Applied to Face Recognition Chengbo Wang, Chengan Guo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057 A Boosting SVM Chain Learning for Visual Information Retrieval Zejian Yuan, Lei Yang, Yanyun Qu, Yuehu Liu, Xinchun Jia . . . . . . . . 1063 Nonlinear Estimation of Hyperspectral Mixture Pixel Proportion Based on Kernel Orthogonal Subspace Projection Bo Wu, Liangpei Zhang, Pingxiang Li, Jinmu Zhang . . . . . . . . . . . . . . . 1070 A New Proximal Support Vector Machine for Semi-supervised Classification Li Sun, Ling Jing, Xiaodong Xia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076 Sparse Gaussian Processes Using Backward Elimination Liefeng Bo, Ling Wang, Licheng Jiao . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083 Comparative Study of Extreme Learning Machine and Support Vector Machine Xun-Kai Wei, Ying-Hong Li, Yue Feng . . . . . . . . . . . . . . . . . . . . . . . . . . . 1089
ICA and BSS Multi-level Independent Component Analysis Woong Myung Kim, Chan Ho Park, Hyon Soo Lee . . . . . . . . . . . . . . . . . 1096 An ICA Learning Algorithm Utilizing Geodesic Approach Tao Yu, Huai-Zong Shao, Qi-Cong Peng . . . . . . . . . . . . . . . . . . . . . . . . . . 1103
Table of Contents – Part I
XXV
An Extended Online Fast-ICA Algorithm Gang Wang, Ni-ni Rao, Zhi-lin Zhang, Quanyi Mo, Pu Wang . . . . . . . 1109 Gradient Algorithm for Nonnegative Independent Component Analysis Shangming Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115 Unified Parametric and Non-parametric ICA Algorithm for Arbitrary Sources Fasong Wang, Hongwei Li, Rui Li, Shaoquan Yu . . . . . . . . . . . . . . . . . . 1121 A Novel Kurtosis-Dependent Parameterized Independent Component Analysis Algorithm Xiao-fei Shi, Ji-dong Suo, Chang Liu, Li Li . . . . . . . . . . . . . . . . . . . . . . . 1127 Local Stability Analysis of Maximum Nongaussianity Estimation in Independent Component Analysis Gang Wang, Xin Xu, Dewen Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1133 Convergence Analysis of a Discrete-Time Single-Unit Gradient ICA Algorithm Mao Ye, Xue Li, Chengfu Yang, Zengan Gao . . . . . . . . . . . . . . . . . . . . . . 1140 A Novel Algorithm for Blind Source Separation with Unknown Sources Number Ji-Min Ye, Shun-Tian Lou, Hai-Hong Jin, Xian-Da Zhang . . . . . . . . . . 1147 Blind Source Separation Based on Generalized Variance Gaoming Huang, Luxi Yang, Zhenya He . . . . . . . . . . . . . . . . . . . . . . . . . . 1153 Blind Source Separation with Pattern Expression NMF Junying Zhang, Zhang Hongyi, Le Wei, Yue Joseph Wang . . . . . . . . . . 1159 Nonlinear Blind Source Separation Using Hybrid Neural Networks Chun-Hou Zheng, Zhi-Kai Huang, Michael R. Lyu, Tat-Ming Lok . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165 Identification of Mixing Matrix in Blind Source Separation Xiaolu Li, Zhaoshui He . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1171 Identification of Independent Components Based on Borel Measure for Under-Determined Mixtures Wenqiang Guo, Tianshuang Qiu, Yuzhang Zhao, Daifeng Zha . . . . . . . 1177
XXVI
Table of Contents – Part I
Estimation of Delays and Attenuations for Underdetermined BSS in Frequency Domain Ronghua Li, Ming Xiao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1183 Application of Blind Source Separation to Five-Element Cross Array Passive Location Gaoming Huang, Yang Gao, Luxi Yang, Zhenya He . . . . . . . . . . . . . . . . 1189 Convolutive Blind Separation of Non-white Broadband Signals Based on a Double-Iteration Method Hua Zhang, Dazhang Feng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195 Multichannel Blind Deconvolution Using a Novel Filter Decomposition Method Bin Xia, Liqing Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202 Two-Stage Blind Deconvolution for V-BLAST OFDM System Feng Jiang, Liqing Zhang, Bin Xia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208
Data Preprocessing A Comparative Study on Selection of Cluster Number and Local Subspace Dimension in the Mixture PCA Models Xuelei Hu, Lei Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1214 Adaptive Support Vector Clustering for Multi-relational Data Mining Ping Ling, Chun-Guang Zhou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1222 Robust Data Clustering in Mercer Kernel-Induced Feature Space Xulei Yang, Qing Song, Meng-Joo Er . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1231 Pseudo-density Estimation for Clustering with Gaussian Processes Hyun-Chul Kim, Jaewook Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238 Clustering Analysis of Competitive Learning Network for Molecular Data Lin Wang, Minghu Jiang, Yinghua Lu, Frank Noe, Jeremy C. Smith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1244 Self-Organizing Map Clustering Analysis for Molecular Data Lin Wang, Minghu Jiang, Yinghua Lu, Frank Noe, Jeremy C. Smith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1250
Table of Contents – Part I
XXVII
A Conscientious Rival Penalized Competitive Learning Text Clustering Algorithm Mao-ting Gao, Zheng-ou Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256 Self-Organizing-Map-Based Metamodeling for Massive Text Data Exploration Kin Keung Lai, Lean Yu, Ligang Zhou, Shouyang Wang . . . . . . . . . . . . 1261 Ensemble Learning for Keyphrases Extraction from Scientific Document Jiabing Wang, Hong Peng, Jing-song Hu, Jun Zhang . . . . . . . . . . . . . . . 1267 Grid-Based Fuzzy Support Vector Data Description Yugang Fan, Ping Li, Zhihuan Song . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273 Development of the Hopfield Neural Scheme for Data Association in Multi-target Tracking Yang Weon Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1280 Determine Discounting Coefficient in Data Fusion Based on Fuzzy ART Neural Network Dong Sun, Yong Deng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286 Scientific Data Lossless Compression Using Fast Neural Network Jun-Lin Zhou, Yan Fu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293 HyperSurface Classifiers Ensemble for High Dimensional Data Sets Xiu-Rong Zhao, Qing He, Zhong-Zhi Shi . . . . . . . . . . . . . . . . . . . . . . . . . . 1299 Designing a Decompositional Rule Extraction Algorithm for Neural Networks Jen-Cheng Chen, Jia-Sheng Heh, Maiga Chang . . . . . . . . . . . . . . . . . . . . 1305 Estimating Fractal Intrinsic Dimension from the Neighborhood Qutang Cai, Changshui Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1312 Dimensionality Reduction for Evolving RBF Networks with Particle Swarms Junying Chen, Zheng Qin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319 Improved Locally Linear Embedding Through New Distance Computing Heyong Wang, Jie Zheng, Zhengan Yao, Lei Li . . . . . . . . . . . . . . . . . . . . 1326 An Incremental Linear Discriminant Analysis Using Fixed Point Method Dongyue Chen, Liming Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1334
XXVIII
Table of Contents – Part I
A Prewhitening RLS Projection Alternated Subspace Tracking (PAST) Algorithm Junseok Lim, Joonil Song, Yonggook Pyeon . . . . . . . . . . . . . . . . . . . . . . . 1340 Classification with the Hybrid of Manifold Learning and Gabor Wavelet Junping Zhang, Chao Shen, Jufu Feng . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1346 A Novel Input Stochastic Sensitivity Definition of Radial Basis Function Neural Networks and Its Application to Feature Selection Xi-Zhao Wang, Hui Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1352 Using Ensemble Feature Selection Approach in Selecting Subset with Relevant Features Mohammed Attik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1359 A New Method for Feature Selection Yan Wu, Yang Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367 Improved Feature Selection Algorithm Based on SVM and Correlation Zong-Xia Xie, Qing-Hua Hu, Da-Ren Yu . . . . . . . . . . . . . . . . . . . . . . . . . 1373 Feature Selection in Text Classification Via SVM and LSI Ziqiang Wang, Dexian Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1381 Parsimonious Feature Extraction Based on Genetic Algorithms and Support Vector Machines Qijun Zhao, Hongtao Lu, David Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . 1387 Feature Extraction for Time Series Classification Using Discriminating Wavelet Coefficients Hui Zhang, Tu Bao Ho, Mao-Song Lin, Xuefeng Liang . . . . . . . . . . . . . 1394 Feature Extraction of Underground Nuclear Explosions Based on NMF and KNMF Gang Liu, Xi-Hai Li, Dai-Zhi Liu, Wei-Gang Zhai . . . . . . . . . . . . . . . . . 1400 Hidden Markov Model Networks for Multiaspect Discriminative Features Extraction from Radar Targets Feng Zhu, Yafeng Hu, Xianda Zhang, Deguang Xie . . . . . . . . . . . . . . . . . 1406 Application of Self-organizing Feature Neural Network for Target Feature Extraction Dong-hong Liu, Zhi-jie Chen, Wen-long Hu, Yong-shun Zhang . . . . . . . 1412
Table of Contents – Part I
XXIX
Divergence-Based Supervised Information Feature Compression Algorithm Shi-Fei Ding, Zhong-Zhi Shi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1421 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427
Table of Contents – Part II
Pattern Classification Design an Effective Pattern Classification Model Do-Hyeon Kim, Eui-Young Cha, Kwang-Baek Kim . . . . . . . . . . . . . . . . .
1
Classifying Unbalanced Pattern Groups by Training Neural Network Bo-Yu Li, Jing Peng, Yan-Qiu Chen, Ya-Qiu Jin . . . . . . . . . . . . . . . . . .
8
A Modified Constructive Fuzzy Neural Networks for Classification of Large-Scale and Complicated Data Lunwen Wang, Yanhua Wu, Ying Tan, Ling Zhang . . . . . . . . . . . . . . . .
14
A Hierarchical FloatBoost and MLP Classifier for Mobile Phone Embedded Eye Location System Dan Chen, Xusheng Tang, Zongying Ou, Ning Xi . . . . . . . . . . . . . . . . . .
20
Iris Recognition Using LVQ Neural Network Seongwon Cho, Jaemin Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Minimax Probability Machine for Iris Recognition Yong Wang, Jiu-qiang Han . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
Detecting Facial Features by Heteroassociative Memory Neural Network Utilizing Facial Statistics Kyeong-Seop Kim, Tae-Ho Yoon, Seung-Won Shin . . . . . . . . . . . . . . . . .
40
Recognizing Partially Damaged Facial Images by Subspace Auto-associative Memories Xiaorong Pu, Zhang Yi, Yue Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
A Facial Expression Classification Algorithm Based on Principle Component Analysis Qingzhang Chen, Weiyi Zhang, Xiaoying Chen, Jianghong Han . . . . . .
55
Automatic Facial Expression Recognition Huchuan Lu, Pei Wu, Hui Lin, Deli Yang . . . . . . . . . . . . . . . . . . . . . . . .
63
Facial Expression Recognition Using Active Appearance Model Taehwa Hong, Yang-Bok Lee, Yong-Guk Kim, Hagbae Kim . . . . . . . . . .
69
XXXII
Table of Contents – Part II
Facial Expression Recognition Based on BoostingTree Ning Sun, Wenming Zheng, Changyin Sun, Cairong Zou, Li Zhao . . . .
77
KDA Plus KPCA for Face Recognition Wenming Zheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
Face Recognition Using a Neural Network Simulating Olfactory Systems Guang Li, Jin Zhang, You Wang, Walter J. Freeman . . . . . . . . . . . . . . .
93
Face Recognition Using Neural Networks and Pattern Averaging Adnan Khashman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
Semi-supervised Support Vector Learning for Face Recognition Ke Lu, Xiaofei He, Jidong Zhao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Parts-Based Holistic Face Recognition with RBF Neural Networks Wei Zhou, Xiaorong Pu, Ziming Zheng . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Combining Classifiers for Robust Face Detection Lin-Lin Huang, Akinobu Shimizu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Face Detection Method Based on Kernel Independent Component Analysis and Boosting Chain Algorithm Yan Wu, Yin-Fang Zhuang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Recognition from a Single Sample per Person with Multiple SOM Fusion Xiaoyang Tan, Jun Liu, Songcan Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Investigating LLE Eigenface on Pose and Face Identification Shaoning Pang, Nikola Kasabov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Multimodal Priority Verification of Face and Speech Using Momentum Back-Propagation Neural Network Changhan Park, Myungseok Ki, Jaechan Namkung, Joonki Paik . . . . . 140 The Clustering Solution of Speech Recognition Models with SOM Xiu-Ping Du, Pi-Lian He . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Study on Text-Dependent Speaker Recognition Based on Biomimetic Pattern Recognition Shoujue Wang, Yi Huang, Yu Cao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 A New Text-Independent Speaker Identification Using Vector Quantization and Multi-layer Perceptron Ji-Soo Keum, Chan-Ho Park, Hyon-Soo Lee . . . . . . . . . . . . . . . . . . . . . . . 165
Table of Contents – Part II
XXXIII
Neural Net Pattern Recognition Equations with Self-organization for Phoneme Recognition Sung-Ill Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Music Genre Classification Using a Time-Delay Neural Network Jae-Won Lee, Soo-Beom Park, Sang-Kyoon Kim . . . . . . . . . . . . . . . . . . . 178 Audio Signal Classification Using Support Vector Machines Lei-Ting Chen, Ming-Jen Wang, Chia-Jiu Wang, Heng-Ming Tai . . . . 188 Gender Classification Based on Boosting Local Binary Pattern Ning Sun, Wenming Zheng, Changyin Sun, Cairong Zou, Li Zhao . . . . 194 Multi-view Gender Classification Using Local Binary Patterns and Support Vector Machines Hui-Cheng Lian, Bao-Liang Lu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Gender Recognition Using a Min-Max Modular Support Vector Machine with Equal Clustering Jun Luo, Bao-Liang Lu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Palmprint Recognition Using ICA Based on Winner-Take-All Network and Radial Basis Probabilistic Neural Network Li Shang, De-Shuang Huang, Ji-Xiang Du, Zhi-Kai Huang . . . . . . . . . . 216 An Implementation of the Korean Sign Language Recognizer Using Neural Network Based on the Post PC Jung-Hyun Kim, Kwang-Seok Hong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Gait Recognition Using Wavelet Descriptors and Independent Component Analysis Jiwen Lu, Erhu Zhang, Cuining Jing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Gait Recognition Using Principal Curves and Neural Networks Han Su, Fenggang Huang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 An Adjacent Multiple Pedestrians Detection Based on ART2 Neural Network Jong-Seok Lim, Woo-Beom Lee, Wook-Hyun Kim . . . . . . . . . . . . . . . . . . 244 Recognition Method of Throwing Force of Athlete Based on Multi-class SVM Jinghua Ma, Yunjian Ge, Jianhe Lei, Quanjun Song, Yu Ge, Yong Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
XXXIV
Table of Contents – Part II
A Constructive Learning Algorithm for Text Categorization Weijun Chen, Bo Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Short-Text Classification Based on ICA and LSA Qiang Pu, Guo-Wei Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Writer Identification Using Modular MLP Classifier and Genetic Algorithm for Optimal Features Selection Sami Gazzah, Najoua Essoukri Ben Amara . . . . . . . . . . . . . . . . . . . . . . . 271 Self-generation ART Neural Network for Character Recognition Taekyung Kim, Seongwon Lee, Joonki Paik . . . . . . . . . . . . . . . . . . . . . . . . 277 Handwritten Digit Recognition Using Low Rank Approximation Based Competitive Neural Network Yafeng Hu, Feng Zhu, Hairong Lv, Xianda Zhang . . . . . . . . . . . . . . . . . . 287 Multifont Arabic Characters Recognition Using HoughTransform and Neural Networks Nadia Ben Amor, Najoua Essoukri Ben Amara . . . . . . . . . . . . . . . . . . . . 293 Recognition of English Calling Card by Using Multiresolution Images and Enhanced ART1-Based RBF Neural Networks Kwang-Baek Kim, Sungshin Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 A Method of Chinese Fax Recipient’s Name Recognition Based on Hybrid Neural Networks Zhou-Jing Wang, Kai-Biao Lin, Wen-Lei Sun . . . . . . . . . . . . . . . . . . . . . 306 Fast Photo Time-Stamp Recognition Based on SGNN Aiguo Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 Hierarchical Classification of Object Images Using Neural Networks Jong-Ho Kim, Jae-Won Lee, Byoung-Doo Kang, O-Hwa Kwon, Chi-Young Seong, Sang-Kyoon Kim, Se-Myung Park . . . . . . . . . . . . . . . 322 Structured-Based Neural Network Classification of Images Using Wavelet Coefficients Weibao Zou, King Chuen Lo, Zheru Chi . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Remote Sensing Image Classification Algorithm Based on Hopfield Neural Network Guang-jun Dong, Yong-sheng Zhang, Chao-jie Zhu . . . . . . . . . . . . . . . . . 337
Table of Contents – Part II
XXXV
Tea Classification Based on Artificial Olfaction Using Bionic Olfactory Neural Network Xinling Yang, Jun Fu, Zhengguo Lou, Liyu Wang, Guang Li, Walter J. Freeman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Distinguishing Onion Leaves from Weed Leaves Based on Segmentation of Color Images and a BP Neural Network Jun-Wei Lu, Pierre Gouton, Yun-An Hu . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Bark Classification Based on Textural Features Using Artificial Neural Networks Zhi-Kai Huang, Chun-Hou Zheng, Ji-Xiang Du, Yuan-yuan Wan . . . . 355 Automated Spectral Classification of QSOs and Galaxies by Radial Basis Function Network with Dynamic Decay Adjustment Mei-fang Zhao, Jin-fu Yang, Yue Wu, Fu-chao Wu, Ali Luo . . . . . . . . . 361 Feed-Forward Neural Network Using SARPROP Algorithm and Its Application in Radar Target Recognition Zun-Hua Guo, Shao-Hong Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Computer Vision Camera Calibration and 3D Reconstruction Using RBF Network in Stereovision System Hai-feng Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 A Versatile Method for Omnidirectional Stereo Camera Calibration Based on BP Algorithm Chuanjiang Luo, Liancheng Su, Feng Zhu, Zelin Shi . . . . . . . . . . . . . . . 383 Evolutionary Cellular Automata Based Neural Systems for Visual Servoing Dong-Wook Lee, Chang-Hyun Park, Kwee-Bo Sim . . . . . . . . . . . . . . . . . 390 Robust Visual Tracking Via Incremental Maximum Margin Criterion Lu Wang, Ming Wen, Chong Wang, Wenyuan Wang . . . . . . . . . . . . . . . 398 An Attention Selection System Based on Neural Network and Its Application in Tracking Objects Chenlei Guo, Liming Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Human Motion Tracking Based on Markov Random Field and Hopfield Neural Network Zhihui Li, Fenggang Huang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
XXXVI
Table of Contents – Part II
Skin-Color Based Human Tracking Using a Probabilistic Noise Model Combined with Neural Network Jin Young Kim, Min-Gyu Song, Seung You Na, Seong-Joon Baek, Seung Ho Choi, Joohun Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Object Detection Via Fusion of Global Classifier and Part-Based Classifier Zhi Zeng, Shengjin Wang, Xiaoqing Ding . . . . . . . . . . . . . . . . . . . . . . . . . 429 A Cartoon Video Detection Method Based on Active Relevance Feedback and SVM Xinbo Gao, Jie Li, Na Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 Morphological Neural Networks of Background Clutter Adaptive Prediction for Detection of Small Targets in Image Data Honggang Wu, Xiaofeng Li, Zaiming Li, Yuebin Chen . . . . . . . . . . . . . . 442 Two Important Action Scenes Detection Based on Probability Neural Networks Yu-Liang Geng, De Xu, Jia-Zheng Yuan, Song-He Feng . . . . . . . . . . . . 448 Local Independent Factorization of Natural Scenes Libo Ma, Liqing Zhang, Wenlu Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 Search Region Prediction for Motion Estimation Based on Neural Network Vector Quantization DaeHyun Ryu, HyungJun Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 Hierarchical Extraction of Remote Sensing Data Based on Support Vector Machines and Knowledge Processing Chao-feng Li, Lei Xu, Shi-tong Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 Eyes Location Using a Neural Network Xiao-yi Feng, Li-ping Yang, Zhi Dang, Matti Pietik¨ ainen . . . . . . . . . . . 474
Image Processing Gabor Neural Network for Endoscopic Image Registration Vladimir Spinko, Daming Shi, Wan Sing Ng, Jern-Lin Leong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480 Isomap and Neural Networks Based Image Registration Scheme Anbang Xu, Ping Guo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
Table of Contents – Part II
XXXVII
Unsupervised Image Segmentation Using an Iterative Entropy Regularized Likelihood Learning Algorithm Zhiwu Lu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 An Improvement on Competitive Neural Networks Applied to Image Segmentation Rui Yan, Meng Joo Er, Huajin Tang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 Image Segmentation by Deterministic Annealing Algorithm with Adaptive Spatial Constraints Xulei Yang, Aize Cao, Qing Song . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 A Multi-scale Scheme for Image Segmentation Using Neuro-fuzzy Classification and Curve Evolution Da Yuan, Hui Fan, Fu-guo Dong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 A Robust MR Image Segmentation Technique Using Spatial Information and Principle Component Analysis Yen-Wei Chen, Yuuta Iwasaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 Adaptive Segmentation of Color Image for Vision Navigation of Mobile Robots Zeng-Shun Zhao, Zeng-Guang Hou, Min Tan, Yong-Qian Zhang . . . . . 523 Image Filtering Using Support Vector Machine Huaping Liu, Fuchun Sun, Zengqi Sun . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 The Application of Wavelet Neural Network with Orthonormal Bases in Digital Image Denoising Deng-Chao Feng, Zhao-Xuan Yang, Xiao-Jun Qiao . . . . . . . . . . . . . . . . 539 A Region-Based Image Enhancement Algorithm with the Grossberg Network Bo Mi, Pengcheng Wei, Yong Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 Contrast Enhancement for Image Based on Wavelet Neural Network and Stationary Wavelet Transform Changjiang Zhang, Xiaodong Wang, Haoran Zhang . . . . . . . . . . . . . . . . 551 Learning Image Distortion Using a GMDH Network Yongtae Do, Myounghwan Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 An Edge Preserving Regularization Model for Image Restoration Based on Hopfield Neural Network Jian Sun, Zongben Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
XXXVIII Table of Contents – Part II
High-Dimensional Space Geometrical Informatics and Its Applications to Image Restoration Shoujue Wang, Yu Cao, Yi Huang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 Improved Variance-Based Fractal Image Compression Using Neural Networks Yiming Zhou, Chao Zhang, Zengke Zhang . . . . . . . . . . . . . . . . . . . . . . . . . 575 Associative Cubes in Unsupervised Learning for Robust Gray-Scale Image Recognition Hoon Kang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581 A Novel Graph Kernel Based SVM Algorithm for Image Semantic Retrieval Songhe Feng, De Xu, Xu Yang, Yuliang Geng . . . . . . . . . . . . . . . . . . . . . 589 Content Based Image Retrieval Using a Bootstrapped SOM Network Apostolos Georgakis, Haibo Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 Unsupervised Approach for Extracting the Textural Region of Interest from Real Image Woo-Beom Lee, Jong-Seok Lim, Wook-Hyun Kim . . . . . . . . . . . . . . . . . . 602 Image Fakery and Neural Network Based Detection Wei Lu, Fu-Lai Chung, Hongtao Lu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610 Object Detection Using Unit-Linking PCNN Image Icons Xiaodong Gu, Yuanyuan Wang, Liming Zhang . . . . . . . . . . . . . . . . . . . . 616 Robust Image Watermarking Using RBF Neural Network Wei Lu, Hongtao Lu, Fu-Lai Chung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623 An Interactive Image Inpainting Method Based on RBF Networks Peizhi Wen, Xiaojun Wu, Chengke Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . 629 No-Reference Perceptual Quality Assessment of JPEG Images Using General Regression Neural Network Yanwei Yu, Zhengding Lu, Hefei Ling, Fuhao Zou . . . . . . . . . . . . . . . . . 638 Minimum Description Length Shape Model Based on Elliptic Fourier Descriptors Shaoyu Wang, Feihu Qi, Huaqing Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646 Neural Network Based Texture Segmentation Using a Markov Random Field Model Tae Hyung Kim, Hyun Min Kang, Il Kyu Eom, Yoo Shin Kim . . . . . . 652
Table of Contents – Part II
XXXIX
Texture Segmentation Using SOM and Multi-scale Bayesian Estimation Tae Hyung Kim, Il Kyu Eom, Yoo Shin Kim . . . . . . . . . . . . . . . . . . . . . . 661 Recognition of Concrete Surface Cracks Using the ART1-Based RBF Network Kwang-Baek Kim, Kwee-Bo Sim, Sang-Ho Ahn . . . . . . . . . . . . . . . . . . . . 669
Signal Processing SVM-Enabled Voice Activity Detection Javier Ram´ırez, Pablo Y´elamos, Juan Manuel G´ orriz, Carlos G. Puntonet, Jos´e C. Segura . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 A Robust VAD Method for Array Signals Xiaohong Ma, Jin Liu, Fuliang Yin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 A Flexible Algorithm for Extracting Periodic Signals Zhi-Lin Zhang, Haitao Meng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688 A Neural Network Method for Blind Signature Waveform Estimation of Synchronous CDMA Signals Tianqi Zhang, Zengshan Tian, Zhengzhong Zhou, Yujun Kuang . . . . . . 694 A Signal-Dependent Quadratic Time Frequency Distribution for Neural Source Estimation Pu Wang, Jianyu Yang, Zhi-Lin Zhang, Gang Wang, Quanyi Mo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700 Neural Network Channel Estimation Based on Least Mean Error Algorithm in the OFDM Systems Jun Sun, Dong-Feng Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706 Higher-Order Feature Extraction of Non-Gaussian Acoustic Signals Using GGM-Based ICA Wei Kong, Bin Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712 Automatic Removal of Artifacts from EEG Data Using ICA and Exponential Analysis Ning-Yan Bian, Bin Wang, Yang Cao, Liming Zhang . . . . . . . . . . . . . . 719 Identification of Vibrating Noise Signals of Electromotor Using Adaptive Wavelet Neural Network Xue-Zhi Zhao, Bang-Yan Ye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
XL
Table of Contents – Part II
Fractional Order Digital Differentiators Design Using Exponential Basis Function Neural Network Ke Liao, Xiao Yuan, Yi-Fei Pu, Ji-Liu Zhou . . . . . . . . . . . . . . . . . . . . . . 735 Multivariate Chaotic Time Series Prediction Based on Radial Basis Function Neural Network Min Han, Wei Guo, Mingming Fan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741 Time Series Prediction Using LS-SVM with Particle Swarm Optimization Xiaodong Wang, Haoran Zhang, Changjiang Zhang, Xiushan Cai, Jinshan Wang, Meiying Ye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747 A Regularized Minimum Cross-Entropy Algorithm on Mixtures of Experts for Time Series Prediction Zhiwu Lu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753 Prediction for Chaotic Time Series Based on Discrete Volterra Neural Networks Li-Sheng Yin, Xi-Yue Huang, Zu-Yuan Yang, Chang-Cheng Xiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759
System Modeling A New Pre-processing Method for Regression Wen-Feng Jing, De-Yu Meng, Ming-Wei Dai, Zongben Xu . . . . . . . . . . 765 A New On-Line Modeling Approach to Nonlinear Dynamic Systems Shirong Liu, Qijiang Yu, Jinshou Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771 Online Modeling of Nonlinear Systems Using Improved Adaptive Kernel Methods Xiaodong Wang, Haoran Zhang, Changjiang Zhang, Xiushan Cai, Jinshan Wang, Meiying Ye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777 A Novel Multiple Neural Networks Modeling Method Based on FCM Jian Cheng, Yi-Nan Guo, Jian-Sheng Qian . . . . . . . . . . . . . . . . . . . . . . . 783 Nonlinear System Identification Using Multi-resolution Reproducing Kernel Based Support Vector Regression Hong Peng, Jun Wang, Min Tang, Lichun Wan . . . . . . . . . . . . . . . . . . . 790 A New Recurrent Neurofuzzy Network for Identification of Dynamic Systems Marcos A. Gonzalez-Olvera, Yu Tang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796
Table of Contents – Part II
XLI
Identification of Dynamic Systems Using Recurrent Fuzzy Wavelet Network Jun Wang, Hong Peng, Jian Xiao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802 Simulation Studies of On-Line Identification of Complex Processes with Neural Networks Francisco Cubillos, Gonzalo Acu˜ na . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808 Consecutive Identification of ANFIS-Based Fuzzy Systems with the Aid of Genetic Data Granulation Sung-Kwun Oh, Keon-Jun Park, Witold Pedrycz . . . . . . . . . . . . . . . . . . . 815 Two-Phase Identification of ANFIS-Based Fuzzy Systems with Fuzzy Set by Means of Information Granulation and Genetic Optimization Sung-Kwun Oh, Keon-Jun Park, Hyun-Ki Kim . . . . . . . . . . . . . . . . . . . . 821 A New Modeling Approach of STLF with Integrated Dynamics Mechanism and Based on the Fusion of Dynamic Optimal Neighbor Phase Points and ICNN Zhisheng Zhang, Yaming Sun, Shiying Zhang . . . . . . . . . . . . . . . . . . . . . . 827
Control Systems Adaptive Neural Network Control for Nonlinear Systems Based on Approximation Errors Yan-Jun Liu, Wei Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836 Adaptive Neural Network Control for Switched System with Unknown Nonlinear Part by Using Backstepping Approach: SISO Case Fei Long, Shumin Fei, Zhumu Fu, Shiyou Zheng . . . . . . . . . . . . . . . . . . . 842 Adaptive Neural Control for a Class of MIMO Non-linear Systems with Guaranteed Transient Performance Tingliang Hu, Jihong Zhu, Zengqi Sun . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849 Adaptive Neural Compensation Control for Input-Delay Nonlinear Systems by Passive Approach Zhandong Yu, Xiren Zhao, Xiuyan Peng . . . . . . . . . . . . . . . . . . . . . . . . . . 859 Nonlinear System Adaptive Control by Using Multiple Neural Network Models Xiao-Li Li, Yun-Feng Kang, Wei Wang . . . . . . . . . . . . . . . . . . . . . . . . . . 867
XLII
Table of Contents – Part II
Implementable Adaptive Backstepping Neural Control of Uncertain Strict-Feedback Nonlinear Systems Dingguo Chen, Jiaben Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875 A Discrete-Time System Adaptive Control Using Multiple Models and RBF Neural Networks Jun-Yong Zhai, Shu-Min Fei, Kan-Jian Zhang . . . . . . . . . . . . . . . . . . . . . 881 Robust Adaptive Neural Network Control for Strict-Feedback Nonlinear Systems Via Small-Gain Approaches Yansheng Yang, Tieshan Li, Xiaofeng Wang . . . . . . . . . . . . . . . . . . . . . . 888 Neural Network Based Robust Adaptive Control for a Class of Nonlinear Systems Dan Wang, Jin Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898 Robust H∞ Control for Delayed Nonlinear Systems Based on Standard Neural Network Models Mei-Qin Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904 SVM Based Nonlinear Self-tuning Control Weimin Zhong, Daoying Pi, Chi Xu, Sizhen Chu . . . . . . . . . . . . . . . . . . 911 SVM Based Internal Model Control for Nonlinear Systems Weimin Zhong, Daoying Pi, Youxian Sun, Chi Xu, Sizhen Chu . . . . . . 916 Fast Online SVR Algorithm Based Adaptive Internal Model Control Hui Wang, Daoying Pi, Youxian Sun, Chi Xu, Sizhen Chu . . . . . . . . . . 922 A VSC Method for MIMO Systems Based on SVM Yi-Bo Zhang, Dao-Ying Pi, Youxian Sun, Chi Xu, Si-Zhen Chu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928 Identification and Control of Dynamic Systems Based on Least Squares Wavelet Vector Machines Jun Li, Jun-Hua Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934 A Nonlinear Model Predictive Control Strategy Using Multiple Neural Network Models Zainal Ahmad, Jie Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943 Predictive Control Method of Improved Double-Controller Scheme Based on Neural Networks Bing Han, Min Han . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
Table of Contents – Part II
XLIII
Discrete-Time Sliding-Mode Control Based on Neural Networks Jos´e de Jes´ us Rubio, Wen Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956 Statistic Tracking Control: A Multi-objective Optimization Algorithm Lei Guo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 962 Minimum Entropy Control for Stochastic Systems Based on the Wavelet Neural Networks Chengzhi Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 968 Stochastic Optimal Control of Nonlinear Jump Systems Using Neural Networks Fei Liu, Xiao-Li Luan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975 Performance Estimation of a Neural Network-Based Controller Johann Schumann, Yan Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981 Some Key Issues in the Design of Self-Organizing Fuzzy Control Systems Xue-Feng Dai, Shu-Dong Liu, Deng-Zhi Cui . . . . . . . . . . . . . . . . . . . . . . 991 Nonlinear System Stabilisation by an Evolutionary Neural Network Wasan Srikasam, Nachol Chaiyaratana, Suwat Kuntanapreeda . . . . . . . 998 Neural Network Control Design for Large-Scale Systems with Higher-Order Interconnections Cong Ming, Sunan Huang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007 Adaptive Pseudo Linear RBF Model for Process Control Ding-Wen Yu, Ding-Li Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013 An Improved BP Algorithm Based on Global Revision Factor and Its Application to PID Control Lin Lei, Houjun Wang, Yufang Cheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1019 Neuro-fuzzy Generalized Predictive Control of Boiler Steam Temperature Xiang-Jie Liu, Ji-Zhen Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027 Model-Free Control of a Nonlinear ANC System with a SPSA-Based Neural Network Controller Yali Zhou, Qizhi Zhang, Xiaodong Li, Woonseng Gan . . . . . . . . . . . . . . 1033 Robust Control for AC-Excited Hydrogenators System Using Adaptive Fuzzy-Neural Network Hui Li, Li Han, Bei He . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1039
XLIV
Table of Contents – Part II
Adaptive Fuzzy Neural Network Control for Transient Dynamics of Magneto-rheological Suspension with Time-Delay Xiaomin Dong, Miao Yu, Changrong Liao, Weimin Chen, Honghui Zhang, Shanglian Huang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046 Adaptive Fuzzy Basis Function Network Based Fault-Tolerant Stable Control of Multi-machine Power Systems Youping Fan, Yunping Chen, Shangsheng Li, Qingwu Gong, Yi Chai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1052 Simulation Research on Applying Fault Tolerant Control to Marine Diesel Engine in Abnormal Operation Xiao-Yan Xu, Min He, Wan-Neng Yu, Hua-Yao Zheng . . . . . . . . . . . . . 1062 Hybrid Neural Network and Genetic Algorithms for Self-tuning of PI Controller in DSPM Motor Drive System Rui-Ming Fang, Qian Sun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1068 An Efficient DC Servo Motor Control Based on Neural Noncausal Inverse Modeling of the Plant ¨ calık . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075 H. Rıza Oz¸ A Dynamic Time Delay Neural Network for Ultrasonic Motor Identification and Control Yanchun Liang, Jie Zhang, Xu Xu, Xiaowei Yang, Zhifeng Hao . . . . . . 1084 Application of PSO-Optimized Generalized CMAC Control on Linear Motor Qiang Zhao, Shaoze Yan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1090 PID Control of Nonlinear Motor-Mechanism Coupling System Using Artificial Neural Network Yi Zhang, Chun Feng, Bailin Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096 Design and Simulation of a Neural-PD Controller for Automatic Balancing of Rotor Yuan Kang, Tsu-Wei Lin, Ming-Hui Chu, Yeon-Pun Chang, Yea-Ping Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104 PD Control of Overhead Crane Systems with Neural Compensation Rigoberto Toxqui Toxqui, Wen Yu, Xiaoou Li . . . . . . . . . . . . . . . . . . . . . 1110 A Study on Intelligent Control for Hybrid Actuator Ke Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116
Table of Contents – Part II
XLV
Double Inverted Pendulum Control Based on Support Vector Machines and Fuzzy Inference Han Liu, Haiyan Wu, Fucai Qian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124 Adaptive Wavelet Neural Network Friction Compensation of Mechanical Systems Shen-min Song, Zhuo-yi Song, Xing-lin Chen, Guangren Duan . . . . . . 1131
Robotic Systems Application of Collective Robotic Search Using Neural Network Based Dual Heuristic Programming (DHP) Nian Zhang, Donald C. Wunsch II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1140 RBF Neural Network Based Shape Control of Hyper-redundant Manipulator with Constrained End-Effector Jinguo Liu, Yuechao Wang, Shugen Ma, Bin Li . . . . . . . . . . . . . . . . . . . 1146 Robust Adaptive Neural Networks with an Online Learning Technique for Robot Control Zhi-gang Yu, Shen-min Song, Guang-ren Duan, Run Pei . . . . . . . . . . . . 1153 A Particle Swarm Optimized Fuzzy Neural Network Control for Acrobot Dong-bin Zhao, Jian-qiang Yi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1160 Adaptive Control Based on Recurrent Fuzzy Wavelet Neural Network and Its Application on Robotic Tracking Control Wei Sun, Yaonan Wang, Xiaohua Zhai . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166 Dynamic Tracking Control of Mobile Robots Using an Improved Radial Basis Function Neural Network Shirong Liu, Qijiang Yu, Jinshou Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1172 Grasping Control of Robot Hand Using Fuzzy Neural Network Peng Chen, Yoshizo Hasegawa, Mitushi Yamashita . . . . . . . . . . . . . . . . . 1178 Position Control Based on Static Neural Networks of Anthropomorphic Robotic Fingers Juan Ignacio Mulero-Mart´ınez, Francisco Garc´ıa-C´ ordova, Juan L´ opez-Coronado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1188 Control of Voluntary Movements in an Anthropomorphic Robot Finger by Using a Cortical Level Neural Controller Francisco Garc´ıa-C´ ordova, Juan Ignacio Mulero-Mart´ınez, Juan L´ opez-Coronado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1198
XLVI
Table of Contents – Part II
Learning Control for Space Robotic Operation Using Support Vector Machines Panfeng Huang, Wenfu Xu, Yangsheng Xu, Bin Liang . . . . . . . . . . . . . . 1208 Neural Networks for Mobile Robot Navigation: A Survey An-Min Zou, Zeng-Guang Hou, Si-Yao Fu, Min Tan . . . . . . . . . . . . . . . 1218 Fault Diagnosis for Mobile Robots with Imperfect Models Based on Particle Filter and Neural Network Zhuohua Duan, Zixing Cai, Jinxia Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227 Adaptive Neural Network Path Tracking of Unmanned Ground Vehicle Xiaohong Liao, Zhao Sun, Liguo Weng, Bin Li, Yongduan Song, Yao Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1233
Power Systems A Nuclear Power Plant Expert System Using Artificial Neural Networks Mal rey Lee, Hye-Jin Jeong, Young Joon Choi, Thomas M. Gatton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1239 Short-Term Load Forecasting Based on Mutual Information and Artificial Neural Network Zhiyong Wang, Yijia Cao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246 Short Term Load Forecasting by Using Neural Networks with Variable Activation Functions and Embedded Chaos Algorithm Qiyun Cheng, Xuelian Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1252 Short Term Load Forecasting Using Neural Network with Rough Set Zhi Xiao, Shi-Jie Ye, Bo Zhong, Cai-Xin Sun . . . . . . . . . . . . . . . . . . . . . 1259 Application of Neural Network Based on Particle Swarm Optimization in Short-Term Load Forecasting Dong-Xiao Niu, Bo Zhang, Mian Xing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1269 Study of Neural Networks for Electric Power Load Forecasting Hui Wang, Bao-Sen Li, Xin-Yang Han, Dan-Li Wang, Hong Jin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277 A Neural Network Approach to m-Daily-Ahead Electricity Price Prediction Hsiao-Tien Pao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1284
Table of Contents – Part II
XLVII
Next-Day Power Market Clearing Price Forecasting Using Artificial Fish-Swarm Based Neural Network Chuan Li, Shilong Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1290 Application of Evolutionary Neural Network to Power System Unit Commitment Po-Hung Chen, Hung-Cheng Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296 Application of BP Neural Network in Power Load Simulator Bing-Da Zhang, Ke Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1304 Feeder Load Balancing Using Neural Network Abhisek Ukil, Willy Siti, Jaco Jordaan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1311 A Neural Network Based Particle Swarm Optimization for the Transformers Connections of a Primary Feeder Considering Multi-objective Programming Cheng Chien Kuo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1317 3-D Partial Discharge Patterns Recognition of Power Transformers Using Neural Networks Hung-Cheng Chen, Po-Hung Chen, Chien-Ming Chou . . . . . . . . . . . . . . 1324 Design of Self-adaptive Single Neuron Facts Controllers Based on Genetic Algorithm Quan-Yuan Jiang, Chuang-Xin Guo, Yi-Jia Cao . . . . . . . . . . . . . . . . . . . 1332 Generalized Minimum Variance Neuro Controller for Power System Stabilization Hee-Sang Ko, Kwang Y. Lee, Min-Jae Kang, Ho-Chan Kim . . . . . . . . . 1338 Adaptive Control for Synchronous Generator Based on Pseudolinear Neural Networks Youping Fan, Yunping Chen, Shangsheng Li, Dong Liu, Yi Chai . . . . . 1348 A Research and Application of Chaotic Neural Network for Marine Generator Modeling Wei-Feng Shi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1354 Ship Synchronous Generator Modeling Based on RST and RBF Neural Networks Xihuai Wang, Tengfei Zhang, Jianmei Xiao . . . . . . . . . . . . . . . . . . . . . . . 1363 A New Control Strategy of a Wind Power Generation and Flywheel Energy Storage Combined System Jian Wang, Long-yun Kang, Bing-gang Cao . . . . . . . . . . . . . . . . . . . . . . . 1370
XLVIII
Table of Contents – Part II
Wavelet-Based Intelligent System for Recognition of Power Quality Disturbance Signals Suriya Kaewarsa, Kitti Attakitmongcol, Wichai Krongkitsiri . . . . . . . . . 1378 Recognition and Classification of Power Quality Disturbances Based on Self-adaptive Wavelet Neural Network Wei-Ming Tong, Xue-Lei Song, Dong-Zhong Zhang . . . . . . . . . . . . . . . . 1386 Vibration Fault Diagnosis of Large Generator Sets Using Extension Neural Network-Type 1 Meng-hui Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1395 Fault Data Compression of Power System with Wavelet Neural Network Based on Wavelet Entropy Zhigang Liu, Dabo Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1402 Intelligent Built-in Test (BIT) for More-Electric Aircraft Power System Based on Hybrid Generalized LVQ Neural Network Zhen Liu, Hui Lin, Xin Luo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409 Low Voltage Risk Assessment in Power System Using Neural Network Ensemble Wei-Hua Chen, Quan-Yuan Jiang, Yi-Jia Cao . . . . . . . . . . . . . . . . . . . . . 1416 Risk Assessment of Cascading Outages in Power Systems Using Fuzzy Neural Network Wei-Hua Chen, Quan-Yuan Jiang, Zhi-Yong Wang, Yi-Jia Cao . . . . . 1422 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1429
Table of Contents – Part III
Transportation Systems Traffic Volume Forecasting Based on Wavelet Transform and Neural Networks Shuyan Chen, Wei Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Prediction of Railway Passenger Traffic Volume by means of LS-SVM Zhen-Rui Peng, Fu Wu, Zhao-Yuan Jiang . . . . . . . . . . . . . . . . . . . . . . . . .
8
Traffic Flow Modeling of Urban Expressway Using Artificial Neural Networks Guo-Jiang Shen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Radial Basis Function Network for Traffic Scene Classification in Single Image Mode Qiao Huang, Jianming Hu, Jingyan Song, Tianliang Gao . . . . . . . . . . .
23
A New Method for Traffic Signs Classification Using Probabilistic Neural Networks Hang Zhang, Dayong Luo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Detection for Triangle Traffic Sign Based on Neural Network Shuang-dong Zhu, Yi Zhang, Xiao-feng Lu . . . . . . . . . . . . . . . . . . . . . . . .
40
Vanishing Point and Gabor Feature Based Multi-resolution On-Road Vehicle Detection Hong Cheng, Nanning Zheng, Chong Sun, Huub van de Wetering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
Recognition of Road Signs with Mixture of Neural Networks and Arbitration Modules Boguslaw Cyganek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
Vehicle Classification in Wireless Sensor Networks Based on Rough Neural Network Qi Huang, Tao Xing, Hai Tao Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
Neural Network Approach to Identify Model of Vehicles Hyo Jong Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
L
Table of Contents – Part III
An Intelligent Vehicle Security System Based on Modeling Human Driving Behaviors Xiaoning Meng, Yongsheng Ou, Ka Keung Lee, Yangsheng Xu . . . . . .
73
Adaptive Neural Network Control of Helicopters Shuzhi Sam Ge, Keng-Peng Tee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
Communication Networks Modified Hopfield Neural Network for CDMA Multiuser Detection Xiangdong Liu, Xuexia Wang, Zhilu Wu, Xuemai Gu . . . . . . . . . . . . . .
88
Blind Multiuser Detection Based on Kernel Approximation Tao Yang, Bo Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
A Novel Blind Multiuser Detection Model over Flat Fast Fading Channels Hongbo Tian, Qinye Yin, Ke Deng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Robust Multiuser Detection Method Based on Neural-net Preprocessing in Impulsive Noise Environment Ying Guo, Tianshuang Qiu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Channel Equalization Using Complex Extreme Learning Machine with RBF Kernels Ming-Bin Li, Guang-Bin Huang, Paramasivan Saratchandran, Narasimhan Sundararajan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Nonlinear Channel Equalization Using Concurrent Support Vector Machine Processor Jae Woo Wee, Tae Seon Kim, Sung Soo Dong, Chong Ho Lee . . . . . . . 120 Recursive Complex Extreme Learning Machine with Widely Linear Processing for Nonlinear Channel Equalizer Junseok Lim, Jaejin Jeon, Sangwook Lee . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A Study on the Detection Algorithm of QPSK Signal Using TDNN Sun-Kuk Noh, Jae-Young Pyun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 The BP Network for Carrier Frequency Offset Estimation in OFDM-Based WLANs Feng Zhu, Yafeng Hu, Saiyi Wang, Peng Wei . . . . . . . . . . . . . . . . . . . . . 144 The LD-CELP Gain Filter Based on BP Neural Network Gang Zhang, Keming Xie, Zhefeng Zhao, Chunyu Xue . . . . . . . . . . . . . . 150
Table of Contents – Part III
LI
A Neural Network Based Application Layer Multicast Routing Protocol Peng Cheng, Qiong-Hai Dai, Qiu-Feng Wu . . . . . . . . . . . . . . . . . . . . . . . 156 A Neural Network Decision-Making Mechanism for Robust Video Transmission over 3G Wireless Network Jianwei Wen, Qionghai Dai, Yihui Jin . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 A Multilevel Quantifying Spread Spectrum PN Sequence Based on Chaos of Cellular Neural Network Yaqin Zhao, Nan Zhao, Zhilu Wu, Guanghui Ren . . . . . . . . . . . . . . . . . . 171 An Experimental Hyper-Chaos Spread Spectrum Communication System Based on CNN Jianye Zhao, Quansheng Ren, Daoheng Yu . . . . . . . . . . . . . . . . . . . . . . . . 178 A Resource Allocating Neural Network Based Approach for Detecting End-to-End Network Performance Anomaly Wenwei Li, Dafang Zhang, Jinmin Yang, Gaogang Xie, Lei Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Recurrent Neural Network Inference of Internal Delays in Nonstationary Data Network Feng Qian, Guang-min Hu, Xing-miao Yao, Le-min Li . . . . . . . . . . . . . 190 Multiscale BiLinear Recurrent Neural Networks and Their Application to the Long-Term Prediction of Network Traffic Dong-Chul Park, Chung Nguyen Tran, Yunsik Lee . . . . . . . . . . . . . . . . . 196 Bandwidth Prediction and Congestion Control for ABR Traffic Based on Neural Networks Zhixin Liu, Xinping Guan, Huihua Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Information Security Application of Neural Networks in Network Control and Information Security ´ Angel Grediaga, Francisco Ibarra, Federico Garc´ıa, Bernardo Ledesma, Francisco Brot´ ons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Enhancing the Transmission Security of Content-Based Hidden Biometric Data Muhammad Khurram Khan, Jiashu Zhang . . . . . . . . . . . . . . . . . . . . . . . . 214
LII
Table of Contents – Part III
Building Lightweight Intrusion Detection System Based on Random Forest Dong Seong Kim, Sang Min Lee, Jong Sou Park . . . . . . . . . . . . . . . . . . . 224 Intrusion Detection Based on Fuzzy Neural Networks Ji-yao An, Guangxue Yue, Fei Yu, Ren-fa Li . . . . . . . . . . . . . . . . . . . . . . 231 Intrusion Detection Using PCASOM Neural Networks Guisong Liu, Zhang Yi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 A Mutated Intrusion Detection System Using Principal Component Analysis and Time Delay Neural Network Byoung-Doo Kang, Jae-Won Lee, Jong-Ho Kim, O-Hwa Kwon, Chi-Young Seong, Se-Myung Park, Sang-Kyoon Kim . . . . . . . . . . . . . . . 246 A Novel Intrusion Detection Model Based on Multi-layer Self-Organizing Maps and Principal Component Analysis Jie Bai, Yu Wu, Guoyin Wang, Simon X. Yang, Wenbin Qiu . . . . . . . 255 A Modified RBF Neural Network for Network Anomaly Detection Xiaotao Wei, Houkuan Huang, Shengfeng Tian . . . . . . . . . . . . . . . . . . . . 261 Anti-worm Immunization of Web System Based on Normal Model and BP Neural Network Tao Gong, Zixing Cai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Data Hiding in Neural Network Prediction Errors Guangjie Liu, Jinwei Wang, Shiguo Lian, Yuewei Dai, Zhiquan Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 The Minimum Detectable Capacity of Digital Image Information Hiding Fan Zhang, Ruixin Liu, Xinhong Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Robust Digital Image Watermarking Algorithm Using BPN Neural Networks Cheng-Ri Piao, Wei-zhong Fan, Dong-Min Woo, Seung-Soo Han . . . . . 285 A Novel Watermarking Method with Image Signature Xiao-Li Niu, Ju Liu, Jian-De Sun, Jian-Ping Qiao . . . . . . . . . . . . . . . . . 293 Robust Halftone Image Watermarking Scheme Based on Neural Networks Xiang-yang Wang, Jun Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Table of Contents – Part III
LIII
A Blind Source Separation Based Multi-bit Digital Audio Watermarking Scheme Xiaohong Ma, Xiaoyan Ding, Chong Wang, Fuliang Yin . . . . . . . . . . . . 306 A 2DPCA-Based Video Watermarking Scheme for Resistance to Temporal Desynchronization Jiande Sun, Ju Liu, Hua Yan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 A Fast Decryption Algorithm for BSS-Based Image Encryption Qiu-Hua Lin, Fu-Liang Yin, Hua-Lou Liang . . . . . . . . . . . . . . . . . . . . . . 318 A Novel Cryptographic Scheme Based on Wavelet Neural Networks Guo Chen, Feng Tan, Degang Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Combining RBF Neural Network and Chaotic Map to Construct Hash Function Pengcheng Wei, Wei Zhang, Huaqian Yang, Jun Chen . . . . . . . . . . . . . 332 Multiple-Point Bit Mutation Method of Detector Generation for SNSD Model Ying Tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 An Erotic Image Recognition Algorithm Based on Trunk Model and SVM Classification Qindong Sun, Xinbo Huang, Xiaohong Guan, Peng Gao . . . . . . . . . . . . 346
Fault Detection Sensor Validation Using Nonlinear Minor Component Analysis Roger Xu, Guangfan Zhang, Xiaodong Zhang, Leonard Haynes, Chiman Kwan, Kenneth Semega . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Fault Diagnosis with Enhanced Neural Network Modelling Ding-Li Yu, Thoon-Khin Chang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 Fault Detection and Diagnosis Using Neural Network Design Kok Kiong Tan, Sunan Huang, Tong Heng Lee . . . . . . . . . . . . . . . . . . . . 364 Certainty Improvement in Diagnosis of Multiple Faults by Using Versatile Membership Functions for Fuzzy Neural Networks Yuan Kang, Chun-Chieh Wang, Yeon-Pun Chang, Chien-Ching Hsueh, Ming-Chang Chang . . . . . . . . . . . . . . . . . . . . . . . . . . 370
LIV
Table of Contents – Part III
Fault Detection of Reactive Ion Etching Using Time Series Neural Networks Kyung-Han Ryu, Song-Jae Lee, Jaehyun Park, Dong-Chul Park, Sang J. Hong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 Intelligent Diagnostics for Sound Reproduction System by the Use of PEAQ Byung Doo Jun, Nakjin Choi, Hyun-Woo Ko, KoengMo Sung . . . . . . . 382 Active Learning of Support Vector Machine for Fault Diagnosis of Bearings Zhousuo Zhang, Wenzhi Lv, Minghui Shen . . . . . . . . . . . . . . . . . . . . . . . . 390 Growing Structure Multiple Model System Based Anomaly Detection for Crankshaft Monitoring Jianbo Liu, Pu Sun, Dragan Djurdjanovic, Kenneth Marko, Jun Ni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 Fault Diagnosis for Induction Machines Using Kernel Principal Component Analysis Jang-Hwan Park, Dae-Jong Lee, Myung-Geun Chun . . . . . . . . . . . . . . . . 406 Application of RBF and SOFM Neural Networks on Vibration Fault Diagnosis for Aero-engines Kai Li, Dongxiang Jiang, Kai Xiong, Yongshan Ding . . . . . . . . . . . . . . . 414 Predictive Fault Detection and Diagnosis of Nuclear Power Plant Using the Two-Step Neural Network Models Hyeon Bae, Seung-Pyo Chun, Sungshin Kim . . . . . . . . . . . . . . . . . . . . . . 420 Kernel PCA Based Faults Diagnosis for Wastewater Treatment System Byong-Hee Jun, Jang-Hwan Park, Sang-Ill Lee, Myung-Geun Chun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
Financial Analysis On the Symbolic Analysis of Market Indicators with the Dynamic Programming Approach Luk´ aˇs Pichl, Takuya Yamano, Taisei Kaizoji . . . . . . . . . . . . . . . . . . . . . . 432 Neural Network Method to Predict Stock Price Movement Based on Stock Information Entropy Xun Liang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
Table of Contents – Part III
LV
Stock Time Series Forecasting Using Support Vector Machines Employing Analyst Recommendations Zhi-yong Zhang, Chuan Shi, Su-lan Zhang, Zhong-zhi Shi . . . . . . . . . . . 452 Stock Index Prediction Based on the Analytical Center of Version Space Fanzi Zeng, Yonghua Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 Comparison of Forecasting Performance of AR, STAR and ANN Models on the Chinese Stock Market Index Qi-an Chen, Chuan-Dong Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 Index Prediction of KOSPI 200 Based on Data Models and Knowledge Rules for Qualitative and Quantitative Approach Hyeon Bae, Sungshin Kim, Joing-Il Bae . . . . . . . . . . . . . . . . . . . . . . . . . . 471 Modular Neural Network Rule Extraction Technique in Application to Country Stock Cooperate Governance Structure Dang-Yong Du, Hai-Lin Lan, Wei-Xin Ling . . . . . . . . . . . . . . . . . . . . . . . 477 A Hybrid Support Vector Machines and Discrete Wavelet Transform Model in Futures Price Forecasting Fan-yong Liu, Min Fan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 A Novel Learning Network for Option Pricing with Confidence Interval Information Kyu-Hwan Jung, Hyun-Chul Kim, Jaewook Lee . . . . . . . . . . . . . . . . . . . . 491 An Adaptive BP Algorithm with Optimal Learning Rates and Directional Error Correction for Foreign Exchange Market Trend Prediction Lean Yu, Shouyang Wang, Kin Keung Lai . . . . . . . . . . . . . . . . . . . . . . . . 498 Recurrent Self-Organising Maps and Local Support Vector Machine Models for Exchange Rate Prediction He Ni, Hujun Yin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 Selection of the Appropriate Lag Structure of Foreign Exchange Rates Forecasting Based on Autocorrelation Coefficient Wei Huang, Shouyang Wang, Hui Zhang, Renbin Xiao . . . . . . . . . . . . . 512 Exchange Rate Forecasting Using Flexible Neural Trees Yuehui Chen, Lizhi Peng, Ajith Abraham . . . . . . . . . . . . . . . . . . . . . . . . . 518 Local Volatility Function Approximation Using Reconstructed Radial Basis Function Networks Bo-Hyun Kim, Daewon Lee, Jaewook Lee . . . . . . . . . . . . . . . . . . . . . . . . . 524
LVI
Table of Contents – Part III
Neuroinformatics Visualization of Dynamic Brain Activities Based on the Single-Trial MEG and EEG Data Analysis Jianting Cao, Liangyu Zhao, Andrzej Cichocki . . . . . . . . . . . . . . . . . . . . 531 Multichannel Classification of Single EEG Trials with Independent Component Analysis Dik Kin Wong, Marcos Perreau Guimaraes, E. Timothy Uy, Logan Grosenick, Patrick Suppes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 Application of SVM Framework for Classification of Single Trial EEG Xiang Liao, Yu Yin, Chaoyi Li, Dezhong Yao . . . . . . . . . . . . . . . . . . . . . 548 Normal and Hypoxia EEG Recognition Based on a Chaotic Olfactory Model Meng Hu, Jiaojie Li, Guang Li, Xiaowei Tang, Walter J. Freeman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554 Nonlinear Dynamics of EEG Signal Based on Coupled Network Lattice Model Minfen Shen, Guoliang Chang, Shuwang Wang, Patch J. Beadle . . . . . 560 ICA-Based EEG Spatio-temporal Dipole Source Localization: A Model Study Ling Zou, Shan-An Zhu, Bin He . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 Networking Property During Epileptic Seizure with Multi-channel EEG Recordings Huihua Wu, Xiaoli Li, Xinping Guan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573 Time-Frequency Analysis of EEG Based on Event Related Cognitive Task Xiao-Tong Wen, Xiao-Jie Zhao, Li Yao . . . . . . . . . . . . . . . . . . . . . . . . . . 579 Adaptable Noise Reduction of ECG Signals for Feature Extraction Hyun Dong Kim, Chul Hong Min, Tae Seon Kim . . . . . . . . . . . . . . . . . . 586 Mining the Independent Source of ERP Components with ICA Decomposition Jia-Cai Zhang, Xiao-Jie Zhao, Yi-Jun Liu, Li Yao . . . . . . . . . . . . . . . . . 592 Multiple Signal Classification Based on Chaos Optimization Algorithm for MEG Sources Localization Jie-Ming Ma, Bin Wang, Yang Cao, Li-Ming Zhang . . . . . . . . . . . . . . . 600
Table of Contents – Part III
LVII
Automatic Segmentation of Putamen from Brain MRI Yihui Liu, Bai Li, Dave Elliman, Paul Simon Morgan, Dorothee Auer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 A Neural Network Model for the Estimation of Time-to-Collision Ling Wang, Hongjin Sun, Dezhong Yao . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 Classification of Movement-Related Potentials for Brain-Computer Interface: A Reinforcement Training Approach Zongtan Zhou, Yang Liu, Dewen Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
Bioinformatics Two-Class SVM Trees (2-SVMT) for Biomarker Data Analysis Shaoning Pang, Ilkka Havukkala, Nikola Kasabov . . . . . . . . . . . . . . . . . . 629 Interpreting Gene Profiles from Biomedical Literature Mining with Self Organizing Maps Shi Yu, Steven Van Vooren, Bert Coessens, Bart De Moor . . . . . . . . . . 635 Mining Protein Interaction from Biomedical Literature with Relation Kernel Method Jae-Hong Eom, Byoung Tak Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642 A Study of Particle Swarm Optimization in Gene Regulatory Networks Inference Rui Xu, Ganesh Venayagamoorthy, Donald C. Wunsch II . . . . . . . . . . . 648 Support Vector Machine Approach for Retained Introns Prediction Using Sequence Features Huiyu Xia, Jianning Bi, Yanda Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654 Prediction of Protein-Protein Interface Residues Using Sequence Neighborhood and Surface Properties Yasir Arafat, Joarder Kamruzzaman, Gour Karmakar . . . . . . . . . . . . . . 660 Prediction of Protein Subcellular Multi-locations with a Min-Max Modular Support Vector Machine Yang Yang, Bao-Liang Lu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667 Prediction of Protein Domains from Sequence Information Using Support Vector Machines Shuxue Zou, Yanxin Huang, Yan Wang, Chunguang Zhou . . . . . . . . . . . 674
LVIII
Table of Contents – Part III
Using a Neural Networking Method to Predict the Protein Phosphorylation Sites with Specific Kinase Kunpeng Zhang, Yun Xu, Yifei Shen, Guoliang Chen . . . . . . . . . . . . . . . 682 Neural Feature Association Rule Mining for Protein Interaction Prediction Jae-Hong Eom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690 Prediction of Contact Maps Using Modified Transiently Chaotic Neural Network Guixia Liu, Yuanxian Zhu, Wengang Zhou, Chunguang Zhou, Rongxing Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696 Identification of Cell-Cycle Phases Using Neural Network and Steerable Filter Features Xiaodong Yang, Houqiang Li, Xiaobo Zhou, Stephen T.C. Wong . . . . . 702 Prediction of the Human Papillomavirus Risk Types Using Gap-Spectrum Kernels Sun Kim, Jae-Hong Eom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710 Extreme Learning Machine for Predicting HLA-Peptide Binding Stephanus Daniel Handoko, Kwoh Chee Keong, Ong Yew Soon, Guang Lan Zhang, Vladimir Brusic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716 Identifying Transcription Factor Binding Sites Based on a Neural Network Xianhua Dai, Zhiming Dai, Jiang Wang . . . . . . . . . . . . . . . . . . . . . . . . . . 722 TSFSOM: Transmembrane Segments Prediction by Fuzzy Self-Organizing Map Yong Deng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
Biomedical Applications Analysis of Multifibre Renal Sympathetic Nerve Recordings Dong Li, Yingxiong Jin, Zhuo Yang, Tao Zhang . . . . . . . . . . . . . . . . . . . 734 A New Color Blindness Cure Model Based on BP Neural Network Yu Ma, Xiao-Dong Gu, Yuan-Yuan Wang . . . . . . . . . . . . . . . . . . . . . . . . 740 Design of RBF Network Based on Fuzzy Clustering Method for Modeling of Respiratory System Kouji Maeda, Shunshoku Kanae, Zi-Jiang Yang, Kiyoshi Wada . . . . . . 746
Table of Contents – Part III
LIX
Recognition of Fatty Liver Using Hybrid Neural Network Jiangli Lin, XianHua Shen, Tianfu Wang, Deyu Li, Yan Luo, Ling Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754 A Novel Fast Fuzzy Neural Network Backpropagation Algorithm for Colon Cancer Cell Image Discrimination Ephram Nwoye, Li C. Khor, Satnam S. Dlay, Wai L. Woo . . . . . . . . . . 760 Poultry Skin Tumor Detection in Hyperspectral Images Using Radial Basis Probabilistic Neural Network Intaek Kim, Chengzhe Xu, Moon S. Kim . . . . . . . . . . . . . . . . . . . . . . . . . 770 Combination of Network Construction and Cluster Analysis and Its Application to Traditional Chinese Medicine Mingfeng Wang, Zhi Geng, Miqu Wang, Feng Chen, Weijun Ding, Ming Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777 Differentiation of Syndromes with SVM Zhanquan Sun, Guangcheng Xi, Jianqiang Yi . . . . . . . . . . . . . . . . . . . . . 786 Neural Network Based Posture Control of a Human Arm Model in the Sagittal Plane Shan Liu, Yongji Wang, Jian Huang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
Industrial Applications Neural Network Models for Transforming Consumer Perception into Product Form Design Chung-Hsing Yeh, Yang-Cheng Lin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799 2D Pattern Design of Upper Outer from 3D Specification Based on Neural Networks Dongyun Wang, Yulin Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805 Design of a Broadband Microwave Amplifier Using Neural Performance Data Sheets and Very Fast Simulated Reannealing Yavuz Cengiz, H¨ useyin G¨ oksu, Filiz G¨ une¸s . . . . . . . . . . . . . . . . . . . . . . . . 815 An Intelligent System for the Heatsink Design Yao-Wen Hsueh, Hsin-Chung Lien, Ming-Hsien Hsueh . . . . . . . . . . . . . 821 Learning Shape for Jet Engine Novelty Detection David A. Clifton, Peter R. Bannister, Lionel Tarassenko . . . . . . . . . . . . 828
LX
Table of Contents – Part III
Support Vector Machine in Novelty Detection for Multi-channel Combustion Data Lei A. Clifton, Hujun Yin, Yang Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . 836 A Levinson Predictor Based Compensatory Fuzzy Neural Network and Its Application in Crude Oil Distillation Process Modeling Yongfeng He, Quanyi Fan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844 Laminar Cooling Process Model Development Using RBF Networks Minghao Tan, Xuejun Zong, Heng Yue, Jinxiang Pian, Tianyou Chai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 852 Hybrid Intelligent Control Strategy of the Laminar Cooling Process Minghao Tan, Shujiang Li, Tianyou Chai . . . . . . . . . . . . . . . . . . . . . . . . . 858 Application of Adaptable Neural Networks for Rolling Force Set-Up in Optimization of Rolling Schedules Jingming Yang, Haijun Che, Yajie Xu, Fuping Dou . . . . . . . . . . . . . . . . 864 Multiple Neural Network Modeling Method for Carbon and Temperature Estimation in Basic Oxygen Furnace Xin Wang, Zhong-Jie Wang, Jun Tao . . . . . . . . . . . . . . . . . . . . . . . . . . . . 870 Air-Fuel-Ratio Optimal Control of a Gas Heating Furnace Based on Fuzzy Neural Networks Heng Cao, Ding Du, Yunhua Peng, Yuhai Yin . . . . . . . . . . . . . . . . . . . . . 876 An Evolutionary Hybrid Model for the Prediction of Flow Stress of Steel Ai-ling Chen, Gen-ke Yang, Zhi-ming Wu . . . . . . . . . . . . . . . . . . . . . . . . . 885 Meta-Learning Evolutionary Artificial Neural Network for Selecting Flexible Manufacturing Systems Arijit Bhattacharya, Ajith Abraham, Crina Grosan, Pandian Vasant, Sangyong Han . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891 A Multi-Criteria Decision Making Procedure Based on Neural Networks for Kanban Allocation ¨ ¨ ur Eski, Ceyhun Araz . . . . . . . . . . . . . . . . . . . . . . 898 Ozlem Uzun Araz, Ozg¨ On-Line Measurement of Production Plan Track Based on Extension Matter-Element Theory Zhi-Lin Sheng, Song-Zheng Zhao, Xin-Zheng Qi, Chen-Xi Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906
Table of Contents – Part III
LXI
Modeling and Optimization of High-Technology Manufacturing Productivity Sheng Xu, Hui-Fang Zhao, Zhao-Hua Sun, Xiao-Hua Bao . . . . . . . . . . . 914 Scheduling of Re-entrant Lines with Neuro-Dynamic Programming Based on a New Evaluating Criterion Ying Wang, Huiyu Jin, Shunzhi Zhu, Maoqing Li . . . . . . . . . . . . . . . . . . 921 A Constraint Satisfaction Adaptive Neural Network with Dynamic Model for Job-Shop Scheduling Problem Li-Ning Xing, Ying-Wu Chen, Xue-Shi Shen . . . . . . . . . . . . . . . . . . . . . . 927 Neural Network Based Industrial Processes Monitoring Luis P. S´ anchez-Fern´ andez, Cornelio Y´ an ˜ez-M´ arquez, Oleksiy Pogrebnyak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933 A New Method for Process Monitoring Based on Mixture Probabilistic Principal Component Analysis Models Zhong-Gai Zhao, Fei Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939 On-Line Nonlinear Process Monitoring Using Kernel Principal Component Analysis and Neural Network Zhong-Gai Zhao, Fei Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945 On-Line Batch Process Monitoring Using Multiway Kernel Independent Component Analysis Fei Liu, Zhong-Gai Zhao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 951 Tool Wear Monitoring Using FNN with Compact Support Gaussian Function Hongli Gao, Mingheng Xu, Jun Li, Chunjun Chen . . . . . . . . . . . . . . . . . 957 Intelligent Classification of Cutting Tool Wear States Pan Fu, Anthony D. Hope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964 Neural Networks-Based In-Process Surface Roughness Adaptive Control System in Turning Operations Julie Z. Zhang, Joseph C. Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 970 Modeling of Micro Spring Tension Force for Vertical Type Probe Card Fabrication Chul Hong Min, Tae Seon Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976 Identification of Crack Location and Depth in Rotating Machinery Based on Artificial Neural Network Tao Yu, Qing-Kai Han, Zhao-Ye Qin, Bang-Chun Wen . . . . . . . . . . . . . 982
LXII
Table of Contents – Part III
Natural Color Recognition Using Fuzzification and a Neural Network for Industrial Applications Yountae Kim, Hyeon Bae, Sungshin Kim, Kwang-Baek Kim, Hoon Kang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 991 Design of a High Precision Temperature Measurement System Kenan Danisman, Ilker Dalkiran, Fatih V. Celebi . . . . . . . . . . . . . . . . . . 997 Integrating Computational Fluid Dynamics and Neural Networks to Predict Temperature Distribution of the Semiconductor Chip with Multi-heat Sources Yean-Der Kuan, Yao-Wen Hsueh, Hsin-Chung Lien, Wen-Ping Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005 Modeling and Characterization of Plasma Processes Using Modular Neural Network Seung Soo Han, Dong Sun Seo, Sang Jeen Hong . . . . . . . . . . . . . . . . . . . 1014 Prediction of Plasma Enhanced Deposition Process Using GA-Optimized GRNN Byungwhan Kim, Dukwoo Lee, Seung Soo Han . . . . . . . . . . . . . . . . . . . . . 1020 Prediction of Radio Frequency Impedance Matching in Plasma Equipment Using Neural Network Byungwhan Kim, Donghwan Kim, Seung Soo Han . . . . . . . . . . . . . . . . . 1028 Recognition of Plasma-Induced X-Ray Photoelectron Spectroscopy Fault Pattern Using Wavelet and Neural Network Byungwhan Kim, Sooyoun Kim, Sang Jeen Hong . . . . . . . . . . . . . . . . . . 1036 Polynomial Neural Network Modeling of Reactive Ion Etching Process Using GMDH Method Seung-Soo Han, Sang Jeen Hong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1043 Wafer Yield Estimation Using Support Vector Machines Lei-Ting Chen, David Lin, Dan Muuniz, Chia-Jiu Wang . . . . . . . . . . . . 1053 Dynamic Soft-Sensing Model by Combining Diagonal Recurrent Neural Network with Levinson Predictor Hui Geng, Zhihua Xiong, Shuai Mao, Yongmao Xu . . . . . . . . . . . . . . . . 1059 Thermal Properties Reduced Models by ANN in Process Simulation Xia Yang, Rongshan Bi, Yugang Li, Shiqing Zheng . . . . . . . . . . . . . . . . . 1065
Table of Contents – Part III
LXIII
Nonlinear Identification of a PEM Fuel Cell Humidifier Using Wavelet Networks Xian-Rui Deng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1071 Application of RBF Neural Networks Based on a New Hybrid Optimization Algorithm in Flotation Process Yong Zhang, Jie-Sheng Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1078 Estimation of Some Crucial Variables in Erythromycin Fermentation Process Based on ANN Left-Inversion Xianzhong Dai, Wancheng Wang, Yuhan Ding . . . . . . . . . . . . . . . . . . . . 1085 The Control of Membrane Thickness in PECVD Process Utilizing a Rule Extraction Technique of Neural Networks Ming Chang, Jen-Cheng Chen, Jia-Sheng Heh . . . . . . . . . . . . . . . . . . . . . 1091 PCA-Based Neural Network Modeling Using the Photoluminescence Data for Growth Rate of ZnO Thin Films Fabricated by Pulsed Laser Deposition Jung Hwan Lee, Young-Don Ko, Min-Chang Jeong, Jae-Min Myoung, Ilgu Yun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1099 Wood Defects Classification Using a SOM/FFP Approach with Minimum Dimension Feature Vector Mario I. Chacon, Graciela Ramirez Alonso . . . . . . . . . . . . . . . . . . . . . . . . 1105 A Kernel Based Multi-resolution Time Series Analysis for Screening Deficiencies in Paper Production Marcus Ejnarsson, Carl Magnus Nilsson, Antanas Verikas . . . . . . . . . . 1111 Using Directed Acyclic Graph Support Vector Machines with Tabu Search for Classifying Faulty Product Types Ping-Feng Pai, Yu-Ying Huang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117 Product Quality Prediction with Support Vector Machines Xinggao Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1126 Hierarchical Neural Network Based Product Quality Prediction of Industrial Ethylene Pyrolysis Process Qiang Zhou, Zhihua Xiong, Jie Zhang, Yongmao Xu . . . . . . . . . . . . . . . 1132 A Sub-stage Moving Window GRNN Quality Prediction Method for Injection Molding Processes Xiao-Ping Guo, Fu-Li Wang, Ming-Xing Jia . . . . . . . . . . . . . . . . . . . . . . 1138
LXIV
Table of Contents – Part III
Joint Time-Frequency and Kernel Principal Component Based SOM for Machine Maintenance Qianjin Guo, Haibin Yu, Yiyong Nie, Aidong Xu . . . . . . . . . . . . . . . . . . 1144
Other Applications Automatic Recognition and Evaluation of Natural Language Commands Maciej Majewski, Wojciech Kacalak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155 Natural Language Human-Machine Interface Using Artificial Neural Networks Maciej Majewski, Wojciech Kacalak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1161 Implementing a Chinese Character Browser Using a Topography-Preserving Map James S. Kirk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167 A Soft Computing Method of Economic Contribution Rate of Education: A Case of China Hai-xiang Guo, Ke-jun Zhu, Jin-ling Li, Yan-min Xing . . . . . . . . . . . . . 1173 Improving Population Estimation with Neural Network Models Zaiyong Tang, Caroline W. Leung, Kallol Bagchi . . . . . . . . . . . . . . . . . . 1181 Application of Fuzzy Neural Network for Real Estate Prediction Jian-Guo Liu, Xiao-Li Zhang, Wei-Ping Wu . . . . . . . . . . . . . . . . . . . . . . 1187 Local Neural Networks of Space-Time Predicting Modeling for Lattice Data in GIS Haiqi Wang, Jinfeng Wang, Xuhua Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . 1192 Modeling Meteorological Prediction Using Particle Swarm Optimization and Neural Network Ensemble Jiansheng Wu, Long Jin, Mingzhe Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202 A Fast Cloud Detection Approach by Integration of Image Segmentation and Support Vector Machine Bo Han, Lishan Kang, Huazhu Song . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1210 Application of Support Vector Machines to Vapor Detection and Classification for Environmental Monitoring of Spacecraft Tao Qian, Xiaokun Li, Bulent Ayhan, Roger Xu, Chiman Kwan, Tim Griffin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216
Table of Contents – Part III
LXV
Artificial Neural Network Methodology for Three-Dimensional Seismic Parameters Attenuation Analysis Ben-yu Liu, Liao-yuan Ye, Mei-ling Xiao, Sheng Miao, Jing-yu Su . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223 Estimation of the Future Earthquake Situation by Using Neural Networks Ensemble Tian-Yu Liu, Guo-Zheng Li, Yue Liu, Geng-Feng Wu, Wei Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1231 An Expert System Based on BP Neural Networks for Pre-splitting Blasting Design Xiaohong Li, Xinfei Wang, Yongkang Dong, Qiang Ge, Li Qian . . . . . 1237 Surface Reconstruction Based on Radial Basis Functions Network Han-bo Liu, Xin Wang, Xiao-jun Wu, Wen-yi Qiang . . . . . . . . . . . . . . . 1242 Determination of Gas and Water Volume Fraction in Oil Water Gas Pipe Flow Using Neural Networks Based on Dual Modality Densitometry Chunguo Jing, Guangzhong Xing, Bin Liu, Qiuguo Bai . . . . . . . . . . . . . 1248 Application of Neural Network in Metal Loss Evaluation for Gas Conducting Pipelines Wei Zhang, Jing-Tao Guo, Song-Ling Huang . . . . . . . . . . . . . . . . . . . . . . 1254 Soft Sensor Using PNN Model and Rule Base for Wastewater Treatment Plant Yejin Kim, Hyeon Bae, Kyungmin Poo, Jongrack Kim, Taesup Moon, Sungshin Kim, Changwon Kim . . . . . . . . . . . . . . . . . . . . . 1261 Knowledge Acquisition Based on Neural Networks for Performance Evaluation of Sugarcane Harvester Fang-Lan Ma, Shang-Ping Li, Yu-Lin He, Shi Liang, Shan-Shan Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1270 Application of Artificial Neural Network in Countercurrent Spray Saturator Yixing Li, Yuzhang Wang, Shilie Weng, Yonghong Wang . . . . . . . . . . . 1277 Wavelet Neural Networks Approach for Dynamic Measuring Error Decomposition Yan Shen, Bing Guo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1283
LXVI
Table of Contents – Part III
Maneuvering Target Tracking Based on Unscented Particle Filter Aided by Neutral Network Feng Xue, Zhong Liu, Zhang-Song Shi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1290 Application of Principal Component-Artificial Neural Networks in Near Infrared Spectroscopy Quantitative Analysis Hai-Yan Ji, Zhen-Hong Rao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296 Application of Neural Networks for Integrated Circuit Modeling Xi Chen, Gao-Feng Wang, Wei Zhou, Qing-Lin Zhang, Jiang-Feng Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1304 Power Estimation of CMOS Circuits by Neural Network Macromodel Wei Qiang, Yang Cao, Yuan-yuan Yan, Xun Gao . . . . . . . . . . . . . . . . . . 1313
Hardware Implementation An Efficient Hardware Architecture for a Neural Network Activation Function Generator Daniel Larkin, Andrew Kinane, Valentin Muresan, Noel O’Connor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319 A Design and Implementation of Reconfigurable Architecture for Neural Networks Based on Systolic Arrays Qin Wang, Ang Li, Zhancai Li, Yong Wan . . . . . . . . . . . . . . . . . . . . . . . . 1328 Hardware In-the-Loop Training of Analogue Neural Network Chip Liang Zhang, Joaquin Sitte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1334 Implementation of a Neural Network Processor Based on RISC Architecture for Various Signal Processing Applications Dong-Sun Kim, Hyun-Sik Kim, Duck-Jin Chung . . . . . . . . . . . . . . . . . . . 1340 Fully-Pipelining Hardware Implementation of Neural Network for Text-Based Images Retrieval Dongwuk Kyoung, Keechul Jung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1350 FPGA Implementation of a Neural Network for Character Recognition Farrukh A. Khan, Momin Uppal, Wang-Cheol Song, Min-Jae Kang, Anwar M. Mirza . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1357 A Silicon Synapse Based on a Charge Transfer Device for Spiking Neural Network Application Yajie Chen, Steve Hall, Liam McDaid, Octavian Buiu, Peter Kelly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366
Table of Contents – Part III
LXVII
Effect of Steady and Relaxation Oscillations in Brillouin-Active Fiber Structural Sensor Based Neural Network in Smart Structures Yong-Kab Kim, Soonja Lim, ChangKug Kim . . . . . . . . . . . . . . . . . . . . . . 1374 A Novel All-Optical Neural Network Based on Coupled Ring Lasers Ying Chen, Qi-guang Zhu, Zhi-quan Li . . . . . . . . . . . . . . . . . . . . . . . . . . . 1380 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1387
The Ideal Noisy Environment for Fast Neural Computation Si Wu1 , Jianfeng Feng2 , and Shun-ichi Amari3 1
Department of Informatics, University of Sussex, UK 2 Department of Computer Science, Warwick University, UK 3 Lab. for Mathematical Neuroscience, RIKEN Brain Science Institute, Japan
Abstract. A central issue in computational neuroscience is to answer why neural systems can process information extremely fast. Here we investigate the effect of noise and the collaborative activity of a neural population on speeding up computation. We find that 1) when input noise is Poissonian, i.e., its variance is proportional to the mean, and 2) when the neural ensemble is initially at its stochastic equilibrium state, noise has the ‘best’ effect of accelerating computation, in the sense that the input strength is linearly encoded by the number of neurons firing in a short-time window, and that the neural system can use a simple strategy to read out the stimulus rapidly and accurately.
1
Introduction
Neural systems can process information extremely fast. Taking the primate visual systems as an example, the event-related potential study has revealed that human subjects are able to carry out some complex scenes analysis in less than 150 ms [1]. Recoding also shows that the latency of neural responses in V1 can be as short as 40 ms [2], and in the temporal cortex 80 − 110 ms [3]. Understanding why neural systems can carry out information processing at such a rapid speed is of critical importance in understanding computational mechanisms underlying brain functions. Recent studies on the properties of neural dynamics have suggested that stochastic noise, which is ubiquitously observed in biological systems and is often thought to degrade information processing, may actually play a critical role in speeding up neural computation [4-5]. To see this point clearly, let us first look at the limitation on the speed of neural response in a noiseless environment. Without loss of generality, we consider the neural dynamics is simply written as τ dv/dt = I, with v representing the membrane potential of the neuron, I the strength of external input and τ the membrane time constant. Consider I is a constant and the initial state of the neuron is fixed to be v0 . The firing moment of the neuron is then calculated to be T = τ (θ − v0 )/I, where θ is the firing threshold. Thus, in general the response speed of a neuron J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1–6, 2006. c Springer-Verlag Berlin Heidelberg 2006
2
S. Wu, J. Feng, and S.-i. Amari
in a noiseless environment is in the order of τ , which has a value of 10 − 20 ms. This cannot account for the fast computational behaviours as observed in data [1-3]. To increase the speed of neural computation, one idea is to take into account the effect of noise and the collaborative activities of a neural ensemble (see, e.g. [4, 5]). The underlying picture is intuitively understandable. Noise randomizes the initial state of the neural population, i.e., the distribution of membrane potentials of neurons. As a result, those neurons whose potentials are close to the threshold will fire rapidly after the onset of a stimulus. Therefore, we can expect that if the noisy environment (which includes both the form of input noise and the initial state of the neural ensemble) is appropriate, then even for small inputs a non-zero percentage of the neural population will fire quickly to report the presence of the stimulus. Ideally, if the joint activity of the neural population exhibits a simple statistical feature which encodes the input strength faithfully, then the neural system can utilize a simple strategy to read out the stimulus rapidly and accurately. Although the idea seems straightforward, the detailed mechanism by which noise accelerates computation and its exact achievement remain largely unclear. In particular, these issues include: The encoding scheme. While noise speeds up the reaction time of the neural system, it also complicates the way of encoding external stimuli. Simply thinking, because of noise the spike timing of individual neurons can no longer encode inputs reliably in a short-time window, and the activities of the neural ensemble need to be considered. Furthermore, to simplify the decoding process (presuming this is associated with low computational cost, and hence also affect the speed of computation), it is critical that external information is encoded by a simple feature of neural population activity. The suitable noisy environment. Fast computation relies on neural activity in the very initial moment of presenting stimuli. Thus, its performance is sensitive to the structure of input noise and the initial state of the system. A critical question is: what kind of noisy environment is most suitable for fast computation, so that the neural system can use a simple strategy to read out stimuli rapidly and accurately. The goal of this study is to explore the above two issues.
2
The Model
We consider a simple population coding paradigm, in which a one-dimensional continuous variable x is encoded by N neurons [8]. Neurons are clustered according to their preferred stimuli, denoted as c. The dynamics of a single neuron is specified by the I & F operation, i.e, τ
dvci = −vci + Ici (t), dt
(1)
The Ideal Noisy Environment for Fast Neural Computation
3
where vci denotes the potential of ith neuron in the cluster c and Ici (t) the input. The fluctuations of Ici (t) are assume to be Gaussian white noise, that is, Ici (t) = μc + σc ξci (t), μc ξci (t) ξci (t1 )ξci (t2 )
= Ae−(c−x) = 0,
2
(2) 2
/2a
= δ(t1 − t2 ),
,
(3) (4) (5)
where the variable μc denotes the mean drift of the external inputs, which is a Gaussian function of the difference between the true stimulus x and the neuronal preferred stimulus c. A is a positive constant controlling the magnitude of the input. The variable ξci (t) represents Gaussian white noise of zero mean and unit variance, and σc the noise strength. We consider fluctuations of external inputs between different neurons are independent of each other. Notably, for neurons in the same cluster, their inputs have the same mean drift. The form of noise is the key that determines the performance of fast computation. It turns out that the following noise structure, σc2 = αμc ,
α > 0,
(6)
has the best effect of accelerating neural computation. This noise structure is stimulus-dependent. For convenience, we call it Poissonian, taking into account that when α = 1, noise is Poisson.
3
The Poissonian Equilibrium Theory
To achieve fast computation, the key is that the neural system can use a simple statistical feature of its population activity measured in a short-time window to encode external stimuli. For the model we consider, since neural activities within a cluster are homogeneous and independent of each other, a natural choice for such a simple feature is the number of neurons firing in a short-time window, which we denote as Nc (t), i.e. the number of neurons fired in the cluster c within the time interval t. The detailed firing pattern of the cluster within the timewindow is irrelevant. But, can Nc represent the input information faithfully in a short-time window? Clearly, this depends on the noisy environment. Ideally, we may expect that there exists a noisy environment, in which the mean drift of external inputs, which is correlated with the stimulus in our case, is linearly and instantly encoded by Nc (t) (more exactly, the mean value of Nc ), that is, Nc (t) ∼ μc t.
(7)
If this encoding scheme holds, then the neural estimator can use a simple strategy (NE) to read out the stimulus, i.e., Nc (t)c xˆN E (t) = c . (8) c Nc (t)
4
3.1
S. Wu, J. Feng, and S.-i. Amari
The Ideal Noisy Environment
Our finding of the ideal noisy environment is inspired by the following observations. We compute the equilibrium state of the neural ensemble given a fixed stimulus. Since neural clusters are independent of each other, it is adequate for us to consider only one of them. For simplicity, let us consider first there is no decay term (−vci ) in the neural dynamics of eq. (1). Denote pc (v, t) as the probability of finding the potential of a randomly chosen neuron in cluster c at v at time t. The Fokker-Plank equation for pc (v, t) is given by [6, 7] 1 ∂pc (v, t) ∂pc (v, t) σc2 ∂ 2 pc (v, t) = −μc + . (9) τ ∂t ∂v 2τ ∂v 2 The equilibrium distribution, pc (v), of the cluster, is determined by ∂pc (v, t)/∂t = 0, i.e., ∂pc (v) σc2 ∂ 2 pc (v) − μc + = 0. (10) ∂v 2τ ∂v 2 Assume noise is Poissionian, i.e., σc2 = αμc , we have −
∂pc (v) α ∂ 2 pc (v) + = 0. ∂v 2τ ∂v 2
(11)
Thus, from the above equation, we see that for Poissonian noise, pc (v) is independent of μc , i.e., independent of the value of stimulus. At the equilibrium state, pc (v) satisfies a set of boundary conditions [7], which are pc (θ) = 0, ∂pc (θ) 2τ 2 γc =− 2 , ∂v σc + − ∂pc (vr ) ∂pc (vr ) 2τ 2 γc − =− 2 , ∂v ∂v σc
(12) (13) (14)
where vr denotes the resting potential and γc the instant firing rate of the cluster c. θ By further utilizing the normalization condition, i.e., −∞ pc (v)dv = 1, pc (v) can be exactly solved, which is given by ⎧1 ⎨ θ (1 − e−2τ θ/α )e2τ v/α v < 0, pc (v) = 1θ (1 − e2τ (v−θ)/α ) (15) 0 ≤ v ≤ θ, ⎩ 0 v > θ, where we have set vr = 0. The profile of pc (v) is shown in Fig.1A. The instant firing rate γc in the equilibrium state is calculated to be, γc = − =
σc2 ∂pc (v) |θ , 2τ 2 ∂v
μc . τθ
(16)
The Ideal Noisy Environment for Fast Neural Computation
5
Thus, the mean of Nc (t), i.e., Nc (t) = M μc t/(τ θ), is proportional to μc t. The above study reveals two important features of the network dynamics when the noise structure is Poissonian, which are: 1. In the equilibrium state, the linear encoding scheme eq.(7) holds. 2. The equilibrium state of a neural cluster is insensitive to the stimulus value. It may not be straightforward to see the importance of this second property which guarantees that the equilibrium state of the neural system serves to be the optimal initial condition for fast computation. To understand this point, let us imagine the opposite situation that pc (v) is sensitive to μc . We then have difficulty identifying a unique optimal initial condition, albeit at each pc (v) eq.(7) holds. Interestingly, it turns out that when the decay term (−vci ) is added in the neural dynamics, the above properties still hold roughly (see Fig.1B). A
B
0.9
μc=0.5
0.8
μc=1
1 0.7 0.8
pc(v)
pc(v)
0.6 0.5 0.4
μ =1.5 c
0.6
0.4
0.3 0.2
0.2 0.1 0 −1
−0.5
0
0.5
1
0 −1
−0.5
0
0.5
1
v
v
Fig. 1. The equilibrium state of a neural cluster under Poissonian noise. (A) Without the decay term. (B) With the decay term.
Drawing the above observations together, we come to the conclusion that the ideal noisy environment for fast computation consists of two elements, namely, – The structure of input noises satisfies σc2 = αμc ; – The neural ensemble is initially at its equilibrium state. We call this encoding scheme the Poissonian equilibrium theory. In the ideal noisy environment, consider N t is sufficiently large, the error of neural estimator (NE) is calculated to be (ˆ xN E − x)2 ≈ √
κτ a . 2πAρt
(17)
where κ is the variance-mean ratio of Nc and ρ is the density of neurons. Fig.2 shows the simulation results, which confirms that the error of NE decreases with the length of decoding time t (Fig.2A) and the number of neurons (Fig.2B). The error of NE can be arbitrarily small at any decoding time, provided the number of neurons is sufficiently large.
6
S. Wu, J. Feng, and S.-i. Amari A
−3
1.5
B
x 10
0.06 α=1 α=0.8 α=0.6
0.05
Decoding Error
Decoding Erroe
α=0.6 α=0.8 α=1 1
0.5
0.04
0.03
0.02
0.01
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
Decoding Time
0.8
0.9
1
0 10
20
30
40
50
60
70
80
90
100
M
Fig. 2. Illustrating the performance of NE. (a) For different decoding time. (b) For different number of neurons.
4
Conclusions
The present study investigates the ideal noisy environment for fast neural computation. We observe that the stimulus-dependent Poissonian noise, rather than stimulus-independent ones, has the best effect of accelerating neural information processing. This property is somehow intuitively understandable. For the strong diffusive noise, in a short time-window, it is fluctuations, rather than the mean √ drift, √ that dominate the value of external inputs (that is, W (t) ≈ μt + ση t ≈ ση t, for t 1, where η is a Gaussian random number of zero mean and unit variance). The signal-noise correlation is the key that enables the stimulus information to be adequately propagated to the neural system quickly. As elucidated in this study, with a proper structure, the stimulus-dependent Poissonian noise not only increases the response speed of the neural ensemble, but also provides a simple linear scheme to encode the stimulus.
References 1. Thorpe, S., Fize, D. & Marlot, C.: Speed of Processing in the Human Visual System. Nature 381 (1996) 520-522. 2. Celebrini, S., Thorpe, S., Trotter, Y. & Imbert, M.: Dynamics of Orientation Coding in Area V1 of the Awake Primate. Vis. Neurosci. 10 (1993) 811-825 3. Sugase, Y., Yamane, S., Ueno, S. & Kawano, K.: Global and Fine Information Coded by Single Neurons in the Temporal Visual Cortex. Nature 400 (1999) 869-873 4. Gerstner, W.: Population Dynamics of Spiking Neurons: Fast Transients, Asynchronous State and Locking. Neural Computation 12 (1999) 43-89 5. van Rossum, M., Turrigiano, G. & Nelson, S.: Fast Propagation of Firing Rates through Layered Networks of Neurons. J. Neurosci. 22 (2002) 1956-1966 6. Tuckwell, H.: Introduction to Theoretical Neurobiology. Cambridge University Press, Cambridge (1996) 7. Brunel, N. & Hakim, V.: Fast Global Oscillations in Networks of Integarte-and-Fire Neurons with Low Firing Rates. Neural Computation 11 (1999) 1621-1671 8. Wu, S., Amari, S. & Nakahara, H.: Population Coding and Decoding in a Neural Field: a Computational Study. Neural Computation 14 (2002) 999-1026
How Does a Neuron Perform Subtraction? ─ Arithmetic Rules of Synaptic Integration of Excitation and Inhibition Xu-Dong Wang1,2, Jiang Hao1,2, Mu-Ming Poo1,3, and Xiao-Hui Zhang1 1
Institute of Neuroscience Graduate School of Chinese Academy of Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China {wxd, acchj, xhzhang}@ion.ac.cn 3 Division of Neurobiology, Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
[email protected] 2
Abstract. Numerous learning rules have been devised to carry out computational tasks in various neural network models. However, the rules for determining how a neuron integrates thousands of synaptic inputs on the dendritic arbors of a realistic neuronal model are still largely unknown. In this study, we investigated the properties of integration of excitatory and inhibitory postsynaptic potentials in a reconstructed pyramidal neuron in the CA1 region of the hippocampus. We found that the integration followed a nonlinear subtraction rule (the Cross-Shunting Rule, or CS rule). Furthermore, the shunting effect is dependent on the spatial location of inhibitory synapses, but not that of excitatory synapses. The shunting effect of inhibitory inputs was also found to promote the synchronization of neuronal firing when the CS rule was applied to a small scale neural network.
1 Introduction One of fundamental issues in neuroscience is to understand how a neuron integrates excitatory and inhibitory synaptic inputs of a variety of spatial and temporal characteristics at the dendrite [1]. Recent neurophysiology studies have revealed a linear [2] or super-linear [3, 4] summation of excitatory postsynaptic potentials (EPSPs) at the dendrite under different experimental conditions. Furthermore, though the neuron-like unit usually used in artificial neural networks is the classical “point” neuron [5, 6, 7], it is reported that the single pyramidal neuron may function as a two-layer neural network in response to excitatory synaptic inputs [8, 9, 10, 11, 12]. In the present study, we examined the computational rules for summation of EPSPs and inhibitory postsynaptic potentials (IPSPs) in a realistic neuron model. In this model, the neuronal morphology is derived from a reconstructed pyramidal neuron of rat hippocampus CA1 region, and the distribution and density of ion channels are largely based on those physiological results reported in the literature. We also examined the functional J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 7 – 14, 2006. © Springer-Verlag Berlin Heidelberg 2006
8
X.-D. Wang et al.
significance of the computational rule obtained in this study on the firing properties of a population of connected neurons.
2 Methods The morphology (see Fig. 1A) of the model neuron was obtained from DukeSouthampton archive of neuronal morphology [13]. The model was run within the NEURON simulation environment [14]. The ion channel properties in this model cell include: (1) Hodgkin-Huxley-type sodium and potassium channels, distributed uniformly along the soma and dendrites; (2) A-type potassium channels KA and hyperpolarization-activated channels Ih, with a progressively incremental channel density from the soma to distal dendrites [15]; (3) Four types of synaptic conductance AMPA, NMDA, GABA-A and GABA-B conductances at the sites of the synaptic input. (A)
(B)
Fig. 1. (A) The reconstructed morphology of the rat hippocampus CA1 pyramidal neuron. (B) Typical traces of the EPSP (dotted trace), IPSP (dashed trace) and the integrated response (solid trace, the membrane potential change when both excitatory and inhibitory synapses are activated simultaneously).
The parameters in our simulation are adjusted based on the previous studies [10, 16]. In our simulation paradigm, we placed two synaptic inputs along the main trunk of the apical dendrite, in a configuration of inhibitory inputs more proximal to excitatory inputs. We recorded the EPSP, IPSP, and integrated responses at the soma (see Fig. 1B). We defined the peak amplitude of EPSP as x, the amplitudes of IPSP and integrated potential as y and z at the time point when EPSP reached the peak.
How Does a Neuron Perform Subtraction?
9
The neural network consisted of fifteen artificial neurons fully connected with one another. Each neuron possessed an intrinsic inter-spike interval, which varied among different neurons [17].
3 Results 3.1 Cross-Shunting Rule (CS Rule) First, we systematically studied the model cell with following two synaptic sites: the inhibitory and excitatory synapses were located at 46.68 and 118.05 from the soma, respectively. We found that the simulation results can be best fitted by the function (1), as illustrated by Fig. 2.
μm
μm
Fig. 2. 3-D map of EPSP, IPSP and SUM from the formula (1). A prominent feature in this map is that a straight line is obtained when a section parallel to x-z or y-z plane intersects the function S. The magnitude of synaptic potentials is also coded by the graded grey intensity.
S ( x, y ) = 1.00595 x − 1.0165 y − 0.07891xy + 0.00698
(1)
In the second simulation, we used a different set of locations - 46.68 and for the inhibitory and excitatory inputs, respectively. The best fitting func268.27 tion showed a similar form:
μm
S ( x, y ) = 1.00403x − 0.99415 y − 0.08212 xy + 0.01295
(2)
10
X.-D. Wang et al.
The above two sample results strongly suggested that the relationship among the three quantities (S, x and y) takes the following form, with the constant term abandoned (due to the constraint that S should be zero when both x and y equal zero): S ( x, y ) = x − y − kxy
(3)
We refer this nonlinear subtraction equation as the Cross-Shunting rule (CS rule) and k as the shunting factor. (A)
(B)
Fig. 3. Comparison between the results predicted by the CS rule and that obtained by simulation. (A) The comparison of the summation value. (B) The relative error between the CS rule and simulation.
We further tested the CS rule by randomly taking 11 sets of the amplitude of EPSP and excitatory input at and IPSP, with the inhibitory input placed at 46.68 268.27 from the soma, and compared the results of summation from the simulation with that calculated from the CS rule. As shown in Fig. 3, the two have an excellent agreement.
μm
μm
3.2 Dependence of Shunting Factor k on Synapse Location The CS rule indicates that the shunting effect (kxy) is dependent on both excitatory and inhibitory inputs. We thus examined the spatial dependence of k on the location of excitatory synapse, when that of the inhibitory synapse was kept constant, and vice versa. The result of simulation is shown in the Fig. 4. Our results indicated that k value is independent on the spatial location of excitatory synapse, but increases with the distance of inhibitory synapse from the soma. This suggests that the inhibitory effect of GABAergic synapses on more distal afferent excitatory input is independent of the relative spatial location of the excitatory synapse along the dendrite. More distal GABAergic synapses exhibit a relatively larger shunting effect.
How Does a Neuron Perform Subtraction?
(A)
11
(B)
(C)
Fig. 4. Dependence of the shunting factor k on the synapse location. (A) & (B) The inhibitory synapse was located at 46.68 and 158.62 from the soma and the location of the excitatory synapse was varied. The shunting factor k was found to remain constant. (C) The excitatory synapse was at 268.27 , while the location of inhibitory synapse was varied. The value of k increased with the distance between the inhibitory synapse and the soma. The relation between the two quantities can be fitted by a first-order exponential function (black line): k = 0.069 + 0.005exp(x/83).
μm μm
μm
3.3 CS Rule in Neural Networks How does the CS rule work in neural networks? To address this question, we further investigated the effect of this nonlinear subtraction rule in the framework of neural network described by Carnevale and Hines (2005) [17]. In this neural network, each neuron fires action potentials spontaneously with a particular inter-spike interval and receives excitatory or inhibitory inputs from all other neurons. We additionally added the shunting term kxy when the cell in this network received inhibitory inputs. The network consisted of 15 neurons, whose activities were described by a membrane state variable m, which satisfied the following equation:
τ
dm + m = m∞ dt
m∞ >1
(4)
12
X.-D. Wang et al.
(A)
(B)
(C)
Fig. 5. Synchronous activity of 15 artificial cells in the neural network. (A) No inhibition was introduced and each cell fired action potentials with its intrinsic inter-spike-interval. (B) Only inhibition in a manner of S = x - y was introduced. (C) Strong synchrony in the presence of both the inhibition and the shunting terms (S = x - y - kxy).
How Does a Neuron Perform Subtraction?
13
The values for the parameters used in this simulation were: synaptic weight: -0.12; synaptic delay: 2 ms; cell time constant: 10 ms; shunting factor k: -0.08; inter-spiking interval: 10-11 ms. In this neural network, the neurons showed a progressive asynchrony of neuronal activity over time in the absence of the inhibition, due to the accumulative difference of inter-spiking intervals (Fig. 5A). When the inhibition y was introduced to the network without the shunting term kxy, the neurons only exhibited a partial synchrony (Fig. 5B). However, if both the inhibition and the shunting effect were added, the synchronization of the neuronal activity was immensely enhanced (Fig. 5C). An intuitive explanation of this finding is as follows: For the neurons that begin with shorter inter-spike-intervals, they have a higher possibility in staying a higher membrane state (x) close to the firing threshold, and the shunting effect is bigger, given the same inhibitory input (y). Hence, their firing activity tends to slow down and becomes synchronous to those neurons that begin with longer inter-spike intervals. This result suggests that the shunting term contributes to the synchronizing activity of neurons in the neural network.
4 Conclusion Taken together, our results have revealed a simple computational rule - the CrossShunting rule that underlies the integration of EPSP and IPSP in a realistic neuron model. The form of the nonlinear shunting term (kxy) suggests that the shunting effect results from the interaction between excitation and inhibition and this provides a means for a neuron to implement multiplication, a very important nonlinear operation. The shunting factor (k) increases with the increasing distance of the inhibitory synapse from the soma but is independent of the location of excitatory synapses. Our preliminary data of physiological experiments, in which the dendritic integration of IPSP and EPSP has been studied in a single CA1 pyramidal neuron in the rat hippocampus slice, support the above conclusion (Hao et al., unpublished data). Moreover, our results further suggest that the shunting term may contribute to the synchronization of the neuronal activity in a neural network.
Acknowledgments We thank the members in our lab: Ning-long Xu, Chang-quan Ye, Dong Wang, Jiang-teng Lv, Hai-tian Zhang for useful comments and discussions. This project was supported by a grant from Major State Basic Research Program of China.
References 1. Yuste, R., Tank, D.W.: Dendritic Integration in Mammalian Neurons, A Century After Cajal. Neuron 16 (1996) 701–716 2. Cash, S., Yuste, R.: Linear Summation of Excitatory Inputs by CA1 Pyramidal Neurons. Neuron 22 (1999) 383–394 3. Margulis, M., Tang, C.M.: Temporal Integration Can Readily Switch Between Sublinear and Supralinear Summation. J. Neurophysiol. 79 (1998) 2809–2813
14
X.-D. Wang et al.
4. Nettleton, J.S., Spain, W.J.: Linear to Supralinear Summation of AMPA-mediated EPSPs in Neocortical Pyramidal Neurons. J. Neurophysiol. 83 (2000) 3310–3322 5. McCulloch, W.S., Pitts, W.: A Logical Calculus of the Ideas Immanent in Nervous Activity. Bull Math Biophys. 5 (1943) 115–133 6. Rosenblatt, F.: Principles of Neurodynamics. New York, Spartan (1962) 7. Rumelhart, D., Hinton, G., and McClelland, J.: A General Framework for Parallel Distributed Processing. in: D. Rumelhart and J. McClelland, J. (Eds): Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol.1. Cambridge, MA, Bradford (1986) 45–76 8. Mel, B.W., Ruderman, D.L., Archie, K.A.: Translationinvariant Orientation Tuning in Visual ‘Complex’ Cells Could Derive from Intradendritic Computations. J. Neurosci. 17 (1998) 4325–4334 9. Archie, K.A., Mel, B.W.: An Intradendritic Model for Computation of Binocular Disparity. Nat. Neurosci. 3 (2000) 54–63 10. Poirazi, P., Brannon, T.M., Mel, B.W.: Arithmetic of Subthreshold Synaptic Summation in A Model CA1 Pyramidal Cell. Neuron 37 (2003) 977–987 11. Poirazi, P., Brannon, T.M., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37 (2003) 989–999 12. Polsky, A., Mel, B.W., Schiller: Computational Subunits in Thin Dendrites of Pyramidal Cells. J. Nat. Neurosci. 7 (2004) 621−627 13. Cannon, R., Turner, D., Pyapali, G.K., Wheal, H.: An On-line Archive of Reconstructed Hippocampal Neurons. J. Neurosci. Methods 84 (1998) 49–54 14. Hines, M.L., Carnevale, N.T.: The NEURON Simulation Environment. Neural Comput. 9 (1997) 1179–1209 15. Migliore, M., Shepherd, G. M.: Emerging Rules for the Distributions of Active Dendritic Conductances. Nature Rev. Neurosci. 3 (2002) 362–370 16. Migliore, M.: On the Integration of Subthreshold Inputs from Perforant Path and Schaffer Collaterals in Hippocampal CA1 Pyramidal Neurons. J. Comput. Neurosci. 14 (2003) 185–192 17. Carnevale, N.T., Hines, M.L.: The NEURON Book. Cambridge Univ. Press (2005)
Stochastic Resonance Enhancing Detectability of Weak Signal by Neuronal Networks Model for Receiver Jun Liu, Jian Wu, Zhengguo Lou, and Guang Li Department of Biomedical Engineering, Key Laboratory of Biomedical Engineering of Ministry of Education, Zhejiang University, 310027 Hangzhou, P.R. China
[email protected]
Abstract. Stochastic resonance phenomenon in a biological sensory system has been studied through signal detection theories and psychophysical experiments. There is a conflict between the real experiments and the traditional signal detection theory for stochastic resonance because the latter treats the receiver as linear model. This paper presents a two-layer summing network of Hodgkin-Huxley (HH) neurons and a summing network of threshold devices to model the receiver, respectively. The simulation results indicate that the relevant index of signal detectability exhibit the stochastic resonance characteristics.
1 Introduction Stochastic resonance (SR) is a phenomenon that a nonzero noise level optimizes system performance in a non-linear field and especially contributes to weak signal detection and transfer in an intensive noisy background [1,2]. Some psychophysical experiments have indicated that the capacity of sensory systems to detect a weak stimulus can be enhanced by SR [3-9]. Accordingly, the signal detection theory has been used to analyze and model the psychophysical experiment by the detector model that separates detection process into reception and classification [10-13]. The detector model based on signal detection theory is generally consisted of two parts, defined as a receiver and a classifier respectively. The former transforms observation signal into decision information and is evaluated by the index of detectability (d’). The latter discriminates the decision information according to a binary classification criterion and is evaluated by the index of percent-correct classification. This model is consistent with two basic functions of sensory systems, namely reception and classification so that it may be an appropriate model to simulate sensory systems including peripheral receptors and cortical cells. Some research results suggest that SR is closely linked to the non-linear classifier with the binary classification criterion [10,11], but the receiver of a detector (sensory system) is considered as linear system according to the initial description of signal detection theory and this part does not exhibit any SR phenomenon. It is worth noting that detectability (d’) has a close relationship with the signal-noise-ratio (SNR) and the two-interval forced choice (2IFC) rule of psychophysical experiment through which SR phenomenon can be observed [7,15]. For explaining this conflict, some new models and general indices of detectbability are used to describe the non-linear J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 15 – 20, 2006. © Springer-Verlag Berlin Heidelberg 2006
16
J. Liu et al.
characteristic of a receiver [13]. The simulation results indicate that SR can occur in either parts of the detector as long as they are non-linear. In this paper, we will further discuss the above issue using neuronal networks a two-layer summing network of Hodgkin-Huxley (HH) neurons [16] and a summing network of threshold devices to model the receiver. Based on two kinds of network model and signal detection theory, the detectability will be used to investigate SR phenomena in the receiver.
2 The Signal Detection Theory and Detectability In terms of signal detection theory [18], the corresponding decision model is written as two hypotheses: H1 and H2, representing the input conditions of noise-alone and signal-plus-noise, respectively. After the receiver transforms inputs into spikes and preprocesses the spikes into the information used for classification, the classifier discriminates whether the inputs belong to H1 or H2. A straightforward and useful measure tool is the index of detectability that implies the characteristics of receiver. The detecbability of a signal depends both on the separation and the spread of the noise-alone and signal-plus-noise. The most widely used measure is called d-prime (d’), and its formula is simply:
d ' = separation spread
(1)
Where the separation corresponds to the difference between the response from noise-alone and signal-plus-noise, and the spread corresponds to the deviation induced by noise.
3 Detectability Based on the Two-Layer HH Neuronal Network The left figure in Fig.1 shows the structure of the receiver based on two-layer HH neuronal network model. The details of a HH neuronal model are seen in the paper [16]. The first layer network is composed of N parallel neurons represented by n11, n12, …n1N . The second layer has one neuron represented by n2 and act as the output part of network. The total network is an analogical structure of the spikes generation system, in which the first layer can be considered as the part of the reception and transmission of external stimulus and converges on the neuron of the second layer. Especially, the input signal S(t) of the first layer is considered as the periodic stimulation I1 sin(2 fst) and constant stimulation I0 in order to study the spectrum characteristic of the systems. The I1 sin(2 fst) denotes the external stimulation information and the I0 is regarded as the average effect of the internal environment in sensory systems. Each neuron in the first layer is subjected to the external input noises represented by η1, η2,……, ηN , which are simply assumed as an independent or uncorrelated Gaussian white noise, satisfying <η(t)> = 0, <η(ti) η(tj)> = 2Dδ(ti-tj), D is intensity of noises. The neuron of the second layer receives all outputs of the neurons in the first layer and the same internal environment stimulation I0. The neurons in the first layer are parallel connected with the second layer through a synapse. The synaptic current of the neuron n1i is described in reference [19].
π
π
Stochastic Resonance Enhancing Detectability of Weak Signal
17
Fig. 1. (Left) The structure of the receiver based on the two-layer network. The inputs (the periodic signal s(t) plus noise or only noises ) are transformed into spikes. The spikes are converted into power spectrum by spectral analyzer. (Right) The d’ varying with the intensity of noise D for N 1, 50 and 100, respectively. The rest parameters: I1 =1μA/cm2, I0 =1μA/cm2, fs=50Hz.
=
The corresponding spikes sequences occur when the inputs are applied to the receiver. It is widely acknowledged that each of the auditory, visual, and tactile sensory systems can be modeled as spectral frequency analyzers [17], thus a spectrum analysis is used to preprocess the information for classification. We simplify firstly the spikes sequences into the standard rectangle pulse sequences (the amplitude of each pulse is 1mV and the width is 2ms), and then obtain the power spectrum density (PSD) by summing the results from the Fast Fourier Transform (FFT) of the pulse sequences. The PSD in a small bandwidth f and centered on the signal frequency fs is used to define the detectability of the detection process. Referring to the paper [17], the corresponding expression can be written as:
Δ
d ' = [ PsΔ+fn ( f s ) − PnΔ f ( f s )] PnΔ f ( f s )
(2)
In this equation, PsΔ+fn ( f s ) − PnΔ f ( f s ) represents the power separation between the conditions of noise-alone and signal-plus-noise and PnΔf ( f s ) is the power from the noise. The result represents the total power contained in the signal feature of the spectrum compared to the total power in the noise in the same bandwidth centered on the signal frequency. Obviously, the d’ have a monotonic relationship with the SNR which usually represent the ratio of the signal peak and the mean amplitude of background noise at the input signal frequency f s in the power spectrum. The right figure of fig. 1 shows the d’ versus the intensity of noises D when N=1,N=50 and N=100. The amplitude of input signal is taken as the subthreshold value ( I1 =1μA/cm2). Three curves exhibit the typical characteristic of SR: first a rise and then a drop. Differently, the optimal intensity of noise in the case of N=50 and N=100 varies from 1 to10 and has wider range than that in the case of N=1. This means that the range and the optimal d’ increase with the number of neurons in first layer. The reason may be the collective behavior of many neurons can restrain the noises and decrease the power of noise so as to improve the d’.
18
J. Liu et al.
4 Detectability Based on the Summing Network of Threshold Devices Because the HH model is only numerically analyzed, the network of threshold devices is introduced to mimic the threshold characteristic of reception. In view of its simplicity in form, the corresponding probability model can be used to obtain the analytic results.
Fig. 2. (Left) The structure of the receiver based on the summing network of threshold devices. The inputs (the signal s(t) plus noise or only noises ) are transformed into spikes. The spikes are summed and analyzed statistically. (Right) The d’ varying with the intensity of noise D for N 1, 50 and 100, respectively. The subthreshold signal s=5 and the threshold value θ=10.
=
As shown in the left figure of fig.2, a receiver model can also consist the network of threshold devices. The basic elements of the receiver are MacCulloch-Pitts neurons that are considered as simple threshold devices described by Heaviside functions. Each output, yi (t), is equal to 1, if si(t)+ηi(t)>θi or si(t)+ηi(t)=0. Each threshold device is subject to the same input signal, s(t). And all noise variables, η1(t), … , ηi(t), … ,ηN(t), are assumed to be independent and identically distributed (IID) with mean zero and distribution function F. The θi is the threshold where i=1, 2, …, N. Let thresholds of all devices equal to θ, independent Bernoulli random variables with probabilities are observed at time, t, instantly:
p(t) = pyi (t) = P[yi (t) =1] = Pyi (t ) (θ, ∞) =1− F[θ − s(t)]
(3)
The receiver is considered as a summing network of N threshold devices. The output, y(t), represents the number of the devices which input cross the threshold at any time. According to Eq.(3), when the value of N is large enough, the probability density of the outputs, y(t), approximate the Gaussian distribution with the mean μ(t)=Np(t) and the deviation σ2(t)=Np(t)[1-p(t)] at time t. The corresponding decision model is written as two hypotheses: N N H : y (t ) = g[η (t )] and H : y (t ) = g[ s (t ) + η (t )] . The function, g(.), is Heaviside 0
∑ i =1
i
1
∑ i =1
i
i
function. H0 represents the condition with only noise input and H1 is the condition with signal plus noise input. For a detector, y(t) corresponds to the decision variable which is the output of receiver. The corresponding detectability [20] at time t is defined as:
Stochastic Resonance Enhancing Detectability of Weak Signal
d ' = 2[ μ1 (t ) − μ0 (t )]
σ 02 (t ) + σ 12 (t )
19
(4)
Here this d’ is a complete characterization of the detectability of the signal assuming that the noise follows a normal (Gaussian) distribution with a fixed variance, independent of the signal strength, satisfying <η(t)> = 0, <η(ti) η(tj)> = 2Dδ(ti-tj), D is intensity of noises. For a stationary signal input, the output of receiver, y(t), is stationary random variable, so the population distribution can be replaced by the distribution at time, t. The mean and the deviation become constant values: μi=N [1-F (θ-si)] and σi2=N F (θ-si) [1-F (θ-si)], where, i=0 or 1, s0=0, s1=s (amplitude of the input signal). Numeric simulation results are shown in the right figure of fig.2. It can be found that the detectability has the characteristics of SR when the signal amplitude is lower than the threshold. Similarly, we investigate that how the number of threshold devices affects the d’. The results indicate that the d’ increase with the number of devices, but the range of optimal d’ does not exhibit wider and wider with the number increase like the two-layer HH network. The possible reason is that the neurons in the first layer are parallel connected with the second layer through a synapse with the nonlinear summing function.
5 Conclusion Many human psychophysical and animal behavioral experiments have shown SR in a d’-like measure of performance but the traditional signal detection theory cannot explain these phenomena. The reason is that the receiver of a detector (sensory system) is considered as linear. In this paper we present two models to replace the linear receiver part in a detector from different angles. It is noting that when the stimulus is transformed and processed from the peripheral neural system to the cortical of brain, there are the gathers of many neurons or neuron groups. These gathering processes may be considered as the different levels of classification. The models in this paper are constructed according as this idea. The HH network is biologically realistic, but it is only numerically analyzed. The network of threshold devices is simple in construction and the corresponding probability model can be obtained. According to two network model, two kinds of definition of d’ for different models both exhibit the SR. This may be a complemental explanation to the corresponding experiment.
Acknowledgment This work was supported by the National Natural Science Foundation of China (projects No. 30470470) and the Project of Zhejiang Province Education Department of China (20050907).
References 1. Benzi, R., Parisi, G., Sutera, A., Vulpiani, A.: Stochastic Resonance in Climatic Change. Tellus 34(1982) 10-16 2. Gammaitoni, L., Hänggi, P., Jung, P., Marchesoni, F.: Stochastic Resonance. Rev. Mod. Phys. 70(1998) 223-287
20
J. Liu et al.
3. Gingl, Z., Kiss, L. B., Moss, F.: Non-dynamical Stochastic Resonance—Theory and Experiment with White and Arbitrarily Colored Noise. Europhys Lett.. 29 (1995) 191-196 4. Pei, X., Bachmann, K., Moss, F.: The Detection Threshold, Noise and Stochastic Resonance in the Fitzhugh-Nagumo Neuron Model. Phys. Lett. (1995) 61–65 5. Collins, J. J., Imhoff, T. T., Grigg, P.: Noise-enhanced Tactile Sensation. Nature 383(1996) 770-772 6. Simonotto, E., Riani, M., Seife, C.: Visual Perception of Stochastic Resonance. Phys. Rev. Lett. 78(1997) 1186-1189 7. Zeng, F.G., Fu, Q.J., Morse. R.: Human Hearing Enhanced by Noise. Brain Researcher 869 (2000) 251–255 8. Russell, D. F., Tucker, A., Wettring, B.: Noise Effects on the Electrosense- mediated Feeding Behavior of Small Paddlefish. Fluct. Noise Lett. 1 (2001) L71–L86 9. Freund, J., Schimansky-Geier, L., Beisner, B.: Behavioral Stochastic Resonance: How the Noise from a Daphnia Swarm Enhances Individual Prey Capture by Juvenile Paddlefish. J. Theor. Biol. 214 (2002) 71–83 10. Tougaard, J.: Stochastic Resonance and Signal Detection in an Energy Detector–Implications for Biological Receptor Systems. Biol. Cybern. 83 (2000) 471–480 11. Gong, Y. F., Marrhews, N., Qian, N.: Model for Stochastic-resonance-type Behavior in Sensory Perception. Phy. Rew. 65 (2000) 0319041-5 12. Tougaard, J.: Signal Detection Theory, Detectability and Stochastic Resonance Effects. Biol. Cybern. 87 (2002) 79-90 13. Ward, L. M., Neiman, A., Moss, F.: Stochastic Resonance in Psychophysics and in Animal Behavior. Biol.Cybern. 87 (2002) 91-101 14. Macmillan, N. A., Creelman, C. D.: Detection Theory: a User’s Guide. Cambridge University Press, Cambridge (1991) 15. Manjarrez, E.,Rojas-Piloni, G., Mẻndez, I.: Stochastic Resonance with the Somatosensory System: Effects of Noise on Evoked Field Potentials Elicited by Tactile Stimuli. Journal of Neuroscience 23 (2003) 1997-2001 16. Hodgkin, A., Huxley, A.: A Quantitative Description of Membrane Current and its Application to Conduction and Excitation in Nerve. J. Physiol. 117 (1952) 500-544 17. Coren, S., Ward, L. M., Enns, J.: Sensation and perception, 5th edn. Harcourt, San Diego (1999) 18. Green, D. M., Swets, J. A.: Signal Detection Theory and Psychophysics. Krieger, Huntington, N, Y. (1966) 19. Koch, C., Segev, I.: Methods in Neuronal Modeling: From Ions to Networks (second edition). Cambridge (Massachusetts, USA): MIT Press (1998) 98~100 20. Simpson A. J., Fitter, J.: What is the Best Index of Detectability? Psychol. Bull. 80 (1973) 481–488
A Gaussian Dynamic Convolution Models of the FMRI BOLD Response Huafu Chen, Ling Zeng, Dezhong Yao, and Qing Gao School of Applied Mathematics, School of Life Science & Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
[email protected]
Abstract. Blood oxygenation level dependent (BOLD) contrast based functional magnetic resonance imaging (fMRI) has been widely utilized to detect brain neural activities and great efforts are now stressed on the hemodynamic processes of different brain regions activated by a stimulus. The focus of this paper is Gaussian dynamic convolution models of the fMRI BOLD response. The convolutions are between the perfusion function of the neural response to a stimulus and a Gaussian function. The parameters of the models are estimated by a nonlinear least-squares optimal algorithm for the fMRI data of eight subjects collected in a visual stimulus experiment. The results show that the Gaussian model is better in fitting the data.
1 Introduction A primary goal of functional MRI (fMRI) is to accurately characterize the underlying neural activities by the changes of the measured signals. The most common technique in fMRI is the gradient-echo imaging with blood oxygenation level dependent (BOLD) contrast to measure the hemodynamic changes accompanying neural activation [1-5]. In block design experiments the subject is instructed to perform a task that activates a certain region of the brain during a specific time interval. The resultant hemodynamic response (HDR) of the cortex under investigation changes with the stimulus. Early fMRI experiments [6-7] suggested a linear relationship between the parameters in the stimulus and the resultant HDR, but later experiments [8-12] showed an increasing amount of nonlinearity in the response. To analyze the nonlinear relationship, Miller et al proposed a fMRI dynamic model of convolution between the perfusion function of the neural response to a stimulus with a Gamma response function of the Cerebral Blood Flow (CBF) hemodynamics of neuron [8,4]. Recently, we proposed an extended Gamma convolution model where a baseline is introduced into the original Gamma convolution model to represent the background activities [13]. In general, it is still a frontier problem to construct a better BOLD dynamic model to reveal the neural mechanism of cerebral activities and to investigate the dynamic pattern of various cerebral regions in brain function exploration and cognitive neuroscience [14-16]. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 21 – 26, 2006. © Springer-Verlag Berlin Heidelberg 2006
22
H. Chen et al.
In this paper, the effectiveness of the new Gaussian model is evaluated. As a preliminary application, the new model is applied to an fMRI data of a visual stimulus where we concern the phenomena of hemispheric lateralization. Understanding the specific nature of how spatial abilities are lateralized in the brain is important for many reasons and is still a topic without common viewpoint [17-18]. In this paper, the parameters of the two fMRI models were estimated by a nonlinear least-square optimal algorithm, and according to the parameter changes in the left and right cerebral visual region, the difference of the hemodynamic processes are discussed.
2 Gaussian Dynamic Convolution Models 2.1 The Perfusion Response Many experiments indicate that there are neural adaptation between neural responses and stimulus, and it is the general tendency of the neural firing rate to decay from a high initial value to a lower steady-state value during a sustaining stimulus [6,8,16]. If the parameterization of the neural response to a block stimulus begins at t=0 and ends at t=T, then we have: ⎧⎪1 + ae −t / tn n(t ) = ⎨ ⎪⎩0
0≤t ≤T otherwise
(1)
where parameter t n is a decay time constant, a is the amount that the initial response overshoots the steady-state response and n(t ) is assumed being the perfusion function of neural response[8]. 2.2 Gaussian Response of a Neural Cluster It has been shown that the regional response of collectively activated neural cluster to a stimulus is a normal distribution, and it is consistent to the law of large numbers in probability [18], so the response of a neural cluster may be expressed by a Gaussian function: g (t ) =
θ0 (t − θ 2 ) 2 exp(− ) θ1 2θ 1 2
(2)
where θ 0 and θ1 are two constants, θ 2 is the time interval from the presentation of a stimulus to the maximum neural response, namely it is a response delay parameter. 2.3 Gaussian Convolution Model The BOLD signal of the cerebral activation is a collective response of an activated region and it can be explained as a mutual interaction process between the perfusion of the neural response to a stimulus and the Hemodynamic change due to the activation of a neural cluster. Therefore, the BOLD signal x(t ) of a cerebral activation can be expressed by convolution of this perfusion response of the neural response with a Gaussian function of the CBF hemodynamic response of a neural cluster[17]:
A Gaussian Dynamic Convolution Models of the FMRI BOLD Response
23
Model I:
x(t ) = n(t ) ⊗ g (t ) + (1 + α × t ) × η + e(t )
(3)
where (1 + α × t ) × η is introduced to represent the background fMRI signal. The η is a constant, and the time dependence is assumed being a linear function α × t × η with a baseline shift factor α , e(t ) represents various additive noises. Fig. 1 is the simulation result of model I without e(t ) . 2.4 Nonlinear Least-Squares Estimation
Gaussian convolution model I is composed of two nonlinear functions with parameters a, t n ,θ 0 ,θ1 ,θ 2 , η , e(t ) and α .These parameters are estimated with a nonlinear least-squares algorithm as follow. 1) Equations (3) and (6) are reformulated as x(t ) = f (t , β ) + e(t )
(4) r
where x(t ) is the observed signal value at time t, β is an unknown parameter vector including all the above noted unknown parameters except the e(t ) . f is the nonlinear r function with respect to parameter t and β . r 2) Using a nonlinear least-squares estimator, the parameter β is estimated as following r
[
][
]
r T r ⎤ ⎡ 1 x(t ) − f (t , β ) x (t ) − f (t , β ) ⎥ ⎦ ⎣ n
β * = arg max ⎢−
(5)
In our model, the detail cost function to be minimized is n
∑ [ x(t ) − (1 + ae
−t / t n
) ⊗ ϕ (t ) − (1 + α × t ) × η ] 2
(6)
t =1
where x (t ) is the measured BOLD signal at time t, t = 1,2,L , n . ϕ (t ) is the Gaussian function.
3 Application in fMRI Data 3.1 In Vivo fMRI Experiment
For testing the Gaussian model in fMRI experiments in vivo, normal subjects were involved in a visual task with a Block paradigm. Experiment conducted at Beijing MRI center, Chinese Academy of Sciences, with SIEMENS 3 TESLA MAGNETOM TRIO. EPI: repetition time TR=1.0s, matrix size=64×64, voxel o
size=3.00×3.00×2mm.TE=30ms, flip angle 90 . The first 4 scans of each run were discarded to allow for magnetic saturation effects. Then 320 volume images comprised time-series were analyzed.
24
H. Chen et al.
3.2 fMRI Image Result
According to F-test statistical method (P value <0.05) and the prior response information, the data is analyzed with SPM principle [www.fil.ion.ucl.ac.uk/spm;]. Shown in Fig.1 is the result of the fMRI activations of the subjects in Talairach brain corresponding to left and right VFs. The black points in Fig. 1 represent the activated voxels in a brain slice, where regions of interest (ROIs) comprise those in the left and right occipital lobes of the primary visual cortex, respectively.
Fig. 1. fMRI result
3.3 Modelling Results of fMRI Signals
First, we choose two measured activated fMRI signals, marked them as A and B (Fig.1), from the selected ROIs in the left and the right primary visual cortexes, and minimize the cost function in Eq.(6) for the Gaussian model with optimal algorithm offered by MATLAB to get the model parameter estimations.
(a) voxel A Fig. 2. BOLD signal in Gaussian model
(b) voxel B
A Gaussian Dynamic Convolution Models of the FMRI BOLD Response
25
Fig.2 are the signal curves of the measured BOLD signals and the estimated BOLD signals of the model. The horizontal axis is the sample point, and the vertical axis is the signal value. Fig. 1(a) is the result of the signal at voxel A in the left occipital lobe, where the solid lines note the measured BOLD signal, the dotted solid lines note the signals estimated by the Gaussian model. These results show the new Gaussian model not only of a better curve trend adapted to the measured BOLD signals. The model parameters are different in the left and right occipital regions, which indicate that the dynamic processes seem different in various cerebral functional regions.
4 Conclusion In this paper, Gaussian convolution model is presented. The parameters was presented to estimate a vivo fMRI data evoked by a visual stimulation The results show that the Gaussian model, which is based on the law of large numbers in probability thus may reflects the statistical characteristics of neural cluster activation, can simulate the BOLD response. fMRI is a new functional imaging technology and the BOLD signal is the key to investigate the relation between the neuronal activation and fMRI signals [4,8,10]. It is still a challenging topic to combine characteristics of various experimental paradigms to develop and improve a BOLD model [14].
Acknowledgment This work is supported by the 973 Project No. 2003CB716100, NSFC#90208003, 30525030, 30570507, Fok Ying Tong Education Foundation (91041), New Century Excellent Talents in University.
References 1. Kwong, K., Belliveau, J., Chesler, D.: Dynamic Magnetic Resonance Imaging of Human Brain Activity During Primary Sensory Stimulation. Proc Natl Acad Sci USA 89 (1992) 5675–5679 2. Ogawa, S, Tank, D. W., Menon, R., Ellermann, J.M., Kim, S.G., Merkle, H., Ugurbil, K: Intrinsic Signal Changes Accompanying Sensory Stimulation: Functional Brain Mapping with Magnetic Resonance Imaging. Proc. Natl. Acad Sci. USA 89 (1992) 5951-5955 3. Bandettini, P.A., Jesmanowicz, A., Wong, E.C., Hyde, J.S.: Processing Strategies for Time-Course Data Sets in Functional MRI of the Human Brain. Magnetic Resonance in Medicine 30(2) (1993) 161-173 4. Rasmus, M.B., Ziad, S.S., Peter, A.N.: Spatial Heterogeneity of the Nonlinear Dynamics in the FMRI BOLD Response. NeuroImage 14(5) (2001) 817-826 5. Yao, D.: Highresoltion EEG Mappings: A Spherical Harmonic Spectra Theory and Simul ation Results. Clinical Neurophysiology 111(1) (2000) 81-92 6. Boynton, G.M.,. Engel, S.A, Glover, G.H., Heeger, D.J.: Linear Systems Analysis of Functional Magnetic Resonance Imaging in Human V1. J. of Neuroscience 16(13) (1996) 4207-4221
26
H. Chen et al.
7. Rao, S.M., Bandettini, P.A., Binder, J.R., Bobholz, J.A., Hammeke, T.A., Stein, v, Hyde, v.: Relationship Between Finger Movement Rate and Functional Magnetic Resonance Signal Change in Human Primary Motor Cortex. J. of Cerebral Blood Flow and Metabolis 16(6) (1996) 1250-1254 8. Miller. V., Luh, W.M., Liu, T.T., Martinez, A., Obata, T., Wong, E.C., Frank, L.R., Buxton, R.B.: Nonlinear Temporal Dynamics of the Cerebral Blood Flow Response. Human Brain Mapping 13(1) (2001) 1-12 9. Buxton, R.B., Liu, T.T.,Wong, E.C.: Nonlinearity of the Hemodynamic Response: Modeling the Neural and BOLD Contributions. Proc. 9th ISMRM (2001) 1164 10. Liu, H.L., Gao, J.H.: An investigation of the Impulse Functions for the Nonlinear BOLD Response in functional MRI. Magnetic Resonance Imaging 18(8) (2000) 931-938 11. Vazquez, A.L., Noll, D.C.: Nonlinear Aspects of the BOLD Response in Functional MRI. Neuroimage 7(2) (1998) 108-118 12. Aguirre, G. K., Zarahn, E., Esposito, M.D.: The Variability of Human BOLD Hemodynamic Responses. NeuroImage 8 (4) (1998) 360-369 13. Chen, H. and Yao, D.: An Extended Gamma Dynamic Model of fMRI BOLD Response. Neurocomputing 61 (2004) 395-400 14. Logothetis, N.K., Pauls, J., Augath, M., Trinath, T., Oeltermann, A.: Neurophysiological Investigation of the Basis of the FMRI Signal. Nature 412 (2001) 150-157 15. Rajapakse, J.C., Kruggel, F., Maisog, J.M. and VonCramon, D.Y.: Modelling Hemodynamic Response for Analysis of Functional MRI Time-series. Human Brain Mapping 6(4) (1998) 283-300 16. Rosen, B.R., Buckner, R.L., Dale, A.: Event-related Functional MRI: Past, Present, and Future. Proc. Natl. Acad. Sci. USA 95 (1998) 773-780 17. Chen, H., Yao, D., Liu, Z.: A Study on Asymmetry of Spatial Visual Field by Analysis of the FMRI BOLD Response, Brain topography 17(1) (2004) 39-46 18. Chen, H., Yao, D., Zhuo, Y, and Chen, L.: Analysis of FMRI Data by Blind Separation into Independent Temporal Component. Brain Topography 15(4) (2003) 223-232
Cooperative Motor Learning Model for Cerebellar Control of Balance and Locomotion* Mingxiao Ding, Naigong Yu, and Xiaogang Ruan** Beijing University of Technology, Beijing 100022, China
[email protected] {yunaigong, adrxg}@bjut.edu.cn
Abstract. A computational model in the framework of reinforcement learning, called Cooperative Motor Learning (CML) model, is proposed to realize cerebellar control of balance and locomotion. In the CML model, cerebellum is a parallel pathway that the vermis and the flocculonodular lobe play a role of reflex actions executor, and that the intermediate zone of the cerebellum participates in initiating voluntary actions. During the training phase, the cerebral cortex provides the predictive error through climbing fiber to modulate the concurrently activated synapses between parallel fibers and purkinje cells. Meanwhile, the intermediate zone of the cerebellum computes the temporal difference (TD) error as a training signal for the cerebral cortex. In the simulation experiment for the balance of double inverted pendulum on a cart, a well-trained CML model can smoothly push the pendulum into the equilibrium position.
1 Introduction Cerebellum has long been seen as a learning machine for motor control and motor skill [1], [2], [3], [4]. In all of these classical theories, an instructive signal via the inferior olive-climbing fiber system is the critical basis of successful motor learning. Therefore, how to obtain the instructive signal for the neural network of cerebellum is the key to understand the physiological functions of cerebellum. To address this problem, the idea that anticipations influence and guide motor control has been increasingly appreciated over the last decades. Learning by prediction and reward has been used to understand the learning and memory function in the basal ganglia [5], hippocampus [6], and cerebral cortex [7]. Evidences show that the anticipatory mechanism in the framework of reinforcement learning is suitable for identifying the cerebellum for one aspect of basic associative learning and memory: classical conditioning [8]. Inspired by recent progress in neurophysiology, we propose a hypothesis that the cerebral cortex is responsible to predict future instructive signal for the cerebellar control, and the cerebellum and the cerebral cortex compose an overall actor-critic version of reinforcement learning agent. Our current work is focused on the cooperative learning scheme of the cerebellum and the cerebral cortex compose an overall actor-critic version of reinforcement learning agent. Our current * **
This work is supported by National Natural Science Foundation of China (60375017). Corresponding author.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 27 – 33, 2006. © Springer-Verlag Berlin Heidelberg 2006
28
M. Ding, N. Yu, and X. Ruan
work is focused on the cooperative learning scheme of the cerebellum and the cerebral cortex during the control and adaptation of dynamic motor balance, which is a fundamental issue for humanoid robot. In this paper, we attempt to model this Cooperative Motor Learning (CML) scheme in the framework of reinforcement learning. Because the recurrent neural activity exists across both cortical modules [9], [10] and subcortical areas [11], both the cerebellum and the cerebral cortex are realized using the recurrent network and real-time recurrent learning (RTRL) algorithm [12]. The effectiveness of the CML model is demonstrated by the balance task of double inverted pendulum on a cart.
2 The Cooperative Motor Learning Model The CML model is in the framework of temporal difference (TD) learning which aims to deal with reinforcement learning problem with delayed reward. Since the neural circuits of cerebellum and cerebral cortex are taken as an integral agent in this paper, everything outside the integral agent is labeled as the “environment”. The sensory inputs from the environment stimulate both cerebellum and cerebral cortex, and the action output to the environment is the response of the integral agent. The cerebellum is responsible for action decision, and the cerebral cortex is responsible to predict the future instructive signal for action network based on the current sensory inputs and current action output. The learning aim of the cerebellum is to minimize the square of the predictive error derived from the cerebral cortex:
Ecerebellum = (eˆ(t ))2
(1)
The reward from environment is computed according to:
⎧0 r(t ) = ⎨ ⎩ −1
if the action is successful if the action is
failed
(2)
Thus the expected temporal difference (TD) error for the cerebral cortex is computed [6] according to:
δ (t ) = r (t ) + γ eˆ(t ) − eˆ(t − 1)
(3)
The learning aim of the cerebral cortex is to minimize the goal: Ecerebrum = (δ (t ))2
(4)
In the CML model, the reward signal guarantees that the behavior of an animal is generally goal directed, thus an interesting question is where the reward signal come. 2.1 Cerebellum
Based on afferent and efferent connectivity, the cerebellum is functionally divided into the medial zone (vermis), the intermediate zone, the lateral zone and the flocculonodular lobe. It is believed that the vermis and the flocculonodular lobe may play a
CML Model for Cerebellar Control of Balance and Locomotion
29
strong role in the control of balance and locomotion by modulating the limb flexor and extensor muscle activity [13]. The intermediate zone receives abundant inputs not only from receptors in muscles, joints and skin, but also from the cerebral cortex via the corticopontocerebellar system. The intermediate zone projects out to the red nucleus and to the cerebral cortex via the thalamus. Therefore, the intermediate zone has been proposed to serve as a “comparator” which indicates the mismatch between the expected and actual outcome of the action [14], [15]. These facts enable us to speculate that the function of the cerebellum is both an action executor and a sensory comparator, as depicted in Fig. 1. In the CML model, the cerebellum can output an action response corresponding to sensory inputs, and the cerebral cortex provides the predictive error through climbing fiber to modulate the concurrently activated synapses between parallel fibers and purkinje cells. Meanwhile, the intermediate zone of the cerebellum computes the TD error as a training signal for the cerebral cortex. When the cerebral cortex has built up a forward predictive model for the outer environment, the cerebellum can be trained as an inverse model for the environment. If the inverse model is accurate, the cerebellum can reflex action responses to shift the body into the equilibrium position. This whole process is a “trial and error” learning frame, thus the cooperative motor learning scheme can be depicted in Fig. 2. Based on the sound anatomy knowledge, the formula description of the vermis is given by equation (5)-(7): I action ( i ) (t ) = X sensor ( i ) (t )
i = 1, 2
TD error
(5)
,N
predictive error
action feedback to cerebral cortex
intermediate zone comparator
sensory input
reward signal
-
Z-1
vermis action output
Z-1
Cerebellum
Fig. 1. Subdivided function zones of the cerebellum and its signal relation with the cerebral cortex. In the intermediate zone, it is assumed that there is build-in knowledge about the equilibrium position in the cerebellum, such as a small region around the vertical position. The movement with a variable expected position may be in relation to motion plan and is out of the research scope of this paper.
30
M. Ding, N. Yu, and X. Ruan
sensory input
Cerebral Cortex
predictive error
TD error action output
Cerebellum
Fig. 2. The architecture of the Cooperative Motor Learning model. The instructive signals for the cerebral cortex and the cerebellum are set in dashed line.
Oaction (t ) = F ( s(t ))
(6)
s(t ) = ∑ (Winput ( i ) i I action ( i ) (t )) + Wrecurrent iOaction (t − 1)
(7)
i
where X sensor (t ) is the sensory input from the eyes, the muscles and joints of the body, and the vestibular system. This afferent information is delivered to the granule cells via mossy fiber and parallel fibers carry the outputs of granule cells to the output neurons of the cerebellar cortex, the Purkinje cells. The motor impulses, denoted by Oaction (t ) , are sent to the muscles of the body to maintain the balance during standing, walking and running. During the learning phase, the climbing fibers derived from the inferior olive carry an instructive signal to adjust the synaptic strength between the Purkinje cells and the parallel fibers. 2.2 Cerebral Cortex
Admittedly, the real neural circuits in the cerebral cortex are perfectly complex. On the other hand, the excitation of the neurons in the cerebral cortex is problem depended, thus a four-layer recurrent network is employed here to simulate the anticipative function in humanoid motor balance task. This anticipative network is active when the action output of the cerebellum is unfavorable to the motor balance of the body. When the cerebellum can balance the body precisely, the activity in the anticipative network should be inhibited. The formula description of the anticipative network in the cerebral cortex is given by equation (8)-(15): I critc ( i ) (t ) = X sensor ( i ) (t )
i = 1, 2
, M −1
I critc ( M ) (t ) = Oaction (t )
H 1j (t ) = F j ( rj1 (t ))
j = 1, 2
(8) (9)
, P1
(10)
CML Model for Cerebellar Control of Balance and Locomotion 1 1 rj1 (t ) = ∑ (Vinput ( ji ) i I critic ( i ) (t )) + ∑ (Vrecurrent ( jk ) i H k ( t − 1))
i
k = 1, 2
, P1
k
H s2 (t ) = Fs ( rs2 (t ))
s = 1, 2
, P2
2 2 rs2 (t ) = ∑ (Vhidden ( sj ) i H 1j (t )) + ∑ (Vrecurrent ( su ) i H u ( t − 1))
j
u
31
(11) (12)
u = 1, 2
, P2
(13)
Ocritic (t ) = ∑ (Voutput ( s ) i H s2 (t ))
(14)
eˆ(t ) = Ocritic (t )
(15)
s
where X sensor (t ) is a copy version of the sensory inputs into the cerebellum, and eˆ(t ) is the predictive instructive signal for the cerebellum. The distributed recurrent loop in every hidden layer is essential to capture the nonlinear dynamic by form a short term memory. During the learning phase, the weights in the cerebral cortex are updated using the real-time recurrent learning (RTRL) algorithm [12].
3 Simulation Experiments and Results The double inverted pendulum on a cart (DIPC) is adopted as the control object because the nonlinear dynamics of human legs can be well simulated by a double inverted pendulum model. Here the dynamic equation of the DIPC system is obtained from Lagrange’s equation of motion as used in [16]. The training phase consisted of a number of trials, setting the DIPC system to a given state chosen at random between ±5° both for the bottom pendulum and the top pendulum. Each trial run until a failure signal occurs or the plant reaches 12,000 steps successfully. The sample-time is chosen as 0.02 seconds. The learning performance is showed by Fig. 3 where the number of steps per trial is plotted versus the number of trials.
Fig. 3. Learning curve for the CML model
In Fig. 3, the stand-up duration raises up suddenly at around 220 trials. This experiment result is very possible because similar individual learning curves can be seen in [17]. Once the action network in the cerebellum has been trained, the model is
32
M. Ding, N. Yu, and X. Ruan
tested using the initial state: θ1 = 9° , θ 2 = −9° , where θ1 defines the angle of bottom pendulum and θ 2 defines the angle of top pendulum. The controlling performance is showed in Fig. 4 where the time course of the bottom and top pendulum is depicted.
Fig. 4. Controlling performance of the CML model
A 3-D phase plane of the bottom pendulum is represented in Fig. 5 to illustrate the non-linear dynamics during the controlling process. This figure suggests that the CML model has learnt to control the balancing task. In the presence of a steady-state input, the weights in the CML model successfully approach the fixed point attractor.
Fig. 5. 3-D phase plane of the bottom pendulum
These experiment results show that the CML model can acquire the technique of keeping the DIPC upright and finally focus to the equilibrium position without any prior knowledge and additional instructive signal. Although the controlling performance of the trained CML model is less efficient than the fuzzy controller proposed in [16], these preliminary results reveal the CML model’s potential to explain or represent the neural dynamics in the cerebellum. In future, more delicate learning principles can be developed to improve the controlling performance of the trained CML model.
4 Conclusions Both the anticipatory and feedback mechanisms are used in the cerebellar control of balance and locomotion. The CML model proposed here represents the parallel path-
CML Model for Cerebellar Control of Balance and Locomotion
33
way in the cerebellum and the correlation between the cerebellum and the cerebral cortex. The cerebral cortex plays a crucial role during the adaptation of cerebellar control pattern. Once a new control pattern is learned by cerebellum, the cerebellum could automatically response to the sensory stimulus without the involvement of the cerebral cortex. The experiment results showed that the CML model can learn the new balancing task from scratch, and a well-trained CML model can smoothly push the pendulum into the equilibrium position. It remains to test the model on further tasks, and to compare it to other cerebellar learning and control models. To cope with various complexities of controlled objects, an evolving neural architecture for the model is also to be considered.
References 1. Albus, J.S.: A Theory of Cerebellar Function. Mathematical Bioscience 10 (1971) 25-61 2. Marr, D.: A Theory of Cerebellar Cortex. Journal of Physiology 202 (1969) 437-470 3. Ito, M.: Neural Design of the Cerebellar Motor Control System. Brain Research 40 (1972) 81-84 4. Kawato, M. and Gomi, H.: A Computational Model of Four Regions of the Cerebellum Based on Feedback Error Learning. Biological Cybernetics 69 (1992) 95-103 5. Schultz, W., Dayan, P., and Montague, P.R.: A Neural Substrate of Prediction and Reward. Science 275 (1997) 1593-1599 6. Schmajuk, N.A. and Dicarlo, J.J.: A Neural Network Approach to Hippocampal Function in Classical Conditioning. Behavioral Neuroscience 105 (1991) 82-110 7. Bao, S., Chan, V.T., and Merzenich, M.M.: Cortical Remodelling Induced by Activity of Ventral Tegmental Dopamine Neurons. Nature 412 (2001) 79-83 8. Thompson, R.F., Thompson, J.K., Kim, J.J., Krupa, D.J., and Shinkman, P.G.: The Nature of Reinforcement in Cerebellar Learning. Neurobiology of learning and memory 70 (1998) 150-176 9. Fuster, J.M.: Executive Frontal Functions. Experimental Brain Research 133 (2000) 66-70 10. Suri, R.E. and Sejnowski, T.J.: Spike Propagation Synchronized by Temporally Asymmetric Hebbian Learning. Biological Cybernetics 87 (2002) 440-445 11. Houk, J.C., Buckingham, J.T., and Barto, A.G.: Models of the Cerebellum and Motor Learning. Behavioral and Brain Sciences 19 (1996) 368-383 12. Williams, R.J., and Zipser, D.: A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation 1 (1989) 270-280 13. Morton, S.M. and Bastian, A.J.: Cerebellar Control of Balance and Locomotion. The Neuroscientist 10 (2004) 247-259 14. Stein, J.F. and Glickstein, M.: Role of The Cerebellum in Visual Guidance of Movement. Physiological Reviews 74 (1992) 967-1017 15. Matsumura, M., Sadato, N., Kochiyama, T., Nakamura, S., Naito, E., Matsunami, K., Kawashima, R., Fukuda, H., and Yonekura, Y.: Role of the Cerebellum in Implicit Motor Skill Learning: A PET Study. Brain Research Bulletin 63 (2004) 471-483 16. Yi, J.Q., Yubazaki, N., and Hirota, K.: Stabilization Control of Series-type Double Inverted Pendulum Systems Using the SIRMs Dynamically Connected Fuzzy Inference Model. Artificial Intelligence in Engineering 15 (2001) 297-308 17. Barto, A.G., Sutton, R.S., and Anderson, C.: Neuron-like Adaptive Elements That Can Solve Difficult Learning Control Problems. IEEE Transactions on Systems, Man, and Cybernetics 13 (1983) 834-846
A Model of Category Learning with Attention Augmented Simplistic Prototype Representation Toshihiko Matsuka Center for Decision Technologies, Stevens Institute of Technology, Hoboken, NJ 07030, USA
[email protected]
Abstract. Cognitive models has been a main tool for quantitatively testing theories on human cognition. The results of previous cognitive modeling research collectively suggest the comparative advantage of exemplar over prototype accounts in human cognition. However, we hypothesized that unsuccessful outcomes by traditional prototype models may be the unforeseen consequences of the algorithmic constraints imposed on the models, but not of the implausibility of the theory itself. To test this hypothesis, a new cognitive model based on prototype theory with a more complex and realistic attention system is introduced and evaluated in the present study. A simulation study shows that a new model termed CASPRE resulted in a substantial improvement as compared with the traditional prototype model in replicating empirical findings and that it performed marginally better than an exemplar model, thus confirming our hypothesis.
1 Introduction There are two main competing theories on how humans internally represent categorical knowledge. Exemplar theory suggests that humans categorize an objects by utilizing its similarities to many if not all memorized exemplars they previously encountered, while Prototype theory assumes that humans utilize similarities between the input object and a smaller number of summarized reference points, i.e., prototypes. The results of previous cognitive modeling research collectively suggest the comparative advantage of exemplar over prototype accounts of internal representation in human cognition. For example, for one of the most extensively tested stimulus sets, namely Medin and Shaffers 5-4 stimulus set shown in Table 1 (referred to as MS78 hereafter), exemplar models have been more successful than prototype models in replicating empirical observations [1]. Moreover, although it is not always observed, a psychological phenomenon called A2 advantage, which is a tendency that people often categorize the less ”prototypical” stimulus A2 (see Table 1) more accurately than more ”prototypical” stimulus A1, have been pointed as supporting evidence for exemplar theory and challenging evidence against prototype theory. Another simple yet robust evidence against traditional prototype models is a category structure with exclusive-or (XOR) logic. Consider a two-dimension XOR stimulus set (i.e., [0 0, 1 1] for Category A, [0 1, 1 0] for Category B) whose prototypes for Categories A and B would be theoretically identical, i.e., [0.5, 0.5]. By internally representing categorical knowledge with these identical prototypes, previous prototype models with traditional selective attention mechanisms J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 34–40, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Model of Category Learning
35
have not been able to categorize or learn to categorize these stimuli, because the models mathematical formulations yield identical psychological similarity measures for any input stimulus to both prototypes, providing no constructive information regarding a category to which the input stimulus should belong. These previous prototype models are comprised of two key assumptions on their selective attention mechanisms that result in the invariant similarities for the simple XOR stimulus set and possibly other unsuccessful replications of some empirical findings. The first key assumption is that, like many other similarity-based models of categorization, the models assume that the selective attention mechanism is mediated by the dimension-by-dimension basis (i.e., selective attention is applied orthogonally). Secondly, the models also assume that the dimensional selective attention process is applied to the entire psychological stimulus space universally. These two assumptions have been widely accepted and incorporated in virtually all computational models of human categorization and category learning, and many of these models have been shown to be successful in replicating several empirical phenomena (e.g., [2]). However, in spite of their prevalent acceptance and recognized successes in some applications, the present research casts doubt on the two key assumptions about the selective attention mechanisms incorporated in the models. What if attention processes are uniquely applied to individual prototypes and what if humans are capable of attending correlations among feature dimensions, then is the prototype theory still psychologically less plausible than exemplar theory? A recent simulation study suggested the possibilities of interdependencies between models internal representation systems and selective attention mechanisms on effectiveness in replicating human category learning and that the traditional selective attention mechanisms, described above, incorporated into prototype models might not be the most sensible ones [3]. The main objective of the present works is not to reiterate the debate over the validity of prototype and exemplar accounts in human cognition per se, but to raise a concern that unsuccessful outcomes by some previous prototype models may be the unforeseen consequences of the algorithmic constraints imposed on the models, but not of the implausibility of the theory itself. In particular, the present study introduces a new model of category learning on the basis of prototype theory with a new feedforward algorithm consisting of more complex yet plausible selective attention mechanisms and a backward (learning) algorithm that emphasizes human tendency in knowledge acquisition.
2 A New Model of Category Learning: CASPRE Our new model is termed CASPRE for Category learning with Attention augmented Simplistic Prototype REpresentation. It is a cognitive model based on prototype theory and thus it assumes that categorical knowledge is organized by small numbers of prototypes and that humans utilize psychological similarity between input stimulus and prototypes to categorize the input. As its name suggests, CASPRE is comprised of two important components. First, it assumes a somewhat complex attention process, namely a local attention process (i.e., each prototype having unique selective attention process) and sensitivity to correlations among features (paying attention to correlations
36
T. Matsuka
among feature dimensions) to form attention augmented prototype representations. The selective attention may be interpreted as processes of mental rotation and psychological scaling of proximity or similarity between input stimuli and prototypes. Thus, in CASPRE each prototype is augmented with a unique selective attention process to form a uniquely shaped and oriented prototype conceptual field. Therefore, unlike traditional prototype models, characteristics of prototypes cannot be explained only by centroids, but by both conteroids and within-prototype psychological scaling processes. Some empirical studies provide supporting evidence for CASPRE’s attention mechanism (e.g., local attention process [4] and correlation sensitivity [5]). Another important component is the principle of simplicity in high-order human cognition. One plausible theoretical justification of prototype theory is its compact representation of knowledge, allowing a limited-processing-capacity human brain to handle rich information. CASPRE follows the simplicity principle, but the incorporation of complex attention mechanisms can confound the principle. To avoid the possible interference, CASPRE incorporates a simple multi-objective learning algorithm, trying to acquire manageably simple yet sufficiently accurate concepts. 2.1 Forward Algorithm In CASPRE, psychological similarity or distance, dj (x), between an input stimulus, x, and prototype j, πj , are defined by Mahalanobis distance (in quadratic form) between them, allowing it to be sensitive to correlations among features dimensions. Thus, dj (x) =
I I i
m
αjim (πji − xi )(πjm − xm )
(1)
where αj defines directions and strengths of attention field for πj , subscripts i and m indicate feature dimensions, and I is the number of feature dimensions. Note that it is assumed that αim = αmi , α2im ≤ |αii · αmm |, and , αii ≥ 0, ∀i. For off-diagonal entries (i.e., i = m), an attention weight can be a negative value, where its signum indicates direction of attention field while its magnitude indicates the strength of attention. Psychological distance measures activate prototype units using the following function hj = exp(−c · dj (x)) (2) where c controls overall sensitivity. Prototype units activations are then fed forwarded to category output nodes, or Ok (x) = wkj hj (3) j
where wkj is an association weights between πj and category node k. The output activations will be used to obtain response probability by the following function: P (k) = exp(φOk )/ l exp(φOl ), where φ scales decisiveness of response. In short, CASPRE assumes that humans utilize psychological similarity between input object and prototypes, psychologically scaled by correlation sensitive local selective attention processes, as evidence for categorizing the input instance into the most probable category.
A Model of Category Learning
37
2.2 Learning Algorithm CASPRE incorporates the gradient descent version of Stochastic Context-Dependent (multi-objective) Learning algorithms [6] [7]. In particular, CASPRE assumes that in ordinary situations humans would prefer and try to acquire compact yet accurate knowledge. Thus, its objective functional consists of at least two sub-objectives; accuracy and simplicity, or Υ (wkj , αim ) =
1 k
2
(tk − Ok )2 +
2 wkj +
j
k
I I−1 j
i
m=i+1
(αjim )2 /Z 1 + (αjim )2 /Z
(4)
I−1 I j 2 where tk is a target output for category node k, and Z = i m=i+1 (αim ) . The first term is a function defining categorization accuracy, the second term is weight decay function, regularizing association weights, and the third term is an attention elimination function [7], reducing the number of dimensions and correlations among dimensions attended on the basis of the relative attention strengths. A framework [6] [7] that allows more general multi-objective context dependent learning can also be applied to CASPRE. Learnable coefficient w and α are updated by following functions, Ω Δwkj = λE w ek hj − 2λw wkj + νkj
Δαjim
=
−λE α
ek wkj hj δji δjm c −
λA α
k
2αjim ·
2(αjim )2 +
(5)
(αjl )2 l∈im /
(αjl )2 l∈im /
2 + νim
(6)
where ek = (tk −Ok ), δji = (πji −xi ), hj is as given Eq. 2, λs are learning rates, and ν are independent Gaussian noise in learning with means equal to 0 and time-decreasing standard deviations. The noise in learning is introduced in CASPRE, because, recent cognitive modeling studies indicate the importance of stochasticity or noise in human learning for qualitative fits and quantitative interpretations [6] [7]. The centroid of prototype j is updated with a simple competitive learning algorithm [8] [9], thus (t) (t) π j + η (t) (x − π j ) if x and πj belong the same category (t+1) πj = (7) (t) πj otherwise. where η is a time-decreasing learning rate.
3 Simulations In order to investigate the capability of CASPRE to replicate observed empirical data, a simulation study was conducted. In particular, we simulated a classical study in human category learning [1] where prototype models previously had trouble replicating observed classification profile. Table 1 shows the schematic representation of stimulus set used in the present simulation study. In the empirical study, subjects were asked to learn to classify nine unique training exemplars (A1 - B4) to either Category A or
38
T. Matsuka Table 1. Schematic representation of stimulus set used in simulations (MS78)
ID A1 A2 A3 A4 A5 B1 B2 B3 B4
Training Cat D1 D2 D3 A 1 1 1 A 1 0 1 A 1 0 1 A 1 1 0 A 0 1 1 B 1 1 0 B 0 1 1 B 0 0 0 B 0 0 0
D4 0 0 1 1 1 0 0 1 0
ID T1 T2 T3 T4 T5 T6 T7
Transfer D1 D2 D3 1 0 0 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0
D4 1 1 1 1 0 0 0
B with corrective feedback. The training session was followed by a transfer session in which subjects were asked to categorize the nine training exemplars and seven novel exemplars. There are two empirically observed criterion response profiles involved in the present study, namely the original MS78 [1] and the average classification response profile from 30 independent replications (but also includes the original study) of MS78 summarized in Smith and Minda [11] (referred to as SM00). The two criterion profiles were employed to facilitate fair comparisons as Smith and Minda [11] claim the original MS78 data dubiously favor the exemplar models. The first and fourth rows in Table 2 show the observed classification profile in the transfer session in MS78 and SM00 respectively. There were three models involved in the present simulation study, namely CASPRE, a prototype model (PRO) with traditional selective attention mechanism (i.e., αjim = αlim , ∀j, l and αjim = 0, ∀i = m), and a typical successful exemplar model (EXE) that has the same selective attention mechanism with PRO (e.g., [10], see also [3]). The only difference between EXE and PRO was their reference point. In EXE all unique nine training exemplars were used as the reference points and the locations of those nine reference exemplars were static. [10]. PRO and EXE were included to evaluate descriptive validity of attention augmented prototype representation. Methods: The models were run in a simulated training procedure with 50 trial blocks, where each block consisted of a random presentation of the nine unique training exemplars exactly once, in order to learn the correct classification responses for the stimulus set. The model parameters were optimized so that the sum of squared errors (SSE) between observed and predicted classification profiles were minimized. The selected paΩ rameters were as follows: For MS78 replication,c = 1.17, φ = 1.96, λE w = 0.30, λw = E A E E 0.0001, λα = 0.36, λα = 0.002 for CASPRE; c = 1.38, φ = 2.11, λw = 0.43, λα = E 0.11 for PRO; and c = 2.01, φ = 1.82, λE w = 0.12, λα = 0.006 for EXE. For SM00, E Ω E c = 4.13, φ = 2.48, λw = 0.30, λw = 0.0003, λα = 0.10, λA α = 0.002 for CASPRE; E c = 1.379, φ = 2.113, λE = 0.434, λ = 0.106 for PRO; and c = 1.01, φ = w α E Ω A 1.56, λE = 0.17, λ = 0.04 for EXE. λ andλ were held at 0 for PRO and EXE w α w α to simulate traditional prototype and exemplar models. Each model was replicated 100 times.
A Model of Category Learning
39
Table 2. Results of Simulation Study (MS78 & SM00 : observed human data [1] [11]; PRO: prototype model; CAS: CASPRE; EXE: exemplar model) Model SSE MS78 PRO CAS EXE SM00 PRO CAS EXE
N/A 0.197 0.023 0.020 N/A 0.086 0.015 0.034
A1 0.78 0.84 0.83 0.80 0.83 0.81 0.84 0.86
A2 0.88 0.75 0.89 0.88 0.82 0.73 0.84 0.79
A3 0.81 0.94 0.87 0.87 0.89 0.91 0.86 0.93
A4 0.88 0.76 0.83 0.85 0.79 0.71 0.74 0.70
A5 0.81 0.76 0.83 0.85 0.74 0.69 0.71 0.66
Probability of ’A’ Response B1 B2 B3 B4 T1 T2 0.16 0.16 0.12 0.03 0.59 0.94 0.32 0.32 0.25 0.03 0.67 0.97 0.19 0.14 0.14 0.10 0.58 0.90 0.17 0.17 0.16 0.11 0.62 0.93 0.30 0.28 0.15 0.11 0.62 0.88 0.36 0.34 0.30 0.06 0.65 0.95 0.25 0.30 0.19 0.11 0.60 0.84 0.39 0.34 0.28 0.04 0.64 0.97
T3 0.50 0.30 0.53 0.51 0.40 0.34 0.41 0.30
T4 0.62 0.67 0.60 0.62 0.55 0.63 0.58 0.60
T5 0.31 0.24 0.29 0.33 0.40 0.32 0.35 0.34
T6 0.34 0.24 0.32 0.33 0.34 0.30 0.35 0.30
T7 0.16 0.04 0.10 0.09 0.17 0.07 0.18 0.05
Results and Discussion: Table 2 shows the results of Simulation study. There were dramatic decreases in SSEs for CASPRE as compared with those in the traditional prototype model (PRO). The SSEs decreased approximately 90 % for MS78 and 85% for SM00. CASPRE performed almost as good as EXE for MS78 and its SSE was twice as small than EXE in SM00. Overall, CASPRE was the best performing model among three models tested. To the best of our knowledge, CASPRE is the first prototype model that performed better than EXE for this particular stimulus set. Furthermore, although CASPRE is in fact a prototype model, it successfully replicated A2 advantage in MS78 replication (Table 1, third row), more correctly categorizing less ”prototypical” stimuli, A2, than more ”prototypical” stimulus, A1. This is because CASPRE perceived A2 to be more similar to the prototype of Category A (Prototype A) than A1 to Prototype A. This was caused by CASPRE attention process, elongating concept coverage area of Category A (or Prototype A) to the direction of the least prototypical stimulus characteristics, namely 0s in both Dimensions 2 and 4 in order to reduce classification error. This in turn resulted in an increased perceived similarity between Prototypes A and Stimulus A2 above and beyond the fact that the two most diagnostic dimensions (i.e., Dimensions 1 and 3) provide supporting evidence that A2 may belong to Category A. We also conducted a simulation study to test if CASPRE is capable of learning to categorize simple XOR logic (i.e., [0 0; 1 1] for Category A and [0 1; 1 0] for Category B), which was also pointed out as evidence against the previous prototype models. CASPRE successfully learned the XOR logic. In sum, CASPRE’s performance in replicating observed data was on average better than that of a typical successful exemplar model. In addition, the present simulation studies showed CASPRE’s attention augmented representation system significantly improved a prototype model of category learning in replicating empirical psychological phenomena.
4 Conclusion Cognitive modeling research has been the main vehicle for quantitatively testing theories on human cognition. Prototype theory has been criticized or receiving less fa-
40
T. Matsuka
vorable review, because models based on prototype theory used in these studies often resulted in less accurate replication of empirical data. Simulation studies on MS78 were one of the most frequently utilized for criticizing the theory. However, with proper attention system, a prototype model, CASPRE, performed as good as or better than an exemplar model, showing no comparative advantage of exemplar theory over prototype theory. This in turn casts doubt on the validity of previous criticisms of Prototype theory based on simulation studies using MS78, and possibly other category structures as well. Rather, those criticisms or unsuccessful simulation results should be attributed to the mismatch of the theory and integrated attention system in modeling, but not the theory itself. Incorporating more realistic attention mechanisms in prototype model may facilitate fairer comparisons between Prototype and Exemplar theories in order to better understand the nature of human categorization processes.
Acknowledgements This research was supported in part by the Office of Naval Research, Grant # N0001405-1-00632.
References 1. Medin, D. L., Schaffer, M. M.: Context Theory of Classification Learning. Psychological Review 85 (1978) 207-238 2. Nosofsky, R. M.: Attention, Similarity and the Identification-Categorization Relationship. Journal of Experimental Psychology: General 115 (1986) 39-57 3. Matsuka, T.: Generalized Exploratory Model of Human Category Learning. International Journal of Computational Intelligence 1 (2004) 7-15 4. Aha, D. W., Goldstone, R. L.: . Concept Learning and Flexible Weighting. In: Proceedings of the 14h Annual Meeting of the Cognitive Science Society Lawrence Erlbaum Associates, Hillsdale, NJ. (1992) 534-539 5. Anderson, J. R., Fincham, J. M.: Categorization and Sensitivity to Correlation. Journal of Experimental Psychology: Learning, Memory & Cognition, 22 (1996) 119-128 6. Matsuka, T.: Simple, Individually Unique, and Context-Dependent Learning Method for Models of Human Category Learning. Behavior Research Methods 37 (2005) 240-255 7. Matsuka, T.: Modeling Human Learning as Context Dependent Knowledge Utility Optimization. In: Wang, L., Chen, K., Ong, Y. S. (eds.): Advances in Natural Computation, Lecture Notes on Computer Science, Vol. 3610. Springer-Verlag, Berlin (2005) 933-946 8. Kohonen, T. Self-Organizing Maps, 3rd edn. Springer, Berlin (2001) 9. Love, B. C., Medin, D. L., Gureckis, T. M.: SUSTAIN: A Network Model of Category Learning. Psychological Review 111 (2004) 309-332 10. Kruschke, J. E. ALCOVE: An Exemplar-Based Connectionist Model of Category Learning. Psychological Review 99 (1992) 22-44 11. Smith, J. D., Minda, J. P.: Thirty Categorization Results in Search of a Model. Journal of Experimental Psychology: Learning, Memory, and Cognition 26 (2000) 3-27.
On the Learning Algorithms of Descriptive Models of High-Order Human Cognition Toshihiko Matsuka1 and Arieta Chouchourelou2 1
Center for Decision Technologies, Stevens Institute of Technology, Hoboken, NJ 07030, USA 2 Rutgers University, Newark, NJ 07102, USA
Abstract. The primary focus in computational modeling research in high order human cognition is to compare how realistically embedded algorithm describes human cognitive processes. However, several current models incorporated learning algorithms that apparently have questionable descriptive validity or qualitative plausibleness. The present research attempts to bridge this gap by identifying five critical issues overlooked by previous modeling research and then introducing a modeling framework that addresses the issues and offers better qualitative plausibleness. A simulation study utilizing the present framework with two distinctive implementation approaches shows their descriptive validity.
1 Introduction Artificial Intelligence by its nature is an interdisciplinary science, and thus there are many sub-fields emphasizing different, sometimes competing issues. While a majority of AI research has emphasized solving complex problems with algorithms that qualitatively have no implication with regard to real human intelligence, there is a line of research that strictly focuses on how algorithms or models mimic human behaviors of interest in order to test theories of human cogntive processes. Several simulation studies suggested the predictive validity of some cognitive models (e.g., [1]). However, in spite of their success, some simulated cognitive processes incorporated in many cognitive models appear to have a normative justification (i.e., how humans should think), or their algorithms do not necessarily give qualitative interpretations that resemble real human cognitive processes. The present paper is an attempt to raise concerns regarding explicit and implicit theories and assumptions integrated in cognitive models of human learning that appear to have debatable descriptive validity (i.e., not describing real human cognitive processes). In particular, the present work focuses on the descriptive validity of learning algorithms embedded in traditional models of human category learning or concept formation and then introduces a new framework that offers more realistic qualitative interpretation and better descriptive validity. Note that the terms ”category learning” and ”concept formation” will be used interchangeably throughout this paper, because conceptual knowledge is considered to be categorically organized. 1.1 Model of Category Learning Forward Algorithm. There have been several models of category learning introduced, but the majority of recent successful models (e.g., [1]) can be considered to be a variant of Radial Basis Functions. A generic model may be formulated as follows: J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 41–49, 2006. c Springer-Verlag Berlin Heidelberg 2006
42
T. Matsuka and A. Chouchourelou
Ok =
j
wkj R (d(xi , ψji , αi )) ;
(1)
where d is a function characterizing perceived psychological distance (scaled by dimensional selective attention coefficient α) between input stimulus x and the j-th mentally represented reference points, ψj , (e.g., memorized exemplars or category prototypes) both of which are characterized by I feature dimensions. R is an activation function for reference points transforming the perceived distance to activation (e.g., an exponential function). Each reference point is connected to all category output nodes, and the activations of output nodes (Ok ) are determined by the weighted sum, weighted by association weight coefficient w, of the activation of reference points. The output activations will be used to obtain response probability by the following function: P (k) = exp(φOk )/ m exp(φOm ), where φ scales decisiveness of response. In short, the models assume that people utilize psychological similarity between input object (x) and some reference points (ψ) as evidence for categorizing the input instance into the most probable category. Traditional Learning Algorithm. The majority of cognitive models of human category ∂E learning incorporate an online version of gradient descent, or Δθi = −λ ∂θ , where θ i indicates ”learnable” concepts or model coefficients (e.g., wkj and αi in Eq. 1). The almost exclusive choice of the error function is the sum of squared error or E(θ) = 1/2 k (dk − Ok )2 where dk is the desired output for category node k.
2 Descriptive Validity of the Traditional Learning Algorithm Probability of ”Optimal” Learning: Although it is not explicitly assumed, the gradient type learning algorithm can be qualitatively interpreted as that humans try to learn concepts and category by adjusting all possible relevant knowledge (i.e., model coefficients) locally optimally in order to minimize misclassification whenever they receive feedback on accuracy of their concepts or knowledge. Thus, the probability of finding locally better concept for any given learning trial and throughout the entire learning phase is always 1, or t
P (E(θ t + Δθ) < E(θt + Δϑ), ∀Δϑ = Δθ|xt , θ t , {|Δθ|, |Δϑ|} < ) = 1 (2)
where t indicates time, and is a small number, implying that it holds only locally. This implicit assumption does not appear to hold in real human concept formation. In reality, it is most likely true that some knowledge is updated correctly while other is not necessarily done so. Objectives of Learning: Human cognitive competence is often characterized by adaptability, which manifests itself prominently in context dependent thinking skills. But, this type of cognitive ability has been largely overlooked in high-order cognitive modeling research. In particular, as described in Section 1.1, virtually all models of human category learning assume that humans are motivated and capable of minimizing categorization error in every training instance. In other words, the models assume that everyone
On the Learning Algorithms of Descriptive Models
43
undertakes exactly the same computational operation for updating coefficients irrespective of a learner’s personal contextual factors (e.g., motivation). Recent empirical and simulation studies, however, do not support this assumption. Inadequacy of error minimization as a sole mechanism of real human concept formation was empirically suggested [3]. Real-world human learning is more likely to be explained as a process of multi-objective optimization of utility of knowledge, where objectives are not only defined by categorization errors, but also by each learner’s intra-, inter-, and extra contextual factors. For example, one laboratory experiment showed that in ordinary situation humans have a preference for simpler, yet sufficiently accurate concepts over complex but marginally more accurate concepts, suggesting their objectives in learning the concept were defined by both accuracy and simplicity [3]. Matsuka [2] showed that only models that incorporated contextual characteristics in their learning algorithms were capable of reliably replicating some empirical phenomena. Arbitrary Decision in Learning: Another important finding of the aforementioned empirical study [3] that showed humans being biased toward simple sufficient knowledge1 is that each participant’s preference to pay attention to one diagnostic feature dimension over the other equally diagnostic yet highly collinear or redundant dimension seemed to be the result of arbitrary decision or random choice. The traditional gradient descent algorithm, again, does not allow cognitive models to have such arbitrary decision in learning process, because the algorithm assumes and actually does ’correctly’ update all coefficients, resulting in paying attention to the two diagnostic dimensions. Degrees of Locality in Learning: A most commonly used learning algorithm in models of category learning is an online version of gradient descent. Thus, the traditional learning models do not necessarily assume strictly local learning because of its stochastic approximation, its learning strategy is still highly locally oriented. More precisely, there is no mechanism that explicitly influences the degree of locality (or globality) in search areas. This suggests that the capability of finding the global minimum is highly influenced by the initial coefficient configuration (i.e., initial mental state) and by the stimuli presentation order. Much of real life and ordinary learning are probably based on heuristic and locally-oriented learning. But, there are rarer instances where humans would make an attempt for globally-oriented learning probably in the case of non-ordinary situations where the maximization of knowledge utility is crucial (e.g., preparing for exams). In order words, humans seem to have the capability of adjusting their search areas in the conceptual space. This mechanism has not been explicitly incorporated (e.g., parameterized) in previous cognitive models. The Nature of Error Hypersurface: In order to find the gradient for the error hypersurface, the objective function for the traditional gradient descent learning needs to be differentiable or its first derivative needs to be defined. Thus, these learning models assume the error hypersurface to be smooth and continuous. Some previous studies, however, suggest the possibility of a discontinuous property in hypothetical error surface in human cognition. Bower and Trabasso [4] showed evidence that individuals could 1
This bias, however, is probably limited to certain circumstances, and thus it does not have to be always present. Rather, context factors determine the presence or absence of the bias.
44
T. Matsuka and A. Chouchourelou
have learned category in an all-or-none manner (i.e., accuracy suddenly increased from a near base-rate accuracy), which they referred to as ”insightful” learning. This type of learning may be explained by either that the error hypersurface has the discontinuous or all-or-none property (i.e., step functions) or that learning utilizes Hessian matrix with model coefficients θ being sufficiently close to a minimum. However, it appears impossible for ordinary human to consciously obtain an inverse of Hessian matrix while performing a categorization task. Thus, those empirical studies might have suggested that at least in some situations the error hypersurface for multi-objective category learning is discontinuous. Although insightful learning is mainly supported by very simple learning tasks, these empirical findings raise concerns about the descriptive validity of the standard gradient descent as a model of human learning in some cases.
3 Toward Descriptive Models of Category Learning We raised five issues on the descriptive validity of the traditional learning algorithm applied to cognitive model of category learning. The following sections introduce a new framework that can, at least partly, overcome the issues, followed by two possible implementation approaches. The first approach is a simple extension of the gradient descent, and the other is based on a derivative-free stochastic optimization method. In spite of the differences in their optimization processes, the two new learning models fundamental assumptions are the same – human learning is a multi-objective optimization process with stochastic characteristics. 3.1 Fundamental Principles Multi-objective in Learning: The present framework assumes that human learning is not only driven by error-minimization, but by optimization of subjectively and contextually defined multidimensional knowledge utility. Operationally, it is framed as a minimization problem, and thus a smaller utility value implies better utility. The generic utility function can be formulated as: Υ (θ) = λE E(θ) +
M m
λm · Qm (θ)
(3)
where the first term defines accuracy of concepts (e.g., SSE), and each Qm function in the second term can be interpreted as a function defining a particular contextual factor m, λm is a scalar weighting importance of context factor m, and M is the number of contextual factors. There are virtually infinitely many contextual functions appropriately defined for Eq. 3. Concept abstraction, domain expertise, conservation, and knowledge commonality are few examples of contextual, mainly motivational, factors possibly influencing learning processes and outcomes [2]. Stochasticity in Learning: One criticism regarding the descriptive validity of the traditional gradient descent methods is its absolute certainty in finding locally optimal knowledge for any given single learning trial (i.e., Eq. 2). The present framework overcame this issue by assuming that concept update consists of two elements, structured, G, and stochastic, ν, processes, where the latter can be considered as noise, ”random
On the Learning Algorithms of Descriptive Models
45
thoughts,” or random hypotheses. Thus, models’ coefficients (i.e., concepts) are updated as follows: θ(t+1) = θ(t) + G θ(t) + ν; (4) The following two sections introduce two approaches implementing these two characteristics (Eqs. 3 & 4) in learning. 3.2 Noisy Gradient Descent with Multi-objective Functions The present learning model is a very simple, yet possibly effective extension of the traditional gradient descent. Here, the structured learning process, G, is a function of context-dependent multi-dimensional concept utility, and stochastic process, ν is timedecreasing random noise perturbation added to each coefficient update. Specifically, ∂Υ ∂E ∂Qm (θi ) (t) G θi = −λ∇(Υθ ) = −λ = −λE − λQm (5) m ∂θi ∂θi ∂θi where, λs are step sizes weighting different contextual factors (including categorization accuracy). In the present model, ν is random noise, i.e., νi ∼ Φ(0, ζ(·)), in learning, where ζ is a decreasing function. ζ needs to be a decreasing function for convergence. In addition, this function should return sufficiently large in the early learning phases to explore wider ”conceptual” space. The present model will be referred to as NGD. Remarks on Descriptive Validity: The added random noise in coefficient updates can be interpreted as follows: People try to revise their relevant knowledge (i.e., coefficient) about categories, but revision of each coefficient succeeds probabilistically due to the somewhat coarse and imperfect nature of human high-order cognitive processes. Furthermore, the added noise provides arbitrary decisions or random choices in learning. In addition, the degree of locality in learning can be controlled by the level of noise and its decaying speed. (e.g., it becomes globally oriented learning when the noise level is high and noise decrement is sufficiently slow). Because the present model utilizes utility gradient, the error hypersurface for the multi-objective utility needs to be continuous or the contextual factors need to be defined by some differentiable functions, and thus it probably would not sufficiently replicate ”insightful learning”. 3.3 Stochastic Hypothesis-Testing The present model is re-formulation of Matsuka’s [2] stochastic context-dependent learning model termed SCODEL, and will be referred to as SCO-R. The fundamental concepts underlying the present and the previous models are very similar. But, as compared with NGD, SCO-R has a more heuristic-oriented and hypothesis-testing like interpretations. In SCO-R, G process is very simple, i.e., to maintain status quo. Instead, it contains an elaborated and heuristically controlled ν process. In particular, it assumes that people form a hypotheses set, η, about category knowledge (i.e., model coefficients) in a random fashion. Operationally, this process is accomplished by randomly drawing hypothetical coefficient update values by an independently sampled term (i.e., ηi ) from a prespecified, in general, zero-mean symmetric distribution. The hypothetical update terms may be accepted to become ν terms (i.e., νi = ηi ) or discarded (i.e.,
46
T. Matsuka and A. Chouchourelou
νi = 0) depending on their utility (i.e., Eq 3.) relative to that of the currently accepted concept of coefficients (θ(t) ). Note that in SCO-R, a smaller value of Υ indicates better utility or less futility. Therefore, ηi ∼ Φ(0, ζ(T (t) )), if P (η) = P (η|θ (t) , T (t) ) ≥ U (0, 1); νi = (6) 0, otherwise. ⎧ (t) −1 ⎨ (θ (t) ) 1 + exp Υ (θ +η)−Υ , if Υ (θ (t) + η) > Υ (θ (t) ); t T P (η) = (7) ⎩1, otherwise. where the superscript t indicates time, T indicates a parameter called ”temperature” (see [5]) that decreases over time, U is the Uniform distribution with min and maximum being 0 and 1 respectively, Φ(0, T t ) is a symmetric distribution function, The random distributions, Φ takes a time-decreasing parameter T that controls width of the random distributions (it also affects the probability of accepting a new hypotheses set, see Eq. 7). The temperature decreases across training blocks according to the following annealing schedule function: T t = δ(υ, t), where δ is the temperature decreasing function that takes temperature decreasing rate, υ, and time t as inputs. A choice of the annealing function may depend on the particular choice of the random number generation functions [6]. Note that the decrease in the temperature causes decreases in width of the distributions Φ. The effect of transition in distribution widths controlled by the temperature can be interpreted as follows: In early stages of learning SCODEL is quite likely to produce ”radical” hypotheses (i.e., the new set of hypotheses thus coefficients are very different from the currently valid and accepted concepts). But, as learning progresses, the widths of the random distribution decrease, so that it increasingly stabilizes its hypotheses and establishes more concrete and stable knowledge about the category. In sum, SCO-R does not assume learning involves computation intensive (back) propagations of error in the multi-faceted coefficient space, or calculation of partial derivative for each coefficient for the error hypersurface. Rather, a very simple operation (e.g., comparison of two utility values) along with the operation of stochastic processes is assumed to be the key mechanism in human learning. Remarks on Descriptive Validity: As in NGD, SCO-R assumes that the probability of having ”optimal” learning in the concept space for a given trial (and the entire learning phases) can be and often is less than 1. By its algorithmic nature, it also incorporates arbitrary decision in learning, and it can be global optimization-oriented, depending on parameter configurations. Unlike the gradient descent method, the utility functions for SCODEL do not need to be differentiable or continuous, and thus it allows more flexibility in defining its contextual factors. Furthermore, the present type of learning model has shown to be capable of replicating rapid changes in both categorization accuracies and attentional shift in a simple categorization task [7].
4 Simulations In order to investigate the capabilities of the present framework operating as different types of hypothetical yet plausible learners, a simulation study is conducted. The
On the Learning Algorithms of Descriptive Models
47
Table 1. Schematic representation of stimulus set used in simulations Category Dim1 Dim2 Dim3 Dim4 Dim5
A 0 0 0 1 1
A 0 0 1 2 2
A 0 0 2 3 3
A 0 0 0 4 4
A 0 0 1 5 5
A 0 0 2 6 6
B 1 1 0 7 7
B 1 1 1 8 8
B 1 1 2 9 9
B 1 1 0 10 10
B 1 1 1 11 11
B 1 1 2 12 12
Table 2. Functions used in Simulation Study Function & Objective Output Accuracy Weight Decay Attn. Elim.
Expertise ψiM ax
and
ψiM ax−1
Implementation
Ok = j wkj · exp −1 · i αi |ψji − xi | U E (w, α) = λE1/2 k (dk − Ok )2 2 U W (w) = λW k j wkj −1 I 2 2 αi · αl l U S (α) = λS i −1 I 2 1 + α2i · l αl
exp −c · i αi ψiM ax − xi Ξ
U (α) = λΞ exp −c · αi ψ M ax−1 − xi i
(Eq. 8) (Eq. 9) (Eq. 10) (Eq. 11)
(Eq. 12)
i
indicates the maximally and 2nd maximally activated exemplars
present simulation study does not emphasize the present frameworks’ accuracies in learning per se. Instead, greater emphasis is placed on its capabilities in showing different expectable context dependent learning by hypothetical people engaged in hypothetical category learning tasks with the aim of evaluating its potential as an alternative or more descriptive model of high-order human learning processes. In particular, we test the two implementation approaches for the stochastic multi-objective learning framework using an extended version of a problem (Table 1) suggested by Matsuka [2]. Noted that all feature values are treated as nominal values differentiating each element within the dimension, and thus their numeric values and differences do not have any meaning. Methods: Five types of hypothetical human subjects with different kinds of hypothetical yet realistic learning objectives or combinations of objectives were involved in the present study. The five types of learners are: EMN - error minimizer whose objective in learning is to minimize categorization error, thus Υ = U E (i.e., Eq 9 in Table 2); ABS - abstractive learner who tries to acquire simple yet accurate concepts, or Υ = U E + U W + U S ; ATT - attentive learner who pays attention to many feature dimensions (e.g., being obsessive compulsive or trying to understand relationships among feature dimensions), thus Υ = U E − U S ; DEX - domain expertise whose objective includes acquiring accurate concept in addition to ability to distinguish different exemplars, thus Υ = U E +U Ξ ; and finally EDX, efficient domain expertise who would try to acquire simple accurate knowledge and the ability to distinguish between different exemplars, thus Υ = U E + U S + U Ξ . These five types of learners are empirically observed or suggested [2] [3] [8] [9].
48
T. Matsuka and A. Chouchourelou Table 3. Results of Simulation Study Types of Learner EMN ABS ATT DEX EDX
Model Accuracy NGC SCO NGC SCO NGC SCO NGC SCO NGC SCO
0.94 0.93 0.92 0.93 0.94 0.92 0.94 0.93 0.94 0.91
Dim 1 0.34 (0.20) 0.34 (0.18) 0.48 (0.48) 0.43 (0.38) 0.21 (0.03) 0.22 (0.04) 0.19 (0.08) 0.20 (0.08) 0.02 (0.13) 0.11 (0.11)
Mean Predicted Attention (StdDev) Dim2 Dim3 Dim4 Dim 5 0.36 (0.21) 0.17 (0.16) 0.07 (0.11) 0.06 (0.10) 0.32 (0.19) 0.14 (0.15) 0.10 (0.11) 0.10 (0.11) 0.43 (0.48) 0.02 (0.11) 0.04 (0.17) 0.03 (0.15) 0.42 (0.39) 0.07 (0.12) 0.04 (0.08) 0.04 (0.10) 0.21 (0.03) 0.20 (0.04) 0.19 (0.04) 0.19 (0.04) 022 (0.04) 0.20 (0.05) 0.18(0.05) 0.18 (0.06) 0.19 (0.08) 0.02 (0.03) 0.30 (0.07) 0.31 (0.07) 0.19 (0.08) 0.05 (0.06) 0.28 (0.06) 0.28 (0.06) 0.02 (0.12) 0.01 (0.05) 0.48 ((0.47) 0.46 (0.47) 0.11 (0.11) 0.04 (0.05) 0.37 (0.18) 0.37 (0.18)
Eq. 8 described forward categorization process (i.e., Eq. 1). The five types of learners were implemented with both NGD and SCO-R, and thus there were a total of 10 models simulated. All 10 models were trained with corrective feedback to learn to categorize 12 instances into either categories A or B. Note that we do not test the models abilities to learn (although all models successfully learned to categorize inputs), but instead we test their abilities to show different expectable context dependent learning outcome, particularly in learned attention allocations. For NGD, Gaussian distribution with mean equals to 0 and the standard deviation equals to the median changes in the t−1 previous training trial was used for ν process or νw ∼ N (0, M EDIAN (Δwkj )), t−1 να ∼ N (0, M EDIAN (Δαi )) For SCO-R, Cauchy distribution and exponential decay functions were used for random hypothesis generations and temperature decrement, respectively. There were 500 simulated subjects for each model. The parameters were selected arbitrary. Results: Table 3 shows the results of the simulation study. All models learned the category structure accurately at approximately the same precision. Note that predicted classification accuracies can be easily adjusted by manipulating model’s classification decisiveness parameter φ (see Section 1.1) and thus are not of the greatest importance other than the fact that all 10 models reached similar classification accuracies with the same φ value. For each type of learners, NGC-MO and SCO-R’s predictions on relative attention allocations (both means and standard deviations) were very similar, except EDX. For EDX, NGC-MO predict more variability in the amounts of attention allocated to Dimensions 4 and 5. Otherwise, NGC-MO and SCO-R showed similar patterns of learned attention distributions. These predictions are basically consistent with empirical observations [3] and cognitive theory [8] [9], indicating the two implementation approaches in the present framework successfully capture the essence of human learning.
5 Conclusion The present research raised five critical issues associated with the descriptive validity of the most widely used learning algorithm in high-order human cognitive processes,
On the Learning Algorithms of Descriptive Models
49
namely, the online version of gradient descent. These issues are the probability of ”optimal” learning, objectives in learning, arbitrary decision in learning, the degree of locality in learning, and the nature of error hypersurface. To overcome these limitations, a framework with two implementation approaches for models of human category learning or concept formation were introduced. The present framework assumes that human concept formation is not only driven by error minimization, but also by optimization of subjectively and contextually define concept utility. It also assumes that stochasticity plays a key role in the optimization process. Qualitative interpretations of the two learning models significantly expand the descriptive validity of currently available cognitive models of human category learning. Our simulation study quantitatively demonstrated both predictive and descriptive validities of both approaches.
Acknowledgements This work was supported by the Office of Naval Research (# N00014-05-1-00632).
References 1. Kruschke, J. E.: ALCOVE: An Exemplar-Based Connectionist Model of Category Learning. Psychological Review 99 (1992) 22-44 2. Matsuka, T.: Simple, Individually Unique, and Context-Dependent Learning Method for Models of Human Category Learning. Behavior Research Methods 37 (2005) 240-255. 3. Matsuka, T., Corter, J. E.: Process Tracing of Attention Allocation in Category Learning. (2006) Under review. 4. Bower, G. H., Trabasso, T. R.: Reversals Prior to Solution in Concept Identification. Journal of Experimental Psychology 66 (1963) 409-418. 5. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E.: Equation of State Calculations by Fast Computing Machines. Journal of Chemical Physics 21 (1953) 1087-1092. 6. Ingber, L.: Very Fast Simulated Re-Annealing. Journal of Mathematical Computer Modelling 12 (1989) 967–973. 7. Matsuka, T., Corter, J. E.: Stochastic Learning Algorithm for Modeling Human Category Learning. International Journal of Computational Intelligence 1 (2004) 40-48. 8. Tanaka, J., Taylor, M.: Object Categories and Expertise: Is the Basic Level in the Eye of the Beholder? Cognitive Psychology 23 (1991) 457-482 9. Tanaka, J., Gauthier, I.: Expertise in Object and Face Recognition. In: Medin, Goldstone & Schyns (eds.): Perceptual Learning: The psychology of Learning and Motivation Vol. 36. Academic Press, San Diego, CA: (1997) 83-125
A Neural Model on Cognitive Process Rubin Wang1,2, Jing Yu2, and Zhi-kang Zhang1 1
Institute for Brain Information Processing and Cognitive Neurodynamics, School of Information Science and Engineering, East China University of Science and Technology, Meilong Road 130, Shanghai 200237, P.R. China {rbwang, zhikangb}@163.com 2 School of Science, Donghua University, West Yan-An Road 1882, Shanghai 200051, P.R. China {rbwang, seychelle}@163.com
Abstract. In this paper we studied a new dynamic evolution model on phase encoding in population of neuronal oscillators under condition of different phase, and investigated neural information processing in cerebral cortex and dynamic evolution under action of different stimulation signal. It is obtained that evolution of the averaging number density along with time in space of three dimensions is described in different cluster of neuronal oscillators firing action potential at different phase space by means of method of numerical analysis. The results of numerical analysis show that the dynamic model proposed in this paper can be used to describe mechanism of neurodynamics on attention and memory.
1 Introduction In early published papers, we have studied dynamic models of phase resetting with neural population, and these models were generalized to cognitive neural dynamic system. These models extended the analysis in which interactive neurons in dynamic model of phase resetting is limited in one neural population [1-3]. In those research results, amplitude of neural population was extended to changeable amplitude from original amplitude of limit cycle, nonlinear dynamic model on two neural populations was obtained under case of changeable amplitude. We further studied nonlinear stochastic dynamic models in terms of phase resetting in which coupling intensity of synapses among neurons is changeable along with time. Accordingly, models of these types can reveal the dynamic mechanism between evolution models of learning, memory and neuronal plasticity [4,5]. Particularly, we studied the selfsynchronization evolution model in noise presence and the desynchronization evolution model in periodic stimulation presence for neural population possessing stochastic amplitude, and many conclusions possessing biological sense were obtained [6-8]. Due to eigenfrequency of each neuronal oscillator to be different in neural population, the phase and the amplitude of each neuronal oscillator are also different in neuronal activities [9]. Why does our attention focus a certain event we simultaneously ignore other events in unconsciousness? Why does our attention easily memorize some events which were been paid attention, and events are not paid attention to be easily J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 50 – 59, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Neural Model on Cognitive Process
51
forgotten. In order to deeply describe the neural information processing under action of stimulation and to answer questions above mentioned, in this paper we present the neural dynamic model that possesses different phase. The results of numerical computation show that the dynamic model can satisfactorily describe mechanism of neurodynamics on attention and memory.
2 The Dynamic Model Assuming a population of neural oscillator possessing periodic activity is composed by N oscillators as in figure 1. There are two clusters in the population of neural oscillators, oscillators’ numbers and phase in the first cluster are denoted to N1 and θ1, respectively. And oscillators’ numbers and phase in the second cluster are denoted to N- N1 and θ2, respectively. The interaction of neurons between two clusters is denoted to M which will be given in next part.
Fig. 1. Two clusters of neural oscillators on different phase in one population
Where Ω1 is eigenfrequency of N1 neural oscillators, amplitude is R1(t) and external stimulation is S1(Ψj, R1). Ω2 is eigenfrequency of N–N1 neural oscillators, amplitude is R2(t)and external stimulation is S2(Ψj, R1). The dynamic equations under both noise and periodic stimulation presence are given by ⎧ 1 N ∑ M (ψ j −ψ k , rj , rk ) + Fj ,1 (t ), ( j = 1, 2L N1 ) ⎪ψ& j = Ω1 + S1 (ψ j , rj ) + N k =1 ⎪ ⎪ ⎪ 1 N ⎪ψ& j = Ω 2 + S2 (ψ j , rj ) + ∑ M (ψ j −ψ k , rj , rk ) + Fj ,2 (t ), ( j = N1 + 1,L N ) ⎨ N k =1 ⎪ ⎪r& = α r − β r 3 , j = 1, 2L N 1 j 1 j ( 1) ⎪ j ⎪ ⎪⎩r&j = α 2 rj − β 2 rj3 , ( j = N1 + 1,L N )
(1)
where Fji is the term of white noise. Stimulation term is S k (ψ j , rj ) =
L
∑
m =1
I km rjm + 2 cos ( mψ j + γ km ), k = 1, 2
(2)
52
R. Wang, J. Yu, and Z.-k. Zhang
And coupling term is
(
M ψ j −ψ k , rj , rk
)=-m∑=1r L
m m j rn
( K sin m (ψ j −ψ k ) + C cos m(ψ −ψ )) m
m
The Fokker-Plank equation on probabilistic density
j
k
(3)
f in equation (1) is given by
N1 ∂ 2 N ∂f ∂f ∂ ⎡ Q N ∂2 = − ∑ g k ( Rk ) + f − ∑ T f⎤ ∑ [T1 f ] − ∑ 2 ⎣ 2 ⎦ k =1 ψ ψ ∂t ∂Rk 2 j =1 ∂ψ j ∂ ∂ j =1 j = N1 +1 j j
(4)
The probabilistic density and coefficients on vector {ψ l } of phase and vector {Rl } of amplitude in Fokker-Planck equation (4) can be written in the form
f = f (ψ 1 , Lψ N , R1,L RN , t )
1 N T = T11 1 N k∑ =1 1 N T = T22 2 N k∑ =1
( (
) ( ) ) ( ) ⎛ ⎧Ω + S (ψ , R ) + M (ψ j − ψ k , R2 , R1 )(k = 1,L N1 ) ⎞ ⎜T = ⎪ 2 2 j 2 ⎟ ⎜ 22 ⎨⎪Ω + S (ψ j , R ) + M (ψ j −ψ , R , R )(k = N + 1,L N )⎟ 2 1 k 2 2 ⎩ 2 2 ⎝ ⎠
⎛ ⎧Ω + S ψ , R + M ψ j − ψ k , R1, R1 (k = 1,L N1 ) ⎞ ⎜T = ⎪ 1 1 j 1 ⎟ ⎜ 11 ⎨Ω + S ψ , R + M ψ − ψ , R , R (k = N + 1, L N )⎟ ⎪ j j 1 1 k 1 2 ⎩ 1 1 ⎝ ⎠
One obtains an expression on function of amplitude with time t from equation of amplitude in equations (1) as follows Ri (t ) =
Ri ,0 1−
βi 2 R α i i ,0
αit
e
×
β 1+ i αi
⎛ ⎞ ⎜ ⎟ ⎜ Ri ,0 αi t ⎟ ×e ⎟ ⎜ βi 2 ⎜ ⎟ ⎜⎜ 1 − α Ri ,0 ⎟⎟ i ⎝ ⎠
2
(5)
In Figure 2 one observes that the amplitude reached the limit cycle R=1 after neuron is activated to 5s.
Fig. 2. Amplitude of limit cycles: R0 = 0.5 , α = β = 1
Therefore, the amplitude becomes a constant after persisting a period of time.
A Neural Model on Cognitive Process
A globe number density is defined as follows n (θ1 ,θ 2 , R1 , R2 , t ) = n1 (θ1 , R1 , R2 , t )
+n (θ , R , R , t ) 2
2
1
2
53
(6)
According to equation (5) we know the globe number density is a function on phase and time, and the function satisfies the following equation:
∫
2π
0
n (θ ′,θ ′, R1 , R2 , t ) dθ ′ = 1
(7)
Inserting (6) into (4) yields equation on the averaging number density 2 ∂n 2π 2π ⎛ ∂f ∂n = ∫ ∫ dψ l ⎜ n%1 + n%2 ∂f ⎞⎟ = − ∑ (α Rk − β Rk3 ) ∂t ⎠ ⎝ ∂t ∂t 0 0 ∂ Rk k =1 ⎞ ⎛ ⎞ Q ⎛ ∂2 ∂2 ∂ ∂ n1 + Ω 2 n2 ⎟ ⎜ 2 n1 + 2 n2 ⎟ − ⎜ Ω1 2 ⎝ ∂θ1 ∂θ 2 ⎠ ⎝ ∂θ1 ∂θ 2 ⎠ ∂ ∂ − ( S1 (θ1 , R1 ) n1 ) − ∂θ ( S2 (θ 2 , R2 ) n2 ) ∂θ1 2 +
−
∂ ⎡ 2π n1 (θ1 ) ∫0 ( M (θ1 −ψ ′, R1 , R1 ) n1 (ψ ′) + M (θ1 −ψ ′, R1 , R2 ) n2 (ψ ′) ) dψ ′⎤ ⎥⎦ ∂θ1 ⎢⎣
−
∂ ⎡ 2π n (θ 2 ) ∫0 ( M (θ 2 − ψ ′, R2 , R1 ) n1 (ψ ′) + M (θ 2 −ψ ′, R2 , R2 ) n2 (ψ ′)) dψ ′⎤ ⎦⎥ ∂θ 2 ⎣⎢ 2
(8)
Using Fourier transformation, the stimulation and coupling terms can be written as S j (ψ , R j ) = ∑ S jk ( R ) eikx ,
( j = 1, 2 )
k
⎧ I jk k + 2 iγ jk , ( k = 1,L, L ) ⎪ Rj e ⎪ 2 ⎪I S jk ( R ) = ⎨ jk R j −k + 2e−iγ jk , ( k = −1,L , − L ) ⎪ 2 ⎪ ⎪ other ⎩0,
M (ψ − ψ ′, R j , Rn ) =
-∑ M k
k
( R , R )eik (ψ −ψ ′) j
n
⎧π R kj Rnk ( −Ck + iK k ) , ( k = 1,L L ) ⎪⎪ M k ( R j , Rn ) = ⎨−π R −j k Rn− k ( C− k + iK − k ) , ( k = −1, L − L ) ⎪ other ⎪⎩0,
(9)
(10)
(11)
(12)
The averaging number density is expanded to Fourier series as follows:
ikθ n j (θ j , R1 , R2 , t ) = ∑ nˆ j ( k , R1 R2 , t ) e j , j = (1, 2 ) k
(13)
54
R. Wang, J. Yu, and Z.-k. Zhang
One obtains N1 N − N1 , nˆ2 ( 0 ) = 2π N 2π N Under the above condition the averaging number density is rewritten in the following form nˆ1 ( 0 ) =
n (θ1 ,θ 2 , R1 , R2 , t ) = ∑ n j (θ j , R1 , R2 , t ) 2
j =1
(
= ∑ nˆ1 ( k , R1 , R2 , t ) eikθ1 + nˆ2 ( k , R1 , R2 , t ) eikθ 2 k
)
(14)
Inserting (14) into (8) yields the Fokker-Planck equation expressed by the averaging number density
⎧ ∂nˆ1 ( k , t ) ⎪ ⎪ ∂t ⎪ ⎪ ⎨ ⎪ ∂nˆ2 (k , t ) ⎪ ∂t ⎪ ⎪ ⎩
2
∂nˆ1
m =1
∂Rm
= − ∑ g m ( Rm )
−
k
2
2
Qnˆ1 − ik Ω1nˆ1 − ik ∑ S1m ( R1 ) nˆ1 ( k − m ) m
−ik ∑ nˆ1 ( k − m ) ⎡⎣ M m ( R1 , R1 ) nˆ1 ( m ) + M m ( R1 , R2 ) nˆ2 ( m ) ⎤⎦ L
m =±1
= − ∑ g m ( Rm ) 2
m =1
∂nˆ2 ∂Rm
−
k
2
2
Qnˆ2 − ik Ω 2 nˆ2 − ik ∑ S 2 m ( R2 ) nˆ2 ( k − m )
(15)
m
−ik ∑ nˆ2 ( k − m ) ⎡⎣ M m ( R1 , R2 ) nˆ1 ( m ) + M m ( R2 , R2 ) nˆ 2 ( m ) ⎤⎦ L
m=±1
Taking into account equation (5), one obtains dnˆ1 ∂nˆ1 ∂nˆ1 & ∂nˆ1 & ∂nˆ ∂nˆ ∂nˆ = + R1 + R2 = 2 + g1 ( R1 ) 1 + g 2 ( R2 ) 1 dt ∂t ∂R1 ∂R ∂t ∂R1 ∂R2 2
And then equation (15) can be rewritten as ⎧ dnˆ1 (k , t ) ⎪ ⎪ dt ⎪ ⎪ ⎨ ⎪ dnˆ2 (k , t ) ⎪ dt ⎪ ⎪ ⎩
=−
k2 Qnˆ1 − ik Ω1nˆ1 − ik ∑ S1m ( R1 ) nˆ1 ( k − m ) m 2
−ik ∑ nˆ1 ( k − m ) ⎡⎣ M m ( R1 , R1 ) nˆ1 ( m ) + M m ( R1 , R2 ) nˆ2 ( m )⎤⎦ L
m =±1
k2 = − Qnˆ2 − ik Ω 2 nˆ2 − ik ∑ S 2 m ( R2 ) nˆ2 ( k − m ) m 2
(16)
−ik ∑ nˆ2 ( k − m ) ⎡⎣ M m ( R1 , R2 ) nˆ1 ( m ) + M m ( R2 , R2 ) nˆ2 ( m )⎤⎦ L
m =±1
3 Evolution Results in Case of No Stimulation In case of no stimulation, if all parameters of in two clusters are completely the same, then neural model (16) is reduced to the result obtained in P.A. Tass’s monograph [9]. Therefore, we do not analysis for this case. In order to describe clearly the interaction of between two clusters in neural oscillators, we give evolution process of the averaging number density in space of three dimension in the below. The averaging number
A Neural Model on Cognitive Process
55
density on the first cluster’s neurons is described in (ψ1, n) plane, and averaging number density on the second cluster’s neurons is described in (ψ2, n) plane. The neural oscillators in the first and the second cluster satisfy each normalization condition in two planes, respectively.
Fig. 3. Evolution of the averaging number density of different phase N1:N2=10:1.Initial condition: n (ψ 1 ,0) = 1 2π + 0.012 cosψ 1 + 0.012 cos(2ψ 1 ) , n(ψ 2 ,0) = 1 2π + 0.012 cosψ 2 + 0.012 cos(2ψ 2 ) , Q1=Q2= 0.4, Ω1=Ω2=2π
Fig. 4. Evolution of the averaging number density of different phase N1:N2=1:10 (Initial condition is the same as Fig.3)
56
R. Wang, J. Yu, and Z.-k. Zhang
In Figures 3 and 4 we clearly observe that the larger is difference of quantity of neurons, the more salience is to tend the evolution of which quantify of neural oscillators is the most of that cluster under condition of the same parameters. The proportion of quantity of neurons in the two clusters is very large to impact of evolution result, and neurons that the cluster with the most of quantity can dominate evolution result of system (the globe averaging number density at diagonal line). It is necessary to emphasize that we first give the dynamic evolution on neural encoding in phase space of two dimensions by means of numerical analysis, and numerically proved that early results on phase coding of neuronal population in phase space of one dimension lose abundant useful neural information [10]. Hence, theory of phase coding in phase space of one dimension cannot be applied to research of cognitive process. Actually, the theory of phase coding in high-dimension case only can preferably describe neural encoding and evolution in neural population in cerebral cortex.
4 Attention and Memory Under Stimulation Presence In this section, we research stimulation to impact of neural encoding in phase space of three dimensions. The desynchronization process caused instantaneously stimulation and the resynchronization process after end of stimulation will be interested in two questions in the section. The stimulation is composed by three steps in the numerical analysis. 1. In the first step there is no stimulation in neuronal activities, i.e. S1 = S 2 = 0 . We have studied the synchronization process and corresponding firing pattern on spontaneous behavior of large-scale neuronal population in reference [7]. Hence here does not describe in detail. 2. In the second step there is stimulation presence in neuronal activities, i.e. S1 ≠ 0 or S 2 ≠ 0 , S1 ≠ 0 and S 2 ≠ 0 . The question of stress is that different stimulation acts on different neural cluster, so that we can further research response of population of neural oscillators to different stimulation in nature. 3. In the third step there is stimulation stop in a synchronization process, i.e. S1 = S 2 = 0 . In the process we study cluster of neuronal oscillators how transit to a stable state from a non-stable state of the desynchronization process, i.e. how to return to the stable synchronization state by resynchronization process. Actually, in the procedure of the third step, stimulation result is an initial condition of resynchronization process. For this case, we had discussed impact of result of neural encoding to different initial condition in reference [8]. Figure 5 describes how neural oscillators of two clusters that quantitative ration of neurons is 1:10 evolve to a non-stable state of desynchronization from stable state of the synchronization under impulse’s stimulation of different order. Due to dominating the system’s the neural cluster receiving stimulation of one-order harmonic, there is to appear distribution of the averaging number density of periodic single peck. After instantaneous stimulation 0.3s, we observe to appear to a new distribution of averaging number density as in figure 5 (c), and this new distribution is completely different
A Neural Model on Cognitive Process
57
with the initial distribution as in figure 5(a). This stimulation result reflects that the cluster of the most of neural oscillators in quantity can dominate dynamic behavior of system in cognitive procedure. Figure 5(d) describes that the distribution of neuronal activities after the resynchronization procedure of 18s under coupling action among neurons tends to the configuration in figure 5(b). The distribution of the desynchronization under action of couple in figure 5(c) restores again the stable state of the synchronization both the first phase and the second phase as in figure 5(b). This phenomenon shows that memorial state as stimulation effect is very stable after end of stimulation.
Spontaneous behavior 9s
(a)
(b) Stimulation time 0.3s
Spontaneous
behavior 18s (d)
(c)
Fig. 5. Quantitative ration of neurons N1:N2=1: 10 (Initial condition and parameters are the same as Fig.3) Si (ψ , Ri ) = I i1 Ri 2 cos ψ + I i 2 Ri 2 cos( 2ψ ) , I11 = 0 I12 = 7 , I 21 = 7 , I 22 = 0 ( a) Initial
distribution of averaging number density of oscillators ( b) distribution after spontaneous behavior of 9s ( c) distribution after end of instantaneous stimulation 0.3s ( d) distribution after spontaneous behavior 18s after stimulation stop
There is a complete evolution process of on neural encoding in phase space in figure 5. These processes illustrated some biological phenomena, such as why we can identify a familiar sound from noisy environment, simultaneously, and noise of various frequencies from the environment does not disperse our attention. This is because stimulating effect with cognition and size of quantity of those neurons of acting as cognitive process are close relation, such as N1 cluster. In N2 cluster, coupling resonance with cognitive process can be produced from neural oscillators. And only small neurons act reaction to from noise in environment, this reaction does not reach the level of coupling resonance, such as neural activities in N1 cluster. Due to the encod-
58
R. Wang, J. Yu, and Z.-k. Zhang
ing to external stimulation in N1 cluster to be not reached conscious level, this kind of neural activities only remained some faint response in cerebral cortex. The faint response will be not deposited in cortex by the dynamic mechanism of attention. But the memory as effect of attention was saved from observation of stimulation result in figure 5(d). Comparison with figure 5 (c) we observe that the size on wave in the first phase space becomes to very small in figure 5 (d), but that the size on wave in the second phase space becomes to very large, this phenomenon shows memorial effect on cognition.
5 Concluding Remarks In this paper we proposed a new dynamic model of phase resetting on population of neural oscillators possessing different phase, and performed stability and numerical analysis for spontaneous behavior of large-scale neural population. Meanwhile, we performed numerical analysis for the synchronization and the desynchronization activity of neurons in two case of symmetry (quantity of neural oscillator in two clusters are completely the same) and asymmetry (quantity of neural oscillators in two clusters are different). The numerical results firstly showed that early results on phase encoding of neuronal population in phase space of one dimension lost abundant useful neural information. Hence, theory of phase encoding in phase space of one dimension cannot be applied to research of cognitive process. And then the results of numerical computation showed that the neural dynamic model given in high-dimension phase space can deeply describe dynamic mechanism on attention and memory. Next we showed that cluster of the most of neural oscillators in quantity can dominate dynamic behavior of system in cognitive procedure in evolution procedure of distribution of the averaging number density. The conclusions above mentioned can explain these biological phenomena, i.e. why events noticed can be easily memorized, and memorial contents as stimulation effect are very stable. The conclusions can help us to comprehend mechanism of neurodynamics on attention and memory.
Acknowledgement The work reported in this paper was supported by the National Natural Science Foundation of China under Grant No. 30270339.
References 1. Wang, R.B., Zhang, Z.K.: Nonlinear stochastic models of neurons activities. Neurocomputing 51 (2003) 401-411 2. Wang, R.B., Zhang, Z.K.: A Dynamic Evolution Model for the Set of Populations of Neurons. Int. J. Nonlinear Sci. and Num. Simul. 4(3) (2003) 203-208 3. Wang, R.B., Yu, W., Jiao, X.F.: A Review of Application on Stochastic Dynamics in Brain Information Processing. Fudan Lecture in Neurobiology 307 (2004) 193-206 4. Wang, R.B., Jiao, X.F.: Stochastic Model and Neural Coding of Large-scale Neural Population with Variable Coupling Strength. Neurocomputing (2005) (in press)
A Neural Model on Cognitive Process
59
5. Jiao, X.F., Wang, R.B.: Nonlinear Evolution Model and Neuronal Coding of Neuronal Population with the Variable Coupling Strength in the Presence of External Stimuli. Appl. Phys. Lett. 87 (2005) 083901 6. Wang, R.B.: A New Nonlinear Phase Setting Models in Neurons Activities. In: Wang, L.P., Rajapakse, J.C. (eds.): Proceedings of 9th International Conference on Neural Information Processing, Vol. 5. Fudan University Press, Shanghai (2002) 2497-2501 7. Wang, R.B., Yu, W.: A Stochastic Nonlinear Evolution Model and Dynamic Neural Coding on Spontaneous Behavior of Large-scale Neuronal Population. In: Wang, L.P., Chen, K. (eds.): Advances in Natural Computation. Part 1. Lecture Notes in Computer Science, Vol. 3610. Springer-Verlag, Berlin Herdelberg (2005) 490-497 8. Wang, R.B., Yu, W.: Stochastic Nonlinear Evolutional Model of the Large-Scaled Neuronal Population and Dynamic Neural Coding Subject to Stimulation. J. Biol. Med. Eng. (2006) (in press) 9. Tass, P.A.: Phase Resetting in Medicine and Biology. Springer-Verlag, Berlin Herdelberg Germany (1999) 10. Wang, R.B.: Some Advances in Nonlinear Stochastic Evolution Models of Neuron Population. In: Zhu, W.Q. (eds.): Advance in Stochastic Structural Dynamics. CRC Press. USA (2003) 453-461
Approximation Bound of Mixture Networks in Lpω Spaces Zongben Xu, Jianjun Wang, and Deyu Meng Institute for Information and System Science, Xi’an Jiaotong University, Xi’an, Shaan’xi, 710049, P.R. China {zbxu, wangjianjun}@mail.xjtu.edu.cn
Abstract. The approximation order estimation problem for multidimensional functions by the mixture of experts neural networks is studied. It is shown that under very mild condition on activation functions, the mixture neural networks have the same approximation order with that of the normal feedforward sigmoid neural networks. The obtained result sharpens the estimation developed by Maiorov and Meir in IEEE Trans. on Neural Networks (9(1998),969-978) over the compact region in Lpω Spaces and underlies applicability of the mixture neural networks.
1
Introduction
Universal approximation capabilities of a broad range of neural networks have been established by many researchers (see, e.g., Cybenko [1], Mhaskar [5] and Chen [3]). The works that have been conducted mainly concentrate on the problem of denseness or feasibility of using feedword neural networks (FNNs) as universal approximation. But from application point of view, a more worthnoticing problem is what is the degree of approximation by FNNs for specific type of functions to be approximated. Recently, several authors have derived approximation bounds for the three layer of FNNs for many different classes of functions (see, e.g., [4]-[9]). The obtained results characterize quantitatively the approximation accuracy of the FNNs. In [10], Jordan and Jacobs have introduced a new type of neural networks called the mixture networks or the mixture of experts. Such type of neural networks, endowed with a certain probabilistic interpretation, consists of a finite number of sub-networks with each accomplishing a specific subtask, and requires the activations of the units in the hidden-layer to sum to unity. It is shown in [10] that the mixture networks have several distinct advantages over the normal FNNs, for instance, it is easier and more efficient in training and it has much stronger generalization capability. It is, however, an unclear problem whether or not such type networks can keep the same approximation accuracy with the normal FNNs. Maiorov and Meir ([6]) have studied this problem in a specific setting. They, through using the theory of weighted polynomial approximation, have developed an approxαr imation order estimation of the form O(n− d ) with 0 < α = 1 − β1 < 1 in the measure of Lpω , where the weight ω(x) = exp(−a|x|β ) with β ≥ 2 and J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 60–65, 2006. c Springer-Verlag Berlin Heidelberg 2006
Approximation Bound of Mixture Networks in Lpω Spaces
61
a > 0. The obtained approximation order is the same with that of FNNs in αr the setting of Lpω . This result, however, does not suggest that O(n− d ) is optimal in order for approximation of the mixture networks. In this paper, our aim is to provide a sharpened, nearly optimal approximation order estimation for the mixture networks though using a new weight in the measure of Lpω . The obtained result provides not only an affirmative answer to the approximation accuracy estimation problem, but also underlies the applicability of the mixture networks. We first briefly state the mixture network structure to be studied( For more details, we refer to [6], [10]). Consider the problem of modeling a regression functions E(y|x) in a probabilistic setting. A mixture network, or called a mixture of experts model, is composed of n expert networks, each of which solves a function approximation problem over a local region of the input space. The probability model of each expert is p(y|x; θj )(j = 1, 2, · · · , n), where x ∈ Rd is the input vector, y ∈ R is the output vector and θj ∈ Θ is parameter vector associated with each expert. Thus, the overall probabilistic model assumes the form of a mixture density n p(y|x; Θ) = gj (x; θg )p(y|x; θj ), (1) j=1
where gj (x; θg ) ≥ 0
n
and
gj (x; θg ) = 1.
(2)
j=1
The regression function E(y|x; Θ) is obtained by taking the expectation with the random variable y in (1), that is E(y|x; Θ) =
n
gj (x; θj )μ(x; θj ).
(3)
j=1
where μ(x; θj ) = Ep(y|x;θj ) (y). In application, p(y|x; θj ) or equivalently, μ(x; θj ) and gj (x; θg ) assume specific forms. For the convergence or approximation order estimation purpose, however, one can assume μ(x; θj ) to be constant (for, gj (x; θj ) is a general nonlinear function in x and μ(x; θj ) can be absorbed into it whenever necessary ). With this understanding, the regression function defined by (3) can be equivalently defined as E(y|x; Θ) =
n
cj gj (x; θj )
(4)
j=1
In [6], with sigmoid function φ(x) = (1 + e−x )−1 , gj (x; θj ) are taken in the form of φ(aTj x + bj ) gj (x; θj ) = , j = 1, 2, ..., n n φ(aTj x + bj ) j=1
62
Z. Xu, J. Wang, and D. Meng
Thus the regression functions defined by the mixture networks constitute the following family of functions: ⎧ ⎫ n ⎪ ⎪ ⎪ ⎪ ck φ(ak · x + bk ) ⎨ ⎬ k=1 d Fn = fn (x) : fn (x) = , a ∈ R , b , c ∈ R (5) k k k n ⎪ ⎪ ⎪ ⎪ φ(aj · x + bj ) ⎩ ⎭ j=1
In the subsequent study we do not limit the function φ to be sigmoid, but, instead, we assume that φ is a monotonic activation function such that there exists a constant b satisfying |φ(k) (b)| > c > 0 for k = 1, 2, · · · . From [2], any monotonic sigmoid function must have this property.
2
Notation and Main Results
Following [11], we will take the Chebyshev orthogonal weights in the measure of Lpω . That is, for any x = (x1 , x2 , . . . , xd ) ∈ [−1, 1]d , we take ω(x) = 1 d Πi=1 ω(xi ), ω(xi ) = (1 − x2i )− 2 . With so chosen weight function, we define a weighted norm of a function f by 1 f p,ω = ( | ω(x)f (x) |p dx) p , 1 ≤ p < ∞ (6) [−1,1]d
where m = (m1 , m2 , . . . , md ) ∈ Z d , dx = dx1 dx2 · · · dxd . We denote the class of functions for which f p,ω is finite by Lpω . This is a Banach space. The function class we wish to use the mixture neural networks to approximate is r,d Ψp,ω = {f : f (λ) p,ω ≤ M, |λ| ≤ r},
(7) |λ|
f where λ = (λ1 , λ2 , . . . , λd ), |λ| = λ1 + λ2 + · · · + λd , f (λ) = ∂x λ∂1 ···∂x λ , r is any 1 d d natural number, and M < ∞. For any two subsets F and G in Lpω , we define the distance from F to G by
dist(F, G, Lpω ) = sup inf f − gp,ω . f ∈F g∈G
and we define a class of multivariate polynomials by ⎧ ⎫ ⎨ ⎬ Pm = P : P (x) = bi1 ,i2 ,...,id xi11 · · · xidd , bi1 ,i2 ,...,id ∈ R, ∀i1 , . . . , id . ⎩ ⎭ 0≤|i|≤|m|
The main results of this work is the following theorem. Theorem. For any 1 ≤ p < ∞, there holds the following estimation: r,d −d dist(Ψp,ω , Fn , Lω p ) < Cn r
Here and hereafter C is a positive constant independent of n and x (Its value may be different in different contexts).
Approximation Bound of Mixture Networks in Lpω Spaces
63
r,d This theorem reveals two things: (i) for any multivariate functions f ∈ Ψp,ω , there is a mixture neural network of the form (4) that approximates f arbitrarily well in Lpω . That is, the mixture neural networks can be used as the universal r,d approximator of functions in Ψp,ω ; (ii) quantitatively, the approximation accur racy of a mixture network of the form (4) can attain the order of (n− d ), where d is the dimension of input space and r is the smoothness of the function to be approximated. This estimation is exactly the same with that for the normal FNNs proved in [11]. r Clearly, the approximation order estimation O(n− d ) developed here sharpens αr that provided in [6] in the sense that the parameter α in the estimation O(n− d ) in [6] has been dismissed. This shows that our estimation is more fundamental. Furthermore, we can show that in L2ω the approximation order estimation developed here is optimal. So it can be conjectured that the estimation presented in this work can not be improved further.
3
Proof of the Main Theorem
The proof of Theorem requires several lemmas, which we will present first. r,d Lemma 1. ( see [11] ). For any 1 ≤ p < ∞, f ∈ Ψp,ω , and m = (m1 , m2 , ..., md ) with mi ≤ m, there holds the following.
inf f − p p,ω ≤ Cm−r .
p∈Pm
Lemma 2. The function ϕ(x) defined by −d ϕ(x) = (2ρ) φ(t · x)dt,
0 < ρ < 1.
[−ρ,ρ]d
has the following properties: d d (i) |ϕ(x)| ≥ | i=1 φ(−ρ|xi |)| or |ϕ(x)| ≥ | i=1 φ(ρ|xi |)|; r,d (ii) Let s = (s1 , · · · , sd ), Then ∀f ∈ Ψp,ω , f ϕ p,ω = Ds (f ϕ) p,ω ≤ C1 < ∞, |s|≤r r,d where C1 = C1 (r, ρ, d) < ∞, i.e.,f ϕ ∈ Ψp,ω .
Lemma 3. For the following class of functions Γn = {un (x)|un (x) =
0≤k≤N
dk
pk (x) , ϕ(x)
d dk ∈ R, k ∈ Z+ },
1
where the polynomials pk (x) ∈ Pk ,and N = (N, N, · · · , N ) with N = n d , we have r p −d dist{Φr,d (8) p,ω , Γn , Lω )} ≤ Cn The Lemma 2 and Lemma 3 are obvious, here we omit the proof.
64
Z. Xu, J. Wang, and D. Meng
Lemma 4. Let s = (s1 , s2 , · · · , sd ) and s =
d
(1 + si ). Then there exists ar,k
i=1
such that for h sufficiently small
ar,k φ(h(2r − s) · x)
p (x)
0≤r≤s
k
− −1
ϕ(x) s φ(h(2r − s) · x)
≤ Mk CΩ(φ, h),
(9)
p,ω
0≤r≤s
where Mk is a constant dependent only on k, p, d, Ω(f, h) = sup|t|≤h |f (x + t) − f (x)| is the modulus of continuity of f . Proof. Let Ah (x, k) =
T
ar,k φ(h(2r − s) x),
Bh (x) = s−1
0≤r≤s
Since
T
φ(h(2r − s) x).
0≤r≤s
p (x) A (x, k)
p (x)
k
h k − ≤ (ϕ(x) − Bh (x))
ϕ(x)
ϕ(x)Bh (x)
Bh (x) p,ω p,ω
1
.
+ (pk (x)−Ah (x, k)) = I1 + I2
Bh (x)
p,ω
We will estimate the bound of I1 and I2 respectively. First,
−d −1 φ(t · x)dt − s φ(h(2r − s) · x)
ϕ(x) − Bh (x) = (2ρ)
d [−ρ,ρ] 0≤r≤s ∞ ∞
= s−1 (2h)−d {φ((t + h(2r − s)) · x) − φ(h(2r − s) · x)}dt
[0,2h]d ∞
0≤r≤s
≤ Ω(φ, t · x) ≤ 2d(1 + x∞ )Ω(φ, h).
(10)
By in [6], we can express the polynomials pk (x) as pk (x) = using jthe approach bj x , x ∈ Rd , and by M inkowski s inequality, we derive 0≤j≤k
p (x)
k
ϕ(x)Bh (x)
≤ p,ω
0≤j≤k
d
j x 2 |bj | x (1 + e )
i=1
≤ Mk .
(11)
p,ω
From (10) and (11), we thus obtain I1 ≤ Mk CΩ(φ, h). For I2 , we can follow the proof of Theorem 2 in [11], to yield that for any > 0, I2 ≤ . Since is arbitrary, combining the estimation of I1 , I2 , and then gives Lemma 4. dk Ah (x,k)
Now we prove the Theorems. Consider the function fN,s,h (x) = 1 d
1 d
1 d
k≤N
Bh (x)
,
where N = (n , n , · · · , n ), k, r, s ∈ Clearly, fN,s,h(x) ∈ Fn . Using the triangle inequality, Lemma 3, M inkowski s inequality and Lemma 4, we have d Z+ .
Approximation Bound of Mixture Networks in Lpω Spaces
f (x) − fN,s,h (x)
p,ω
pk (x) pk (x)
≤ f (x) − dk dk ( −
+
ϕ(x) ϕ(x) k≤N k≤N p,ω r ≤ Cn− d + CΩ(φ, h) Mk |dk |
k≤N
dk Ah (x, k)
)
Bh (x)
65
p,ω
k≤N
Since the above equation holds for any pk (x), we may take the minimum over r pk ∈ PN , set Ω(φ, h) ≤ n− d /C Mk |dk |, and then we get the theorems. k≤N
Acknowledgement This work was supported by National Science Foundation Nos.10371097 and 70531030.
References 1. Cybenko, G.: Approximation by Superpositions of a Sigmoidal Function. Math. Contr. Signals Syst. 2(1989) 303-314 2. Xu, Z.B., Cao, F.L.: Simultaneous Lp -Approximation Order for Neural Networks. Neural Networks 18(7) (2005) 914-923 3. Chen, T.P., Chen, H.: Approximation Capability to Functions of Several Variables, Nonlinear Functions, and Operators by Radial Function Neural Networks. IEEE Trans. Neural Networks 6(1995) 904-910 4. Barron, A.R.: Universal Approximation Bound for Superpositions of a Sigmoidal Function. IEEE Trans. Inform. Theory (39)(1993) 930-945 5. Mhaskar, H.N.: Neural Networks for Optimal Approximation for Smooth and Analytic Functions. Neural Comput. 8(1996) 164-177 6. Maiorov, V., Meir, R.S.: Approximation Bounds for Smooth Functions in C(Rd ) by Neural and Mixture Networks. IEEE Trans. Neural Networks 9 (1998) 969-978 7. Burger, M., Neubauer, A.: Error Bounds for Approximation with Neural Networks. J. Approx. Theory 112 (2001) 235-250 8. Kurkova, V., Sanguineti, M.: Comparison of Worst Case Errors in Linear and Neural Network Approximation. IEEE Trans. Inform.Theory 48(2002) 264-275 9. Wang, J.L., Sheng, B.H., Zhou, S.P.: On Approximation by Non-periodic Neural and Translation Networks in Lpω Spaces. ACTA Mathematica Sinica (in chinese) 46(2003) 65-74 10. Jordan, M.I., Jacobs, R.A.: Hierarchical Mixtures of Experts and the EM Algorithm. Neural Comput. 6 (1994) 181-214 11. Wang, J.J., Xu, Z.B., Xu, W.J.: Approximation Bounds by Neural Networks in Lpω . Lecture Notes in Computer Science, Vol.3173. Springer-Verlag, Berlin Heidelberg New York (2004) 1-6
Integral Transform and Its Application to Neural Network Approximation Feng-jun Li and Zongben Xu Faculty of Science, Institute for Information and System Science, Xi’an Jiaotong University, Xi’an,710049, P.R. China
[email protected]
Abstract. Neural networks are widely used to approximate nonlinear functions. In order to study its approximation capability, a theorem of integral representation of functions is developed by using integral transform. Using the developed representation, an approximation order estimation for the bell-shaped neural networks is obtained. The obtained result reveals that the approximation accurately of the bell-shaped neural networks depends not only on the number of hidden neurons, but also on the smoothness of target functions.
1
Introduction
Many authors (see, e.g.,[1-5]) have concluded that a three-layered feed-forward neural network (FNN) can approximate an arbitrary function with desired accuracy whenever the FNN has sufficiently many hidden neurons. The approximation can be realized through many different ways like using Fourier transform and convolution. Recently, integral transformation has been applied to resolve this question. Murata ([6]) introduced an integral transformation of sigmoidal functions and proved the universal approximation capability of FNNs and give an approximation theorem (Theorem 2) when p = 2. In this work, we study the approximation bound estimation problem of neural networks. The aim of this is to reveal the quantitative relation between the approximation accuracy of a FNN and the number of its hidden neurons. We also prove an approximation theorem when 2 ≤ p < ∞. As the first step towards this goal, we focus in the present work on the case when the complete information about the target function is known (thus, some analytic tools like integral transformation can be applied). Throughout the paper we will use boldface characters to stand for vectors, and regular characters for scalars. In section 2 we introduce the integral transform and inverse transform of functions by using ridge functions. And then, in section 3, we apply the obtained approximate representation theorem to derive an approximation bound estimation of the bell-shaped three-layered FNNs. Finally, in section 4 we summarize our current research. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 66–71, 2006. c Springer-Verlag Berlin Heidelberg 2006
Integral Transform and Its Application
2
67
An Integral Transformation and Its Approximate Inverse
A ridge function (Martin and Allan [7]) is a multivariate function h : Rn −→ R of the form h(x1 , x2 , · · · , xn ) = g(a1 x1 + a2 x2 + · · · + an xn ) = g(a · x − b),
(1)
where g : R → R, b ∈ R, a = (a1 , a2 , · · · , an ) ∈ Rn \{0} and a · x is the inner product of a and x. That is to say, a ridge function is such a multivariate function that takes constant values on the parallel hyper-planes a · x = c, c ∈ R. The vector a ∈ Rn \{0} in Eq.(1) is generally called the direction. In the following, we discuss the possibility and method of approximating a given function f : Rn → R by a linear combination of ridge functions. We assume that the function f ∈ L1 (Rn )∩Lp (Rn ) (2 ≤ p < ∞). Two bounded functions ϕ, ψ ∈ L1 (R) ∩ L2 (R) are said to be a permitted pair whenever they satisfy ⎧ ˆ ˆ ˆ ψ(−y) = ϕ(y) ˆ ψ(y) ⎪ ⎨ ϕ(−y) ∞ |ϕ(y) ˆ ˆ ψ(y)| dy < ∞ (2) 0 yn ⎪ ˆ ⎩ ∞ ϕ(y) ˆ ψ(y) dy = 0, yn 0 where “ˆ” denotes the Fourier transform and “ − ” denotes the conjugate. Let ∞ ˆ ˆ ϕ(y) ˆ ψ(y) ϕ(y) ˆ ψ(y) C= dy = 2 dy. (3) n n |y | y R 0 Using the ridge function ϕ, an integral transform Wϕ of function f is introduced by Murata ([6]) and Widder ([8]) as follows: 1 (Wϕ f )(a, b)x = ϕ(a · x − b)f (x)dx. (4) (2π)n C Rn For the transformation Wϕ , we can prove the following estimation in terms of Lp —norm. Theorem 1. For any > 0 and a real number δ, let f ∈ L1 (Rn ) ∩ Lp (Rn ) (2 ≤ p < ∞), if δ → 0, then there holds the following inequality: 2 ||f (x) − (Wϕ f )(a, b)x ψ(a · x − b)e−δ|a| dadb||p ≤ . (5) Rn+1
Proof. Let us define 1 F (x) = (2π)n C
ψ(a · x − b)ϕ(a · z − b)f (z)e−δ|a| dadbdz. 2
(6)
R2n+1
Since f (z) ∈ L1 (Rn ), ϕ is bounded, and ψ ∈ L1 (R), we can see that the right 2 n side of Eq.(6) is absolutely integrable. Let ρ(x, δ) = 1/(4πδ) 2 e−|x| /4δ , according
68
F.-j. Li and Z. Xu
to Fubini theorem, parseval’s formula and properties of convolution operation, we obtain ∞ 2 ˆ 1 ∗ f (x)dw, F (x) = ϕ(w) ˆ ψ(w)ρ (7) n δ2 Cw 0 where “∗” denotes the convolution. Using H¨older inequality, we then get ∞ ˆ 2 ˆ ψ(w) δ ϕ(w) ||F (x) − f (x)||p ≤ ||ρ(x, 2 ) ∗ f (x) − f (x)||p dw n |C| 0 w w r ˆ 2 ˆ ψ(w) δ ϕ(w) = ||ρ(x, 2 ) ∗ f (x) − f (x)||p dw |C| 0 wn w ∞ ˆ 2 ˆ ψ(w) δ ϕ(w) + ||ρ(x, 2 ) ∗ f (x) − f (x)||p dw |C| r wn w A1 + A2 .
(8)
Since lim ||ρ(x, δ) ∗ f (x) − f (x)||p = 0
(9)
δ→0
and ||ρ(x, δ) ∗ f (x)||p ≤ ||f (x)||p , holds in any case (Lu and Wang [9]), therefore, ||ρ(x, δ) ∗ f (x) − f (x)||p ≤ ||f (x)||p . In addition
∞
0
ϕ(y) ˆ ˆ ψ(y) dy < ∞. yn
(10)
(11)
From Eqs.(10) and (11), we conclude that for any > 0, we may choose r so that |A1 | < 2 . From Eqs. (9) and (11), we may also choose δ0 so that rδ2 < δ0 . Thus we can conclude |A2 | < 2 . This finishes the proof of Theorem 1.
3
Application to Neural Networks
By using the above approximate representation theorem, we can estimate approximation bound of some special type of feed-forward neural networks (FNNs). 3.1
Three-Layered FNNs with Bell-Shaped Activation Functions
A sigmoidal function σ with the properties ⎧ σ(x) → 1 as x → ∞ ⎪ ⎪ ⎨ σ(x) → 0 as x → −∞
σ (x) > 0 as x ∈ R ⎪ ⎪ ⎩ σ (x) → 0 as |x| → ∞
(12)
has been extensively used as the activation function of hidden neurons of a FNN, but it does not belong to L1 (R), so the previous approximate integral representation theorem can not be directly applied to the sigmoidal FNNs. Thus we
Integral Transform and Its Application
69
introduce a so-called bell-shaped activation function. A bell-shaped function ψ is such a function that has the following properties: R ψ(x)dx < ∞, maxx ψ(x) = 1 and ψ(x) ≥ 0 as x ∈ R, and ψ(x) → 0, as |x| → ∞. It is clear that any bellshaped function is unmodal. In the following, we will consider the FNNs with bell-shaped functions as the activation functions of the hidden neurons. Such a general FNN can be expressed as N (x) =
m
ci ψ(ai ·x − bi ),
(13)
i=1
for any x ∈ Rn , where ai ∈ Rn , bi , ci ∈ R and m denotes the number of hidden neurons. According to Katsuyuki, Taichi and Naohiro ([10]), a bell-shaped function can be constructed from two sigmoidal functions. For example, ψ(x) = c(σ(x + d) − σ(x − d)), is a bell-shaped function, where c is a constant to normalize the maximum value and d is a positive constant. In this case, a bell-shaped hidden neuron can be constructed from two sigmoidal hidden neurons. So, in this sense, a bell-shaped FNNs can be explained as a sigmoidal FNN with twice as many as sigmoidal hidden neurons. 3.2
Function Approximation by Bell-Shaped FNNs
In this subsection, by using Theorem 1, we prove a theorem on approximation bound estimation of a bell-shaped FNNs. The obtained result will quantitatively characterize approximation capability of the bell-shaped FNNs to any integrable functions. We assume that input x is generated according to a probability density μ(x), and the approximation bound of a three-layered bell-shaped FNN, N (x), with m hidden neurons will be estimated in terms of Lpμ —norm. That is, we are going to estimate the error function defined by
p1 ||N (x) − f (x)||p,μ = (N (x) − f (x))p μ(x)dx , (2 ≤ p < ∞), (14) Rn
where f is a target function to be approximated. Let CW = |Re(Wϕ f )(a, b)x |dadb < ∞.
(15)
Rn+1
Our main result is the following theorem. Theorem 2. Assume that the absolute integral CW defined by (21) is bounded, and ψ is a bell-shaped function, ϕ is a function such that(ψ, ϕ) is a permitted pair. If a three layered FNN N (x) with ψ being their activation functions of the hidden neurons, then, for any input distribution μ(x), there holds the following approximation order estimation: 1 ||N (x) − f (x)||p,μ = O( √ ), (2 ≤ p < ∞), p m where“O” is independent of x, m, n, f, N .
(16)
70
F.-j. Li and Z. Xu
Proof. Define (a,b) = Sign(Re(Wϕ f )(a, b)x )CW , C
(17)
and 1 |Re(Wϕ f )(a, b)x )|. CW
(18)
|(Wϕ f )(a, b)x |dadb < ∞.
(19)
μ(a, b) = By the assumption, we obtain Rn+1
Without loss of generality, we assume f and ψ are both real functions. From [6] we get (a,b) ψ(a · x − b)μ(a, b)dadb. f (x) = C (20) Since μ(a, b) is positive and its integral over Rn+1 is equal to 1, μ(a, b) can be viewed as a probability density of x and b. (x) = 1 m C Let us consider the function defined by: N i=1 (ai ,b) ψ(ai ·x − bi ), m where (ai , bi ), (i = 1, 2, · · · , m) are independently chosen according to probability (x) − f (x) is a bounded density μ(a, b). From Eqs.(15)—(20), we know that N (x) − f (x)| ≤ M . function. So there exists a positive constant M , such that |N The expectation and variance of N (x) are given by ( Murata [6] ) (x)) = f (x), E(N (x)) ≤ 1 (C 2 − f (x)2 ). V (N m W
(21) (22)
Combining Eqs.(21) and (22), and make use of properties of expectation and variance, we thus have (x) − f (x))p μ(x)dx = E[((N (x) − f (x))2 ) p2 ]μ(x)dx E( (N
(x) − f (x) p N )2 ) 2 ]μ(x)dx M (x) − f (x) N p ≤M E(( )2 )μ(x)dx M C1 2 ≤ (C − ||f ||22,μ ) m W C2 ≤ , (2 ≤ p < ∞), (23) m = Mp
E[((
where C1 and C2 are different positive constants. This finishes the proof of Theorem 2.
Integral Transform and Its Application
4
71
Conclusion
We have developed an approximate representation theorem of any integrable functions by using ridge functions. By using the developed representation theorem, we have established an approximation order estimation for the bell-shaped FNNs. The developed approximation order estimation reveals that the approximation accuracy of the bell-shaped neural networks is inversely proportional to the number of hidden neurons. Since it is known that the bell-shaped FNNs have specially important application in dynamical system simulation, the obtained result is of great significance in understanding and clarifying approximation capability of the bell-shaped FNNs.
Acknowledgement This work was supported by NSFC projects under contract Nos. 10371097 and 70531030.
References 1. Lewicki, G., Marino, D.: Approximation of Functions of Finite Variation by Superpositions of a Sigmoidal Function. Applied Mathematics Letters 17 (12)(2004) 1147–1152 2. Barron, A. B.: Universal Approximation Bounds for Superpositions of a Sigmoidal Function. IEEE Trans. Inform. Theory 3 (6)(1993) 930–945 3. Funahashi, K.: On the Approximate Realization of Continuous Mappings by Neural Networks. Neural Networks 2 (1)(1989 ) 183–192 4. Xin, L.: Simultaneous Approximations of Multivariate Functions and Their Derivatives by Neural Networks with One Hidden Layer. Neurocmputing 12 (8)(1996) 327–343 5. Makovoz, Y.: Uniform Approximation by Neural Networks. Journal of Approximation Theory 95 (11)(1998) 215–228 6. Murata, N.: An Integral Representation of Functions Using Three-layered Networks and Their Approximation Bounds . Neural Networks 9 (6)(1996) 947–956 7. Martin, D.B., Allan, P.: Identifying Linear Combination of Ridge Functions. Advances in Applied Mathematics 22 (1)(1999) 103–118 8. Widder,D.V.: An Intrduction to Transform Theoy. Academic Press, New York(1971) 9. Lu, S. Z., Wang, K. Y.: Real Analysis. Bei Jing Normal University Pess, Bei Jing( in chinese)(1997) 10. Katsuyuki, H., Taichi, H., Naohiro, T.: Upper Bound of the Expected Training Error of Neural Network Regression for a Gaussian Noise Sequence. Neural Networks 14 (10)(2001) 1419–1429
The Essential Approximation Order for Neural Networks with Trigonometric Hidden Layer Units Chunmei Ding1 , Feilong Cao1,2 , and Zongben Xu2 1
Department of Information and Mathematics Sciences, College of Science, China Jiliang University, Hangzhou, Zhejiang, 310018, P.R. China
[email protected] 2 Institute for Information and System Sciences, Faculty of Science, Xi’an Jiaotong University, Xi’an, Shaanxi, 710049, P.R. China
[email protected]
Abstract. There have been various studies on approximation ability of feedforward neural networks. The existing studies are, however, only concerned with the density or upper bound estimation on how a multivariate function can be approximated by the networks, and consequently, the essential approximation ability of networks cannot be revealed. In this paper, by establishing both upper and lower bound estimations on approximation order, the essential approximation ability of a class of feedforward neural networks with trigonometric hidden layer units is clarified in terms of the second order modulus of smoothness of approximated function.
1
Introduction
Function approximation is the most fundamental capability of feedforward neural networks (FNNs). Various studies on the density or feasibility of FNNs approximating continuous or integrable functions have been made in past years. Typical results can see [7], [9], [10], [17], [4] and [6]. Among those studies, it is shown in particular that any continuous multivariate function defined on a compact set of Rd can be approximated arbitrarily well by one hidden layer FNN of the form ⎛ ⎞ m d N (x) = ci σi ⎝ wij xj + θi ⎠ , x ∈ Rd , d ≥ 1, (1) i=1
j=1
where for any 1 ≤ i ≤ m, θi ∈ R is the threshold, wi = (wi1 , wi2 , ..., wis )T ∈ Rd is the connection weight of neuron i in the hidden layer with the input neurons, and ci ∈ R is its connection weight with the output neuron, and σi (·) is the sigmoidal activation function. In (1), the number m has special importance: it determines the number of hidden units used and correspondingly specify the topology structure of the sigmoidal, one hidden layer FNN. In most of the existing studies, such number m is justified to be existed, finite and sufficiently large. It has, however, hardly been explicitly specified. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 72–79, 2006. c Springer-Verlag Berlin Heidelberg 2006
The Essential Approximation Order for Neural Networks
73
In recent years, a quantitative study on the approximation order or approximation speed of the neural networks (1) has attracted much attention(see [1], [12], [15], [13], [5], [3] and [16]). In all these studies, some upper bound estimations on the approximation order of the neural networks were established. Such upper bound estimations can, on one side, imply the convergence of the neural networks to the approximated function, and, on the other side, provide quantitative estimations on how accurate the neural networks approximate the function. These estimations cannot, however, completely characterize the approximation capability of the neural networks in general, because an established upper bound estimation might be too loose to reflect their inherent approximation capability. Given an FNN, let us refer the highest approximation accuracy the FNN may achieve to as the essential approximation order of the FNN. Then, clearly, the inherent approximation capability of an FNN can be uniquely characterized by its essential approximation order. Thus to study the essential approximation order of an FNN is fundamental and important in clarifying an FNN’s inherent approximation capability. In the present paper, we will derive such an essential approximation order estimation for a special class of neural networks with trigonometric hidden layer units. Our approach will be based on establishing upper and lower bound estimations on the approximation order of the FNN simultaneously, and then derive the essential approximation order of the network when the upper and lower bounds become identical. We will take multivariate function approximation theory as tools in general and take the second order modulus of smoothness as the measures of approximation order in particular. Thus, for the FNNs, both their essential approximation order and the relationship between approximation speed and the number of hidden units are clarified. The paper is organized as follows. Some notations are introduced in section 2 below. Section 3 states the main results and their significations. In section 4, the proof of the main results is given by using some techniques of approximation theory.
2
Some Notations
Let N be the set of positive integral number, and R the set of real number. ith
Assume N0 = N ∪ {0} and 0 = (0, 0, . . . , 0), 1i = (0, 0, . . . , 0, 1 , 0 . . . , 0) ∈ 1/2 d 2 N0d . Let |r| = di=1 |ri | for r = (r1 , r2 , . . . , rd ) ∈ N0d , t = for i=1 ti d d t = (t1 , t2 , . . . , td ) ∈ R , and rt = i=1 ri ti . Let Lp2π , 1 ≤ p < ∞, be the Banach space consisting of all p-th Lebesgue integrable functions with 2π periodic for each variable defined on Rd . We identify L∞ 2π = C2π with the space of continuous functions with 2π periodic for each variable defined on Rd . For f ∈ Lp2π , 1 ≤ p ≤ ∞, its norm is defined by
1/p π p |f (u)| du ,1≤p<∞ , −π f p = maxx∈[−π,π]d |f (x)| , p = ∞ and the r-th symmetric difference of f along direction h is given by
74
C. Ding, F. Cao, and Z. Xu (r)
Δh f (x) =
r r r (−1)i f x + ( − i)h . i 2 i=0
(r)
In terms of Δh f (·), the r-th modulus of smoothness of f is then defined by (r) ωr (f, t)p = sup Δh f (·) . (2) p
0<h≤t
A function f is said to belong to Lipschitian function class with order up to r, r ∈ N , denoted by f ∈ Lip(α)r , if its r-th modulus of smoothness ωr (f, t)p satisfies ωr (f, t)p = O(tα ), with an α ∈ (0, r]. The modulus of smoothness of f defined by (2) could be regarded to as a direct multivariate extension of the univariate modulus of smoothness. Such modulus has some properties similar to univariate modulus of smoothness. For example, (1) (2) (3) (4)
ωr (f, t)p monotonically decreases to zero as t goes to zero; ωr (f, t + s)p ≤ ωr (f, t)p + ωr (f, s)p for any t, s ∈ (0, ∞); ωr (f, t)p ≤ 2r−s ωs (f, t)p for any s ≤ r; ωr (f, λt)p ≤ (λ + 1)r ωr (f, t)p for λ > 0. π 1 We denote by f ∗ g(x) = (2π) f (t)g(x − t)dt the convolution of f and d −π ∧ −irt g, and by f (r) =< f, e > the Fourier transformation of function f , where π 1 < f, g >= (2π) f (t)g(t)dt is the inner product of f and g. Furthermore, we d −π introduce a concept of r-th K-functional for a function f ∈ Lp2π , 1 ≤ p ≤ ∞, and δ > 0, r ∈ N , i.e.,
Kr (f, δ r )p =
f − gp + δ r sup Dβ gp
inf
Dβ g∈Lp 2π
where |β| = β1 + β2 + · · ·+ βd ,
,
(3)
|β|=r
β = (β1 , β2 , . . . , βd ) ∈ N0d , and Dβ =
∂ |β| β β ∂x1 1 ···∂xd d
is differential operator. The theory of K-functional was first introduced by Peetre [14] and was further developed by Johnen and Scherer [11] and Ditzian and Totik [8]. It is usually used to measure the distance from a normed space to its density subspace. It was shown in [11] that the K-functional given by (2) is equivalent to the modulus of smoothness defined in (3), i.e., there are constants C1 and C2 , such that C1 ωr (f, δ)p ≤ Kr (f, δ r )p ≤ C2 ωr (f, δ)p . (4) In the following, we will give some notations used in [15]. We denote for λ ∈ N and each fi ∈ Lp2π , 1 ≤ p ≤ ∞, p = (p1 , p2 , . . . , pd ) ∈ N0d , Bλ =
2 λ+2
q = (q1 , q2 , . . . , qd ) ∈ N0d ,
d ,
bλ,r =
d i=1
sin
ri + 1 π, λ+2
The Essential Approximation Order for Neural Networks
75
and αλ,p,q [fi ] = 2Bλ bλ,p bλ,q < fi , cos(p − q)t >, βλ,p,q [fi ] = 2Bλ bλ,p bλ,q < fi , sin(p − q)t >, θλ [fi ] = < fi , 1 >, With those notations, Suzuki [15] constructed the following three-layer networks with (2λ + 1)d − 1 trigonometric hidden layer units: T Nλ [fi ](x) = θλ [fi ] +
0≤p u ,qv ≤λ
Combinations of
{αλ,p,q [fi ] cos(p − q)x + βλ,p,q sin(p − q)x} p =q
where the summation is over combinations of p = (pu )du=1 and q = (qv )dv=1 ∈ N0d such that p = q, 0 ≤ pu , qv ≤ λ. That is, if it is added in the case of (p, q), it is not added in the case (q, p).
3
The Main Results
In [15], Suzuki obtained the estimation of upper bound on approximation error in terms of the first order modulus of smoothness, that is, π2 √ 1 fi − T Nλ [fi ]p ≤ 1 + d ω1 fi , , fi ∈ Lp2π , 1 ≤ p ≤ ∞. (5) 2 λ+2 p This result is more important since it offers not only an estimation of upper bound on approximation order of the constructed networks T Nλ [fi ], but also bridges the gap of the approximation speed of the networks and the number of hidden units (2λ + 1)d − 1. The estimation cannot, however, completely characterize the capability of the constructed neural network in general, because such upper bound estimation might be too loose to reflect their inherent approximation capability. Furthermore, such upper bound estimation cannot to answer whether or not the first order modulus of smoothness estimated in (5) is the essential approximation order of the networks T Nλ [fi ]. In this paper, we first use the second order modulus of smoothness of function as a metric, and prove that the modulus can be used not only to estimate the upper bound of approximation error, but also can characterize the lower bound of the approximation speed. In fact, we will prove the following results. Theorem 3.1. For fi ∈ Lp2π , 1 ≤ p ≤ ∞, we have 2 1 d 1 2 fi − T Nλ [fi ]p ≤ 1+π ω2 fi , . 2 2 λ+2 p
(6)
The above theorem offers an upper bound estimation on approximation order of T Nλ [fi ], which describes clearly how approximation speed of the network is
76
C. Ding, F. Cao, and Z. Xu
affected by the number of hidden units m, or equivalently, parameter λ (for m = (2λ + 1)d − 1 is positively proportional to λ). In particular, it shows that the approximation speed of the networks is proportionally upper controlled by the second order modulus of smoothness and how many hidden units are used. Whenever fi ∈ Lp2π , it then further shows that T Nλ [fi ] − fi p → 0 as λ (equivalently, m = (2λ + 1)d − 1) goes to infinity. That is, any continuous or integrable function with 2π periodic can be approximated arbitrarily well by T Nλ [fi ]. Now we give a lower bound estimation on the approximation speed for the networks. Theorem 3.2. If fi ∈ Lp2π , 1 ≤ p ≤ ∞, then ω2 fi ,
1 λ+2
p
≤ M λ−2
λ
kT Nk [fi ] − fi p ,
(7)
k=1
here and in the following M is a positive constant independent of f and λ. Theorem 3.2 provides a lower bound estimation of approximation accuracy of T Nλ [fi ], which shows that the weighted arithmetic average of T Nλ [fi ] over pa rameters λ, λ12 λk=1 k T Nk [fi ] − fi p , is lower controlled by the second order modulus of smoothness of fi . It is noted that there holds always the following identity λ 1 lim T Nλ [fi ] − fi p = lim 2 k T Nk [fi ] − fi p . λ→∞ λ→∞ λ k=1
So, whenever λ (or, equivalently, the number of hidden units m = (2λ+1)d −1) is λ sufficiently large, T Nλ [fi ] − fi p and λ12 k=1 k T Nλ [fi ] − fi p can be viewed approximately as the same. In such case, the statements (6) and (7) then imply 1 1 C2 ω2 fi , ≤ T Nλ [fi ] − fi p ≤ C1 ω2 fi , , λ+2 p λ+2 p where the positive constants C1 and C2 are independent of fi and λ. The following theorem reflects the relation between the speed of approximation of the networks and the constructive properties of approximated function. Theorem 3.3. Suppose fi ∈ Lp2π , 1 ≤ p ≤ ∞, then T Nλ [fi ] − fi p = O λ−α , 0 < α < 2, if and only if fi ∈ Lip(α)2 . This theorem shows particularly that when the functions to be approximated are Lipschitzian with order up to 2 (i.e., fi ∈ Lip(α)2 , 0 < α < 2), the approximation speed of T Nλ [fi ] is determined by the second order modulus of smoothness of the function fi . So, for the given activation function and number of units in the hidden layer of the networks T Nλ [fi ], the better smoothness the approximated function holds, the faster approximation speed of the networks is. Inversely,
The Essential Approximation Order for Neural Networks
77
in order to get faster approximation speed of the networks approximating a function, the approximated function must have better smoothness. This reveals that the smoothness of the approximated function essentially determines the approximation speed of the networks.
4
The Proof of Main Results
Now we are in position to prove our main results stated in the section 3. We first prove Theorem 3.1. Let r = (r1 , r2 , . . . , rd ) ∈ N0d , λ ∈ N . The d-dimensional F´ejer-Korovkin 2 kernel Kλ is defined by Kλ (t) = Bλ 0≤ri ≤λ bλ,r eirt , where bλ,r
⎛
d
ri + 1 = sin π, λ+2 i=1
Bλ = ⎝
⎞−1
(bλ,r )2 ⎠
,
0≤ri ≤λ
then(see [15]) ⎛ Bλ = ⎝
0≤ri ≤λ
and
d
ri + 1 sin π λ+2 i=1
2 ⎞−1 ⎠
=
2 λ+2
d ,
2 0≤p u ,qv ≤λ irt Kλ (t) = Bλ bλ,r e = 1 + 2Bλ bλ,p bλ,q cos(p − q)t. 0≤ri ≤λ p =q∈N d 0
Recalling the convolution Kλ ∗ fi (x) =
1 (2π)d
π −π
Kλ (t)fi (x − t)dt,
gives(see [15]) T Nλ [F ] = (T Nλ [f1 ], T Nλ [f2 ], . . . , T Nλ [fd ]) = (Kλ ∗ f1 , Kλ ∗ f2 , . . . , Kλ ∗ fd ). From the expression of Kλ (t) and the fact that Kλ (t) is an odd function for t, it follows that π 1 1 Kλ ∗ fi (x) = (fi (x + t) + fi (x − t)) Kλ (t)dt. (2π)d −π 2 On the other hand, by calculation we obtain(see also [15]) Kλ∧ (0) = 1,
Kλ∧ (r) = Bλ
0≤p u ,qv ≤λ
bλ,p bλ,q
p−q=r,p,q∈N0d π and Kλ∧ (1i ) = cos λ+2 . Therefore, by the properties of the modulus of smoothness, we find
78
C. Ding, F. Cao, and Z. Xu
π 1 1 T Nλ [fi ] − fi p = (fi (· + t) + fi (· − t) − 2fi (·)) Kλ (t)dt d (2π) −π 2 p π π 2 1 1 1 1 ≤ Kλ (t)ω2 (fi , t)p dt ≤ ω2 (fi , δ)p Kλ (t) 1 + δ −1 t dt d d (2π) −π 2 (2π) −π 2 ⎛ 1/2 d 1 2 1 π 2 ≤ ω(fi , δ) ⎝1 + t Kλ (t)dt 2 δ (2π)d i=1 −π k d 1 1 π 2 + 2 tk Kλ (t)dt . δ (2π)d −π k=1
Since
t t ≤ π sin , 2
we see
0 ≤ t ≤ π;
π sin
t t2 ≤ π 2 sin2 , 2
t ≤ t, 2
−π ≤ t ≤ 0,
−π ≤ t ≤ π.
Consequently, π π 1 1 π 2 2 2 2 t Kλ (t)dt ≤ π sin tk Kλ (t)dt = π 1 − cos . (2π)d −π k (2π)d λ+2 −π Finally, we take δ = to find
1 λ+2
1 T Nλ [fi ] − fi p ≤ 2
d
and recall
k=1
π 1 − cos λ+2 ≤
d 2
π λ+2
2 , it is easy
2 d 1 2 1+π ω2 fi , . 2 λ+2 p
This completes the proof of Theorem 3.1. The proof of Theorem 3.2 is similar to [2], we omit the detail. Now, we prove Theorem fi ∈ Lip(α)2 , from Theorem 2.1 we obtain α 3.3. If −α T Nλ [fi ] − fi p ≤ M λ1 = O(λ ) and, inversely, whenever the above estimation holds, we obtain from Theorem 3.2 ω2 fi ,
1 λ+2
p
≤ M λ−2
λ
k 1−α ≤ M (λ + 2)−α
k=1
1 2 Since, for any t ∈ (0, 1) there always is an integer λ such that λ+2 ≤ t ≤ λ+2 , the above estimation implies fi ∈ Lip(α)2 . This shows that the necessary and sufficient condition for T Nλ[fi ]−fi p = O(λ−α ) is fi ∈ Lip(α)2 . This completes the proof of Theorem 3.3.
Acknowledgement This work was supported by the Nature Science Foundation of China (60473034) and China Postdoctoral Science Foundation (2004035225).
The Essential Approximation Order for Neural Networks
79
References 1. Barron, A.R.: Universial Approximation Bounds for Superpositions of a Sigmodial Function. IEEE Trans. Inform. Theory 39(1993)930-945 2. Cao, F.L., Xiong, J.Y.: Steckin-Marchaud-Type Inequality in Connection with Lp Approximation for Multivariate Bernstein-Durrmeyer Operators. Chinese Contemporary Mathematics 22(2)(2001)137-142 3. Cao, F.L., Li, Y.M., Xu Z.B.: Pointwise Approximation for Neural Networks. In: Wang, J., Liao, X.F., Zhang,Y.(eds): Advance in Neural Networks-ISNN 2005. Lecture Notes in Computer Science, Vol.3496. Springer-Verlag, Berlin Heidelberg New York(2005)39-44 4. Chen, T.P., Chen, H.: Universal Approximation to Nonlinear Operators by Neural Networks with Arbitrary Activation Functions and Its Application to Dynamical System. IEEE Trans. Neural Networks 6(1995)911-917 5. Chen, X.H., White, H.: Improved Rates and Asymptotic Normality for Nonparametric Neural Network Estimators. IEEE Trans. Inform. Theory 45(1999)682-691 6. Chui, C.K., Li, X.: Approximation by Ridge Functions and Neural Networks with One Hidden Layer. J. Approx. Theory 70(1992)131-141 7. Cybenko, G.: Approximation by Superpositions of Sigmoidal Function. Math. of Control Signals, and System 2(1989)303-314 8. Ditzian, Z., Totik, V.: Moduli of Smoothness, Springer-Verlag, Berlin Heidelberg New York(1987) 9. Funahashi, K.I.: On the Approximate Realization of Continuous Mappings by Neural Networks. Neural Networks 2(1989)183-192 10. Hornik, K., Stinchombe, M., White, H.: Multilayer Feedforward Networks are Universal Approximation. Neural Networks 2(1989)359-366 11. Johnen, H., Scherer, K.: On the Equivalence of the K-Functional and the Moduli of Continuity and Some Applications. In: Schempp, W., Zeller, K. (eds.): Constructive Theory of Function of Several Variables. Lecture Notes in Mathematics, Vol. 571, Springer-Verlag, Berlin Heidelberg New York(1977)119-140 12. K˚ urkova, V., Kainen, P.C., Kreinovich, V.: Estimates of the Number of Hidden Units and Variation with Respect to Half-Space. Neural Networks 10(1997)10681078 13. Maiorov, V., Meir,R.S.: Approximation Bounds for Smooth Functions in C(Rd ) by Neural and Mixture Networks. IEEE Trans. Neural Networks 9(1998)969-978 14. Peetre, J.: On the Connection Between the Theory of Interpolation Spaces and Approximation Theory. In: Alexits, G., Stechkin, S.B.(eds): Proc. Conf. Construction of Function. Budapest(1969)351-363 15. Suzuki, Shin: Constructive Function Approximation by Three-Layer Artificial Neural Networks. Neural Networks 11(1998) 1049-1058 16. Xu, Z.B., Cao, F.L.: Simultaneous Lp -Approximation Order for Neural Networks. Neural Networks 18(2005)914-923 17. Ito, Y.: Approximation of Functions on a Compact Set by Finite Sums of Sigmoid Function without Scaling. Neural Networks 4(1991)817-826
Wavelets Based Neural Network for Function Approximation Yong Fang1 and Tommy W.S. Chow2 1
School of Communication and Information Engineering, Shanghai University, Shanghai, 200072, China
[email protected] 2 Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China
[email protected]
Abstract. In this paper, a new type of WNN is proposed to enhance the function approximation capability. In the proposed WNN, the nonlinear activation function is a linear combination of wavelets, that can be updated during the networks training process. As a result the approximate error is significantly decreased. The BP algorithm and the QR decomposition based training method for the proposed WNN is derived. The obtained results indicate that this new type of WNN exhibits excellent learning ability compared to the conventional ones.
1 Introduction The approximation of a general continuous function by neural network (NN) has been widely studied because of its outstanding capability of fitting nonlinear models for input/output data. A three-layer NN is usually represented by the following finite sums of the form: N
g ( x ) = ∑ wi σ ( aiT x + bi ) ,
(1)
i =1
where wi , bi ∈ R , ai ∈ R n , σ (⋅) is a given function from R to R , x ∈ R n is the input vector. It has been proved that the output, g ( x ) , is dense in the space of continuous function defined on [0,1]n if σ (⋅) is a continuous, discriminating function. Generally, σ (⋅) is adopted as a sigmoid function that is discriminatory. Because wavelet decomposition has been emerged as a new powerful tool for representing nonlinearity, a class of network combining wavelets and neural networks have recently been investigated [1-10]. It has shown that this class of wavelet networks can provide better function approximation ability than ordinary basis function networks. It is noticed that the development of the above WNNs are all theoretically based upon the wavelet frame theory given by Daubechies [4] in one-dimensional (1-D) case and generalized by Kugarajah and Zhang [7] in a multi-dimensional (M-D) case. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 80 – 85, 2006. © Springer-Verlag Berlin Heidelberg 2006
Wavelets Based Neural Network for Function Approximation
81
A rather complex network architecture is inevitably resulted when the construction of an MWNN is solely in accordance with the wavelet frame theory. For practical applications, a relative large network size is often required. In this paper, we study the possibility of using a 1-D wavelet function for the development of an MWNN. Based upon the theories of approximation of nonlinear functions using neural networks [8], our proposed WNN theorems are derived in accordance with the theories that the sigmoid function can be replaced by any continuous or discontinuous function [8], if certain conditions are satisfied (the wavelet function satisfy these conditions too) [8]. Following the development of this new MWNN, we based upon the discrete wavelet transforms to further develop another new type of MWNN, called MWNN-DWT. The activation function of the MWNNDWT is a linear combination of wavelet bases rather than the sigmoid function or the wavelet function.
2 Wavelets and Wavelet Neural Networks (WNN) Wavelets are functions whose dilations and translations form a frame of L2 ( R) . That is, for some a > 0, b > 0 , the family ψ lk ( x ) = a − l / 2ψ ( a − l x − bk ) , for l , k ∈ Z
(2)
satisfies the frame property [4]. The sufficient conditions of wavelet frames were fiven in [4]. For instance, the “Morlet” wavelet with a = 2,b = 1 can build a frame of L2 ( R) . g (x ) =
Hence, the collection of all linear combination of elements of the frame
∑ wlkψ lk ( x )
(where I ⊂ Z 2 ) is dense in L2 ( R) . This implies that for any
( l ,k )∈Ι
f ∈ L2 ( R) and ε > 0 , there exists a positive integer N and constants such that N
f ( x ) − ∑ wiψ ( a i x + bi ) < ε .
(3)
i =1
Pati and Krishnaprasad [3] connected the wavelet with NN by applying Daubechies’ results [4]. They proposed the following network form g( x) =
∑ wlkψ lk ( x ) ,
(4)
( l ,k )∈Ι
where the index set Ι is the integer translation and the integer dilation, which are determined by using the time-frequency localization properties under a given accuracy [3-4], [6]. Kugarajah and Zhang [7] firstly built the M-D wavelet frames by a single mother wavelet as following form ψ l , k ( x ) = a − nl / 2ψ ( a − l x − bk ) for l ∈ Z , k ∈ Z n
(5)
where ψ ( x ) ∈ L2 ( R n ) , x ∈ R n , a, b ∈ R and a > 1 . Also, they gave the sufficient conditions of wavelet frames by generalizing Daubechies’ theorem [4]. In the sequence, l the dilation index l is a scalar and the scalar dilation parameter a is shared by all the dimensional of a wavelet. They proposed a methodology to construct a M-D wavelet function leading to frames [7], but not all 1-D wavelet can be extended as a M-D
82
Y. Fang and T.W.S. Chow
wavelet. The special conditions can be found in [7]. Zhang and Benveniste [1], and Kugarajah and Zhang [7] respectively gave the M-D wavelet frames in the following multi-scaling forms: ψ l , k ( x ) = (det D j )1 / 2ψ ( D j x − bk ) for j, k ∈ Z n ,
(6)
ψ l ,k ( x ) = det D 1j / 2ψ ( D j x − Tk ) for j, k ∈ Z n ,
(7)
where D j = diag ( a j1 ,L , a jn ) , j = ( j1 ,L j n ) T ∈ Z n and T = diag (b1 , L, bn ) , a > 1, bi > 0, i = 1, L , n , b > 0 . They proved that if a 1-D wavelet function ψ (x ) can consti-
tute frames, the tensor product of the 1-D wavelet function can also constitute frames. Zhang and Benveniste [1] presented a network structure in the form of −
N
g ( x ) = ∑ wiψ [ Di Ri ( x − t i )] + g ,
(8)
i =1
where Ri are rotation matrices, and g is introduced to deal with nonzero mean functions on finite domains.
3 MWNN Based on Discrete Wavelet Transform (MWNN-DWT) In the MWNN described in (12), the activation function is a 1-D mother wavelet function. As it is well known that the dilations and translations of wavelet can provide a good representation of nonlinearity [4],[6], a 1-D dilations and translations for the wavelet in the MWNN is then developed to enhance the approximation capability of the network. In this section we derive a new type of MWNN, called MWNN-DWT, ∧
ψ
ψ 00
b1
........ ψ lk
+
w00 wlk
−
+
g
a1
w1
x
+
........................................ aN
bN
y
wN ψ 00
+
w00
..........
+
wlk
ψ lk
Fig. 1. The architecture of the MWNN
which is based upon the discrete wavelet transform. The output, g (x ) , of MWNNDWT depicted in Fig. 1 is represented by the following form: N
∧
−
g ( x ) = ∑ wi ψ ( aiT x + bi ) + g ,
(9)
∑ wlkψ (a l x − kb) , {a l / 2ψ (a l x − kb) l , k ∈ Z }
is a frame for the L2 ( R) and
i =1
where ψˆ ( x ) =
( l ,k )∈Ι
Ι is the index set of pairs (l,k) of integer translation and integer dilation. The parame-
Wavelets Based Neural Network for Function Approximation
83
ter g is introduced so that the approximation of functions with nonzero average is possible [1]. It is clear that the activation function in MWNN-DWT (9) is expressed in a linear combination of the dilating and translating wavelets. Through adaptively adjusting the 1-D wavelet frames, the MWNN-DWT is capable of providing an excellent approximation capability. The (13) can be rewritten as the following two equivalent forms: N
g ( x ) = ∑ wi i =1
g( x) =
∑
−
∑ wlkψ [a l (aiT x + bi ) − kb] + g N
( l ,k )∈Ι
(10)
( l ,k )∈Ι
−
wlk ∑ wiψ [a l (aiT x + bi ) − kb ] + g .
(11)
i =1
To demonstrate the approximate capability of the MWNN-DWT, it is applied to approximate the function f ( x ) = 0.5e − x sin(6 x ) over the domain [-1,1]. The parameter g was initialized by the mean of the function observations, and other parameters were simply randomized between -0.5 and 0.5. It should also be noted that a rather special procedure in initializing the parameters is used in [1], while the parameters of the proposed MWNN-DWT is simply randomized in a general fashion. In order to compare the approximation capability of the networks described in (8) and (11), the initial parameters of these networks are all similarly randomized. Note that no rotation parameter is required for the wavelet network (8) in a 1-D case [1]. The training set consist of 100 points uniformly sampled in f (x ) . The function approximation performance of an MWNN-DWT in (11) with 5 neurons and 9 wavelet coefficients is compared to an WNN in (8) with 8 wavelons. In both networks, 25 parameters are used and trained by the standard BP algorithm. Fig.2 is the total squared errors over 2000 iterations, where the solid line shows the error of the MWNN-DWT and the dashed line represents that of the WNN. Clearly, the proposed MWNN-DWT provides much better results compared to that of the WNN. It is noted that the parameters of the MWNN-DWT consists of two parts: the weights of the network and the coefficients of the wavelets. In order to determine the optimal index set I and to speed up the convergence of the MWNN-DWT, we divide the training process into two stages. Firstly, let w00 = 1, wlk = 0 , l , k ≠ 0 and apply the BP algorithm for the network training. In this case, ψˆ ( x ) = ψ ( x ) , the network is the same as the MWNN in (13) but less parameters is required. In this training stage, the activation function is fixed and the standard BP algorithm is used. After a number of iterations and the network converges to a specified level of error, all the parameters are fixed. Therefore, the (11) can be expressed as g( x) =
∑ wlk Glk ( x ) + g ,
(12)
( l ,k )∈I N
where Glk ( x ) = ∑ wiψ [a l ( aiT x + bi ) − kb] . The second training stage is to determine i =1
the coefficients of the wavelets wlk . As wlk appears to be linear in the output equa-
84
Y. Fang and T.W.S. Chow
tion (12) of the network, most optimization techniques can be used to minimize the measure function if the index set is given. In this paper, the QR decomposition is used to adjust the activation function according to the form in (16). This process is equivalent to solving a least squares minimization problem for a given training set TP = {( x i , f ( x i )}iP=1 . By choosing an order of the index set the equation (12) can be written as F = GW ,
(13)
where F = ( f ( x1 ),L , f ( x P )) T is a column vector, G = (L , Gˆ lk , L ,1) is the P×# ( I ) matrix, #(I) denotes the number of the elements of the index set I, Gˆ lk = (Glk ( x1 ),L, Glk ( x P )) T , and W is the coefficient vector which needs to be determined. To obtain the matrix G, search the Gˆ lk in accordance with the order: l, k = 0,± 1,± 2, L . If the rank (G , Gˆ lk ) > rankG and max j Glk ( x j ) > ε , then G ⇐ (G , Gˆ lk ,1) ,
where G = (G ,1) and the given ε is a very small number. Following the above procedures, QR decomposition is used to determine the coefficients and the averaged W. The weights can then evaluated. Obviously, the overall convergence rate is substantially speeded up. In the above example, it required 5000 iterations to converge to a total squared error of 0.0112. In order to speed up the rate of convergence and to enhance the approximation capability of our network, the proposed training scheme was used for the function f ( x ) = 0.5e − x sin(6 x ) . The proposed network with 5 hidden units was firstly trained by BP algorithm. After 100 iterations, we obtain the matrix G with 12 columns, which implies that there are 12 coefficients of wavelets required to be determined by QP decomposition. Our results show that a total squared error of 7.9954 × 10 −5 was obtained by only 100 training iterations together with the QR decomposition. The computational time required by this training scheme is substantially reduced compared to other MWNN training schemes. For the case of a 2-D function given in above section: f ( x, y ) = sin( x ) / x sin( y ) / y , an MWNN-DWT with 50 hidden units was trained by the proposed training scheme. In this example, only 400 training iterations (with BP algorithm) together with the QR decomposition was required to provide a total squared error of 0.1688. Fig. 3 and Fig. 4 show the original function and approximation results respectively. Total Squared Error
1 1 .5 9 .5 7 .5 5 .5 3 .5 1 .5 -0 .5 0
400
800
1200 1600 2000 N u m b e r o f Ite ra tio n s
Fig. 2. The total squared errors of the WNN (8) and the proposed network for first function
Wavelets Based Neural Network for Function Approximation
1 .0 0
1 .0 0
0 .8 0
0 .8 0
0 .6 0
85
0 .6 0
0 .4 0
10 6
0 .2 0 2 0 .0 0
-2 -6
-0 .2 0 10
6
2
-2
-6
-10 -1 0
Fig. 3. Original 2-D function
0 .4 0
10 6
0 .2 0 2 0 .0 0
-2 -6
-0 .2 0 10
6
2
-2
-6
-1 0 -1 0
Fig. 4. Resulting approximation
4 Conclusion In this paper, we use Discrete Wavelet Transform to extend the MWNN to a new type of network called, MWNN-DWT. This enables us to minimize the measure function by adjusting the wavelet bases activation function. Our results indicate that the proposed MWNN-DWT can deliver an enhanced function approximation capability. The proposed training algorithm, which is based on the standard BP algorithm and QR decomposition, has an outstanding convergence rate compared to other MWNN training algorithms.
Acknowledgement This work is supported by the National Natural Science Foundation of China (60472103, Shanghai Excellent Academic Leader Project (05XP14027), and Shanghai Leading Academic Discipline Project (T0102).
References 1. Zhang, Q., Benveniste, A.: Wavelet Networks. IEEE Trans. on Neural Networks 3(1992) 889-898 2. Yamakawa, T., Uchino, E., Samatsu, T.: Wavelet Neural Networks Employing Overcomplete Number of Compactly Supported Non-orthogonal Wavelete and Their Applications. IEEE Int. Conf. on Neural Network 1(1994) 1391-1396 3. Pati, Y. C., Krishnaprasad, P.S.: Analysis and Synthesis of Feedforward Neural Networks Using Discrete Affine Wavelet Transformations. IEEE Trans. on Neural Networks 4(1993) 73-85 4. Daubechies, I.: The Wavelet Transform, Time-frequency Localization and Signal Analysis. IEEE Trans. on Information Theory 36(1990) 961-1005 5. Zhang, J., Walter, G.G., Miao, Y., Wayne, W.N.: Zhang, Q., Benveniste, A.: Wavelet Neural Networks for Function Learning. IEEE Trans. on Signal Processing 43(195) 1485-1496 6. Daubechies, I, Ten Lectures on Wavelets. Philadelphia, PA: SIAM Press(1992) 7. Kugarajah, T. Zhang, Q.: Multidimensional Wavelet Frames. IEEE Trans. on Neural Networks 6(19995) 1552-1556 8. Chen, T., Chen, H.: Approximarions of Continuous Functionals by Neural Networks with Application to Dynamic System. IEEE Trans. on Neural Networks 4(1993) 910-918 9. Chow, T.W.S., Fang, Y.: Two-dimensional Learning Strategy for Multiplayer Feedforward Neural Network. Neurcomputing 34(2000) 195-206
Passivity Analysis of Dynamic Neural Networks with Different Time-Scales Alejandro Cruz Sandoval and Wen Yu Departamento de Control Automtico, CINVESTAV-IPN, A.P. 14-740, Av. IPN 2508, Mxico D.F., 07360, M´exico
[email protected]
Abstract. Dynamic neural networks with different time-scales include the aspects of fast and slow phenomenons. Some applications require that the equilibrium points of the designed network be stable. In this paper, the passivity-based approach is used to derive stability conditions for dynamic neural networks with different time-scales. Several stability properties, such as passivity, asymptotic stability, input-to-state stability and bounded input bounded output stability, are guaranteed in certain senses. Numerical examples are also given to demonstrate the effectiveness of the theoretical results.
1
Introduction
A wide class of physical systems in engineering contains both slow and fast dynamic phenomena occurring in separate time-scales. Resent results show that neural network techniques seem to be very effective to model a wide class of complex nonlinear systems with different time-scales when we have no complete model information, or even when we consider the plant as a ”black box” [4]. This model-free approach uses the nice features of neural networks, but the lack of a model for the controlled plant makes it hard to obtain theoretical results on the stability. Some of neural networks applications, such as patterns storage and solving optimization problem, require that the equilibrium points of the designed network be stable [5]. So, it is important to study the stability of neural networks. Dynamic neural networks with different time-scales can model the dynamics of the short-term memory (neural activity levels) and the long-term memory (dynamics of unsupervised synaptic modifications). Their capability of storing patterns as stable equilibrium points requires stability criteria which includes the mutual interference between neuron and learning dynamics. The dynamics of dynamic neural networks with different time-scales are extremely complex, exhibiting convergence point attractors and periodic attractors [1]. Networks where both short-term and long-term memory are dynamic variables cannot be placed in the form of the Cohen-Grossberg equations [3]. However, a large class of competitive systems have been identified as being ”generally” convergent to point attractors even though no Lyapunov functions have found for their flows. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 86–92, 2006. c Springer-Verlag Berlin Heidelberg 2006
Passivity Analysis of Dynamic Neural Networks with Different Time-Scales
87
There are not many results on the stability analysis of neural networks in spite of their successful applications. The global asymptotic stability (GAS) of dynamic neural networks has been developed during the last decade. Negative semi-definiteness of the interconnection matrix may make Hopfield-Tank neuro circuit GAS [2]; The stability of neuro circuits was established by the concept of diagonal stability [6]. By the frameworks of the Lur’e systems, the absolute stability of multilayer perceptrons (MLP) and recurrent neural networks were proposed in [13] and [7]. Input-to-state stability (ISS) analysis method is an effective tool for dynamic neural networks, and in [14] it is stated that if the weights are small enough, neural networks are ISS and GAS with zero input. Stability of identification and tracking errors with neural networks was also investigated. [4] and [9] studied the stability conditions when multilayer perceptrons are used to identify and control a nonlinear system. Lyapunov-like analysis is a popular tool to prove the stability. [10] and [14] discussed the stability of signal-layer dynamic neural networks. For the case of high-order networks and multilayer networks the stability results may be found in [7] and [10]. Passivity theory is another effective tool to analyze the stability of a nonlinear system. It may deal with nonlinear systems using only the general characteristics of the input-output dynamics, and offers elegant solutions for the proof of absolute stability. Passivity framework is a promising approach to the stability analysis of neural networks, because it can lead to general conclusions on the stability using only input-output characteristics. Passivity properties of MLP were examined in [9]. By means of analyzing the interconnection of error models, they derived the relationship between passivity and closed-loop stability. To the best of our knowledge, open loop analysis based on the passivity method for dynamic neural networks with different time-scales has not yet been established in the literature. In this paper we apply passivity techniques for two time-scales neural networks, which have two types of states variables (short-term and long-term memory), describing the slow and fast dynamics of the systems. We prove that a gradient-like learning law will make the dynamic neural network with different time-scales stability by passivity method. With additional conditions, the neural networks are global asymptotic stability (GAS) and input-to-state stability (ISS).
2
Stability Properties of Dynamic Neural Networks with Different Time-Scales
A general dynamic neural network with two time-scales can be expressed as ·
T
T
x = Ax + W1 σ 1 (V1 [x, z] ) + W3 φ1 (V3 [x, z] )u · T T z = Bz + W2 σ 2 (V2 [x, z] ) + W4 φ2 (V4 [x, z] )u
(1)
where x ∈ Rn and z ∈ Rn are slow and fast states, , Wi ∈ Rn×2n (i = 1 · · · 4) are the weights in output layers, Vi ∈ R2n×2n (i = 1 · · · 4) are the weights in hidT den layers, σ k = [σ k (x1 ) · · · σ k (xn ) , σ k (z1 ) · · · σ k (zn )] ∈ R2n (k = 1, 2), φ(·) ∈
88
A.C. Sandoval and W. Yu
A xt
³
V3
V4 zt
³
V1
V1
W1
I1
W3
u
I2 V2
¦
W4
u
V2
W2
¦
1
H
B
Fig. 1. Dynamic neural network with two time-scales
R2n×2n is diagonal matrix,φk (x, z) = diag [φk (x1 ) · · · φk (xn ) , φk (z1 ) · · · φk (zn )] (k = 1, 2), u(k) = [u1 , u2 · · · um , 0, · · · 0]T ∈ R2n . A ∈ Rn×n , B ∈ Rn×n are stable matrices (Hurwitz). is a small positive constant. The structure of the dynamic neural networks (2) is shown in Fig. 1. When = 0, the dynamic neural networks (1) have been discussed by many authors, for example [7], [10] and [14]. One may see that Hopfield model is a special case of this kind of neural networks with A = diag {ai } , ai := −1/Ri Ci , Ri > 0 and Ci > 0. Ri and Ci are the resistance and capacitance at the ith node of the network respectively. The subT T structure W1 σ 1 (V1 [x, z] ) + W3 φ1 (V3 [x, z] )u is a multilayer perceptron structure. In order or simplify the theory analysis, we let the hidden layers Vi = I. We discuss a single layer neural network ·
x = Ax + W1 σ 1 (x, z) + W3 φ1 (x, z)u · z = Bz + W2 σ 2 (x, z) + W4 φ2 (x, z)u
(2)
Because A and B are Hurwitz matrices, there exist positive definite matrices Q1 and Q2 ∈ n×n such that the following Lyapunov functions have positive definite solutions P1 and P2 P1 A + AT P1 = −Q1 P2 B + B T P2 = −Q2
(3)
The following theorem gives a basic passivity condition for the dynamic neural network with two time-scales. Theorem 1. If the weights of the dynamic neural networks defined by (2) are updated as ·
W 1 = −P1 xσ T1 + 12 xuT ·
W 2 = − 1 P2 zσ T2 + 12 zuT ·
T
W 3 = −P1 x (φ1 u) + 12 xuT ·
T
W 4 = − 1 P2 z (φ2 u) + 12 zuT
(4)
Passivity Analysis of Dynamic Neural Networks with Different Time-Scales
89
where P1 and P2 are the solution of (3), then the dynamic neural network (2) is strictly passive from input u to output y ∈ R2n y = xT (W1 + W3 ) + z T (W2 + W4 )
(5)
Proof. Select a Lyapunov function St (or we call it storage function) as St (x, z) = xT P1 x + z T P2 z + tr W1T W1 + tr W2T W2 + tr W3T W3 + tr W4T W4 (6)
where P ∈ n×n is positive definite matrix, tr {·} stands for the ”trace” and is defined as the sum of all the diagonal elements of a matrix. According to (2), its derivative is · S t (x, z) = xT P1 A + AT P1 x + 1 z T P2 B + B T P2 z 2 T +2xT P1 W1 σ 1 + 2 z T P 2xT P1 W 2 W2 σ 2 + 4 φ2 u z P2 W 3 φ1 u + · T
+2tr W 1 W1
· T
+ 2tr W 2 W2
· T
+ 2tr W 3 W3
· T
+ 2tr W 4 W4
Adding and subtracting xT (W1 + W3 ) u + z T (W2 + W4 ) u and using (3), and the updating law as (4), we have · 1 S t (x, z) = −xT Q1 x − z T Q2 z + xT (W1 + W3 ) + z T (W2 + W4 ) u
(7)
From T Definition 1 Twe see that if the input is defined as u and the output as x (W1 + W3 ) + z (W2 + W4 ) , then the dynamic neural network given by (2) is strictly passive with 1 Vt = xT Q1 x + z T Q2 z ≥ 0 Remark 1. When we have hidden layers Vi , since σ(·) and φ(·) are bounded, the passivity property has no relationship with Vi . The weights of the hidden layer may be fixed. Furthermore, we can also conclude that the stability properties of the dynamic neural networks (2) are not influenced by the hidden layers. Corollary 1. If the dynamic neural networks (2) is unforced (u = 0), the updating law (4) will make the equilibriums x = z = 0 stability. If the control input of the dynamic neural networks (2) is selected as u = −μy = −μ xT (W1 + W3 ) + z T (W2 + W4 ) , μ>0 (8) the updating law (4) will make the equilibrium xt = 0 asymptotic stability. Theorem 2. If the upper bound of the weights Wi (i = 1 · · · 4) satisfies λmin (Q1 ) ≥ λmax W 1 + W 3 λmin (Q2 ) ≥ λmax W 2 + W 4
(9)
90
A.C. Sandoval and W. Yu
where W i = sup Wi 2Λ−1 , matrix norm is defined as A2B := AT BA , Λi is i positive definite matrix, λmax (·) and λmin (·) are the maximum and minimum eigenvalues of the matrix. The updating law (4) will make the dynamic neural networks (2) input-to-state stability (ISS). Remark 2. Consider Property 2; ISS means that the behavior of the dynamic neural networks should remain bounded when their inputs are bounded. If bounded disturbances are also regarded as inputs, the neural networks are also bounded with respect to the disturbances. So the dynamic neural networks (2) are bounded-input bounded-output (BIBO) stability. If P1 and P2 are selected large enough, then condition (9) is not difficult to be satisfied. From (3) we have T1 P1 A + T AT P1 = −T1 Q T2 P2 B + T B T P2 = −T2 Q
(10)
where Ti = TiT > 0. If the positive definite matrix Ti is large enough, (9) is satisfied, i.e. ( is a small positive constant) λmax W 1 + W 3 ≤ λmin (T1 Q1 ) 2 Q2 ) λmax W 2 + W 4 ≤ λmin (T
3
Simulation
To illustrate the theoretical results, we give the following simulations. The dynamic neural network is ·
x = Ax + W1 σ 1 (x, z) + W3 φ1 (x, z)u · z = Bz + W2 σ 2 (x, z) + W4 φ2 (x, z)u
(11)
−5 0 −5 0 where x ∈ , z ∈ , A = , B = , = 0.05. The active 0 −5 0 −5 functions σ k and φk are selected as tanh(). First, let check the passivity of the 2
2
5 x1
4 3
x2 2 1
z1 z2
0 -1 0
1
2
3
Time (s) 4
5
Fig. 2. Dynamic neural network with two time-scales
Passivity Analysis of Dynamic Neural Networks with Different Time-Scales
91
5 x1
4 3 2
x2 z1
1 0
z2 Time (s)
-1 0
1
2
3
4
5
Fig. 3. Zero-input responses
10 0 neural networks (11). If we select Q1 = Q2 = , the solution of Lyapunov 0 10
10 equation (3) is P1 = P2 = . The updating law for the weights W are used as 01 (4). Fig.2 and Fig.3 show the bounded inputs responses (u = [3 sin (0.5t) , 0, 0, 0]), zero-input responses and additional output feedback (8) u = −μy (μ = 1).
4
Conclusion
By means of passivity technique, we have given some new results on the stability of dynamic neural networks with different time-scales. A simple gradient learning algorithm can make the neural networks passive, stability, asymptotic stability, input-to-state stability and bounded input bounded output stability in certain senses. Although it is not easy to construct the special passive circuits with the passivity technique used in this paper, it is possible to extend these results to neuro identification and neuro control.
References 1. Amari, S.: Field Theory of Self-Organizing Neural Nets. IEEE Trans. Syst., Man, Cybern. 13(1) (1983) 741–748 2. Forti, M., Manetti, S., Marini, M.: Necessary and Sufficient Condition for Absolute Stability of Neural Networks. IEEE Trans. on Circuit and Systems-I 41(4) (1994) 491-494 3. Grossberg, S.: Competition, Decision and Consensus. J. Math. Anal. Applicat. 66(5) (1978) 470–493 4. Jagannathan, S., Lewis, F.L.: Identification of Nonlinear Dynamical Systems Using Multilayered Neural Networks. Automatica 32(12) (1996) 1707-1712 5. Jin, L., Gupta, M.: Stable Dynamic Backpropagation Learning in Recurrent Neural Networks. IEEE Trans. Neural Networks 10(9) (1999), 1321–1334 6. Kaszkurewics, K., Bhaya, A.: On a Class of Globally Stable Neural Circuits. IEEE Trans. on Circuit and Systems-I 41(1) (1994) 171-174
92
A.C. Sandoval and W. Yu
7. Kosmatopoulos, E.B., Polycarpou, M.M.,: Christodoulou, M.A., Ioannou, P.A. High-Order Neural Network Structures for Identification of Dynamical Systems. IEEE Trans. on Neural Networks 6(2) (1995) 442-431 8. Meyer-B¨ ase, A., Ohl, F., Scheich, H.: Singular Perturbation Analysis of Competitive Neural Networks with Different Time-Scales. Neural Comput. 8(3) (1996) 545–563 9. Lewis, F.L., Liu, K., Yesildirek, A.: Multilayer Neural-Net Robot Controller with Guaranteed Tracking Performance. IEEE Trans. on Neural Network 7(2) (1996) 388-398 10. Rovithakis, G.A., Christodoulouand, M.A.: Adaptive Control of Unknown Plants Using Dynamical Neural Networks. IEEE Trans. on Syst., Man and Cybern. 24(2) (1994) 400-412 11. Sontag, E.D., Wang, Y.: On Characterization of the Input-to-State Stability Property. System & Control Letters 24(3) (1995) 351-359 12. Shao, Z.H.: Robust Stability of Two-Time-Scale Systems with Nonlinear Uncertainties. IEEE Transactions on Automatic Control 49(1) (2004) 135-140 13. Suykens, J., Moor, B., Vandewalle, J.: Robust Local Stability of Multilayer Recurrent Neural Networks. IEEE Trans. Neural Networks 11(1) (2000) 222–229 14. Yu, W., Poznyak, A.S.: Indirect Adaptive Control via Parallel Dynamic Neural Networks. IEE Proceedings - Control Theory and Applications 37(1) (1999) 25-30
Exponential Dissipativity of Non-autonomous Neural Networks with Distributed Delays and Reaction-Diffusion Terms Zhiguo Yang1,2 , Daoyi Xu1 , and Yumei Huang1 1
Institute of Mathematics, Sichuan University, Chengdu 610064, China
[email protected] 2 College of Mathematics and Software Science, Sichuan Normal University, Chengdu 610066, China
Abstract. In this paper, a class of non-autonomous neural networks with distributed delays and reaction-diffusion terms is considered. Employing the properties of diffusion operator and the techniques of inequality, we investigate positive invariant set, global exponential stability, and then obtain the exponential dissipativity of the neural networks under consideration. Our results can extend and improve earlier ones. An example is given to demonstrate the effectiveness of these results.
1
Introduction
Dynamics of autonomous neural networks based on Hopfield architecture have been intensively studied in the recent years. Many important results have been obtained ([1]-[3]). However, non-autonomous neural network model can even accurately depict the evolutionary processes of the networks ([4]-[5]). Therefore, it is important to study the dynamics of non-autonomous neural networks. On the other hand, delay and diffusion effects[6] cannot be avoided in neural networks when electrons are moving in asymmetric electromagnetic fields. Thus, the study to neural networks with delays and reaction-diffusion terms becomes extremely important to manufacture high quality neural networks. As is well known, the stability of neural networks with or without delay has received much attention in the literature([1]-[6]). As pointed out in [7], the dissipativity is also an important concept in dynamical neural networks. And it has found applications in the areas such as stability theory, chaos and synchronization theory, system norm estimation, and robust control. In the literature [7], some new sufficient conditions for the global dissipativity of autonomous neural networks with time delay are obtained. However, to the best of our knowledge, there is no investigation to the dissipativity of non-autonomous neural networks with delays and reaction-diffusion terms. In this article, we consider a class of non-autonomous neural networks with distributed delays and reaction-diffusion terms. By using the properties of diffusion operator and the techniques of inequality, we firstly investigate positive invariant set, the global exponential stability, and then obtain the exponential J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 93–99, 2006. c Springer-Verlag Berlin Heidelberg 2006
94
Z. Yang, D. Xu, and Y. Huang
dissipativity of the networks. Our results can extend and improve earlier ones. An example is given to demonstrate the effectiveness of these results.
2
Model Description and Preliminaries
In this article, we consider the following non-autonomous reaction-diffusion neural networks with distributed delays and reaction-diffusion terms ⎧ m n ∂ui ∂ui ∂ ⎪ ⎪ bij (t)gj (uj (t − τij (t), x)) ⎪ ∂t = ∂xk (Dik ∂xk ) − ai (t)ui + ⎪ ⎪ j=1 k=1 ⎪ ⎨ n t + cij (t) −∞ kij (t − s)fj (uj (s, x))ds + Ji (t), (1) ⎪ j=1 ⎪ ⎪ ∂ui ∂ui ∂ui ⎪ ⎪ := ( ∂x1 , · · · , ∂xm ) = 0, t ≥ t0 , x ∈ ∂Ω, ⎪ ⎩ ∂n ui (t0 + s, x) = φi (s, x), −∞ < s ≤ 0, x ∈ Ω, i ∈ 1, 2, · · · , n. where 0 ≤ τij (t) ≤ τ , τ is a constant, Ω is a bounded domain in real mdimensional linear vector space Rm with smooth boundary ∂Ω and measure μ = mesΩ > 0; Smooth function Dik = Dik (t, x) ≥ 0 corresponds to the transmission diffusion operator; kij are continuous delay kernels functions. We assume that functions ai (t), bij (t), cij (t) and Ji (t) are continuous and the solution u(t, x) = u(t; t0 , φ) of system (1) is uniquely determined by the initial value φi (s, x). Δ Let C = C((−∞, 0] × Ω, Rn ). Then for φ(s, x) ∈ C, we define [φ]+ s =(φ1 2s , Δ
Δ
φ2 2s , · · · , φn 2s )T , where ||φi (s, x)||2s = max ||φi (s, x)||2 and ||φi (s, x)||2 = −∞<s≤0 1 ( Ω φ2i (s, x)dx) 2 . For any real matrices A = (aij )n×m and B = (bij )n×m , we write A ≥ B if aij ≥ bij , ∀i ∈ N , j ∈ {1, 2, · · · , m}. Definition 1. A set S ⊂ C is called to be a positive invariant set of Eq.(1) if for any initial value φ ∈ S, ut (t0 , φ) ∈ S, ∀t ≥ t0 , where ut (t0 , φ) = u(t + s; t0 , φ)) for s ∈ (−∞, 0]. Definition 2. Eq.(1) is said to be globally exponentially stable, if there are constants λ > 0 and M ≥ 1 such that for any two solutions u(t; t0 , φ) and u(t; t0 , ψ) with the initial functions φ, ψ ∈ C, respectively, one has u(t; t0 , φ) − u(t; t0 , ψ) ≤ M φ − ψs e−λ(t−t0 ) , ∀t ≥ t0 , Δ
where, ||u|| =(
n
i=1
1
Δ
||ui (t, x)||22 ) 2 , ||φ||s =(
clidean norm of Rn .
n
i=1
1
||φi (s, x)||22s ) 2 = |[φ]+ s |, |.| is Eu-
Definition 3. The neural network defined by (1) is called a exponentially dissipative system, if there are three positive constants M1 , M2 and λ such that ||u(t; t0 , φ)|| ≤ M1 φs e−λ(t−t0 ) + M2 ,
∀ φ ∈ C, t ≥ t0 .
For convenience, we introduce the following assumptions.
Exponential Dissipativity of Non-autonomous Neural Networks
95
(A1 ) There exist σj > 0, lj > 0 such that |gj (u) − gj (v)| ≤ σj |u − v|, |fj (u) − fj (v)| ≤ lj |u − v|, ∀j ∈ N , u, v ∈ R. (A2 ) There exists a positive constants β such that the continuous delay kernels functions kij : [0, ∞) → [0, ∞) satisfy ∞ kˆij = eβξ kij (ξ)dξ < ∞, i, j ∈ N . 0
(A3 ) There exist continuous functions hi (t) > 0 and constants a ˆi > 0, ˆbij ≥ 0, ˆ Ji ≥ 0, i, j ∈ N , such that ai (t) ≥ a ˆi hi (t), |bij (t)| ≤ ˆbij hi (t), |cij (t)| ≤ cˆij hi (t), |Ji (t)| ≤ Jˆi hi (t). ˆ − CL ˆ is an M -matrix [8], where Cˆ = (ˆ (A4 ) Aˆ − Bσ c∗ij )n×n , cˆ∗ij = cˆij kˆij , ˆ = (ˆbij )n×n , σ =diag{σ1 , · · · , σn }, L =diag{l1, · · · , ln }, Aˆ =diag{ˆ B a1 , · · · , a ˆn }.
3
Main Results
In this section, we investigate positive invariant, the global exponential stability, and obtain the exponential dissipativity of non-autonomous system (1). Theorem 1. Assume that conditions (A1 )-(A4 ) are satisfied. Let μ = mesΩ and Jˆ = (Jˆ1 , · · · , Jˆn )T , gˆ = (|g1 (0)|, · · · , |gn (0)|)T , fˆ = (|f1 (0)|, · · · , |fn (0)|)T , Δ ˆ ˆ g + Cˆ fˆ + Jˆ , S = {φ ∈ C|[φ]+ ≤ N = ˆ − CL) ˆ −1 Iμ}. Then S is a I = Bˆ (A − Bσ s positive invariant set of system (1). ˆ − CL ˆ is an Proof. Without loss of generality, we let Jˆ > 0. Since Aˆ − Bσ −1 T Δ ˆ ˆ ˆ ˆ ˆ − M -matrix, we have (A − Bσ − CL) ≥ 0, and N = (N1 , · · · , Nn ) = (A − Bσ −1 + ˆ CL) Iμ > 0. We now prove for φ ∈ C, when [φ]s ≤ N , Δ
[u(t, x)]+ =(u1 (t, x)2 , u2 (t, x)2 , · · · , un (t, x)2 )T ≤ N, ∀t ≥ t0 ,
(2)
where u(t, x) = u(t; t0 , φ) is the solution of (1) with the initial functions φ ∈ C. First, we shall prove that for p > 1, [φ]+ s < pN implies [u(t, x)]+ < pN,
t ≥ t0 .
(3)
If not, there must be l and t1 > t0 such that ul (t1 , x)2 = pNl , ul (t, x)2 < pNl , −∞ < t < t1 ,
(4)
ui (t, x)2 ≤ pNi , ∀i ∈ N , −∞ < t ≤ t1 .
(5)
Multiply both sides of the equation in (1) with ui (t, x), and integrate
96
Z. Yang, D. Xu, and Y. Huang
1 d 2 dt
m
u2i dx = Ω
ui
Ω
k=1
∂ ∂ui (Dik )dx − ai (t) ∂xk ∂xk
u2i dx + Ω
×[gj (uj (t − τij (t), x)) − gj (0)]dx + × +
n
−∞ n
bij (t)
ui Ω
j=1
cij (t)
ui Ω
j=1 t
n
n kij (t − s)[fj (uj (s, x)) − fj (0)]dsdx + [ bij (t)gj (0)
cij (t)fj (0)
j=1
j=1
t
−∞
kij (t − s)ds + Ji (t)]
ui dx, i ∈ N . (6) Ω
From the boundary condition, we get m k=1
ui
Ω
∂ ∂ui (Dik )dx = ∂xk ∂xk
∂ui m )k=1 ·ds − ∂xk m
(ui Dik ∂Ω
=−
k=1
m
Dik ( Ω
k=1
Dik ( Ω
∂ui 2 ) dx ∂xk
∂ui 2 ) dx, ∀i ∈ N , ∂xk
(7)
∂ui m ∂ui ∂ui T where ∇ is the gradient operator, (Dik ∂x ) = (Di1 ∂x , · · · , Dim ∂x ) . 1 m k k=1 By the conditions(A1 )-(A2 ), equation (6)-(7) and Holder inequality, we have n d ˆbij σj ||uj (t − τij (t), x)||2 ||ui ||2 ≤ −ˆ ai hi (t)||ui ||2 + hi (t)[ dt j=1 t n + cˆij kij (t − s)lj uj (s, x)2 ds + Ii μ], i ∈ N , t ≥ t0 . (8) j=1
where Ii =
−∞
n n ˆbij |gj (0)| + cˆij kˆij |fj (0)| + Jˆi . From [φ]+ < pN , (4) and (5), s
j=1
j=1
||ul (t1 , x)||2 ≤ e−
t1 t0
a ˆl hl (s)ds
||φl ||2s +
t1
e−
t1 s
a ˆl hl (ξ)dξ
n ˆblj σj ||uj (s hl (s)[
t0
−τlj (s), x)||2 +
n
cˆlj
j=1
< e−
+
t1 t0
a ˆl hl (s)ds
[pNl −
j=1
s
−∞
klj (s − ξ)lj uj (ξ, x)2 dξ + Il μ]ds
n n 1 ˆ ( blj σj pNj + cˆlj kˆlj lj pNj + Il μ)] a ˆl j=1 j=1
n n 1 ˆ ( blj σj pNj + cˆlj kˆlj lj pNj + Il μ). a ˆl j=1 j=1
(9)
ˆ − CL ˆ is an M -matrix and N = (Aˆ − Bσ ˆ − CL) ˆ −1 Iμ, one can get Since Aˆ − Bσ
Exponential Dissipativity of Non-autonomous Neural Networks
a ˆi pNi ≥
n
ˆbij σj pNj +
j=1
Noting that e−
t1 t0
a ˆl hl (s)ds
n
cˆij kˆij lj pNj + Ii μ, ∀i ∈ N , p > 1.
97
(10)
j=1
≤ 1, from (9) and (10), we obtain ||ul (t1 , x)||2 < pNl ,
which contradicts the equality in (4). Let p → 1 in (3), then the proof is complete. Theorem 2. Suppose that the conditions (A1 )-(A4 ) hold. And there exist a positive constant λ such that continuous functions hi (t) satisfy t a ˆi hi (ξ)dξ ≥ λ(t − s), t ≥ s ≥ t0 , i ∈ N . (11) s
Then system (1) is globally exponentially stable and the exponential convergent rate δ ≤ min{λ, β} is determined by (15). Proof. For any φ, ψ ∈ C, we denote y(t, x) = u(t, x) − v(t, x),
(12)
where u(t, x) = u(t; t0 , φ) and v(t, x) = v(t; t0 , ψ) are the solutions of (1) with the initial functions φ, ψ ∈ C, respectively. Then from (1), y(t, x) must satisfy ⎧ m n ∂yi ∂ i ⎪ = (D ) − a (t)y + bij (t)¯ gj (yj (t − τij (t), x)) ⎪ ∂y ik i i ⎪ ∂t ∂xk ∂xk ⎪ j=1 ⎪ k=1 ⎪ ⎨ n t + cij (t) −∞ kij (t − s)f¯j (yj (s, x))ds, (13) ⎪ j=1 ⎪ ⎪ ∂yi ∂yi ∂yi ⎪ ⎪ := ( ∂x , · · · , ∂x ) = 0, t ≥ t0 , x ∈ ∂Ω, ⎪ 1 m ⎩ ∂n yi (t0 + s, x) = φi (s, x) − ψi (s, x), −∞ < s ≤ 0, x ∈ Ω, i ∈ N . where, g¯j (yj (t − τij (t), x)) = gj (uj (t − τij (t), x)) − gj (vj (t − τij (t), x)) and f¯j (yj (s, x)) = gj (uj (s, x)) − gj (vj (s, x)). Similar to the proof of the inequality (8), we obtain n d ˆbij σj ||yj (t − τij (t), x)||2 ||yi ||2 ≤ −ˆ ai hi (t)||yi ||2 + hi (t)[ dt j=1 t n + cˆij kij (t − s)lj yj (s, x)2 ds], ∀i ∈ N , t ≥ t0 . j=1
(14)
−∞
ˆ − CL ˆ is an M -matrix [8], there exist r = (r1 , · · · , rn )T > 0 and Since Aˆ − Bσ a δ > 0 such that ˆ − CL)r ˆ (Aˆ − Bσ > 0, ˆ − (Bσe ˆ τ δ + CL)r(1 ˆ Ar + Next, we shall prove that for any ε > 0,
δ ) > 0. λ−δ
(15)
98
Z. Yang, D. Xu, and Y. Huang
||yi (t, x)||2 < ri R(||φ − ψ||s + ε)e−δ(t−t0 ) = zi (t), t ∈ (−∞, +∞), i ∈ N . (16) where R is a positive constant satisfying Rri ≥ 1, i ∈ N . Obviously, when t ∈ (−∞, t0 ], we have ||yi (t, x)||2 < ri R(||φ − ψ||s + ε)e−δ(t−t0 ) = zi (t), i ∈ N . So, if (16) is not, there must be l and t1 > t0 such that yl (t1 , x)2 = zl (t1 ), yl (t, x)2 < zl (t), −∞ < t < t1 ,
(17)
yi (t, x)2 ≤ zi (t), ∀i ∈ N , −∞ < t ≤ t1 .
(18)
From (A2 ), (11), (14), (15), (17) and (18) we obtain ||yl (t1 , x)||2 ≤e
−
t1 t0
a ˆ l hl (s)ds
||yl (t0 , x)||2 +
t1
e−
t1 s
a ˆ l hl (ξ)dξ
t0
×||yj (s − τlj (s), x)||2 +
n
cˆlj lj
j=1
n ˆblj σj hl (s)[ j=1
s
−∞
klj (s − ξ)yj (ξ, x)2 dξ]ds,
t 1 1 t1 d(e− s aˆl hl (ξ)dξ ) <e rl R(||φ − ψ||s + ε) + a ˆl t0 ds n n ˆblj σj rj eδτ + ×[ cˆlj kˆlj lj rj ]R(||φ − ψ||s + ε)e−δ(s−t0 ) ds −
t1 t0
a ˆ l hl (s)ds
j=1
j=1
≤ R(||φ − ψ||s + ε){e− ×
n
t1 t0
a ˆ l hl (s)ds
[rl −
n 1 ˆ 1 (blj σj rj eδτ + cˆlj kˆlj lj rj )] + a ˆl j=1 a ˆl
(ˆblj σj rj eδτ + cˆlj kˆlj lj rj )[e−δ(t1 −t0 ) +
j=1
< R(||φ − ψ||s + ε){e +
−λ(t1 −t0 )
δ (e−δ(t1 −t0 ) − e−λ(t1 −t0 ) )]} λ−δ
n 1 ˆ [rl − (blj σj rj eδτ + cˆlj kˆlj lj rj )(1 a ˆl j=1
n δ 1 ˆ δ )] + (blj σj rj eδτ + cˆlj kˆlj lj rj )(1 + )e−δ(t1 −t0 ) } λ−δ a ˆl j=1 λ−δ
≤ R(||φ − ψ||s + ε)e−δ(t1 −t0 ) rl = zl (t1 ). which contradicts the equality in (17). Then (16) is true. Let ε → 0, we have [y(t, x)]+ ≤ rR||φ − ψ||s e−δ(t−t0 ) , t ∈ (−∞, +∞). u(t; t0 , φ) − u(t; t0 , ψ) ≤ |r|R||φ − ψ||s e−δ(t−t0 ) , t ∈ (−∞, +∞).
(19) (20)
Therefore, system (1) is globally exponentially stable. The proof is complete. Using Theorem 1 and Theorem 2, we easily obtain the follow theorem. Theorem 3. Suppose all conditions in Theorem 2 are satisfied. Then the neural network (1) is exponentially dissipative.
Exponential Dissipativity of Non-autonomous Neural Networks
4
99
Example
Example. Consider non-autonomous reaction-diffusion neural networks ⎧ ∂u (t,x) 1 ⎪ = Δu1 (t, x) − (4 + e−t )u1 (t, x) − sin(t2 )g1 (u1 (t − 12 , x)) ⎪ ∂t t ⎪ ⎪ ⎪ +g2 (u2 (t − | cos t|, x)) + −∞ e−(t−s) u1 (s, x)ds ⎪ ⎪ t ⎪ ⎪ ⎪ +2 −∞ e−(t−s) u2 (s, x)ds + 4π1 2 arctan(t), ⎪ ⎪ ⎨ ∂u2 (t,x) = Δu2 (t, x) − (5 + e−t )u2 (t, x) + 12 cos(t2 )g1 (u1 (t − 34 , x)) ∂t (21) 1 t −(t−s) ⎪ −2g (u (t − 1, x)) + e u (s, x)ds ⎪ 2 2 1 2 −∞ ⎪ t ⎪ ⎪ ⎪ + −∞ e−(t−s) u2 (s, x)ds + 4π1 2 arctan(t), ⎪ ⎪ ⎪ ⎪ ui (t0 + s, x) = φi (s, x), −∞ < s ≤ 0, x ∈ Ω, ⎪ ⎪ ⎩ ∂ui ∂ui ∂ui ∂ui t ≥ t0 , x ∈ ∂Ω, i = 1, 2. ∂n := ( ∂x1 , ∂x2 , ∂x3 ) = 0, where Ω = {(x1 , x2 , x3 ) ∈ R3 |x21 + x22 + x23 = 1}, and the sigmoid function g1 (s) = g2 (s) = tanh(s). It is easy to check that the conditions (A1 )-(A4 ) are satisfied and we may take τ = 1, h1 (t) = h2 (t) ≡ 1, kˆ11 = kˆ12 = kˆ21 = kˆ22 = 1,
40 11 10 12 10 ˆ ˆ ˆ A= , B= 1 , σ= , C= 1 , L= , 05 01 01 2 2 2 1
1 ˆ g + Cˆ fˆ + Jˆ = Jˆ = 8π I = Bˆ , μ = mesΩ = 4π 1 3 , 8π
5 2 −3 −1 ˆ ˆ ˆ ˆ ˆ A − Bσ − CL = is an M -matrix , N = (A − Bσ) Iμ = 61 , −1 2 2 S1 = {φ ∈ C | φ1 2s ≤ 56 , φ2 2s ≤ 12 }. It follows from Theorem 1-3 that system (21) is exponentially dissipative.
References 1. Zhang, Y.: Global Exponential Convergence of Recurrent Neural Networks with Variable Delays. Theoretical Computer Science 312(1) (2004) 281-293 2. Wang, L.: Stability of Cohen-Grossberg Neural Networks with Distributed Delays. Applied Mathematics and Computation 160(1) (2005) 93-110 3. Zhang, Q., Wei, X., Xu, J.: Delay-dependent Exponential Stability of Cellular Neural Networks with Time-varying Delays. Chaos Solitons & Fractals 23(4) (2005) 13631369 4. Xu, D., Zhao, H.: Invariant and Attracing Sets of Hopfield Neural Networks with Delay. International Journal of Systems Science 32(7) (2001) 863-866 5. Jiang, H., Li, Z., Teng, Z.: Boundedness and Stability for Nonautonomous Cellular Neural Networks with Delay. Physics Letters A 306(1) (2003) 313-325 6. Cao, J.: An Estimation of the Domain of Attraction and Convergence Rate for Hopfield Continuous Feedback Neural Networks. Physics Letters A 325(4) (2004) 370-374 7. Arik, S.: On the Global Dissipativity of Dynamical Neural Networks with Time Delays. Physics Letters A 326(4) (2004) 126 -132 8. Liao, X.: Theory and Applications of Stability for Dynamical Systems. Defence Industry Publishing House, Beijing (2000)
Convergence Analysis of Continuous-Time Neural Networks Min-Jae Kang1, Ho-Chan Kim1, Farrukh A. Khan2, Wang-Cheol Song2, and Jacek M. Zurada3 1
Faculty of Electrical & Electronic Engineering, Cheju National University, Jeju 690-756, South Korea {minjk, hckim}@cheju.ac.kr 2 Department of Computer Engineering, Cheju National University, Jeju 690-756, South Korea {farrukh, philo}@cheju.ac.kr 3 Department of Electrical Engineering, University of Louisville, Louisville, KY, 40292 USA
[email protected]
Abstract. The energy function of continuous-time neural network has been analyzed for testing the existence of stationary points and the global convergence of network. The energy function always has only one stationary point which is a saddle point in the unconstrained space when the total conductance of neuron's input is zero ( Gi = 0 ). However, the stationary points exist only inside the hypercube R n ∈ [0,1] when the total conductance of neuron's input is not zero ( Gi ≠ 0 ). The Hessian matrix of the energy function is used for testing the global convergence of the network.
1 Introduction Single-layer feedback neural networks, also called gradient-type networks, have received widespread attention in the technical literature [1-4]. Most of the published results, however, deal with the discrete-time, discrete-output networks. The discretetime networks represent limit case of continuous-time networks which involve continuous activation function and changing the output values continuously rather than discretely. Some of the postulates regarding both the convergence and performance of discrete-time networks remain valid for their continuous-time counterparts. There are also numerous differing aspects in both types of networks. In this paper, the energy function of continuous-time neural networks has been analyzed for proving the existence of stationary points and the networks’ convergence.
2 Description of Continuous-Time Actual Neural Networks Hopfield neural network consists of n neurons each mapping the input voltage ui of the i-th neuron into the output voltage vi through the activation function ƒ (ui). Conductance Wij connects the output of the j-th neuron to the input of the i-th neuron. The J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 100 – 108, 2006. © Springer-Verlag Berlin Heidelberg 2006
Convergence Analysis of Continuous-Time Neural Networks
101
postulated network is symmetric, i.e. wij =wji. Also, since it has been assumed that wii = 0, the output of neurons are not connected to their own inputs. The input conductance and capacitance of i-th neuron is of non zero value and is equal to gi and Ci respectively. Applying KCL (Kirchoff’s Current Law) at the input node having potential ui gives the equilibrium equation
Ci
dui = dt
n
∑W v
ij j
+ ii − Gi ui
i = 1,..., n.
(1)
j =1
Denoting the total conductance connected to the input node i as Gi, where n
Gi = gi +
∑W
ij
(2)
j =1
The most common choice is the so-called sigmoid function with the finite value of gain λ as follows:
vi = f (ui ) =
1 1 + exp(−λ ui )
(3)
Hopfield (1984) [3] introduced the postulated energy function for his neural network as follows:
1 E (v ) = − 2
n
n
n
vj
n
∑ ∑W v v − ∑ i v + ∑ G ∫ f ij i j
i =1
j =1
i i
i =1
i
i =1
−1 i
( z )dz
(4)
0.5
The negative derivative of the energy function for symmetric W can be computed as:
−
n
∑W v
dE = dvi
ij j
+ ii − Gi ui
(5)
j =1
By comparing equation (5) with the right hand side of the equation (1) it can be shown that
−
du dE = Ci i dvi dt
(6)
Since W is symmetric and f (u ) is a monotonic increasing function, it can be easily shown that the energy function E (v) is a Lyapunov like function [1].
dE = dt
n
∑ i =1
dE dvi =− dvi dt
n
∑ i =1
n
Ci
∑
dui dvi du = − Ci ( i ) 2 f ' (u ) dt dt dt i =1
≤
0
(7)
102
M.-J. Kang et al.
3 Energy Function Analysis of Continuous-Time Neural Networks The Hopfield type neural network converges by following a path that decreases the Lyapunov like energy function. By analyzing this energy function, the existence of stationary points has been proved and the global convergence of the system has also been tested. 3.1 Existence of Stationary Points
The continuous-time neural network has the neuron's transfer sigmoid gain λ of finite value whereas sigmoid gain is infinite for the discrete-time neural network. The energy function has been analyzed for proving existence of stationary points. It is shown that the total conductance of neuron's input node is a major factor for existence of stationary points. Proposition 1. If the neuron's transfer sigmoid gain λ is finite, then the following statements are true. n
(1) If the total conductance of the i-th neuron's input is zero (G = g + W = 0) , then ∑ ij i i j =1
st
there exists one stationary point ( v ) of the energy function in the space R n ∈ (−∞, ∞) (2) If the total conductance of the i-th neuron's input is not zero ( Gi ≠ 0 ), then st
there could exist more than one stationary points ( v ), and all stationary points are inside the space R n ∈ (0,1) . Proof (1) The stationary points of the postulated energy function E(v) in equation (4) can be expressed as follows:
∇E (v ) = Wv + i − Gu = 0
(8)
vi ) 1 − vi
(9)
where,
ui = f −1 (vi ) =
1
λ
ln(
Using equation (9), the third term of equation (8) can be written as follows:
G
λ
ln(
v ) = Wv + i 1− v
(10)
If the i-th neuron's total conductance is zero ( Gi = 0 ), then equation (10) can be expressed as follows:
Wv = −i
(11)
Convergence Analysis of Continuous-Time Neural Networks
And it is well-known that if W is symmetric and wii
103
= 0 , the inverse of W exists
[10]. Therefore, only the single stationary point exists in the space R n ∈ ( −∞, ∞ ) and is given as follows:
v st = −W −1i
(12)
Proof (2) Let's define the left hand side of equation (10) as follows
S (vi ) =
Gi
λ
ln(
vi ), i = 1, 2,..., n 1 − vi
(13)
If Gi ≠ 0 and λ > 0 , then S (vi ) is differentiable in vi ∈ (0,1) and if Gi > 0 , S (vi ) is strictly increasing, therefore,
∀vix1 , vix 2 ∈ (0,1), vix1 > vix 2 ⇒ S (vix1 ) ≥ S (vix 2 ), and, lim + S (vi ) = −∞, v j →0
lim S (vi ) = ∞ .
v j →1−
If Gi < 0 , S (vi ) is strictly decreasing, therefore,
∀vix1 , vix 2 ∈ (0,1), vix1 > vix 2 and
lim S (vi ) = ∞,
V j → 0+
⇒ S (vix1 ) ≤ S (vix 2 )
lim S (vi ) = −∞
V j →1−
In other words, S (vi ) exists only in the space R n ∈ (0,1) and its value changes to positive infinite or negative infinite near the boundary of the hypercube R n ∈ (0,1) . However, the right hand side of equation (10) is a finite value in the space Rn ∈ (0,1) .Therefore, the solution of equation (10) always exists in the space R n ∈ (0,1) for all i. This means that the stationary points of the energy function with n Gi ≠ 0 always exist inside the hypercube R ∈ (0,1) .
Case Study The 2-bit A/D converter is also selected for a case study. The energy function for 2bit A/D converter including weights and bias currents can be expressed as follows [3], where x is analog input.
1 E (v) = − [v1 2
⎛ 0 −2 ⎞ ⎡ v1 ⎤ v2 ] ⎜ ⎟ ⎢ ⎥ − [v1 ⎝ −2 0 ⎠ ⎣ v2 ⎦
⎡ v1 ⎤ ⎢ G1 fi −1 ( z ) dz ⎥ 1⎤ ⎡ ⎥ x − ⎥ ⎢ 0.5 ⎥ v2 ] ⎢ 2 +⎢ v ⎢ ⎥ ⎢ 2 ⎥ ⎣⎢ x − 2 ⎦⎥ ⎢G fi −1 ( z ) dz ⎥ 2 ⎢ ⎥ ⎣ 0.5 ⎦
∫
∫
(14)
104
M.-J. Kang et al.
If G1 = G2 = 0 , the stationary points of E can be obtained by using equation (12) as follows: 1⎤ ⎡ −0.5 ⎞ ⎢ x − ⎥ ⎡ v1 ⎤ ⎛ 0 2 ⎟⎢ ⎢v ⎥ = − ⎜ ⎥ 0 ⎠ ⎣ 2⎦ ⎝ −0.5 ⎣⎢ x − 2 ⎦⎥
(15)
Also, in case of Gi ≠ 0 , the stationary points of E can be obtained by using equation (10) as follows: v1 ⎤ ⎡ G1 1⎤ ⎡ ⎢ λ ln 1 − v ⎥ −2 ⎞ ⎡ v1 ⎤ ⎢ x − ⎥ 1 ⎥ ⎛ 0 ⎢ =⎜ + 2 ⎟ ⎢ ⎥ ⎥ ⎢ G2 v2 ⎥ ⎝ −2 0 ⎠ ⎣ v2 ⎦ ⎢ ⎢⎣ x − 2 ⎦⎥ ⎢ ln ⎥ v 1 λ − 2⎦ ⎣
(16)
Fig.1a shows the stationary points of E for the analog input x=0, 1, 3, 4 in the case of G1 = G2 = 0 . It has shown that the stationary points can exist outside the space R 2 ∈ (0,1) . However, in the case of Gi ≠ 0 , all stationary points exist inside the 2-dimensional space R2 ∈ (0,1) as shown in Fig. 1b. The stationary points for the analog input x=0, 1, 2, 3 in the case of Gi ≠ 0 have been obtained graphically by using equation (16) as shown in Fig. 1b.
(a)
(b)
Fig. 1. The locations of the stationary points of E for analog input x=0, 1, 2, 3 and the case of (a) G1 = G2 = 0 , (b) G1 = G2 = 0.5
λ = 2 , in
3.2 Global Convergence
The continuous actual neural network has the sigmoid gain λ with finite value whereas the discrete neural network has it with the infinite value. The energy function with the sigmoid gain of finite value has been analyzed for proving system's global
Convergence Analysis of Continuous-Time Neural Networks
105
convergence. The Hessian matrix of the energy function has been tested for judging the stationary points. The Hessian matrix has to be positive definite, negative definite and indefinite if the stationary point is minima, maxima and saddle point respectively. Proposition 2. If the sigmoid gain λ is finite, then the following statements are true.
(1) if the total conductance is positive Gi > 0 , then the stationary points are minima near the boundary of the hypercube R n ∈ (0,1) . (2) if the total conductance is zero Gi = 0 , then the stationary point is a saddle point. (3) if the total conductance is negative Gi < 0 , then the stationary points are maxima near the boundary of the hypercube R n ∈ (0,1) . Proof (1) The Hessian matrix of the energy function has to be positive definite if the stationary point is minimum near the boundary of the hypercube R n ∈ (0,1) . Using the chain rule, the Hessian matrix of the energy function can be derived as below:
∂u ∂v
H = ∇ 2 E (v) = −W + G
(17)
Using the partial derivative of the activation function of the neuron, equation (17) can be derived as follows H = ∇2 E (v) = −W + G
1
(18)
λ (v − v 2 )
The Hessian matrix (18) can be expressed in matrix form as: G1 ⎡ ⎢ 2 ⎢ λ (v1 − v1 ) ⎢ ⎢ − w21 2 ∇ E (v ) = ⎢ ⎢ M ⎢ ⎢ ⎢ − wn1 ⎣
⎤ ⎥ ⎥ ⎥ G1 K − w2 n ⎥ 2 λ (v2 − v2 ) ⎥ ⎥ O M ⎥ G1 ⎥ K − wn 2 2 ⎥ λ (vn − vn ) ⎦ − w12
By using Gerschgorin’s Circle Theorem [9], n
Gi
λ (vi − vi ) 2
−
∑
wij
≤ λi
≤
j =1, j ≠ i
− w1n
K
(19)
∃λi (eigenvalue) s.t. n
Gi
λ (vi − vi ) 2
+
∑
wij
i = 1,..., n
j =1, j ≠ i
(20)
Let us now define the diagonal element of the Hessian matrix as follows:
D(vi ) =
Gi
λ (vi − vi2 )
(21)
106
M.-J. Kang et al.
If the total conductance is positive ( Gi > 0 ), then D (vi ) is positive infinite near the boundary of the hypercube R n ∈ (0,1) , that is, (∀vi , lim D (vi ) = ∞, + v j →0
lim D (vi ) = ∞) .
vi →1−
Therefore, all eigenvalues of the Hessian matrix become positive near the boundary of the hypercube R n ∈ (0,1) . Proof (2) Since, the second term of right hand side in equation (18) disappears if Gi
= 0 , the
Hessian matrix of the energy function can be written as follows:
H = ∇ 2 E (v) = W
(22)
where, W = W T , Wii = 0 i = 1, 2,..., n
Since the sum of all eigenvalues of a matrix equals its trace [9], the trace of the Hessian matrix can be expressed as follows: n
Tr ( H ) = −
∑
n
Wii =
i =1
∑λ
i
(23)
i =1
where, λi are the eigenvalues of the Hessian matrix. Since, all diagonal elements of the Hessian matrix are zero, ( Wii = 0, i = 1, 2,..., n ,), equation (23) can be rearranged as follows n
Tr ( H ) =
∑λ
i
=0
(24)
i =1
This means that all eigenvalues are zero, or that some eigenvalues are negative and the others are positive for the summation to be zero. If all eigenvalues of the Hessian matrix are zero, then the Hessian matrix is zero rank. Therefore, all eigenvalues of the Hessian matrix can not be zero because the Hessian matrix of zero rank is meaningless. Therefore, since the summation of all eigenvalues yields zero, it is obvious that some eigenvalues are negative and the others are positive which proves that the Hessian matrix is indefinite. This proves completely that the stationary point of the energy function is a distinct saddle point if the total conductance is zero ( Gi = 0 ). Proof (3) If the total conductance is negative ( Gi < 0 ), then D(vi ) in equation (21) is negative
infinite
near
(∀Vi , lim+ D (vi ) = −∞, vi → 0
the
boundary
of
the
hypercube
R n ∈ (0,1)
lim D (vi ) = −∞) . Therefore, all eigenvalues of the Hessian
vi →1−
matrix become negative near the boundary of the hypercube R n ∈ (0,1) . This means that the Hessian matrix becomes negative definite near the boundary of the hypercube.
Convergence Analysis of Continuous-Time Neural Networks
107
Case Study Fig. 2a provides the energy map of the 2-bit A/D converter for the case x=1.6, λ =2, G1 = G2 = 0.5 . There are three stationary points which are at (0.0288, 0.9897), (0.6005, 0.499) and (0.9827, 0.0447). It can be seen that the stationary points near the corners are minima and exist indeed inside the 2-dimensional space R 2 ∈ (0,1) . The obtained eigenvalues at the minimum (0.0288, 0.9897) are 8.6854 and 24.777. Also the eigenvalues at the other minimum (0.9827, 0.0447) are 15.1362 and 5.4236. Fig. 2b shows the energy map of the 2-bit A/D converter for the case x=1.6, λ =2, G1 = G2 = 0 . There is only one stationary point at (0.6, 0.55) and the eignenvalues at this point are -2 and 2. Therefore, the Hessian matrix of this case is indefinite as expected. It can be seen in Fig. 2b that this stationary point is a saddle point. The energy function of the 2-bit A/D converter for the case x=1.6, λ =2, G1 = G2 = −0.5 has three stationary points as shown Fig. 2c. As expected, two maxima occur at (0.0130, 0.0091) and (0.9599, 0.9468) inside near the 2-dimensional space R 2 ∈ (0,1) . And the eigenvalues are -19.0412 and -28.1855 at (0.0130, 0.0091) and -7.8707 and -3.5875 at (0.9599, 0.9468).
(a)
(b)
(c)
Fig. 2. The energy map for the analog input x=1.6, and λ = 2 , in the cases of (a) G1 = G2 = 0.5, (b) G1 = G2 = 0, (c) G1 = G2 = -0.5
4 Conclusion The continuous-time neural network has the sigmoid gain of finite value, and neuron's output changes continuously. The discrete-time network is limit case of continuoustime networks, because the sigmoid gain of the discrete-time network is infinite. In this paper, by using the energy function of the continuous-time neural network, the existence of stationary points has been proved and the global convergence of network has also been analyzed. The stationary points can exist in any space when the total conductance of neuron's input is zero ( Gi = 0 ), however, the stationary points can exist only inside of hypercube Rn ∈ (0,1) when the total conductance of neuron's input is not zero ( Gi ≠ 0 ). The Hessian matrix of the energy function is used for testing the global convergence of the
108
M.-J. Kang et al.
network. The Hessian matrix of the energy function is indefinite when the total conductance of neuron's input is zero ( Gi = 0 ). Therefore, the stationary point is a saddle point and the continuous-time neural network converges to the constrained minimum. When the total conductance of neuron's input is positive ( Gi > 0 ), the Hessian matrix is positive definite near the boundary of the hypercube Rn ∈ (0,1) for the stationary points to be minima. However, there exist no minima near the boundary of the hypercube R n ∈ (0,1) when Gi < 0 , because the Hessian matrix is negative definite.
References 1. Hopfield, J.J., Tank, D.W.: Neural Computation of Decisions in Optimization Problems. Biological Cybernetics 52 (1985) 141-152 2. Hopfield, J.J., Tank, D.W.: Computing with Neural Circuits: A Model. Science 233 (1986) 625-633 3. Friedberg, S., Insel, A.: Introduction to Linear Algebra with Application. 1st edn. PrenticeHall, Englewood Cliffs N.J. (1986) 4. Michel, A.N., Farrel, J.A., Porod, W.: Qualitative Analysis of Neural Networks. IEEE Trans. Circuits Systems 36 (1989) 229-243 5. Zurada, J. M.: Introduction to Artificial Neural Systems. 1st edn. West, St. Paul MN. (1992) 6. Peng, M., Gupta, N., Armitage, A.F.: An investigation into the Improvement of Local Minima of the Hopfield Network. Neural Networks 9(7) (1996) 1241-1253 7. Golub, G.H., Van Loan, C.F.: Matrix Computations. 3rd edn. John Hopkins Univ. Press, (1996) 8. Chen, T., Amari, S.I.: Stability Asymmetric Hopfield Networks. IEEE Trans. Neural Networks 12(1) (2001) 159-163
Global Convergence of Continuous-Time Recurrent Neural Networks with Delays Weirui Zhao1,2 and Huanshui Zhang1 1
Shenzhen Graduate School, Harbin Institute of Technology, D424,HIT Campus of Shenzhen University Town, Xili, Shenzhen, 518055, P.R. China 2 Department of Mathematics, Wuhan University of Technology, 122 Luoshi road, Wuhan, Hubei, 430070, P.R. China
[email protected]
Abstract. We present new global convergence results of neural networks with delays and show that these results partially generalize recently published convergence results by using the theory of monotone dynamical systems. We also show that under certain conditions, reversing the directions of the coupling between neurons preserves the global convergence of neural networks.
1
Introduction
Recently, there has been much activity to study the global stability of neural networks with delays and many stability criteria are proposed [1]-[9]. In this paper, we present a new criterion for the global convergence of neural networks with delays. We show that this criterion is more explicit and easily verified than those presented in the past. Our criteria is proved using a different strategy than the recent literature. In present papers, the most general case is proved by incorporating all the generalities into Lyapunov functional, resulting in a complicated Lyapunov functional. Instead, we deduce the global convergence of neural networks with delays by the theory of monotone dynamical systems.
2
Global Convergence of Neural Networks with Delays
Consider a neural networks with delays described by the state equations: n n dxi (t) = −di xi (t) + aij gj (xj (t)) + bij gj (xj (t − τj )) + ui . dt j=1 j=1
(1)
where n denotes the number of neurons in the network; xi (t) is the state of the ith neuron at time t; gj (xj (t)) denotes the activation function of the jth neuron at time t; the constant weight coefficient aij denotes the connection of the jth neuron to the ith neuron at time t; the constant weight coefficient bij denotes the connection of the jth neuron to the jth neuron at time t − τj ; τj J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 109–114, 2006. c Springer-Verlag Berlin Heidelberg 2006
110
W. Zhao and H. Zhang
is the transmission delay along the axon of the jthe unit and is a nonnegative constant; ui is the external bias on the ith neuron; di is the positive rate with which the ith unit will reset its potential to the resting state in isolation when disconnected from the network and the external inputs ui . The activation function gj (·) is continuously differentiable and monotonically nondecreasing with gj (0) = 0. In our analysis, we will employ the following class of activation functions |gj (x)| < kj , gj (x) > 0, for allj. where 0 ≤ kj < +∞. The sigmoid activation functions are used in the Hopfield networks [10] are typical representatives of this classical bounded functions. An initial condition of systems (1) is constituted by the history of the activation of each neuron i during a time interval corresponding to delay τi . Initial conditions for (1) are of the form φ = (φ1 , φ2 , · · · , φn ), where each φi is a continuous function defined on the interval [−τi , 0] of length equal to the delay τi . n Thus the phase space for (1) is the product space Cτ = C([−τi , 0], R), where i=1
C(I, R) denotes the space of continuous functions from the interval I on the real R. Cτ is a Banach space with the norm |φ| = |φi |. For φ = (φ1 , φ2 , · · · , φn ) in Cτ , there exists a unique solution of systems (1), denoted x(t, φ) = (x1 (t, φ), x2 (t, φ), · · · , xn (t, φ)) such that xi (t, φ) = φi (t) for τi ≤ t ≤ 0, and x(t, φ) satisfy systems (1) for t ≥ 0. We denote xt the (semi)flow associated with systems (1), that is, xt ∈ Cτ as xt = (x1t , x2t , · · · , xnt ) where xit = xi (t + θ) for θ ∈ [−τi , 0]. To simplify the notations, the dependence on the initial condition φ will not be indicated unless necessary. Definition 1. A square matrix A = (aij ) is a quasipositive matrix if it is nonnegative except perhaps on its main diagonal, i.e., aij ≥ 0 for i = j. A square matrix A = (aij ) is a nonnegative matrix if aij ≥ 0 for all i, j. Definition 2. A n × n square matrix A is a irreducible matrix if for every nonempty, proper subset I of the set N = {1, 2, · · · , n}, there is an i ∈ I and j ∈ J ≡ N \ I such that aij = 0. Thank to the monotonicity of system it is also possible to draw a picture of the global behavior of the trajectories in the phase space. We first verify that system (1) satisfy a boundedness condition. Lemma 1. (i) For φ = (φ1 , φ2 , · · · , φn ), we define F (φ) = (F1 (φ), F2 (φ), · · · , Fn (φ))T , where Fi = −di φi (0)+
n j=1
aij gj (φj (0))+
n
bij gj (φj (−τj ))+ui for i = 1, 2, · · · , n,
j=1
then, F maps bounded subsets of Cτ to bounded subsets of Rn . (ii)There is a bounded subset D of Cτ such that for all φ in Cτ , there is T > 0 such that xt (Φ) ∈ D for all t > T .
Global Convergence of Continuous-Time Recurrent Neural Networks
111
Proof. (i) Let fi (x1 , x2 , · · · , xn , y1 , y2 , · · · , yn ) = −di xi +
n
aij gj (xj ) +
j=1
n
bij gj (yj ).
j=1
The first result of this lemma stems from the fact that each fi maps bounded subsets of R2n to bounded subsets of R. (ii) For φ = (φ1 , φ2 , · · · , φn ) in Cτ , let x(t) = (x1 (t), x2 (t), · · · , xn (t)) be the solution of systems (1) satisfying xt0 = φ. First, we prove that there exists ti > t0 n with xi (ti ) < γi ≡ d1i [ (|aij | + |bij |)kj + |ui | + 1]. Otherwise xi (t) ≥ γi for all t ≥ t0 . Therefore,
j=1
n
x˙ i (t) = −di xi (t) +
aij gj (xj (t)) +
j=1
≤ −di γi +
n
|aij ||gj (xj (t))| +
j=1 n
< −di γi + [
n j=1 n
bij gj (xj (t − τij )) + ui |bij ||gj (xj (t − τij ))| + |ui |
j=1
(|aij | + |bij |)kj + |ui |] = −1.
j=1
t Then xi (t) = xi (t0 ) + t0 x˙ i (s)ds < xi (t0 ) − (t − t0 ) for t > t0 . Thus, xi (t) < γi for t > t0 + xi (t0 ) − γi , which are contradict to xi (t) < γi for all t > t0 . Thus there exists ti > 0 with x(ti ) < γi . It follows that xi (t) < γi for all t > ti . Otherwise there exists si > ti with xi (si ) > γi , then there exists ri ∈ (ti , si ) with xi (ri ) = γi and x˙ i (ri ) ≥ 0. On the n n other hand, x˙ i (ri ) = −di γi + aij gj (xj (ri )) + bij gj (xj (ri − τij )) + ui < 0. j=1
j=1
This is a contradiction. The case that there exists Ti > ti such that xi (t) > −γi for all t > Ti is analogous. Hence |xi (t)| < γi for all t > T = max{T1 , T2 , · · · , Tn }. Let D ≡ {φ ∈ Cτ : n |φ| ≤ γi2 }, then xt (φ) ∈ D for all t > T + max τi . i=1
1≤i≤n
Using above lemma, we can give the global convergence of the solutions of (1). Theorem 1. Let A = (aij )n×n is irreducible and quasipositive, B = (bij )n×n is nonnegative and for every j for which τj > 0, there exist i such that bij = 0, then the set of convergent points in Cτ contains an open and dense subset. If systems (1) has a unique equilibrium point, then all solutions of (1) converge to it. Proof. For all (x1 , x2 , · · · , xn , y1 , y2 , · · · , yn ), under theorem’s assumptions, we have ∂fi (x1 , x2 , · · · , xn , y1 , y2 , · · · , yn ) = aij gj (xj ) ≥ 0 ∀i = j, ∂xj
112
and
W. Zhao and H. Zhang
∂fi (x1 , x2 , · · · , xn , y1 , y2 , · · · , yn ) = bij gj (yj ) ≥ 0 ∀i, j. ∂yj
Thus systems (1) are cooperative systems [13]. Obviously, ∂fi ∂fi −di + aii gi (xi ) + bii gi (yi ), for i = j; + = aij gj (xj ) + bij gj (yj ), for i = j. ∂xj ∂yj Because A is irreducible and quasipositive, for every nonempty, proper subset I of the set N = {1, 2, · · · , n}, there is an i ∈ I and j ∈ J ≡ N \ I such that ∂f aij > 0. Thus aij gj (xj ) + bij gj (yj ) > 0, i.e., the Jacobian matrix ∂f ∂x + ∂y is irreducible for every φ ∈ Cτ . For each j and each (x1 , x2 , · · · , xn , y1 , y2 , · · · , yn ), n ,y1 ,y2 ,···,yn ) there exists i such that ∂fi (x1 ,x2 ,···,x = bij gj (yj ) ≥ 0 because bij > 0 ∂yj and gj (yj ) > 0. Therefore systems (1) is irreducible. The fact that the systems (1) is cooperative and irreducible together with the previous lemma imply that, the set of convergent points in Cτ contains an open and dense subset[Theorem 4.1 (p.90)[13]]. If systems (1) has a unique equilibrium point, then all solutions converge to it [Theorem 3.1 (p.18)[13]], this complete the proof of this theorem. Note AT is irreducible and quasipositive if A is irreducible and quasipositive, and B T is nonnegative if B T is nonnegative, this implies the following proposition: Proposition 1. Let A = (aij )n×n is irreducible and quasipositive, B = (bij )n×n is nonnegative and for every j for which τj > 0, there exist i such that bij = 0, then the set of convergent points in Cτ contains an open and dense subset. If the equilibrium of systems is unique, then all solutions of systems converge to it for each of the following four state equations: dxi (t) dt
= −di xi (t) +
dxi (t) dt
= −di xi (t) +
dxi (t) dt
= −di xi (t) +
dxi (t) dt
= −di xi (t) +
n j=1 n j=1 n j=1 n j=1
aij gj (xj (t)) + aji gj (xj (t)) + aij gj (xj (t)) + aji gj (xj (t)) +
n j=1 n j=1 n j=1 n
bij gj (xj (t − τj )) + ui ; bij gj (xj (t − τj )) + ui ; bji gj (xj (t − τj )) + ui ; bji gj (xj (t − τj )) + ui .
j=1
The above result gives condition under which such reversals do not change the global convergence of the neural networks. As a special case, we consider the following one dimensional system[1]: x(t) ˙ = −dx(t) + ag(x(t)) + bg(x(t − τ )) + I,
(2)
where g(x) = tanh(x), d > 0. As all conditions of theorem are trivially satisfied for scalar equation except on cooperativity of systems (2), and systems (2) obviously is cooperative when b > 0, we have
Global Convergence of Continuous-Time Recurrent Neural Networks
113
Proposition 2. Suppose b > 0 and the equilibrium of systems (2) is unique, then all solutions of systems (2) converge to it. We also have the following corollary: Corollary 1. Suppose b > 0 and one of the following conditions satisfy (i) a + b ≤ d; a+b−d a+b (ii) a + b > d and (−d ln( a+b − ) − (a + b) a+b−d + I)I > 0; d d a+b−d a+b (iii) a + b > d and (−d ln( a+b ) + (a + b) a+b−d + I)I > 0; d + d Then all solutions of systems (2) converge to its unique equilibrium. Proof. Let H(u) = −du + (a + b)g(u) + I, if a + b ≤ d, then H (u) = −d + (a + b)g (u) < −d point is unique; else let + a + b < 0, thus the equilibrium
a+b−d a+b−d u1 = ln( a+b ) and u2 = ln( a+b ), then H (u) < 0 for d − d d + d u ∈ (−∞, u1 ) ∪ (u2 , +∞) and H (u) > 0 for u ∈ (u1 , u2 ), and (−du1 + (a + b)g(u1 ) + I)I > 0 or (−du2 + (a + b)g(u2 ) + I)I > 0. Thus, the equilibrium point also is unique. Therefore the result of this corollary stems from the above proposition.
3
Comparison with Previous Results
Several results on the global asymptotical stability of neural networks with delays have appeared in recent years[[1]-[9]], each one improving upon and generalizing on previous results. The results in [6] and [7] are the most general among them, but there exist two main difference their and ours. Firstly, their proof require the fact that F (x1 , x2 , · · · , xn , x1 , x2 , · · · , xn ) is a homeomorphism of Rn onto itself in [6] and [7], in our paper, we only require that the equilibrium of systems is unique. Secondly, it is not easy to select P, Q in [6] and P, K in [7] directly satisfy corresponding inequality, but our conditions are easily verified except on the uniqueness of equilibrium. Here we give an example for which Chen’s methods fails and ours is successful. Considering the following one dimension systems: x(t) ˙ = −x(t) + g(x(t)) + g(x(t − τ )) + 5,
(3)
where g(x) = tanh(x).Thus A=1,B=1,D=1,G=1. Firstly, for all P > 0, Q > 0, we have 2P DG−1 − (P A + AT P ) − P BQ−1 (P B)T − Q = −P 2 Q−1 − Q < 0. So Chen’s method fails to show the convergence of this equation. Secondly, for any factorization of B = B1 B2 , any positive number K and P , R = 2P D − P A − AT P − P B1 KB1T − B2T K −1 B2 = −P 2 KB12 − B22 K −1 < 0, So Huang’s method also fails to show the convergence of this equation. In our paper, as A + B = 2, B > 0, systems (3) √ is cooperative √ and irreducible; Let H(u) = −u + 2g(u) + 5, then H(u1 ) = − ln( 2 − 1) − 2 + 5 > 0. Thus the system (3) has a unique equilibrium and all solutions of systems (3) converge to it.
114
4
W. Zhao and H. Zhang
Conclusion
This letter studies the global convergence of continuous-time recurrent neural networks with time delays using monotone dynamical theory. Several results are presented to characterize the global convergence. Those criterion is more explicit and easily verified than those presented in the past. We also show that under certain conditions, reversing the directions of the coupling between neurons preserves the global convergence of neural networks. Some examples are given to demonstrate the difference of the new results.
Acknowledgements This work was supported by the National Natural Science Foundation of China under Grant 60574016. The authors would like to thank the anonymous reviewers and the editor for their constructive comments.
References [1] [2] [3] [4] [5] [6] [7]
[8] [9]
[10]
[11] [12] [13]
Gopalsamy, K., Lung,I.: Convergence under Dynamical Thresholds with Delays. IEEE Trans. Neural Networks 8(2) (1997) 341-348 Pakdaman, K., Malta, C.P.: A Note on Convergence under Dynamical Thresholds with Delays. IEEE Trans. Neural Networks 9(1) (1998) 231-234 Chen, T.: Convergence of Delayed Dynamcal Systems. Neural Processing Letters 10(2) (1999) 267-271 Chen, T.: Global Convergence of Delayed Dynamcal Systems. IEEE Transactions on Neural Networks 12(6) (2001) 1532-1536 Chen, T.: Global Exponential Stability of Delayed Dynamcal Systems. Neural Networks 14(8) (2001) 977-980 Lu, W., Rong, L., Chen, T.: Global Convergence of Delayed Dynamcal Systems. International Journal of Neural Systems 13(3)(2003) 1-12 Huang, Y., Wu, C.: A Unifying Proof of Global Asymptotical Stability of Neural Networks with Delay. IEEE International Symposium on Circuits and Systems-II 52(4) (2005) 181-184 Arik, S.: An Improved Global Stability Result for Delayed Cellular Neural Networks. IEEE Trans. on Circ.and Syst.-I 49(8) (2002) 1211-1214 Arik, S.: Global Asymptotical Stability of a Larger Class of Delayed Neural Networks. In: Proc. of IEEE International Symposium on Circuits and Systems, Vol. 5. Thailand (2003) 721-724 Hopfield, J.: Neurons with Graded Response have Collective Computational Properties like Those of Two-state Neurons, In: Proc. National Academy of Sciences, Vol. 81. USA (1984) 3088-3092 Hale, J.(ed): Asymptotic Behavior of Dissipative Systems, Amer. Math. Soc., Providence RI (1988) Hale, J., Verduyn Lunel S. (eds): Introduction to Functional Differential Equations, Springer, New York (1993) Smith, H. (ed): Monotone Dynamical Systems An Introduction to the theory of Competitive and Cooperative systems, Vol.41, Mathematical Surveys and Monographys (American Mathematical Society, Providence (1995)
Global Exponential Stability in Lagrange Sense of Continuous-Time Recurrent Neural Networks Xiaoxin Liao1 and Zhigang Zeng2 1
2
Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China xiaoxin
[email protected] School of Automation, Wuhan University of Technology, Wuhan, Hubei, 430070, China
[email protected]
Abstract. In this paper, global exponential stability in Lagrange sense is further studied for continuous recurrent neural network with three different activation functions. According to the parameters of the system itself, detailed estimation of global exponential attractive set, and positive invariant set is presented without any hypothesis on existence. It is also verified that outside the global exponential attracting set; i.e., within the global attraction domain, there is no equilibrium point, periodic solution, almost periodic solution, and chaos attractor of the neural network. These theoretical analysis narrowed the search field of optimization computation and associative memories, provided convenience for application.
1
Introduction
Lyapunov stability is one of the important properties for dynamic systems. Form a systems-theoretic point of view, the global stability of recurrent neural networks is a very interesting issue for research because of the special nonlinear structure of recurrent neural networks. From a practical point of view, the global stability of recurrent neural networks is also very important because it is a prerequisite in many neural network applications such as optimization, control, and signal processing. In recent years, the Lyapunov stability of continuous-time recurrent neural networks has received much attention in many paper [1]-[9], [12]-[14] and monographs [10],[11]. However, the stability in Lagrange sense for neural network is almost not studied. In [15], we firstly studied the global dissipativity and global exponential dissipativity of neural network. And we also provided the estimation of global attractive set and global exponential attractive set. On the basis of [15], global exponential stability in Lagrange sense is further studied for continuous recurrent neural network with three different activation functions. According to the parameters of the system itself, detailed estimation of global exponential attractive set, and positive invariant set is presented. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 115–121, 2006. c Springer-Verlag Berlin Heidelberg 2006
116
X. Liao and Z. Zeng
We considered that the study of the global exponential stability in Lagrange sense at least has the following meanings: First, one of the core applications of neural network is optimization computation by finding out equilibrium point of neural networks. Once the global exponential attractive set is found, that the rough boundary of the optimization solution is set, thus provided prior knowledge. Second, the study of period solution and almost periodic solution is always the important topic of dynamic system as well as the important research topic of neural network. The necessary condition of existence on period solution and almost periodic solution is that the system is ultimately bounded. The period solution and almost periodic solution should be within the global attractive set, so that it provided more narrow boundary of the existence of them. Third, it is generally considered that continuous dynamic system will has the situation of chaos when the system is dissipative (or ultimately bounded) in large scale and has positive Lyapunov index in small scale. Recently people found the existence of chaos in neural network. Therefore, it is very necessary to study the stability in Lagrange sense for neural networks. Fourth, the global stability in Lyapunov sense on unique equilibrium point can be counted as an special case of stability in Lagrange sense by regarding equilibrium point as attractive set. Hence, form both theory and application, it is significative to study the global stability in Lagrange sense for neural networks.
2
Preliminaries
Consider a general recurrent neural network model with multiple time delays ci
n dxi (t) = −di xi (t) + (˜ aij gj (xj (t)) + bij gj (xj (t − τij ))) + Ii , dt j=1
(1)
where i = 1, · · · , n, ci > 0 and di > 0 are positive parameters, xi is a state variable of the i−th neuron, Ii is an input (bias), a ˜ij , ˜bij are connection weights from neuron i to neuron j, and gi (·) is an activation function with gi (0) = 0, continuous and monotonically nondecreasing. When (˜bij = 0), we have ci
n dxi (t) = −di xi (t) + a ˜ij gj (xj (t)) + Ii , i = 1, 2, · · · , n. dt j=1
(2)
Next, we define three classes of activation functions. (1) The set of sigmoidal activation functions is defined as def
S =
g(x)|g(0) = 0, |gi (x)| ≤ ki , i = 1, 2, · · · , n, D+ gi (x) ≥ 0 .
(3)
(2) The set of Lurie type activation functions is defined as def
L = {g(x) | 0 < xi gi (xi ) ≤ ki x2i , xi = 0, ∀xi ∈ , i = 1, 2, · · · , n}.
(4)
Global Exponential Stability in Lagrange Sense
117
It is obvious that the activation functions satisfied Lipschitz condition or partial Lipschitz condition are typical representatives of L-type. But there exist some L-type functions which do not satisfy Lipschitz condition or partial Lipschitz condition. (3) The set of monotone nondecreasing activation functions is denoted as def
G = {g(x) | g(x) ∈ C[R, R], D+ gi (xi ) ≥ 0, i = 1, 2, · · · , n}.
(5)
Evidently, S ⊂ G. But, S is not subset of L, L is also not subset of S. G is not subset of L, L is also not subset of G. Let Ω ⊆ n be compact set in n , Ωε ⊆ n be ε-neighborhood of Ω. Denote n \Ω be complement set of Ω in n . ρ(x, Ω) := inf y∈Ω ||x − y|| is distance between x and Ω. Let CH (t0 ) be the set of continuous function ψ : [t0 − τ, t0 ] → n with ||ψ||t0 = sups∈[t0 −τ,t0 ] |ψ(s)| ≤ H, where τ = max1≤i,j≤n {τij }. Denote x(t; t0 , ψ) as solution of (1), it means that x(t; t0 , ψ) is continuous and satisfies (1) and x(s; t0 , ψ(s)) = ψ(s), for s ∈ [t0 − τ, t0 ]. Definition 1. If there exists a compact set Ω ⊆ n such that ∀x0 ∈ n \Ω, limt→+∞ ρ(x(t), Ω) = 0, then Ω is said to be a globally attractive set. A set Ω is called positive invariant, if ∀x0 ∈ Ω implies x(t) ⊆ Ω for t ≥ t0 . An equivalent definition on globally attractive set is as follows: Definition 2. If there exist a compact set Ω ⊆ n and T > 0, such that ∀x0 ∈ n \Ω, when t ≥ t0 + T, x(t) ⊆ Ωε , then Ω is said to be a globally attractive set. Definition 3. If there exist a compact set Ω ⊆ n and α > 0, such that ∀x0 ∈ n \Ω and x(t) ∈ n \Ω, ρ(x(t), Ω) ≤ exp{−α(t − t0 )}ρ(x(0), Ω), then Ω is said to be a globally exponentially attractive set of (1). An equivalent definition on globally exponentially attractive set is as follows: Definition 4. If there exist a radically unbounded and positive definite Lyapunov function and > 0, α > 0, such that when V (x0 ) > , V (x(t)) > , (V (x(t)) − ) ≤ exp{−α(t − t0 )}(V (x0 ) − ), then {x|V (x) ≤ } is said to be a globally exponentially attractive set of (1). Definition 5. System (1) with globally exponentially attractive set is said to be globally exponentially stable in Lagrange sense, or be globally exponentially dissipative, or be ultimately bounded.
3
g(x) ∈ S (1)
Let |gi (xi )| ≤ ki , 2Mi Ω1 =
x|
n i=1
=
n
˜
j=1 ((|aij |+|bij |)kj +|Ii |),
(2)
2Mi
=
n
j=1 (|aij |+|Ii |),
n ci x2i /2
≤
(1) 2 i=1 (Mi ) /εi
2 min1≤i≤n (di − ci )/ci
, where 0 < εi < di
,
118
X. Liao and Z. Zeng
(1) Ω2 = x| |xi | ≤ 2Mi /di , i = 1, 2, · · · , n , n n (1) Ω3 = x| ci |xi | ≤ 2Mi / min {dj /cj } , i=1
Ω4 =
x|
n
ci x2i /2
≤
i=1
n
1≤j≤n
i=1
(2) 2 i=1 (Mi ) /εi
2 min1≤i≤n (di − ci )/ci
, where 0 < εi < di
,
(2) Ω5 = x| |xi | ≤ 2Mi /di , i = 1, 2, · · · , n , n n (2) Ω6 = x| ci |xi | ≤ 2Mi / min {2dj /cj } , i=1
i=1
⎧ ˜ii , ⎨a aij = 0, ⎩ a ˜ij ,
where
1≤j≤n
i = j, a ˜ii > 0, i = j, a ˜ii ≤ 0, i = j.
Theorem 1. If g(x) ∈ S, then Ωi (i = 1, 2, 3) are globally exponentially attractive
sets and positive invariant sets of (1). Hence, Ω123 = Ω1 Ω2 Ω3 is a globally exponentially attractive set and positive invariant set of (1). Theorem 2. If g(x) ∈ S, then Ωi (i = 4, 5, 6) are globally exponentially attractive
sets and positive invariant sets of (2). Hence, Ω456 = Ω4 Ω5 Ω6 is a globally exponentially attractive set and positive invariant set of (2). Remark 1. Globally attractive sets, which obtained by using Theorems 1 and 2, may not be the best; i.e., people can obtain better estimation by other methods. But, it is not importance to improve on the size of attractive set. It is importance to judge whether neural network (1) or (2) is globally exponentially stable in Lagrange sense or not.
4
g(x) ∈ L
n Theorem 3. If g(x) ∈ L, and −dj + i=1 |˜ aij |kj < 0, then Ω7 is a globally exponentially attractive set and positive invariant set of (2), where n n i=1 |Ii | Ω7 = x| ci |xi | ≤ , ξ1 min1≤i≤n di /ci i=1 ⎧ ⎨ ξ1 = sup
⎩
ξ, −
dj (1 − ξ) + a ˜jj + kj
n i=1,i =j
|˜ aij | ≤ 0, j = 1, 2, · · · , n, 0 < ξ < 1
⎫ ⎬ ⎭
.
Theorem 4. If g(x) gn (·), then k = k1 = k2 = · · · = kn . n∈ L, g1 (·) = g2 (·) = · · · = + If −dj + a ˜jj + i=1,i |˜ a |k < 0, and D g(x) ≥ 0, then Ω8 is a globally ij =j exponentially attractive set and positive invariant set of (2), where
Global Exponential Stability in Lagrange Sense
Ω8 =
5
n
max1≤i≤n |Ii | max1≤i≤n ci n x| ci |xi | ≤ min {d − aij |k} 1≤i≤n i j=1 |˜ i=1
119
.
g(x) ∈ G
Theorem 5. If g(x) ∈ G, and A˜ + A˜T is negative definite, then Ω9 is a globally exponentially attractive set and positive invariant set of (2), where n n 2 xi i=1 Ii Ω9 = x| ci gi (xi )dxi ≤ , 4ε min1≤i≤n di /ci 0 i=1 ε = sup , A˜ + A˜T + En is negative semidefinite, 0 < < 1 . Theorem 6. If g(x) ∈ G, and A˜ + A˜T is negative semidefinite, then Ω10 is a globally exponentially attractive set and positive invariant set of (2), where n xi 2 Ω10 = x| ci gi (xi )dxi ≤ , min1≤i≤n di /ci 0 i=1 =
max
|xi |=2|Ii |
n i=1
ci
xi
gi (xi )dxi . 0
Theorem 7. If g(x) ∈ G, and a ˜jj + ni=1,i aij | ≤ 0, then Ω11 is a globally =j |˜ exponentially attractive set and positive invariant set of (2), where n n i=1 |Ii | Ω11 = x| ci |xi | ≤ . min1≤i≤n di /ci i=1 Remark 2. According to Theorems 1-7, in outside the global exponential attracting set; i.e., within the global attraction domain, there is no equilibrium point, periodic solution, almost periodic solution, and chaos attractor of the neural network.
6
Example
Consider the neural networks studied by Q. Li and X. Yang in [16]: x(t) ˙ = −x(t) + W F (x(t)), where F (x(t)) = (f (x1 (t)), f (x2 (t)), f (x3 (t)))T , f (x) = (|x + 1| − |x − 1|)/2, ⎛ ⎞ ⎛ ⎞ w11 , w12 , 0 1.2, −1.6, 0 A = ⎝ w21 , w22 , w23 ⎠ = ⎝ 1.2, 1 + p, 0.9 ⎠ . 0, w32 , w33 0, 2.2, 1.5
(6)
120
X. Liao and Z. Zeng (2)
(2)
When p = −0.12, 2M1 = 2(|w11 | + |w12 |) = 5.6, 2M2 = 2(|w21 | + |w22 | + (2) |w23 |) = 5.96, 2M3 = 2(|w32 | + |w33 |) = 7.4. Hence Ω12 = {(x1 , x2 , x3 ), |x1 | ≤ 5.6, |x2 | ≤ 5.96, |x3 | ≤ 7.4, } is a globally exponentially attractive set and positive invariant set of (6). Similarly, when p = −0.03, p = 0, p = 0.12, Ω13 = {(x1 , x2 , x3 ), |x1 | ≤ 5.6, |x2 | ≤ 6.14, |x3 | ≤ 7.4, }, Ω14 = {(x1 , x2 , x3 ), |x1 | ≤ 5.6, |x2 | ≤ 6.4, |x3 | ≤ 7.4, } Ω15 = {(x1 , x2 , x3 ), |x1 | ≤ 5.6, |x2 | ≤ 6.44, |x3 | ≤ 7.4, } are globally exponentially attractive sets and positive invariant sets of (6), respectively.
Acknowledgement This work was supported by the Natural Science Foundations of China under Grant 60274007 and 60474011.
References 1. Hopfield, J.J.: Neurons with Graded Response Have Collective Computational Properties Like Those of Two-state Neurons. Proc. Natl. Academy Sci. 81 (1984) 3088-3092 2. Chua, L.O., Yang, L.: Cellular Neural Networks: Theory. IEEE Trans. Circuits Syst. 35 (1988) 1257-1272 3. Liao, X.X.: Stability of Hopfield-type Neural Networks (I). Science in China, Series A 38 (1995) 407-418 4. Liao, X.X.: Mathematical Theory of Cellular Neural Networks (I). Science in China, Series A 24 (1994) 902-910 5. Liao, X.X.: Mathematical Theory of Cellular Neural Networks (II). Science in China, Series A 24 (1994) 1037-1046 6. Liao, X.X., Wang, J.: Algebraic Criteria for Global Exponential Stability of Cellular Neural Networks with Multiple Time Delays. IEEE Trans. Circuits and Systems I. 50 (2003) 268-275 7. Liao, X.X., Wang, J., Zeng, Z.G.: Global Asymptotic Stability and Global Exponential Stability of Delayed Cellular Neural Networks. IEEE Trans. on Circuits and Systems II, Express Briefs 52 (2005) 403-409 8. Shen, Y., Jiang, M.H., Liao, X.X.: Global Exponential Stability of Cohen-Grossberg Neural Networks with Time-varying Delays and Continuously Distributed Delays. In: Wang, J., Liao, X.F., Zhang, Y. (eds.): Advances in Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 156-161 9. Zeng, Z.G., Wang, J., Liao, X.X.: Stability Analysis of Delayed Cellular Neural Networks Described Using Cloning Templates. IEEE Trans. Circuits and Syst. I 51 (2004) 2313-2324 10. Yi, Z., Tan, K.K.: Convergence Analysis of Recurrent Neural Networks. Kluwer Academic Publishers (2004) 11. Michel, A.N., Liu, D.: Qualitative Analysis and Synthesis of Recurrent Neural Networks. Marcel Dekker, New York (2002) 12. Xu, D.Y., Yang, Z.C.: Impulsive Delay Differential Inequality and Stability of Neural Networks. J. Math. Anal. and Appl. 305 (2005) 107-120
Global Exponential Stability in Lagrange Sense
121
13. Liao, X.F., Yu, J.B.: Robust Stability for Interval Hopfield Neural Networks with Time Delay. IEEE Transactions on Neural Networks 9 (1998) 1042-1045 14. Forti, M., Manetti, S., Marini, M.: Necessary and Sufficient Condition for Absolute Stability of Neural Networks. IEEE Trans. Circuits and Syst. I 41 (1994) 491-494 15. Liao, X.X., Wang, J.: Global Dissipativity of Continuous-time Recurrent Neural Networks with Time Delay. Physical Review E 68 (2003) 016118 16. Li, Q.D., Yang, X.S.: Complex Dynamics in A Simple Hopfield-type Neural Network. In: Wang, J., Liao, X.F., Zhang, Y. (eds.): Advances in Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 357-362
Global Exponential Stability of Recurrent Neural Networks with Time-Varying Delay Yi Shen1 , Meiqin Liu2 , and Xiaodong Xu3 1
Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
[email protected] 2 College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China
[email protected] 3 College of Public Administration, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
[email protected]
Abstract. A new theoretical result on the global exponential stability of recurrent neural networks with time-varying delay is presented. It should be noted that the activation functions of recurrent neural network do not require to be bounded. The presented criterion, which has the attractive feature of possessing the structure of linear matrix inequality, is a generalization and improvement over some previous criteria.
1
Introduction
In recent years, recurrent neural networks are widely studied in [1-2], because of their immense potentials of application perspective. The Hopfield neural networks and cellular neural networks are typical representative recurrent neural networks among others, and have been successfully applied to signal processing, especially in image processing, and to solving nonlinear algebraic and transcendental equations. Such applications rely heavily on the stability properties of equilibria. Therefore, the stability analysis of recurrent neural networks is important from both theoretical and applied points of view[1-8]. In this paper, a new theoretical result on the global exponential stability of recurrent neural networks with time-varying delay is presented. It should be noted that the activation functions of recurrent neural network do not require to be bounded. The presented criterion, which has the attractive feature of possessing the structure of linear matrix inequality, is a generalization and improvement over some previous criteria in Refs.[3-8]. The model of recurrent neural network to be considered herein is described by the state equation x(t) ˙ = −Dx(t) + Af (x(t)) + Bf (x(t − δ(t))) + u,
(1)
where x(t) = [x1 (t), · · · , xn (t)]T denotes the state vector, f (x(t)) = [f1 (x1 (t)), · · · , fn (xn (t))]T is the output vector, D = diag(d1 , · · · , dn ) > 0 is the self feedback J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 122–128, 2006. c Springer-Verlag Berlin Heidelberg 2006
Global Exponential Stability of Recurrent Neural Networks
123
matrix, A = (aij )n×n is the feedback matrix, B = (bij )n×n is the delayed feedback matrix, u = [u1 , · · · , un ]T is the constant vector, and δ : R+ → [0, τ ] is the delay, τ > 0 and δ (t) ≤ 0. The activation function fi (·) satisfies the following condition (H) There exist ki > 0 such that for any θ, ρ ∈ R 0 ≤ (fi (θ) − fi (ρ))(θ − ρ)−1 ≤ ki .
(2)
It is easy to see that these activation functions satisfied (H) are not necessarily bounded. However, some usual sigmoidal functions and piecewise linear functions which are employed in [3-8] are bounded on R.
2
Main Results
Theorem 2.1. If there exist positive definite symmetric matrices P = [pij ] ∈ Rn×n and G = [gij ] ∈ Rn×n and a positive definite diagonal matrix Q = diag(q1 , · · · , q2 ) such that ⎛ ⎞ P D + DT P −P A −P B 2QDK −1 − QA − AT Q − G −QB ⎠ > 0, (3) M = ⎝ −AT P T T −B P −B Q G where K = diag(k1 , . . . , kn ) (see (2)), then the network (1) has the unique equilibrium point x∗ and it is globally exponentially stable. Proof. By (3), it follows 2QDK −1 − QA − AT Q − G −B T Q
−QB G
> 0.
(4)
From the Schur Complement (see [1]) and (4), we obtain 2QDK −1 − QA − AT Q − G − QBG−1 B T Q > 0.
(5)
By (5) and G + QBG−1 B T Q − B T Q − QB ≥ 0, we get 2QDK −1 − Q(A + B) − (A + B)T Q > 0.
(6)
Set J(x) = −Dx + (A + B)f (x) + u, then J is injective on Rn . Otherwise, there is x = y, x, y ∈ Rn such that J(x) = J(y), that is to say D(x − y) = (A + B)(f (x) − f (y)).
(7)
It is obvious that f (x) = f (y), otherwise, x = y by (7). Multiplying the both sides of (7) by 2(f (x) − f (y))T Q, we obtain 2(f (x) − f (y))T QD(x − y) = 2(f (x) − f (y))T Q(A + B)(f (x) − f (y)).
(8)
By (2), we have 2(f (x) − f (y))T QD(x − y) ≥ 2(f (x) − f (y))T QDK −1 (f (x) − f (y)).
(9)
124
Y. Shen, M. Liu, and X. Xu
Therefore, by (8) and (9), we get (f (x) − f (y))T (2QDK −1 − Q(A + B) − (A + B)T Q)(f (x) − f (y)) ≤ 0. (10) Since f (x) − f (y) = 0, by (6), we have (f (x) − f (y))T (2QDK −1 − Q(A + B) − (A + B)T Q)(f (x) − f (y)) > 0. (11) It is obvious that (10) contradicts (11). Therefore, J is injective. Now one can prove lim|x|−→+∞ |J(x)| = +∞. Because u and f (0) are constants, it only needs to prove lim|x|−→+∞ |J(x)| = +∞, where J(x) = −Dx + (A + B)(f (x) − f (0)). Firstly, we will show that limx−→+∞ |J(x)| = +∞. If |f (x)| is bounded on [0, +∞), then limx−→+∞ |J(x)| = +∞. Assume that |f (x)| is unbounded on [0, +∞), so there exist series xn ∈ [0, +∞) such that limn−→+∞ |f (xn )| = +∞. By (2), f (x) is continuous, monotonic and increasing, and assume that {xn } is monotonic and increasing, then we have limn−→+∞ xn = +∞ , limn−→+∞ f (xn )= +∞. Therefore, there is m for any sufficiently large x ∈ [0, +∞) such that xm ≤ x, f (xm ) ≤ f (x). Since limm−→+∞ xm = +∞, and limm−→+∞ f (xm ) = +∞, we can get limx−→+∞ f (x) = +∞, that is to say, limx−→+∞ |f (x)| = +∞. Therefore, limx−→+∞ |f (x) − f (0)| = +∞. And by (2), we have 2(f (x) − f (0))T QJ(x) ≤ −μ|f (x) − f (0)|2 ,
(12)
where μ = λmin (2QDK −1 − Q(A + B) − (A + B)T Q), it follows from (6) that μ > 0. So, by (12), we get μ|f (x) − f (0)|2 ≤ 2Q|f (x) − f (0)||J(x)|.
(13)
limx−→+∞ |f (x) − f (0)| = +∞ together with (13) implies limx−→+∞ |J(x)| = +∞. Similarly, we can prove limx−→−∞ |J (x)| = +∞. Hence, we obtain lim|x|−→+∞ |J(x)| = +∞. By Ref.[2], it follows that J(x) is a homeomorphism from Rn to itself. Therefore, the equation J(x) = 0 has a unique solution; i.e., the system (1) has a unique equilibrium x∗ . One will prove that the unique equilibrium x∗ = (x∗1 , · · · , x∗n )T of the system (1) is globally exponentially stable. Since x∗ is the equilibrium of the system (1), −Dx∗ + (A + B)f (x∗ ) + u = 0, by (1), we have y(t) ˙ = −Dy(t) + Ag(y(t)) + Bg(y(t − δ(t))),
(14)
where y(·) = [y1 (·), · · · , yn (·)]T = x(·) − x∗ is the new state vector, g(y(·)) = [g1 (y1 (·)), · · · , gn (yn (·))]T represents the output vector of the transformed system, and gi (yi (·)) = fi (yi (·) + x∗i ) − fi (x∗i ), i = 1, 2, · · · , n. Obviously, g(0) = 0, y ∗ = 0 is a unique equilibrium of the system (14). Therefore the stability of the equilibrium x∗ of the system (1) is equivalent to the one of the trivial solution of the system (14). Choose the following Lyapunov functional yi (t) t n V (y(t)) = y T (t)P y(t) + 2 qi gi (s)ds + g(y(s))T Gg(y(s))ds, i=1
0
t−δ(t)
Global Exponential Stability of Recurrent Neural Networks
125
The time derivative of V (y(t)) along the trajectories of (14) turns out to be V˙ (y(t)) ≤ −λ|y(t)|2 , (15) where λ = λmin (M ) > 0, (15) has been obtained by (2) and δ (t) ≤ 0. Set h(ε) = −λ + λmax (P )ε + QKε + GK2eετ τ ε. Owing to h (ε) > 0, h(0) = −λ < 0, h(+∞) = +∞, there exists a unique ε such that h(ε) = 0; i.e., there is ε > 0 satisfies − λ + λmax (P )ε + QKε + GK2eετ τ ε = 0.
(16)
From the definition of V and (2), it follows |V (y(t))| ≤ (λmax (P ) + QK)|y(t)| + GK 2
t
|y(s)|2 ds. (17)
2 t−δ(t)
For ε > 0 satisfying (16), by (15) and (17), we get (eεt V (y(t))) = εeεt V (y(t)) + eεt V˙ (y(t)) ≤ eεt (−λ + λmax (P )ε + QKε)|y(t)|2 t εt 2 +εe GK |y(s)|2 ds.
(18)
t−δ(t)
Integrating the both sides of (18) from 0 to an arbitrary t ≥ 0, we can obtain eεt V (y(t)) − V (y(0)) t ≤ eεs (−λ + λmax (P )ε + QKε)|y(s)|2ds 0 t s + εeεs GK2ds |y(r)|2 dr. And
t
s
εeεs GK2ds 0
(19)
s−δ(s)
0
|y(r)|2 dr ≤ s−δ(s)
t
−τ
εGK2τ eε(r+τ ) |y(r)|2 dr. (20)
Substituting (20) into (19) and applying (16), we have eεt V (y(t)) ≤ C, t ≥ 0. where C = V (y(0)) + (21), we get
0 −τ
(21)
εGK2τ eετ |y(r)|2 dr. By the definition of V and
|y(t)| ≤
C e−εt , t ≥ 0. λmin (P )
(22)
126
Y. Shen, M. Liu, and X. Xu
This implies the trivial solution of (14) is globally exponentially stable and its exponential convergence rate ε is decided by (16). Remark 1. In Theorem 2.1, we do not suppose that the activation function is bounded and delay is constant. However, in [3], several asymptotic stability results are derived by the bounded piecewise linear function and constant delay while our result ensures the globally exponential stability of the system (1) in similar conditions. In addition, the network (1) is even more general system than the one in [3]. Corollary 2.2. Assume that K = D = I (identity matrix) in (1), then the neural network (1) has the unique equilibrium point and it is globally exponentially stable if there exist matrices Q, T and an nonnegative constant β such that one of the following conditions is satisfied (A1) QA+AT Q+T < 0, 2Q+T −I −QBB T Q ≥ 0, where Q = diag(q1 , . . . , qn ) is a positive definite diagonal matrix; and T = (tij )n×n is a symmetric positive definite matrix. √ (A2) A + AT + βI < 0, B ≤ √2β, β > 1. T (A3) A + A + βI < 0, B ≤√ 1 + β, 0 ≤ β ≤ 1. (A4) A + AT + βI < 0, B ≤ 1 + γ + β, β > 0, where γ is the minimum eigenvalue of −(A + AT + βI) (A5) A + AT < 0, B ≤ 1. √ (A6) A + AT + βI < 0, B ≤ √1 + β, β ≥ 0. (A7) A + AT + βI < 0, B ≤ 2β, β > 0. Proof. It is obvious that (A4) is deduced by one of the conditions (A3),(A5) and (A6), furthermore, (A2) is obtained by (A7). Therefore, it only needs to prove that the conclusion holds if one of the conditions (A1), (A2) and (A4) is satisfied. Let K = D = I, P = αI, G = αB T B + I, α > 0, by Lemma A (see Appendix), then M > 0 in (3) is equivalence to S −QB αAT 1 −1 αA αB − α −B T Q αB T B + I αB T 2 1 −1 αAT αA −αB > 0, =U+ α (23) −αB T 2 S − αAT A −QB where U = , S = 2Q−QA−AT Q−αB T B−I. Obviously, −B T Q I if U > 0, then (23) holds. By Lemma A, U > 0 is equivalence to 2Q − QA − AT Q − αB T B − I − αAT A − QBB T Q > 0.
(24)
Case 1: It is obvious that if α is sufficient small, then, by (A1), there exists a matrix Q such that (24) holds. Case 2: Take Q = β1 I, β > 1. If α is sufficient small, then, there is a matrix Q satisfying (24) by (A2).
Global Exponential Stability of Recurrent Neural Networks
Case 3: Set Q =
1 β+1 I, β
127
> 0, then
2Q − QA − AT Q − αB T B − I − αAT A − QBB T Q I A + AT + βI BB T β(A + AT + βI) = − − − − αB T B − αAT A 2 2 1+β (1 + β) (1 + β) (1 + β)2 Take α sufficiently small, by the above formula and the condition (A4), it is easy to follow that there exists a matrix Q such that (24) holds. The proof is completed. Remark 2. Assume that D = I, fi (xi ) = 12 (|xi + 1| − |xi − 1|), δ(t) = τ (constant) in (1), then Corollary 2.2 reduces to the main results in [3-8]. However, Refs.[3-8] only prove the asymptotic stability of the network (1). Furthermore, the following example 3.1 satisfies the conditions of the Theorem 2.1 while it doesn’t satisfy the ones of Corollary 2.2. Therefore, the Theorem 2.1 is improvement and generalization of Refs.[3-8].
3
Conclusion
In this paper, we present theoretical results on the global exponential stability of a general class of recurrent neural networks with time-varying delay. These conditions obtained here are easy to be checked in practice, and are of prime importance and great interest in the design and the applications of network. The criterions are more general than the respective criteria reported in [3-8].
Acknowledgments The work was supported by Natural Science Foundation of Hubei (2004ABA055) and National Natural Science Foundation of China (60574025, 60074008).
References 1. Cao, J., Wang, J.: Global Asymptotic and Robust Stability of Recurrent Neural Networks with Time Delays. IEEE Trans. Circuits Syst. I 52 (2005) 417-426 2. Zeng, Z., Wang, J., Liao, X.: Global Exponential Stability of a General Class of Recurrent Neural Networks with Time-Varying Delays. IEEE Trans. Circuits Syst. 50 (2003) 1353-1359 3. Singh, V.: A Generalized LMI-Based Approach to the Global Asymptotic Stability of Delayed Cellular Neural Networks. IEEE Trans. Neural Networks 15 (2004) 223225 4. Arik, S.: An Analysis of Global Asymptotic Stability of Delayed Cellular Neural Networks. IEEE Trans. Neural Networks 13 (2002) 1239-1242 5. Liao, T., Wang, F.: Global Stability for Cellular Neural Networks with Time Delay. IEEE Trans. Neural Networks 11 (2000) 1481-1484
128
Y. Shen, M. Liu, and X. Xu
6. Arik, S., Tavsanoglu, V.: On the Global Asymptotic Stability of Delayed Cellular Neural Networks. IEEE Trans. Circuits Syst. I, Fundam. Theory Appl. 47 (2000) 571-574 7. Cao, J.: Global Stability Conditions for Delayed CNNs. IEEE Trans. Circuits Syst. I, Fundam. Theory Appl. 48 (2001) 1330-1333 8. Arik, S.: An Improved Global Stability Result for Delayed Cellular Neural Networks. IEEE Trans. Circuits Syst. I, Fundam. Theory Appl. 49 (2002) 1211-1214
New Criteria of Global Exponential Stability for a Class of Generalized Neural Networks with Time-Varying Delays Gang Wang, Hua-Guang Zhang, and Chong-Hui Song School of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110004, People’s Republic of China wanggang
[email protected],
[email protected]
Abstract. In this paper, we essentially drop the requirement of Lipschitz condition on the activation functions. By employing Lyapunov functional and several new inequalities, some new criteria concerning global exponential stability for a class of generalized neural networks with time-varying delays are obtained, which only depend on physical parameters of neural networks. Since these new criteria do not require the activation functions to be differentiable, bounded or monotone nondecreasing and the connection weight matrices to be symmetric, they are mild and more general than previously known criteria.
1 Introduction In recent years, the stability problem of time-delayed neural networks has been deeply investigated for the sake of theoretical interest as well as applications (see, e.g., [1]-[4], [7]). Due to the finite switching speed of amplifies, time delays are likely to be present in models of electronic networks. Thus a delay parameter must be introduced into the system model. All of the results above assume that the activation functions satisfy the Lipschitz condition. However, in many evolutionary processes as well as optimal control models and flying object motion, there are many bounded monotone-nondecreasing signal functions which do not satisfy the Lipschitz condition. For instance, in the simplest case of the pulse-coded signal function which has received much attention in many fields of applied sciences and engineering, an exponentially weighted time average of sampled pulses is often used which does not satisfy the Lipschitz condition. Therefore, it is very important and, in fact, necessary to study the issue of the stability of such a dynamical neural systems whose activation functions do not satisfy the Lipschitz condition. Feng and Plamondon [5] first proposed the hypothesis (H1) (see Sec. 2) which was different from the Lipschitz condition. The proposed hypothesis (H1) is significant, and allows us to include a variety of activation functions whether satisfying Lipschitz condition or not. In [5], some global asymptotic stability criteria are obtained which only depend on physical parameters of neural networks. Compared with the criteria with lots of unknown parameters, due to no degrees of freedom to be tuned, these criteria can be checked easily and quickly. Therefore the criteria only depending on physical parameters of neural networks are more worth being improved. In the spirit of [5], J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 129–134, 2006. c Springer-Verlag Berlin Heidelberg 2006
130
G. Wang, H.-G. Zhang, and C.-H. Song
Zhang et al. [6] proposed some new global asymptotic stability criteria which are less restrict than those in [5]. But due to lots of unknown parameters, it is difficult to tune the degrees of freedom and tastes plenty of time. On the other hand, [5] and [6] only obtained some global asymptotic stability criteria with constant delays. To the best of our knowledge, global exponential stability with time-varying delays under the hypothesis (H1) are seldom considered. The purpose of this paper is to improve the results in [5]. We present some new criteria concerning the global exponential stability for a class of generalized neural networks with time-varying delays, which only depend on physical parameters of neural networks under the hypothesis (H1). The main advantages of the obtained exponential stability conditions include: (i) We essentially drop the requirement of Lipschitz condition on the activation functions, and do not require the activation functions to be differentiable, bounded or monotone nondecreasing and the connection weight matrices to be symmetric. (ii) The stability conditions only depend on physical parameters of neural networks, which can be checked easily and quickly. (iii) The conditions obtained here are less restrictive, and are a generalization of some previous works.
2 Preliminary We consider a class of generalized neural networks with time-varying delays described by the following nonlinear differential equations n n dxi (t) = −di (xi (t)) + aij fj (xj (t)) + bij gj (xj (t − τij (t))) + Ii , dt j=1 j=1
xi (s) = ϕi (s), s ∈ [−τ, 0],
i = 1, 2, · · · , n,
(1)
where xi (t) is the state, fj (·) and gj (·) denote the activation functions, aij and bij denotes the constant connection weight and the constant delayed connection weight, respectively, τij (t) is the time-varying transmission delay, which is nonnegative, bounded, and differentiable, τ = max τij (t), di (·) > 0 represents the rate with which the ith neuron will reset its potential to the resting state when disconnected from the network and external inputs with di (0) = 0. Suppose system (1) has at least an equilibrium. Let zi (t) = xi (t) − x∗i , where x∗ = ∗ (x1 , x∗2 , · · · , x∗n ) is an equilibrium of system (1), then system (1) reduces to n n dzi (t) = −Di (zi (t)) + aij αj (zj (t)) + bij βj (zj (t − τij (t))), dt j=1 j=1
(2)
where Di (zi (t)) = di (zi (t) + x∗i ) − di (x∗i ),αj (zj (t)) = fj (zj (t) + x∗j ) − fj (x∗j ), βj (zj (t − τij (t))) = gj (zj (t − τij (t)) + x∗j ) − gj (x∗j ). Remark 1. Obviously, (0, 0, · · · , 0)T is an equilibrium point of (2). Therefore, the global exponential stability analysis of the equilibrium point x∗ of (1) can now be transformed to the global exponential stability analysis of the trivial solution z = 0 of (2).
New Criteria of Global Exponential Stability
131
Throughout this paper, we assume that the activation functions αj , βj (j = 1, 2, · · · , n) satisfy the following property: (H1) the function αj , βj (j = 1, 2, · · · , n) satisfy , zj αj (zj ) > 0(zj = 0); zj βj (zj ) > 0(zj = 0) and that there exist real numbers mj and nj such that αj (zj ) , zj zj =0
βj (zj ) . zj zj =0
mj = sup
(3)
nj = sup
Remark 2. Note that the hypothesis (H1) is weaker than the Lipschitz condition which is mostly used in literature. Further, if fj , gj for each j = 1, 2, · · · , n are Lipschitz functions, then fj and gj for each j = 1, 2, · · · , n can be replaced by the respective Lipschitz constants. (H2) There exist positive constants Di and Di , i = 1, 2, · · · , n such that 0 < Di ≤
di (xi ) − di (yi ) ≤ Di , xi − yi
We denote ψ − ϕ∗ 2 =
sup
n
−τ ≤s≤0 i=1
f or all xi = yi .
|ψi (s) − ϕ∗i |2 ,
(4)
where ϕ∗ = (ϕ∗1 , ϕ∗2 , · · · , ϕ∗n ) is the initial value of (1). Definition 1. The equilibrium x∗ = (x∗1 , x∗2 , · · · , x∗n ) of (1) is said to be globally exn ponentially stable, if there exist ε > 0 and M (ε) ≥ 1, such that (xi (t) − x∗i )2 ≤ i=1
M (ε)ψ − ϕ∗ 2 e−εt for all t > 0, and ε is called the degree of exponential stability. Definition 2. Let f (t) : R → R be a continuous function. The upper right dini derivative D+ f (t) is defined as 1 (f (t + h) − f (t)). h Lemma 1. Suppose that a, b and c are constant, a > 0, b ≥ 0, and 0 ≤ c ≤ ∞ then D+ f (t) = lim
h→0+
1 b2 − ac2 + bc ≤ − ac2 + . 2 2a
(5)
− (ac − b)2 ≤ 0.
(6)
1 b2 − ac2 + bc ≤ − ac2 + . 2 2a
(7)
Proof. According to
we obtain
The proof is completed. Lemma 2. (Cauchy inequality) Suppose that ai and bi (i = 1, 2, ...n) are positive real numbers, then n n n ( ai bi )2 ≤ ( a2i )( b2i ) (8) i=1
i=1
i=1
132
G. Wang, H.-G. Zhang, and C.-H. Song
3 Main Results Theorem 1. Assume that fj , gj (j = 1, 2, · · · , n) satisfy the hypothesis (H1) above, τij (t) ≤ 0, and suppose further that n n 1 2 2 1 (ajk mk + b2jk n2k ) < D i , D 2 j j=1
(9)
k=1
for all t > 0 and i = 1, 2, · · · , n, then the equilibrium point x∗ of the system (1) is globally exponentially stable. Proof. (9) implies that we can always choose a small constant ε > 0 such that n n a2jk m2k + b2jk n2k eετ 1 ε − (D i − ) + < 0, 2 2 Dj − ε2 j=1
(10)
k=1
where τ = max τij (t), i, j = 1, 2, · · · , n. We choose a Lyapunov functional for system (2) as V (z, t) =
n
(11)
(Vi1 (z, t) + Vi2 (z, t)),
i=1
where Vi1 (z, t) = zi2 (t)eεt ,
(12)
and 2 Vi2 (z, t) =
n j=1
b2ij n2j
t t−τij (t)
n j=1
Di −
zj2 (s)eε(s+τij (t)) ds (13)
.
ε 2
Calculating the upper right derivative Vi1 (z, t) along the solutions of system (2). From (H1) and (H2), one can derive that D+ Vi1 (z, t)(2) n n 1 ε 2 εt ≤ 2e − 2( Di − )zi (t) + |zi (t)| |aij |mj |zj (t)| + |zi (t)| |bij |nj |zj (t − τij (t))| . 2 4 j=1 j=1 According to Lemma 1, i.e., −ac2 + bc ≤ − 21 ac2 +
b2 2a ,
we obtain
D+ Vi1 (z, t)(2) ≤ 2eεt
1 ε − ( Di − )zi2 (t) + 2 4
(
n
|aij |mj |zj (t)|)2
j=1
Di −
ε 2
( +
n
j=1
|bij |nj |zj (t − τij (t))|)2 Di −
ε 2
. (14)
New Criteria of Global Exponential Stability
133
By Lemma 2, we obtain D+ Vi1 (z, t)(2) ≤ 2e
εt
n
1 ε − ( Di − )zi2 (t) + 2 4
j=1
a2ij m2j
n
j=1 Di − 2ε
n
zj2 (t) +
j=1
b2ij n2j
n j=1
zj2 (t − τij (t))
Di −
.
ε 2
(15) Calculating the upper right derivative Vi2 (z, t) along the solutions of system (2), one can derive that n n n 2 2 b2ij n2j eεt zj (t)eετ − zj2 (t − τij (t)) D+ Vi2 (z, t)(2) ≤
j=1
j=1
j=1
Di −
.
ε 2
(16)
It follows from (11),(15) and (16) that n n n a2jk m2k + b2jk n2k eετ 2 1 ε + εt D V (z, t)(2) ≤ 2e − (D i − ) + zi (t) < 0. 2 2 Dj − 2ε i=1 j=1 k=1
i.e., V (z, t) ≤ V (z, 0),
t ≥ 0.
(17)
Note that V (z, t) ≥
n
zi2 (t)eεt ,
(18)
b2ij n2j ||ψ − ϕ∗ ||2 . ε(Di − 2ε )
(19)
i=1
and V (z, 0) ≤ 1 + 2e
ετ
n n i=1 j=1
From (17), (18) and (19), we have n
zi2 (t)
∗ 2 −εt
≤ M (ε)||ψ − ϕ || e
,
i=1
M (ε) = 1 + 2e
ετ
n n i=1 j=1
b2ij n2j ≥ 1. ε(Di − 2ε )
Because zi = xi − x∗i , we obtain n
(xi (t) − x∗i )2 ≤ M (ε)||ψ − ϕ∗ ||2 e−εt .
(20)
i=1
So, for all t > 0, the equilibrium point x∗ of system (1) is globally exponentially stable. The proof is completed.
134
G. Wang, H.-G. Zhang, and C.-H. Song
If we derive di (xi ) = di xi , di > 0, system (1) can be described as follows n n dxi (t) = −di xi (t) + aij fj (xj (t)) + bij gj (xj (t − τij (t))) + Ii . (21) dt j=1 j=1
Then we have the following result. Corollary 1. Assume that fj , gj (j = 1, 2, · · · , n) satisfy the hypothesis (H1) above, τij (t) ≤ 0, and suppose further that n n 1 2 2 1 (ajk mk + b2jk n2k ) < di d 2 j j=1 k=1
for all t > 0 and i = 1, 2, · · · , n, then the equilibrium point x∗ of the system (21) is globally exponentially stable.
4 Conclusion In this paper, some new criteria concerning global exponential stability of a class generalized neural networks with time-varying delays are obtained. All these new criteria only depend on physical parameters of neural networks, so can be checked easily. Furthermore, the obtained results improve upon some existing results, and are less conservative than some previous results.
Acknowledgment This work was supported by the National Natural Science Foundation of China (60325311, 60534010, 60572070, 60504006) and the Program for Changjiang Scholars and Innovative Research Team in University.
References 1. Arik, S., Tavanoglu, V.: Equilibrium Analysis of Delayed CNNs. IEEE Trans. Circuits syst. I. 45(2) (1998) 168-171 2. Liang, J.L., Cao, J.D.: Global Asymptotic Stability of Bi-Directional Associative Memory Networks with Distributed Delays. Applied Mathematics and Computation. 152(2) (2004) 415-424 3. Zeng, Z.G., Wang, J., Liao, X.X.: Global Asymptotic Stability and Global Exponential Stability of Neural Networks with Unbounded Time-varying Delays. IEEE Trans. on Circuits and System II. 52(3) (2005) 168-173 4. Ensari, T., Arik, S.: Global Stability of a Class of Neural Networks with Time-varying Delay. IEEE Trans. on Circuits And System Ii. 52(3) (2005) 126-130 5. Feng, C.H., Plamondon, R.: On the Stability of Delay Neural Networks System. Neural Networks. 14 (2001) 1181-1118 6. Zhang, Q., Ma, R.N., Wang, C., Xu, J.: On the Global Stability of Delayed Neural Networks. IEEE Trans. On Automatic Control. 48(5) (2003) 794-797 7. Zhang, H.G., Wang, G.: Global Exponential Stability of a Class of Generalized Neural Networks with Variable Coefficients and Distributed Delays. In: Huang, D.S., Zhang, X.P, Huang, G.B. (eds.): Advances in Intelligent Computing. Lecture Notes in Computer Science, Vol. 3644. Springer-Verlag, Berlin Heidelberg He Fei (2005) 807-817
Dynamics of General Neural Networks with Distributed Delays Changyin Sun1,2 and Linfeng Li1 1
College of Electrical Engineering, Hohai University, Nanjing 210098, P.R. China 2 Research Institute of Automation, Southeast University, Nanjing 210096, P.R. China
[email protected]
Abstract. The paper introduces a general class of neural networks with periodic inputs. By constructing a Lyapunov functional and the Halanay-type inequality separately, we obtain easily verifiable sufficient conditions ensuring that every solutions of the delayed neural networks converge exponentially to the unique periodic solutions. The results obtained can be regarded as a generalization to the discrete case of previous results.
1 Introduction The dynamical properties of Hopfield neural networks play primarily important roles in their applications [1-8]. Therefore, one of the most investigated active issues has been to obtain conditions ensuring neural networks possesses a unique equilibrium point which globally converges [9-21]. Since an equilibrium point can be viewed as a special periodic solution of neural networks with arbitrary period [12-14], the analysis of periodic solutions of neural networks may be considered to be more general sense than that of equilibrium points. In practice, time delays are unavoidably encountered in implementation of neural networks. They are often the source of oscillation and instability. Hence, exploring convergence of delayed neural networks is an important topic, see, e.g., [18-20], and references therein. The property of global convergence prevents a delayed neural network from the risk of getting stuck at some local minimum of the energy function, which has been demonstrated in a number of practical applications reported in the literature [20]. Among the most previous results of the dynamical analysis of neural systems, a frequent assumption is that the time delays are discrete. However, neural networks usually have spatial extent due to the presence of a multitude of parallel pathways with a variety of axon sizes and lengths. Thus, there will be a distribution of propagation delays. This means that the signal propagation is no longer instantaneous and is better represented by a model with continuously distributed delays. In this paper, our focus is on periodicity of continuous-time neural networks with continuously distributed delays. Further, we point out that the discrete delay case can be included in our models by choosing suitable kernel functions. In this paper, our focus is on periodicity of general delayed neural networks. By using the Lyapunov functional and the Halanay-type inequality separately, we will J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 135 – 140, 2006. © Springer-Verlag Berlin Heidelberg 2006
136
C. Sun and L. Li
derive some easily checkable conditions to guarantee exponential periodicity of neural networks with delays. The rest of the paper is organized as follows. Hopfield neural network model with distributed delays and preliminaries will be given in Section 2. Main results about Periodicity of neural networks will be presented and discussed in Section 3. In Section 4, two examples are given to illustrate the correctness of our results witch are independent of each other. Finally, Section 5 is the conclusion.
2 Neural Network Model and Preliminaries Consider the model of continuous-time neural networks described by the following integrodifferential equations of the form
(
n
)
∞
x&i (t) = −di xi (t) + ∑aij f j ∫0 Kij (s)xj (t − s)ds + Ii (t) , t ≥ 0 j=1
(1)
where x i (t ) is the state vector of the i th unit at time t , a ij are constants. For simplicity, let D be an n × n constant diagonal matrix with diagonal elements d i >0, A = (a ij )
i = 1,2, K, n,
are
n×n
constant
interconnection
matrices,
f (x) = ( f1(x1), f2 (x2 ),K, fn (xn ))T : R → R is a nonlinear vector-valued activation function, n
n
I(t) =(I1(t),I2(t),K, In(t))T ∈R n is an input periodic vector function with period ω , i.e., there exists a constant ω > 0 such that I i (t + ω ) = I i (t ) (i = 1,2, L , n) for all t ≥ 0 , and x = ( x1 , x 2 , K , x n ) T ∈ R
n
. Suppose further f ∈ GLC in R
n
, i.e., for each
j ∈{1,2,L, n} , g j : R → R is globally Lipschitz continuous with Lipschitz con-
stant L j , i.e. f j (u) − f j (v) ≤ Lj u − v for all u, v ∈ R. K ij (⋅) , i, j = 1,2, L , n are the delay kernels which are assumed to satisfy the following conditions simultaneously: (i) K ij : [0, ∞) → [0, ∞); (ii) K ij are bounded and continuous on [0, ∞] ; (iii)
∞
∫K 0
ij
( s )ds = 1 ;
(iv) There exists a positive number μ such that
∫
∞
0
K ij ( s )e μs ds < ∞ .
The literature [22] has given some examples to meet the above conditions. The initial conditions associated with the system (1) are given by x i ( s ) = γ i ( s ) , s ≤ 0 , i = 1,2, L , n, where γ i (⋅) is bounded and continuous on (−∞, 0] . As a special case of neural system (1), the delayed neural networks with constant input vector
I = ( I 1 , I 2 , K , I n ) T ∈ R n have been studied widely in [22]. This
system is described by the following functional differential equations
Dynamics of General Neural Networks with Distributed Delays n
x&i (t) = −di xi (t) + ∑aij f j j=1
( ∫ K (s)x (t − s)ds) + I , t ≥ 0 ∞
0
ij
j
i
137
(2)
Let x * = (x1* , x 2* , L, x n* ) be the equilibrium point of system (2), i.e., n
d i xi* = ∑ aij f j ( xi* ) + I i .
(3)
j =1
If delay kernel functions kij ( s ) are of the form kij ( s ) = δ ( s − τ ij ) , i, j = 1, 2,L , n where τ ij > 0 is discrete delays. Then systems (2) reduces to Hopfield neural networks with discrete delays [15]. Therefore the discrete delay case can be included in our models by choosing suitable kernel functions. For the proof of the main results in this paper, the following Halanay-type inequality is needed. Halanay-type inequality in comparison to Lyapunov functional is rarely used in the literature on stability of dissipative dynamical systems in spite of their possible generalizations and applications (see [22] and its references therein for detail). This equality is narrated from [22] as follows. Lemma 1. Let v(t ) > 0 for t ∈ R , τ ∈ [0, ∞) and t 0 ∈ R . Assume
d v(t ) ≤ −av(t ) + b⎛⎜ sup v( s ) ⎞⎟ for t > t 0 . dt ⎠ ⎝ s∈[t −τ , t ] If a > b > 0 , then there exist constants γ > 0 and k > 0 such that v(t ) ≤ ke − γ ( t − t0 ) , for t > t 0 .
3 Main Results In this section, let C = C ([−τ ,0], R n ) be the Banach space of all continuous function from [−τ ,0] to R n with the topology of uniform convergence. Given any φ , ϕ ∈ C , let x(t, φ ) = ( x1 (t, φ ), x 2 (t , φ ), L, x n (t, φ )) T and x(t , ϕ ) = ( x1 (t , ϕ ), x 2 (t , ϕ ), L , x n (t , ϕ )) T be the solutions of (1) starting from φ and ϕ respectively. Using Lyapunov functional method, similar to the proof of Theorem 1 of literature [9], we can easily obtain the following theorem. Theorem 1. Suppose f ∈ GLC. If the system parameters d i , a ij (i, j = 1,2, L , n) satisfy the following conditions n
d i − L i ∑ a ji > 0 , j =1
for i = 1, L , n ,
(4)
then for every periodic input I (t ) , the delayed neural system (1) is robustly exponentially periodic.
138
C. Sun and L. Li
By Lemma 1, we use a method comprising of Halanay-type inequality to establish another set of sufficient conditions for the network (1) to converge exponentially towards its periodic solution as follows. Theorem 2. Suppose f ∈GLC. If the system parameters d i , a ij (i, j = 1,2, L , n) satisfy the following conditions n
d i − ∑ a ij L j > 0 , j =1
for i = 1, L , n ,
(13)
then for every periodic input I (t ) , the delayed neural system (1) is exponentially periodic. Remark 1. When I = ( I 1 , I 2 , L , I n ) is a constant vector, then for any constant T ≥ 0
we have I = I (t + T ) = I (t ) for t ≥ 0 . Thus by the results obtained, when the sufficient conditions in Theorem 1 or Theorem 2 are satisfied, a unique periodic solution becomes a periodic solution with any positive constant as its period. So, the periodic solution reduced to a constant solution, that is, an equilibrium point. Furthermore, all other solutions globally exponentially converge to this equilibrium point as t → +∞ . The unique equilibrium point of the delayed neural system (2) globally exponentially convergence. Obviously, the results obtained are consistent with the exponential stability results that were recently reported in [11,15,16,20].
4 Examples In this section, we give two examples to illustrate the correctness of our results. We also show that the conditions for these two theorems are independent of each other by providing the following Example 1 and 2. Example 1. Consider the dynamical systems (1) where d 1 = 3, d 2 = 4 , a11 = 1 , a 22 = 2 , a12 = 4 , a 21 = 2 , L1 = 1 / 2 , L 2 = 3 / 4 , for i = 1,2 . By Theorem 2, we can check 2
d 1 = 3 < ∑ a ij L j = 3.5 . j =1
Hence the sufficient conditions in Theorem 2 are not satisfied. However, it is easy to verify that all the hypotheses of Theorem 1 are satisfied. Therefore, the criteria in Theorem 1 are effective. Example 2. Consider the dynamical systems (1) where d 1 = 4, d 2 = 4 , a11 = 1 / 2 , a 22 = 1 / 5 , a12 = 1 / 3 , a 21 = 1 / 3 , L1 = 8 , L 2 = 12 ,
for i = 1,2 . By Theorem 1, we can check 2
d 1 = 4 < L1 ∑ a ji = 20 / 3 . j =1
Dynamics of General Neural Networks with Distributed Delays
139
Hence the sufficient conditions in Theorem 1 are not satisfied. However, it is easy to verify that all the hypotheses of Theorem 2 are satisfied. Therefore, the criteria in Theorem 2 are applicable.
5 Conclusions In this paper, a general class of neural networks with distributed delays and periodic inputs are investigated. Exponential periodicity of neural systems has been investigated using Lyapunov functional and Halanay-type inequality respectively. The easily checked conditions ensuring the exponential periodicity of neural systems are obtained.
References 1. Bouzerman, A., Pattison, T.: Neural Network for Quadratic Optimization with Bound Constraints. IEEE Trans. Neural Networks 4 (1993) 293-303 2. Forti, M., Nistri, P.: Global Convergence of Neural Networks with Discontinuous Neuron Activations. IEEE Trans. Circuits and Systems I 50 (11) (2003) 1421-1435 3. Forti, M., Tesi, A.: New Conditions for Global Stability of Neural Networks with Application to Linear and Quadratic Programming Problems. IEEE Trans. Circuits and Systems I 42 (1995) 354-366 4. Venetianer, P., Roska, T.: Image Compression by Delayed CNNs. IEEE Trans. Circuits Syst. I 45 (1998) 205-215 5. Sun, C., Feng, C.: Neural Networks for Nonconvex Nonlinear Programming Problems: A Switching Control Approach. In: Wang, J., Liao, X., Yi, Z. (eds.): Advance in Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg (2005) 694 699 6. Morita, M.: Associative Memory with Non-montone Dynamics. Neural networks 6 (1993) 115-126 7. Tank, D., Hopfield, J.: Simple “Neural” Optimization Networks: An A/D Converter, Signal Decision Circuit, And A Linear Programming Circuit. IEEE Trans. Circuits & Systems 33 (1996) 533-541 8. Kennedy, M., Chua, L.: Neural Networks for Linear and Nonlinear Programming. IEEE Trans. Circuits & Systems 35 (1988) 554-562 9. Liao, X., Wong, K., Yang, S.: Convergence Dynamics of Hybrid Bidirectional Associative Memory Neural Networks with Distributed Delays. Physics Letters A 316 (2003) 55-64 10. Cao, J., Wang, J.: Global Asymptotic Stability of Recurrent Neural Networks with Lipschitz-continuous Activation Functions and Time-Varying Delays. IEEE Trans. Circuits Syst. I 50 (2003) 34-44 11. Liao, X., Wang, J.: Algebraic Criteria for Global Exponential Stability of Cellular Neural Networks with Multiple Time Delays. IEEE Trans. Circuits Systems I 50 (2) (2003) 268275 12. Sun, C., Feng, C.: Exponential Periodicity of Continuous-time And Discrete-time Neural Networks with Delays. Neural Processing Letters 19(2) (2004) 131-146 13. Sun, C., Feng, C.: On Robust Exponential Periodicity of Interval Neural Networks with Delays. Neural Processing Letters 20 (1) (2004) 53-61
-
140
C. Sun and L. Li
14. Sun, C., Feng, C.: Exponential Periodicity And Stability of Delayed Neural Networks. Mathematics and Computers in Simulation 66 (2004) 469-478 15. Sun, C., Zhang, K., Fei, S., Feng, C.: On Exponential Stability of Delayed Neural Networks with A General Class of Activation Functions. Physics Letters A 298 (2002) 122132 16. Sun, C., Feng, C.: Global Robust Exponential Stability of Interval Neural Networks with Delays. Neural Processing Letters 17 (1) (2003) 107-115 17. Zhang, J., Jin, X.: Global Stability Analysis in Delayed Hopfield Neural Network Models. Neural Networks 13 (2000) 745-753 18. Wang, L.: Stability of Cohen–Grossberg Neural Networks with Distributed Delays. Applied Mathematics and Computation 160 (2005) 93–110 19. Zeng, Z., Wang, J., Liao, X.: Global Exponential Stability of A General Class of Recurrent Neural Networks with Time-Varying Delays. IEEE Trans. Circuits System I 50 (2003) 1353-1359 20. Yi, Z., Tan, K.: Convergence Analysis of Recurrent Neural Networks. Kluwer Academic Publishers, Dordrecht (2003) 21. Mohamad, S.: Global Exponential Stability in Continuous-time And Discrete-time Delay Bi-directional Neural Networks. Physical D 159 (2001) 233-251 22. Mohamad, S., Gopalsamy, K.: Exponential Stability of Continuous-time And Discretetime Cellular Neural Networks with Delays. Applied Mathematics and Computation 135 (2003) 17-38
On Equilibrium and Stability of a Class of Neural Networks with Mixed Delays Shuyong Li1 , Yumei Huang2,3 , and Daoyi Xu2 1
College of Mathematics and Software Science, Sichuan Normal University, Chengdu 610066, China
[email protected] 2 Institute of Mathematics, Sichuan University, Chengdu 610064, China {huangyumei181, daoyixu}@163.com 3 School of Mathematics and Computer Engineering, Xihua university, Pixian 610039, China
Abstract. In this paper the authors analyze problems of existence and global asymptotic stability of the equilibrium for the neural networks with mixed delays. Some new sufficient conditions ensuring the existence, uniqueness, and global asymptotic stability of the equilibrium are established by means of Leray-Schauder principle, arithmetic-mean-geometricmean inequality and vector delay differential inequality technique. These conditions are less restrictive than previously known criteria.
1
Introduction
In the design of a delayed neural networks(DNN’s), stability is a main concern. In many applications such as linear program or pattern recognition, for example, it is desired to design a DNN with unique equilibrium that is globally asymptotically stable(GAS). Therefore, the GAS of the equilibrium of DNN has become a subject of intense research activities [1],[2]. Among many existing research works, some works consider the discrete delays [1], and others continuous delays [2]. To the best of our knowledge, neural networks with mixed delays (both continuous and discrete) are seldom considered. However, mixed delays in practice are more common [3]. Therefore, the studies of neural networks with mixed delays are more important than those with continuous or discrete delays. Consider a class of neural networks with mixed delays described by the following nonlinear integral differential equations [3],[7]: ui (t) = −μi ui (t) + +bij gj (
n
[aij fj (uj (t − τj (t)))
j=1 t
−∞
kj (t − s)uj (s)ds)] + Ii , t ≥ t0 ,
ui (t) = φi (t), t ∈ (−∞, t0 ], t0 ≥ 0, i = 1, · · · , n, J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 141–146, 2006. c Springer-Verlag Berlin Heidelberg 2006
(1)
142
S. Li, Y. Huang, and D. Xu
where xi (t) is the state, fj and gj denote the activation functions, aij and bij represent the constant connection weight. Ii is the external bias, μi > 0 represents the rate, kj is the refractoriness and the delay τj (t) ≥ 0 is continuous with t − τj (t) ≥ r(t) and r(t) → ∞ as t → ∞. The initial function φi (i = 1, · · · , n) is bounded and continuous on (−∞, 0]. In this paper, fj ’s, gj ’s and kj ’s satisfy the following conditions. (A1 ) There exist Kj > 0 and Lj > 0 such that for any u, v ∈ R, |fj (u) − fj (v)| ≤ Kj |u − v|, |gj (u) − gj (v)| ≤ Lj |u − v|, i = 1, · · · , n. (A2 ) All kj ’s satisfy |kj | ∈ L1 [0, ∞), i.e., there is a number cj such that ∞ |kj (s)|ds = cj > 0, i = 1, · · · , n. 0
2
Existence of Equilibrium
Lemma 1. Let U is a nonempty bounded open set in the Banach space X. ¯ ⊂ X → Rn is compact and let 0 ∈ U and ∂U is Assume that the operator h : U the boundary of U . If there is an i such that hi (u)sgnui > 0, ∀u ∈ ∂U,
(2)
¯ then the equation h(u) = 0 has a solution u on U. Proof. We define a mapping T (u) = u − h(u). Then u − λT (u) = λh(u) + (1 − λ)u
(3)
is a homotopic mapping (or homotopy [4]), and for all u ∈ ∂U , |ui − λTi (u)| = (1 − λ)|ui | + λhi (u)sgnui . Condition (2) implies that for all u ∈ ∂U , we have ui = 0 and (1 − λ)|ui | > 0 if λ ∈ [0, 1), |ui − λTi (u)| ≥ hi (u)sgnui > 0 if λ = 1.
(4)
(5)
That is, λT (u) = u for all (u, λ) ∈ ∂U × [0, 1]. By Leray-Schauder principle [4, ¯ This implies that pp.556], T has a fixed point u∗ on U. ¯, T (u∗ ) = u∗ − h(u∗ ) = u∗ , i.e., h(u∗ ) = 0, u∗ ∈ U and the proof is completed. Lemma 2. (Arithmetic-mean–geometric-mean inequality [5, p.13]). For xi ≥ 0, αi > 0 and ni=1 αi = 1, n n i xα ≤ αi xi , (6) i i=1
i=1
On Equilibrium and Stability of a Class of Neural Networks
143
the sign of equality holding if and only if xi = xj for all i, j = 1, · · · , n. By M, we denote the “M -matrix” class [6]. For any real matrices A = (aij ) and B = (bij ), we write A ≥ B if aij ≥ bij , ∀i, j = 1, · · · , n. Theorem 1. Suppose that conditions (A1 ) and (A2 ) are satisfied. If the following condition (A3 ) holds. (A3 ) There exist positive constants rkij , k = 1, 2, 3, 4; i, j = 1, 2, · · · , n, and α ≥ 1 such that Δ − M ∈ M, where Δ =diag{δ1, δ2 , · · · , δn }, M = (mij ) and ⎧ for α = 1, ⎨ μi , n
1 1 δi = αμ − (α − 1) −1 1−α −1 1−α |aij |Kj r1ij r2ij + |bij |Lj cj r3ij r4ij , for α > 1, (7) ⎩ i j=1
⎧ ⎨ |aij |Kj + |bij |Lj cj , mij =
α = 1, (8)
⎩ |a |K rα−1 r + |b |L c rα−1 r , α > 1. ij j 1ij 2ij ij j j 3ij 4ij
Then there exists at least an equilibrium u∗ of system (1). Proof. We know that u∗ is an equilibrium of Eq.(1) if and only if u∗ = (u∗1 , · · · , u∗n )T is a solution of equations
t n μi ui = aij fj (uj ) + bij gj kj (t − s)uj ds + Ii −∞
j=1 Δ
= Fi (u), i = 1, · · · , n.
(9)
For i = 1, · · · , n, we let hi (u) = (α|ui |α−1 + 1)(μi ui − Fi (u)),
(10)
and h(u) = [h1 (u), · · · , hn (u)] . Then, equations (9) and h(u) = 0 have exactly the same solutions. It follows from (A1 ) and from (9), we have T
|Fi (u)| ≤
n
n |aij |Kj |uj | + |bij |Lj cj |uj | + ξi (or = ηij |uj | + ξi ), (11)
j=1
j=1
where ξi =
n
(|aij ||fj (0)| + |bij ||gj (0)|) + |Ii |, ηij = (|aij |Kj + |bij |Lj cj ). (12)
j=1
For α > 1, by using (7), (8), (11) and Lemma 2, we obtain α|ui |
α−1
|Fi (u)| ≤
n
α
−1 1−α α |aij |Kj r1ij r2ij (r1ij |uj |α + (α − 1)r2ij |ui |α )
j=1
α −1 1−α α +|bij |Lj cj r3ij r4ij (r3ij |uj |α + (α − 1)r4ij |ui |α ) + αξi |ui |α−1 = (αμi − δi )|ui |α +
n j=1
mij |uj |α + αξi |ui |α−1 .
(13)
144
S. Li, Y. Huang, and D. Xu
When α = 1, from (7), (8) and (12), we have δi = μi and mij = ηij , this implies that (13) holds for α = 1. Combining (10), (11) and (13), we have hi (u)sgnui ≥ δi |ui |α + μi |ui | −
n
[mij |uj |α + ηij |uj |] − αξi |ui |α−1 − ξi
j=1
⎧ n α ⎪ ⎪ [mij |uj |α + ηij |uj |] − αξi |ui |α−1 − ξi , for α > 1, ⎨ δi |ui | − j=1 ≥ (14) n ⎪ ⎪ 2δ |u | − 2 m |u | − 2ξ , for α = 1. ⎩ i i ij j i j=1
¯ i (u) = hi (u)sgnui , h(u) ¯ ¯ 1 (u), h ¯ 2 (u) · · · , h ¯ n (u)]T , H = (ηij ), Ξ = Let h = [h T diag{ξ1 , ξ2 , · · · , ξn } and ξ = [ξ1 , ξ2 , · · · , ξn ] . Thus, we can rewrite the above inequality in the following matrix-vector form: (Δ − M )[uα ]+ − H[u]+ − αΞ[uα−1 ]+ − [ξ]+ , for α > 1, ¯ h(u) ≥ (15) 2(Δ − M )[u]+ − 2[ξ]+ , for α = 1, where [uα ]+ = [|u1 |α , · · · , |un |α ]T for α > 0. From Condition (A3 ) and properties of M -matrix, there exists a positive n×n diagonal matrix D =diag{d1 , d2 . · · · , dn } Δ such that matrix D(Δ − M ) + (Δ − M )T D = A is positive definite, that is, the minimal eigenvalue λ(A) of A is positive. By means of inequality (15) as well as [uα ]+ ≥ 0 and D ≥ 0, we can easily obtain 2
n
¯ i (u) = ([uα ]+ )T Dh(u) ¯ ¯ T (u)D[uα ]+ di |ui |α h +h
i=1
⎧ α 2 α E ( H E u E ⎪ ⎨ λ(A) u E − 2 u E D
α−1 Δ +α Ξ
u
E + ξ E ), α > 1, E ≥ Γ (u) = ⎪ ⎩ 2λ(A) u 2E − 4 u E D E ξ E , α = 1,
(16)
where · E is the Euclidean norm. It follows from (16) that Γ (u) → ∞ as u E → ∞. Then there is a R0 such that for all u E ≥ R0 , 2
n
¯ i (u) ≥ Γ (u) > 0. di |ui |α h
(17)
i=1
Consider the following set: U (R0 ) = {u : u E < R0 }. It is easy to verify that U (R0 ) is a nonempty bounded open set, 0 ∈ U (R0 ) and h defined by (10) ¯ 0 ). From (17), there is at least an i such that is compact on U(R ¯ i (u) = hi (u)sgnui > 0 for all u ∈ ∂U (R0 ). h (18) By Lemma 1, we can conclude that equation h(u) = 0 or (9) has at least a ¯ 0 ). Hence, neural network model (1) has at least an equilibrium. solution in U(R Remark 1. When α > 1, the existing results on the existence of equilibrium always require the boundedness of the activation functions [1],[2]. The boundedness condition has been dropped in our results.
On Equilibrium and Stability of a Class of Neural Networks
3
145
Uniqueness of Equilibrium and Its Stability
Let K∞ is the class of continuous functions φ : R+ → R+ with φ(0) = 0, φ(s) > 0 if s > 0, φ is strictly increasing, and φ(s) → +∞ as s → +∞. C = C[(−∞, 0], Rn ] is the Banach space of continuous bounded functions φ : (−∞, 0] → Rn with φ = sup−∞≤s≤0 |φ(s)| and | · | is any norm on Rn . Equations (1) are categorized as retarded functional differential equations, x˙ i (t) = fi (t, xt ), i = 1, · · · , n,
(19)
where xt = [x1t , · · · , xnt ]T ∈ C is defined as xit (s) = xi (t + s) for s ∈ (−∞, 0] and i = 1, · · · , n, fi : R × C → R is continuous and fi (t, 0) ≡ 0. From Theorem 1 in [7], we can easily get the following lemma. Lemma 3. Suppose there exist functions vi : R+ × R → R+ with the following properties: vi (t) ≡ vi (t, xi ) is continuous and locally Lipschitzian in xi , w1i (|xi |) ≤ vi (t, xi ) ≤ w2i (|xi |), and the Dini derivative of vi (t) along Equations (19) satisfy t n n + D vi (t, xi ) ≤ −δi vi + pij vjt r + qij ψi (t − s) vjs r ds, (20) j=1
j=1
−∞
where δi , pij , qij ∈ R+ are constants, w1i , w2i ∈ K∞ , ψi ≥ 0 and ψi ∈ L1 [0, ∞),
vjt r = supr≤u≤t |vjt (u)|, r = r(t) ≤ t and r(t) → ∞ as t → ∞. If Δ − M ∈ M, (21) ∞ where M = (mij ) and mij = pij + qij 0 ψi (s)ds. Then the equilibrium x∗i = 0 of system (19) is globally asymptotically stable. Theorem 2. Assume that assumptions (A1 ), (A2 ) and (A3 ) hold. Then system (1) has a unique equilibrium u∗ , which is globally asymptotically stable. Proof. Since assumptions (A1 ), (A2 ) and (A3 ) hold, system (1) has an equilibrium u∗ . Let y(t) = u(t) − u∗ , system (1) can be written as t n yi (t) = −μi yi (t) + [aij f¯j (yj (t − τj (t))) + bij g¯j ( kj (t − s)yj (s)ds)], (22) −∞
j=1
where i = 1, · · · , n, fj (yj ) = fj (yj + u∗j ) − fj (u∗j ), and g¯j (
t −∞
kj (t − s)yj (s)ds) = gj (
t
−∞
kj (t − s)(yj (s) + u∗j )ds) − gj (cj u∗j ).
Then f¯j (0) = g¯j (0) = 0 and system (22) has an equilibrium at u = 0. To apply Lemma 3, we let vi (t) = |yi (t)|α for α ≥ 1 and α−1 α−1 |aij |Kj r1ij r2ij , for α > 1, |bij |Lj r3ij r4ij , for α > 1, pij = , qij = (23) |aij |Kj , for α = 1, |bij |Kj , for α = 1.
146
S. Li, Y. Huang, and D. Xu
Calculating the Dini derivative of vi (t) along the solution of Equation (22), from condition (A1 ), we have n D+ vi (t) ≤ α[−μi |yi (t)|α + |yi (t)|α−1 ( |aij |Kj |yj (t − τj (t))|
+|bij |Lj
j=1 t
−∞
|kj (t − s)||yj (s)|ds)].
(24)
For α > 1, using (24) and Lemma 2, we have D+ vi (t) ≤ −αμi vi (t) +
n
α −1 1−α α |aij |Kj r1ij r2ij (r1ij
vjt r + (α − 1)r2ij vi (t)) j=1
−1 +|bij |Lj r3ij r4ij
= −δi vi (t) +
n
j=1
t
−∞
α
1−α α |kj (t − s)|(r3ij
vjs r + (α − 1)r4ij vi (t))ds
pij vjt r + qij
t
−∞
|kj (t − s)| vjs r ds .
(25)
This implies that vi (t) satisfies condition (20) with ψi (s) = ki (s). It is obvious that (25) holds for α = 1. On the other hand, condition (21) can be obtained by using assumptions (A2 ) and (A3 ), together with (23). Now, by Lemma 3, the trivial solution of Equation (22) is globally asymptotically stable, and therefore the equilibrium u∗ is globally asymptotically stable. Since all of the solutions of (1) tend toward u∗ as t → ∞, system (1) has a unique equilibrium.
Acknowledgments The work is supported by the National Natural Science Foundation of China under Grant 10371083.
References 1. Cao, J.D.: A Set of Stability Criteria for Delayed Cellular Neural Networks. IEEE Trans. Circuits Syst. I 48 (2001) 494-498 2. Guo, S., Huang, L.: Exponential Stability and Periodic Solutions of Neural Networks with Continuously Distributed Delays. Phys. Rev. E 67 (2003) 011902 3. Kolmanovskii, V., Myshkis, A.: Intoduction to the Theory and Applications of Functional Differential Equations. Kluwer Academic Publishers, Dordrecht (1999) 4. Zeidler, E.: Nonlinear Functional Analysis and its Application. I: Fixed-Point Theorems. Springer-Verlag, New York (1986) 5. Beckenbach, E.F., Bellman, R.: Inequalities. Springer-Verlag, New York (1961) 6. Berman, A., Plemmons, R.J.: Nonnegative Matrices in the Mathematical Sciences. Academic Press, New York (1979) 7. Xu, D.Y.: Integro-Differential Equations and Delay Integral Inequalities. Tohoku Math. J. 44 (1992) 365-378
Stability Analysis of Neutral Neural Networks with Time Delay Hanlin He1 and Xiaoxin Liao2 1
College of Sciences, Naval University of Engineering, Hubei, Wuhan, 430033, China
[email protected] 2 Department of Control Science and Engineering, Huazhong University of Science and Technology, Hubei, Wuhan, 430074, China
Abstract. A method is proposed for asymptotic stability analysis of neutral neural networks with delayed state. The delay is assumed unknown, but constant. Sufficient conditions on delay-independent asymptotic stability are given in terms of the existence of symmetric and positive definite solutions of a constant Riccati algebraic matrix equation coupled with a discrete Lyapunov equation.
1
Introduction
The stability of time-delay neural networks is a of practical and theoretical interests since the existence of a delay in a neural network may induce instability, and various results were reported [1, 3, 5-15]. However, due to the complicated dynamic properties of the neural cells, in many cases the existing neural network models cannot characterize the properties of a neural reaction process precisely. A natural and important problem is to how to further describe and model the dynamic behavior for such complex neural reactions. Such problems naturally introduces the neutral neural networks. For example, in the biochemistry experiments, neural information may transfer across chemical reactivity, which results in a neutral-type process [2]. A different example is proposed in [8], where the neutral phenomena exist in large-scale integrated circuits. Some criteria for mean square exponential stability and asymptotic stable of stochastic neutral neural networks are provided in [13-14]. This paper considers the stability of a particular class of neutral neural networks with time-delay. We are interested in giving conditions for delayindependent stability conditions. By solving some appropriate algebraic Riccati equation, based on the Lyapunov-Razumikhin type theory and an appropriate Lyapunov-Razumikhin functional, this paper achieves a sufficient criterion for globally asymptotic stability of the neutral neural networks with time-delays. The organization of this paper is as follows: Section 2 gives a stability result for a functional differential equation of neutral type. Section 3 gives the description of the neutral neural networks in this paper and derives the stability result. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 147–152, 2006. c Springer-Verlag Berlin Heidelberg 2006
148
H. He and X. Liao
Section 4 and 5 give the singular value test and LMI formulation for the stability result derived in Section 3. Section 6 is an example to explain the stability result. Some conclusions of Section 7 ends this paper.
2
Preliminaries: Stability Theory
Throughout this paper, the following notations are used. λmin (P ), λmax (P ) denote the minimal and maximal eigenvalues of matrix P . Cτ = C([−τ, 0], Rn ) denotes the Banach space of continuous vector functions mapping the interval [−τ, 0] into Rn with the topology of uniform convergence. . refers to either the Euclidean vector norm or the induced matrix 2-norm. Consider the following functional differential equation of neutral type: d [D(xt )] = f (xt ) dt
(1)
with an appropriate initial condition: x(t0 + θ) = φ(θ), ∀θ ∈ [−τ, 0], (t0 , φ) ∈ R+ × Cτ
(2)
where D : D(xt ) = x(t) − Cx(t − τ ). We say that the operator D is stable if the zero solution of the corresponding homogeneous difference equation is uniformly asymptotically stable. Then we have the following lemma: Lemma 1 [4]. Suppose D is stable, f : R × Cτ → Rn takes bounded sets of Cτ in bounded sets of Rn and suppose u(s), v(s) and w(s) are continous, nonnegative and nondecreasing functions with u(s), v(s) > 0,for s = 0 and u(0) = 0, v(0) = 0. If there is a differential continuous functional V (xt ) : R × Cτ → Rn such that 1)u(D(xt ) ≤ V (xt ) ≤ v(xt ), 2)V˙ (xt ) ≤ −w(xt ) where xt = x(t + θ), then the solution x = 0 of the neutral equation (1)-(2) is uniformly asymptotically stable.
3
Main Result
We consider the neutral neural networks with linear and nonlinear time delay described by the differential-difference equation of the form x(t) ˙ − C x(t ˙ − τ ) = Ax(t) + Ad x(t − τ ) + Bf (x(t − τ )) (3) x(t) = φ(t), t ∈ [−τ, 0] Where x(t) ∈ Rn is the state of the neuron; A, B are real constant matrices with appropriate dimensions; τ is constant time-delay; φ(t) is a continuous vectorvalued initial function defined on [−τ, 0]; f ∈ C(Rn , Rn ) is bounded function, the boundedness is l, that is f (x(t)) ≤ l x(t) and f (0)) = 0; Then, we have the following theorem. Theorem 1. The neutral neural networks system (3) is delay-independent asymptotically stable if
Stability Analysis of Neutral Neural Networks with Time Delay
149
(i) C is Schur-Cohn stable matrix; (ii) There exist two symmetric and positive definite matrices R and Q such that the following Riccati equation has a symmetric and positive definite matrix solution P > 0. AT P + P A + P BB T P + S + Q + [P (AC + Ad ) + SC]R−1 [C T S + (ATd + C T AT )P ] = 0
(4)
where S > 0 is the symmetric and positive definite matrix solution of the Lyapunov discrete equation C T SC − S + l2 I + R = 0
(5)
Proof. Take the following Lyapunov functional candidate: t T V (xt ) = (x(t) − Cx(t − τ )) P (x(t) − Cx(t − τ )) + x(θ)T Sx(θ)dθ
(6)
t−τ
Introducing the operator D : Cτ → Rn : D(xt ) = x(t) − Cx(t − τ )
(7)
Clearly V (xt ) satisfies the following inequalities u(D(xt )) ≤ V (xt ) ≤ v(xt C ) where xt C = max−τ ≤θ≤0 xt (θ), u(s) = λmin (P )s2 and v(s) = (λmax (P ) + τ λmax (S))s2 . The time derivative of V (xt ) along the system (1) is given by: V˙ (xt ) = (x(t) − Cx(t − τ ))T [P A + AT P + S](x(t) − Cx(t − τ )) +xT (t − τ )ATd P (x(t) − Cx(t − τ )) + (x(t) − Cx(t − τ ))T P Ad x(t − τ ) +2(x(t) − Cx(t − τ ))T P Bf (x(t − τ )) +x(t − τ )T C T SCx(t − τ ) − x(t − τ )T Sx(t − τ ) +(x(t) − Cx(t − τ ))T P ACx(t − τ ) +x(t − τ )T C T AT P (x(t) − Cx(t − τ )) +(x(t) − Cx(t − τ ))T SCx(t − τ ) +x(t − τ )T C T S(x(t) − Cx(t − τ ))
(8)
By the condition of f and Schwarz inequality 2(x(t) − Cx(t − τ ))T P Bf (x(t − τ )) ≤ 2l(x(t) − Cx(t − τ ))T P Bx(t − τ ) ≤ (x(t) − Cx(t − τ ))T P BB T P (x(t) − Cx(t − τ )) + l2 x(t − τ )T x(t − τ ) Since S is the positive solution of the discrete Lyapunov equation (5) and using the operator form (7), we have V˙ (xt ) ≤ D(xt )T [P A + AT P + S + P BB T P ]D(xt ) +D(xt )T (P (AC + Ad ) + SC)x(t − τ )) +x(t − τ )T ((C T AT + ATd )P + C T S)D(xt ) −x(t − τ )T Rx(t − τ )
150
H. He and X. Liao
Since P is the symmetric and positive definite solution of the Ricati equation (4), we have V˙ (xt ) ≤ −D(xt )T QD(xt ) −[((C T AT + ATd )P + C T S)D(xt ) − Rx(t − τ )]T R−1 [((C T AT + ATd )P + C T S)D(xt ) − Rx(t − τ )] ≤ −D(xt )T QD(xt )
(9)
By Lemma 1, the solution x = 0 of the system (3) is globally uniformly asymptotically stable. Remark 1. Since P BB T P + S + Q + [P (AC + Ad ) + SC]R−1 [C T + (ATd + C T AT )P ] > 0, from standard results on Lyapunov equations, A is stable iff P ≥ 0. Hence, the necessary condition for the equation (4) has solution is that A is stable. Remark 2. The Schur-Cohn stability of the matrix C ensures the stability of the operator D : Cτ → Rn : D(xt ) = x(t) − Cx(t − τ )
4
Singular Value Test
Equation (4) can be rewritten as (A + (AC + Ad )R−1 C T S)T P + P (A + (AC + Ad )R−1 C T S) +P [BB T + (AC + Ad )(AC + Ad )T ]P + S + Q + SCR−1 C T S = 0 (10) Hence the associated Hamiltonian matrix is given by A + (AC + Ad )R−1 C T S BB T + (AC + Ad )(AC + Ad )T −(S + Q + SCR−1 C T S) −(A + (AC + Ad )R−1 C T S)T
(11)
In order to have a symmetric and positive definite solution to the Ricati equation (4), the associated Hamiltonian matrix (11) has no eigenvalues on the imaginary axis.
5
LMI Formulation
Pay attention that R is the solution of (5), via Schur complement, the theorem 1 can be easily converted into an LMI feasibility problem. Corollary 1. The neutral neural networks (3) is delay-independent asymptotically stable if there exist two symmetric and positive definite matrices P > 0 and S > 0 such that the the following LMIs hold: T A P + P A + P BB T P + S P (AC + Ad ) + SC <0 (12) C T S + (ATd + C T AT )P C T SC − S + l2 I C T SC − S + l2 I < 0
(13)
Stability Analysis of Neutral Neural Networks with Time Delay
6
151
Illustrative Example
Example: Suppose the system (3) with the following data −2 0 0 −1 0.1 0.2 A= ; Ad = ;B = 0 −2 0 0 0.3 0.1 C=
sin x2 (t − τ ) 0.1 0 ; f (x(t − τ ) = 0 0.2 1 − cos x1 (x(t − h(t))
Hence l = 1, the inequalities (12)-(13) are satisfied for P = I, R = 2I. From Theorem 1 the zero solution of neural networks (3) is globally uniformly asymptotically stable for the given data.
7
Conclusions
A novel type of neutral neural networks is studied in this paper. Following the recent development on the stability result for systems of neutral type and the algorithm and software for LMI, some new criteria on neutral neural networks are provided. Further researches include more elegant stability analysis approaches with less conservativeness and some control strategies for the neutral differential equations.
Acknowledgments This work is supported by the Natural Science Foundation of China (60474011), and Academic Foundation of Naval University of Engineering, China.
References 1. Cao, J.D. and Li,Q.: On the exponential stability and periodic solution of delayed cellular neural networks. J. Math. Anal. and Appl. 252 (1) (2000) 50-64 2. Curt, W.: Reactive Molecules: The Neutral Reactive Intermediates in Organic Chemistry. Wilay Press, New York (1984) 3. Fu, C.J., He, H.L., Liao, X.X.: Globally Stable Periodic State of Delayed CohnGrossberg Neural Networks. In: Wang J. et al (eds): Advances in Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 276-281 4. Hale, J.K., Verduyn Lunel, S.M.: Introduation to Functional Differential Equations Applied Math. Sciences, Vol. 99, Springer-verlag, New York (1991) 5. He, H.L., Wang, Z.S., Liao, X.X.: Stability Analysis of Uncertain Neural Networks with Linear and Nonlinear Time Delays. In Wang J. et al (eds): Advances in Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 199-202 6. Liao, X.X., Xiao, D.M.: Globally Exponential Stability of Hopfield Neural Networks with Time-varying Delays. ACTA Electronica Sinica, 28 (1) (2000) 87-90
152
H. He and X. Liao
7. Liao, X.X., Wang, J.: Algebraic Criteria for Global Exponential Stability of Cellular Neural Networks with Multiple Time Delays. IEEE Trans. Circuits and Systems I. 50 (2) (2003) 268-275 8. Shen, Y., Zhao, G.Y., Jiang, M.H., Mao, X.R.: Stochastic Lotka-Volterra Competitive Systems with Variable Delay. In: Huang, D.S., Zhang, X.P., Huang, G.B. (eds.): Advances in Intelligent Computing. Lecture Notes in Computer Science, Vol. 3645. Springer-Verlag, Berlin Heidelberg New York (2005) 238-247 9. Shen, Y., Zhao, G.Y., Jiang, M.H., Hu, S.G.: Stochastic High-order Hopfield Neural Networks. In: Wang, L.P., Chen, K., Ong, Y.S. (eds.): Advances in Natural Computation. Lecture Notes in Computer Science, Vol. 3610. Springer-Verlag, Berlin Heidelberg New York (2005) 740-749 10. Zeng, Z.G., Wang, J., Liao, X.X.: Global Exponential Stability of A General Class of Recurrent Neural Networks with Time-varying Delays. IEEE Trans. Circuits and Syst. I 50 (10) (2003) 1353-1358 11. Zeng, Z.G., Wang, J., Liao, X.X.: Stability Analysis of Delayed Cellular Neural Networks Described Using Cloning Templates. IEEE Trans. Circuits and Syst. I 51 (11) (2004) 2313-2324 12. Shen, Y., Jiang, M.H., Liao, X.X.: Global Exponential Stability of Cohen-Grossberg Neural Networks with Time-varying Delays and Continuously Distributed Delays. In: Wang, J., Liao, X.F., Zhang, Y. (eds.): Advances in Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 156-161 13. Zhang, Y.M., Guo, L. Wu, L.Y. et al: On Stochastic Neutral Neural Networks. In: Wang J. et al (eds): Advances in Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 69-74 14. Zhang, Y.M., Shen, Y., Liao, X.X.: Exponential for Stochastic Interval Delayed Hopfield Neural Networks. Control Theory and Applications 20 (12) (2003) 12991300 15. Wang, Z.S., He, H.L., Liao, X.X.: Stability Analysis of Uncertain Neural Networks with Delay. In: Yin F. et al (eds): Advances in Neural Networks. Lecture Notes in Computer Science, Vol. 3173. Springer-Verlag, Berlin Heidelberg New York (2004) 44-48
Global Asymptotical Stability in Neutral-Type Delayed Neural Networks with Reaction-Diffusion Terms Jianlong Qiu1,2 and Jinde Cao1 1 2
Department of Mathematics, Southeast University, Nanjing 210096, China Department of Mathematics, Linyi Normal University, Linyi 276005, China {qjl9916, jdcao}@seu.edu.cn
Abstract. In this paper, the global uniform asymptotical stability is studied for delayed neutral-type neural networks by constructing appropriate Lyapunov functional and using the linear matrix inequality (LMI) approach. The main condition given in this paper is dependent on the size of the measure of the space, which is usually less conservative than space-independent ones. Finally, a numerical example is provided to demonstrate the effectiveness and applicability of the proposed criteria.
1
Introduction
It is well known that the dynamics of delayed recurrent neural networks such as cellular neural networks (CNNs) and Hopfield neural networks (HNNs) have been deeply investigated, in recent years, many researchers have studied the global asymptotical stability and global exponential stability analysis for delayed neural networks. A great number of results on this topic have been reported in the literature; see, [1, 2, 3] and the references therein. However, strictly speaking, diffusion effects cannot be avoided in the neural networks when electrons are moving in asymmetric electromagnetic fields. In [5, 6, 7], the authors considered the stability of neural networks with reaction-diffusion terms. In [8], the authors investigated the exponential stability of a class of delayed neural networks described by nonlinear delay differential equations of neutral type without reaction-diffusion terms. Motivated by the above discussions, the aim of the present paper is to consider a class of neutral-type neural networks with time delays and reaction-diffusion terms described by a nonlinear delay differential equation of neutral type. m n ∂ui (t, x) ∂ ∂ui (t, x) = Dik − ai ui (t, x) + bij fj (uj (t, x)) ∂t ∂xk ∂xk j=1 k=1 n
+
j=1
cij gj (uj (t − τ, x)) +
n j=1
hij
∂uj (t − τ, x) + Ii , x ∈ Ω, (1) ∂t
where x = (x1 , x2 , · · · , xm )T ∈ Ω ⊂ Rm , Ω is a compact set with smooth boundary ∂Ω and mesΩ > 0 in space Rm ; u = (u1 , u2 , · · · , un )T ∈ Rn , ui (t, x) J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 153–158, 2006. c Springer-Verlag Berlin Heidelberg 2006
154
J. Qiu and J. Cao
denotes the state of the ith neuron at time t and in space x; fj , gj are activation functions of the jth neuron; the scalar ai > 0 is the rate with which the ith unit will reset its potential to the resting state in isolation when disconnected from the network and external inputs at time t and in space x; bij , cij , hij , i, j = 1, 2, · · · , n are known constants denoting the strength of the ith neurons on the jth neurons; smooth functions Dik = Dik (t, x, u) ≥ 0 are transmission diffusion operator of the ith neurons; τ is non-negative constant, which correspond to finite speed of axonal signal transmission delay; Ii denote the ith component of an external input source. The initial and boundary condition of the system (1) are given by T ∂ui ∂ui ∂ui ∂ui = , ,··· , = 0, ∀t ≥ 0, x ∈ ∂Ω, i = 1, 2, · · · , n, (2) ∂n ∂x1 ∂xk 2 ∂xm ∂ ∂φi (s, x) ui (s, x) = φi (s, x), ui (s, x) = , −τ ≤ s ≤ 0, i = 1, 2, · · · , n. (3) ∂t ∂t in which φi (s, x) are bounded and first order continuous differentiable on [−τ, 0]. Notation. Throughout this paper, vector X ∈ Rn (n-dimensional Euclidean √ T space), its norm is defined as X = X X. C 1,2 (R+ ; Rm ) is a Banach space. n Ω ⊆ Rm is simply connected compact subset . Y ∈ C 1,2 : R+ × Ω → R is2a real 1 Lebesgue measurable function on Ω, for the norm Y (t)2 = [ Ω Y (t, x) dx] 2 . E denotes identity matrix. The notation A > B means that the matrix A − B is symmetric positive definite. For an arbitrarily real matrix H (hij )n×n , + H + (h+ ij )n×n , where hij = max(hij , 0), i, j = 1, 2, · · · , n.
2
Preliminaries
Throughout the paper we need the following assumption and lemmas. (H) The neurons activation functions fj , constants lj > 0, kj > 0 such that 0≤
fj (ξ1 ) − fj (ξ2 ) ≤ lj ; ξ1 − ξ2
gj are continuous, and there exist
0≤
gj (ξ1 ) − gj (ξ2 ) ≤ kj ξ1 − ξ2
for any ξ1 , ξ2 ∈ R, ξ1 = ξ2 , j = 1, 2, · · · , n. Definition. A vector U ∗ = (u∗1 , u∗2 , · · · , u∗n )T ∈ Rn is said to be an equilibrium point of system (1) if it satisfies −ai u∗i +
n j=1
bij fj (u∗j ) +
n
cij gj (u∗j ) + Ii = 0, i = 1, 2, · · · , n.
j=1
Lemma 1.[8] Suppose W, U are any matrices, is a positive number and matrix H = H T > 0, then the following inequality holds W T U + U T W ≤ W T HW + −1 U T H −1 U.
Global Asymptotical Stability in Neutral-Type Delayed Neural Networks
155
Lemma 2. For any constant matrix Σ ∈ Rn×n , Σ = Σ T > 0, Ω ⊂ Rn , mesΩ > 0, if ω : Ω → Rn is vector function such that the integration is well defined, then
T ω(s)ds
Σ
Ω
≤ |Ω|
ω(s)ds Ω
ω T (s)Σω(s)ds.
(4)
Ω
Remark: Obviously, we assume that the matrix Σ is identity matrix in the proof. If not, we letv(s) = P w(s), where P is a invertible orthogonal matrix Er 0 such that P T ΣP = . And by H¨ older inequality, it is easily prove it. 0 0 Lemma 3.[9] The equilibrium of the system is globally uniformly asymptotically stable if there exists a C 1 function V : R+ × Rn → R such that (i) V is a positive definite, decrescent and radially unbounded, and (ii) −V˙ is positive definite. Now, we let yi (t, x) = ui (t, x)−u∗i , i = 1, 2, · · · , n. It is easy to see that system (1) can be transformed into ∂yi (t, x) ∂ = ∂t ∂xk m
k=1 n
+
n ∂yi (t, x) Dik − ai yi (t, x) + bij fj (yj (t, x)) ∂xk j=1
cij gj (yj (t − τ, x)) +
j=1
n j=1
hij
∂yj (t − τ, x) , x ∈ Ω, ∂t
(5)
where fj (yj (t, x)) = fj (yj (t, x) + u∗j ) − fj (u∗j ), j = 1, 2, · · · , n, gj (yj (t − τ, x)) = gj (yj (t − τ, x) + u∗j ) − gj (u∗j ), j = 1, 2, · · · , n. Then it is easy to see that fj (0) = 0, gj (0) = 0, and fj (·), gj (·) satisfy assumption (H).
3
Main Results
Theorem. Under the assumption (H), the equilibrium point U ∗ of system (1) is globally uniformly asymptotically stable if there exist positive definite matrix Q1 and diagonal matrices P > 0 and Q2 > 0 such that ⎛ ⎞ Γ1 Γ2T Γ3T Ψ = ⎝ Γ2 Σ1 Γ4T ⎠ < 0, (6) Γ3 Γ4 Σ2 where T
Γ1 = −2P A + KQ1 K + 2P BL + AQ2 A − 2AQ2 B + L + |Ω|LB + Q2 B + L, T
T
Γ2 = P C − AQ2 C + + |Ω|LB + Q2 C + , Γ3 = P H − AQ2 H + + |Ω|LB + Q2 H + , T
T
T
Γ4 = |Ω|C + Q2 H + , Σ1 = −Q1 + |Ω|C + Q2 C + , Σ2 = −Q2 + |Ω|H + Q2 H + .
156
J. Qiu and J. Cao
Proof: Consider a Lyapunov functional as
t
T
V (t) =
GT (Y (α, x))Q1 G(Y (α, x))dα
[Y (t, x)P Y (t, x) + Ω
t−τ t
+
( t−τ
∂ ∂ Y (α, x))T Q2 Y (α, x)dα]dx, ∂α ∂α
(7)
where Y (t, x) = (y1 (t, x), · · · , yn (t, x))T , F (Y (t, x)) = (f1 (y1 (t, x)), · · · , fn (yn (t, x)))T , G(Y (t, x)) = (g1 (y1 (t, x)), · · · , gn (yn (t, x)))T . Computing the time-derivative of V (t), along the solution of system (5), we have n m ∂ ∂yi (t, x) V˙ (t) = 2pi yi (t, x) Dik dx − 2 Y T (t, x)P AY (t, x)dx ∂x ∂x k k Ω k=1 Ω i=1 ∂ +2 Y T (t, x)[P BF (Y (t, x)) + P CG(Y (t, x)) + P H Y (t − τ, x)]dx ∂t Ω + GT (Y (t, x))Q1 G(Y (t, x)) − GT (Y (t − τ, x))Q1 G(Y (t − τ, x)) Ω
∂ ∂ ∂ ∂ +( Y (t, x))T Q2 Y (t, x) − ( Y (t − τ, x))T Q2 Y (t − τ, x)]dx. ∂t ∂t ∂t ∂t
(8)
From boundary condition (2) and Green formula, we have m
yi (t, x)
Ω k=1
∂ ∂xk
2 m ∂yi (t, x) ∂yi (t, x) Dik dx = − Dik dx. (9) ∂xk ∂xk Ω k=1
Using (9) and assumption (H), we have T V˙ (t) ≤ Y (t, x)[(−2P A + KQ1 K + 2P BL)Y (t, x) + 2P CG(Y (t − τ, x)) Ω
∂ Y (t − τ, x)] − GT (Y (t − τ, x))Q1 G(Y (t − τ, x)) ∂t
∂ ∂ ∂ ∂ −( Y (t − τ, x))T Q2 Y (t − τ, x)( Y (t, x))T Q2 Y (t, x) dx. ∂t ∂t ∂t ∂t
+2P H
(10)
We multiply system (5) by yi (t, x), and then integral on x in Ω 1 d 2 dt
∂yi (t, x) = Dik dx − ai yi2 (t, x)dx ∂xk Ω Ω k=1 Ω n n + bij yi (t, x)fj (yj (t, x))dx + cij yi (t, x)gj (yj (t − τ, x))dx
yi2 (t, x)dx
m
Ω j=1
+
n Ω j=1
∂ yi (t, x) ∂xk
Ω j=1
hij yi (t, x)
∂yj (t − τ, x) dx, i = 1, 2, · · · , n. ∂t
(11)
Global Asymptotical Stability in Neutral-Type Delayed Neural Networks
157
Taking into account (H), (9) and by Schwarz inequality, it is easy to see that the equality (11) can be transformed into n n d + yi (t, x)2 ≤ −ai yi (t, x)2 + bij lj yj (t, x)2 + c+ ij kj yj (t − τ, x)2 dt j=1 j=1
+
n
h+ ij y˙ j (t − τ, x)2 , i = 1, 2, · · · , n.
(12)
j=1
By Lemma 2 and some classical inequalities, we have T n ∂ ∂ d Y (t, x) Q2 Y (t, x)dx ≤ q2i ( ||yi (t, x)||2 )2 ∂t ∂t dt Ω i=1 T ≤ Y T (t, x)[(AQ2 A − 2AQ2 B + L + |Ω|LB + Q2 B + L)Y (t, x) Ω T
+2(−AQ2 C + + |Ω|LB + Q2 C + )G(Y (t − τ, x)) + 2(−AQ2 H + ∂ T T +|Ω|LB + Q2 H + )] Y (t − τ, x) + GT (Y (t − τ, x))[|Ω|C + ∂t ∂ T + Q2 C G(Y (t − τ, x)) + 2|Ω|C + Q2 H + Y (t − τ, x)] ∂t T ∂ ∂ +T + +|Ω| Y (t − τ, x) H Q2 H Y (t − τ, x) dx. ∂t ∂t Substituting (13) into (10) and the condition (6), we have V˙ (t) ≤ ΛT (t, x)Ψ Λ(t, x)dx,
(13)
(14)
Ω
T ∂ where Λ = Y T (t, x), GT (Y (t − τ, x)), ( ∂t Y (t − τ, x))T )T . From Lemma 3, we follow that the equilibrium of system (1) is globally uniformly asymptotically stable.
4
Simulation Example
1 Example. Let k = n = 2, Dik = 1, i, k = 1, 2, |Ω| = 16 , τ = 1, and fj = gj , j = 1, 2, with the Lipschtiz constant l1 = k1 = 0.5, l2 = k2 = 0.7. Consider the following delayed neural network in (1) with parameters as 2.8346 0 0.3346 −1.8921 A= ,B = , 0 9.7352 1.7243 0.5769 −0.8765 3.5647 0.3346 0.1745 C= ,H = . −0.7324 −2.0125 −0.0765 0.4217
By Theorem 1, we can conclude that this delayed neural network is globally asymptotically stable. using the Matlab LMI Control Toolbox to solve the LMI in (6), we obtain P, Q1 , Q2 as follows
158
J. Qiu and J. Cao
P =
5
0.2312 0 0 0.0949
, Q1 =
0.9522 0.0425 0.0425 1.0600
, Q2 =
0.0700 0 0 0.0151
.
Conclusion
In this paper, we have investigated the global uniform asymptotical stability for neutral-type neural networks with time delays and reaction-diffusion terms. Using Lyapunov functional method and the LMI approach, we gave a sufficient criterion ensuring the global asymptotical stability of the equilibrium point. The obtained result improves and extend several earlier publications and is useful in applications of manufacturing high quality neural networks.
Acknowledgement This work is supported by the National Natural Science Foundation of China under Grants 60574043 and 60373067, the Natural Science Foundation of Jiangsu Province, China under Grant BK2003053.
References 1. Cao, J. and Zhou, D.: Stability Analysis of Delayed Cellular Neural Networks. Neural Networks 11 (1998) 1601-1605 2. Cao, J. and Wang, J.: Global Exponential Stability and Periodicity of Recurrent Neural Networks with Time Delays. IEEE Trans. Circuits Syst. I 52(5) (2005) 925931 3. Cao, J. and Ho, D.W.C.: A General Framework for Global Asymptotic Stability Analysis of Delayed Neural Networks Based on LMI Approach. Chaos, Solitons and Fractals 24(5) (2005) 1317-1329 4. Gu, K.: An Integral Inequality in the Stability Problems of Time-delay Systems. Proceedings of 39th IEEE CDC, Sydney, Australia (2000) 2805-2810 5. Liao, X.X., Fu, Y., Gao, J. and Zhao, X.: Stability of Hopfield Neural Networks with Reaction-diffusion Terms. Acta Electron. Sinica 28 (2002) 78-80 6. Liang, J. and Cao, J.: Global Exponential Stability of Reaction-diffusion Recurrent Neural Networks with Time-varying Delays. Phys. Lett. A 314 (2003) 434-442 7. Wang, L. and Xu, D.: Global Exponential Stability of Reaction-diffusion Hopfield Neural Networks with Time-varying Delays. Science in China E 33 (2003) 488-495 8. Xu, S., Lam, J., Ho, D.W.C. and Zou, Y.: Delay-dependent Exponential Stability for a Class of Neural Networks with Time Delays. J. Comput. Appl. Math. 183 (2005) 16-28 9. Vidyasagar, M.: Nonliear Systems Analysis. Englewood Cliffs, New Jersey (1993)
Almost Sure Exponential Stability on Interval Stochastic Neural Networks with Time-Varying Delays Wudai Liao1 , Zhongsheng Wang1 , and Xiaoxin Liao2 1
School of Electrical and Information Engineering, Zhongyuan University of Technology, 450007, Zhengzhou, Henan, P.R. China {wdliao, zswang}@zzti.edu.cn 2 Department of Control Science and Engineering, Huazhong University of Science and Technology, 430074, Wuhan, Hubei, P.R. China
[email protected]
Abstract. Because of VLSI realization of artificial neural networks and measuring the elements of the circuits, noises coming from the circuits and the errors of the parameters of the network systems are therefore unavoidable. Making use of the stochastic version of Razumikhin theorem of stochastic functional differential equation, Lyapunov direct methods and matrix analysis,almost sure exponential stability on interval neural networks perturbed by white noises with time varying delays is examined, and some sufficient algebraic criteria which only depend on the systems’ parameters are given. For well designed deterministic neural networks, the results obtained in the paper also imply that how much tolerance against perturbation they have.
1
Introduction
Exponential stability of deterministic neural networks have been widely studied and many useful results are obtained, such as [1, 2, 3, 4, 5]. In fact, neural networks work in noised environment considering unavoidable noises coming from VLSI realization of them and noised process of synapsis transmission of brain neural networks. So, for reality, we need introduce noises into the neural networks, which we call them stochastic neural networks. The stability of stochastic neural networks is also discussed by some scholars and some criteria [6, 7, 8, 9] ensuring them to be exponentially stable are gained. In practice, the weights between neurons of neural networks and noised intension must be estimated by using confidence intervals, which give us an interval stochastic neural networks. Based on the finite presentation of interval matrices, we get some criteria [10] which ensure the interval stochastic neural networks to be almost sure exponentially stable. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 159–164, 2006. c Springer-Verlag Berlin Heidelberg 2006
160
W. Liao, Z. Wang, and X. Liao
In the present paper, we consider the stability of the interval stochastic neural network with time varying delays of the form dx(t) = [−(B + ΔB)x(t) + (A + ΔA)f (x(t − τ (t)))]dt + [(σ + Δσ)x(t − τ (t))]dw(t) .
(1)
Where B = diag.(b1 , b2 , · · · , bn ) is positive diagonal matrix, A, σ ∈ IR are weight matrix and perturbed matrix respectively, the error matrices ΔB ∈ B], ΔA ∈ [−A, A], Δσ ∈ [− A, σ [−B, σ, σ ], and B, ∈ IRn×n are up-bound man trices of errors, x = (x1 , x2 , · · · , xn )T ∈ IR is the state vector of the neural networks, x(t − τ (t)) = (x1 (t − τ1 (t)), x2 (t − τ2 (t)), · · · , xn (t − τn (t)))T , τi (t) ≥ 0 is the delay of the neuron i and 0 ≤ τi (t) ≤ τ, i = 1, 2, · · · , n. f (·) is the vector of the output functions of the neurons, f (x) = (f1 (x1 ), f2 (x2 ), · · · , fn (xn ))T , fi (·) satisfies that there exists a positive number αi > 0 such that n×n
|fi (u)| ≤ 1 ∧ αi |u|, −∞ < u < +∞, i = 1, 2, · · · , n .
(2)
w(·) is a one-dimension standard Brownian motion. We establish some sufficient algebraic criteria which ensure the neural network (1) to be almost sure exponentially stable and they also show that in what extent of noises the corresponding deterministic neural network can tolerate.
2
Notation
Let | · | denote the Euclidean norm of a vector x. The matrix norm · and · F are defined in the following way respectively: A = λmax (AT A), AF = trace(AT A) . Where A ∈ IRn×n . We know that the two matrix norms are consistent to Euclidean norm of a vector x ∈ IRn , and Frobenius norm is consistent to itself, that is |Ax| ≤ A · |x|, |Ax| ≤ AF · |x|, ABF ≤ AF · BF .
(3)
C([−τ, 0]; IRn ) the space of all continuous IRn -valued functions φ defined on [−τ, 0], C2,1 (IRn × [t0 − τ, ∞); IR+ ) the family of all non-negative real-valued functions V (x, t) defined on IRn × [t0 − τ, ∞) which are continuously twice differentiable in x ∈ IRn and once differentiable in t ∈ [−τ, 0], LpFt ([−τ, 0]; IRn ) the family of all Ft -measurable C([−τ, 0]; IRn )-valued random variables φ such that Eφp < ∞. For any t ≥ t0 , define xt ∈ C([−τ, 0]; IRn ) : xt (θ) = x(t + θ), θ ∈ [−τ, 0], considering the following stochastic functional differential equation dx(t) = f (xt , t)dt + g(xt , t)dw(t) , t ≥ t0 , xt0 = ξ ∈ LpFt ([−τ, 0]; IRn ) . 0
(4)
For Lyapunov function V (x, t) ∈ C2,1 (IRn × [t0 − τ, ∞); IR+ ), define the differential operator from C2,1 (IRn × [t0 − τ, +∞)) to IR: 1 LV (φ, t) = Vt (φ(0), t)+Vx (φ(0), t)f (φ, t)+ trace[g T (φ(t), t)Vxx (φ(0), t)g(φ, t)] . 2
Almost Sure Exponential Stability on Interval Stochastic Neural Networks
3
161
Lemmas
In this section, we give two lemmas used in the paper latter. Lemma 1. Assume that there exists a Lyapunov function V (x, t) ∈ C2,1 (IRn × [t0 − τ, ∞); IR+ ) and positive numbers λ, p, c1 , c2 , q > 1 such that c1 |x|p ≤ V (x, t) ≤ c2 |x|p holds for any (x, t) ∈ IRn × [t0 − τ, ∞), and ELV (φ, t) ≤ −λEV (φ(0), t) holds for any t ≥ t0 and φ ∈ LpFt ([−τ, 0]; IRn ) satisfying EV (φ(θ), t + θ) < qEV (φ(0), t), −τ ≤ θ ≤ 0, then we have (1) For all ξ ∈ LpFt ([−τ, 0]; IRn ), the corresponding solution of the system (4) 0 has the following estimation p
E|x(t, ξ)| ≤
c2 p E|ξ| e−γ(t−t0 ) , t ≥ t0 . c1
Where γ = min{λ, logq/τ }. (2) For any solution x(t) of System (4), we have the following inequality 1 γ lim sup log|x(t, ξ)| ≤ − , a.s. t p t→∞ so long as p ≥ 1 and there exists positive constant k > 0 such that E(|f (xt , t)|p + |g(xt , t)| ≤ k
sup E|x(t + θ)|p , t ≥ t0
−τ ≤θ≤0
(5)
holds. That is the trivial solution of System (4) is almost sure exponential stability, and the Lyapunov exponent is γ/p. The proof of Lemma 1 is in Reference [11] and we can easily deduce that the condition (5) holds for System (1).
4
Main Results
We will set up some sufficient criteria ensuring the neural network (1) almost sure exponential stability in this section. For a symmetric positive definite matrix Q, select the Lyapunov function V (x) = xT Qx and for φ ∈ C([−τ, 0]; Rn ), the differential operator L with respect to the solution of neural network (1) has the form: LV (φ) = 2φT (0)Q[−(B + ΔB)φ(0) + (A + ΔA)f (φ(−τ (t)))] +φT (−τ (t))(σ + Δσ)T Q(σ + Δσ)φ(−τ (t)) .
(6)
162
W. Liao, Z. Wang, and X. Liao
Theorem 1. Assume that there exist a symmetric positive definite Q and a positive number , such that Q α2 2 + (σ + min{bi } > max{bi } + + [ (A + A) σ )2 ] 2 2λmin (Q)
(7)
holds, then for any ξ ∈ L2F0 ([−τ, 0]; IRn ), the corresponding solution of neural network (1) has the following properties lim sup t→∞
1 log q log(E|x(t; ξ)|2 ) ≤ − . t τ
(8)
1 log q log(|x(t; ξ)|) ≤ − . t 2τ
(9)
lim sup t→∞
Where α = max{αi } and q > 1 is the unique root of equation Q α2 log q 2 +(σ+ −2 min{bi }+2 max{bi }++ [ (A+A) σ)2 ]q = − . λmin(Q) τ Proof (of Theorem 1). Because matrix Q is positive definite, so Q1/2 exists and is reversible, and λmin (Q1/2 (B + ΔB)Q−1/2 ) = λmin (B + ΔB), considering that two matrices (Q1/2 (B + ΔB)Q−1/2 ) and (B + ΔB) are similar. In the following, three terms in the differential operator (6) are estimated. − 2φT (0)Q(B + ΔB)φ(0) = −2φT (0)Q1/2 [Q1/2 (B + ΔB)Q−1/2 ]Q1/2 φ(0) ≤ −2λmin(Q1/2 (B + ΔB)Q−1/2 )φT (0)Qφ(0) = −2λmin(B + ΔB)V (φ(0)) . By using λmin (B +ΔB) = min{bi +Δbi } ≥ min{bi }+min{Δbi }, −bi ≤ Δbi ≤ bi and min{Δbi } ≥ min{−bi } = − max{bi }, we have the estimation of the first term in (6) as follows − 2φT (0)Q(B + ΔB)φ(0) ≤ −2(min{bi } − max{bi })V (φ(0)) .
(10)
Similarly , the estimation of the second term in (6) is as follows: 2φT (0)Q(A + ΔA)f (φ(−τ (t))) ≤ 2|φT (0)Q1/2 | · |Q1/2 (A + ΔA)f (φ(−τ (t)))| ≤ |φT (0)Q1/2 |2 + (1/)|Q1/2 (A + ΔA)f (φ(−τ (t)))|2
(11)
≤ V (φ(0)) + (1/)Q(A + ΔA) |f (φ(−τ (t)))| 2 )V (φ(−τ (t))) . ≤ V (φ(0)) + (α2 Q/λmin(Q))(A + A 2
2
And the third term in (6) can be estimated as follows φT (−τ (t))(σ + Δσ)T Q(σ + Δσ)φ(−τ (t)) Q(σ + σ )2 ≤ Q(σ + Δσ)2 |φ(−τ (t))|2 ≤ V (φ(−τ (t))) . (12) λmin (Q)
Almost Sure Exponential Stability on Interval Stochastic Neural Networks
163
Substituting (10), (11) and (12) into (6), we have LV (φ) ≤ (−2 min{bi } + 2 max bi + )V (φ(0)) Q α2 2 + (σ + + [ (A + A) σ )2 ]V (φ(−τ (t))) . λmin (Q)
(13)
Therefore, for any t ≥ 0 and φ ∈ L2Ft ([−τ, 0]; IRn ) satisfying EV (φ(−τ (t))) ≤ qEV (φ(0)), we have ELV (φ) ≤ {−2 min{bi } + 2 max{bi } + qQ α2 2 + (σ + + [ (A + A) σ )2 ]}EV (φ(0)) . λmin (Q) According to the condition (7) of the theorem, the equation qQ α2 log q 2 + (σ + −2 min{bi } + 2 max{bi } + + [ (A + A) σ )2 ] = − λmin (Q) τ has an unique root q > 1, and therefore, log q EV (φ(0)) . τ By using Lemma 1, the desired estimations (8) and (9) hold. The proof is complete. ELV (φ) ≤ −
Corollary 1. Assume that there exits a symmetric positive-definite matrix Q, such that 1 Q + Q (σ + min{bi } > max{bi } + (1 + α2 )(A + A) σ )2 2 λmin (Q) 2λmin (Q) holds, then the estimations (8) and (9) hold, where α = max αi and q > 1 is the unique root of equation Q √q −2 min{bi } + 2 max{bi } + (1 + α2 )(A + A) λmin (Q) +
Q log q (σ + σ )2 q = − . 2λmin (Q) τ
Proof (of Corollary 1). Similar to the proof of Theorem 1, by choosing = qQ/λmin(Q)(A + A) in (13), then, we have LV (φ) ≤ [−2 min{bi } + 2 max{bi } + qQ/λmin (Q)(A + A)]V (φ(0)) Q + + [α2 Q/qλmin (Q)(A + A) (σ + σ )2 ]V (φ(−τ (t))) . λmin (Q)
The remain of the proof is same as in Theorem 1. The proof is complete.
164
W. Liao, Z. Wang, and X. Liao
By choosing the symmetric matrix Q equals unit matrix I in Corollary 1, we have the following corollary. Corollary 2. For the neural network (1), if 1 + 1 (σ + min{bi } > max{bi } + (1 + α2 )(A + A) σ 2 ) (14) 2 2 holds, then the properties (8) and (9) hold, where α = max{αi } and q > 1 is the unique root of equation log q √q + (σ + −2 min{bi } + 2 max{bi } + (1 + α2 )(A + A) σ )2 q = − . τ Remark 1. Considering that the properties (3) of Frobenius norm, after carefully examining the proof of Theorem 1, Euclidean matrix norm can be replaced by Frobenius norm in all theorem and corollaries of this section. Remark 2. All conclusions obtained in the paper ensure exponential stability for deterministic neural network when perturbed item is zero in neural network (1).
Acknowledgement The work is supported by National Natural Science Foundation of China (60274007, 60474001) and Foundation of Young Bone Teacher of Henan Province of China.
References 1. Cao, J., Zhou, D.: Stability Analysis of Delayed Celluar Neural Networks. Neural Networks 11 (1998) 1601-1605 2. Liao, X.: Robust Stability for Interval Hopfield Neural Networks with Time Delay. IEEE Tran. Neural Networks 9(5) (1998) 1042-1045 3. Wang, J., Wu, G.: A Multilayer Recurrent Neural Network for Solving ContinuousTime Algebraic Riccati Equations. Neural Networks 11 (1998) 939-950 4. Lu, H.: On Stability of Nonlinear Continuous-Time Neural Networks with Delays. Neural Networks 13 (2000) 1135-1143 5. Liang, X., Wang, J.: A Proof of Kaszkurewicz and Byaya’s Conjecture on Absolute Stability of Neural Networks in Two-Neuron Case. IEEE Trans. Circuits and Systems-1: Fundamental Theory and Its Applications 47(4) (2000) 609-611 6. Liao, X., Mao, X.: Stability of Stochastic Neural Networks. Neual, Parallel and Scientific Computations 14(4) (1996) 205-224 7. Blythe, S., Mao, X.: Stability of Stochastic Delay Neural Networks. Journal of the Franklin Institute 338(2001) 481-495 8. Shen, Y., Liao, X.: Robust Stability of Nonlinear Stochastic Delayed Systems. Acta Automatica Sinic 25(4) (1999) 537-542 9. Liao, X., Mao, X.: Exponential Stability of Stochastic Delay Interval Systems. Systems and Control Letters 40(2000) 171-181 10. Liao, W., Liao, X.: Robust Stability of Time-Delayed Interval CNN in Noisy Environment. Acta Automatica Sinic 30(2)(2004) 300-305 11. Mao, X.: Stochastic Differential Equations and Their Applications. 1st edn. Horwood Pub. Chichester (1997)
Stochastic Robust Stability of Markovian Jump Nonlinear Uncertain Neural Networks with Wiener Process Xuyang Lou and Baotong Cui Research Center of Control Science and Engineering, Southern Yangtze University, 1800 Lihu Rd., Wuxi, Jiangsu 214122, P.R. China
[email protected],
[email protected]
Abstract. This paper deals with the stochastic robust stability problem for Markovian jump nonlinear uncertain neural networks (MJNUNNs) with Wiener process. Some criteria for stochastic robust stability of Markovian jump nonlinear uncertain neural networks are derived, even if the system contains Wiener process. All the derived results are presented in terms of linear matrix inequality.
1
Introduction
In recent years, considerable efforts have been devoted to the analysis of neural networks due to their potential applications in many areas [1]. The stability analysis on various aspects of neural networks has received considerable attentions in the past few years (see, for instance, [1-13] and the references cited therein). In ref [8], the authors introduced linear matrix inequality (LMI) approach for the exponential stability of BAM neural networks with constant or time-varying delays. In ref [7], we studied global asymptotic stability of delay bi-directional associative memory (BAM) neural networks with distributed delays and reaction-diffusion terms. The absolute exponential stability (AEST) of a class of BAM neural networks is analyzed in ref [3]. It was pointed out [14, 15] that a neural network could be stabilized or destabilized by certain stochastic inputs. However, the stability analysis for stochastic neural networks is difficult. Recently, although the stability analysis of neural networks has received much attention, the stability of stochastic neural networks has not been widely studied. Some results related to this issue have been reported in the literatures (see for instance [15-20]). In [16, 17], the stochastic stability for Hopfield neural networks were studied. The problem of stochastic robust stability (SRST) for uncertain delayed neural networks with Markovian jumping parameters is investigated via linear matrix inequality (LMI) technique in [18]. Liao and Mao [19] mainly discussed the mean square exponential stability of the stochastic delay neural network. To the best of our knowledge, only few works have been done on the SRST analysis for MJNUNNs. However, the SRST analysis for MJNUNNs with Wiener J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 165–171, 2006. c Springer-Verlag Berlin Heidelberg 2006
166
X. Lou and B. Cui
process disturbance has never been tackled. The aim of this paper is to derive sufficient conditions for SRST analysis for MJNUNNs with Wiener process disturbance. All the results obtained in this paper are in terms of LMI. Notations and facts. In the sequel, we denote AT and A−1 the transpose and the inverse of any square matrix A. We use A > 0 (A < 0) to denote a positive- (negative-) definite matrix A; and I is used to denote the n × n identity matrix. Let R denote the set of real numbers, Rn denotes the n-dimensional Euclidean space and Rn×m the set of all n × m real matrices. diag[·] denotes a block diagonal matrix. Γ ([−d, 0]; Rn ) denotes the family of continuous functions φ(·) from [−d, 0] to Rn with the norm φ = sup |φ(θ)|. The symbol “” −d≤θ≤0
within a matrix represents the symmetric term of the matrix. The mathematical expectation operator with respect to the given probability measure P is denoted by E{·}. Fact 1. (Schur complement). Given constant matrices Ω1 , Ω2 , Ω3 , where 0 < Ω1 = Ω1T and 0 < Ω2 = Ω2T , then Ω1 + Ω3T Ω2−1 Ω3 > 0, if and only if Ω1 Ω3T −Ω2 Ω3 < 0, or < 0. Ω3 −Ω2 Ω3T Ω1 Fact 2. For any real matrices Σ1 , Σ2 , Σ3 with appropriate dimensions and Σ3T Σ3 ≤ I, it follows that Σ1T Σ2 + Σ2T Σ1 ≤ εΣ1T Σ3 Σ1 + ε−1 Σ2T Σ3−1 Σ1 , ∀ε > 0.
2
Problem Statement
Consider a nonlinear uncertain neural network with time delay, which is described by a set of functional differential equations as follows x˙ m (t) = −am xm (t) − bdm xm (t − d) +
L
bmn fn (xn (t))
n=1
+
L
bdmn fn (xn (t − d)) + cm gm (xm (t), xm (t − d)),
(1)
n=1
for m, n = 1, 2, · · · , L. System (1) can be represented in vector state space form as follows x(t) ˙ = −Ax(t) − Ad x(t − d) + Bf (x(t)) + Bd f (x(t − d)) + Cg(x(t), x(t − d)), (2) where x(t) = (x1 (t), . . . , xL (t))T is the state vector of the neural network, d is the transmission delay, x(t − d) = (x1 (t − d), . . . , xL (t − d))T is the delayed state vector of the neural networks, A = diag(a1 , · · · , aL ), Ad = diag(ad1 , · · · , adL ) are positive diagonal matrices, B = (bmn )L×L , Bd = (bdmn )L×L , denote the connection weights, C = (cmn )L×L , denote the perturbation weights, g(x(t), x(t − d)) is
Stochastic Robust Stability
167
the uncertain perturbation with the form of g(x(t), x(t − d)) = g1 (x1 (t), x1 (t − T d)), . . . , gn (xn (t), xL (t − d)) , the activation function is f (x(t)) = f1 (x1 (t)), T . . . , fL (xL (t)) . Though out this paper, we assume that the activation function fm (xm (t)) (m = 1, 2, . . . , L) and the nonlinear uncertain perturbation function g(x(t), x(t− d)) satisfy the following conditions: (H1 ) If there exist positive constants km , m = 1, . . . , L, such that 0<
fm (ξ1 ) − fm (ξ2 ) ≤ km ξ1 − ξ2
for all ξ1 , ξ2 ∈ R, ξ1 = ξ2 , m = 1, . . . , L. As we know, dynamical systems whose structures vary in response to random changes, which may result from abrupt phenomena such as parameter shifting and so on, are frequently occurring in practical situations. So it is necessary to consider the follow system (3). In this paper, let us consider the class of MJNUNNs with Wiener process defined on the probability space (Ω, Υ, P), with the following dynamics: ⎧ ⎪ − d) + B(ηt )f (x(t)) + Bd (ηt )f (x(t − d)), ⎪ ⎨ dx(t) = − A(ηt )x(t) − Ad (ηt )x(t (3) +C(ηt )g(x(t), x(t − d)) dt + W (ηt )x(t)dω(t) ⎪ ⎪ ⎩ x(t) = φ0 , t ∈ [−d, 0], where φ0 ∈ Γ ([−d, 0]; Rn ), ω(t) is a standard Wiener process on the given probability space, W is the Wiener process matrix that is supposed to be known, for more details related to Wiener process, see ([21, 22, 23]). Given a probability space (Ω, Υ, P), where Ω is the sample space, Υ is the algebra of events and P is the probability measure defined on Υ. Let the random form process {ηt , t ∈ [0, +∞)} be a homogeneous, finite-state Markovian process with right continuous trajectories with generator = (πij ) and transition probability from mode i at time t to mode j at time t + δ, i, j ∈ S, (S = {1, · · · , N }) : πij δ + o(δ), if i = j, pij = P r(ηt+δ = j|ηt = i) = (4) 1 + πij δ + o(δ), if i = j, with transition probability rates πij ≥ 0 for i, j ∈ S, i = j and πii = −
N
πij ,
j=1,j =i
where δ > 0 and lim o(δ)/δ = 0. Note that the set S comprises the various δ→0
operational modes of the system under study. Assumption 1. The mode ηt is available at time t. Assumption 2. There exist positive constant matrices M and M1 , such that g(x(t), x(t − d)) ≤ M x(t) + M1 x(t − d).
(5)
168
X. Lou and B. Cui
To proceed, we need the following definitions. Definition 1. The Markovian jump nonlinear neural networks (MJNNNs) (perturbation-free) is said to be stochastic stable if for all initial state φ0 and mode η0 such that lim E{x(t, φ0 , η0 )2 } = 0. (6) t→∞
Definition 2. The MJNUNNs with Wiener process is said to be SRST if it is stochastic stable for any admissible perturbation and Wiener process disturbance. Lemma 1. The infinitesimal generator of the Markov process (x(t), ηt ) is given by: LV (t, x(t), ) =
∂V (t, x(t), i) − Ai x(t) + Bi f (x(t)) + Ci f (x(t − d)) ∂x N 1 + tr xT (t)WiT Pi Wi x(t) + πij V (t, x(t), j). 2 j=1
For stochastic systems, Itˆ o’s formula has an important role in the analysis of stochastic systems. The details can be found in [24] and omitted here. In the sequel, for simplicity, while ηt = i, the matrices A(ηt ), Ad (ηt ), B(ηt ), Bd (ηt ), C(ηt ), are represented by Ai , Adi , Bi , Bdi , Ci . Hence, we extract from system (3) for ηt = i ∈ S the following MJNUNN ⎧ ⎪ dx(t) = − Ai x(t) − Adi x(t − d) + Bi f (x(t)) ⎪ ⎨ (7) +B f (x(t − d)) + C g(x(t), x(t − d)) dt + Wi x(t)dω(t), di i ⎪ ⎪ ⎩ x(t) = φ0 , t ∈ [−d, 0].
3
Main Results
In this section, SRST criteria for MJNUNNs with Wiener process are given. Theorem 1. Suppose the activation function f (·) satisfy (H1 ). If there exist matrices Xi > 0, Qi > 0, Ri > 0, and Si > 0, such that the following LMIs holds for all i ∈ S:
Xi Xi ATdi X i Xi K T Xi Xi MdT ≥ 0, ≥ 0, ≥0 (8) Ri Si εi I and
⎡
⎤ (1, 1) Xi K T Xi M T ⎣ −Qi 0 ⎦ < 0, i = 1, · · · , N, −ε1 I
where T (1, 1) = −(Ai Xi + Xi ATi ) + Ri + Bi Qi BiT + Bdi Si Bdi N +Ci (ε1 + ε2 )CiT + 3Xi + WiT Xi Wi + πij Xj . j=1
Then, the MJNUNN (7) is SRST.
(9)
Stochastic Robust Stability
169
Proof. Consider the following generalized Lyapunov functional for mode i ∈ S : V (t, x(t), i) = xT (t)Pi x(t).
(10)
where Pi = PiT is symmetric and positive-definite matrix. Then, from Lemma 1 and Assumption 2, the infinitesimal generator of the Markov process (t, x(t), i) becomes: LV (t, x(t), i) ≤ −xT (t)(Pi Ai + ATi Pi )x(t) + xT (t)Pi Ri Pi x(t) +xT (t − d)ATdi Ri−1 Adi x(t − d) + xT (t)Pi Bi Qi BiT Pi x(t)
T T +xT (t)K T Q−1 i Kx(t) + x (t)Pi Bdi Si Bdi Pi x(t) +xT (t − d)K T Si−1 Kx(t − d) + xT (t)Pi Ci (ε1 + ε2 )CiT Pi x(t) −1 T T T T +ε−1 1 x (t)M M x(t) + ε2 x (t − d)Md Md x(t − d)
+xT (t)WiT Pi Wi x(t) +
N
πij xT (t)Pj x(t).
(11)
j=1
Pre- and post-multiply (11) with Xi = Pi−1 , and by the Schur complement, if and only if inequality (8) and (9) hold, LV (t, x(t), i) < 0 exist, that is the MJNUNN (7) is SRST. This completes the proof. Theorem 2. Suppose the activation function f satisfy (H1 ). If there exist matrices Xi > 0, Qi > 0, and Si > 0, such that the following LMI holds for all i ∈ S: ⎡ ⎤ G11 G12 G13 G = ⎣ G22 G23 ⎦ > 0 (12) G33 and
⎡
(1, 1) G12 + G13 + GT23 ⎢ (2, 2) ⎢ ⎢ ⎢ ⎣
⎤ KT 0 KT 0 0 0 ⎥ ⎥ −Si 0 0 ⎥ ⎥ < 0, i = 1, · · · , N, dG33 0 ⎦ −Qi
(13)
where T (1, 1) = −(Pi Ai + ATi Pi ) + Pi Bi Qi BiT Pi + Pi Bdi Si Bdi Pi + Ci (ε1 + ε2 )CiT N −1 T T +WiT Pi Wi + πij Pj + ε−1 1 M M + ε2 Md Md + dG11 + 2G13 , j=1
T (2, 2) = ε−1 2 Md Md + dG22 + 2G23 Then, the MJNUNN (7) is SRST.
Proof. Consider the following generalized Lyapunov functional for mode i ∈ S : V (t, x(t), i) = V1 + V2 + V3 ,
(14)
tθ 0 t V1 = xT (t)Pi x(t), V2 = 0 θ−d rT Grdsdθ, V3 = −d t+θ x˙ T (s)G33 x(s)dsdθ, ˙ T ˙ where r = x(θ) x(θ − d) x(s) .
170
X. Lou and B. Cui
Now, in this case, for the first term V1 , the infinitesimal generator of the Markov process (t, x(t), i) becomes: LV (t, x(t), i) = LV1 + LV2 + LV3 ≤ η T (t)Ωη(t),
(15)
T T T where η(t) ˙ ⎤ , ⎡ = [x (t) x (t − d)T x(t)] (1, 1) G12 + G13 + G23 0 (2, 2) 0 ⎦ < 0, i = 1, · · · , N, Ω=⎣ dG33 where T (1, 1) = −(Pi Ai + ATi Pi ) + Pi Bi Qi BiT Pi + K T Q−1 i K + Pi Bdi Si Bdi Pi N T +Ci (ε1 + ε2 )CiT + WiT Pi Wi + πij Pj + ε−1 1 M M j=1
T +ε−1 2 Md Md + dG11 + 2G13 , T −1 T (2, 2) = K Si K + ε−1 2 Md Md + dG22 + 2G23 . By the Schur complement, Ω < 0 if and only if inequality (12) and (13) hold, LV (t, x(t), i) < 0 exist, that is the MJNUNN (7) is SRST. This completes the proof.
Remark 1. In the same way, some similar Markovian jump nonlinear uncertain neural networks can also be established, such as MJNUNNs with time-varying delays, MJNUNNs with multiple delays, Markovian jump uncertain BAM neural networks, etc.
4
Conclusions
The problems of SRST of MJNUNNs with Wiener process are investigated. Some criteria for SRST of MJNUNNs with Wiener process are derived. The sufficient conditions are expressed in terms of LMIs, which make them computationally efficient and flexible.
Acknowledgments The authors would like to thank the anonymous reviewers and the Associate Editor for their comments on improving the overall quality of the paper.
References 1. Rong, L.B.: LMI Approach for Global Periodicity of Neural Networks with TimeVarying Delays. IEEE Trans. Circuits Systems I: Regular Papers 52(7) (2005) 1451–1458 2. Cao, J., Liang, J.: Boundedness and Stability for Cohen-Grossberg Neural Network with Time-Varying Delays. J. Math. Anal. Appl. 296(2) (2004) 665–685 3. Lu, H.T.: Absolute Exponential Stability Analysis of Delayed Neural Networks. Physics Letters A 336(2-3) (2005) 133–140
Stochastic Robust Stability
171
4. Lu, H.T., He, Z.Y.: Global Exponential Stability of Delayed Competitive Neural Networks with Different Time Scales. Neural Networks 18(3) (2005) 243–250 5. Cao, J.D., Huang, D.S., Qu, Y.Z.: Global Robust Stability of Delayed Recurrent Neural Networks. Chaos, Solitons and Fractals 23(1) (2005) 221–229 6. Cao, J.D., Chen, T.P.: Globally Exponentially Robust Stability and Periodicity of Delayed Neural Networks. Chaos, Solitons and Fractals 22(4) (2004) 957–963 7. Lou, X.Y., Cui, B.T.: Global Asymptotic Stability of BAM Neural Networks with Distributed Delays and Reaction-Diffusion Terms. Chaos, Solitons and Fractals 27(5) (2006) 1347–1354 8. Huang, X., Cao, J.D., Huang, D.S.: LMI-Based Approach for Delay-dependent Exponential Stability Analysis of BAM Neural Networks. Chaos, Solitons and Fractals 24(3) (2005) 885–898 9. Cao, J.D.: On Exponential Stability and Periodic Solutions of CNNs with Delays. Phys. Lett. A 267(5-6) (2000) 312–318 10. Cao, J.D., Zhou, D.M.: Stability Analysis of Delayed Cellular Neural Networks. Neural Networks 11(9) (1998) 1601–1605 11. Cao, J.D., Jiang, Q.H.: An Analysis of Periodic Solutions of Bi-Directional Associative Memory Networks with Time-Varying Delays. Physics Letters A 330(3-4) (2004) 203–213 12. Liang, J.L., Cao, J.D., Ho, D.W.C.: Discrete-Time Bidirectional Associative Memory Neural Networks with Variable Delays, Physics Letters A 335(2-3) (2005) 226–234 13. Li, C.D., Liao, X.F., Zhang, R.: Delay-Dependent Exponential Stability Analysis of Bi-Directional Associative Memory Neural Networks with Time Delay: an LMI Approach. Chaos, Solitons and Fractals 24(4) (2005) 1119–1134 14. Mao, X.R.: Stochastic Differential Equations and Applications. Chichester, U.K. Horwood (1997) 15. Liao, X.X., Mao, X.R.: Exponential Stability and Instability of Stochastic Neural Networks. Stochast. Anal. Appl. 14(2) (1996) 165–185 16. Wan, L., Sun, J.H.: Mean Square Exponential Stability of Stochastic Delayed Hopfield Neural Networks. Physics Letters A 343(4) (2005) 306–318 17. Huang, H., Ho, D.W.C., Lam, J.: Stochastic Stability Analysis of Fuzzy Hopfield Neural Networks with Time-Varying Delays. IEEE Trans. Circuits Systems II: Express Briefs 52(5) (2005) 251–255 18. Xie, L.: Stochastic Robust Stability Analysis for Markovian Jumping Neural Networks with Time Delays. The 2005 IEEE International Conference on Networking, Sensing and Control (2005) 923–928 19. Liao, X.X., Mao, X.R.: Stability of Stochastic Neural Networks. Neural, Para. Sci. Comput. 4(2) (1996) 205–224 20. Blythe, S., Mao, X.R., Shah, A.: Razumikhin-Type Theorems on Stability of Stochastic Neural Networks with Delays. Stochast. Anal. Appl. 19(1) (2001) 85– 101 21. Amold, L.: Stochastic Differential Equations: Theory and Applications. Loh Wilry and Sons, New York (1974) 22. Boukas, E.K., Liu, K.: Deterministic and Stochastic Systems with Time-Delay. Birkhauser, Boston (2002) 23. Raouf, J., Boukas, E.K.: Robust Stabilization of Markovian Jump Linear Singular Systems with Wiener Process. Proc. of the 2004 American Control Conference. Boston, Massachusetts (2004) 3170–3175
Stochastic Robust Stability Analysis for Markovian Jump Discrete-Time Delayed Neural Networks with Multiplicative Nonlinear Perturbations Li Xie1, Tianming Liu3, Guodong Lu2, Jilin Liu1, and Stephen T.C. Wong3 1
Department of Information and Electronic Engineering, Zhejiang University 2 State Key Lab of CAD CG, Zhejiang University, 310027 Hangzhou, P.R. China
[email protected] 3 Center for Bioinformatics, HCNR, Harvard Medical School, 02115 Boston, MA, USA
[email protected]
&
Abstract. The problem of stochastic robust stability for Markovian jump discrete-time delayed neural networks with multiplicative nonlinear perturbation is investigated via Lyapunov stability theory in this paper. Based on the linear matrix inequality (LMI) methodology, a novel analysis approach is developed. The sufficient conditions of stochastically robust stable are given in terms of coupled linear matrix inequalities. The stable criteria represented in LMI setting are less conservative and more computationally efficient than the methods reported in the literature.
1 Introduction In recent years, considerable research results on various aspects of neural networks are presented. Most of the literature on this subject was focused on the stability analysis of neural networks. Also, the existing studies on dynamical properties of neural networks are predominantly concentrated on the neural network with delay-free [1-3]. Since there exist transmission-delays among the neurons, the stability and the convergence properties of neural networks would be greatly dependent on their intrinsic time delays. In ref [4-6], by using the method of Liapunov functional, the stability criteria of bidirectional associative memory networks and recurrent neural networks with time delays are derived. When the neural networks are subject to the influence of the Markovian jump parameters, standard approaches for the behavior analysis of the neural networks are not applicable. The stochastic Lyapunov-Karsoviskii functional approach was adopted in studying of Markovian jump systems. In ref [7, 8], the stability, stablizability and controllability of jump systems were studied. The results are presented in the form of matrix measures or a set of coupled algebraic Riccati equations, which are too conservative or are difficult to obtain the solutions. It is well known that the linear matrix inequality (LMI) approach is a less conservative alternative to facilitate the solution process. The LMI-based techniques have been successfully employed in a variety of stability analysis for neural networks [9]. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 172 – 178, 2006. © Springer-Verlag Berlin Heidelberg 2006
Stochastic Robust Stability Analysis
173
In this paper, we report the stochastic robust stable criteria for Markovian jump discrete-time delayed neural networks with multiplicative nonlinear perturbation that are developed based on the LMI approach. The organization of the remaining part of this paper is as follows. In Section 2, the system model of Markovian jump discretetime delayed neural networks is provided. In Section 3, the stochastic robust stability of the neural network is investigated. Finally, we present the conclusions in Section 4.
2 Systems Descriptions Consider the discrete-time delayed neural networks with Markovian jumping parameters and multiplicative nonlinear perturbations, which are described by a set of discrete-time functional difference equations as follows: L
xn (k + 1) = −an (rk ) xn (k ) − a1n (rk ) xn (k − d ) + ∑ bnm (rk )σ m [ xm (k )] m =1
L
L
+ ∑ b1nm (rk )σ m [ xm (k − d )] + ∑ cnm (rk ) f m [ xm (k ), xm (k − d )]w(k ) , m =1
n, m = 1,K , L .
m =1
(1)
where xn (k ) denotes the state of the nth neurons at time k ; σ m [ xm (k )] denotes the
activation functions of the mth neuron on the nth neuron; f m [ xm (k ), xm (k − d )] denotes the multiplicative nonlinear perturbation of the mth neuron on the nth neuron; n, m = 1,K , L , and L denotes the number of neurons in a neural network; d is a positive integer denoting the constant time delay of the state of the neurons; the variable {w(k )} is a standard random scalar sequence satisfying: Ε{w(k )} = 0 , Ε{w2 (k )} = 1 .
(2)
where w(0), w(1), K , are independent; an (rk ) , a1n (rk ) denote the charging time constants or passive decay rates of the nth neuron, bnm (rk ) , b1nm (rk ) denote the connection weights of the jth neuron on the ith neuron, cnm (rk ) are parameters of multiplicative nonlinear noises. These coefficients are the functions of the random process { rk , k ∈ Z }, which is a discrete-time discrete-state Markov chain taking values in a finite set S = {1, 2,K , s} , with transition probabilities from mode i at time k to mode j at time k + 1 , i.e. rk = i , rk +1 = j ∈ S . For any i, j ∈ S , we have: pij = Pr{rk +1 = j | rk = i} , pij ≥ 0 . s
∑p j =1
ij
= 1.
(3) (4)
We can rewrite the Markovian jump discrete-time delayed neural network with multiplicative nonlinear perturbation (1) in the following vector form: xk +1 = − A(rk ) xk − A1 (rk ) xk − d + B(rk )σ ( xk ) + B1 (rk )σ ( xk − d ) + C (rk ) f ( xk , xk − d )wk .
(5)
174
L. Xie et al.
where A(rk ) = diag{a1 (rk ),K , aL (rk )} ; A1 (rk ) = diag{a11 (rk ),K , a1L (rk )} ;
B(rk ) = {bnm (rk )}L× L ; B1 (rk ) = {b1nm (rk )}L× L ; C (rk ) = {cnm (rk )}L× L ;
xk = x(k ) = [ x1 (k ),K , xL (k )] ; xk − d = x(k − d ) = [ x1 (k − d ),K , xL (k − d ) ] ; T
T
σ ( xk ) = {σ 1[ x1 (k )],K , σ L [ xL (k )]}T ; σ ( xk − d ) = {σ 1[ x1 (k − d )],K , σ L [ xL (k − d )]}T ; f ( xk , xk − d ) = { f1 ( x1 (k ), x1 (k − d )),K , f L ( xL (k ), xL (k − d ))}T . To obtain our results, we state centain assumptions and definitions as follows: Assumption 1. The activation functions σ n (.) satisfy globally Lipschitz conditions,
i.e. there exist positive constants wn such that
0<
σ n (u ) − σ n (v) u −v
≤ wn , n = 1,K , L .
(6)
Assumption 2. The perturbation function f ( x(k ), x(k − d )) satisfies norm-bounded
condition, i.e., there exist positive matrices M and M 1 , such that
f ( x(k ), x(k − d )) ≤ M x(k ) + M 1 x (k − d ) .
(7)
Definition 1. The Markovian jump discrete-time delayed neural network (5) (perturbation-free) is said to be stochastically stable if for every initial state ( x0 , r0 ) , there exist a finite number M% ( x , r ) > 0 , such that 0
0
K
2 lim E{∑ x(k ) x0 , r0 } ≤ M% ( x0 , r0 ) .
K →∞
k =0
(8)
Definition 2. The Markovian jump discrete-time delayed neural network (5) is said to be stochastic robust stable if it is stochastically stable for any admissible perturbation.
3 Main Results In this section, the sufficient conditions for stochastic robust stability of Markovian jump discrete-time delayed neural networks with multiplicative nonlinear perturbations are presented. Theorem 1. The Markovian jump discrete-time delayed neural network (5) is stochastic robust stable if there exist matrices Pi > 0 ( i = 1,K , s ), S > 0 , R > 0 , and
constants ε n > 0 (n = 1,K , 6) , ρ j > 0 ( j = 1,K , s ) satisfying the coupled LMIs ⎡ Ξi11 Ξ i = ⎢⎢ Ξi 21 ⎢⎣ Ξi 31
Ξi12 Ξ i 22 0
Ξi13 ⎤ 0 ⎥⎥ < 0 , Ξ i 33 ⎥⎦
⎡ ρjI ⎢ ⎢⎣ Pj Ci
CiT Pj ⎤ ⎥ > 0 , i, j = 1,K , s . Pj ⎥⎦
(9)
Stochastic Robust Stability Analysis
175
where Ξ i11 = − Pi + (ε1 + ε 2 + ε 3 ) AiT Ai + (ε 4 + ε 5 ) A1Ti A1i + ε 6W T BiT BiW +2 ρ% ij ( M T M + M 1T M 1 ) , Ξ i12 = ΞTi 21 = ⎡⎣ AiT P%ij , A1Ti P%ij , W T BiT P%ij , W T B1Ti P%ij ⎤⎦ , Ξ i13 = ΞTi 31 = ⎡⎣ A1Ti P%ij , W T BiT P%ij , W T B1Ti P%ij , W T BiT P%ij , W T B1Ti P%ij , W T B1Ti P%ij ⎤⎦ , Ξ i 22 = diag{− P%ij , − P%ij , − P%ij , − P%ij } , Ξ i 33 = diag{−ε1 I , − ε 2 I , − ε 3 I , − ε 4 I , − ε 5 I , − ε 6 I } , s
s
j =1
j =1
W = diag{w1 ,K , wL } , P%ij = ∑ pij Pj , ρ%ij = ∑ pij ρ j .
(10)
Proof. Introduce the Lyapunov function candidate for the neural network (5)
Vk ( xk , rk ) = xkT P(rk ) xk +
k −1
∑
i=k −d
xiT Sxi +
k −1
∑σ
i=k −d
T
( xi )Rσ ( xi ) .
(11)
Let Pi denotes P(rk ) when rk = i , and Pi is a const matrix for each i ∈ S . Then ΔVk ( xk , rk ) = E{Vk +1 ( xk +1 , rk +1 ) xk , rk = i ) − Vk ( xk , rk = i ) = {− xkT AiT − xkT− d A1Ti + σ T ( xk ) BiT + σ T ( xk − d ) B1Ti }P%ij {− Ai xk − A1i xk − d + Biσ ( xk ) + B1iσ ( xk − d )} + f T ( xk , xk − d )CiT P%ij Ci f ( xk , xk − d ) + xkT Sxk − xkT− d Sxk − d + σ T ( xk ) Rσ ( xk ) −σ T ( xk − d ) Rσ ( xk − d ) − xkT Pi xk ≤ xkT AiT P%ij Ai xk + ε1 xkT AiT Ai xk + ε1−1 xkT− d A1Ti P%ij P%ij A1i xk − d +ε 2 xkT AiT Ai xk + ε 2−1σ T ( xk ) BiT P%ij P%ij Biσ ( xk ) + ε 3 xkT AiT Ai xk +ε 3−1σ T ( xk − d ) B1Ti P%ij P%ij B1iσ ( xk − d ) + xkT− d A1Ti P%ij A1i xk − d +ε 4 xkT− d A1Ti A1i xk − d + ε 4−1σ T ( xk ) BiT P%ij P%ij Biσ ( xk ) +ε 5 xkT− d A1Ti A1i xk − d + ε 5−1σ T ( xk − d ) B1Ti P%ij P%ij B1iσ ( xk − d ) +σ T ( xk ) BiT P%ij Biσ ( xk ) + ε 6σ T ( xk ) BiT Biσ ( xk ) +ε 6−1σ T ( xk − d ) B1Ti P%ij P%ij B1iσ ( xk − d ) + σ T ( xk − d ) B1Ti P%ij B1iσ ( xk − d ) +2 ρ% ij ( xkT M T Mxk + xkT− d M 1T M 1 xk − d ) − xkT Pi xk +σ T ( xk ) Rσ ( xk ) − σ T ( xk − d ) Rσ ( xk − d ) + xkT Sxk − xkT− d Sxk − d .
(12)
Let S = A1Ti P%ij A1i + ε1−1 A1Ti P%ij P%ij A1i + (ε 4 + ε 5 ) A1Ti A1i + 2 ρ% ij M 1T M 1 . R = B1Ti P%ij B1i + (ε 3−1 + ε 5−1 + ε 6−1 ) B1Ti P%ij P%ij B1i .
(13)
176
L. Xie et al.
We have ΔVk ( xk , rk ) ≤ xkT Θi xk .
(14)
where Θi = − Pi + AiT P%ij Ai + A1Ti P%ij A1i + (ε1 + ε 2 + ε 3 ) AiT Ai + (ε 4 + ε 5 ) A1Ti A1i +ε1−1 A1Ti P%ij P%ij A1i + W T BiT P%ij BiW + W T B1Ti P%ij B1iW + ε 6W T BiT BiW +2 ρ% ij ( M T M + M 1T M 1 ) + (ε 3−1 + ε 5−1 + ε 6−1 )W T B1Ti P%ij P%ij B1iW + (ε 2−1 + ε 4−1 )W T BiT P%ij P%ij BiW .
(15)
By using Schur complement lemma, if the coupled LMIs (9) exist, Θi < 0 hold. Then it follows that ΔVk ( xk , rk ) ≤ xkT Θi xk < 0 .
(16)
Here, for x(k ) ≠ 0 , ΔVk ( xk , rk ) ≤− Vk ( xk , rk )
− xkT Θi xk xkT P(rk ) xk +
k −1
∑
i =k −d
xiT Sxi +
k −1
∑
i=k −d
≤ α −1 . xiT W T RWxi
(17)
where ⎧ λmin (−Θi ) ⎫ ⎬ <1. ⎩ λmax ( EP(rk )) ⎭
(18)
α = 1 − min ⎨ rk ∈S
and ⎡ S + W T RW L ⎤ 0 ⎡ P(rk ) 0 ⎤ ⎢ ⎥ ; S =⎢ EP(rk ) = ⎢ M O M ⎥ ⎥ . S⎦ ⎣ 0 T ⎢ 0 L S + W RW ⎥⎦ ⎣ d ×d From (17), we have 0<
E{Vk +1 ( xk +1 , rk +1 ) xk , rk = i ) Vk ( xk , rk )
≤α .
(19)
that is 0 < α < 1 , therefore
E{Vk +1 ( xk +1 , rk +1 ) xk , rk = i ) ≤ αVk ( xk , rk ) .
(20)
which implies that K
E{∑ Vk ( xk , rk ) x0 , r0 ) ≤ (1 + α + L + α K )V0 ( x0 , r0 ) ≤ k =0
1 − α K +1 V0 ( x0 , r0 ) . 1−α
(21)
thus, we have K
lim E{∑ xkT P (rk ) xk x0 , r0 ) ≤
K →∞
k =0
1 V0 ( x0 , r0 ) . 1−α
(22)
Stochastic Robust Stability Analysis
177
Define V0 ( x0 , r0 ) . M% ( P(r0 )) (1 − α ) max { ρ [ P(ri )]} i∈S
(23)
then N
lim E{∑ Vk ( xk , rk ) x0 , r0 ) ≤ M% ( P(r0 )) .
N →∞
k =0
(24)
which implies that the Markovian jump discrete-time delayed neural network (5) is stochastic robust stable for all admissible nonlinear perturbation. Remark. In Theorem 1, stochastic robust stability criterion for Markovian jump discrete-time delayed neural networks with multiplicative nonlinear perturbation is derived in terms of LMIs. The result can be simplified to delay-free case. Corollary 1. The Markovian jump discrete-time neural network (5) with delay-free is stochastic robust stable if there exist matrices Pi > 0 ( i = 1,K , s ), S > 0 , R > 0 , and
constants ε > 0 , ρ j > 0 ( j = 1,K , s ) satisfying the coupled LMIs
⎡ Ξ i11 ⎢ Ξ i = ⎢ P%ij Ai ⎢ P%ij BiW ⎣
AiT P%ij − P% ij
0
W T BiT P%ij ⎤ ⎥ 0 ⎥ <0, −ε I ⎥⎦
⎡ ρjI ⎢ ⎣⎢ Pj Ci
CiT Pj ⎤ ⎥ > 0 , i, j = 1,K , s . Pj ⎦⎥
(25)
where Ξ i11 = − Pi + ε AiT Ai + 2 ρ%ij M T M , s
s
j =1
j =1
W = diag{w1 ,K , wL } , P%ij = ∑ pij Pj , ρ%ij = ∑ pij ρ j .
(26)
4 Conclusions In this paper, we consider the problem of stochastic robust stability analysis for Markovian jump discrete-time delayed neural networks with multiplicative nonlinear perturbation. The sufficient condition of the robust stochastically stable are given in terms of linear matrix inequalities. Our approach is believed to be computationally more flexible and efficient then other existing approaches. Acknowledgement. This work was supported by Chinese Natural Science Foundation (60534070, 60473129), and by HCNR, Harvard Medical School, USA.
References 1. Michel, A.N., Farrell, J.A., Porod, W.: Qualitative Analysis of Neural Networks, IEEE Trans. CAS 36(2) (1989) 229-243 2. Kelly, D.G.: Stability in Contractive Nonlinear Neural Networks, IEEE Trans. Bio. Eng., 37(3) (1990) 231-242
178
L. Xie et al.
3. Liang, X.B., Wang, J.: An Additive Diagonal Stability Condition for Absolute Exponential Stability of a General Class of Neural Networks, IEEE Trans. CAS—I 48 (2001) 1308-1317 4 Cao, J., Wang, J.: Global Asymptotic and Robust Stability of Recurrent Neural Networks with Time Delays, IEEE Trans. CAS—I 52(2) (2005) 417-426 5 Liao, X., Wong, K.W.: Robust Stability of Interval Bidirectional Associative Memory Neural Network With Time Delays, IEEE Trans. SMC—B 34(2) (2004) 1142-1154 6. Feng, C.H., Plamondon, R.: Stability Analysis of Bidirectional Associative Memory NetWorks with Time Delays, IEEE Trans. NN 14(6) (2003) 1560-1565 7. Ji, Y., Chizeck, H.J.: Controllability, Stabilizability, and Continuous-Time Markovian Jump Linear Quadratic Control, IEEE Trans. AC 35(7) (1990) 777–788 8. Feng, X., Loparo, K.A., Ji, Y. et. al.: Stochastic Stability Properties of Jump Linear System, IEEE Trans. AC 37(1) (1992) 38–53 9. Liao, X., Chen, G.R., Sanchez, E.N.: LMI-Based Approach for Asymptotically Stability Analysis of Delayed Neural Networks. IEEE Trans. CAS-I 49(7) (2002) 1033-1039
Global Robust Stability of General Recurrent Neural Networks with Time-Varying Delays Jun Xu1,2 , Daoying Pi1 , and Yong-Yan Cao1 1
National Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, P.R. China 2 College of Information & Management, Jiangxi University of Finance & Economics, Nanchang 330013, P.R. China {xujung, yongyancao}@gmail.com,
[email protected]
Abstract. This paper is devoted to global robust stability analysis of the general recurrent neural networks with time-varying parametric uncertainty and time-varying delays. To remove the dependence on the size of time-delays, Lyapunov-Razumikhin stability theorem and LMI approach are applied to derive the global robust stability conditions for the neural networks. Then delay-dependent global robust stability criteria are developed based on integrating Lyapunov-Krasovskii functional method and LMI approach. These stability criteria are in term of the solvability of linear matrix inequalities.
1
Introduction
Robust stability have been extensively studied by researches recently ( see for example [5], [8], [9], [10] and the references therein) because the stability of a neural networks may often be destroyed by its unavoidable uncertainty due to the existence of modeling errors, external disturbance and parameter fluctuation during the implementation on very-large-scale-integration (VLSI) chip [10]. Global robust stability of interval neural networks with time-varying delays was studied in [10], [11]. Robust stability of cellular neural networks with constant parametric uncertainties and single time delays was studied in [8]. Robust stability of cellular neural networks with time-varying parametric uncertainties and time-varying delays was studied in [9]. LMI is an efficient approach which is globally used in neural networks’ stability study, see for example in [5], [6], [8], [9] and the references therein. Motivated by these, in this paper, we study robust stability of a general recurrent neural networks with norm-bounded time-varying parametric uncertainties and time-varying delays by integrating Lyapunov-Krasovskii functional method (Lyapunov-Razumikhin function method) and LMI approach. We can divide the robust stability problem of time-delay system into two categories according to their dependence on the size of the delays, namely delay-independent stability and delay-dependent stability. While delay-independent stability criteria guarantee the asymptotic stability of the system irrespective of the size of the J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 179–184, 2006. c Springer-Verlag Berlin Heidelberg 2006
180
J. Xu, D. Pi, and Y.-Y. Cao
time-delay, delay-dependent stability criteria provide an upper bound of timedelay size which assures the asymptotic stability of the system. In this work we derive both delay-independent and delay dependent stability conditions.
2
Model Description and Preliminaries
The recurrent neural networks with time-varying delays and time-varying parametric uncertainties model is described by the following differential equations: dx(t) = −(C + C(x, t))x(t) + (A + A(x, t))f (x(t)) dt + (B + B(x, t))l(x(t − τ (t))) + I, x(t) =φ(t), t ∈ [−¯ τ , 0]
(1)
where x(t) ∈ Rn is the neuron state vector; x(t − τ (t)) = [x1 (t − τ1 (t)), x2 (t − τ2 (t)), . . . , xn (t − τn (t))]T , τj (t) : [0, ∞) → [0, τ¯](j = 1, 2, . . . , n) are continuous and differentiable functions; A = {aij }n×n and B = {bij }n×n are feedback matrix and delayed feedback matrix, respectively; C = diag(c1 , c2 , . . . , cn ) > 0 is self-feedback matrix; C(x, t), A(x, t) and B(x, t) are matrix functions representing the uncertainties in the matrix C, A and B, respectively; I denotes the external bias; τ (t) denotes the time-varying delays which is bounded by τ (t) ≤ τ¯ satisfying τ˙ (t) ≤ d < 1. Throughout this paper, we assume that each neuron activation function in system (1), satisfies following Assumption (H) 0≤
fj (xj ) − fj (yj ) lj (xj ) − lj (yj ) ≤ G¯j and 0 ≤ ≤ H¯j , xj − yj xj − yj
for all xj , yj ∈ R, where xj = yj (j = 1, 2, . . . , n). In this paper, we assume that the uncertainties can be described by C(x, t) = D0 F0 (x, t)E0 ; A(x, t) = D1 F1 (x, t)E1 ; B(x, t) = D2 F2 (x, t)E2 , where Di , Ei (i = 0, 1, 2) are known real constant matrices , and Fi (t), i = 0, 1, 2, are unknown real time-varying matrix with Lebesgue measurable elements bounded by FiT (x, t)Fi (x, t) ≤ I, ∀t (i = 0, 1, 2). For simplicity, in what follows, we will denote ˜ = C + C(x, t); A(t) ˜ = A + A(x, t); B(t) ˜ = B + B(x, t). C(t) We assume that the system (1) has an equilibrium point xe = (xe1 , xe2 , . . . , xen )T . In order to simplify our proofs, we shall shift the equilibrium point to the origin using the transformation y(t) = x(t) − xe . By denoting g(y(t)) = f (y(t) + xe ) − f (xe ) and h(y(t − τ (t))) = l(y(t − τ (t)) + xe ) − l(xe ), we can transform system (1) into the form dy(t) ˜ ˜ ˜ = −C(t)y(t) + A(t)g(y(t)) + B(t)h(y(t − τ (t))). dt
(2)
Global Robust Stability of General Recurrent Neural Networks
181
With the assumption (H), we have the following inequalities ¯ Gy(t), ¯ g T (y(t))Qg(y(t)) ≤ y T (t)GQ T ¯ Hy(t ¯ h (y(t − τ (t)))h(y(t − τ (t))) ≤ y (t − τ (t))H − τ (t)), T
(3) (4)
where Q > 0 is any diagonal matrix.
3
Main Results
3.1
Delay-Independent Global Robust Stability
Firstly, we study the global robust stability of the general neural networks with Lyapunov-Razumikhin function approach (Theorem 1.2 in [3]). Theorem 1. Consider system (2),if there exist diagonal marix Q > 0, and scalars γ0 , γ1 , γ2 > 0, ν > 1 such that ¯G ¯ − νQ − γ −1 QD0 DT Q − γ1 QD1 DT Q Ξ = CQ + QC − γ0 E0T E0 − G 0 1 0 −γ2 QD2 D2T Q − QAAT Q − QBB T Q − QAE1T (γ1 I1 − E1 E1T )−1 E1 AT Q −QBE2T (γ2 I2 − E2 E2T )−1 E2 B T Q > 0 namely,
⎡
Π1 QAE1T QBE2T QD0 QD1 QD2 ⎢ E1 AT Q Π2 0 0 0 0 ⎢ ⎢ E2 B T Q 0 Π 0 0 0 3 ⎢ ⎢ D0T Q 0 0 γ I 0 0 0 3 Ξ=⎢ ⎢ D1T Q 0 0 0 γ1−1 I4 0 ⎢ ⎢ DT Q 0 0 0 0 γ2−1 I5 2 ⎢ ⎣ AT Q 0 0 0 0 0 BT Q 0 0 0 0 0
⎤ QA QB 0 0 ⎥ ⎥ ⎥ 0 ⎥ 0 0 ⎥ ⎥>0 0 0 ⎥ ⎥ 0 0 ⎥ ⎥ I6 0 ⎦ 0 I7
¯H ¯ > 0, Q−H
(5)
(6)
¯ G, ¯ Π2 = γ1 I1 −E1 E1T > 0, Π3 = γ2 I2 − where Π1 = CQ+QC −γ0E0T E0 −νQ− G T E2 E2 > 0. And I1 , I2 , I3 , I4 , I5 , I6 , I7 are identity matrices. Then the trivial solution of system (1) is globally robustly stable which is delay-independent. Proof: Given diagonal matrix Q > 0, consider a Lyapunov-Razumikhin function candidate as V (y(t)) = y T (t)Qy(t). First, it is easy to have π1 y(t) 2 ≤ V (y(t)) ≤ π2 y(t) 2
(7)
where π1 = λmin (Q) and π2 = λmax (Q). The derivative of V (y(t)) along the solution of (2) can be computed as follows V˙ (y(t)) = 2y T (t)Qy(t) ˙ = −y T (t)(CQ + QC)y(t) − 2y T (t)E0T F0T (x, t)D0T Qy(t) ˜ ˜ +2y T (t)QA(t)g(y(t)) + 2y T (t)QB(t)h(y(t − τ (t)).
182
J. Xu, D. Pi, and Y.-Y. Cao
Using Lemma 2.2 in [1], we have − 2y T (t)E0T F0T (t)D0T Qy(t) ≤ γ0 y T (t)E0T E0 y(t) + γ0−1 y T (t)QD0 D0T Qy(t). (8) By applying Lemma 1 in [4], Lemma 2.2 in [1] and (3), we have ˜ 2y T (t)QA(t)g(y(t)) ≤y T (t)[QAAT Q + QAE1T (γ1 I1 − E1 E1T )−1 E1 AT Q (9) ¯ G]y(t). ¯ + γ1 QD1 DT Q + G 1
Similarly, we can obtain ˜ 2y T (t)QB(t)h(y(t − τ (t)) ≤ y T (t)[QBB T Q + QBE2T (γ2 I2 − E2 E2T )−1 (10) ¯ Hy(t ¯ · E2 B T Q + γ2 QD2 D2T Q]y(t) + y T (t − τ (t))H − τ (t)). With the inequations (8)–(10) and condition (6) in Theorem 1, we have ¯G ¯ − QAE1T (γ1 I1 − E1 E1T )−1 V˙ (y(t)) ≤ −y T (t)[CQ + QC − γ0 E0T E0 − G · E1 AT Q − QBE2T (γ2 I2 − E2 E2T )−1 E2 B T Q − γ0−1 QD0 D0T Q − γ1 Q ·
D1 D1T Q
−
γ2 QD2 D2T Q
(11)
− QAA Q − QBB Q]y(t) + y (t − τ (t))Qy(t − τ (t)). T
T
T
By using Lyapunov-Razumikhin stability theorem (Theorem 1.2 in [3]), we assume that there exists a real ν > 1 such that y T (t − τ (t))Qy(t − τ (t)) = V (y(t − θ)) < νV (y(t)) = νy T (t)Qy(t).
(12)
With (9) and (10), we can obtain that V˙ (yt ) ≤ −y T (t)Ξy(t). So, by conditions (5), (6) in Theorem 1, inequalities (7), (10), (11), and Lyapunov-Razumikhin stability theorem (Theorem 1.2 in [3]), we have that the zero solution of system (2) is globally uniformly asymptotically stable. By Definition 2.1 in [1], the zero solution of system (2) is globally robustly stable and hence the trivial solution of system (1) is globally robustly stable. The proof is completed. 3.2
Delay-Dependent Global Robust Stability
To reduce conservativeness in the analysis when the information on the size of time-delays is available, we will establish some delay-dependent robust stability conditions for the general neural networks with Lyapunov-Krasovskii functional method. Theorem 2. Consider system (2), if there exist P > 0, and scalars ε0 , ε1 , ε2 > 0 such that T T ¯H ¯ −G ¯G ¯ − ε0 E0T E0 − ε−1 Γ = CP + P C − H 0 P D0 D0 P − ε1 P D1 D1 P
−δ −1 ε2 P D2 D2T P − P AAT P − δ −1 P BB T P − P AE1T (ε1 I8 − E1 E1T )−1 E1 AT P −δ −1 P BE2T (ε2 I9 − E2 E2T )−1 E2 B T P > 0
Global Robust Stability of General Recurrent Neural Networks
183
namely, ⎡
Λ1 P AE1T P BE2T P D0 P D1 P D2 ⎢ E1 AT P Λ2 0 0 0 0 ⎢ ⎢ E2 B T P 0 Λ 0 0 0 3 ⎢ ⎢ D0T P 0 0 ε I 0 0 0 10 Γ =⎢ ⎢ D1T P 0 0 0 ε−1 0 1 I11 ⎢ −1 ⎢ DT P 0 0 0 0 δε2 I12 2 ⎢ ⎣ AT P 0 0 0 0 0 BT P 0 0 0 0 0
⎤ PA PB 0 0 ⎥ ⎥ ⎥ 0 ⎥ 0 0 ⎥ ⎥ > 0, 0 0 ⎥ ⎥ 0 0 ⎥ ⎥ I13 0 ⎦ 0 δI14
(13)
¯H ¯ −G ¯ G, ¯ Λ2 = ε1 I8 − E1 E T > 0, Λ3 = where Λ1 = CP + P C − ε0 E0T E0 − H 1 T δ(ε2 I9 − E2 E2 ) > 0, δ = 1 − d > 0. And I8 , I9 , I10 , I11 , I12 , I13 , I14 are identity matrices. Then the trivial solution of system (1) is globally robustly stable for all time delay τ (t) ≤ τ¯ and τ˙ (t) ≤ d < 1, which is delay-dependent. The Lyapunov-Krasovskii functional is chosen as
t
T
hT (y(ξ))h(y(ξ))dξ,
V (yt ) = V1 + V2 = y (t)P y(t) +
(14)
t−τ (t)
where P > 0 to be determined. Proof: Consider the Lyapunov-Krasovskii functional (14). First, we have π3 yt (0) 2 ≤ V (yt ) ≤ π4 yt 2c ,
(15)
¯ 2 . Taking the where π3 = λmin (P ). Note (4), we have π4 = λmax (P0 ) + τ¯ H derivative of V2 leads to V˙2 = hT (y(t))h(y(t)) − (1 − τ˙ (t))hT (y(t − τ (t)))h(y(t − τ (t))). By (4), we have ¯ Hy(t) ¯ V˙2 ≤ y T (t)H − δhT (y(t − τ (t))h(y(t − τ (t))).
(16)
As the proof of Theorem 1, the derivative of V1 along the solution of (2) can be computed as follows V˙1 = − y T (t)(CP + P C)y(t) − 2y T (t)E0T F0T (x, t)D0T P y(t) ˜ ˜ + 2y T (t)P A(t)g(y(t)) + 2y T (t)P B(t)h(y(t − τ (t)).
(17)
Similar to the proof of Theorem 1, we have V˙ (yt ) ≤ −y T (t)Γ y(t). Similarly, by Lyapunov-Krasovskii stability theorem (Theorem 1.1 in [3]), we have that the trivial solution of system (1) is globally robustly stable. The proof is completed.
184
4
J. Xu, D. Pi, and Y.-Y. Cao
Conclusion
In this paper, both the delay-independent and delay-dependent global robust stability are discussed by Lyapunov-Razumikhin function approach and LyapunovKrasovskii functional method for the general recurrent neural networks with time-varying parametric uncertainties and time-varying delays. By integrating Lyapunov-Razumikhin function approach and LMI approach, delay-independent stability conditions which do not involve the time derivative of the time-delays are studied. Then delay-dependent stability criteria which involve the time derivative of the time-delays are obtained by Lyapunov-Krasovskii functional method and LMI approach.
Acknowledgments This work is supported by the 973 program of China under grant No. 2002CB312200; the National Science Foundation of China under grant No. 60574019 and No. 60474045.
References 1. Cao, Y. Y., Sun, Y. X. and Cheng, C.: Delay-Dependent Robust Stabilization of Uncertain Systems with Multiple State Delays. IEEE Trans. Automatic Control 43 (1998) 1608-1612 2. Hale, J. K. and Lunel, S. M. V.: Introduction to Functional Differential Equations. Applied Math. Scinces, New York: Springer-Verlag 99 (1993) 3. Niculescu, S. I.: Delay Effects on Stability: An Robust Control Approach. Lecuture Notes in Control and Information Sciences, Springer-Verlag London Limited (2001) 4. Liao, X. F., Chen, G. and Sanchez, E. N.: Delay-dependent Exponential Stability Analysis of Delayed Neural Networks: An LMI Approach. Neural Networks 15 (2002) 855-866 5. Cao, J. and Wang, J.: Global Asymptotic and Robust Stability of Recurrent Neural Networks with Time Delays. IEEE Trans. on Circuits and systems-I 52 (2005) 417425 6. Cao, J. and Ho, Daniel W. C.: A General Framework for Global Asymptotic Stability Analysis of Delayed Neural Networks Based on LMI Approach. Chaos, Solitons and Fractals 24 (2005) 1317-1329 7. Huang, H., Ho, D.W.C. and Cao, J.: Analysis of Global Exponenital Stability and Periodic Solutions of Neural Networks with Time-varying Delays. Neual Networks 18 (2005) 161-170 8. Singh, V.: Robust Stability of Celluar Neural Networks with Delay: linear matrix inequality approach. IEE Proc.-Control Theory Appl. 151 (2004) 125-129 9. Zhang, H., Li, C. and Liao, X. F.: A Note on the Robust Stability of Neural Networks with Time Delay. Chaos Solitons and Fractals 25 (2005) 357-360 10. Li, C., Liao, X. F., Zhang, R. and Prasad, A.: Global Robust Exponential Stability Analysis for Interval Neural Networks with Time-varying Delays. Chao, Solitons and Fractals 25 (2005) 751-757 11. Chen, A., Cao, J. and Huang, L.: Global Robust Stability of Interval Cellular Neural Networks with Time-varying Delays. Chaos, Solitons and Fractals 23 (2005) 787-799
Robust Periodicity in Recurrent Neural Network with Time Delays and Impulses Yongqing Yang1,2 1
2
School of Science, Southern Yangtze University, Wuxi 214122, China
[email protected] Department of Mathematics, Southeast University, Nanjing 210096, China
w Abstract. In this paper, the robust periodicity for recurrent neural netmorks with time delays and impulses is investigated. Based on Lyapunov n ethod and fixed point theorem, a sufficient condition of global expoential robust stability of periodic solution is obtained.
1
Introduction
The subject of neural networks has become one of the important technical tools for solving a variety of problems such as associative memories, optimization problems, pattern recognition, signal processing. Such applications rely heavily on the stability of networks. Refs [1-2,14] studied the existence and stability of the equilibrium point of neural networks. It is well known that studies on neural networks not only involve discussion of stability, but also involve other dynamics behaviors such as periodic solution, bifurcation and chaos. In many theory and application researches, the property of periodic solutions is of great interest [7,8,13,15]. In hardware implementation of neural networks, time delays and parameter fluctuation are unavoidable. Various sufficient conditions have been presented for global robust stability,[4-6,16]. On the other hand, many physical systems undergo abrupt changes at certain moments due to instantaneous perturbations, which lead to impulsive effects. The stability of some neural networks with time delays and impulses were studied [9-12]. However, the robust stability of periodic solution of neural networks with delays and impulses is seldom considered. Motivated by the above discussions, the aim of this paper is to discuss the global exponential robust stability of periodic solution of recurrent neural networks with time delays and impulses. The organization of the paper is as follows: In Section 2, we introduce some definitions needed in later sections. In Section 3, a sufficient condition for the global exponential robust stability of periodic solution is derived by using Lyapunov method and fixed point theorem. Conclusions are given in Section 4. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 185–191, 2006. c Springer-Verlag Berlin Heidelberg 2006
186
2
Y. Yang
Preliminaries
In the paper, we consider the following neural networks with time delays and impulses. ⎧ n n ⎪ aij fj (xj (t)) + bij fj (xj (t − τij )) + Ii (t), t ≥ 0 t = tk , ⎪ ⎪ x˙ i (t) = −ci xi (t) + ⎪ j=1 j=1 ⎪ ⎪ ⎨ Δx (t ) = x (t+ ) − x (t− ) = I (x (t )), i = 1, 2, ..., n, k ∈ N = {1, 2, ...}, i
k
i
k
i
ik
k
i
k
CI := {C = diag(ci ) : 0 < C ≤ C ≤ C, i.e., 0 < ci ≤ ci ≤ ci , i = 1, 2, ..., n} ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ AI := {A = diag(aij ) : A ≤ A ≤ A, i.e., aij ≤ aij ≤ aij , i, j = 1, 2, ..., n} ⎩ BI := {B = diag(bij ) : B ≤ B ≤ B, i.e., bij ≤ bij ≤ bij , i, j = 1, 2, ..., n}
(1)
where n corresponds to the number of units in a neural network; xi (t) is the state of the ith unit at time t; fj (xj (t)) is the output of the jth unit at time t; aij , bij , ci , (ci > 0) are constant; Ii (t) is a periodic external input with periodic ω; transmission delays τij are nonnegative constants with 0 ≤ τij ≤ τ (i, j = 1, 2, ..., n) . Impulsive moments {tk , k ∈ N } satisfy 0 = t0 < t1 < t2 < ..., lim tk = ∞, k→∞
xi (tk ) corresponds to the abrupt changes of the state at fixed impulsive mo+ ment tk ; xi (tk ) = xi (t− k ) and xi (tk ) exist. Accompanying system (1) are the initial conditions of the form xi (t) = φi (t), t ∈ [−τ, 0], i = 1, 2, ..., n, where φi (t) is continuous for t ∈ [−τ, 0]. Throughout this paper, we assume that the activation functions fi , i = 1, 2, ..., n and impulsive operator satisfy the following conditions (H1 ) fi (x) (i = 1, 2, ..., n) are bounded on R (H2 ) |fi (u) − fi (v)| ≤ ui |u − v|, ∀u, v ∈ R, u = v. (H3 ) Iik (xi (tk )) = −γik xi (tk ), 0 < γik < 2, i = 1, 2, ..., n, k ∈ N. For the sake of convenience, we introduce the following notations: + a+ ij = max(|aij |, |aij |), bij = max(|bij |, |bij |), i, j = 1, 2, ..., n. n
φ = sup |φi (t)|2 . −τ ≤t≤0
i=1
For any φ ∈ C([−τ, 0] → R ), H1 and H2 guarantee the existence and uniqueness of the solution of Eq.(1)[3]. n
Definition 1. A piecewise continuous function x(t) : [0, +∞) → Rn is called an ω− periodic solution of Eq.(1), if + (I) x(t) is continuous at t = tk , x(tk ) = x(t− k ) and x(tk ) exist for each tk ; (II) x(t) satisfies Eq.(1) for t ∈ [0, +∞) ; (III) x(t) = x(t + ω), ∀t ∈ R. Definition 2. The periodic solution x(t, ϕ) of Eq.(1) is said to be globally exponentially stable, if there exist some constants ε > 0 and M ≥ 1 such
Robust Periodicity in Recurrent Neural Network
187
that for every solution x(t, φ) of Eq.(1) with any initial value φ ∈ C([−τ, 0]; Rn ), x(t, φ) − x(t, ϕ) ≤ M φ − ϕ e−εt , ∀t ≥ 0, Definition 3. The periodic solution of Eq.(1) is said to be globally exponentially robust stable if Eq.(1) has exactly one global exponential robust stable periodic solution for all C ∈ CI , A ∈ AI , B ∈ BI .
3
Global Robust Periodicity
In the section, we will derive a sufficient condition for the global exponential robust stability of periodic solution of Eq.(1). Theorem 1. Under the assumption (H1 ), (H2 ) and (H3 ), if there exist positive constants wi > 0 and αij , βij ∈ R, i, j = 1, 2, ..., n such that n
2(1−αij ) + 2(1−αij ) aij
uj
+ b+ ij
2(1−βij )
j=1
+
n
wj j=1
wi
2αji + 2αji aji
ui
+ b+ ji
2βji
< 2ci
(2)
then for every periodic input I(t) of Eq.(1), there exists global exponential robust stable ω−periodic solution. Proof. By inequality (2) , choose a small ε > 0 such that 2(ε − ci ) +
n
2(1−αij ) + 2(1−αij ) aij
uj
+ b+ ij
2(1−βij )
j=1 n
wj 2αji + 2αji 2βji 2ετji ui aji + b+ e < 0, (3) ji wi j=1
Let P C([−τ, 0], Rn ) = x(t)|x : [−τ, 0] → Rn , where x(t) is piecewise continun ous function, then P C([−τ, 0],
R ) is a Banach space.n n Let P C([−τ, +∞), R ) = x(t)|x : [−τ, +∞) → R , where x(t) is piecewise + continuous periodic solution at t = tk , and x(tk ) = x(t− k ), x(tk ) exist at t = tk . n n If x ∈ P C([−τ, +∞), R ), we define xt ∈ P C([−τ, 0], R ) by xt (s) = x(t + s) for −τ ≤ s ≤ 0. Moreover, we define xt+ ∈ P C([−τ, 0], Rn ) by xt+ (s) = x(t + s) for −τ ≤ s < 0 and xt+ (0) = x(t+ ). For ∀φ, ψ ∈ P C([−τ, 0], Rn ), we denote the solution (1) through (0, φ) and (0, ψ) as x(t, φ) and x(t, ψ) respectively. Thus, it follows from Eq.(1) that +
n
d (xi (t, φ) − xi (t, ϕ)) = −ci (xi (t, φ) − xi (t, ϕ)) + aij [fj (xj (t, φ)) − fj (xj (t, ϕ))] dt j=1
+
n
bij [fj (xj (t − τij , φ)) − fj (xj (t − τij , ϕ))], t ≥ 0, t = tk ,
j=1 − + − Δ[xi (tk , φ) − xi (tk , ϕ)] = xi (t+ k , φ) − xi (tk , φ) − xi (tk , ϕ) + xi (tk , ϕ)
= Iik (xi (tk , φ) − xi (tk , φ)), i = 1, 2, ..., n, k ∈ N = 1, 2, ... (4)
188
Y. Yang
Consider the following Lyapunov function: V (t) =
n
n
2α wi |xi (t, φ) − xi (t, ψ)|2 e2εt + uj ij |bij |2βij
i=1
j=1
t
|x(t, φ) − x(t, ψ)|2 e2ε(s+τij ) ds
t−τij
Calculating the upper right Dini-derivative D+ V of V along the solution of Eq.(1) at the continuous points t = tk , we have D+ V |(1) =
n
wi 2εe2εt|xi (t, φ) − xi (t, ψ)|2 + 2e2εt |xi (t, φ) − xi (t, ψ)|D+ |xi (t, φ) − xi (t, ψ)|
i=1
+ −
n
j=1 n
2αij + 2βij bij |xj (t, φ)
uj
2αij + 2βij bij |xj (t
uj
− xj (t, ψ)|2 e2ε(t+τij )
− τij , φ) − xj (t − τij , ψ)|2 e2εt
j=1
≤
n
wi e2εt 2(ε − ci )|xi (t, φ) − xi (t, ψ)|2
i=1
+2
n
u j a+ ij |xi (t, φ) − xi (t, ψ)||(xj (t, φ) − xj (t, ψ))|
j=1
+2
+
n
j=1 n
u j b+ ij |xi (t, φ) − xi (t, ψ)||(xj (t − τij , φ) − xj (t − τij , ψ))| 2αij + 2βij bij |xj (t, φ)
uj
− xj (t, ψ)|2 e2ετij
j=1
−
n
2αij + 2βij bij |xj (t
uj
− τij , φ) − xj (t − τij , ψ)|2
j=1
≤ +
n
n
2(1−βij ) 2(1−αij ) + 2(1−αij ) wi 2(ε − ci ) + uj aij + b+ ij
i=1 n
j=1
wj 2αji + 2αji 2βji 2ετji ui aji + b+ e |xi (t, φ) − xi (t, ψ)|2 e2εt ≤ 0, t = tk . (5) ji w i j=1
Also, we can calculate right limits of Lyapunov function V (t) at impulsive moments {tk , k ∈ N } as follows: V (t+ k)=
n
+ 2 2εtk wi |xi (t+ k , φ) − xi (tk , ψ)| e
i=1
+
n
j=1
2α uj ij |bij |2βij
tk
tk −τij
|xj (s, φ) − xj (s, ψ)|2 e2ε(s+τij ) ds
Robust Periodicity in Recurrent Neural Network
=
n
wi |1 − γik |2 |xi (tk , φ) − xi (tk , ψ)|2 e2εtk
i=1
+
189
n
2βij 2α uj ij b+ ij
tk
tk −τij
j=1
|xj (s, φ) − xj (s, ψ)|2 e2ε(s+τij ) ds
≤ V (tk ), k ∈ N,
(6)
which implies that V (t) ≤ V (0) for t ≥ 0 where V (0) =
n
n
2βij 2α wi |xi (0, φ) − xi (0, ψ)|2 + uj ij b+ ij
i=1
−τij
j=1
≤ [ max {wi } + τ e2ετ 1≤i≤n
n
0
2αij + 2βij bij }]
wi max {uj 1≤j≤n
i=1
|xj (s, φ) − xj (s, ψ)|2 e2ε(s+τij ) ds φ − ϕ 2
(7)
Furthermore, we have xt (φ) − xt (ϕ) ≤ M e−εt φ − ϕ , t ≥ 0
where M =
n ij + 2βij max {wi }+τ e2ετ wi max {u2α bij } j 1≤j≤n 1≤i≤n i=1 min {wi }
≥ 1 is a constant.
1≤i≤n
Choose a positive integer m such that M m e−mεω ≤
1 4
Define a Poincare mapping T : P C([−τ, 0], Rn ) → P C([−τ, 0], Rn ) by xω (s) = x(ω + s) Then we can obtain T m φ − T m ψ ≤
1 φ−ψ . 4
This means that T m is a contraction mapping, hence there exists a unique fixed point φ∗ ∈ P C([−τ, 0], Rn ) such that T m φ∗ = φ∗ . Note that T m (T φ∗ ) = T (T m φ∗ ) = T φ∗ . By the uniqueness of fixed point of mapping T m , we have T φ∗ = φ∗ , which means xω (φ∗ ) = φ∗ . Let x(t, φ∗ ) be the solution of Eq.(1) through (0, φ∗ ); obviously, x(t + ω, φ∗ ) be the solution of Eq.(1), and xt+ω (φ∗ ) = xt (xω (φ∗ )) = xt (φ∗ ), Therefore
x(t + ω, φ∗ ) = x(t, φ∗ )
f or
f or
t ≥ 0.
t ≥ 0.
190
Y. Yang
This shows that x(t, φ∗ ) is exactly one unique ω−periodic solution of Eq.(1), and it is easy to see that all other solutions of Eq.(1) converge exponentially to it as t → +∞. This completes the proof. Corollary 1. Under the assumption (H1 ), H2 and (H3 ), if the following inequality holds n n
+ + u j a+ + b + u i a+ (8) ij ij ji + bji < 2ci j=1
j=1
then for every periodic input I(t) of Eq.(1), there exists global exponential robust stable ω−periodic solution.
4
Conclusions
In this paper, we discuss the dynamical behaviors of recurrent neural networks with time delays and impulses. By using Lyapunov method and fixed point theorem, we prove the global exponential robust stability and existence of robust periodic solution of Eq.(1). The criteria obtained are easily verifiable and possess many adjustable parameters, which provides flexibility for the design and analysis of recurrent neural networks with time delays and impulses.
Acknowledgement This work was supported by the Natural Science Foundation of Jiangsu Province, China under Grants BK 2004021.
References 1. Roska, T., Wu, C. W., Balsi, M. and Chua, L. O.: Stability and Dynamics of Delaytype General and Cellular Neural Networks. IEEE Trans. Circuits Syst. -I 39 (6) (1992) 487-490 2. Cao, J.: Global Stability Analysis in Delayed Cellular Neural Networks, Phys. Rev. E 59 (5) (1999) 5940-5944 3. Lakshmikanthan, V., Bainov, D. D., Simeonov P. S., Theory of Impulsive Differential Equations. World Scientific, Singapore (1989) 4. Wang, L. and Lin, Y.: Global Robust Stability for Shunting Inhibitory CNNs with Delays. J. Neural Syst. 14 (4) (2004) 229-235 5. Xu, S., Lam, J., Ho, D. W. C. and Zou, Y.: Global Robust Exponential Stability Analysis for Interval Recurrent Neural Networks. Phys. Lett. A 325 (2) (2004) 124-133 6. Cao, J. and Chen, T.: Globally Exponentially Robust Stability and Periodicity of Delayed Neural Networks. Chaos, Solitons and Fractals 22 (4) (2004) 957-963 7. Zhao, H.: Global Exponential Stability and Periodicity of Cellular Neural Networks with Variable Delays. Phys. Let. A 336 (3) (2005) 331-341 8. Guo, S. and Huang, L.: Periodic Oscillation for a Class of Neural Networks with Variable Coefficients. Nonlinear Analy. 6 (6) (2005) 545-561
Robust Periodicity in Recurrent Neural Network
191
9. Xu, D. and Yang, Z.: Impulsive Delay Differential Inequality and Stability of Neural Networks. J. Math. Anal. Appl. 305 (1) (2005) 107-120 10. Li, Y. and Lu, L.: Global Exponential Stability and Existence of Periodic Solution of Hopfield-type Neural Networks with Impulses. Phys. Lett. A 333 (11) (2004) 62-71 11. Akca, H., Alassar, R., Covachev, V., Covachev, Z. and Zahrani, E. A.: Continuoustime Additive Hopfield-type Neural Networks with Impulses. J. of Math. Anal. Appl. 290 (2) (2004) 436-451 12. Gopalsamy, K.: Stability of Artificial Neural Networks with Impulses. Appl. Math. Comput. 154 (3) (2004) 783-813 13. Cao, J. and Wang, J.: Global Exponential Stability and Periodicity of Recurrent Neural Networks with Time Delays. IEEE Trans. Circuits Syst.-I 52 (5) (2005) 920-931 14. Cao, J. and Wang, J.: Global Asymptotic and Robust Stability of Recurrent Neural Networks with Time Delays. IEEE Trans. Circuits Syst.-I 52 (2) (2005)417-426 15. Cao, J., Chen, A. and Huang, X.: Almost Periodic Attractor of Delayed Neural Networks with Variable Coefficients. Phys. Lett. A 340 (6) (2005) 104-120 16. Cao, J., Huang, D.-S. and Qu, Y.: Global Robust Stability of Delayed Recurrent Neural Networks. Chaos, Solitons and Fractals 23 (1) (2005) 221-229
Global Asymptotical Stability of Cohen-Grossberg Neural Networks with Time-Varying and Distributed Delays Tianping Chen1 and Wenlian Lu1,2 1 Key Laboratory of Nonlinear Science of Chinese Ministry of Education, Institute of Mathematics, Fudan University, Shanghai, 200433, P.R. China
[email protected] 2 Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
[email protected], {wenlian, Wenlian.Lu}@mis.mpg.de
Abstract. In this paper, we discuss delayed Cohen-Grossberg neural networks with time-varying and distributed delays and investigate their global asymptotical stability of the equilibrium point. The model proposed in this paper is universal. A set of sufficient conditions ensuring global convergence and globally exponential convergence for the CohenGrossberg neural networks with time-varying and distributed delays are given. Most of the existing models and global stability results for CohenGrossberg neural networks, Hopfield neural networks and cellular neural networks can be obtained from the theorems given in this paper.
1
Introduction
Cohen-Grossberg neural networks, initially proposed and studied in [7], have attracted increasing interest. The Cohen-Grossberg neural networks can be described by the following differential equations n dui (t) = −ai (ui )[bi (ui ) − aij gj (uj (t)) + Ii ] i = 1, · · · , n dt j=1
(1)
In this paper, we investigate the following generalized Cohen-Grossberg neural networks with time-varying delays and distributed delays n dui (t) = −ai (ui ) bi (ui ) − aij gj (uj (t)) dt j=1 n ∞ − fj (uj (t − τij (t) − s))dKij (s) + Ii j=1
(2)
0
where ∞ dKij (s) are Lebesgue-Stieljies measures for each i, j = 1, · · · , n, and satisfy 0 |dKij (s)| < ∞. ai (x) > 0 are continuous, bi (x) are Lipshitz continuous with bi (x) ≥ γi > 0, i = 1, · · · , n, J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 192–197, 2006. c Springer-Verlag Berlin Heidelberg 2006
Global Asymptotical Stability of Cohen-Grossberg Neural Networks
193
The initial conditions for system (2) are ui (s) = φi (s)
f or
s ∈ [−∞, 0]
(3)
where φi ∈ C([−τ, 0]), i = 1, · · · , n, are bounded continuous functions. τij = supt>0 τij (t) < ∞. This model was first proposed in [1] for delayed neural networks and investigated in [2, 3, 4]. For convenience of description, for any vector v = [v1 , · · · , vn ]T , we denote ||v|| = max |vi |
(4)
i=1,···,n
and for any continuous function f (t) D+ f (t) =
lim
h>0,h→0
sup
|f (t + h)| − |f (t)| h
(5)
denotes its right upper-derivative.
2
Main Results
In this section, we investigate the globally asymptotical stability and globally exponential stability of the generalized Cohen-Grossberg neural networks with time-varying delays and distributed delays (2). We will prove two theorems. Theorem 1 addresses the globally asymptotical stability. Theorem 2 is about the globally exponential stability. Theorem 1. Suppose that gi ∈ Lip(Gi ), fi ∈ Lip(Fi ). If there are positive constants ξ1 , · · · , ξn such that n n min ξi γj − ξj Gj |aij | − ξj Fj i
j=1
j=1
∞
|dKij (s)| = η > 0
(6)
0
Then the dynamical system (2) has a unique equilibrium v ∗ , which is globally asymptotically stable. Remark 1. By a simple linear transform ui (t) = ξi vi (t), without loss of generality, in the proof, we assume that ξi = 1. We first give two lemmas. Lemma 1. (see [2]) Under the assumptions made in Theorem 1, the dynamical system (2) has at least an equilibrium v ∗ = [v1∗ , · · · , v1∗ ]T . Lemma 2. Under the assumptions made in Theorem 1, every solution u(t) of the system (2) is bounded. Proof. Suppose u(t) is a solution of (2), v ∗ is the equilibrium given in Lemma 1. Let w(t) = u(t) − v ∗ , M (t) = sup−∞<s≤t ||w(s)||. It is clear that ||w(t)|| ≤ M (t).
194
T. Chen and W. Lu
If for some t0 , ||w(t0 )|| = M (t0 ), and it0 = it0 (t0 ), which depends on t0 , is an index such that |wit0 (t0 ))| = ||w(t0 )|| Then, we have D+ |wi0 (t)| ≤ (−γi + η)|wi0 (t)| +
ai0 j Gj +
dKi0 j (s)Fj |wj (t)|
0
j
n ≤ (−γi + η) + |ai0 j |Gj +
∞
∞
|dKi0 j (s)|Fj ||w(t0 )|| < 0
(7)
0
j=1
which means that M (t) is non-increasing at all t > 0 and thus M (t) = M (0). ¯ = max{M (0) + |v ∗ |}. Lemma 2 is proved. Thus |ui (t)| ≤ M i ¯ , we can Because ai (ui ) > 0. Therefore, in a bounded closed region |ui | ≤ M find a constant αi > 0 such that 0 < αi ≤ ai (x). Proof of Theorem 1. Suppose u(t) is a solution of (2), v ∗ is the equilibrium given in Lemma 1, w(t) = u(t) − v ∗ . We will prove that for arbitrarily small > 0, there exists a sufficient large T¯ such that ||w(t)|| < for all t > T¯. First, for any > 0, pick a sufficiently large T , such that ∞ n η M ξj Fj |dKij (s)| < (8) 2 T j=1 where M is the constant given in Lemma 2. Then, due to (6), we can pick a small positive α = α(T ), which is dependent of T , such that for all i = 1, · · · , n, (−γi + αα−1 i ) +
n
Gj |aij | +
j=1
+
n
j=1 ∞
ξj Fj
j=1
n
T
T
eα(τij +s) |dKij (s)|
Fj 0
|dKij (s)| > −
η 2
(9)
Let y(t) = eαt w(t) and M1 (t) = supt−T <s≤t ||y(s)||. We claim that if ||w(t0 )|| > , then M1 (t) is non-increasing at t0 . In fact, at any t0 , there are two possible cases: Case 1. ||y(t0 )|| < M1 (t0 ). In this case, by the continuity of w(t), M1 (t) is nonincreasing at t0 . Case 2. ||y(t0 )|| = M1 (t0 ), and it0 = it0 (t0 ), which depends on t0 , is an index such that |yit0 (t)| = ||y(t0 )|| Then, by some algebra, we have n D |yit0 (t0 )| ≤ αit0 (uit0 (t0 )) (−γit0 + αα−1 Gj |ait0 j | it ) + +
0
j=1
Global Asymptotical Stability of Cohen-Grossberg Neural Networks
+
+
n j=1 n
T
Fj 0
eα(τit0 j +s) |dKit0 j (s)| M1 (t0 )
Fj e
αt0
∞ T
j=1
≤ αit0 (uit0 (t0 ))e +
n j=1
Fj 0
T
195
|wj (t0 − τit0 j (t0 ) − s)||dKit0 j (s)|
αt0
n −1 |wit0 (t0 )| (−γit0 + ααit ) + Gj |ait0 j | 0
n eα(τit0 j +s) |dKit0 j (s)| + M Fj j=1
j=1 ∞ T
|dKit0 j (s)|
(10)
By the choice of T and α, we have ∞ n M j=1 Fj T |dKij (s)| max < i (−γi + αα−1 ) + n Gj |aij | + n Fj T eα(τij +s) |dKij (s)| i j=1 j=1 0 Therefore, if |wit0 (t0 )| > , we have D+ yit0 (t0 ) < 0 Thus, M1 (t) is nonincreasing at t0 . In summary, we conclude that if ||w(t)|| > for all t > 0, then M1 (t) is non-increasing for all t > 0, i.e. M1 (t) ≤ M1 (0) for all t > 0, which implies ||y(t)|| ≤ M1 (0) and ||w(t)|| ≤ M1 (0)e−αt (11) It contradicts ||w(t)|| > . Therefore, we can find a T¯, such that ||w(T¯)|| < . Therefore, for all t > T¯, we have ||w(t)|| ≤ . Because is arbitrary, we conclude that lim u(t) = v ∗
(12)
t→∞
Theorem 1 is proved completely. Following theorem gives the result of globally exponential stability. Theorem 2. Suppose that gi ∈ Lip(Gi ), fi ∈ Lip(Fi ). If there are positive constants α, ξ1 , · · · , ξn such that ∞ n n α(τij +s) min ξi (−γi +αα−1 )− ξ G |a |− ξ F e |dK (s)| ≥ 0 (13) j j ij j j ij i i
j=1
j=1
0
Then the dynamical system (2) has a unique equilibrium v ∗ such that ||u(t) − v ∗ || = O(e−αt )
(14)
Proof of Theorem 2. With the same notations and arguments used in the proof of Theorem 1, we can prove that under condition (13), M1 (t) is bounded. Therefore, ||w(t)|| ≤ M1 (0)e−αt (15)
196
T. Chen and W. Lu
Remark 2. It should be mentioned that the existence of the constant αi > 0 such that 0 < αi ≤ ai (x) is a direct consequence of the assumptions made in Theorem 1. It is not a prerequisite requirement. All the assumptions 0 < αi ≤ ai (x) ≤ α ¯ i and τ˙ij (t) ≤ 1 needed in [5] are removed.
3
Comparisons
It is easy to see that if dKij (0) = bij δ(0), where δ(0) = 1 and δ(s) = 0, for s = 0, then (2) reduces to the Cohen-Grossberg neural networks with timevarying delays n n dui (t) = −ai (ui )[bi (ui ) − aij gj (uj (t)) − bij fj (uj (t − τij (t))) + Ii ] (16) dt j=1 j=1
Instead, if dKij (s) = bij kij (s)ds, τij (t) = 0, then (2) reduces to the CohenGrossberg neural networks with distributed delays n n ∞ dui (t) = −ai (ui )[bi (ui ) − aij gj (uj (t)) − fj (uj (t − s))kij (s)ds + Ii ] dt j=1 j=1 0
(17) Then, the stability analysis for the Cohen-Grossberg neural networks with timevarying delays (see [5, 6, 8, 9, 10, 12, 15, 16] and others) or distributed delays (see [13] and others) given in literature can be derived from the Theorems in this paper. Recently, several researchers also investigated stability criteria with Lp (1 ≤ p ≤ ∞) norm. However, in [17], it has been proved that the stability criteria with L1 norm is the best. More precisely, most of the stability criteria with Lp (1 ≤ p ≤ ∞) norm implies (6). It means that the stability analysis given there is a direct consequence of Theorem 1 and Theorem 2. In particular, the results given in [14] can be derived by that in [12] and all the results can be obtained by Theorem 1 and Theorem 2 given in this paper.
4
Conclusions
In this paper, we investigate the stability of a universal model of generalized Cohen-Grossberg neural networks. Asymptotical stability and exponential stability are given. Most of the models can be dealt with by this universal methods easily.
Acknowledgements This work is supported by National Science Foundation of China under Grant 60374018, 60574044.
Global Asymptotical Stability of Cohen-Grossberg Neural Networks
197
References 1. Chen, T., Lu, W.: Stability Analysis of Dynamical Neural Networks. IEEE Conf. Neural Networks and Signal Processing, Nanjing, China (2003) 14-17 2. Chen, T.: Universal Approach to Study Delayed Dynamical Syatems, ICNC2005, Lecture Notes on Computer Sciences, Vol. 3610 (2005) 245-253 3. Lu, W., Chen, T.: On Periodic Dynamical Systems. Chinese Annals of Mathematics 25B(4) (2004) 455-462 4. Chen, T., Lu, W., and Chen, G.: Dynamical Behaviors of a Large Class of General Delayed Neural Networks. Neural Computation 17 (2005) 949-968 5. Chen, T., Rong, L.: Robust Global Exponential Stability of Cohen-Grossberg Neural Networks With Time Delays. IEEE Trans. Neural Networks 15(1) (2004) 203206 6. Chen, Y.: Global Stability of Neural Networks with Distributed Delays. Neural Networks 15 (2002) 867-871 7. Cohen, M. A., Grossberg, S.: Absolute Stability and Global Pattern Formation and Parallel Memory Storage by Competitive Neural Networks. IEEE Trans. Syst. Man Cybern. B. (13) (1983) 815-821 8. Deng, J., Xu, D., and Yang, Z.: Global Exponental Stability Analysis of CohenGrossberg Neural Networks with Multiple Time-Varying Delays. Proceedings of 11th IEEE International Conference on Electronics, Circuits and Systems (2004) 230-233 9. Hwang, C., Cheng, C., Liao, T.: Globally Exponential Stability of Generalized Cohen-Grossberg Neural Netqorks with Delays. Physics Letters A 319 (2003) 157166 10. Liu, J.: Global Exponential Stability of Cohen-Grossberg Neural Netqorks with Time-Varying Delays. Chaos, Solitons and Fractals 26 (2005) 935-945 11. Lu, H.: Global Exponental Stability Analysis of Cohen-Grossberg Neural Networks. IEEE Trans. on Circuits and Systems II 52(9) (2005) 476-479 12. Liao, X., Li, C., Wong, K.: Criteria for Exponential Stability of Cohen-Grossberg Neural Networks. Neural Networks 17 (2004) 1401-1414 13. Sun, J., Wan, L.: Global Exponential Stability of Cohen-Grossberg Neural Networks with Continuous Distributed Delays. Physics Letters A 342 (2005) 331-340 14. Tu, F., Liao, X.: Harmless Delays for Global Asymptotic Stability of CohenGrossberg Neural Networks. Chaos, Solitons and Fractals 26 (2005) 927-933 15. Wang, L., Zou, X.: Exponential Stability of Cohen-Grossberg Neural Networks. Neural Networks 15 (2002) 415-422 16. Ye, H., Michel, A. and Wang, K.: Qualitative Analysis of Cohen-Grossberg Neural Networks with Multiple Delays. Physical Review E 51 (1995) 2611-2618 17. Zheng, Y., Chen, T.: Global Exponential Stability of Delayed Periodic Dynamical Systems. Physics Letters A 322(5-6) (2004) 344-355
LMI Approach to Robust Stability Analysis of CohenGrossberg Neural Networks with Multiple Delays Ce Ji, Hua-Guang Zhang, and Chong-Hui Song Institute of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110004, P.R. China
[email protected],
[email protected]
Abstract. Robust stability analysis of a class of Cohen-Grossberg neural networks with multiple delays is given. An approach combining the Lyapunov functional with the linear matrix inequality (LMI) is taken to obtain the sufficient condition for the globally robust stability of equilibrium point. The result is established without assuming the differentiability and monotonicity of activation functions.
1 Introduction In recent years, considerable efforts have been devoted to the stability analysis of Cohen-Grossberg neural networks. By Lyapunov functional, [1] established some criteria for the globally asymptotic stability of this model. [2] investigated CohenGrossberg neural networks with multiple delays and obtained several exponential stability criteria. However, there are few existing results on the robust stability for Cohen-Grossberg neural networks. In [3], we analyzed the globally robust stability of Cohen-Grossberg neural networks with multiple delays and perturbations of interconnecting weights, and obtained the sufficient conditions for the globally robust stability of this model. However, the results given by [3] comprised the uncertainties ΔTk ∈ ℜn× n , k = 0,L , K , thus the practicability of the theory and corollary was debased. Aiming at this case, based on the assumption that the uncertainties are norm-bounded and satisfy a certain condition, we will give the robust stability criterion expressed by linear matrix inequality (LMI) in this paper. In contrast to the results in [3], the LMI approach has the advantage that it can be solved numerically and very effectively using the interior-point method. The result is very practical in the process of design and implementation of Cohen-Grossberg neural networks. The paper is organized as follows. In Section 2, the network model and preliminaries will be described as the basis of later sections. In Section 3, we will establish the sufficient condition for the globally robust stability of this model.
2 Model Description and Preliminaries The Cohen-Grossberg neural network model with multiple delays can be described by equation J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 198 – 203, 2006. © Springer-Verlag Berlin Heidelberg 2006
LMI Approach to Robust Stability Analysis of Cohen-Grossberg Neural Networks
199
K ⎡ ⎤ x&(t ) = − A( x(t )) ⎢ B( x(t )) − T0 S ( x(t )) − ∑Tk S ( x(t −τ k )) ⎥ , k =1 ⎣ ⎦
(1)
where x = ( x1 ,L , xn )T ∈ ℜn , A( x) = diag[a1 ( x1 ),L , an ( xn )] , B( x) = [b1 ( x1 ),L , bn ( xn )]T , and S ( x) = [ s1 ( x1 ),L , sn ( xn )]T . Amplification function ai (⋅) is continuous. Behaved
function bi (⋅) is continuous and satisfies bi (0) = 0 . si (⋅) is the activation function
representing the ith neuron and si (0) = 0 . T0 = [t0 ij ] ∈ ℜn×n denotes that part of the
interconnecting structure which is not associated with delay, Tk = [tkij ] ∈ ℜ n×n denotes that part of the interconnecting structure which is associated with delay τ k , where τ k denotes kth delay, k = 1,L , K and 0 < τ 1 < L < τ K < +∞ . Considering the influences of parameter perturbations for model (1), then we can describe (1) as K ⎡ ⎤ x&(t ) = − A( x(t )) ⎢ B( x(t )) − (T0 + ΔT0 )S ( x(t )) − ∑ (Tk + ΔTk )S ( x(t − τ k ))⎥ , ⎣ ⎦ k =1
(2)
where ΔTk = ⎡⎣ Δtkij ⎤⎦ ∈ ℜ n× n , k = 0,L , K are time-invariant matrices representing the norm-bounded uncertainties. Throughout this paper, we have the following assumptions. Assumption 1. For the uncertainties ΔTk , k = 0,L , K , we assume
[ ΔT0 L ΔTK ] = HF [U 0 LU K ] ,
(3)
where F is an unknown matrix representing parametric uncertainty which satisfies
FT F ≤ E ,
(4)
where E is an identical matrix, and H , U 0 ,L , U K can be regarded as the known structural matrices of uncertainty with appropriate dimensions. Assumption 2. For i = 1,L , n , we assume
(H1) ai ( xi ) is positive and bounded, i.e., 0 < α im ≤ ai ( xi ) ≤ α iM . (H2) bi ( xi ) is continuous and differentiable, and satisfies
bi ( xi ) xi ≥ β im > 0.
(5)
(H3) si ( xi ) is bounded and satisfies 0 ≤ si ( xi ) xi ≤ σ iM .
(6)
Definition. The equilibrium point of system (1) is said to be globally robustly stable with respect to the perturbation ΔT0 and ΔTk , k = 1,L , K , if the equilibrium point of system (2) is globally asymptotically stable.
200
C. Ji, H.-G. Zhang, and C.-H. Song
Obviously, the origin is an equilibrium point of (1) and (2). In order to study the globally robust stability of the equilibrium point x = 0 for (1), it suffices to investigate the globally asymptotic stability of equilibrium point x = 0 of system (2). Lemma. [4] If U , V and W are real matrices of appropriate dimension with M
satisfying M = M T , then M + UVW + W T V T U T < 0 ,
for all V T V ≤ E , if and only if there exists a positive constant
(7)
ε
such that
M + ε −1UU T + ε W T W < 0 .
(8)
In the following section, we will give the sufficient conditions for the globally asymptotic stability of equilibrium point x = 0 of system (2).
3 Robust Stability Theorem. For any bounded delay τ k , k = 1,L, K , the equilibrium point x = 0 of system (2) is globally asymptotically stable, if there exist a positive definite matrix P , a positive constant ε , and positive diagonal matrixes Λk = diag [ λk1 ,L, λkn ] ,
where λki > 0, i = 1,L, n, k = 0,L, K , such that the following linear matrix inequality (LMI) holds
⎡ γ11 PAMTK EM ⎢ M T M M T M ⎢E TK A P −ΛK + ε E UKUK E ⎢ M M ⎢ M M ⎢ M T ⎢ EMT T AM P ε E U0 UK EM ⎢ T0 M 0 ⎣⎢ H A P
PAMT0 EM PAM H ⎤ L ⎥ ε EMUKTU0 EM 0 ⎥ L L M M ⎥ ⎥ < 0, O M M ⎥ L −Λ0 + ε EMU0TU0 EM 0 ⎥ ⎥ L L −ε I ⎦⎥ 0 L L O L L
(9)
K
where γ 11 = − B m Am P − PAm B m + ∑ Λk , AM = diag[α1M ,L , α nM ] , Am = diag[α1m ,L , k =0
α ], B = diag[ β ,L , β ], and E M = [σ 1M ,L , σ nM ] . m n
m
m 1
m n
Proof. We can rewrite (2) as K ⎡ ⎤ x& (t ) = − A( x(t )) ⎢ B( x(t )) − (T0 + ΔT0 ) E( x(t )) x(t ) − ∑(Tk + ΔTk ) E( x(t −τ k )) x(t − τ k )⎥ , k =1 ⎣ ⎦
(10)
where
E( x) = diag [ε1 ( x1 ),L, ε n ( xn )] , εi ( xi ) = si ( xi )/ xi , i = 1,L, n . Then from (6), we have ε i ( xi ) ∈ [0, σ iM ] .
(11)
LMI Approach to Robust Stability Analysis of Cohen-Grossberg Neural Networks
201
Here, we introduce the following Lyapunov functional K
V ( xt ) = xT (t ) Px(t ) + ∑ ∫ k =1
0
−τ k
xtT (θ )Λ k xt (θ )dθ .
(12)
where xt (⋅) denotes the restriction of x(⋅) to the interval [t − τ K , t ] translated to
[ −τ K , 0] . For
s ∈ [ −τ K , 0] , we define xt ( s ) = x(t + s ) , where t > 0 .
The derivative of V ( xt ) with respect to t along any trajectory of system (10) is given by K
K
V& ( xt ) = x& T (t ) Px(t ) + xT (t ) Px& (t ) + ∑ xtT (0)Λ k xt (0) − ∑ xtT (−τ k )Λ k xt (−τ k ) k =1
k =1
K
= −xT (t ) BT AT Px(t ) − xT (t ) PABx(t ) + ∑ xT (t )Λk x(t ) + xT (t )Λ0 x(t ) k =1
−1 0
+ x (t ) PA(T0 + ΔT0 ) E ( x(t ))Λ E ( x(t ))(T0 + ΔT0 )T AT Px(t ) T
T
1 − ⎡ 1 ⎤ − ⎢Λ02 x(t ) − Λ0 2 ET ( x(t ))(T0 + ΔT0 )T AT Px(t )⎥ ⎣ ⎦
T
1 − ⎡ 1 ⎤ × ⎢Λ02 x(t ) − Λ0 2 ET ( x(t ))(T0 + ΔT0 )T AT Px(t ) ⎥ ⎣ ⎦ K
+∑ xT (t )PA(Tk + ΔTk )E( x(t −τ k ))Λk−1ET (x(t −τ k ))(Tk + ΔTk )T AT Px(t ) k =1
T
1 − ⎡ 1 ⎤ −∑⎢Λk2 x(t −τk ) −Λk 2 ET (x(t −τk ))(Tk +ΔTk )T AT Px(t)⎥ k =1 ⎣ ⎦ K
1 − ⎡ 1 ⎤ ×⎢Λk2 x(t −τk ) −Λk 2 ET (x(t −τk ))(Tk +ΔTk )T AT Px(t)⎥ ⎣ ⎦ K
≤ − xT (t )( B T AT P + PAB ) x(t ) + ∑ xT (t )Λ k x(t ) k =0
+ x (t ) PA(T0 + ΔT0 ) E ( x(t ))Λ 0−1 E T ( x(t ))(T0 + ΔT0 )T AT Px(t ) T
K
+∑ xT (t ) PA(Tk + ΔTk ) E ( x(t − τ k ))Λ−k 1 ET ( x(t − τ k ))(Tk + ΔTk )T AT Px(t ) ,
(13)
k =1
where B = diag[b1 ( x1 ) x1 ,L , bn ( xn ) xn ] . For convenience, we denote A( x(t )) and B( x(t )) as A and B, respectively. Since AM , B , Am , B m are all positive diagonal matrixes, and P is a positive definite matrix, then by the Assumption 2 and the relevant matrix theory, we can prove K K ⎧ ⎫ V& (xt ) ≤ xT (t ) ⎨− ( Bm Am P + PAm Bm ) + ∑Λk + ∑ PAM (Tk + ΔTk )EM Λ−k1EM (Tk + ΔTk )T AM P⎬ x(t) k =0 k =0 ⎩ ⎭
= xT (t ) S M x(t ) .
(14)
202
C. Ji, H.-G. Zhang, and C.-H. Song
The detailed proof is given in [3]. From (14), we know V& ( xt ) < 0 if S M < 0 . By the functional differential equations theory [5], for any bounded delay τ k > 0 , the equilibrium point x = 0 of system (2) is globally asymptotically stable. According to the Schur Complement [6], S M < 0 can be expressed by the following LMI K ⎡ m m m m ⎢− ( B A P + PA B ) + ∑ Λk k =0 ⎢ E M (TK + ΔTK )T AM P ⎢ ⎢ M ⎢ ⎢ M ⎢ M ⎢ M ⎢ ( E T T0 )T AM P + Δ 0 ⎣
⎤ PAM (TK + ΔTK )EM L L L PAM (T0 + ΔT0 )EM ⎥ ⎥ 0 0 0 0 −ΛK ⎥ ⎥ 0 O 0 0 0 ⎥ < 0. ⎥ 0 0 O 0 0 ⎥ 0 0 0 O 0 ⎥ ⎥ 0 0 0 0 −Λ0 ⎦
(15)
Because of [ ΔT0 L ΔTK ] = HF [U 0 LU K ] , (15) can be expressed as K ⎡ m m m m ⎢ −B A P − PA B + ∑ Λk k =0 ⎢ E M TK T P ⎢ ⎢ M ⎢ ⎢ M ⎢ E M T0T P ⎢⎣
⎡ PAM H ⎤ ⎢ ⎥ ⎢ 0 ⎥ + ⎢ M ⎥ F ⎡⎣0 U K E M ⎢ ⎥ ⎢ M ⎥ ⎢ 0 ⎥ ⎣ ⎦
⎤ PTK E M L L PT0 E M ⎥ ⎥ − ΛK 0 L 0 ⎥ ⎥ 0 O O M ⎥ 0 ⎥ M O O ⎥ − Λ0 ⎥⎦ 0 L 0 ⎡ 0 ⎤ ⎢ EMU T ⎥ K⎥ ⎢ L L U0 E M ⎤⎦ + ⎢ M ⎥ F T ⎡⎣ H T AM P 0 L L 0⎤⎦ < 0. ⎢ ⎥ ⎢ M ⎥ ⎢⎣ E MU0T ⎥⎦
(16)
Using the Lemma, we know (16) holds for all F F ≤ E if and only if there exists a constant ε > 0 such that T
K ⎡ m m m m ⎢ − B A P − PA B + ∑ Λk k =0 ⎢ E M TK T AM P ⎢ ⎢ M ⎢ ⎢ M ⎢ M T M E T 0 A P ⎣⎢
⎡PAM HHT AM P ⎢ 0 1⎢ + ⎢ M ε⎢ M ⎢ ⎢ 0 ⎣
PAM TK E M − ΛK 0 M 0
0 0⎤ ⎡0 ⎥ ⎢ M T M 0⎥ ⎢0 E UKUK E M M ⎥ + ε ⎢M ⎥ ⎢ M M ⎥ ⎢M M T M ⎥ ⎢ 0 L L 0⎦ ⎣0 E U0 UK E 0 0 M M
L L O O
L L O O
⎤ L L PAM T0 E M ⎥ ⎥ 0 L 0 ⎥ ⎥ O O M ⎥ ⎥ 0 O O ⎥ L 0 − Λ0 ⎦⎥
L L O L L
0 L ⎤ M T M⎥ L E UKU0 E ⎥ ⎥ < 0. L M ⎥ O M ⎥ L EMU0TU0 EM ⎥⎦
(17)
LMI Approach to Robust Stability Analysis of Cohen-Grossberg Neural Networks
203
Rearranging (17), we get ⎡ ς11 PTK EM ⎢ M T M T M ⎢E TK P −ΛK + ε E UKUK E ⎢ M M ⎢ M ⎢ M M T ⎢ EM T T P ε E U0 UK EM 0 ⎣ K
L L L L O L L O L L
where ς 11 = − B m Am P − PAm B m + ∑ Λk + k =0
1
ε
⎤ ⎥ ε E U U0 E ⎥ ⎥ <0, M ⎥ M ⎥ −Λ0 + ε EMU0TU0 EM ⎥⎦ PT0 EM
M
T K
M
(18)
PHH T P.
By use of the Schur Complement, (18) is equivalent to condition (9). The proof is completed.
Acknowledgements This work was supported by the National Natural Science Foundation of China (60325311, 60534010, 60572070, 60504006) and the Program for Changjiang Scholars and Innovative Research Team in University.
References 1. Chen, T.P., Rong, L.B.: Delay-Independent Stability Analysis of Cohen-Grossberg Neural Networks. Physics Letters A 317 (2003) 436–449 2. Hwang, C.C., Cheng, C.J., Liao, T.L.: Globally Exponential Stability of Generalized Cohen-Grossberg Neural Networks with Delays. Physics Letters A 319 (2003) 157–166 3. Ji, C., Zhang, H.G., Guan, H.X.: Analysis for Global Robust Stability of Cohen-Grossberg Neural Networks with Multiple Delays. Lecture Notes in Computer Science, Vol. 3173. Springer-Verlag, Berlin Heidelberg (2004) 96–101 4. Singh, V.: Robust Stability of Cellular Neural Networks with Delay: Linear Matrix Inequality Approach. IEE Proc.-Control Theory Appl. 151(1) (2004) 125–129 5. Michel, A.N., Wang, K., Hu, B.: Qualitative Theory of Dynamical Systems–The Role of Stability Preserving Mappings. Second Edition. NY: Marcel Dekker, New York (2001) 6. Boyd, S., El Ghaoui, L., Feron, E.: Linear Matrix Inequalities in System and Control Theory. Studies in Applied Mathematics, Philadelphia: SIAM (1994)
Existence and Global Stability Analysis of Almost Periodic Solutions for CohenGrossberg Neural Networks Tianping Chen, Lili Wang, and Changlei Ren Key Laboratory of Nonlinear Science of Educational Chinese Ministry of Education, Institute of Mathematics, Fudan University, Shanghai, 200433, P.R. China {tchen, 042018047, 042018045}@fudan.edu.cn
Abstract. In this paper, we investigate a large class of periodic CohenGrossberg neural networks and prove that under some mild conditions, there is a unique almost-periodic solution, which is globally stable. As special examples, we derive many results given in literature.
1
Introduction
It is well known that Cohen-Grossberg neural networks, proposed by Cohen and Grossberg in [1, 2], have been extensively studied both in theory and applications. The Cohen-Grossberg neural networks can be described by the following ordinary differential equations n dui (t) = −ai (ui (t))[bi (ui (t)) − aij gj (uj (t)) + Ii ], i = 1, . . . , n dt j=1
(1)
Recently, there are lots of papers discussing Cohen-Grossberg neural networks in literature (see [3-17] and others). In this paper, we investigate the almost periodic solutions of the following generalized Cohen-Grossberg neural networks n dui (t) = −ai (ui (t))[bi (ui (t)) − aij (t)gj (uj (t)) dt j=1 n ∞ − fj (uj (t − τij (t) − s))dKij (t, s) + Ii (t)], i = 1, . . . , n (2) j=1
o
where τij ≥ 0, dsKij (t, s),for any fixed t ≥ 0, are Lebesgue-Stieljies measures.The initial conditions are ui (s) = φi (s)
for s ∈ (−∞, 0]
where φi ∈ C(−∞, 0], i = 1, . . . , n. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 204–210, 2006. c Springer-Verlag Berlin Heidelberg 2006
(3)
Existence and Global Stability Analysis of Almost Periodic Solutions
2
205
Definitions
In this section, we present some definitions and denotations required throughout the paper. Definition 1. Let x(t) : R → Rn is a continuous function, we call it an almost periodic function, if for any > 0, there exist l = l() > 0 and ω = ω() in any interval with the length of l(), such that for all t ∈ R, x(t + ω) − x(t) < Definition 2. {ξ, ∞}-norm: u(t) = u(t){ξ,∞} = max | ξi−1 ui (t)|, where i=1,...,n
ξi > 0, i = 1, . . . , n.
3
Main Results
Throughout this article, we make the following assumptions. Assumption 1: gi (·) ∈ Lip(Gi ), fi (·) ∈ Lip(Fi ), i = 1, . . . , n where Gi , Fi > 0, where Lip(G) denotes Lipshitz functions with Lipschitz constant G > 0. i (v) Assumption 2: 0 < αi ≤ ai (x) ≤ α¯i , bi (t)is continuous and bi (u)−b ≥ u−v γi > 0, i = 1, . . . , n, τij (t) ≥ 0, aij (t), Ii (t) are almost periodic functions, i, j = 1, . . . , n.
Assumption 3: ∀s ∈ R, Kij (t, s) : t → Kij (t, s) is continuous. Moreover, for any fixed t ∈ R, dKij (t, s) : s → dKij (t, s) are Lebesgue-Stieljies measures. Suppose that |ds Kij (t, s)| ≤ |dKij (s)|, where dKij (s) is a Lebesgue-Stieljies ∞ measure independent of t and there exists > 0, o es |dKij (s)| < ∞, i, j = 1, . . . , n. Assumption 4: For any > 0, there exists l = l() > 0 and ω = ω() in any interval with the length of l, such that ∞ |dKij (t + ω, s) − dKij (t, s)|ds < , ∀ t ∈ R, i, j = 1, 2, . . . , n (4) 0
Remark 1. By Bochner Theorem, Assumption 4 is equivalent that aij (t), τij (t), Ii (t) are continuous almost periodic functions. Therefore, they are all bounded. Denote that |a∗ij | = sup |aij (t)| < ∞, t∈R
|Ii∗ |
= sup |Ii (t)| < ∞,
τij∗ = sup τij (t) < ∞, t∈R
i, j = 1, . . . , n.
t∈R
Lemma 1. Suppose that Assumption 1-4 are satisfied. If there are positive constants ξ1 , . . . , ξn , such that for every t > 0, i = 1, . . . , n, ∞ n n − αi γi ξi + |aij (t)|¯ αi Gj ξj + α ¯ i Fj ξj |dKij (t, s)|ds < −η < 0 (5) j=1
j=1
0
206
T. Chen, L. Wang, and C. Ren
Then any solution u(t) of the dynamical system (2) is bounded. Proof: Define M (t) = max u(s). For any fixed t0 , there are two possibilities: s≤t
(i) u(t0 ) < M (t0 ). In this case, there exists δ > 0, u(t) < M (t0 ) for t ∈ (t0 , t0 + δ). Thus M (t) = M (t0 ) for t ∈ (t0 , t0 + δ). (ii)u(t0 ) = M (t0 ). In this case, let it0 be an index such that ξi−1 |uit0 (t0 )| = t0 u(t0 ). Notice that |gj (s)| ≤ Gj |s| + |gj (0)|,
|fj (s)| ≤ Fj |s| + |fj (0)|,
j = 1, . . . , n,
and denote n n ∗ ¯ I = max α ¯ i |bi (0)| + α ¯ i [|aij ||gj (0)| + |fj (0)| i
j=1
By some algebra, we have d|uit0 | (t) dt +
n
− αit0 γit0 ξit0 +
t=t0
α ¯it0 Fj ξj
j=1
∞
0
|dKij (s)|ds +
|Ii∗ | ]
,
0
j=1
≤
∞
s ∈ R.
n
|ait0 j (t0 )|¯ αit0 Gj ξj
j=1
|dKit0 j (t0 , s)|ds M (t0 ) + I¯ ≤ −ηM (t0 ) + I¯
¯ If M (t0 ) ≥ 2|I|/η. There exists δ1 > 0, such that M (t) = M (t0 ) for t ∈ (t0 , t0 + δ1 ). ¯ On the other hand, if M (t0 ) < 2|I|/η. Then there exist δ2 > 0, such that ¯ M (t) < 2|I|/η, t ∈ (t0 , t0 + δ2 ). ¯ Let δ = min{δ1 , δ2 }, then, M (t) ≤ max{M (t0 ), 2|I|/η} holds for every t ∈ (t0 , t0 + δ). ¯ In summary, u(t) ≤ M (t) ≤ max{M (0), 2|I|/η}. Lemma is proved. Lemma 2. Suppose that Assumption 1-4 are satisfied. If there are positive constants ξ1 , . . . , ξn , β > 0,such that for every t > 0, i = 1, . . . , n, (−αi γi + β)ξi +
n
|aij (t)|¯ αj Gj ξj
j=1
+
n
∗
∞
α ¯ j Fj ξj eβτij
eβs |dKij (t, s)|ds < −η < 0
0
j=1
Then for any > 0, there exists T > 0, l = l() > 0, and ω = ω() in any interval with the length of l, such that for any solution u(t) of the dynamical system (2), u(t + ω) − u(t){ξ,∞} ≤ , Proof: Denote i (ω, t) = +
n j=1
∀t > T
aij (t + ω) − aij (t) gj (uj (t + ω))
Existence and Global Stability Analysis of Almost Periodic Solutions
+
n j=1
∞
207
fj (uj (t − τij (t + ω) + ω − s))
0
− fj (uj (t − τij (t) + ω − s)) dKij (t + ω, s) +
n j=1
∞
fj (uj (t − τij (t) + ω − s))d Kij (t + ω, s) − Kij (t, s)
0
+ Ii (t + ω) − Ii (t)
(6)
By Lemma 1, u(t) is bounded. Thus, the right side of (5) is also bounded, which implies that u(t) is uniformly continuous. Therefore, by the assumption A, for any > 0, there exists l = l() > 0 such that every interval [α, α + l], α ∈ R, contains an ω for which |i (ω, t)| ≤
1 η 2α ¯
f or all t ∈ R , i = 1, 2, · · · , n
(7)
where α ¯ = max α ¯i . i
Define
ui (t+ω)
xi (t) = ui (t)
1 dρ. ai (ρ)
Then, we have n dxi (t) = − bi (ui (t + ω)) − bi (ui (t))] + aij (t)[gj (uj (t + ω)) − gj (uj (t)) dt j=1 n ∞ + fj (uj (t − τij (t) + ω − s)) − fj (uj (t − τij (t) − s)) dKij (t, s) j=1
0
+ i (ω, t) Let it be an index such that ξi−1 |xit (t)| = x(t). Differentiating eβs |xit (s)|, we t have d e−βt eβs |xit (s)| dt s=t n ≤ β|xit (t)| − γit |uit (t + ω) − uit (t)| + |ait j (t)|Gj |uj (t + ω) − uj (t)| +
n j=1
j=1 ∞
Fj |uj (t − τit j (t) + ω − s) − uj (t − τit j (t) − s)||dKit j (t, s)|
0
+ |it (ω, t)| ≤ −(γit αit − β)|xit (t)| + +
n j=1
0
n
|ait j (t)|Gj α ¯ j |xj (t)| + η/2α ¯
j=1 ∞
∗
Fj α ¯j |xj (t − τit j (t) − s)|e−β(τit j (t)+s) eβ(τij +s) |dKit j (t, s)|
208
T. Chen, L. Wang, and C. Ren
Denote Ψ (t) = max{eβs x(s)}. If there is such a point t0 > 0 that Ψ (t0 ) = s≤t
eβt0 x(t0 ){ξ,∞} . Then we have d eβt |xit0 (t)| < −ηΨ (t0 ) + ηeβt0 /2α ¯ dt t=t0
(8)
In addition, if Ψ (t0 ) ≥ eβt0 /2α ¯, then Ψ (t) is non-increasing in a small neighborhood (t0 , t0 + δ) of t0 . On the other hand, if Ψ (t0 ) < eβt0 , then in a small neighborhood (t0 , t0 + δ) of t0 , eβt x(t) < eβt0 /2α ¯ , and Ψ (t) ≤ max{Ψ (t0 ), eβt0 /2α ¯ }. By the same reasonings used in the proof of Lemma 1, for all t > t0 , we have eβt x(t) ≤ max{Ψ (t0 ), eβt /2α ¯}. Therefore, there exists t1 > 0, for all t > t1 , x(t) ≤ . In summary, there must exist T > 0 such that x(t) ≤ holds for all t > T . Lemma 2 is proved. Theorem 1. Suppose that Assumptions 1-4 are satisfied. If there are positive constants ξ1 , . . . , ξn , β > 0, such that for every t > 0, i = 1, . . . , n, (−αi γi + β)ξi +
n
|aij (t)|¯ αj Gj ξj +
j=1
n
∗
α ¯j Fj ξj eβτij
j=1
∞
eβs |dKij (t, s)|ds < 0
0
Then the system (2) has a unique almost periodic solution v(t). Furthermore, every solution u(t) of the system (2) satisfying u(t) − v(t) = O(e−βt ). Proof: Define n i,k (t) = ai (ui (t + tk )) aij (t + tk ) − aij (t) gj (uj (t + tk )) +
n j=1
∞
j=1
fj (uj (t − τij (t + tk ) + tk − s))
0
−fj (uj (t − τij (t) + tk − s)) dKij (t + tk , s) +
n j=1
∞
fj (uj (t − τij (t) + tk − s))d Kij (t + tk , s) − Kij (t, s)
0
− Ii (t + tk ) − Ii (t) where {tk } is a sequence of real numbers. Because u(t) is bounded. we can select a sequence {tk } such that |i,k (t)| ≤ 1/k, ∀i, t. By Arzala-Ascoli Lemma and diagonal selection principle, we can select a subsequence tkj of tk , such that u(t + tkj ) (for convenience, we still denote by
Existence and Global Stability Analysis of Almost Periodic Solutions
209
u(t + tk )) uniformly converges to a continuous function v(t) = [v1 (t), v2 (t), · · · , vn (t)]T on any compact set of R. Using Lebesgue dominated convergence theorem, we have t+h n vi (t + h) − vi (t) = −ai (vi (σ)) bi (vi (σ)) − aij (σ)gj (vj (σ)) t
−
j=1
n
∞
fj (vj (σ − τij (σ) − s))dKij (σ, s) + Ii (σ) dσ,
0
j=1
for any t > 0 and h ∈ R. Hence, v(t) is a solution of the system (2). Meanwhile, with Lemma 2, for any > 0, there exists T > 0, l = l() > 0, and ω = ω() in any interval with the length of l, such that |ui (t + tk + ω) − ui (t + tk )| ≤ ,
∀t > 0
Let k → ∞, it is easy to get |vi (t + ω) − vi (t)| ≤ ,
∀t > 0.
Hence, v(t) is almost periodic. Finally, we will prove that for every solution u(t) = [u1 (t), . . . , un (t)]T , u(t) − v(t) = O(e−βt ). Similar to the proof of the previous lemmas, define ui (t) 1 zi (t) = dρ, vi (t) ai (ρ) Then, n dzi (t) = − bi (ui (t)) − bi (vi (t))] + aij (t)[gj (uj (t)) − gj (vj (t)) dt j=1 n ∞ + fj (uj (t − τij (t) − s)) − fj (vj (t − τij (t) − s)) dKij (t, s) j=1
0
Let Φ(t) = max eβs z(s) . by the same arguments used in the proof of the s≤t
Lemma, we can prove Φ(t) = Φ(0) for all t ≥ 0. Then, x(t){ξ,∞} ≤ Φ(0)e−βt which implies u(t) − v(t){ξ,∞} ≤ α ¯ Φ(0)e−βt , Theorem 1 is proved.
α ¯ = max α ¯i i
210
4
T. Chen, L. Wang, and C. Ren
Conclusions
In this paper, a thorough analysis of existence, uniqueness, and global stability of the almost periodic solution for a class of Cohen-Grossberg neural networks has been presented. Under some general conditions, we get a set of sufficient conditions guaranteeing the existence and the global stability of the almost periodic solutions for these systems.
Acknowledgements This work is supported by National Science Foundation of China under Grant 60374018, 60574044.
References 1. Cohen, M. A., and Grossberg, S.: Absolute Stability and Global Pattern Formation and Parallel Memory Storage by Competitive Neural Networks. IEEE Trans. Syst. Man Cybern. B. 13 (1983) 815-821 2. Grossberg, S.: Nonlinear Neural Networks: Principles, Mechanisms, and Architectures. Neural Networks 1 (1988) 17-61 3. Lu, W., Chen, T.: Dynamical Behaviors of Cohen-Grossberg Neural Networks with Discontinuous Activation Functions. Neural Networks 18 (2005) 231-242 4. Cao, J.: New Results Concerning Exponential Stability and Periodic Solutions of Delayed Cellular Neural Networks. Physics Letters A 307 (2003) 136-147 5. Zhou, J., Liu, Z., Chen, G.: Dynamics of Delayed Periodic Neural Networks. Neural Networks 17(1) (2004) 87-101 6. Zheng, Y., Chen, T.: Global Exponential Stability of Delayed Periodic Dynamical Systems. Physics Letters A 322(5-6) (2004) 344-355 7. Lu, W., Chen, T.: On Periodic Dynamical Systems. Chinese Annals of Mathematics 25B(4) (2004) 455-462 8. Chen, T. Lu, W. Chen, G.: Dynamical Behaviors of a Large Class of General Delayed Neural Networks. Neural Computation 17 (2005) 949-968 9. Cao, J., Wang, J.: Global Exponential Stability and Periodicity of Recurrent Neural Networks With Time Delays. IEEE Transactions on Circuits and Systems-I: Regular Paper 52(5) (2005) 920-931 10. Li, Y.: Existence and Stability of Periodic Solutions for Cohen-Grossberg Neural Networks with Multiple Delays. Chaos, Solitons and Fractals 20(3) (2004) 459-466 11. Chen, T., Lu, W.: Stability Analysis of Dynamical Neural Networks. IEEE Inl Conf. Neural Networks and Signal Processing, Nanjing, China, December. (2003) 14-17 12. Lu, W., Chen, T.: Global Exponential Stability of Almost Periodic Solution for a Large Class of Delayed Dynamical Systems. Science In China Series A-Mathematics 48(8) 1015-1026 2005 13. Liu, B., Huang, L.: Existence and Exponential Stability of Almost Periodic Solutions for Cellular Neural Networks with Time-Varying Delays. Physics Letters A 341(1-4) (2005) 135-144
A New Sufficient Condition on the Complete Stability of a Class Cellular Neural Networks Li-qun Zhou1,2 and Guang-da Hu1 1
2
Department of Control Science and Engineering, Harbin Institute of Technology, Harbin, 150001, P.R. China Department of Mathematics, QiQihar University, QiQihar, 161006, P.R. China {zhouliqun, ghu}@hit.edu.cn
Abstract. In this paper, a sufficient condition is presented to ensure the complete stability of a cellular neural networks (CNNs) that output functions are a set of piecewise sigmoid nonlinear functions. The convergence theorem of the Gauss-Seidel method and Gauss-Seidel method, which is an iterative technique for solving a linear algebraic equation, plays an important role in our discussion.
1
Introduction
In this paper we consider the standard model of the CNNs: x˙ = −x + Ay(x) + b or x˙ i = −xi +
n
aij yj (xj ) + bi , ∀ i,
(1)
j=1
where x = [x1 , x2 , . . . , xn ]T is the state vector, A = [aij ] is feedback matrix of the system. b = [b1 , b2 , . . . , bn ]T is a constant input vector. y(x) = [y1 (x1 ), y2 (x2 ), . . . , yn (xn )]T is the output vector, where ⎧ ⎨ f (r0 ), xi > r0 , yi (xi ) = f (xi ) = f (xi ), xi ∈ [−r0 , r0 ], (2) ⎩ −f (r0 ), xi < r0 . we will assume that self-feedback coefficients aii > 1/β, ∀ i, where β denotes the maximum slope of f (xi ); It is important to note that under the condition aii > 1/β, ∀ i, a stable equilibrium point can only be in total saturation region (|xi | ≥ r0 , ∀ i); f (xi ) are sigmoid and monotonically increasing in [−r0 , r0 ]; (xi ) f (xi ) ∈ C (R1 ) and f (xi ) is bounded, that is 0 < dfdx ≤ β, 0 < β ≤ 1 and i |f (xi )| ≤ f (r0 ) = α; 0 < r0 < r, where r represents the positive abscissa of maximum value point of (1), when f (xi ) are a set of sigmoid nonlinear functions. Studies on completely stability of CNNs have been vigorously done and many criteria have been obtained so far [1-5]. In this paper, we give a new sufficient condition for a CNNs whose output functions are piecewise sigmoid nonlinear functions to be completely stable. The main result of the present paper is based J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 211–216, 2006. c Springer-Verlag Berlin Heidelberg 2006
212
L.-q. Zhou and G.-d. Hu
on the stability condition for standard CNNs given in [4]. A convergence theorem of the Gauss-Seidel method [5], which is an iterative technique for solving linear algebraic equations plays a important role. Now we give some definitions and theorems for CNNs. Definition 1. Let P be a matrix with positive diagonal elements. The comparison matrix S of P is defined as sii = pii and sij = −|pij |, if i = j. Definition 2. An n×n matrix P with nonpositive off-diagonal elements is called M -matrix if all its principal minors are positive. Theorem 1. ([4]) A standard CNN is completely stable if the comparison matrix of A − I is an M -matrix. Where I denotes the identity matrix. Theorem 2. ([3][5]) There exists a stable equilibrium point in the total saturation region if the comparison matrix of A − I is a M -matrix.
2
Complete Stability of CNNs
Theorem 3. Let W = (wij ) be the n × n matrix as follows: a β ii min − 1, if i = j wij = , −|aij|, if i =j
(3)
where βmin = min{f (xi )|xi ∈ [−r0 , r0 ]}. If W is an M -matrix, then the CNN is completely stable. The theorem 3 is the main result of this paper. First we give the following lemma. Lemma 1. The CNNs (1) is completely stable if 1) Every (n − 1)-cell CNN x y( defined by d x + A x) + b is completely stable independent of the value of dt = − b, where x is any , y( x) ∈ Rn−1 , b is an n − 1 dimensional constant vector and A principal submatrix of A of order n − 1. (m) (m) (m) 2) The sequence {d(m) }∞ = [d1 , d2 , . . . , m=0 of n dimensional vector d (m) T dn ] generated by Algorithm 1 given below converges to the vector. Algorithm 1 (0) 1) Set m = 1 and initialize the n-dimensional vector u(0) , l(0) , d(0) , βi,min , (0)
(0)
(0)
(0)
βi,max , Bi,min , Bi,max and βi (0)
lj
(0)
= −α, uj
(0)
Bi,min =
max { (0)
(0)
(0)
(0)
= α, Ij
f (xj )∈Ij
Bi,max =
as follows:
min
(0)
f (xj )∈Ij (0)
aii βi,min =
i−1
aij f (xj ) +
j=1
{
i−1
aij f (xj ) +
j=1
(0)
(0)
n
= 2α,
aij f (xj ) + bi },
n
aij f (xj ) + bi },
j=i+1
∗(0)
∗(0) −xi,max
(0)
j=i+1
Bi,min −xi,max
aii βi
(0)
= [lj , uj ], dj
(0)
(0)
, aii βi,max = (0)
∗(0)
Bi,max −xi,min
(0)
∗(0) −xi,min
= min{aii βi,max , aii βi,min }.
,
A New Sufficient Condition on the Complete Stability
213
2) For ∀ i, compute (m) pi0
(m)
qi0
=
=
(m)
−1
[
(m)
aii βi,min − 1
min
(m−1)
yj ∈Ij0
−1 (m)
(m)
i−1
(m−1)
yj ∈Ij0
(m)
yj ∈Ij0
n
(m)
yj ∈Ij0
(m)
(m)
aij yj + bi ],
(4)
aij yj + bi ],
(5)
j=i+1
aij yj + max
j=1
(m)
n
aij yj + min
j=1
[ max
aii βi,max − 1 (m)
i−1
j=i+1
(m)
(m)
(m)
(m)
where ui0 = f (pi0 ), li0 = f (qi0 ), Ii0 = [li0 , ui0 ], di0 = ui0 − li0 . (m) (m) (m) (m−1) 3) Set aii βi = min{aii βi,min , aii βi,max , aii βi }, ∀ i compute (m) pi
(m)
qi
(m)
=
−1 (m)
aii βi =
−1
−1 (m)
aii βi
−1
(m)
[
i−1
min
(m−1)
yj ∈Ij
(m)
i−1
(m−1)
yj ∈Ij
(m)
aij yj + min yj ∈Ij
j=1
[ max
n
n
aij yj + max
(m)
yj ∈Ij
j=1 (m)
(m)
(m)
aij yj + bi ],
(6)
aij yj + bi ],
(7)
j=i+1
j=i+1
(m)
(m)
(m)
(m)
where ui = f (pi ), li = f (qi ), Ii = [li , ui ], di = ui − li . (m) (m) (m) 4) If d(m) = [d1 , d2 , . . . , dn ]T is sufficiently close to the zero vector, go to Step 4 Stop. Otherwise, add 1 to m and go to Step 2. Secondly, we show that if W = (wij ) is an M -matrix, then {d(m) }∞ m=0 in Algorithm 1 converges to the zero vector. In order to do that, we consider the Gauss-Seidel method [6] for solving a linear algebraic equation W z˜ = 0 with the (0) initial condition z˜i = 2α, for ∀ i. The algorithm is given as follows: Algorithm 2 (0) 1) Set m = 1 and z˜j = 2α, ∀j. 2) Set βmin = min{f (xi )|xi ∈ [−r0 , r0 ]}. For all i compute i−1 (m)
z˜i (m)
=
j=1
(m)
|aij |˜ zj
n
+
j=i+1
(m−1)
|aij |˜ zj
.
aii βmin − 1
(m)
(8)
(m)
3) If z˜(m) = [˜ z1 , z˜2 , . . . , z˜n ]T is sufficiently close to the zero vector, go to Step 4 Stop. Otherwise, add 1 to m and go to Step 2. Lemma 2. If the matrix W is an M -matrix, then the sequence {˜ z (m) }∞ m=0 in Algorithm 2 converges to the zero vector. Proof: See [4] Theorem 4, we get immediately lemma 2. Algorithm 3 (0) 1) Set m = 1 and zj = 2α, ∀j. i−1
2) For all i, compute (m)
(m) zi
(m)
=
j=1
(m)
|aij |zj
(m)
+
n
(m−1)
|aij |zj
j=i+1 (m) aii βi −1
.
3) If z (m) = [z1 , z2 , . . . , zn ]T is sufficiently close to the zero vector, go to Step 4 Stop. Otherwise, add 1 to m and go to Step 2.
214
L.-q. Zhou and G.-d. Hu
Lemma 3. If the matrix W is an M -matrix, then the sequence {z (m) }∞ m=0 in Algorithm 3 converges to the zero vector. (0)
Proof: Since zi (m−1) zj
≤
(0)
(m)
= z˜i , ∀i. Suppose that zj
(m−1) z˜j ,
(m)
i−1
=
, j = 1, 2, . . . , i − 1 and
j = i + 1, . . . , n holds. Since for all m and for ∀i, aii βi
aii βmin > 1, thus we have that (m) zi
(m)
≤ z˜j
j=1
(m)
|aij |zj
+
n
(m) aii βi
− 1 > aii βmin − 1 > 0. Then
≤
j=1
i−1
(m−1)
|aij |zj
≤
j=i+1 (m) aii βi −1 i−1
(m)
|aij |˜ zj
+
>
n
j=1
(m)
|aij |˜ zj
+
n
(m−1)
|aij |˜ zj
j=i+1 (m) aii βi −1
(m−1)
j=i+1
|aij |˜ zj
(m)
= z˜i
aii βmin −1
.
Thus for ∀i and for all m ≥ 0, zi ≤ z˜i holds, then the sequence {z (m) }∞ m=0 converges to the zero vector faster than the sequence {˜ z (m) }∞ m=0 . (m)
(m)
Lemma 4. If the matrix W is an M -matrix, then the sequence {d(m) }∞ m=0 in Algorithm 1 converge to the zero vector. (0)
Proof: Since di (m−1)
dj
(m)
di
(m−1)
≤ zj
(0)
ui i−1
≤
j=1
(m)
≤ zj
, j = 1, 2, · · · , i − 1 and
, j = i + 1, · · · , n. holds. Then (m)
=
(m)
= zi , ∀i. suppose that dj
(m)
|aij |dj
+
(m)
− li n
(m)
= f (pi
i−1
(m−1)
|aij |dj
j=i+1 (m) aii βi −1
(m)
) − f (qi
≤
j=1
(m)
|aij |zj
(m)
) ≤ pi +
n
(m)
− qi
(m−1)
|aij |zj
j=i+1 (m) aii βi −1
(m)
= zi
.
Thus for ∀i and for all m ≥ 0 , di ≤ zi holds, then the sequence {d(m) }∞ m=0 converges to the zero vector faster than the sequence {z (m) }∞ m=0 . (m)
(m)
Proof of Theorem 3: Since Algorithm 1 is completed if W = (wij ) is an M matrix, the sufficient condition for the n-cell CNN to be completely stable is that every (n − 1)-cell CNN whose feedback matrix is an (n − 1) × (n − 1) principal submatrix of A is completely stable. Note that for any (n − 1) × (n − 1) principal ˜ is an M -matrix. Thus a sufficient condition for submatrix A˜ of A, the matrix W complete stability of an n-cell CNN is that every (n−2)-cell CNN whose feedback matrix is an (n − 2) × (n − 2) principal submatrix of A is completely stable. Continuing the above discussion, we can finally state as follows: a sufficient condition for complete stability of an n-cell CNN is that every 1-cell CNN whose feedback matrix is a 1 × 1 principal submatrix of A is completely stable. Since every one-cell CNN is completely stable, the n-cell CNN is completely stable. In fact, if f (xi ) = xi , ∀i, in [−r0 , r0 ] for the function (2), then βmin = β = 1. In Theorem 3 the matrix W is nicely the comparison matrix of A − I. Thus we show that the comparison matrix of A − I in [3,4] is M -matrix which is only a special case in our sufficient condition.
A New Sufficient Condition on the Complete Stability
3 3.1
215
Stability of a Class CNNs and Numerical Example Output Function Is Piecewise tanh Function
Now we choose r0 . let f (xi ) = tanh(xi ) = Bi , where h(xi ) = −xi + aii f (xi ), Bi =
exi −e−xi exi +e−xi n
in (1), we have x˙ i = h(xi ) +
aij f (xj ) + bi . Since
j=1,j =i 4 , (exi +e−xi )2
then 0 <
df (xi ) dxi
df (xi ) dxi
=
≤ 1. We consider two following cases.
1. If aii > 1. i) Let Bi = 0, then x˙ i = h(xi ). Let r > 0, 4aii −1 + (er +e −r )2 = 0. We get that
dh(xi ) dxi |xi =r
= 0, that is
√ √ 1 r = ln( aii + aii − 1) and f (xi )|xi =±r = . aii
(9)
We choose 0 < r0 < r, βmin = f (xi )|xi =±r0 . Then x˙i has the characteristics shown in Fig.1(a). ii) Bi = constant = 0 is valid for cases shown in Fig.1(b). Bi>r1 x’ (t) i
x’i
x’i Bi=r1
Bi>0 Bi
r1 −r2
r
2
Bi=0 Bi<0
xi
x (t)
xi
i
r0 r
Bi<−r1 B =−r i
(a)
1
Bi>−r1 (b)
(c)
Fig. 1. (a) Bi = 0, there are three equilibrium points : x∗i = 0 is unstable, x∗i = −r2 , x∗i = r2 are stable. (b) Bi = 0, all of the stable equilibrium points share the common property |x∗i | > r0 . (c) Bi = 0, there is only one stable equilibrium point x∗i = 0; Bi = 0, there is only one stable equilibrium point x∗i = 0.
Theorem 4. We assume that aii > 1, then lim |xi (t)| = |x∗i | > r0 ,
t→∞
(10)
√ √ where r0 < ln( aii + aii − 1). 2. If aii ≤ 1, then h(xi ) has the characteristic shown in Fig.1(c). We can make a conclusion that when aii ≤ β1 , if W = (wij ) is an M -matrix, then 1 − a β ii min , if i = j the CNNs is completely stable. Where wij = , β = −|aij|, if i = j min min{f (xi )|xi ∈ [−r0 , r0 ]}. The discussion shows that completely stability and global asymptotic stability for CNNs are equivalence when CNNs has only an equilibrium point.
216
L.-q. Zhou and G.-d. Hu 10
8
6
4
2
0
−2
−4
−6
−8
−10 −10
−8
−6
−4
−2
0
2
4
6
8
10
Fig. 2. The CNN has five equilibrium points. where there are four stable equilibrium points(4.0755, 2.9717), (−4.0755, 2.9717), (2.7170, −3.8208),(−2.7170, 3.8208) in the total saturation region, and (0, 0) is unstable equilibrium.
3.2
Numerical Example and Simulation Result
Let us consider the two-cell CNNs state equation x˙ = −x + 4f (x ) + 0.8f (x ), 1 1 1 2 (11) x˙ 2 = −x2 − 0.5f (x1 ) + 4f (x2 ). √ By (9), we have r = ln(2 + 3) and choose r0 = ln 3.5. Since βmin = 0.2791, 2 |W | = (16βmin − 1)2 − 0.4 = 0.0136 > 0, hence W is an M -matrix. So the CNNs is completely stable. We see the simulation result Fig.2.
Acknowledgement We would like to thank the referees for their careful comments and helpful suggestions which improve this paper significantly.
References 1. Wu, C.W., Chua, L.O.: A More Rigorous Proof of Complete Stability of Cellular Neural Networks. IEEE Trans. Circuit Syst. I. 44(4) (1997) 370-371 2. Chua, L. O., Roska, T.: Stability of a Class of Nonreciprocal Cellular Neural Networks: Theory. IEEE Trans. Circuit Syst. 37(12) (1990) 1520-1527 3. Arik, S., Tavsanoglu, V.: Equilibrium Analysis of Nonsymmetric CNNs. Int. J. Circuit Theory Appl. 24(2) (1996) 269-274 4. Takahashi, N., Chua, L.O.: On the Complete Stability of Nonsymmetric Cellular Neural Networks. IEEE Trans. Circuit Syst. I. 45(7) (1998) 754-758 5. Takahashi, N., Chua, L.O.: A New Sufficient Condition for Nonsymmetric CNNs to Have a Stable Equilibrium Point. IEEE Trans. Circuit Syst. I. 44(11) (1997) 1092-1095 6. Young, D.M.: Iterative Solution of Linear Solutions. New York, Academic (1971)
Stability Analysis of Reaction-Diffusion Recurrent Cellular Neural Networks with Variable Time Delays Weifan Zheng, Jiye Zhang, and Weihua Zhang Traction Power State Key Laboratory, Southwest Jiaotong University, Chengdu 610031, China {wfzheng, jyzhang}@home.swjtu.edu.cn
Abstract. In this paper, the global exponential stability of a class of recurrent cellular neural networks with reaction-diffusion and variable time delays was studied. When neural networks contain unbounded activation functions, it may happen that equilibrium point does not exist at all. In this paper, without assuming the boundedness, monotonicity and differentiability of the active functions, the algebraic criteria ensuring existence, uniqueness and global exponential stability of the equilibrium point of neural networks are obtained.
1 Introduction In recent years, the dynamic behaviors of neural networks have been deeply investigated due to its applicability in some signal processing, imaging processing, and pattern recognition problems, especially in some parallel computation and difficult optimization problems [1-5]. Such applications rely on the qualitative properties of system. In hardware implementation, time delays are unavoidable due to finite switching speeds of the amplifiers and communication time etc. Time delays, especially variable delays and unbounded delay, may lead to an oscillation and furthermore, to instability of networks [6]. In a complex dynamical system, absolute invariable delays is few; and in some cases such as the processing of moving images it requires the introduction of delays in the signal transmitted through the networks [7]. Therefore, the stability of neural networks with variable delays is practically important, and has been extensively studied in the literatures [8-11]. However, diffusion effect cannot be avoided in the neural networks model when electrons are asymmetric electromagnetic field. The asymptotical stability of neural networks with reaction-diffusion was discussed in [12, 17]. But many of systems in these literatures require increasingly monotonic neuron activation functions and nonnegative diffusion functions. In this paper, we relax some limits on activation functions and diffusion functions of neural network, which are similar to those discussed in [12-17]. By constructing the proper Liapunov function, we analyze the conditions ensuring the existence, uniqueness and global exponential stability of the equilibrium of the models. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 217 – 223, 2006. © Springer-Verlag Berlin Heidelberg 2006
218
W. Zheng, J. Zhang, and W. Zhang
2 Notations and Preliminaries In this paper, we analyze the stability of reaction-diffusion recurrent cellular neural networks with variable delays described by the following differential equations n m ∂ui ∂u ∂ [ Dik (t , x, ui ) i ] − ci ui (t ) + ∑ aij f j (u j (t )) =∑ ∂t k =1 ∂xk ∂xk j =1 n
− ∑ bij g j (u j (t − τ ij (t ))) + J i , i = [1, n] .
(1)
∂u ∂u ∂ui = col( i ,..., i ) = 0, i = [1, n] , t ∈ I , x ∈ ∂Ω , ~ ∂x1 ∂xm ∂n
(2)
j =1
where ui is the state of neuron i, (i = 1,2,L, n) and n is the number of neurons; Di (t , x, ui ) is smooth reaction-diffusion function, A = (aij ) n×n , B = (bij ) n×n are connection
matrices,
J = ( J 1 , J 2 , L , J n )T
is
the
constant
input,
respectively.
Τ
f (u ) = ( f1 (u1 ), f 2 (u2 ),..., f n (un )) , g (u ) = ( g1 (u1 ), g 2 (u2 ),L, g n (un )) are the activation functions of the neurons, C = diag(c1 , c2 ,L, cn ) > 0 . The delays τ ij (t ) T
( i, j = 1,2, L , n ) are bounded functions, i.e. 0 ≤ τ ij (t ) ≤ τ .The initial conditions of equation (1) are of the form ui ( s ) = φi ( s) , −τ ≤ s ≤ 0 , where φi is bounded and continuous in [−τ ,0] . Equation (2) is the boundary condition of equation (1), in which x ∈ Ω ⊂ R m , Ω is a compact set with smooth boundary and mesΩ > 0 , ∂Ω is the boundary of Ω , t ∈ I = [0,+∞) . For convenience, we introduce some notations. The notation u = (u1 , u2 ,..., un )T ∈ R n represents a column vector (the symbol ( )Τ denotes transpose). For matrix
A = (aij ) n×n , | A | denotes absolute-value matrix given by | A |= (| aij |) n×n , i, j = 1, 2… n; [ A]S is defined as ( AT + A) / 2 . For x ∈ R n , | x |= (| x1 |,..., | xn |) Τ , || x || denotes a vector norm defined by || x ||= max{| xi |} . Let 1≤i ≤ n
∂ ∂u ( D1 j (t , x, u1 ) 1 ),..., ∂xk k =1 ∂xk model (1) becomes the following system: m
D(t , x, u ) = col(∑
m
∂
∑ ∂x k =1
( Dnj (t , x, un ) k
∂un )) , ∂xk
u& (t ) = D(t , x, u (t )) − Cu (t ) + Af (u (t )) + Bg (u (t − τ ij (t )) + J .
(3)
System (3) and (1) have the same properties of stability. If there is a constant u0 = u0* =const (const denotes invariable constant) which is the solution of the following equation:
Cu (t ) = Af (u (t )) + Bg (u (t − τ ij (t )) + J , then
∂u0* = 0 . From equation (3), we can get Dik (t , x, ui (t )) =0, and ∂x
(4)
Stability Analysis of Reaction-Diffusion Recurrent Cellular Neural Networks
219
D (t , x, u (t )) − Cu (t ) + Af (u (t )) + Bg (u (t − τ ij (t )) + J = 0 . It implies that equation (4) and (3) has the same equilibrium, and system (4) has the same equilibrium as that of system (1).Supposing activation functions satisfy Assumption 1. There exist the real numbers p j > 0 , q j > 0 , such that
p j = sup |
f j ( y) − f j ( z)
| , q j = sup |
g j ( y) − g j ( z )
y−z y≠ z Let P = diag( p1 ,..., pn ) , Q = diag(q1 ,..., qn ) . y≠ z
y−z
| . ( j = 1,2, L, n) .
3 Existence and Uniqueness of the Equilibrium Point In this section, we will give a necessary and sufficient condition ensuring the existence and uniqueness of the equilibrium point with respect to the class of unbounded activation functions satisfying Assumption 1. Definition 1[18]. A real matrix A = ( aij ) n×n is said to be an M-matrix if aij ≤ 0 i, j = 1,2,..., n , i ≠ j , and all successive principal minors of A are positive. We firstly define a map associated with (1) as follows: H (u ) = −Cu + Af (u ) + Bg (u ) + J .
(5)
It is known that the solutions of H (u ) = 0 are equilibriums in (1). If map H (u ) is a homeomorphism on R n , then there exists a unique point u* such that H (u*) = 0 , i.e., system (1) has a unique equilibrium u * . Lemma
[8]
. If H (u ) ∈ C 0 satisfies (i) H (u ) is injective on R n ; (ii) H (u ) → ∞ as
u → ∞ ; then H (u ) is a homeomorphism of R n . Theorem 1. If Assumption 1 is satisfied, and α = C − (| A | P + | B | Q ) is an M-matrix, then, for every input J , system (1) has a unique equilibrium u * . Proof. In order to prove that for every input J , system (1) has a unique equilibrium point u * , it is only to prove that H (u ) is a homeomorphism on R n . In following, we shall prove it in two steps.
Step 1. We shall prove that condition (i) in Lemma is satisfied. Suppose that there exist x, y ∈ R n with x ≠ y such that H ( x ) = H ( y ) . From (5), we get C ( x − y ) = A( f ( x ) − f ( y )) + B ( g ( x ) − g ( y )) .
(6)
From Assumption 1, there exist matrices K = diag( K1 ,..., K n ) ( − P ≤ K ≤ P ), and
L = diag( L1 ,..., Ln ) (−Q ≤ L ≤ Q ) , such that g ( x) − g ( y ) = L( x − y ) . So (6) can be written as:
f ( x) − f ( y ) = K ( x − y )
[−C + ( AK + BL)]( x − y ) = 0 .
and (7)
220
W. Zheng, J. Zhang, and W. Zhang
In the following, we prove that det[−C + ( AK + BL)] ≠ 0 . Considering the system
z& = [−C + ( AK + BL)]z .
(8)
Due to α being an M-matrix, using property of M-matrix ξ i > 0 (i = 1,2,..., n) such that
[18]
, there exist
n
− ciξ i + ∑ ξ j (| a ji K j | + | b ji L j |) < 0 (i = 1,2,..., n) .
(9)
j =1
n
Constructing a Liapunov function V ( z ) = ∑ ξ i | zi | . Calculating the upper right dei =1
rivative D +V of V along the solutions of (8), we get n
n
i =1
j =1
D +V ( z ) ≤ ∑ [−ciξ i + ∑ ξ j (| a ji K j | + | b ji L j |)] | zi | <0
(|| z ||≠ 0) .
By stability Theorem [18], the zero solution of (8) is globally asymptotically stable. Thus matrix −C + ( AK + BL) is a stable matrix and det[−C + ( AK + BL)] ≠ 0 . From (7), x = y , which is a contradiction. So map H is injective. Step 2. We now prove that condition (ii) in Lemma is satisfied. Let H (u ) = −Cu + A( f (u ) − f (0)) + B( g (u ) − g (0)) .To show that H is homeomorphism, it suffices to show that H (u ) is homeomorphism. From Assumption 1, we get | f i (u ) − f i (0) |≤ pi | u | (i=1, 2,…,n), | gi (u ) − gi (0) |≤ qi | u | (i=1, 2,…,n). Since α is an M-matrix, from property of M-matrix [18], there exists a positive definite diagonal matrix D such that [ D (−C + | A | P + | B | Q )]S ≤ −ε En < 0 , for sufficiently small ε > 0 , where En is an identity matrix. By calculating, we have [ Du ]Τ H (u ) ≤ −ε u . By using Schwartz inequality, we get ε u 2
2
≤ D u H (u ) ,
namely, ε || u || / || D || ≤|| H (u ) || . So, || H (u ) ||→ +∞ , i.e. || H (u ) ||→ +∞ as || u ||→ +∞ . From Lemma, we know that H (u ) is homeomorphism on R n . So systems (1) have a unique equilibrium point u * . The proof is completed.
4 Global Exponential Stability of Equilibrium Point In this section, we shall apply the ideal of vector Liapunov method [18, 19] to analyze global exponential stability of model (1). Definition 2. The equilibrium point u * of (1) is said to be globally exponentially stable, if there exist constant λ > 0 and β > 0 such that || u (t ) − u* ||
≤ β || φ − u* || e − λ t for all t ≥ 0 , where || φ − u* ||= max sup | φi ( s ) − ui* | . 1≤ i ≤ n s∈[ −τ , 0 ]
Theorem 2. If Assumption 1 is satisfied and α = C − (| A | P + | B | Q ) is an M-matrix, then for each input J, systems (1) have a unique equilibrium point, which is globally exponentially stable.
Stability Analysis of Reaction-Diffusion Recurrent Cellular Neural Networks
221
Proof. Since α is an M-matrix, from Theorem 1, system (1) has a unique equilib-
rium point u * . Let z (t ) = u (t ) − u * , Eqs. (1) can be written as ∂z ∂zi (t ) m ∂ [ Dik (t , x, ( z (t ) + u * )) i ] =∑ ∂xk ∂t k =1 ∂xk n
n
j =1
j =1
− ci zi (t ) + ∑ aij f j ( z j (t )) + ∑ bij g j ( z j (t − τ ij (t ))) ,
(10)
where f j ( z j (t )) = f j ( z j (t ) + u j *) − f j (u j *) and g j ( z j (t − τ ij (t ))) = g j ( z j (t − τ ij (t )) + u j *) − g j (u j *) , ( j = 1,L, n) .
The initial condition of equations (10) is ψ ( s ) = φ ( s ) − u * , −τ ≤ s ≤ 0 , and equations (10) have a unique equilibrium at z = 0 . From Assumption 1, we get | f j ( z j (t )) |≤ p j | z j (t ) | , | g j ( z j (t − τ ij (t ))) | ≤ q j | z j (t − τ ij (t )) | ( j = 1,L, n) . Due to α being an M-matrix [18], there exist ξ i > 0 (i = 1,2,..., n) and λ > 0 satisfy n
− ξ i (ci − λ ) + ∑ ξ j (| aij | p j + e λτ | bij | q j ) < 0 ,
(11)
j =1
where, τ is a fixed number. Constructing Vi (t ) = eλt zi (t ) , and Liapunov functional Vi (t ) = ∫Ω | Vi (t ) | dx (i = 1,2,..., n) . Calculating the upper right derivative D + Vi (t ) of Vi (t ) along the solutions of equation (10), and considering Assumption 1, boundary condition (2) and mesΩ > 0 , we get m ∂u ∂ D + Vi (t ) = ∫Ω e λt sgn zi ∑ [ Dik (t , x, ( zi (t ) + u * )) i ]dx + ∫ λeλt | zi (t ) | dx Ω ∂xk k =1 ∂xk n
n
j =1
j =1
+ ∫ e λt {sgn zi (t )[ −ci zi (t ) + ∑ aij f j ( z j (t )) + ∑ bij g j ( z j (t − τ ij (t )))]}dx Ω
∂u ≤ e λt sgn zi (t )∑ [ Dik (t , x, ( zi (t ) + u * )) i ] |∂Ω ∂xk k =1 m
n
n
j =1
j =1
+ ∫Ω e λt [(−ci + λ ) | zi (t ) | + ∑ (| aij || f j ( z j (t )) |) + ∑ | bij || g j ( z j (t − τ ij (t ))) |]dx ≤
n
∫ { − (c − λ ) | V (t ) | +∑ [| a Ω
+e ≤
λτ ij ( t )
i
i
| bij | q j e
λ ( t −τ ij ( t ))
j =1
ij
| z j (t − τ ij (t )) |]}dx n
∫ { − (c − λ ) | V (t ) | +∑ [| a Ω
i
i
| p j | V j (t ) |
j =1
ij
| p j | V j (t ) | + eλτ | bij | q j sup | V j ( s) |]}dx . (12) t −τ ≤ s ≤t
Defining the curve γ = { y (l ) : yi = ξ i l , l > 0, i = 1,2,..., n} , and the set Ω( y ) = {u : 0 ≤ u ≤ y, y ∈ γ } . Let ξ min = min{ξi } , ξ max = max{ξ i } and taking 1≤i ≤ n
1≤i ≤ n
l0 = (1 + δ ) || ψ || / ξ min , where δ > 0 is a constant number. Then
{| V |: | V |= eλs | ψ ( s) |, − τ ≤ s ≤ 0} ⊂ Ω( z0 (l0 )) ,
222
W. Zheng, J. Zhang, and W. Zhang
namely, | Vi ( s) |= eλs | ψ i ( s ) |< ξi l0 , −τ ≤ s ≤ 0 , i=1,2,…,n. We claim that | Vi (t ) |< ξi l0 , for t ∈ [0,+∞) , i=1, 2, … , n . If it is not true, then there exist some index i and time
t1 ( t1 > 0 ), such that | Vi (t1 ) |= ξil0 , D + | Vi (t1 ) |≥ 0 , and | V j (t ) |≤ ξ j l0 , for −τ < t ≤ t1 , j =1, 2, … , n. So we get D + Vi (t1 ) = ∫Ω D + | Vi (t1 ) | dx ≥ 0 (i=1, 2, … , n). (13) However, from (11) and (12), we get n
D + Vi (t1 ) = ∫Ω D + | Vi (t1 ) | dx ≤ [−ξ i (ci − λ ) + ∑ ξ j (| aij | p j + e λτ | bij | q j )]l0 < 0 , j =1
which contradicts with inequalities (13). So | Vi (t ) |< ξi l0 , for t ∈ [0,+∞) , therefore, | zi (t ) |< ξ il0 e − λt ≤ β || ψ || e − λt (i=1, 2, … , n), where β = (1 + δ )ξ max / ξ min . From Definition 2, the zero solution of systems (10) is globally exponentially stable, i.e., the equilibrium point of systems (1) is globally exponentially stable. The proof is completed. Remark: The similar system is discussed in reference [12-17], but all of them require nonnegative diffusion function or increasing activation functions in systems, however, those conditions are not required in Theorem 2 in this paper.
5 Conclusion In the paper, a thorough analysis of existence, uniqueness, and global exponential stability of the equilibrium point for a class of reaction-diffusion recurrent cellular neural networks with variable delays have been presented. By applying idea of vector Liapunov function method and M-matrix theory, sufficient conditions independent of the delays for global exponential stability are obtained. Moreover, the conditions are easily tested.
Acknowledgments This work is supported by Youth Science Foundation of Sichuan (No. 05ZQ026-015), National Program for New Century Excellent Talents in University (No. NCET-040889), and Natural Science Foundation of China (No.10272091).
References 1. Chua, L. O., Yang, L.: Cellular Neural Networks: Theory. IEEE Trans. Circ. Syst. 35 (1988) 1257-1272 2. Zhang, J., Jin, X.: Global Stability Analysis in Delayed Hopfield Neural Networks Models. Neural Networks 13 (2000) 745-753 3. Zhang, J., Yang, Y.: Global Stability Analysis of Bidirectional Associative Memory Neural Networks with Time Delay. Int. J. Circ. Theor. Appl. 29 (2001) 185-196
Stability Analysis of Reaction-Diffusion Recurrent Cellular Neural Networks
223
4. Zhang, Y., Peng, P. A., Leung, K. S.: Convergence Analysis of Cellular Neural Networks with Unbounded Delay. IEEE Trans. Circ. Syst.-I 48 (2001) 680-687 5. Zhang, J.: Globally Exponential Stability of Neural Networks with Variable Delays. IEEE Trans. Circ. Syst.-I 50 (2003) 288-291 6. Civalleri, P. P., Gill, L. M., Pandolfi, L.: On Stability of Cellular Neural Networks with Delay. IEEE Trans. Circ. Syst.-I. 40 (1993) 157-164 7. Roska, T., Chua, L. O.: Cellular Neural Networks with Delay Type Template Elements and Nonuniform Grids. Int. J. Circ. Theor. Appl. 20 (1992) 469-481 8. Forti, M., Tesi, A.: New Conditions for Global Stability of Neural Networks with Application to Linear and Quadratic Programming Problems. IEEE Trans. Circ. Syst.-I 42 (1995) 354-366 9. Zheng, W., Zhang, J.: Global Exponential Stability of a Class of Neural Networks with Variable Delays. Computers and Mathematics with Applications 49 (2005) 895-902 10. Zhang, J., Suda, Y., Iwasa, T.: Absolutely Exponential Stability of a Class of Neural Networks with Unbounded Delay. Neural Networks 17 (2004) 391-397 11. Zhang, J.: Absolute Stability Analysis in Cellular Neural Networks with Variable Delays and Unbounded Delay. Computers and Mathematics with Applications 47 (2004) 183-194 12. Liao, X., Fu, Y. et al.: Stability of Hopfield Neural Networks with Reaction-diffusion Terms. Acta Electronica Sinica 28 (2000) 78-80(in Chinese) 13. Liao, X., Li, J.: Stability in Gilpin-Ayala Competition Models with Diffusion. Nonlinear Anal. 28 (1997) 1751-1758 14. Wang, L., Xu, D.: Global Stability of Reaction-diffusion Hopfield Neural Networks with Variable Time Delay. Science in China (serial E) 33 (2003) 488-495 15. Liang, J., Cao, J.: Global Exponential Stability of Reaction-Diffusion Recurrent Neural Networks with Time-varying Delays. Physics letters A 314 (2003) 434-442 16. Song, Q., Cao, J.: Global Exponential Stability and Existence of Periodic Solutions in BAM Networks with Delays and Reaction-diffusion Terms. Chaos Solitons and Fractals 23 (2005) 421-430 17. Song, Q., Zhao, Z., Li, Y.: Global Exponential Stability of BAM Neural Networks with Distributed Delays and Reaction-diffusion Terms. Physics Letters A 335 (2005) 213-225 18. Siljak, D. D.: Large-scale Dynamic Systems — Stability and Structure. Elsevier NorthHolland, New York (1978) 19. Zhang, J., Yang, Y., Zeng, J.: String Stability of Infinite Interconnected System. Applied Mathematics and Mechanics 21 (2000) 791-796
Exponential Stability of Delayed Stochastic Cellular Neural Networks Wudai Liao1 , Yulin Xu1 , and Xiaoxin Liao2 1
2
School of Electrical and Information Engineering, Zhongyuan University of Technology, 450007, Zhengzhou, Henan, P.R. China {wdliao, xuyulin}@zzti.edu.cn Department of Control Science and Engineering, Huazhong University of Science and Technology, 430074, Wuhan, Hubei, P.R. China
[email protected]
Abstract. In view of the character of saturation linearity of output functions of neurons of the cellular neural networks, the method decomposing the state space to sub-regions is adopted to study almost sure exponential stability on delayed cellular neural networks which are in the noised environment. When perturbed terms in the model of the neural network satisfy Lipschitz condition, some algebraic criteria are obtained. The results obtained in this paper show that if an equilibrium of the neural network is the interior point of a sub-region, and an appropriate matrix related to this equilibrium has some stable degree to stabilize the perturbation, then the equilibrium of the delayed cellular neural network can still remain the property of exponential stability. All results in the paper is only to compute eigenvalues of matrices.
1
Introduction
The stability problem on cellular neural networks is widely studied from origin to now, including the original work [1], delayed cases [2] and other studies. For reality, we should consider the case that the cellular neural networks are in noised environment, this is because the network realization is through VLSI approach and the information transmission among real brain neuron cells ia a noised process. We call these neural networks stochastic neural networks. The stability analysis of these neural networks originates in [3], and after that, lots of results [4, 5, 6] are obtained by some scholars. Because of the characteristic of saturation linearity about the output functions of neurons in cellular neural networks, we are going to use this to study the stability problem on stochastic cellular neural networks, it has the form dx(t) = [−Bx(t) + Af (x(t − τ )) + I]dt + σ (x(t), x(t − τ ))dw(t) .
(1)
Where x = (x1 , x2 , · · · , xn )T ∈ IRn is the state vector of the neural networks, x (t − τ ) = (x1 (t − τ1 ), x2 (t − τ2 ), · · · , xn (t − τn ))T , τi ≥ 0 is the time delay of the J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 224–229, 2006. c Springer-Verlag Berlin Heidelberg 2006
Exponential Stability of Delayed Stochastic Cellular Neural Networks
225
neuron i and 0 ≤ τi ≤ τ, i = 1, 2, · · · , n.f (·) is the vector of the output functions of the neurons, f (x) = (f1 (x1 ), f2 (x2 ), · · · , fn (xn ))T , fi (·) has the form fi (u) =
1 (|u + 1| − |u − 1|) − ∞ < u < ∞, i = 1, 2, · · · , n . 2
(2)
σ (·, ·) ∈ IRn×m is the perturbed matrix satisfied Lipschtz condition, that is there exists a positive constant L such that σ (x, y) − σ (¯ x, y¯)2 ≤ L(x − x ¯2 + y − y¯2 ) .
(3)
· in this paper denotes the Frobenius norm of a matrix. w(·) is an m-dimension Brownian motion, B = diag.(b1 , b2 , · · · , bn ) is a positive diagonal matrix, A = (aij )n×n is the weight matrix between neurons, I is the bias of the neuron.
2
Main Results
We will set up some sufficient criteria ensuring the neural network (1) almost sure exponential stability in this section. According to IR = (−∞, ∞) being decomposed into three intervals (−∞, −1), [−1, 1] and (1, ∞), the n-space IRn can be divided into 3n sub-regions. Suppose that x∗ = (x∗1 , x∗2 , · · · , x∗n )T is an equilibrium of the system (1), which is the interior point in one of the sub-regions, N (x∗ ) is the greatest neighborhood of the point x∗ which is in the same sub-region that the point x∗ is in. Take the transformation z = x − x∗ , and a function 1, |u| < 1 φ(u) = . 0, |u| > 1 From the assumptions above and Property (2) of the output functions, we have f (z(t − τ ) + x∗ ) − f (x∗ ) = Φ(x∗ )z(t − τ ) holds for points in N (x∗ ), where the matrix Φ(x∗ ) = diag.(φ(x∗1 ), φ(x∗2 ), · · · , φ (x∗n )) is a diagonal matrix which the elements are either 0 or 1. Thus, in order to discuss the stability of the equilibrium x∗ of System (1), we need only study the same property of the trivial equilibrium z = 0 of the following system dz(t) = [−Bz(t) + AΦ(x∗ )z(t − τ )]dt + σ(z(t), z(t − τ ))dw(t) .
(4)
Where σ(z(t), z(t − τ )) = σ (z(t) + x∗ , z(t − τ ) + x∗ ) − σ (x∗ , x∗ ). n For ∀z, y ∈ IR and a symmetric positive definite matrix Q, we have the following estimation on σ(z(t), z(t − τ )) by using the condition (3). trace(σ T (z, y)Qσ(z, y)) ≤ trace(Q)L(z2 + y2 ) .
(5)
We first give a definition and a lemma [4] which plays an important role in this paper.
226
W. Liao, Y. Xu, and X. Liao
Definition 1. For a Lyapunov function V ∈ C 2 (IRn ; IR+ ), that is, the function is continuously twice differentiable with respect to its variables, IR+ = [0, ∞), for any z, y ∈ IRn , defines the operator L generated by the system (4) as following 1 LV (z, y) = Vx [−Bz + AΦ(x∗ )y] + trace((σ(z, y))T Vxx σ(z, y)) . 2 Lemma 1. For System (4), if there exit functions V ∈ C 2 (IRn ; IR+ ), μ ∈ C(IRn ; IR+ ), μi ∈ C(IR+ ; IR+ ), i = 1, 2, · · · , n, and constants λ1 > λ2 ≥ 0, such that (1)LV (z, y) ≤ −λ1 μ(z) + λ2
n
μi (yi ), (2)V (z) ≤ μ(z), (3)
i=1
n
μi (zi ) ≤ μ(z)
i=1
hold for any z, y ∈ IRn , then, the trivial equilibrium z = 0 is almost sure exponential stability. In the following, we will get the main results. Theorem 1. For a diagonal positive definite matrix Q = diag.(q1 , q2 , · · · , qn ), select a positive diagonal matrix R = diag.(r1 , r2 , · · · , rn ), if the matrix ⎛ ⎞ −2QB + αU + R QAΦ(x∗ ) ⎠ H=⎝ (QAΦ(x∗ ))T −R + αU is negative definite, then the trivial equilibrium z = 0 of System (4) is almost sure exponential stability, where α = trace(Q)L, U is the unit matrix. Proof (of Theorem 1). For ∀z, y ∈ IRn , choose the Lyapunov function V (z) = z T Qz, and then, the operator L generated by System (4) has the form LV (z, y) = 2z T Q(−Bz + AΦ(x∗ )y) + trace((σ(z, y))T Qσ(z, y)) . By using condition (5) and denoting −λ = λmax (H), λ > 0, we have the following estimation LV (z, y) ≤ −2z T (QB)z + 2z T [QAΦ(x∗ )]y + α(z2 + y2 ) n n z T T = (z , y )H − ri zi2 + ri yi2 y i=1
≤−
n
(λ + ri )zi2 +
i=1
n
i=1
(ri − λ)yi2 .
i=1
From the construction of the matrix H, we can easily deduce that ri − λ ≥ α > 0, i = 1, 2, · · · , n. Denote that λ1 = min { 1≤i≤n
1 ri − λ (λ + ri )}, λ2 = max { } . 1≤i≤n λ + ri qi
Exponential Stability of Delayed Stochastic Cellular Neural Networks
227
Obviously, λ1 > 0, 0 < λ2 < 1 (use λ > 0) and (λ + ri )/λ1 ≥ qi , i = 1, 2, · · · , n. Let n 1 1 μ(z) = (λ + ri )zi2 , μi (yi ) = (λ + ri )yi2 , λ1 i=1 λ1 then, we have (1)LV (z, y) ≤ −λ1 μ(z) + λ1 λ2
n i=1
μi (yi ), (2)V (z) ≤ μ(z), (3)
n
μi (yi ) = μ(y) .
i=1
According to Lemma 1, the trivial equilibrium z = 0 of System (4) is almost sure exponential stability. The proof is complete. Corollary 1. If there exist positive diagonal matrices Q = diag.(q1 , q2 , · · · , qn ) and R = diag.(r1 , r2 , · · · , rn ) such that the matrix ⎛ ⎞ −2QB + R QAΦ(x∗ ) ⎠ H1 = ⎝ ∗ T (QAΦ(x )) −R has the stable degree α, that is λmax (H1 ) < −α, then the trivial equilibrium z = 0 System (4) is almost sure exponential stability. Proof (of Corollary 1). From Theorem 1, in order to prove Corollary 1, we need only to verify the matrix H in Theorem 1 to be negative definite. For ∀z, y ∈ IRn ,
z z (z T , y T )H = (z T , y T )H1 + αz2 + αy2 y y ≤ (λmax (H1 ) + α)z2 + (λmax (H1 ) + α)y2 . Because λmax (H1 ) < −α, the matrix H in Theorem 1 is negative definite. The proof is complete. In following, we give an approach to choose the optional matrix R and the matrix Q, so, it is easy to use in system synthesis. Corollary 2. For an appropriate positive number m > 0, If the matrix ⎛ ⎞ 2m − m+1 U B −1 AΦ(x∗ ) ⎠ H2 = ⎝ 2 (B −1 AΦ(x∗ ))T − m+1 U n has the stable degree α = L i=1 b−1 i , then the trivial equilibrium z = 0 of System (4) is almost sure exponential stability. Proof. Let −2QB + R = −mR and Q = B −1 in matrix H1 , this implies that R = [2/(m + 1)]U , hence, the matrix H1 becomes the matrix H2 . The proof is complete.
228
W. Liao, Y. Xu, and X. Liao
Corollary 3. For equilibrium x∗ = (x∗1 , x∗2 , · · · , x∗n )T of System (1), |x∗i | < 1, i = 1, 2, · · · , n. If there exists a number m > 0, such that the matrix ⎛ ⎞ 2m − m+1 U B −1 A ⎠ H3 = ⎝ 2 (B −1 A)T − m+1 U has the stable degree α = L exponential stability.
n
−1 i=1 bi ,
then the equilibrium x∗ is almost sure
Proof. In this case, Φ(x∗ ) = U , by using Corollary 2, Corollary 3 holds. The proof is complete. Corollary 4. For equilibrium x∗ = (x∗1 , x∗2 , · · · , x∗n )T of System (1), |x∗i | > 1, i = 1, 2, · · · , n. If there exists a number m > 0, such that ⎧ 2 n −1 −1 ⎨ m+1 ( i=1 bi ) , m > 1 0≤L< ⎩ 2m n −1 −1 , m≤1 i=1 bi ) m+1 ( then, the equilibrium x∗ can still remain exponential stability while x∗ of the deterministic neural network corresponding to System (1) is exponential stability. Proof. In this case, Φ(x∗ ) = 0, H2 has the following form 2m
− m+1 U 0 2 0 − m+1 U its biggest eigenvalue is 2m 2 λmax (H2 ) = max − ,− . m+1 m+1 So, the condition of this corollary implies λmax (H2 ) < −α = −L using Corollary 2, this corollary holds. The proof is complete.
n
−1 i=1 bi ,
by
Corollary 5. For equilibrium x∗ = (x∗1 , x∗2 , · · · , x∗n )T of System (1), |x∗j0 | < 1, |x∗i | > 1, i = 1, · · · , j0 − 1, j0 + 1, · · · , n. If there exists positive number m, such that
2 n aij0 4m cj0 := < b (m + 1)2 i i=1 and 0 ≤ L < λ(m) hold, then, the equilibrium x∗ is almost sure exponential stability, where
−1 n 2 2m m−1 2 −1 λ(m) = min , , 1 − cj0 + ( ) ) · bi . m+1 m+1 m+1 i=1
Exponential Stability of Delayed Stochastic Cellular Neural Networks
229
Proof. Without loss generality, let |x∗1 | < 1, |x∗i | > 1, i = 2, 3, · · · , n. Hence, Φ(x∗ ) = diag.(1, 0, · · · , 0), and then the character polynomial of matrix H2 is as following f (λ) = (λ +
2 2m n−1 2 2m )n−1 (λ + ) [(λ + )(λ + ) − c1 ] . m+1 m+1 m+1 m+1
Its all eigenvalues satisfy the following equations λ+
2 2m 4m = 0, λ + = 0, λ2 + 2λ + − c1 = 0 . m+1 m+1 (m + 1)2
The condition c1 < 4m/(m+ 1)2 implies that the bigger root of the last equation above is negative, it has the form m−1 2 λ = −1 + c1 + ( ) . m+1 Then, from the condition of this corollary, the condition of Corollary 2 holds. The proof is complete.
3
Conclusions
From discussion above, we conclude the results as following: 1. All results obtained in our paper hold for the deterministic case corresponding to neural network (1); 2. If the perturbed intensity is pre-estimated, then we can choose the parameter matrices B and A to design a deterministic neural network which has enough robustness to stabilize the perturbed intensity.
Acknowledgements This work is supported by National Natural Science Foundation of China (60274007, 60474001, 70572050).
References 1. Chua, L., Yang, L.: Cellular Neural Networks: Theory. IEEE Trans. Circuits and Systems 35 (1988) 1257-1272 2. Cao, J., Zhou, D.: Stability Analysis of Delayed Celluar Neural Networks. Neural Networks 11 (1998) 1601-1605 3. Liao, X., Mao, X.: Stability of Stochastic Neural Networks. Neual, Parallel and Scientific Computations 14(4) (1996) 205-224 4. Blythe, S., Mao, X.: Stability of Stochastic Delay Neural Networks. Journal of The Franklin Institute 338 (2001) 481-495 5. Shen, Y., Liao, X.: Robust Stability of Nonlinear Stochastic Delayed Systems. Acta Automatica Sinic 25(4) (1999) 537-542 6. Mao, X.: Stochastic Differential Equations and Their Applications. 1st edn. Horwood Pub., Chichester (1997)
Global Exponential Stability of Cellular Neural Networks with Time-Varying Delays and Impulses Chaojin Fu and Boshan Chen Department of Mathematics, Hubei Normal University, Huangshi, Hubei, 435002, China
[email protected]
Abstract. In this paper, global exponential stability of cellular neural networks with time-varying delays and impulses is studied. By estimating the delay differential inequality with impulses, some sufficient conditions for global exponential stability of impulsive neural networks are obtained. The obtained results are new and they complement previously known results. Finally, an example is given to illustrate the theory.
1
Introduction
In recent years, cellular neural networks (CNNs) have attracted great attention due to their significant potential in applications. In many engineering applications and hardware implementations of neural networks, time delays even varying delays in neuron signals are often inevitable, because of internal or external uncertainties. In addition, it is well known that the stability of CNNs and cellular neural networks with time-varying delays (DCNNs) are critical for their applications, especially in signal processing, image processing, and solving nonlinear algebraic and transcendental equations, and optimization problems. In the past decade, the stability of CNNs and DCNNs has been widely investigated; e.g., [1]-[10]. Moreover, in order to improve the convergence rate of the neural networks toward equilibrium points, and reduce the calculating time, such network as having globally stable equilibrium is prior to be considered. As CNNs and DCNNs are large-scale nonlinear dynamical systems, their stability analysis is a nontrivial task. However, besides delay effect, impulsive effect likewise exists in a wide variety of evolutionary processes in which states are changed abruptly at certain moments of time, involving such fields as medicine and biology, economics, mechanics, electronics and telecommunications, etc. Many interesting results on impulsive effect have been gained; e.g., [11]-[13]. In this paper, global exponential stability of cellular neural networks with time-varying delays and impulses is studied. Scopes of impulses are magnified. By estimating the delay differential inequality with impulses, some sufficient conditions for global exponential stability of impulsive neural networks are obtained. The obtained results are new and they complement previously known results. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 230–235, 2006. c Springer-Verlag Berlin Heidelberg 2006
Global Exponential Stability of Cellular Neural Networks
2
231
Preliminaries
Considers a CNN with time-varying delays and impulses (IDCNN)
¯ ¯ = −ci ui (t) + n = tk ij f (uj (t)) + bij f (uj (t − τj (t)))) + Ii , t > t0 , t j=1 (a (1) n ¯(uj (t− )) + Ii |, |Δui (tk )| ≤ | − ci ui (t− ) + (a + b ) f t = t , ij ij k k k j=1 dui (t) dt
where i = 1, 2, · · · , n, k = 1, 2, · · · , aij and bij are constant connection weights, ci is a positive constant, Ii is an external input or bias, f (·) is the activation function defined by f¯(uj (t)) = (|uj (t) + 1| − |uj (t) − 1|)/2,
(2)
τj (t) ≤ τ is the time delay upper-bounded by a constant τ, i, j ∈ {1, 2, · · · , n}, − Δui (tk ) = ui (t+ k ) − ui (tk ) is the impulse at moments tk and t1 < t2 < t3 < · · · is a strictly increasing sequences such that limk→+∞ tk = +∞. ui (t+ k ) denotes the right limit of ui (t) at t = tk . ui (t− ) denotes the left limit of u (t) at t = tk . i k Throughout of this paper, we denote C = diag{c1 , c2 , · · · , cn }, A = (aij )n×n , B = (bij )n×n , C([t0 −τ, t0 ], n ) is the set of continuous function φ : [t0 −τ, t0 ] → n . Since f¯ is bounded, it can be easily proved that IDCNN (1) has at least one equilibrium point u∗ = (u∗1 , u∗2 , · · · , u∗n )T such that for i = 1, 2, · · · , n, Ii =
ci u∗i
−
n
(aij + bij )f¯(u∗j ).
(3)
j=1
Let x(t) = (x1 (t), x2 (t), · · · , xn (t))T = (u1 (t)−u∗1 , u2 (t)−u∗2 , · · · , un (t)−u∗n )T , f (ri ) = f¯(ri + u∗i ) − f¯(u∗i ), then IDCNN (1) can be rewritten as
= −ci xi (t) + nj=1 (aij + bij )f (xj (t)), t > t0 , t = tk n − |Δxi (tk )| ≤ | − ci xi (t− ) + (a + b )f (x (t ))|, t = tk . ij ij j k j=1 k dxi (t) dt
(4)
The initial condition of IDCNN (1) (or IDCNN (4)) is assumed to be φ(ϑ) = (φ1 (ϑ), φ2 (ϑ), · · · , φn (ϑ))T ,
(5)
where φ(ϑ) ∈ C([t0 −τ, t0 ], n ). Denote ||φ||t0 = maxi∈{1,···,n},s∈[t0 −τ,t0 ] {|φi (s)|}. Denote u(t; t0 , φ) as the solution of (1) with initial condition (5). It means that u(t; t0 , φ) satisfies (1) and u(s; t0 , φ) = φ(s), for s ∈ [t0 − τ, t0 ]. Also simply denote u(t) as the solution of (1). Lemma 1. [10] Let P be a given n × n matrix. If λ is an eigenvalue of P˜ , then −λ is also an eigenvalue of P˜ , and λ2 is an eigenvalue of P P T and P T P, where 0 P ˜ P := . PT 0
232
3
C. Fu and B. Chen
Main Results
In the following discussions, let λ A+AT , λBB T and λB be respectively the maxi2 mum eigenvalues of matrices A + AT := , BB T and B 2
0 B BT 0
.
According to Lemma 1, λ2B = λBB T . Let M = (max1≤i≤n {1 + ci + nj=1 (|aij + bij |)})2 . If − cmin + λ A+AT + λB < 0,
(6)
2
where cmin := min1≤i≤n ci , then there exists θ > 0, such that θ − 2cmin + 2λ A+AT + λB + λB exp{θτ } = 0.
(7)
2
Theorem 1. If (6) holds, and there exists β > 0 such that for k = 1, 2, · · · , tk − t0 ≥ βk; nM ≤ exp{θβ/2}, then IDCNN (1) is globally exponentially stable. Proof. Denote u∗ = (u∗1 , u∗2 , · · · , u∗n )T as an equilibrium point of IDCNN (1). Let x(t) = (x1 (t), x2 (t), · · · , xn (t))T = (u1 (t) − u∗1 , u2 (t) − u∗2 , · · · , un (t) − u∗n )T . From (2), for ∀t ≥ T, IDCNN (1) can be rewritten as (4). Let V (t) = V (x(t)) = xT (t)x(t), then for t = tk , dV (x(t)) |(4) ≤ −2xT (t)Cx(t) + xT (t)(A + AT )x(t) dt T xT (t), xT (t − τ (t)) +(xT (t), xT (t − τ (t)))B ≤ (−2cmin + 2λ A+AT + λB )V (x(t)) + λB V (x(t − τ (t))).
(8)
2
Let G(t) = V (t) − ||φ||t0 exp{−θ(t − t0 )}, Then ∀t ∈ [t0 , t1 ), G(t) ≤ 0. Otherwise, on the one hand, ∀s ∈ [t0 − τ, t0 ], G(s) ≤ 0, then there exist t∗1 , t∗2 and ε > 0 such that t1 > t∗2 > t∗1 ≥ t0 , G(t∗1 ) = 0, G(t∗2 ) = ε, dG(t) dG(t) |t=t∗1 ≥ 0, |t=t∗2 > 0; dt dt
(9) (10)
for s ∈ (t∗1 , t∗2 ], G(s) > 0; for t ∈ [t0 − τ, t∗2 ], G(t) ≤ ε. On the other hand, by (7), dG(t) dV (t) |t=t∗2 = |t=t∗2 + θ||φ||t0 exp{−θ(t∗2 − t0 )} dt dt ≤ (θ − 2cmin + 2λ A+AT + λB + λB exp{θτ })||φ||t0 exp{−θ(t∗2 − t0 )} 2
+(−2cmin + 2λ A+AT + λB )ε.
(11)
2
By (6), (8) and (11),
dG(t) ∗ dt |t=t2
≤ 0. This contradicts (10), thus
∀t ∈ [t0 , t1 ), V (t) ≤ ||φ||t0 exp{−θ(t − t0 )}.
(12)
Global Exponential Stability of Cellular Neural Networks
Hence, ∀i ∈ {1, 2, · · · , n}, ∀t ∈ [t0 , t1 ), xi (t) ≤ (4) and (12), − − |xi (t+ 1 )| ≤ |xi (t1 )| + | − ci xi (t1 ) +
n
233
||φ||t0 exp{−θ(t − t0 )}. Form
(aij + bij )f (xj (t− 1 ))|
j=1
≤ M ||φ||t0 exp{−θ(t1 − t0 )}. Hence, V (t+ 1 ) ≤ nM ||φ||t0 exp{−θ(t1 − t0 )}.
(13)
Similar to the proofs of (12) and (13), for k ≥ 2, ∀t ∈ [t1 , t2 ), V (t) ≤ nM ||φ||t0 exp{−θ(t − t0 )}.
(14)
k V (t+ k ) ≤ (nM ) ||φ||t0 exp{−θ(tk − t0 )}.
(15)
∀t ∈ [tk , tk+1 ), V (t) ≤ (nM )k ||φ||t0 exp{−θ(t − t0 )}.
(16)
Hence, for k ≥ 3,
Since for k = 1, 2, · · · , tk − t0 ≥ βk; nM ≤ exp{θβ/2}, from (16), for ∀t ∈ [tk , tk+1 ), V (t) ≤
(nM )k θ θ ||φ||t0 exp{− (t − t0 )} ≤ ||φ||t0 exp{− (t − t0 )}. exp{βθ(tk − t0 )/2} 2 2
Hence, IDCNN (1) is globally exponentially stable. Corollary 1. If (6) holds, and for all k = 1, 2, · · · , |Δui (tk )| = 0, then IDCNN (1) is globally exponentially stable. Proof. Since |Δui (tk )| = 0, tk − t0 = +∞. Hence, for ∀n, M, there exists β > 0 such that for k = 1, 2, · · · , tk −t0 > βk; nM ≤ exp{θβ/2}. According to Theorem 1, IDCNN (1) is globally exponentially stable. In the following discussion, we always assume that for ∀i = 1, 2, · · · , n, ci = 1. Remark 1. If |Δui (tk )| = 0, (1) is a cellular neural network without impulse. Arik ([14]) has shown that if −(A + AT ) is positive definite and ||B|| ≤ 1 then the DCNN (1) is globally exponentially stable. √ Liao ([15]) has shown that if −(A + AT + βE) is positive definite and ||B|| ≤ 1 + β then the DCNN (1) is globally exponentially stable.√Arik ([16]) has shown that if −(A + AT + βE) is positive definite and ||B|| ≤ 2β then the DCNN (1) is globally exponentially stable. If −(A+AT +βE) is positive definite, then λA+AT +βE < 0; i.e., λ(A+AT )/2 < −β/2, where E is unit matrix. Hence, it is necessary that −(A + AT ) is positive definite in [14], [15], [16]
234
C. Fu and B. Chen
According to Lemma 1, λBB T ≤ ||B||2 implies that λB ≤ ||B||. So, if conditions of [ Theorem 1 in [14]] hold, then λB + λ A+AT ≤ ||B|| + λ A+AT < 1 + 0 = 1. 2
2
Hence, when condition (6) holds, the result of Corollary 1 improve upon the existing one in [14]. If conditions of [Theorem 1 in [15]] hold, then
λB + λ A+AT ≤ ||B|| + λ A+AT < 1 + β − β/2 ≤ 1. 2 2 √ (Since (1 + β/2)2 = 1 + β + β 2 /4 > 1 + β, 1 + β − β/2 < 1.) Hence, when condition (6) holds, the result of Corollary 1 improve upon the existing one in [15]. If conditions of [Theorem 1 in [16]] hold, then
λB + λ A+AT ≤ ||B|| + λ A+AT < 2β − β/2 ≤ 1. 2
2
2 2 (Since √ (1 − β/2) = 1 − β + β /4 ≥ 0, 1 + β + β /4 ≥ 2β, then (1 + β/2) ≥ 2β; i.e., 2β − β/2 ≤ 1.) Hence, when condition (6) holds, the result of Corollary 1 improve upon the existing one in [16]. 2
4
2
Simulation Results
In this section, we give one example to illustrate the new results. Example 1. Consider IDCNN (1), where 0.2 0.4 0.3 −0.4 A= , B= , −0.4 0.2 0.4 0.3 T
I = (0.1, 0) , ci = 1, the time delays τi (t) = 1 (i = 1, 2), t0 = 0, tk = 13.6k. Then λ A+AT = 0.2, λB = 0.5, M = 6.25, θ = 0.3736 such that 2
θ − 2cmin + 2λ A+AT + λB + λB exp{θτ } ≤ 0. 2
Since exp{βθ/2} = exp{13.6×0.3736/2} > 12.5 = nM, according to Theorem 1, this cellular neural network is globally exponentially stable. However, since −(A + AT ) is not positive definite, even if for all k = 1, 2, · · · , Δxi (tk ) = 0, i = 1, 2, the conditions in [14], [15], [16] cannot be used to ascertain the stability of this cellular neural network.
5
Concluding Remarks
In this paper, global exponential stability of cellular neural networks with timevarying delays and impulses is studied. Scopes of impulses are magnified. By estimating the delay differential inequality with impulses, some sufficient conditions for global exponential stability of impulsive neural networks are obtained. Those sufficient conditions independent on the equilibrium point. Hence, when it is difficult to obtain the equilibrium point, the given results in this paper are with more reason. Finally, an example is given to illustrate the theory.
Global Exponential Stability of Cellular Neural Networks
235
Acknowledgement This work was supported by the Natural Science Foundation of China under Grant 60474011 and the Young Foundation of Hubei Provincial Education Department of China under Grant 2003B001.
References 1. Forti, M., Tesi, A.: New Conditions for Global Stability of Neural Networks with Application to Linear and Quadratic Programming Problems. IEEE Trans. Circ. Syst. I 42 (1995) 354-366 2. Liao, X.X., Wang, J.: Algebraic Criteria for Global Exponential Stability of Cellular Neural Networks with Multiple Time Delays. IEEE Trans. Circuits and Systems I. 50 (2003) 268-275 3. Yi, Z., Heng, A., Leung, K.S.: Convergence Analysis of Cellular Neural Networks with Unbounded Delay. IEEE Trans. Circuits Syst. I 48 (2001) 680-687 4. Chen, T.P., Rong, L.B.: Robust Global Exponential Stability of Cohen- Grossberg Neural Networks with Time-Delays. IEEE Transactions on Neural Networks 15 (2004) 203-206 5. Liao, X.F., Li, C.G., Wong, K.W.: Criteria for Exponential Stability of CohenGrossberg Neural Networks. Neural Networks 17 (2004) 1401-1414 6. Zeng, Z.G., Wang, J., Liao, X.X.: Global Asymptotic Stability and Global Exponential Stability of Neural Networks with Unbounded Time-varying Delays. IEEE Trans. on Circuits and Systems II, Express Briefs 52 (2005) 168-173 7. Cao, J.: Results Concerning Exponential Stability and Periodic Solutions of Delayed Cellular Neural Networks. Physics Letters A 307 (136-147) 2003 8. Zhou, J., Liu, Z., Chen, G.R.: Dynamics of Delayed Periodic Neural Networks. Neural Networks 17 (2004) 87-101 9. Zeng, Z.G., Wang, J., Liao, X.X.: Stability Analysis of Delayed Cellular Neural Networks Described Using Cloning Templates. IEEE Trans. Circuits and Syst. I 51 (2004) 2313-2324 10. Zeng, Z.G., Wang, J., Liao, X.X.: Global Exponential Stability of A General Class of Recurrent Neural Networks with Time-varying Delays. IEEE Trans. Circuits and Syst. I 50 (2003) 1353-1358 11. Xu, D., Yang, Z.: Impulsive Delay Differential Inequality and Stability of Neural Networks. J. Math. Anal. Appl. 305 (2005) 107-120 12. Guan, Z., Chen, G.: On Delayed Impulsive Hopfield Neural Networks. Neural Networks 12 (1999) 273-280 13. Guan, Z., Lam, J., Chen, G.: On Impulsive Autossociative Neural Networks. Neural Networks 13 (2000) 63-69 14. Arik, S.: On the Global Asymptotic Stability of Delayed Cellular Neural Networks. IEEE Trans. Circuits Syst. I 47 (2000) 571-574 15. Liao, T.L., Wang, F.C.: Global Stability for Cellular Neural Networks with Time Delay. IEEE Trans. Neural Networks 11 (2000) 1481-1484 16. Arik, S.: An Analysis of Global Asymptotic Stability of Delayed Cellular Neural Networks. IEEE Trans. Neural Networks 13 (2002) 1239-1242
Global Exponential Stability of Fuzzy Cellular Neural Networks with Variable Delays Jiye Zhang, Dianbo Ren, and Weihua Zhang Traction Power State Key Laboratory, Southwest Jiaotong University, Chengdu 610031, China
[email protected]
Abstract. In this paper, the global exponential stability of fuzzy cellular neural networks with time-varying delays is studied. Without assuming the boundedness and differentiability of the activation functions, based on the properties of M-matrix, by constructing vector Liapunov functions and applying differential inequalities, the sufficient conditions ensuring existence, uniqueness, and global exponential stability of the equilibrium point of fuzzy cellular neural networks with variable delays are obtained.
1 Introduction Recently, artificial neural networks have been intensively studied and widely applied to various information processing problems. Cellular neural networks (CNNs), introduced by Chua and Yang in 1988, have been successfully applied to signal processing systems, especially in static image treatment [1], and to solve nonlinear algebraic equations [2-3]. Hopfield-type neural networks (HNNs) and their various generalizations have been applied to deal with the tasks of classification, associative memory, parallel computation and optimization problems [4-5]. Yang extended the CNNs from classical to fuzzy sets, and proposed the fuzzy cellular neural networks (FCNNs), and applied it to the image processing [6,7]. Such applications rely on the existence of equilibrium points and qualitative properties of the neural networks. The stability of HNNs and CNNs were widely studied, and some results were given [8-17]. In hardware implementation, time delays occur due to finite switching speeds of the amplifiers and communication time. Some conditions ensuring the global exponential stability of FCNNs with variable time delays were given in [18], but in which differentiability of the delays terms were assumed. In the paper, we study the fuzzy neural networks, which contains variable delays. By constructing proper nonlinear integro-differential inequalities involving both variable delays, applying the idea of vector Liapunov method, we obtain the sufficient conditions of global exponential stability.
2 Notation and Preliminaries For convenience, we introduce some notations. x Τ and AΤ denote the transpose of a vector x and a matrix A , where x ∈ R n and A ∈ R n×n . [ A]s is defined as J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 236 – 242, 2006. © Springer-Verlag Berlin Heidelberg 2006
Global Exponential Stability of Fuzzy Cellular Neural Networks with Variable Delays
[ A]s = [ AΤ + A] 2
| x|
.
| x |= (| x1 |, | x 2 |, L | xn |)
Τ
denotes
the
absolute-value
vector
given
237
by
and | A | denotes the absolute-value matrix given by
| A |= (| aij |) n×n . || x || denotes the vector norm defined by || x ||= ( x12 + L + xn2 )1/ 2 and
|| A || denotes the matrix norm defined by || A ||= (max{λ : λ is an eigenvalue of
∧ ∨
AΤ A })1 / 2 . and denote the fuzzy AND and fuzzy OR operation, respectively. The dynamical behavior of FCNNs with variable time delays can be described by the following nonlinear differential equations: n
n
j =1
j =1
n
x&i = −d i xi (t ) + ∑ aij f j ( x j (t )) + ∑ bij u j + J i + ∧ α ij f j ( x j (t − τ ij (t ))) n
n
j =1
n
+ ∨ β ij ( f j ( x j (t − τ ij (t ))) + ∧ Rij u j + ∨ S ij u j j =1 j =1 j =1
( i = 1,2,L n ).
(1)
where xi is the state of neuron i, i = 1,2,L, n , and n is the number of neurons; ui and
J i denote input and bias of the ith neuron, respectively; f i is the activation function of the ith neuron; d i is the damping constants, and d i >0; aij , bij are elements of feedback template and feedforward template; α ij , β ij , Rij and S ij are elements of fuzzy feedback MIN template, fuzzy feedback MAX template, fuzzy feedforward MIN template and fuzzy feedforward MAX template, respectively; τ ij (t ) denote the variable time delays. Assume that delays τ ij (t ) are bounded, continuous with
τ ij (t ) ∈ [0,τ ] for all t ≥ 0 , where τ is a constant, i, j = 1,2,L, n . The initial conditions associated with equation (1) are of the form xi ( s ) = φi ( s ) , s ∈ [−τ ,0] , φi ∈ C ([−τ ,0], R) , i = 1,2, L , n ,
where it is assumed that φi ∈ C ([−τ ,0], R) , i = 1,2, L , n .
Let D = diag(d1 , d 2 ,L , d n ) , A = (aij ) n×n , B = (bij ) n×n , α = (α ij ) n×n , β = ( β ij ) n×n ,
u = (u1 , u 2 ,..., u n) Τ , J = ( J 1 , J 2 , L, J n) Τ , f ( x) = ( f 1 ( x1), f 2 ( x2),L , f n ( x n))Τ . Assumption 1. For each j ∈ {1,2,..., n} , f j : R → R is globally Lipschitz with
Lipschitz constants L j > 0 , L = diag( L1 , L 2 , L, L n ) > 0 .
i.e. | f j ( x j ) − f j ( y j ) |≤ L j | x j − y j | for all x j , y j . We let
Definition 1. The equilibrium point x * of (1) is said to be globally exponentially stable, if there exist constant λ > 0 and M > 0 such that | xi (t ) − xi * |≤ M || φ − x* || e − λt for all t ≥ 0 , where || φ − x* ||= max{ sup | φi ( s ) − xi* |} . 1≤ i ≤ n
s∈[ −τ , 0 ]
Lemma 1 [19]. Let A =( aij ) be a matrix with non-positive off-diagonal elements.
Then the following statements are equivalent: (i) A is an M-matrix, (ii) The real parts of all eigenvalues of A are positive, Τ (iii) There exists a vector ξ > 0 , such that ξ A > 0 ,
238
J. Zhang, D. Ren, and W. Zhang
(iv) A is nonsingular and all elements of A−1 are nonnegative, (v) There exists a positive definite n × n diagonal matrix Q such that matrix
AQ + QAΤ is positive definite. Lemma 2 [6]. Suppose x and x ' are two states of system (1), then n
n
j =1
j =1
n
n
j =1
j =1
| ∧ α ij f j ( x j ) − ∧ α ij f j ( x j ' ) | ≤
n
∑ |α
ij
|| f j ( x j ) − f j ( x j ' ) |
ij
|| f j ( x j ) − f j ( x j ' ) |
j =1
| ∨ β ij f j ( x j ) − ∨ β ij f j ( x j ' ) | ≤
n
∑ |β j =1
Lemma 3 [5]. If H ( x) ∈ C 0 is injective on R n , and || H ( x) ||→ ∞ as || x ||→ ∞ , then
H ( x) is a homeomorphism of R n .
3 Existence and Uniqueness of the Equilibrium Point In the section, we study the existence and uniqueness of the equilibrium point of (1). We firstly study the nonlinear map associated with (1) as follows: n
n
n
j =1
j =1
H i ( xi ) = − d i xi + ∑ aij f j ( x j ) + ∧ α ij f j ( x j ) + ∨ β ij f j ( x j ) + I i , i = 1,2,L n , j =1
n
(2)
n
where I i = ∧ Rij u j + ∨ S ij u j + J i , i = 1,2,L, n . j =1 j =1 Let H ( x) = ( H 1 ( x1 ), H 2 ( x2 ),..., H n ( xn ))Τ . It is known that the solutions of H ( x) = 0 are equilibriums in (1). If map H ( x) is a homeomorphism on Rn , then there exists a unique point x * such that H ( x*) = 0 , i.e., systems (1) have a unique equilibrium x * . Based on the Lemma 3, we get the conditions of the existence of the equilibrium for system (1) as follows. Theorem 1. If Assumption 1 is satisfied, and D − (| A | + | α | + | β |) L is an M- matrix, then for each u , system (1) has a unique equilibrium point. Proof. In order to prove that systems (1) have a unique equilibrium point x * , it is only need to prove that H ( x) is a homeomorphism on Rn . In the following, we shall
prove that map H ( x) is a homeomorphism in two steps. Step 1. We prove that H ( x) is an injective on Rn . For purposes of contradiction, suppose that there exist x, y ∈ R n with x ≠ y , such that H ( x) = H ( y ) . From (2), Lemma 2, and Assumption 1, we get
[ D − (| A | + | α | + | β |) L] | x − y |≤ 0 .
(3)
Because of D − (| A | + | α | + | β |) L being an M-matrix, from Lemma 1, we know that all elements of ( D − (| A | + | α | + | β |) L) −1 are nonnegative. Therefore | x − y |≤ 0 , i.e.,
Global Exponential Stability of Fuzzy Cellular Neural Networks with Variable Delays
239
x = y . From the supposition x ≠ y , thus this is a contradiction. So map H ( x) is injective. Step 2. We prove that || H ( x) ||→ ∞ as || x ||→ ∞ .
Let H ( x) = H ( x) − H (0) . From (2), we get n
n
n
j =1
j =1
H i ( xi ) = −d i xi + ∑ aij ( f j ( x j ) − f j (0)) + ∧ α ij f j ( x j ) − ∧ α ij f j (0) j =1
n
n
j =1
j =1
+ ∨ β ij f j ( x j ) − ∨ β ij f j (0) ( i = 1,2,L n ).
(4)
To prove that H ( x) is a homeomorphism, it only suffices to show that H ( x) is a homeomorphism. Since D − (| A | + | α | + | β |) L is an M-matrix, from Lemma 1, there exists a positive definite diagonal matrix T = diag{T1 , T2 ,L , Tn } , such that [T (− D + (| A | + | α | + | β |) L)]s ≤ −ε E n < 0 , (5) where ε is a sufficiently small positive number and E n is the identity matrix. From (4) and Lemma 2, we get n
∑ x T {− d
[Tx]Τ H ( x) =
i =1
i
i
n
i
xi + ∑ a ij [( f j ( x j ) − f j (0)] j =1
n
n
n
+ ∧ α ij f j ( x j ) − ∧ α ij f j (0) + ∨ β ij f j ( x j ) − ∨ β ij f j (0)} j =1 j =1 j =1 j =1 n
n
≤ ∑ Ti {− d i xi2 + | xi | ∑ | a ij || ( f j ( x j ) − f j (0)) | n
j =1
i =1
n
n
+ | xi | ∑ |α ij || f j ( x j ) − f j (0) | + | xi | ∑ | β ij || f j ( x j ) − f j (0) |} j =1
j =1
≤| x | [T (− D + (| A | + | α | + | β |) L)] | x | ≤ −ε || x ||2 . Τ
s
(6)
Using Schwarz inequality, and from (6), we get ε || x || ≤|| T || || x || || H ( x) || , so 2
|| H ( x) ||≥ ε || x || / || T || . Therefore, || H ( x) ||→ +∞ , i.e., || H ( x) ||→ +∞ as || x ||→ +∞ . From steps 1 and 2, according to Lemma 3, we know that for every input u , map H ( x) is a homeomorphism on Rn , so system (1) has a unique equilibrium point. The proof is completed.
4 Global Exponential Stability of the Equilibrium Point Theorem 2. If Assumption 1 is satisfied and D − (| A | + | α | + | β |) L is an M-matrix, then for each u , system (1) has a unique equilibrium point, which is globally exponentially stable. Proof. Since D − (| A | + | α | + | β |) L is an M-matrix, from Theorem 1, systems (1)
have a unique equilibrium x * . Let y (t ) = x(t ) − x * , systems (1) can be written as
240
J. Zhang, D. Ren, and W. Zhang n
n
y& i (t ) = − d i yi (t ) + ∑ aij f j ( y j (t ) + x j *) − ∑ aij f j ( x j *)) j =1
j =1
n
n
+ ∧ α ij f j ( y j (t − τ ij (t )) + x j *) − ∧ α ij f j ( x j *) j =1 j =1 n
n
j =1
j =1
+ ∨ β ij f j ( y j (t − τ ij (t )) + x j *) − ∨ β ij f j ( x j *) ( i = 1,2,L, n ).
(7)
The initial conditions of equation (7) are Ψ ( s ) = φ ( s ) − x* , s ∈ [−τ ,0] . Systems (7) have a unique equilibrium at y = 0 . Let Vi (t ) = e λt | yi (t ) | , (8)
where λ is a constant to be given. Calculating the upper right derivative of Vi (t ) along the solutions of (7), we have D + (Vi (t )) = e λt sgn( yi (t ))[ y& i (t ) + λyi (t )] n ≤ e λt {− d i | y i (t ) | + ∑ | aij || f j ( y j (t ) + x j *) − f j ( x*j ) | j =1
n
n + | ∧ α ij f j ( y j (t − τ ij (t )) + x*j ) − ∧ α ij f j ( x*j ) | j =1 j =1 n n + | ∨ β ij f j ( y j (t − τ ij (t )) + x*j ) − ∨ β ij f j ( x*j ) | +λ | yi (t ) |} , ( i = 1,2,L, n ). j =1
j =1
From Assumption 1 and Lemma 2, we get n
D + (Vi (t )) ≤ (− d i + λ )Vi (t ) + ∑ L j [| aij |V j (t ) + e λτ (| α ij | + | β ij |) sup V j ( s )] , j =1
t −τ ≤ s ≤t
(9)
where τ is a fixed number. Due to D − (| A | + | α | + | β |) L is an M-matrix, from the Lemma 1, we get that there exist positive constant numbers ξ i , i = 1,2,L n, satisfying n
− ξ i d i + ∑ ξ j (| aij | + | α ij | + | β ij |) L j < 0 ( i = 1,2,L n ).
(10)
j =1
Constructing functions as follows n
Fi (η ) = − ξ i (d i − η ) + ∑ ξ j[| aij | + eητ (| α ij | + | β ij |)] L j , ( i = 1,2,L n ). j =1
(11)
Fi (η ) is a continuous function about η , and from (10) we know that n
Fi (0) = − ξ i d i + ∑ ξ j (| aij | + | α ij | + | β ij |) L j < 0 ( i = 1,2,L n ). j =1
So there exists a constant λ > 0 such that n
− ξ i (d i − λ ) + ∑ ξ j[| aij | + eλτ (| α ij | + | β ij |)] L j < 0 ( i = 1,2,L n ). j =1
(12)
In order to prove our results, we define the curve γ = {z (l ) : zi = ξ i l , l > 0, i = 1,2,L , n} and the set Ω( z ) = {u : 0 ≤ u ≤ z , z ∈ γ } . Let ξ M = max ξ i , ξ m = min ξ i , taking i =1,....,N
i =1,....,N
Global Exponential Stability of Fuzzy Cellular Neural Networks with Variable Delays
241
l0 = (1 + δ ) e λτ || Ψ || / ξ m , δ > 0 be a constant. Defining set O = {V : V = e λτ || Ψ1 ( s) ||,
|| Ψ2 ( s) |, L, || Ψn ( s ) ||) Τ ,−τ ≤ s ≤ 0} . So, O ⊂ Ω( z0 (l0 )) , namely Vi (s ) ≤ eλτ || Ψi ( s ) ||< ξ il0 , −τ ≤ s ≤ 0
( i = 1,2,L n ).
(13)
In the following, we shall prove Vi (t ) < ξ il0 , t > 0
( i = 1,2,L n ).
(14)
If (14) is not true, then from (13), there exist t1 > 0 and some index i such that Vi (t1 ) = ξ il0 , D + (Vi (t1 )) ≥ 0 , V j (t ) ≤ ξ j l0 , j = 1,2,L n , t ∈ [−τ , t1 ] .
(15)
However, from (9), and (12), we get n
D + (Vi (t1 )) ≤ {− ξ i (d i − λ ) + ∑ ξ j[| aij | + eλτ (| α ij | + | β ij |)] L j}l0 < 0 . j =1
This is a contradiction. So Vi (t ) < ξ il0 , for t > 0 ( i = 1,2,L n ). Furthermore, from (8), and (14), we obtain | yi (t ) | ≤ ξ il0 e − λt ≤ (1 + σ ) e λτ ξ M / ξ m || Ψ || e − λt ≤ M || Ψ || e − λt , t ≥ 0 ( i = 1,2,L n ). where M = (1 + σ ) e λτ ξ M / ξ m . So | xi (t ) − xi * |≤ M || xi (t ) − xi * || e − λt , and the equilibrium point of (1) is globally exponentially stable. The proof is completed. In [18], the global exponential stability of systems (1) was studied, and the stability conditions were given under the assumption 0 ≤ dτ ij (t ) / dt ≤ 1 . In fact, in the proof of results in [18], the following Lyapunov functionals were used, n
n
i =1
j =1
V ( y (t )) = ∑ ri (| yi (t ) | e lt + ∑ ∫t −τ t
ij
(t )
(| α ij | + | β ij |) k j | y j ( s ) | e ls ds .
It is obvious that the assumptions of the boundedness and for dτ ij (t ) / dt are necessary. However, in this paper, the constraint for the time delays is unnecessary.
4 Conclusions In this paper, without assuming the boundedness and differentiability of the Activation functions, we analyze the existence, uniqueness, and global exponential stability of the equilibrium point of fuzzy cellular neural networks with variable delays. Applying the idea of vector Liapunov function method, we obtain sufficient conditions for global exponential stability independent of the delays. Yang in [6] derived some results on stability of FCNNs, however, in which the model of FCNNs do not involved time delays and the activation function is bounded. Ref. [18] studied the exponential stability of the equilibrium point of FCNNs with variable delays under Assumption 1, but in which the boundedness and differentiability for the delays is assumed. Those assumptions for the delays are unnecessary in this paper. So the results in this paper extent the results in [6, 18].
242
J. Zhang, D. Ren, and W. Zhang
Acknowledgments This work is supported by the National Program for New Century Excellent Talents in University (No. NCET-04-0889), and Natural Science Foundation of China (No.50375127, 10272091)
References 1. Chua, L. O., Yang, L.: Cellular Neural Networks: Theory. IEEE Trans. Circ. Syst. 35 (1988) 1257-1272 2. Arik, S., Tavsanoglu, V.: On the Global Asymptotic Stability of Delayed Cellular Neural Networks. IEEE Trans. Circ. Syst.- I 47 (2000) 571-574 3. Arik, S.: An Improved Global Stability Result for Delayed Cellular Neural Networks. IEEE Trans. Circ. Syst. 49 (2002) 1211-1214 4. Gopalsamy, K., He, X.: Stability in Asymmetric Hopfield Nets with Transmission Delays. Physica D 76 (1994) 344-358 5. Forti, M., Tesi, A.: New Conditions for Global Stability of Neural Networks with Application to Linear and Quadratic Programming Problems. IEEE Trans. Circ. Syst.-I 42 (1995) 354-366 6. Yang, T., Yang, L. B.: Exponential Stability of Fuzzy Cellular Neural Networks with Constant and Time-Varying Delays. IEEE Trans. Circ. Syst.-I 43 (1996) 880-883 7. Yang, T., Yang, L. B.: Fuzzy Cellular Neural Networks: A New Paradigm for Image Processing. Int. J. Circ. Theor. Appl. 25 (1997) 469-481 8. Roska, T., Wu, C. W., Balsi, M., Chua, L. O.: Stability of Cellular Neural Networks with Dominant Nonlinear and Delay-Type Templates. IEEE Trans. Circ. Syst. 40 (1993) 270272. 9. Zhang, J.: Global Stability Analysis in Delayed Cellular Neural Networks. Computers and Mathematics with Applications 45 (2003) 1707-1720 10. Zhang, J.: Absolutely Exponential Stability in Delayed Cellular Neural Networks. Int. J. Circ. Theor. Appl. 30 (2002) 395-409 11. Zhang, J.: Globally Exponential Stability of Neural Networks with Variable Delays. IEEE Trans. Circ. Syst.-I 50(2003) 288-291 12. Yucel, E., Arik, S.: New Exponential Stability Results for Delayed Neural Networks with Time Varying Delays. Physica D 191 (2004) 314–322 13. Zhang, J.: Absolute Stability Analysis in Cellular Neural Networks with Variable Delays and Unbounded Delay. Computers and Mathematics with Applications 47 (2004) 183-194 14. Van Den Driessche, P., Zou, X.: Global Attractivity in Delayed Hopfield Neural Networks Models. SIAM J. Appl. Math. 58 (1998) 1878-1890 15. Chen, T.: Global Exponential Stability of Delayed Hopfield Neural Networks. Neural Networks 14 (2001) 977-980 16. Xu, D., Zhao, H., Zhu, H.: Global Dynamics of Hopfield Neural Networks Involving Variable Delays. Computers and Mathematics with Applications 42 (2001) 39-45 17. Zhang, J., Suda, Y., Iwasa, T.: Absolutely Exponential Stability of a Class of Neural Networks with Unbounded Delay. Neural Networks 17 (2004) 391-397 18. Liu, Y., Tang, W.: Exponential Stability of Fuzzy Cellular Neural Networks with Constant and Time-Varying Delays. Physics Letters A 323 (2004) 224-233 19. Siljak, D. D.: Large-Scale Dynamic Systems-Stability and Structure. Elsevier, New York (1978)
Stability of Fuzzy Cellular Neural Networks with Impulses Tingwen Huang1 and Marco Roque-Sol2 1
2
Texas A&M University at Qatar, Doha, P. O. Box 5825, Qatar
[email protected] Mathematics Department, Texas A&M University, College Station, TX77843, USA
[email protected]
Abstract. In this paper, we study impulsive fuzzy cellular neural networks. Criteria are obtained for the existence and exponential stability of a unique equilibrium of fuzzy cellular neural networks impulsive state displacements at fixed instants of time.
1
Introduction
Fuzzy cellular neural networks (FCNN) is a generalization of cellular neural networks (CNN) by using fuzzy operations in the synaptic law calculation allowing us to combine the low level information processing capability of CNN’s with the high level information processing capability, such as image understanding, of fuzzy systems. Yang el al. in [13]-[15] introduced FCNN and investigated existence and stability of equilibrium point of FCNN. After the introduction of FCNN, some researchers studied stability of FCNN with constant time delays and time-varying delays (see [9]), others considered stability of FCNN with distributed delays (see [4], [6]) and exponential stability of FCNN with diffusion effect (see [5]). However, in real world, many evolutionary processes are characterized by abrupt changes at certain time. These changes are called to be impulsive phenomena, which are encountered in many fields such as physics, chemistry, population dynamics, optimal control, etc. In this paper, we will study FCNN model incorporating impulses: n n n dxi (t) = −di xi (t) + bij μj + Ii + αij fj (xj (t)) + Tij μj dt j=1 j=1 j=1
+
n j=1
βij fj (xj (t)) +
n
Hij μj ,
t = tk ,
j=1
n x(t+ 0 ) = x0 ∈ R , − Δxi (tk ) = xi (t+ k ) − xi (tk ) = −γik xi (tk ),
i = 1, · · · , n,
k = 1, 2, · · · , (1)
where αij , βij , Tij and Hij are elements of fuzzy feedback MIN template, fuzzy feedback MAX template, fuzzy feed-forward MIN template and fuzzy feedJ. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 243–248, 2006. c Springer-Verlag Berlin Heidelberg 2006
244
T. Huang and M. Roque-Sol
forward MAX template respectively; bij are elements of feed-forward template; and denote the fuzzy AND and fuzzy OR operation respectively; xi , μi and Ii denote state, input and bias of the ith neurons respectively; fi is the activation − function; Δxi (tk ) = xi (t+ k ) − xi (tk ), k = 1, 2, · · · , are the impulses at moments tk and t1 < t2 < · · · is a strictly increasing sequence such that limk→∞ tk = ∞. As usual in the theory of impulsive differential equations, at the points of discontinuity tk of the solution t → xi (t) we assume that xi (tk ) = xi (t− k ). According to (1), there exist the limits x˙ i (t− ˙ i (t+ ˙ i (tk ) = x˙ i (t− k ) and x k ), we assume x k ). In this paper, we assume that H: fi is a bounded function defined on R and satisfies |fi (x) − fi (y)| ≤ li |x − y|,
i = 1, · · · , n,
(2)
for any x, y ∈ R.
2
Main Results
In order to obtain the main results on the existence and stability of the equilibrium point of FCNN with impulses, we would like to cite the following lemma first. Lemma 1. ([14]). For any aij ∈ R, xj , yj ∈ R, i, j = 1, · · · , n, we have the following estimations, |
n
aij xj −
j=1
and |
n
n
aij yj | ≤
j=1
aij xj −
j=1
n
(|aij | · |xj − yj |)
(3)
(|aij | · |xj − yj |)
(4)
1≤j≤n
aij yj | ≤
j=1
1≤j≤n
Now, we are ready to state and prove the main results. First, we establish sufficient conditions to guarantee the existence and uniqueness of the equilibrium point for the system (1). Theorem 1. Suppose that assumption H is satisfied. Suppose further that (i) di > 0; (ii) the following inequalities hold: di − li
n j=1
|αji | − li
n
|βji | > 0,
i = 1, · · · , n;
j=1
(iii) γik = 0, i = 1, · · · , n, k = 1, 2, · · · ; Then impulsive FCNN (1) exists equilibrium point x∗ = (x∗1 , · · · , x∗n ).
(5)
Stability of Fuzzy Cellular Neural Networks with Impulses
245
We can similarly prove this theorem by imitating the proof of Theorem 2.1 in [2] through constructing a contraction map, then by the contraction mapping principle, there exists a unique fixed point of the map. Existence of a unique solution of system (1) follows. Theorem 2. Assume H is satisfied and the two conditions: (i) and (ii) in Theorem 1 are satisfied too. Further assume that the impulsive operators Ii (xi (t)) satisfy Ii (xi (t)) = −γik (xi (tk ) − x∗i ),
0 < γik < 2,
i = 1, · · · , n,
k ∈ Z +.
(6)
where x∗ is the unique equilibrium point of system (1). The proof of the theorem is similar to the proof of Theorem 1. An additional difference is the consideration of the impulse effect. From the imposed condition on impulsive operator, it is clear that constructing mapping is still contracted on those discontinuous points. Thus, the results follow. We omit the rigorous proofs for Theorem 1 and Theorem 2. Theorem 3. Assume that all conditions of Theorem 2 hold. Then there exists a positive constant λ such that all solutions of system (1) satisfy the following inequality: n n |xi (t) − x∗i | ≤ e−λt |xi (0) − x∗i | t > 0. (7) i=1 ∗
where x =
(x∗1 , · · · , x∗n )T
i=1
denotes the equilibrium.
Proof. By x∗ is the equilibrium of (1), we have the following: n d (xi (t) − x∗ ) = −di (xi (t) − x∗ ) + αij (fj (xj (t)) − x∗ ) dt j=1
+
n
βij (fj (xj (t)) − x∗ )
j=1
(8) for i = 1, 2, · · · , n, t > 0, t = tk , k ∈ Z + , and hence by assumption H and Lemma 1, n d+ |xi (t) − x∗ | ≤ −di |xi (t) − x∗ | + αij lj |xj (t) − x∗ | dt j=1
+
n
βij lj |xj (t) − x∗ |
j=1
≤ −di |xi (t) − x∗ | +
n j=1
|αij |lj |xj (t) − x∗ |
246
T. Huang and M. Roque-Sol
+
n
|βij |lj |xj (t) − x∗ |
j=1
(9) d+ dt
for i = 1, 2, · · · , n, t > 0, t = tk , k ∈ Z + , where Also, from the conditions, we have
so
is the upper right derivative.
∗ ∗ ∗ xi (t+ k ) − xi = xi (tk ) + Ii (x(tk )) − xi = (1 − γik )(xi (tk ) − xi ),
(10)
∗ ∗ ∗ |xi (t+ k ) − xi | = |(1 − γik )(xi (tk ) − xi )| ≤ |xi (tk ) − xi |,
(11)
for i = 1, · · · , n, k ∈ Z . Let us define the Lyapunov function V (·) by +
V (t) = V (x1 , x2 , · · · , xm )(t) =
n
|xi (t) − x∗i |,
(12)
i=1
for t > 0 and by (9), we can obtain d+ d+ V (t) = |xi (t) − x∗i | dt dt i=1 n
≤
n
{−di |xi (t) − x∗ | +
i=1
+
=
n
|αij |lj |xj (t) − x∗ |
j=1
n
|βij |lj |xj (t) − x∗ |}
j=1 n
n
i=1
j=1
{−di + li
|αji | + li
n
|βji |}|xi (t) − x∗ |
j=1
(13) By the condition: di − li
n
|αji | − li
j=1
n
|βji | > 0,
i = 1, · · · , n;
(14)
i = 1, · · · , n.
(15)
j=1
there exists a positive number λ, such that di − li
n j=1
|αij | − li
n
|βij | ≥ λ,
j=1
Thus, by (13) and (15), we have the following: d+ V (t) ≤ −λV (t), dt
t > 0, t = tk .
(16)
Stability of Fuzzy Cellular Neural Networks with Impulses
247
Also, we have the following: V (t+ k) =
n
∗ |xi (t+ k ) − xi | ≤
i=1
n
|xi (tk ) − x∗i | = V (tk ),
k ∈ Z +.
(17)
i=1
Now, using the stability theorem in [11] and (16), (17), we obtain d+ V (t) ≤ −λV (t), dt Therefore, we have
V (t) ≤ e−λt V (0),
t > 0.
(18)
t > 0.
(19)
i.e., n
|xi (t) − x∗i | ≤ e−λt
i=1
n
|xi (0) − x∗i |,
t > 0.
(20)
i=1
Thus, we have completed the proof of this theorem.
3
Conclusion
In this paper, we discuss the existence and exponential stability of the equilibrium of FCNN with impulses. Several sufficient conditions set up here are easily verified.
Acknowledgments The first author is grateful for the support of Texas A&M University at Qatar.
References 1. Cao, J., Wang, J., Liao, X.: Novel Stability Criteria of Delayed Cellular Neural Networks. International Journal of Neural Systems 13(2003)365-375 2. Gopalsamy, K.: Stability of Artificial Neural Networks with Impulses. Applied Mathematics and Computation 154(2004)783-813 3. Horn, R.A., Johnson, C.R.: Topics in Matrix Analysis, Cambridge University Press, Cambridge (1999) 4. Huang, T., Zhang, L.: Exponential Stability of Fuzzy Cellular Neural Networks. In: Wang, J., Liao X., Yi Z. (eds.): Advances in Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg, New York (2005)168-173 5. Huang, T.: Exponential Stability of Delayed Fuzzy Cellular Neural Networks with Diffusion, to appear in Chaos, Solitons and Fractrals. 6. Huang, T.: Exponential Stability of Fuzzy Cellular Neural Networks with Unbounded Distributed Delay, to appear in Physics Letters A.
248
T. Huang and M. Roque-Sol
7. Li, Y.: Global Exponential Stability of BAM Neural Networks with Delays and Impulses. Chaos, Solitons and Fractals 24(2005)279-285 8. Li, C., Liao, X., Zhang, R.: Impulsive Synchronization of Nonlinear Coupled Chaotic Systems. Physics Letters A 328(2004)47-50 9. Liu, Y., Tang, W.: Exponential Stability of Fuzzy Cellular Neural Networks with Constant and Time-varying Delays. Physics Letters A 323(2004)224-233 10. Liao, X.F., Wong, K, Li, C: Global Exponential Stability for a Class of Generalized Neural Networks with Distributed Delays. Nonlinear Analysis: Real World Applications 5(2004)527-547 11. Samoilenko A., Perestyuk N.: Impulsive Differential Equations, in: World Scientific Series on Nonlinear Science. Series A: Monographs and Treatises, Vol. 14. World Scientific, Singapore (1995) 12. Yang, T., Yang, L.B., Wu, C.W., Chua, L.O.: Fuzzy Cellular Neural Networks: Theory. In Proc. of IEEE International Workshop on Cellular Neural networks and Applications (1996)181-186 13. Yang, T. , Yang, L.B., Wu, C.W., Chua, L.O.: Fuzzy Cellular Neural Networks: Applications. In Proc. of IEEE International Workshop on Cellular Neural Networks and Applications (1996)225-230 14. Yang, T., Yang, L.B.: The Global Stability of Fuzzy Cellular Neural Network. Circuits and Systems I: Fundamental Theory and Applications 43(1996)880 - 883 15. Yang, X., Liao, X., Evans, D., Tnag, Y.: Existence and Stability of Periodic Solution in Impulsive Hopfield. Neural Networks with Finite Distributed Delays 343(2005)108-116
Absolute Stability of Hopfield Neural Network Xiaoxin Liao1 , Fei Xu2 , and Pei Yu2 1 Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China xiaoxin
[email protected] 2 Department of Applied Mathematics, The University of Western Ontario, London, Ontario, N6A 5B7, Canada
[email protected],
[email protected]
Abstract. This paper presents some new results for the absolute stability of Hopfield neural networks with activation functions chosen from sigmoidal functions which have unbounded derivatives. Detailed discussions are also given to the relation and difference of absolute stabilities between neural networks and Lurie systems with multiple nonlinear controls. Although the basic idea of the absolute stability of neural networks comes from that of Lurie control systems, it provides a very useful practical model for the study of Lurie control systems.
1
Introduction
As we know, the fundamental theoretical basis for Hopfield neural networks [1] being widely used is because it uses electric circuit simulations to solve nonlinear algebraic or transcendental equations, and the process can be automated. This great advantage has attracted many researchers to study neural networks. However, in general, when a neural network has multiple equilibrium points, it is very difficult to solve global optimization problems. To solve a global optimization problem based on neural networks, it is better for the system to have unique equilibrium point which is yet globally attractive. Based on this idea, Forti et al. [2], [3] first proposed the concept of absolute stability of neural networks. That is, for a given type of activation functions (usually S-type functions (sigmoidal functions) or L-type functions satisfying Lipschitz condition), for an arbitrarily chosen activation function g(x) and an arbitrary input current I, the equilibrium point of the network is unique, and local stable in the sense of Lyapunov and globally attractive. The absolute stability for Hopfield neural networks is defined in [2], in which the sufficient and necessary condition for S-type functions was first obtained, and was later generalized to a general class of neural networks [4], [5]. For L-type functions, on the other hand, the sufficient condition for exponentially absolute stability was given in [5], and later sufficient and necessary condition was obtained [6]. In this paper, we will present some new results for the exponentially absolute stability of Hopfield neural networks with activation functions chosen from Stype functions which have unbounded derivatives. The results are different from that reported in the literature so far. We have also improved the results presented J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 249–254, 2006. c Springer-Verlag Berlin Heidelberg 2006
250
X. Liao, F. Xu, and P. Yu
in the above mentioned publications, but, due to the page limit, will not discuss in this paper. We will discuss the relation and difference of the absolute stabilities between neural networks and Lurie systems with multiple nonlinear controls. It will be shown that on one hand, the absolute stability problem of neural networks provides a practically useful new model for further study of Lurie systems with multiple nonlinear feedback controls, while On the other hand, the theory and methodology of absolute stability of Lurie control systems are helpful in the study of absolute stability of neural networks.
2
Relation and Difference Between Hopfield Neural Networks and Lurie Systems with Multiple Nonlinear Controls
Consider the following Hopfield neural networks: ui + Tij gj (uj ) + Ii , Ri j=1 n
ci u˙ i = −
(1)
where the dot denotes differentiation with respect to time t, u ∈ Rn , T = (Tij )n×n is a constant matrix, g(u) = (g1 (u1 ), . . . , gn (un ))T : Rn → Rn is a nonlinear diagonal mapping, I = (I1 , . . . , In )T ∈ Rn is a constant vector. gj (uj ) : R → R is C 1 function and gj (uj ) > 0 satisfying gi (R) ≡ (ai , bi ), ai , bi ∈ R, ai < bi . For convenience in comparison of Hopfield neural networks and Lurie systems with multiple nonlinear controls, we first introduce the transformations: di = Tij 1 Ci Ri and bij = ci into Eq. (1) to obtain u˙ i = −di ui +
n
bij gj (uj ) +
j=1
Ii . Ci
(2)
Since the original system (1) assumes the function gj (uj ) being a sigmoidal type, we have gi ∈ F∞ = {gi |0 < ui gi (ui ) < ∞, gi is continuous and gi (0) = 0} . Therefore, the nonlinear functions in neural networks and Lurie control systems are basically same. More precisely, the nonlinear functions in neural networks require that the functions are monotone and have the maximum slope at the origin. So strictly speaking, the nonlinear functions in neural networks belong to a subset of the nonlinear functions in Lurie control systems. Let ui = u∗i be an equilibrium point of (2), and f (xi ) = gi (ui ) − gi (u∗i ) = gi (xi + u∗i ) − gi (u∗i ). Then (2) can be rewritten as x˙ i = −di xi +
n j=1
bij fj (xj ).
(3)
Absolute Stability of Hopfield Neural Network
251
Since fj (0) = gi (u∗i ) − gi (u∗i ) = 0, and fj = gj > 0, we have fj ∈ F∞ = {fj |0 < xj fj < ∞, fj (0) = 0 fj is continuous} . Remarks: 1) Hopfield neural network is actually a particular case of the more general Lurie systems with multiple nonlinear controls, given by [7][8] x˙ i =
n j=1
aij xj +
n j=1
bij fj (σj ),
σj =
n
ci xi ,
(4)
i=1
in which letting aij = 0, i = j, aii = −di < 0, and σj = xj , cj = 1, ci = 0, i = j yields the Hopfield neural network (3). 2) For the Lurie control system (4), it does not require aii < 0 for all i, nor does the nonlinear function fj (σj ) have to be monotone. It only requires σi fk (σi ) ≥ 0 (i.e., the function is located in the first and third quadrants of the σ-f (σ) plane), and is thus more general than the neural network (3). 3) The study of the Lurie systems with multiple nonlinear controls is far behind that of the Lurie systems with single control. As pointed out in a SIAM conducted research report (1988) The Future of Control Theory – Mathematical Prospect that although the study on the stability of nonlinear control systems has received considerable attentions and many results have been obtained, most of achievements are limited to the systems with single control, such as Popov criterion, Lyapunov method, etc. The problem for the Lurie systems with multiple controls has not been solved with satisfactory. This indicates the difficulty in generalizing the study from the case of single control to multiple controls. 4) Note that the Hopfeild neural network (3) contains an input Ii , while the input in the Lurie system (4) is zero. However, this non-zero input can be removed by a simple translation and thus (3) becomes (4). Since the activation functions gi (ui ) and fi (xi ) are the same type of functions, the absolute stabilities of the equilibrium point u = u∗ of (3) and x = 0 of (4) are the same. The study of the global stability of the neural network (3) has opened a new area for optimal computation. Therefore, it is natural to combine the studies on the stability of neural networks and the absolute stability of Lurie control systems.
3
Some Results on Hopfield Neural Networks with Infinite Gains to Be Absolutely Exponentially Stable
In this section, we will consider sufficient conditions for the absolute stability of Hopfield neural networks with a class of sigmoidal nonlinear simulation functions (i.e., the simulation functions have unbounded gains). Let SUB := {g(x)|g(x) ∈ C[R, R], D+ gi (ui ) ≥ 0, i = 1, · · · , n}, i.e., sup D+ gi (ui ) < ∞. In the following, we consider (2) with gi (ui ) ∈ SUB .
252
X. Liao, F. Xu, and P. Yu
Definition 1. ∀gi ∈ SUB and ∀Ii ∈ R, if the equilibrium point of (2) is globally exponentially stable, then it is said that system (2) is absolutely exponentially stable with respect to SUB . Theorem 1. If there exist two groups of constants ξi > 0, ηi > 0, i = 1, · · · , n such that the matrix A is negative definite, then we have the following conclusions: 1) system (2) is absolutely exponentially stable; and 2) the Lyapunov exponent is − λμ , where A11 A12 A= , AT12 A22 2n×2n
A11 = diag − Rξ11 , · · · , − Rξnn ,
η1 ηn A12 = diag(− 2R , · · · , − 2R )n×n 1 n
A22 = (
+( B=
ηi Tij +ηj Tji )n×n , (i = 1, · · · , n), 2
ξi Tij +ξj Tji )n×n , 2
B11 B12 T B12 B22
T B12 = B12 =
,
B11 = diag
2n×2n diag( C12η1 , · · · , Cn2ηn )n×n ,
C1 ξ1 Cn ξn 2 ,···, 2
, n×n
B22 = On×n .
−λ denotes the maximum eigenvalue of matrix A and μ is the maximum eigenvalue of matrix B, the superscript T denotes transpose. Proof. Since g ∈ S, the existence of equilibrium point of system (2) is a wellknown result. Let u = u∗ be an equilibrium point of (2). 1) Let x = (x1 , · · · , xn )T = (u1 − u∗1 , · · · , un − u∗n )T . Then fi (xi ) = gi (xi + u∗i ) − gi (u∗i ), and system (2) can be rewritten as Ci x˙ i =
n
Tij fj (xj ) −
j=1
xi Ri
(i = 1, · · · , n).
(5)
The global stability of the equilibrium u = u∗ of system (2) is equivalent to the equilibrium x = 0 of system (5). Construct the radially unbounded Lyapunov function: V (x) =
n Ci ξi i=1
2
x2i
+
n i=1
ηi Ci
xi
fi (xi )dxi .
(6)
0
2) Set V ∗ = et V, > 0 is a sufficiently small constant. Then we have dV ∗ ≤0 dt (5)
for 0 < 1.
(7)
Absolute Stability of Hopfield Neural Network
253
Hence, n Ci ξi i=1
2
x2i ≤ V (x(t)) ≤ e−t V0∗ ,
(8)
which results in n i=1
x2i ≤
V0∗ min1≤i≤n
Ci ξi 2
( > 0).
(9)
Inequality (9) implies that the equilibrium x = 0 of system (5), or the equilibrium u = u∗ of system (2) is globally exponentially stable, i.e., system (2) is absolutely exponentially stable. Also, the positive constant can be considered as a Lyapunov exponent. 3) Let −λ, μ be the maximum eigenvalues of matrices A, B, respectively. n
n dV ∗ t 2 2 xi + fi (xi ) ≤ 0. (10) ≤ e (−λ + μ) dt (5) i=1 i=1 Thus, the inequality (8) holds and = λμ is an estimation of the Lyapunov exponent. The is complete. proof ◦ ◦ A A B◦ B◦ 11 12 Let A◦ = , and B ◦ = [ ◦ 11 ◦ 12 ]2n×2n , where ◦ ◦ A 21 A 22 B 21 B 22 ◦ ◦ 1 1 Tij + Tji = −diag , · · · , , = A11 A22 R1 Rn 2 n×n, ◦ 1 1 Tij + Tji ,···, + A12 = −diag 2R1 2Rn n×n 2 n×n, ◦ C1 Cn ,···, B 11 = diag 2 2 n×n, T ◦ ◦ ◦ C1 Cn ,···, B 12 = B 12 = diag B 22 = On×n. 2 2 n×n, Corollary 1. If matrix A◦ is negative definite, then the equilibrium u = u∗ of system (2) is unique and globally exponentially stable, and the Lyapunov exponential can be chosen as = λ◦ /μ, where −λ◦ and μ are the maximum eigenvalues of matrices A◦ and B ◦ , respectively. Proof. Taking ξi = ηi = 1 (i = 1, · · · , n) in Theorem 1 leads to the corollary. Theorem 2. If there exist two groups of constants ξi > 0, ηi > 0, (i = 1, · · · , n) such that the following matrix G is negative definite, then the conclusion of theorem 1 holds, where G11 G12 G= , G11 = −diag Rξ11 , · · · , Rξnn , T G12 G22 2n×2n ξ T +ξ T η T +η T G12 = GT12 = (1 − δij ) i ijR1 j ji , G22 = i ij 2 j ji . 2n×2n
n×n
254
X. Liao, F. Xu, and P. Yu
Also the estimation (9) is satisfied, where = min min1≤i≤n is the maximum eigenvalue of matrix G, and μ∗ =
ηi Ri
−ξi Tii
ηi Ci Ci ξi max1≤i≤n 2 .
∗ , μλ∗ , −λ∗
Proof. From the negative definite property of matrix G, we know that Tii < 0, (i = 1, · · · , n). Using Lyapunov function (6), The remaining part of the proof is similar to Theorem 1.
4
Conclusion
In this paper, we have presented some new results for the absolutely exponential stability of Hopfield neural networks which have activation function chosen from sigmoidal functions with unbounded derivatives. We have also discussed the relation and difference of the absolute stabilities between neural networks and Lurie systems with multiple nonlinear controls. It has been shown that the basic idea of the absolute stability of neural networks is the same as that of Lurie control systems, and thus the theory and methodology of Lurie control systems can promote the study of absolute stability of neural networks.
Acknowledgement This work was supported by the Natural Science Foundation of China (NSFC No. 60274007 and No. 60474001), the Premier’s Research Excellence Award (PREA), and the Natural Science and Engineering Research Council of Canada (NSERC No. R2686A02).
References 1. Tank, D.W., Hopfield, J.J.: Simple Neural Optimization Networks: An A/O Converter Signal Decision Circuit and a Linear Programming Network. IEEE Trans. Circuits Syst. 23(5) (1986) 533-541 2. Forti, M., Manetti, S., Marini, M.: Necessary and Sufficient Condition for Absolute Stability of Neural Networks. IEEE Trans. Circuits Syst. 41(7) (1994) 491-494 3. Forti, M., Liberatore, S.A., Manetti, J., Marini, M.: On Absolute Stability of Neural Networks. Proc. 1994 IEEE Int. Symp. Circuits Syst. 241-244 4. Kaszkurewicz, E., Bhaya, A.: Comments on ‘Necessary and Sufficient Condition for Absolute Stability of Neural Networks’. IEEE Trans. Circuits Syst. I 42 (1995) 497-499 5. Liang, X., Wang, J.: An Adaptive Diagonal Stability Condition for Absolute Exponential Stability of a General Class of Neural Networks. IEEE Trans. Circuits Syst. 48(11) (2001) 1038-1317 6. Hu, S., Wang, J.: Absolute Exponential Stability of a Class of Continuous Time Recurrent Neural Networks. IEEE Trans. Neural Networks 14(1) (2003) 35-45 7. Liao, X.X.: Absolute Stability of Nonlinear Control Systems. Kluwer Academic China Science Press, Beijing (1993) 8. Liao, X.X., Yu, P.: Sufficient and Necessary Conditions for the Absolute Stability of Time-Delayed Lurie Control Systems. J. Math. Anal. Appl. (2006) (in press)
Robust Stability Analysis of Uncertain Hopfield Neural Networks with Markov Switching Bingji Xu and Qun Wang School of Information Engineering, China University of Geosciences, Beijing, 100083, China
[email protected]
Abstract. The robust stability of uncertain Hopfield neural networks with Markov switching is analyzed, the parametric uncertainty is assumed to be norm bounded. Sufficient conditions for the exponential stability are established by constructing suitable Lyapunov functionals. The stability criteria represented in terms of linear matrix inequalities (LMIs), and are computationally efficient.
1
Introduction
In the past two decades, neural networks has been received considerable attention, and there have been extensive research results presented about the stability analysis of neural network and its applications (See,e.g.,[1-6]). The stability research related to Hopfield neural networks have been extensively studied and developed in recent years (e.g.,[7,8]). As is well known, parametric uncertainty occur due to the modelling inaccuracies and/or changes in the environment of the model, which can affect the stability of a network by creating oscillatory and unstable characteristics. It is important to investigate the robust stability of neural networks with parametric uncertainties. With the development of intelligent control,switched systems have been widely studied. In this paper, we will study a class of switched Hopfield neural networks, the individual subsystems of the switched Hopfield neural networks are a set of Hopfield neural networks, and a Hopfield neural networks switched to the next one according to a continuous Markov process with a finite state space. In addition, the parametric uncertainty is considered and assumed to be norm bounded. Lyapunov functionals are employed to investigate the sufficient conditions for the robust exponential stability. The remainder of this paper is organized as follows. In section 2, the mathematical model of the neural networks is described and some necessary assumptions are given. Section 3 is dedicated to the analysis of robust stability for the uncertain Hopfield neural networks with Markov switching. Some exponential stability criteria for these neural networks are derived in terms of LMIs. Conclusions are given in Section 4. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 255–260, 2006. c Springer-Verlag Berlin Heidelberg 2006
256
2
B. Xu and Q. Wang
Systems Descriptions
We consider the uncertain Hopfield neural networks with Markov switching described by du(t) = −[D(r(t)) + ΔD(t, r(t))]u(t) + [A(r(t)) + ΔA(t, r(t))]g(u(t)) + I(r(t)), dt (1) T where u(t) = u1 (t), u2 (t), · · · , un (t) is the neuron state vector, and g(u) = T g1 (u1 ), g2 (u2 ), · · · , gn (un ) is the neuron activation function vector. {r(t), t ≥ 0} is a time homogeneous Markov process with right continuous trajectories taking values from a finite set S = {1, 2, · · · , N }, with transition probabilities: γij Δt + o(Δt) i =j P {r(t + Δt) = j|r(t) = i} = 1 + γii Δt + o(Δt) i = j, o(Δt) Δt→0 Δt
where Δt > 0, lim
= 0, and γij ≥ 0 for i = j, while γii = −
N
γij for
j=1,i =j
i ∈ S. For each r(t) = i ∈ S, D(r(t)) = diag d1 (r(t)), d2 (r(t)), · · · , dn (r(t)) is a positive diagonal matrix, A(r(t)) is interconnection weight matrix, I(r(t)) is a constant input vector, ΔD(t, r(t)) and ΔA(t, r(t)) are unknown real valued functions representing time-varying parametric uncertainties. To simplify the notation, M (t, i) and M (i) will be denoted by Mi (t) and Mi , respectively. Throughout this paper, we assume that the parametric uncertainties ΔDi (t) and ΔAi (t) are norm-bounded and can be described as ΔDi (t) = Bi Fi (t)Ci , ΔAi (t) = Ei Gi (t)Hi , i ∈ S,
(2)
where Bi , Ci , Ei , Hi , i ∈ S are known constant matrices of appropriate dimensions, and Fi (t), Gi (t), i ∈ S are unknown matrices representing the parametric uncertainties,which satisfies FiT (t)Fi (t) ≤ I, GTi (t)Gi (t) ≤ I, ∀ t ∈ , i ∈ S,
(3)
in which I is the identify matrix of appropriate dimension. Since the matrix Di in system (1) is diagonal, it is reasonable to assume that ΔDi is also diagonal, but not necessary positive definite. The neuron activation functions gi (u), i = 1, 2, · · · , n, are continuous and satisfy the following conditions: 0≤
gi (ui ) − gi (vi ) ≤ Ki , ∀ ui = vi , ui , vi ∈ , i = 1, 2, · · · , n, ui − vi
(4)
Let K = diag(K1 , K2 , · · · , Kn ), u∗ = (u∗1 , u∗2 , · · · , u∗n , )T be an equilibrium point of system (1), and set x(t) = u(t) − u∗ , fi (xi (t)) = gi (xi (t) + u∗i ) − gi (u∗i ), i = 1, 2, · · · , n. Then, fi (z), i = 1, 2, · · · , n, satisfies |fi (z)| ≤ Ki |z|, zfi(z) ≥ 0, ∀z ∈ , i = 1, 2, · · · , n.
(5)
Robust Stability Analysis of Uncertain Hopfield Neural Networks
257
System (1) may be rewritten as follows. dx(t) = −[D(r(t)) + ΔD(t, r(t))]x(t) + [A(r(t)) + ΔA(t, r(t))]f (x(t)), dt or dx(t) = −[Di + ΔDi (t)]x(t) + [Ai + ΔAi (t)]f (x(t)), i = 1, 2, · · · , N. dt
3
(6)
(7)
Exponential Stability
If u∗ is an equilibrium point of system (1), then x = 0 is an equilibrium point of system (6) and system (7). Definition 1. The trivial solution for system (6) is said to be exponentially stable in mean square if for all uncertainties ΔDi (t), ΔAi (t) satisfies (2) and (3), lim sup
t→∞
1 log Ex(t)2 < 0, t ≥ 0. t
Definition 2. The trivial solution for system (6) is said to be almost surely exponentially stable if for all uncertainties ΔDi (t), ΔAi (t) satisfies (2) and (3), lim sup
t→∞
1 log x(t) ≤ −r, r > 0, t ≥ 0, a.s. t
Theorem 3.1. If there exist matrices Pi > 0, diagonal matrices Qi = diag(qi1 , qi2 , · · · , qin ) > 0, and positive scalars εi , μi , i = 1, 2, · · · , N , such that the following LMIs hold for i = 1, 2, · · · , N, ⎡ ⎤ Φi (1, 1) Pi Ai − Qi Di −Pi Bi Pi Ei ⎢ ATi Pi − Di Qi Qi Ai + ATi Qi + μi HiT Hi −Qi Bi Qi Ei ⎥ ⎥ < 0, Φi = ⎢ ⎣ −BiT Pi −BiT Qi −εi I 0 ⎦ EiT Pi EiT Qi 0 −μi I in which Φi (1, 1) =
N j=1
γij KQj + Pj − Pi Di − Di Pi + εi CiT Ci . Then, the trivial
solution for system (6) is both exponentially stable in mean square and almost surely exponentially stable. Proof. Take the Lyapunov functional T
V (x, i) = x Pi x + 2
n
xj
qij
j=1
fj (s)ds. 0
It is easy to verify that min λmin (Pi ) x2 ≤ V (x, i) ≤ max λmax (Pi ) + λmax (KQi ) x2 . 1≤i≤N
1≤i≤N
258
B. Xu and Q. Wang
The infinitesimal operator L for the Lyapunov functional V (x, i) at point {x(t), t, r(t) = i} is given by [9] N ∂V ∂V LV (x, i) = + x(t) ˙ + γij V (t, j) ∂t ∂t r(t)=i j=1 = −xT (Pi Di + Di Pi )x + 2xT Pi Ai f (x) − 2xT Pi Bi Fi (t)Ci x +2xT Pi Ei Gi (t)Hi f (x) − 2xT Qi Di f (x) + 2f T (x)Qi Ai f (x) −2f T Qi Bi Fi (t)Ci x + 2f T (x)Qi Ei Gi (t)Hi f (x) xl N N n + γij xT Pj x + 2 γij qjl fl (s)ds. j=1
j=1
(8)
0
l=1
It follows from (3) and (5) that 2
n
qjl
xl
fl (s)ds ≤ xT KQj x,
(9)
0
l=1
[Fi (t)Ci x]T [Fi (t)Ci x] ≤ xT CiT Ci x, [Gi (t)Hi f (x)] [Gi (t)Hi f (x)] ≤ f T
T
(10)
(x)HiT Hi f (x).
(11)
Substituting (9),(10) and (11) into (8), we obtain LV (x, i) ≤ −xT (Pi Di + Di Pi )x + 2xT Pi Ai f (x) − 2xT Pi Bi Fi (t)Ci x +2xT Pi Ei Gi (t)Hi f (x) − 2xT Qi Di f (x) + 2f T (x)Qi Ai f (x) −2f T Qi Bi Fi (t)Ci x + 2f T (x)Qi Ei Gi (t)Hi f (x) +
+
N
N
γij xT Pj x
j=1
γij xT KQj x + εi xT CiT Ci x − [Fi (t)Ci x]T [Fi (t)Ci x]
j=1
+μi f T (x)HiT Hi f (x) − [Gi (t)Hi f (x)]T [Gi (t)Hi f (x)] = y T Φi y,
(12)
T where y = xT , f T (x), xT CiT FiT (t), f T (x)HiT GTi (t) . By Φi < 0, i = 1, 2, · · · , N, we know that there exist a scalar δ > 0 such that LV (x, i) ≤ −δx2 . Thus, by [9] we known that the trivial solution for system (6) is both exponentially stable in mean square and almost surely exponentially stable.This completes the proof. Theorem 3.2. If there exist matrices Pi > 0, diagonal matrices Ri = diag(ri1 , ri2 , · · · , rin ) > 0, and positive scalars εi , μi , i = 1, 2, · · · , N , such that the following LMIs hold for i = 1, 2, · · · , N,
Robust Stability Analysis of Uncertain Hopfield Neural Networks
⎡
N
⎢ j=1 ⎢ Ψi = ⎢ ⎢ ⎣
γij Pi − Pi Di − Di Pi + εi CiT Ci ATi Pi + KRi −BiT Pi EiT Pi
259
⎤ −Pi Bi Pi Ei ⎥ ⎥ μi HiT Hi − 2KRi 0 0 ⎥ ⎥ < 0. 0 −εi I 0 ⎦ 0 0 −μi I Pi Ai + Ri K
Then, the trivial solution for system (6) is both exponentially stable in mean square and almost surely exponentially stable. Proof. Take the Lyapunov functional V (x, i) = xT Pi x. By (8), one has LV (x, i) = −xT (Pi Di + Di Pi )x + 2xT Pi Ai f (x) − 2xT Pi Bi Fi (t)Ci x +2xT Pi Ei Gi (t)Hi f (x) +
N
γij xT Pj x.
(13)
j=1
It follows from (5) that fj (xj ) Kj xj − fj (xj ) ≥ 0, j = 1, 2, · · · , n. This yields LV (x, i) ≤ −xT (Pi Di + Di Pi )x + 2xT Pi Ai f (x) − 2xT Pi Bi Fi (t)Ci x n +2xT Pi Ei Gi (t)Hi f (x) + 2 rij fj (xj ) Kj xj − fj (xj )
+
N
j=1
T
T
γij x Pj x + εi x
CiT Ci x
− [Fi (t)Ci x] [Fi (t)Ci x] T
j=1
+μi f T (x)HiT Hi f (x) − [Gi (t)Hi f (x)]T [Gi (t)Hi f (x)] = y T Ψi y,
(14)
T where y = xT , f T (x), xT CiT FiT (t), f T (x)HiT GTi (t) . The remainder of the proof is similar to that of Theorem 3.1. The proof is complete. Remark. If the parametric uncertainties ΔDi (t) and ΔAi (t) are norm-bounded and can be described as ΔDi (t) ≤ αi , ΔAi (t) ≤ βi , i ∈ S, we can obtain the criteria for the √ exponential stability of systems (6) by let √ Bi = Ci = αi I and Ei = Hi = βi I.
260
4
B. Xu and Q. Wang
Conclusions
In this paper, we studied the uncertain Hopfield neural networks with Markov switching. Some sufficient conditions are derived to ensure the exponential stability in mean square and almost sure exponential stability by constructing suitable Lyapunov functionals. The stability criteria are given in terms of LMIs which can be easily solved by some existing software.
Acknowledgments This paper is supported by National Science Foundation of China under Grant 60474011 and 60574025.
References 1. Liao, X.X., Wang, J., Zeng, Z.G.: Global Asymptotic Stability and Global Exponential Stability of Delayed Cellular Neural Networks. IEEE Trans. on Circuits and Systems II 52 (2005) 403-409 2. Zeng, Z.G., Wang, J., Liao, X.X.: Stability Analysis of Delayed Cellular Neural Networks Described Using Cloning Templates. IEEE Trans. on Circuits and Systems I 51 (2004) 2313-2324 3. Zeng, Z.G., Wang, J., Liao, X.X.: Global Asymptotic Stability and Global Exponential Stability of Neural Networks With Unbounded Time-Varying Delays. IEEE Trans. on Circuits and Systems II 52 (2005) 168-173 4. Shen, Y., Zhao, G.Y., Jiang, M.H., Mao, X.R.: Stochastic Lotka-Volterra Competitive Systems with Variable Delay. In: Huang, D.S., Zhang, X.P., Huang, G.B. (eds.): Advances in Intelligent Computing. Lecture Notes in Computer Science, Vol. 3645. Springer-Verlag, Berlin Heidelberg New York (2005) 238-247 5. Shen, Y., Zhao, G.Y., Jiang, M.H., Hu, S.G.: Stochastic High-order Hopfield Neural Networks. In: Wang, L.P., Chen, K., Ong, Y.S. (eds.): Advances in Natural Computation. Lecture Notes in Computer Science, Vol. 3610. Springer-Verlag, Berlin Heidelberg New York (2005) 740-749 6. Shen, Y., Jiang, M.H., Liao, X.X.: Global Exponential Stability of Cohen-Grossberg Neural Networks with Time-varying Delays and Continuously Distributed Delays. In: Wang, J., Liao, X.F., Zhang, Y. (eds.): Advances in Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 156-161 7. Xu, B.J., Liu, X.Z., Liao, X.X.: Global Asymptotic Stability of High-Order Hopfield Type Neural Networks with Time Delays. Computers and Mathematics with Applications 45 (2003) 1729-1737 8. Liu, X.Z., Teo, K.L., Xu, B.J.: Exponential Stability of Impulsive High-Order Hopfield-Type Neural Networks With Time-Varying Delays. IEEE Trans. on Neural Networks 16 (2005) 1329 - 1339 9. Mao, X.R.: Stability of Stochastic Differential Equations With Markovian Switching. Stochastic Processes and Their Applications 79 (1999) 45-67
Asymptotic Stability of Second-Order Discrete-Time Hopfield Neural Networks with Variable Delays Wei Zhu1,2 and Daoyi Xu2 1 Institute of Applied Mathematics, Chongqing University of Posts and Telecommunications, 400065, Chongqing, China
[email protected],
[email protected] 2 Institute of Mathematics, Sichuan University, 610064, Chengdu, China
Abstract. This paper studies the problem of asymptotic stability of second-order discrete-time Hopfield neural networks with variable delays. By utilizing inequality techniques, we obtain sufficient conditions for the existence and asymptotic stability of an equilibrium point and estimate the region of existence and the attraction domain of the equilibrium point. A numerical example is given to illustrate our theoretical results.
1
Introduction
Hopfield neural networks with time delays have been extensively investigated over the years, and various sufficient conditions for the stability of the equilibrium point of this class of neural networks have been presented in [1]-[3] and the references cited therein. However, there are very few results on the stability of the equilibrium point for high order Hopfield type neural networks with time delays ([4]-[8]),. Especially, for the high order discrete-time Hopfield neural networks, to the best of our knowledge, no results related have appeared in literature. From the fact that high order neural networks have stronger approximation property, faster convergence rate, greater storage capacity, and higher fault tolerance than lower-order neural networks and S. Mohamad et al. s papers (see, [9], [10]) on both continuous and discrete neural networks, we find the necessary of studying the stability of high order discrete neural networks. So, in this paper, by utilizing inequality techniques, we present sufficient conditions for the existence and asymptotic stability of an equilibrium point and estimate the region of existence and the attraction domain of the equilibrium point for second-order discretetime Hopfield neural networks. A numerical example is given to illustrate our theoretical results.
2
Preliminary
Consider the following second-order discrete-time Hopfield neural networks x(k + 1) = A0 x(k) + Ag(x(k − τ1 (k))) + G(x(k − τ2 (k)))Bg(x(k)) + J, J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 261–266, 2006. c Springer-Verlag Berlin Heidelberg 2006
(1)
262
W. Zhu and D. Xu
with k = 0, 1, · · · , where x(k) = col{xi (k)} ∈ Rn , A0 = diag{ai } ∈ Rn×n , A = (aij )n×n , g(x(k)) = col{gj (xj (k))} ∈ Rn , J = col{Ji } ∈ Rn , xi (k) is the output of the ith neuron; aij are the first-order synoptic weights of the neural networks, which are not necessary symmetric; and Ji is the external input of the ith neuron; τ1 (k), τ2 (k) are the delays of neuron, which satisfy 0 ≤ τ1 (k), τ2 (k) ≤ τ , τ is a positive integer; and gj : R → R is the activation function and satisfies gj (0) = 0; G(x(k)) and B T are n × n2 matrices of the forms G(x(k)) = [G1 (x(k)), · · · , Gn (x(k))], B T = [B1 , · · · , Bn ], with Gi (x(k)) being the matrix whose ith row is g T (x(k)) = (g1 (x1 (k)), · · · , gn (xn (k))), and the other elements are all zeros, i.e. ⎡ ⎤ 0 0 ··· 0 ⎢ ··· ⎥ ··· ··· ··· ⎢ ⎥ ⎢ ⎥ 0 0 ··· 0 ⎢ ⎥ ⎢ Gi (x(k)) = ⎢g1 (x1 (k)) g2 (x2 (k)) · · · gn (xn (k))⎥ ⎥, ⎢ ⎥ 0 0 · · · 0 ⎢ ⎥ ⎣ ··· ⎦ ··· ··· ··· 0 0 ··· 0 while ⎡
ai11 ⎢ ai21 ⎢ Bi = ⎣ ··· ain1
ai12 ai22 ··· ain2
··· ··· ··· ···
⎤ ai1n ai2n ⎥ ⎥. ··· ⎦ ainn
where aijl (i, j, l = 1, · · · , n) are the second-order synoptic weights of the neural networks. In this paper, we will use the following notations. A solution of (1) with initial function φ(k) = col{ψi (k)} (k = −τ, · · · , 0) is denoted by x(k, 0, φ) or simply by x(k) if no confusion should occur. N (a) = {a, a + 1, · · · }, N (a, b) = {a, a + 1, · · · , b}. C = {φ | φ : N (−τ, 0) → Rn }. For vector x ∈ Rn and matrix n 1 1 A ∈ Rn×n , we define ||x|| = { i=1 x2i } 2 , ||A|| = {λmax (AAT )} 2 , where λmax (·) is the largest eigenvalues of the relevant matrix, clearly, the vector norm and matrix norm defined as above are adopted, and under the above choice of norm, we have ||G(x(k))|| = ||g(x(k))||. Throughout this paper, we always suppose that (H1) ||g(u) − g(v)|| ≤ L||u − v||, for all u, v ∈ Rn , L > 0. (H2) 1 − ||A0 || − ||A||L > 0, (1 − ||A0 || − ||A||L)2 − 4||B||L2 ||J|| > 0. 2 m (H3) 0 < h < 1−||A0 ||−||A||L−2||B||L , ||B||L2 1
where m =
1−||A0 ||−||A||L−{(1−||A0||−||A||L)2−4||B||L2 ||J||} 2 2||B||L2
> 0.
Asymptotic Stability of Second-Order Discrete-Time
3
263
Existence and Uniqueness of Equilibrium Point
In this section, we first show that system (1) has a positive invariant set and then give an existence and uniqueness result about the equilibrium of system (1) in the set. Definition 1. A set S ⊂ C is called to be a positive invariant set of system (1) if for any initial value φ ∈ S, we have xk (0, φ) ∈ S for all k ∈ N (1), where xk (0, φ) = x(k + s, 0, φ), s ∈ N (−τ, 0). Theorem 1. If (H1) and (H2) hold, then the set Sm = {φ(k) | ||φ(k)|| ≤ m} is a positive invariant set of (1). Proof. It suffices to verify that ||x(k)|| ≤ m(k ∈ N (1)) holds for any ||φ(k)|| ≤ m(k ∈ N (−τ, 0)). If this is not true, then there must be some kl such that ||x(kl + 1)|| > m
and ||x(k)|| ≤ m,
k ∈ N (−τ, kl ).
(2)
Noticing that, from (1) we can derive x(k + 1) = Ak+1 x(0) + 0
k
Ak−s Ag(x(s − τ1 (s))) 0
s=0
+
k
Ak−s G(x(s − τ2 (s)))Bg(x(s)) + 0
s=0
k
Ak−s J. 0
s=0
Hence, we have m < ||x(kl + 1)|| ≤ ||A0 ||
kl +1
||x(0)|| +
kl
||A0 ||kl −s ||A|| · ||g(x(s − τ1 (s)))||
s=0
+
kl
||A0 ||kl −s ||G(x(s − τ2 (s)))|| · ||B|| · ||g(x(s))|| +
s=0
kl
||A0 ||kl −s · ||J||
s=0
≤ ||A0 ||kl +1 ||x(0)|| +
kl
||A0 ||kl −s ||A|| · L||x(s − τ1 (s))||
s=0
+
kl
||A0 ||kl −s L||x(s − τ2 (s))|| · ||B|| · L||x(s)|| +
s=0
kl
||A0 ||kl −s · ||J||
s=0
1 − ||A0 ||kl +1 1 ≤ {||A0 ||kl +1 + (||A||L + ||B||L2 m + ||J||)}m, 1 − ||A0 || m 1
since m =
1−||A0 ||−||A||L−{(1−||A0||−||A||L)2 −4||B||L2 ||J||} 2 2||B||L2
is a root of the equation
||B||L2 m2 + (||A||L + ||A0 || − 1)m + ||J|| = 0, then we have ||B||L2 m + ||A||L +
1 m ||J||
= 1 − ||A0 ||, so we can derive
m < ||x(kl + 1)|| ≤ m,
264
W. Zhu and D. Xu
which is a contradiction. So ||x(k)|| ≤ m for k ∈ N (−τ ), which implies that Sm is a positive invariant set of (1). Theorem 2. If (H1) and (H2) hold, then system (1) has a unique equilibrium point x∗ ∈ Rn in the positive invariant set Sm . Remark 1. The proof is trivial, so be omitted here.
4
Asymptotic Stability of Equilibrium Point
In this section, we will study the asymptotic stability and attraction domain of the equilibrium point x∗ . Let y(k) = x(k) − x∗ , then (1) becomes y(k + 1) = A0 y(k) + Af (y(k − τ1 (k))) + G(x∗ )Bf (y(k)) + F (y(k − τ2 (k)))Bg(y(k) + x∗ ), y(k) = ϕ(k), k ∈ N (−τ, 0),
k ∈ N (0),
(3)
where ϕ(k) = φ(k) − x∗ , f (y) = g(y + x∗ ) − g(x∗ ), F (y) = G(y + x∗ ) − G(x∗ ). Clearly, x∗ is asymptotically stable for (1) if and only if the equilibrium point 0 of (3) is asymptotically stable. Theorem 3. If (H1), (H2) and (H3) hold, then the set Dh = {ϕ(k) | ||ϕ(k)|| ≤ h} is a positive invariant set of (3). Remark 2. The proof is similar with that of Theorem 1, so be omitted. Theorem 4. If (H1), (H2) and (H3) hold, then the equilibrium point 0 of (3) is asymptotically stable, and the attraction domain is Dh . Proof. From Theorem 3, there must be a nonnegative constant σ such that lim sup ||y(k)|| = σ ≤ h.
(4)
k→+∞
According to the definition of superior limit, for arbitrarily small constant δ > 0, there is a K > 0, such that for any k ≥ K, ||y(k − τ )|| ≤ σ + δ. In view of the procedure of the proof of Theorem 1, for a similar argument, when k ≥ K, we obtain k ||y(k + 1)|| ≤ ||A0 ||k+1−K ||y(K)|| + ||A0 ||k−s · ||A|| · ||f (y(s − τ1 (s)))|| s=K
+
k
||A0 ||k−s · ||G(x∗ )|| · ||B|| · ||f (y(s))||
s=K
+
k
||A0 ||k−s ||F (y(s − τ2 (s)))|| · ||B|| · ||g(y(s) + x∗ )||
s=K
≤ {||A0 ||k+1−K +
1 − ||A0 ||k+1−K 1 − ||A0 ||
(||A||L + 2||B||L2 m + ||B||L2 h)}(σ + δ)
Asymptotic Stability of Second-Order Discrete-Time
265
Combining (4) with the definition of superior limit again, there are kl ≥ K, l ∈ N (1) such that limkl →+∞ ||y(kl + 1)|| = σ. Letting kl → +∞ and δ → 0, then we derive 1 σ≤ (||A||L + 2||B||L2 m + ||B||L2 h)σ (5) 1 − ||A0 || If σ = 0, then we have
1 1−||A0 || (||A||L
h≥
+ 2||B||L2 m + ||B||L2 h) ≥ 1, i.e.,
1 − ||A0 || − ||A||L − 2||B||L2 m , ||B||L2
which contradicts (H3). Hence, σ must be zero and the proof is completed.
5
An Illustrate Example
Example. Consider the following second-order discrete-time Hopfield neural networks x(k + 1) = A0 x(k) + Ag(x(k − τ1 (k))) +G(x(k − τ2 (k)))Bg(x(k)) + J, k ∈ N (1),
(6)
where
1/9 0 3/10 1/5 1/8 2/15 2/9 3/10 1/3 T A0 = ,A = ,B = ,J = , 0 1/10 1/15 1/4 1/10 3/20 1/20 1/4 1/4 g1 (x1 ) = 12 sin(x1 ), g2 (x2 ) = 12 x2 , τ1 (k) = k − 1(mod 5), τ2 (k) = k − 1(mod 4). By simple calculation, we derive that L = 12 , ||A0 || = 0.1111, ||A|| = 0.4186, ||B|| = 0.5095, ||J|| = 0.4167, so 1 − ||A0 || − ||A||L = 0.6796 > 0, (1 − ||A0 || − ||A||L)2 − 4||B||L2 ||J|| = 0.2495 > 0 and m = 0.7067, h < 3.9218. It follows from Theorem 4 that system (6) has an asymptotic stability equilibrium point x∗ in domain {x(k) | ||x(k)|| ≤ 0.7067}, and by numerical calculation, we derive that x∗ = (0.5352.0.3990)T and the relevant attraction domain is Dh = {x(k) | ||x(k) − x∗ || ≤ h < 3.9218}.
6
Conclusion
In this article, the second-order discrete-time Hopfield neural networks model involving variable delays is investigated under the assumption that activation function gj satisfies gj (0) = 0 (In fact, if gj (0) = 0, we also can discuss it in a similar method.). For the model, we have derived criteria of existence and uniqueness of equilibrium point in some domain, and given a sufficient condition independent of delays that ensure the equilibrium point is asymptotically stable, and estimated the attraction domain of the relevant equilibrium point.
266
W. Zhu and D. Xu
Acknowledgement The work is supported by National Natural Science Foundation of China under Grant 10371083 and the Young Teachers’ Foundation of Chongqing University of Posts and Telecommunications A2004-12.
References 1. Xu, D.Y., Zhao, H.Y., Zhu, H.: Global Dynamics of Hopfield Neural Networks Involving Variable Delays. Computers and Mathematics with Applications 42(12)(2001)39-45 2. Liao, X.F.,Wong, K.W.: Global Exponential Stability for A Class of Retarded Functional Differential Equations with Applications in Neural Networks. J. Math. Anal. Appl. 293(1)(2004)125-148 3. Guo, S.J., Huang, L.H., Wang, L.: Exponential Stability of Discrete Hopfield Neural Networks. Computers and Mathematics with Applications 47(8-9)(2004)1249-1256 4. Dembo, A., Farotimi, O., Kailath, T.: High-order Absolutely Stable Neural Networks. IEEE Trans. Circ. Syst. II 38(1)(1991)57-65 5. Xu, B.J., Liu, X.Z., Liao, X.X.: Global Asymptotic Stability of High-Order Hopfield Type Neural Networks with Time Delays. Computers and Mathematics with Applications 45(10-11)(2003)1729-1737 6. Zhang, Y.: Robust Stabilization of Bilinear Uncertain Delay Systems. Journal of UEST of China 22(4)(1993)414-419(In Chinese) 7. Cao, J., Liang, J., Lam, J.: Exponential Stability of High-order Bidirectional Associative Memory Neural Networks with Time Delays. Physica D 199(3-4)(2004)425436 8. Cao, J.: Global Exponential Stability of Hopfield Neural Networks. Int. J. Syst. Sci. 32(2)(2001)233-236 9. Mohamad, S.: Gopalsamy, K.: Exponential Stability of Continuous-time and Discrete-time Cellular Neural Networks with Delays. Applied Mathematics and Computation 135(1)(2003)17-38 10. Mohamad, S., Gopalsamy, K.:Dynamics of A Class of Discrete-time Neural Networks and Their Continuous-time Counterparts. Mathematics and Computers in Simulation 53(1-2)(2000)1-39
Convergence Analysis of Discrete Delayed Hopfield Neural Networks Sheng-Rui Zhang1,2 and Run-Nian Ma2,3 1
School of Highway, Chang’an University, Xi’an, 710064 China
[email protected] 2 University Key Lab of Information Sciences and Engineering, Dalian University, 116622, China 3 Telecommunication Engineering Institute, Air Force Engineering University, Xi’an, 710077, China
[email protected]
Abstract. The convergence of recurrent neural networks is known to be bases of successful applications of the networks. In this paper, the convergence of discrete delayed Hopfield neural networks is mainly investigated. One new sufficient and necessary condition for the delayed network having no one stable state is given. Also, several new sufficient conditions for the delayed networks converging towards a limit cycle with 2-period and with 4-period are respectively obtained. All results established here partly extend the previous results on the convergence of both discrete Hopfield neural networks and discrete delayed Hopfield neural networks in parallel updating mode.
1 Introduction Discrete Hopfield neural networks (DHNN) are one of the famous neural networks with a wide range of applications, such as content addressable memory, pattern recognition, and combinatorial optimization [1-6]. Discrete delayed Hopfield neural network (DDHNN) is an extension of DHNN. The convergence of DDHNN is investigated and some results are given in references [7-10]. The main purpose of this paper is to obtain some new results on the stability of DDHNN.
2 Basic Model Discrete delayed Hopfield neural networks having n neurons can be determined by two n×n real matrices W 0 = ( wij0 ) n× n , W 1 = ( wij1 ) n× n , and an n-dimensional column
vector θ = (θ 1 , L , θ n ) T , denoted by N=(W0⊕W1,θ). There are two possible values for
…
the state of each neuron i: 1 or –1. Denote the state of neuron i at time t∈{0,1,2, } as xi(t), the vector X(t)=(x1(t), ,xn(t))T is the state of the whole neurons at time t. The dynamic behavior of DDHNN can be described by the following state equations
…
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 267 – 272, 2006. © Springer-Verlag Berlin Heidelberg 2006
268
S.-R. Zhang and R.-N. Ma
∑ w x (t ) + ∑ w x (t − 1) + θ ) , i ∈ I = {1,2, L , n} , n
xi (t + 1) = sgn(
n
0 ij
1 ij
j
j =1
where t∈{0,1,2,
j
(1)
i
j =1
…}, and the sign function is defined as follows ⎧ 1, if u ≥ 0 sgn(u ) = ⎨ . ⎩− 1, if u < 0
(2)
We rewrite equation (1) in the compact form X(t+1)=sgn(W0X(t)+W1X(t-1)+θ).
(3)
If the state X* satisfies the following condition X*=sgn(W0X*+W1X*+θ),
(4)
the state X* is called a stable state (or an equilibrium point). Let N=(W0⊕W1,θ) start from any initial states X(0),X(1). For t≥2, if there exists time t1∈{0,1,2, } such that the updating sequence X(0),X(1),X(2),X(3), satisfies that X(t+T)=X(t) for all t≥t1, where T is the minimum value which satisfies the above condition, then we call that the initial states X(0),X(1) converges towards a limit cycle with T-period. This means that a limit cycle with 1-period is a stable state. The delayed network is called to be strict, if the condition (5) is satisfied for all states X, Y∈{-1,1}n, where X=(x1, ,xn)T, Y=(y1, ,yn)T
…
…
…
n
…
n
∑w x +∑w 0 ij
1 ij
j
j =1
y j + θi ≠ 0 , i∈I.
(5)
j =1
In this paper, the stability of DDHNN is mainly studied. The main contribution of this paper is expressed by the following results.
3 Main Results
…
Theorem 1. If the network N=(W0⊕W1,θ) is strict, then the state X=(x1, ,xn)T is a stable state if and only if there exists a positive diagonal matrix D=diag(d1, ,dn) (di>0, i=1,2, ,n) such that
…
(W0+W1-D)X+θ=0. Proof. If X=(x1, fied. Let
…,x ) n
T
…
(6)
is a stable state of the network N=(W0⊕W1,θ), then (4) is satisn
n
j =1
j =1
d i = xi (∑ wij0 x j + ∑ wij1 x j + θi ) , i∈I.
…
(7)
Since the state X=(x1, ,xn)T being a stable state of the strict network N=(W0⊕W1,θ), we know that di>0 for each i∈I by the definition of sign function. And
Convergence Analysis of Discrete Delayed Hopfield Neural Networks n
n
j =1
j =1
269
xi d i = ∑ wij0 x j + ∑ wij1 x j + θi , i∈I.
(8)
We note that the matrix form of (8) is (6). So, the necessary condition is proved. On the other hand, if there exists a positive diagonal matrix D=diag(d1, ,dn) (di>0, i=1,2, ,n) such that the state X=(x1, ,xn)T satisfies (6), then (8) is true. So, (7) is true too, i.e.,
…
…
…
n
n
j =1
j =1
xi (∑ wij0 x j + ∑ wij1 x j + θ i ) > 0 , i∈I.
(9)
This then means that the state X satisfies (4), so X is a stable state of the delayed network. Hence, the sufficient condition is proved. The proof is completed. Example 1. Consider N=(W0⊕W1,θ), where the expressions of matrices W0, W1 and θ are respectively in the following
⎛ 1 − 1⎞ ⎛0 0⎞ ⎛0⎞ ⎟⎟ , W 1 = ⎜⎜ ⎟⎟ , θ = ⎜⎜ ⎟⎟ . W 0 = ⎜⎜ ⎝ 1 − 1⎠ ⎝0 0⎠ ⎝0⎠
(10)
We can test that the network has only one stable state X=(1,1)T, and other states can converge towards the stable state X=(1,1)T. Also, there exists a nonnegative diagonal matrix D=diag(0, 0) such that states X=(1,1)T and -X=(-1,-1)T satisfy (6) respectively. Obviously, -X is not a stable state of the network, but it satisfies (6). From example 1, we know that, even if there exists a nonnegative diagonal matrix D=diag(d1, ,dn) (di≥0, i=1,2,…,n) such that (6) is satisfied, the state X=(x1, ,xn)T can not guaranteed to being a stable state. This means that, if the delayed network is not strict, we have no corresponding results to theorem 1. But based on the proof of theorem 1, we can prove that, if the state X=(x1, ,xn)T is a stable state, then there exists a nonnegative diagonal matrix D=diag(d1, ,dn) (di≥0, i=1,2, ,n) such that (6) is satisfied. From all above, we know that, if the delayed network is not strict, then (6) is only necessary condition for the state X being a stable state, but not sufficient condition. Based on theorem 1, we can easily give the following theorem.
…
…
… …
…
Theorem 2. If the network N=(W0⊕W1,θ) is strict, then the delayed network has no one stable state if and only if for any state X=(x1, ,xn)T there does not exist a positive diagonal matrix D=diag(d1, ,dn) (di>0, i=1,2, ,n) such that (6) is true.
…
…
…
Corollary 1. Let the network N=(W ⊕W ,θ) be strict. If matrix W0+W1 and θ satisfies the following condition for each X∈{-1,1}n 0
X T (W 0 + W 1 ) X ≤ −
1
∑θ j∈I
i ,
X∈{-1,1}n,
(11)
then the delayed network has no one stable state. Especially, if one of the following conditions (12), (13) and (14) is satisfied, then the delayed network has no one stable state
270
S.-R. Zhang and R.-N. Ma
∑w
0 ij
+ wij1 − θ i , i ∈ I ,
(12)
∑w
0 ji
+ w1ji − θ i , i ∈ I ,
(13)
wii0 + wii1 ≤ −
j∈I ( j ≠ i )
wii0 + wii1 ≤ −
j∈I ( j ≠ i )
wii0 + wii1 ≤ −
1 2
∑w
0 ij
+ wij1 + w 0ji + w1ji − θ i , i ∈ I.
(14)
j∈I ( j ≠ i )
Proof. Without loss generality, we assume that the network N=(W0⊕W1,θ) has a stable state X=(x1, ,xn)T, then, by theorem 1, there exists a positive diagonal matrix D=diag(d1, ,dn) (di>0, i=1,2, ,n) such that (6) is satisfied. This then means
…
…
…
n
X T (W 0 + W 1 ) X + X Tθ = ∑ d j .
(15)
j =1
On the one hand, we know that XT(W0+W1)X+XTθ≤0 because of the condition (11) being true. On the other hand, d1+ dn>0. This conflicts (15). Consequently, the network has no one stable state. Especially, if one of the above conditions (12), (13) and (14) is satisfied, we can easily prove that the condition (11) is true too. So the delayed network has no one stable state. The proof is completed.
…+
Remark: In corollary 1, let θ=0. If the condition (12) is satisfied, then matrix W0+W1 is called to be nonpositive definite (not necessarily symmetric) on the set {-1,1}. If conditions (13) and (14) are respectively satisfied, matrix W0+W1 is respectively called to be negative row-diagonally dominant and negative column-diagonally dominant. Hence, if N=(W0⊕W1,0) is strict and matrix W0+W1 is negative row-diagonally dominant or negative column-diagonally dominant or satisfies the condition (14), then the delayed network has no one stable state.. Corollary 2. Let the network N=(W0⊕W1,θ) be strict and θ=0. If matrix W0+W1 is antisymmetric, then the delayed network has no one stable state. Especially, if matrices W0 and W1 are antisymmetric, then the delayed network has neither a stable state nor a limit cycle with 2-period. Proof. If the network has a stable state X, then there exists a positive diagonal matrix D=diag(d1, ,dn) (di>0, i=1,2, ,n) such that equation (6) is satisfied. So (15) is true, where θ=0. On the one hand, we know that XT(W0+W1)X=0 because of the matrix W0+W1 being antisymmetric. On the other hand, d1+ dn>0. This conflicts (15). So, the delayed network has no one stable state. Especially, if matrices W0 and W1 are antisymmetric, so matrix W0+W1 is antisymmetric too. The proof is completed. In the following, without loss generality, we assume that the network has a limit cycle (X1,X2) with 2-period. Then, we have
…
…
…+
X1=sgn(W0X2+W1X1), X2=sgn(W0X1+W1X2).
(16)
Convergence Analysis of Discrete Delayed Hopfield Neural Networks
271
Let r=(X1)T(W0X2+W1X1), s=(X2)T(W0X1+W1X2). Since the delayed network is strict, we have r>0, s>0 by the definition of sign function. So, r+s>0. Also, from the expression of r and s, we have r+s=(X1)T(W0X2+W1X1)+(X2)T(W0X1+W1X2) =(X1)TW0X2+(X1)TW1X1+(X2)TW0X1+(X2)TW1X2 =(X1)TW0X2+(X2)TW0X1=0.
(17)
Obviously, this conflicts r+s>0. Consequently, the delayed network has no one limit cycle with 2-period. The proof is completed. In the following, we give some conditions for the delayed network converging respectively towards a limit cycle with 2-period and 4-period. Theorem 3. Let the network N=(W0⊕W1,θ) be strict, we have two results as follows. 1). If the condition (18) is satisfied
max{wii0 + w1ii , wii0 − w1ii } ≤ −
∑w
0 ij
j∈I ( j ≠ i )
∑w
−
− θi , i ∈ I ,
1 ij
j∈I ( j ≠i )
(18)
then the delayed network converges towards a limit cycle with 2-period. 2). If the condition (19) is satisfied max{wii0 + wii1 , − wii0 + wii1 } ≤ −
∑w
0 ij
−
j∈I ( j ≠ i )
∑w
1 ij
− θi , i ∈ I ,
j∈I ( j ≠i )
(19)
then the delayed network converges towards a limit cycle with 4-period. Proof. 1). For any initial states X(0),X(1), if xi(0)=xi(1), since the delayed network being strict, based on (1) and (18), we have n
n
j =1
j =1
xi (2) = sgn( ∑ wij0 x j (1) + ∑ wij1 x j (0) + θ i ) = sgn(( wii0 + wii1 ) xi (0)) = − xi (0) .
(20)
As the same reason, we can calculate xi(3) in the following n
n
j =1
j =1
xi (3) = sgn( ∑ wij0 x j (2) + ∑ wij1 x j (1) + θ i ) = sgn(−( wii0 − wii1 ) xi (0)) = xi (0) .
(21)
And so on, the state updating sequence of neuron i updating process can be interpreted as follows xi(0), xi(0), -xi(0), xi(0), -xi(0), xi(0), -xi(0),
….
If xi(0)≠xi(1), as the same reason, we can calculate the state updating sequence of neuron i updating process, which can be interpreted as follows xi(0), -xi(0), xi(0), -xi(0), xi(0), -xi(0), xi(0),
….
As proved above, we know that N=(W ⊕W ,θ) converges towards a limit cycle with 2-period for any initial states X(0),X(1). 0
1
272
S.-R. Zhang and R.-N. Ma
2). For any initial states X(0),X(1), if xi(0)=xi(1), based on the same reason, we can calculate the state updating sequence of neuron i updating process, which can be interpreted as follows xi(0), xi(0), -xi(0), -xi(0), xi(0), xi(0), -xi(0), -xi(0),
….
If xi(0)≠xi(1), as the same reason, we can calculate the state updating sequence of neuron i updating process, which can be interpreted as follows xi(0), -xi(0), -xi(0), xi(0), xi(0), -xi(0), -xi(0), xi(0),
….
As proved above, we know that N=(W ⊕W ,θ) converges towards a limit cycle with 4-period for any initial states X(0),X(1). The proof is completed. 0
1
Acknowledgments. The work is partly supported by open Foundation of University Key Lab of Information Sciences and Engineering, Dalian University. The authors would like to thank the referees for their valuable comments and suggestions to improve this paper.
References 1. Hopfield, J.J.: Neural Networks and Physical Systems Emergent Collective Computational Abilities. Proc. Nat. Acad. Sci. USA 79(1982) 2554-2558 2. Bruck, J.: On the Convergence Properties of the Hopfield Model. Proceedings IEEE 78 (1990) 1579-1585 3. Xu, Z., Kwong, C.P.: Global Convergence and Asymptotic Stability of Asymmetrical Hopfield Neural Networks. J. Mathematical Analysis and Applications 191(1995) 405426 4. Lee, D.: New Stability Conditions for Hopfield Neural Networks in Partial Simultaneous Update Mode. IEEE, Trans. Neural Networks 10(1999) 975-978 5. Goles, E.: Antisymmetrical Neural Networks. Discrete Applied Mathematics 13(1986) 97100 6. Ma, R., Zhang, Q., Xu, J.: Convergence of Discrete-time Cellular Neural Networks. Chinese J. Electronics 11(2002) 352-356 7. Ma, R., Xi, Y., Gao, H.: Stability of Discrete Hopfield Neural Networks with Delay in Serial Mode. In: Yin, F., Wang, J. and Guo, C.(eds.): Advance in Neural Networks-ISNN 2004. Lecture Notes in Computer Science 3173 Springer-Verlag Berlin Heidelberg (2004) 126-131 8. Qiu, S., Tang, E.C.C. Yeung, D.S.: Stability of Discrete Hopfield Neural Networks with Time-delay. Proceedings of IEEE International Conference on System, Man Cybernetics(2000) 2546-2552 9. Ma, R., Xi, Y., Lei, S.: Dynamic Behavior of Discrete Hopfield Neural Networks with Time-Delay. Chinese J. Electronics 14(2005) 187-191 10. Ma, R., Lei, S., Zhang, S.: Dynamic Behavior Analysis of Discrete Neural Networks with Delay. In: Wang J., Liao X. and Zhang Y.(eds.): Advance in Neural Networks-ISNN 2005. Lecture Notes in Computer Science 3496 Springer-Verlag Berlin Heidelberg (2005) 259264
An LMI-Based Approach to the Global Stability of Bidirectional Associative Memory Neural Networks with Variable Delay Minghui Jiang1 , Yi Shen2 , and Xiaoxin Liao2 1
Institute of Nonlinear Complex Systems, Three Gorges University, Yichang, Hubei, 443000, China
[email protected] 2 Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
[email protected],
[email protected]
Abstract. Based on the linear matrix inequality (LMI), new sufficient conditions on the global exponential stability and asymptotic stability of bidirectional associative memory neural networks with variable delay are presented, and exponential converging velocity index is estimated. Furthermore, the results in this paper are less conservative than the ones reported so far in the literature. One example is given to illustrate the feasibility of our main results.
1
Introduction
Bidirectional associative memory (BAM) neural networks, firstly proposed by Kosko [1] for their various applications such as designing associative memories, solving optimization problems and automatic control engineering, are a class of two-layer hetero-associative networks with delays. Such applications rely heavily on the stability properties of equilibria of neural networks [2-7]. As is well known, in both biological and man-made neural networks, the delays arise because of the processing of information. Thus, the study of neural network with delays becomes extremely important to manufacture high quality neural networks. Unfortunately, most of the existing literatures on BAM neural networks with delays were restricted to models with constant delays[8-11]. Therefore, the stability analysis of BAM neural networks with variable delays is important from both theoretical and applied points of view. In this paper, based on the linear matrix inequality, sufficient conditions on the global exponential stability and asymptotical stability of bidirectional associative memory neural networks with variable delay are presented. It should be noted that the results in this paper are less conservative than the ones reported so far in the literature. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 273–278, 2006. c Springer-Verlag Berlin Heidelberg 2006
274
2
M. Jiang, Y. Shen, and X. Liao
Preliminaries
The BAM neural network model is described by the following differential equation q x˙ i (t) = −ai xi (t) + j=1 wij fj (yj (t − τ (t)) + Ii , i = 1, 2, . . . , p, p (1) y˙ j (t) = −bj yj (t) + i=1 vji gi (xi (t − δ(t)) + Jj , j = 1, 2, . . . , q, for t ≥ 0, where x(t) = [x1 (t), · · · , xp (t)]T ∈ Rp , y(t) = [y1 (t), · · · , yq (t)]T ∈ Rq , ai ≥ 0 and bi ≥ 0 are constant, Ii and Jj are external inputs, wij and vji are connection weights of the network, the variable delays τ (t) and δ(t) continuous and differential functions which satisfy 0 ≤ τ (t), δ(t) ≤ τ and τ (t), δ (t) ≤ 0. The activation functions fj (·) and gi (·) are assumed to satisfy the following conditions: (H1) There exist positive constants Mj , j = 1, · · · , q and Li , i = 1, · · · , p such that for any θ, ρ ∈ R 0 ≤ (fj (θ) − fj (ρ))(θ − ρ)−1 ≤ Mj , θ = ρ, 0 ≤ (gi (θ) − gi (ρ))(θ − ρ)−1 ≤ Li , θ = ρ. (H2) There exist positive constants Nj , j = 1, · · · , q and Ri , i = 1, · · · , p such that |fj (u)| ≤ Nj and |gi (u)| ≤ Ri for all u ∈ R. Throughout this paper, unless otherwise specified, we let A ≥ 0 denote nonnegative definite matrix, A > 0 denote positive definite symmetric matrix and etc. AT , A−1 , λmin (A), λmax (A), denotes, respectively, the transpose of, the inverse of, the minimum eigenvalue of, the maximum eigenvalue of a square matrix A. Let | · | and A denote the Euclidean norm in Rn and the 2-norm in Rn×n , respectively.
3
Main Results
In this section, in order to investigate the stability of equilibria of the BAM neural network (1), we present the following existence of the equilibrium point of the network (1). Theorem 1. The BAM neural network (1) has at least one equilibrium point if the network (1) satisfies Assumption (H1) and (H2). Proof. Obviously, under the Assumption (H1) and (H2), the right-hand side of the BAM neural network (1) satisfies the global Lipschitz condition. It is well known (see [14]) that the network (1) with initial condition has at least one continuous solution on t ≥ −τ . Consider a mapping F : Rp+q → Rp+q defined by F1 (x, y) F (x, y) = , (2) F2 (x, y)
A LMI-Based Approach to the Global Exponential Stability of BAM
where (F1 (x, y))i = a−1 i {
q
275
wij fj (yj (t − τ (t)) + Ii }, i = 1, · · · , p,
j=1
(F2 (x, y))j = b−1 j {
p
vji gi (xi (t − δ(t)) + Jj }, j = 1, · · · , q.
j=1
The rest of the proof is similar to the one in Ref.[8] and omitted. In order to show the uniqueness of the equilibrium point, we use the transformation ui (t) = xi (t) − x∗i , i = 1, · · · , p, zj (t) = yj (t) − yj∗ , j = 1, · · · , q to shift the equilibrium point of the BAM neural network (1) to the origin, and get the following form of the BAM neural network (1): u˙ i (t) = −ai ui (t) + qj=1 wij sj (zj (t − τ (t)), i = 1, 2, . . . , p, p (3) z˙j (t) = −bj zj (t) + j=1 vji ei (ui (t − δ(t)), j = 1, 2, . . . , q, where
sj (zj (t)) = fj (zj (t) + yj∗ ) − fj (yj∗ ), j = 1, 2, . . . , q, ei (ui (t)) = gi (ui (t) + x∗i ) − gi (x∗i ), i = 1, 2, . . . , p.
It is obvious that the functions sj and ej satisfy the same assumptions (H1) and (H2). The BAM neural network (3) can be rewritten as the following vector form: u(t) ˙ = −Au(t) + W S(z(t − τ (t)), (4) z(t) ˙ = −Bz(t) + V E(u(t − δ(t)). where u(t) = (u1 (t), · · · , up (t))T , z(t) = (z1 (t), · · · , zq (t))T , A = diag(a1 , · · · , ap ), B = diag(b1 , · · · , bq ), W = (wij )p×q , V = (vji )q×p , S(z) = (s1 (z1 ), · · · , sq (zq ))T , E(u) = (e1 (u1 ), · · · , ep (up ))T . We next show that the origin of the BAM neural network (4) is unique and global asymptotically stable in the following Theorem 2. Theorem 2. Assume that Assumption (H1) and (H2) hold. The origin of the BAM neural network (4) is the unique equilibrium point and global asymptotically stable if there exist positive constants α, β, θ, κ such that the following LMIs hold − αL−1 AL−1 − 2θAL−1 + βV T B −1 V + κI + θW W T < 0
(5)
− βM −1 BM −1 − 2κBM −1 + αW T A−1 W + θI + κV V T < 0
(6)
and
where M = diag(M1 , · · · , Mq ), L = diag(L1 , · · · , Lp ), and I is an identity matrix. Proof. Let (u∗ , z ∗ ) be the equilibrium of the BAM neural network (4), we have −Au∗ + W S(z ∗ ) = 0, (7) −Bz ∗ + V E(u∗ ) = 0.
276
M. Jiang, Y. Shen, and X. Liao
By Assumption (H1), we know that u∗ = z ∗ = 0 if S(z ∗ ) = E(u∗ ) = 0, and vice versa. Therefore, by (7), if S(z ∗ ) = 0, then E(u∗ ) = 0, and vice versa. We can prove that S(z ∗ ) = E(u∗ ) = 0 by contradiction method and omitted. Next we will prove that the origin of the BAM neural network (4) is global asymptotically stable. Choose the following Lyapunov functional ui (t) t p p V (u(t), z(t)) = αuT (t)u(t) + 2 θ ei (s)ds + κ e2i (ui (s))ds i=1 q
+βz T (t)z(t) + 2
i=1 t
+α
t−τ (t) t
+β
0
κ
zi (t)
i=1 q
si (s)ds + 0
i=1
t−δ(t)
t
s2i (zi (s))ds
θ t−τ (t)
S(z(s))T W T A−1 W S(z(s))ds E(u(s))T V T B −1 V E(u(s))ds.
(8)
t−δ(t)
The time derivative of V (u(t), z(t)) along the trajectories of (4) turns out to be V˙ (u(t), z(t)) = −2αuT (t)Au(t) + 2αuT (t)W S(z(t − τ (t))) − 2θE T (u(t))Au(t) +2θE T (u(t))W S(z(t − τ (t))) + κE T (u(t))E(u(t)) −(1 − δ (t))κE T (u(t − δ(t)))E(u(t − δ(t))) − 2βz T (t)Bz(t) +2βz T (t)V E(U (t − δ(t))) − 2κS T (z(t))Bz(t) +2κS T (z(t))V E(u(t − δ(t))) + θS T (z(t))S(z(t)) −(1 − τ (t))θS T (z(t − τ (t)))S(z(t − τ (t))) +αS(z(t))T W T A−1 W S(z(t)) −(1 − τ (t))αS(z(t − τ (t)))T W T A−1 W S(z(t − τ (t))) +βE(u(t))T V T B −1 V E(u(t)) −(1 − δ (t))βE(u(t − δ(t)))T V T B −1 V E(u(t − δ(t)))
(9)
Using the conditions δ (t), τ (t) ≤ 0 and Assumption (H1), we can derive by (5), (6) and (9) V˙ (u(t), z(t)) ≤ E(u∗ )T (−αL−1 AL−1 − 2θAL−1 + βV T B −1 V + κI +θW W T )E(u∗ ) + S(z ∗ )T (−βM −1 BM −1 − 2κBM −1 +αW T A−1 W + θI + κV V T )S(z ∗ ) < 0.
(10)
Therefore, by the standard Lyapunov-type theorem [13], we can conclude the origin of the BAM neural network (4) is globally asymptotically stable. This complete the proof. Remark 1. The Theorem 2 in this paper reduces to the Theorem 2 in [8] while α = β = θ = κ = 1 and δ(t) = τ (t) = τ . Therefore, the The Theorem 2 in this paper generalized and improved the results in [8].
A LMI-Based Approach to the Global Exponential Stability of BAM
277
In similar method, we can prove the the origin of the BAM neural network (4) is globally exponentially stable in the following result. Theorem 3. Assume that Assumption (H1) and (H2) hold. The origin of the BAM neural network (4) is the unique equilibrium point and global exponentially stable if there exist positive constants α, β, θ, κ such that the following LMIs hold − 2θAL−1 + βV T B −1 V + κI + θW W T < 0
(11)
− 2κBM −1 + αW T A−1 W + θI + κV V T < 0
(12)
and
where M = diag(M1 , · · · , Mq ), L = diag(L1 , · · · , Lp ), and I is an identity matrix. Remark 2. The Theorem 3 provides the sufficient condition on the globally exponential stability of the origin of the BAM neural network (4). However, Refs.[7-11] only prove the asymptotic stability of the BAM neural network (4).
4
Illustrative Example
Example 1. Assume that the network parameters of system the BAM neural network (1) are chosen as: 1.41 1.41 τ (t) = δ(t) = 2 + 1/t, A = B = I, W = V = 1.41 −1.41 and
1 1 tanh(zi ), i = 1, 2, ei (ui ) = tanh(ui ), i = 1, 2 2 2 1 1 with M = L = diag( 2 , 2 ) and all inputs Ii = Jj = 0. In this case, there exists at least one feasible solution α = β = 963.7426, θ = κ = 8.0695 of the LMIs (5) and (6) in Theorem 2. Therefore, the equilibrium of the above system is global asymptotically stable. However, the result of Theorem 2 in [8] yields 0.95240 0 −1 −1 −1 T −1 T −L AL − 2AL + V B V + I + W W = > 0. 0 0.95240 si (zi ) =
Hence, the criteria in [8] can not applied in the Example 1.
5
Conclusion
In this paper, we present theoretical results on the global exponential stability and asymptotic stability of the BAM neural network with variable delay. These conditions obtained here are easy to be checked in practice, and are of prime importance and great interest in the design and the applications of network. The criterions are more general than the respective criteria reported in [8]. Finally, illustrative examples are also given to compare the results with existing ones.
278
M. Jiang, Y. Shen, and X. Liao
Acknowledgments The work was supported by National Natural Science Foundation of China (60274007, 60574025) and Natural Science Foundation of Hubei (2004ABA055).
References 1. Kosko, B.: Bidirectional Associative Memories Systems. IEEE Trans. Man and Cybernetics 18(1988) 49-60 2. Cao, J., Wang, J.: Global Asymptotic Stability of Recurrent Neural Networks with Lipschitz-continuous Activation Functions and Time-Varying Delays. IEEE Trans. Circuits Syst. I 50 (2003) 34-44 3. Zeng, Z., Wang, J., Liao X.: Global Exponential Stability of a General Class of Recurrent Neural Networks with Time-Varying Delays. IEEE Trans. Circuits Syst. 50 (2003) 1353-1359 4. Singh, V.: A Generalized LMI-Based Approach to the Global Asymptotic Stability of Delayed Cellular Neural Networks. IEEE Trans. Neural Networks 15 (2004) 223-225 5. Arik, S.: An Analysis of Global Asymptotic Stability of Delayed Cellular Neural Networks. IEEE Trans. Neural Networks 13 (2002) 1239-1242 6. Liao, T., Wang, F.: Global Stability for Cellular Neural Networks with Time Delay. IEEE Trans. Neural Networks 11 (2000) 1481-1484 7. Zhao, H.: Global Stability of Bidirectional Associative Memory Neural Networks with Distributed Delays. Phys. Lett. A 30 (2002) 519-546 8. Arik, S., Tavsanoglu, V.: Global Asymptotic Stability Analysis of Bidirectional Associative Memory Neural Networks with Constant Time Delays. Neurocomputing 68 (2005) 161-176 9. Liao, X., Yu, J.: Qualitative Analysis of Bi-directional Associative Memory with Time Delay. Int. J. Circuit Theory Appl. 26 (1998) 219-229 10. Liao, X., Yu, J., Chen, G.: Novel Stability Criteria for Bidirectional Associative Memory Neural Networks with Time Delays. Int. J. Circuit Theory Appl. 30 (2002) 519-546 11. Mohamad, S.: Global Exponential Stability in Continuous-time and Discrete-time Delayed Bidirectional Neural Networks. Physica D 159 (2001) 233-251 12. Yi, Z., Tan, K.K.: Convergence Analysis of Recurrent Neural Networks. Kluwer Academic Publishers, Dordrecht (2003) 13. Chen, M., Chen, Z., Chen, G.: Approximate Solution of Operation Equations. World Scientific, Singapore (1997) 14. Driver, R. D.: Ordinary and Delay Differential Equations. Springer-Verlag NewYork Inc. (1977)
Existence of Periodic Solution of BAM Neural Network with Delay and Impulse Hui Wang1,2, Xiaofeng Liao1, Chuandong Li1, and Degang Yang1,3 1 College of Computer Science, Chongqing University, Chongqing, 400030, China 2 Department of Mathematics, Leshan Normal College, 614003, China 3 College of Mathematics and Computer Science, Chongqing Normal University, Chongqing 400047, China
Abstract. By using the continuation theorem for Mawhin’s coincidence degree and some analytical techniques, several sufficient conditions are obtained ensuring existence of periodic solution of BAM neural networks with variant coefficients, delays and impulse.
1 Introduction Since BAM neural network was first introduced by Kosko [1], it has been paid much attention in past decade due to its applicability in many fields. There are many studying results about the periodic solution and their exponential convergence properties [2-7]. To the best of our knowledge, the periodic oscillatory solution is seldom considered for the BAM networks with delays and impulse. In this paper, we investigate the existence of positive periodic solution of impulsive and delay BAM systems:
⎧ . n ⎛ ⎛ ⎞⎞ ⎪ x ( t ) = −a ( t ) x ( t ) + ∑ α ( t ) f ⎜ y ⎜ t − τ ⎟ ⎟ + I ( t ) i i i ji j j ji ⎠⎠ i ⎝ ⎝ ⎪ j =1 ⎪ Δx t = x t+ − x t− = I x t ⎪⎪ i k i k i k ik i k ⎨ . p ⎛ ⎪ y ( t ) = −b ( t ) y ( t ) + ∑ β ( t ) g x ⎛⎜ t − σ ⎞⎟ ⎞ + J ( t ) j j i ⎜⎝ j ⎝ ij ⎠ ⎟⎠ j ⎪ j j = 1 ij ⎪ ⎪ Δ y t = y t + − y t − = J ⎛⎜ y t ⎞⎟ j k j k j k jk ⎝ j k ⎠ ⎪⎩
( ) ( ) ( )
( ( ))
( ) ( )
( )
( )
(1)
where x ( t ) and y ( t ) are the activations of the i th neuron and the j th neuron, i j respectively. α I
i
ji
(t )
and β
ij
(t )
are the connection weights at the time t,
( t ) , J j ( t ) denote the external inputs at time t. τ ji
and σ correspond to the finite ij
( )
( ) are the impulses
speed of the axonal signal transmission at time t, Δ x t , Δ y t j k i k J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 279 – 284, 2006. © Springer-Verlag Berlin Heidelberg 2006
280
H. Wang et al.
and t < t < L is a strictly increasing sequence such that 1 2 k lim t = +∞ . k →∞ k Throughout this paper, we assume that: (H1) a ( t ) > 0 ,b ( t ) > 0,a ( t ) ,α ( t ) , I ( t ) ,b ( t ) , β ( t ) , J ( t ) are positive i j i ji i j ij j
at moments t
continuous periodic functions of period ω . (
min a ( t ) > 0, min b ( t ) > 0, ) 0≤t ≤ω i 0≤t ≤ω j
(H2) f , g are bounded; i j (H3) There exists a positive integer m such that t
J
j (k + m)
=J
jk
k +m
= t + ω,I =I , k ik i ( k + m)
,i = 1, 2 ,L , p; j = 1, 2 L ,n , k = 1, 2 ,L
2 Preliminaries We state Mawhin’s continuous theorem [8] as follows. Theorem 1.1. Let X and Y be two Banach spaces and L : domL I X → Y be a
Fredholm mapping of index 0. Suppose that Ω is open bounded in X and N : Ω → Y is L-compact on Ω . Furthermore, suppose that (i)
(ii)
For each λ ∈ ( 0 ,1) ,x ∈ ∂Ω , Lx ≠ λ Nx
For each x ∈ ∂Ω I KerL, QNx ≠ 0 , and deg ( JQN ,Ω I KerL,0 ) ≠ 0
where J : Im Q → KerL is isomorphism. Then Lx = Nx has at least one solution in
Ω I domL .
1ω ∫ f ( t ) dt . For any nonω0
In the following, we shall use the notation f =
⎧⎪ p + n (q) ,t t ⎤ = z : 0,ω ] → R z ( t ) exist 1 2,L m ⎥⎦ ⎨⎪ [ ⎩ ( q ) ( t + 0 ) , z ( q ) ( t − 0 ) exist at t ,t t ; and z( j ) t = for t ≠ t ,t t ;z 1 2,L m 1 2 ,L m k
negative integer q, let C
( q ) ⎡0,ω ;t ⎢⎣
( )
z
( j)
(tk − 0,) k = 1,L m, j = 0,1,L q ⎫⎬⎭ with the norm
( j ) ( t ) ⎫⎪ z
q
, where ⎬ ⎪⎭ j = 1
C
( q ) ⎡0,ω ;t ⎢⎣
. is any norm of R
,t t ⎤ is a Banach space. 1 2,L m ⎥⎦
z
q
p+n
⎧⎪ = max ⎨ sup ⎪⎩ t ∈ [ 0,ω ] . It is easy to see that
Existence of Periodic Solution of BAM Neural Network with Delay and Impulse
⎧ (1) Let domL = ⎨ Z ∈ C ⎩
⎫ ⎡ 0,ω ;t ,t t ⎤⎬ ⎢⎣ ⎥ 1 2,L m ⎦ ⎭
(
N : Z → Y , Nz =
⎡ n ⎢ − a1 t x1 t + ∑ α j1 t f j ⎛ y j t −τ j1 ⎞ + I1 t ⎜ ⎟ ⎢ ⎝ ⎠ j =1 ⎢ ⎢ M ⎢ n ⎛ ⎞ ⎢−a t x t + p n ∑ α jp t f j ⎜ y j t −τ jp ⎟+ I p ⎢ ⎝ ⎠ ⎢ j =1 ⎢ p ⎢ −b t y t + 1 1 ∑ βi1 t gi ⎛⎜⎝ xi t −σ i1 ⎞⎟⎠ + J1 t ⎢ i =1 ⎢ ⎢ M ⎢ p ⎢ ⎛ ⎞ ⎢ −bn t yn t + ∑ βin t gi ⎜⎝ xi t −σ in ⎟⎠ + J n t ⎢⎣ i =1
()
() ()
()
(
)
(
( (
))
Δ x1(tm )
M
M
Δ x p (tm )
Δ y1(t1) L Δ y1(tm ) M
Δ yn (t1)
()
{
M
Δ yn (tm )
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 0⎥⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦
1ω ∫ z ( t ) dt ω0 m ⎛1 ⎞ = ⎜ ∫ ω ψ ( s )ds + ∑ C + d , 0 ,L 0, 0 ⎟ 0 ⎜ω ⎟ k k =1 ⎝ ⎠
where vector ψ is the first column of the right of equation (3) Obviously KerL = z : z ∈ C ∈ R
(2)
Δ x1(t1) Δ x p (t1)
(t )
()
)
Pz = QNz = Q M = ψ ,C ,C L C ,d 1 2 m
()
)
(
()
() ()
)
(
()
() ()
( ) )
( ) ( )
z → z ′, Δ z t , Δ z t ,L Δ z t , 0 m 1 2
L : domL → Y
() ()
281
p+n
}
,t ∈ [ 0,ω ] , Im L = Z × R
(3)
(4)
(5)
( p + n ) × m × {0}
And dim KerL = p + n = co dim Im L , then P and Q are continuous projectors satisfying Im P = KerL , Im L = KerQ = Im ( I - Q ) . It’s easy to see that Im L is closed in Y, thus we have the following results. Lemma 2.1. Let L and N be defined by (2); then L is a Fredholm operator of index 0. Lemma 2.2. Let L and N be defined by (2) and (3), respectively, suppose Ω is an open bounded subset of domL , then N is L-compact on Ω . Proof. For any z ∈ Ω , we denote the inverse of the map L
domL ∩ KerP
→ Y by
K , through an easy computation, we can find P m 1ωt K M = ∫ t ϕ ( s )ds + ∑ C − ∫ ∫ ϕ ( s )dsdt − ∑ C 0 P k ω k t >t k =1 00 k
Thus, QN( Ω ) is bounded and N is a completely continuous mapping, by using
( )
( I − Q ) N Ω is relatively compact, thus N P is L-compact on Ω . The proof of Lemma2.2 is complete. Arzela-Ascoli theorem, we know that K
282
H. Wang et al.
3 Existence of Periodic Solution In this section, based on the Mawhin’s continuation theorem, we shall study the existence of at least one periodic solution of (1). Theorem 3.1 Assume that (H1), (H2) and (H3) hold, then system (1) has at least one ω -periodic solution. Proof. Based on the Lemma1.1 and Lemma 1.2, now, what we need to do is just to search for an appropriate open, bounded subset Ω for the application of the continuation theorem. Corresponding to the operator equation Lx = λ Nx, λ ∈ ( 0,1) ,
we have ⎧ ⎡ ⎤ n ⎪ dxi t ⎛ ⎞ ⎪ =λ ⎢⎢ −ai t xi t + ∑ α ji t f j ⎜⎜ y j ⎛⎜ t −τ ji ⎞⎟ ⎟⎟+ Ii t ⎥⎥ ,t ≠tk ,t∈⎡⎣⎢0,ω ⎤⎦⎥ ⎝ ⎠⎠ ⎪ dt ⎝ ⎢ ⎥ j =1 ⎪ ⎣ ⎦ ⎪ ⎛ +⎞ ⎛ −⎞ ⎛ x t ⎞, = − = = Δ x t x t x t λ I i 1 , L p,x ⎪⎪ i k i⎜ k ⎟ i⎜ k ⎟ ik ⎜⎝ i k ⎟⎠ i 0 = xi ω ⎝ ⎠ ⎝ ⎠ ⎨ ⎡ ⎤ p ⎪ dy j t ⎛ ⎞ ⎪ =λ ⎢⎢−b j t y j t + ∑ βij t gi ⎜⎜ x j ⎛⎜ t −σ ij ⎞⎟ ⎟⎟ + J j t ⎥⎥ ,t ≠tk ,t∈⎡⎣⎢0,ω ⎤⎦⎥ ⎪ dt ⎝ ⎠⎠ ⎝ i =1 ⎢⎣ ⎥⎦ ⎪ ⎪ ⎛ +⎞ ⎛ −⎞ ⎛ ⎞ j =1,Ln,y j 0 = y j ω ⎪ Δ y j tk = y j ⎜ t ⎟ − y j ⎜ t ⎟ =λ J jk ⎜ y j tk ⎟ , ⎝ ⎠ ⎪⎩ ⎝ k ⎠ ⎝ k ⎠
()
() ()
()
( )
()
() ( )
( )
()
() ()
()
( )
()
()
( )
(
(6)
( )
)
T Suppose that Z ( t ) = xT , yT ∈ Z is a solution of system (6) for some λ ∈ ( 0,1) . Integrating (6), over the interval [ 0,ω ] , we obtain
m ⎛ ⎞ ω ω n ⎛ ⎞ ∫0 ai ( s )xi ( s ) ds = ∫0 ∑ α ji ( s ) f j ⎜⎜ y j ⎛⎜⎝ s −τ ji ⎞⎟⎠ ⎟⎟+ Ii ( s )ds + ∑ Iik ⎝⎜ xi (tk ) ⎠⎟ ⎝ ⎠ k =1 j =1 m ⎛ ⎞ ω ω p ⎛ ⎞ ∫0 b j ( s ) y j ( s ) ds = ∫0 ∑ βij ( s ) gi ⎜⎜ xi ⎛⎜⎝ s −σ ij ⎞⎟⎠ ⎟⎟+ J j ( s )ds + ∑ J jk ⎜⎝ y j (tk ) ⎟⎠ ⎝ ⎠ k =1 i =1
(7)
Let ξ + ,η + ∈ [ 0,ω ] , such that i j
( )
x ξ + = sup x ( t ) , y ⎛⎜η + ⎞⎟ = sup y (t ) i i j⎝ j ⎠ t ∈ [ 0,ω ] i t ∈ [ 0,ω ] j m Then, by (7), we have x ξ + a ≥ −ω nα M − ∑ I x i i i i ik i k =1 m y ⎜⎛η + ⎟⎞ b ≥ −ω n β N - ∑ J ⎜⎛ y j⎝ j ⎠ j j jk ⎝ j k =1
( )
Hence,
( )
⎞ ( ( )) ⎟⎟⎠ / ai := Ti+ ,
m ⎛ x ξ + ≥ ⎜ −ω nα M + ∑ I x t ⎜ i i i ik i k k =1 ⎝
( (tk )), similarly, (tk ) ⎟⎠⎞.
(8)
Existence of Periodic Solution of BAM Neural Network with Delay and Impulse
m ⎛ y ⎛⎜η + ⎞⎟ ≥ ⎜ −ω n β N + ∑ J j⎝ j ⎠ ⎜ j jk k =1 ⎝ Let ξ − ,η − ∈ [ 0 ,ω ] , such that i j
( )
⎛ y t ⎞ ⎞ / b := T + . ⎜ j k ⎟ ⎟⎟ j j ⎝ ⎠⎠
( )
( )
x ξ − = inf x ( t ) , y η − = inf y (t ) , i i j i t ∈ [ 0 ,ω ] i t ∈ [ 0,ω ] j
( )
(9)
( ( ))
m Then, by (7), we have x ξ − a ≤ ω nα M + ∑ I x t ,i = 1,L p. i i i i ik i k k =1 m Similar to (10), we have y ⎛⎜η − ⎞⎟ b ≤ ω n β N + ∑ J ⎛⎜ y t ⎞⎟. j⎝ j ⎠ j j jk ⎝ j k ⎠ k =1
( )
( )
⎞ ( ( )) ⎟⎟⎠ / ai := Ki− ,
m ⎛ x t Hence, x ξ − ≥ ⎜ ω nα M + ∑ I ⎜ i i i ik i k k =1 ⎝ From (9) and (12), we obtain
283
{
− ⎛ ⎞ y j ⎜η −j ⎟ ≥ K j . ⎝ ⎠
sup z ( t ) < max T + , K − i i i t ∈ [ 0 ,ω ]
}
(10)
(11)
(12)
define = H . i
p+n Denote A = ∑ H + E , clearly, A is independent of λ , and we select i i =1 T ⎧⎪ T ⎛ = ⎜ x , x L x , y , y L y ⎞⎟ ∈ Z : z( t ) < H , Ω = ⎨ z( t ) = φ T ,ψ T p 1 2 n⎠ ⎝ 1 2 ⎪⎩ z t + 0 ∈ Ω ,k = 1, 2 L m . k It is clear that Ω verifies the requirement (i) in Theorem1.1. When z ∈ KerL ∩ ∂Ω , z is a constant vector in R p + n with z = H . Then QNz =
(
)
(
)
}
⎡ ⎧ ⎫ n m ⎢ 1 ∫ ω ⎨⎪−a ( t ) x ( t ) + ∑ α ( t ) f ⎜⎛ y t − τ ⎟⎞ + I ( t ) ⎬⎪ dt + 1 ∑ I x t j1 j⎝ j 1 1i ⎠ 1 1k 1 k ⎢ ω 0 ⎪ 1 ω ⎪⎭ j =1 k =1 ⎩ ⎢ M ⎢ ⎫⎪ n ⎢ 1 ω ⎧⎪ 1 m ⎛ ⎛ ⎞⎞ ∑ I x t ⎢ ∫0 ⎨−a p ( t ) x p ( t ) + ∑ α jp ( t ) f j ⎜ y j ⎜ t − τ pi ⎟ ⎟ + I p ( t )⎬ dt + ω k = 1 pk 1 k ⎠⎠ ⎝ ⎝ ⎢ω j =1 ⎩⎪ ⎭⎪ ⎢ p ⎫⎪ 1 ω ⎧⎪ 1 m ⎢ ⎢ ω ∫0 ⎨−b1 ( t ) y1 ( t ) + ∑ βi1 ( t ) gi xi t − σ i1 + J1 ( t )⎬ dt + ω ∑ J1k y1 tk i =1 k =1 ⎩⎪ ⎭⎪ ⎢ M ⎢ p ⎧ ⎢ 1 1 m ⎪⎫ ω⎪ ⎢ ∑ J y t ∫0 ⎨−bn ( t ) yn ( t ) + ∑ βin ( t ) gi xi t − σ in + J n ( t ) ⎬ dt + nk 1 k ω ω i =1 k =1 ⎩⎪ ⎭⎪ ⎣⎢
(
)
( )
( )
≠0
( (
))
( )
( (
))
( )
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 0,L , 0 ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦⎥
284
H. Wang et al.
Let J : Im Q → KerL, ( r, 0,L , 0, 0 ) → r . Then, in view of the assumption in
Theorem (6), it is easy to prove that deg { JQNz,Ω ∩ KerL,0} ≠ 0 . By now, we have
proved that ω verifies all the requirements in Theorem1.1. Hence, system (1) has at least one ω − periodic solution in Ω . The proof is complete. □
4 Conclusions In this paper, we used the continuation theorem of coincidence degree theory to study the existence of periodic solution for BAM model of neural network with impulses and time-delays. A set of easily verifiable sufficient conditions for the existence of periodic solution was obtained. The Global exponential asymptotic stability of the periodic solution of this system will be discussed in future papers.
Acknowledgements This work was jointly supported by the National Natural Science Foundation of China (Grant no. 60574024, 60573047), the Scientific Research Fund Projects for Youngsters of Sichuan Provincial Education Administration Office (Grant no. 2005B039), and the Natural Science Foundation of Chongqing (Grant no. 8509).
References 1. Kosko, B.: Adaptive Bi-directional Associative Memories. Appl. Opt. 26 (23) (1987) 4947– 4960. 2. Li, Y., Lu, L.: Global Exponential Stability and Existence of Periodic Solution of Hopfieldtype Neural Networks with Impulses. Phys Lett A 333 (2004):62-71. 3. Liao, X., Wong, K.W., and Yang, S.Z.: Convergence Dynamics of Hybrid Bidirectional Associative Memory Neural Networks with Distributed Delays. Phys. Lett. A 316 (2003) 55-64. 4. Li, C., Liao, X.: New Algebraic Conditions for Global Exponential Stability of Delayed Recurrent Neural Networks. Neurocomputing 64(2005) 319-333. 5. Li, C., Liao, X.: Delay-dependent Exponential Stability Analysis of Bi-directional Associative Memory Neural Networks: an LMI Approach. Chaos, Solitons & Fractals 24(4) ( 2005) 1119-1134. 6. Li, C., Liao, X., Zhang, R.: A Unified Approach for Impulsive Lag Synchronization of Chaotic Systems with Time Delay. Chaos, Solitons and Fractals 23 (2005) 1177-1184 7. Song, Q., Cao, J.: Global Exponential Stability and Existence of Periodic Solutions in BAM Networks with Delays and Reaction–diffusion Terms. Chaos, Solitons and Fractals 23 (2005) 421–430 8. Gains, R.E., Mawhin, J.L.: Coincidence Degree and Nonlinear Differential Equation. Lecture Notes in Math., vol. 586. Springer- Verlag, Berlin (1977)
On Control of Hopf Bifurcation in BAM Neural Network with Delayed Self-feedback Min Xiao1,2 and Jinde Cao1 1
2
Department of Mathematics, Southeast University, Nanjing 210096, China Department of Mathematics, Nanjing Xiaozhuang College, Nanjing 210017, China
[email protected]
Abstract. In this paper, the control of Hopf bifurcations in BAM neural network with delayed self-feedback is presented. The asymptotic stability theorem and the relevant corollary for linearized nonlinear dynamical systems are stated. For BAM neural network with delayed self-feedback, a control model based on washout filter is proposed and analyzed. By applying the stability lemma, we investigate the stability of the control system and state the relevant theorem for choosing the parameters of the stabilized control system. Some numerical results are also given to illustrate the correctness of our results.
1
Introduction
A neural network, especially a time delayed neural network, is usually a largescale dynamic system that possesses rich dynamical behaviors. Therefore there has been increasing interest in the study of the dynamics of neural networks (see [5-7, 13, 15, 16]). Bidirectional associate memorial (BAM) neural network with or without axonal signal transmission delays is a type of network with the neurons arrayed in two layers (see [1, 2]). Networks with such a bidirectional structure have practical applications in storing paired patterns or memories and possess the ability of searching the desired patterns via both directions: forward and backward (see [1, 3, 4]). The delayed BAM neural network can be described by the following system ⎧ m ⎪ ⎪ cji fi (yj (t − τji )) + Ii , ⎨ x˙ i (t) = −xi (t) + j=1 (1) n ⎪ ⎪ y ˙ (t) = −y (t) + d g (x (t − σ )) + J , ⎩ j j ij j i ij j i=1
where cji , dij (i = 1, 2, · · · , n, j = 1, 2, · · · , m) are the connection weighs through the neurons in two layers: the X-layer and the Y-layer. On the X-layer, the neurons whose states are denoted by xi (t) receive the input Ii and the inputs outputted by those neurons in the Y-layer via activation function fi , while on the Y-layer, the neurons whose associated states are denoted by yj (t) receive the input Jj and the inputs outputted by those neurons in the X-layer via activation function gj . J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 285–290, 2006. c Springer-Verlag Berlin Heidelberg 2006
286
M. Xiao and J. Cao
On the other hand, recent work (see [9]) has shown that inhibitory selfconnections can play a role in a network under some conditions on delays. This motivates us to incorporate inhibitory self-connections terms into the model system (1), and study the following system ⎧ m ⎪ ⎪ cji fi (yj (t − τji )) + Ii , ⎨ x˙ i (t) = −xi (t) + aii fi (xi (t − rii )) + j=1 (2) n ⎪ ⎪ dij gj (xi (t − σij )) + Jj . ⎩ y˙ j (t) = −yj (t) + bjj gj (yj (t − mjj )) + i=1
Although the system (2) can be mathematically regarded as a Hopfied-type neural network, it really produces many nice properties due to the special structure of connection weights and has many practical applications. For the sake of implicity, we assume that there are only one neuron in the X-layer and Y-layer respectively, the time delay from the X-layer to Y-layer is τ , which is the same as the time delay from the Y-layer back to X-layer, and self-feedback delay is also τ . This simplified BAM neural network with delayed self-feedback model can be described by the following system ⎧ ⎨ x˙ 1 (t) = −x1 (t) + a1 f (x1 (t − τ )) − a1 b1 f (y1 (t − τ )), (3) ⎩ y˙ 1 (t) = −y1 (t) + a2 f (x1 (t − τ )) − a2 b2 f (y1 (t − τ )), where ai (i = 1, 2), bi (i = 1, 2) are real constant parameters. Bifurcation phenomena can be found in many time-delayed dynamical systems, which may have richer dynamical behaviors [6, 11, 12, 8, 10]. System (3) may create Hopf bifurcation when active function f (·) and the system parameters ai , bi satisfy some specific conditions. Bifurcation control generally refers to the design of a controller that is capable of modifying the bifurcation characteristics of a bifurcating nonlinear system, thereby achieving some specific dynamical behaviors. Recently, Chen et al. studied the problem of “anti-control” of bifurcations [14]. The term “anti-control” means that a certain bifurcation is created with desired location and properties by control approach. In [14], the authors presented a washout filter based anti-control Hopf bifurcations control law for a general control system.
2
Main Results
In this section, firstly, we investigate the stability of linear system with delay x˙ i (t) =
n
(aij xj (t) + bij xj (t − τ )),
i = 1, 2, . . . , n, τ > 0.
j=1
We first introduce the following two lemmas in for system (4).
(4)
On Control of Hopf Bifurcation in BAM Neural Network
287
Lemma 1. The equilibrium point of system (4) is asymptotically stable if the following conditions hold: (1) There exists τ0 and the real parts of all eigenvalues of characteristic equation Δ(λ; τ0 ) = 0 of Eq. (4) are rigorously negative when τ = τ0 ; (2) For an arbitrary real number ξ and τ ∈ [τ0 , τ ](τ0 < τ ) or τ ∈ [τ, τ0 ](τ0 ≥ τ ), Δ(iξ; τ ) = 0. In particular, system (4) is degenerated as a system without time-delay when τ0 = 0. As a result, the eigenvalues of the characteristic equation are easy to obtain without the term e−τ λ . We describe this special case in the following lemma. Lemma 2. The equilibrium point of system (4) is asymptotically stable if the following conditions hold: (1) The real parts of all eigenvalues of characteristic equation Δ(λ; τ ) = 0 of Eq. (4) are rigorously negative when τ = 0; (2) For an arbitrary real number ξ and τ ∈ [0, τ ], Δ(iξ; τ ) = 0. In the remain part of this paper, Lemmas 1 and 2 will be used frequently. We consider a general form of dynamical system with parameter μ: y˙ = f (t, y, μ),
(f (t, 0, μ) = 0).
(5)
Adding a control action u to the equation on one component yi , and taking u as the following form [14]: ˆ u = f(x, K), w˙ = yi − αw = x,
(6)
where K is the control parameter, fˆ is a nonlinear function. In order to keep the structure of all equilibriums of Eq. (6), the following conditions must be satisfied • α > 0 for the stability of the washout filter. • fˆ(0, K) = 0 for preservation of equilibrium points. By using the above approach, the control system for Eq. (3) is designed as follows: ⎧ x˙ 1 (t) = −x1 (t) + a1 f (x1 (t − τ )) − a1 b1 f (y1 (t − τ )) + fˆ(ζ), ⎪ ⎪ ⎨ y˙ 1 (t) = −y1 (t) + a2 f (x1 (t − τ )) − a2 b2 f (y1 (t − τ )), (7) u˙ = x1 (t) − αu = ζ, ⎪ ⎪ ⎩ ˆ f (ζ) = −K1 ζ. where f (0) = 0, α > 0, K1 is a constant scaling parameter. The linearized equation of Eq. (7) is in the following form: ⎧ ⎨ x˙ 1 (t) = −x1 (t) + a1 f (0)x1 (t − τ ) − a1 b1 f (0)y1 (t − τ ) − K1 (x1 (t) − αu), y˙ 1 (t) = −y1 (t) + a2 f (0)x1 (t − τ ) − a2 b2 f (0)y1 (t − τ ), (8) ⎩ u˙ = x1 (t) − αu.
288
M. Xiao and J. Cao
The characteristic equation for Eq. (8) is: ⎛ ⎞ −1 − K1 + a1 f (0)e−λτ − λ −a1 b1 f (0)e−λτ K1 α det ⎝ a2 f (0)e−λτ −1 − a2 b2 f (0)e−λτ − λ 0 ⎠ = 0. (9) 1 0 −α − λ Therefore, Δ(λ; τ ) = −λ3 − [2 + α + K1 + (a2 b2 − a1 )f (0)e−λτ ]λ2 −[(1 + α + a2 b2 f (0)e−λτ )(1 + K1 − a1 f (0)e−λτ ) +α(1 + a2 b2 f (0)e−λτ ) − K1 α + a1 b1 a2 f 2 (0)e−2λτ ]λ −α(1 + a2 b2 f (0)e−λτ )(1 − a1 f (0)e−λτ ) − αa1 b1 a2 f 2 (0)e−2λτ . When τ = 0, we have Δ(λ; 0) = −λ3 − [2 + α + K1 + (a2 b2 − a1 )f (0)]λ2 −[(1 + α + a2 b2 f (0))(1 + K1 − a1 f (0)) + α(1 + a2 b2 f (0)) − K1 α +a1 b1 a2 f 2 (0)]λ − α(1 + a2 b2 f (0))(1 − a1 f (0)) − αa1 b1 a2 f 2 (0). Denote p = 2 + α + K1 + (a2 b2 − a1 )f (0), q = (1 + α + a2 b2 f (0))(1 + K1 − a1 f (0)) + α(1 + a2 b2 f (0)) −K1 α + a1 b1 a2 f 2 (0), r = α(1 + a2 b2 f (0))(1 − a1 f (0)) + αa1 b1 a2 f 2 (0).
(10)
From Δ(λ; 0) = 0, we obtain λ3 + pλ2 + qλ + r = 0.
(11)
According to the Routh-Hurwitz criterion, it is implied that if (H) p > 0, r > 0, pq − r > 0, then the real parts of all eigenvalues of Eq. (11) is rigorously negative. Based on the above discussions and Lemma 2, the following result for the control system (7) is obtained immediately. Theorem 1. Let p, q, r are defined by (10). Suppose that (H) holds, and for an arbitrary real number ξ and τ ∈ [0, τ ], characteristic equation Δ(iξ; τ ) = 0, then the control system (7) is asymptotically stable.
3
Illustrative Example
For the control system (7), if we take a1 = 1, b1 = 2.3, a2 = 1.8, b2 = 0.6, f (·) = tanh(·), τ = 1, K1 = 3, α = 2, then f (0) = 0, f (0) = 1 and all the conditions of Theorem 1 are satisfied. The waveforms x1 (t), y1 (t) are shown in Fig. 1, respectively. The phase portraits of system (7) without and with control are shown in Fig. 2, respectively. Fix a1 , a2 , b2 and let b1 vary, then we obtain the bifurcation figures (see Fig. 3 ), which cross x1 (t) = x1 (t − τ ) and y1 (t) = y1 (t − τ ) with parameter b1 of system (7) with and without control. By comparing the two bifurcation plots, we observe that the bifurcation occurrence is postponed as the system is being controlled.
On Control of Hopf Bifurcation in BAM Neural Network Without control
1.5
289
With control
1.5
1 1
x1
x1
0.5 0
0.5
−0.5 0 −1 −1.5
0
20
40
60
80
−0.5
100
0
20
40
time t
1.5
80
100
60
80
100
1 Without control
With control
0.8
1
0.6
y1
0.5
y1
60 time t
0
0.4 0.2
−0.5
0
−1 −1.5
−0.2 0
20
40
60
80
−0.4
100
0
20
40
time t
time t
Fig. 1. Waveform of System (7) without and with control 1.5
1 Without control
With control 0.8
1
0.6 0.5
y1
y1
0.4 0
0.2 −0.5 0
−1
−0.2
−1.5 −2
−1
0 x1
1
−0.4 −0.5
2
0
0.5 x1
1
1.5
Fig. 2. Phase portrait of System (7) without and with control 2 Without control
x1−y1
1
0
−1
−2
0
0.5
1
1.5
2 b1
2.5
3
3.5
4
1
1.5
2 b1
2.5
3
3.5
4
1.5 With control 1
x1−y1
0.5 0 −0.5 −1 −1.5
0
0.5
Fig. 3. Bifurcation diagram of system (7) without and with control
290
M. Xiao and J. Cao
Acknowledgement This work is supported by the National Natural Science Foundation of China under Grants 60574043 and 60373067, the Natural Science Foundation of Jiangsu Province, China under Grants BK2003053.
References 1. Gopalsamy, K., He, X.: Delay-Independent Stability in Bi-Directional Associative Memory Networks. IEEE Trans. Neural Netwoks 5 (1994) 998-1002 2. Kosko, B.: Bi-Directional Associative Memories. IEEE Trans. Syst. Man Cybern. 18 (1988) 49-60 3. Kosko, B.: Unsupervised Learning in Noise, IEEE Trans. Neural Networks 1 (1990) 44-57 4. Mohamad, S.: Global Exponential Stability in Continuous-Time and Discret-Time Delayed Bidirectional Neural Networks. Physica D 159 (2001) 233-51 5. Cao, J., Liang, J., Lam, J.: Exponential Stability of High-Order Bidirectional Associative Memory Neural Networks with Time Delays. Physica D 199 (2004) 425-36 6. Olien, L., Belair, J.: Bifurcation, Stability and Monotonocity Properties of A Delayed Neural Network Model. Physica D 102 (1997) 349-63 7. Cao, J., Wang, L.: Exponential Stability and Periodic Oscillatory Solution in BAM Networks with Delays. IEEE Trans. Neural Networks 13 (2002) 457-63 8. Liao, X., Wong, K.W., Leung, C.S., Wu, Z.F.: Hopf Bifurcation and Chaos in A Single Delayed Neuron Equation with Non-monotonic Activation Function. Chaos, Solitons & Fractals 12 (2001) 1535-47 9. Driessche, P.V., Wu, J., Zou, X.: Stabilization Role of Inhibitory Self-Connection in A Delayed Neural Network. Phyisca D 150 (2001) 84-90 10. Zhuang, Z-R., Huang, J., Gao, J-Y.: Analysis of The Bifurcation Diagram of A Hybrid Bistable System with Feedback Control of Chaos. Phys. Rev. E. 60 (1999) 5422-25 11. Song, Y., Han, M., Wei, J.: Stability and Hopf Bifurcation Analysis on A Simplified BAM Neural Network with Delays. Physica D 200 (2005) 185-204 12. Lee, S.-H., Park, J.-K., Lee, B.-H.: A Study on The Nonlinear Controller to Prevent Unstable Hopf Bifurcation. Power Engineering Society Summer Meeting, 2 (2001) 978-82 13. Gopalsamy, K., Leung, I.: Delay Induced Periodicity in A Neural Network of Excitation and Inhibition. Physica D 89 (1996) 395-426 14. Chen, D., Wang, H.O., Chen, G.: Anti-control of Hopf Bifurcations through Washout Filters. Proceedings of 37th IEEE Conference on Decision and Control, 3 (1998) 3040-45 15. Zhou, J., Xiang, L., Liu, Z.: Global Dynamics of Delayed Bidirectional Associative Memory (BAM) Neural Networks. Appl Math & Mech-English Edition 25 (2005) 327-35 16. Zhou, J., Chen, T., Xiang, L.: Robust Synchronization of Delayed Neural Networks Based on Adaptive Control and Parameters Identification. Chaos, Solitons & Fractals 27 (2006) 905-13
Convergence and Periodicity of Solutions for a Class of Discrete-Time Recurrent Neural Network with Two Neurons Hong Qu and Zhang Yi Computational Intelligence Laboratory, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 610054, People’s Republic of China {hongqu, zhangyi}@uestc.edu.cn
Abstract. Multistable neural networks have attracted much interesting in recent years, since the monostable networks are computationally restricted. This paper studies a class of discrete-time two-neurons networks with unsaturating piecewise linear activation functions. Some interesting results for the convergence and the periodicity of solutions of the system are obtained.
1
Introduction
The global stability of recurrent neural networks is a basic requirement in some practical applications of neural networks. Many researchers have been aimed to study the stability of recurrent neural networks [1, 2, 3]. While the major results have focused on the behavior studies of monostable networks. To display the computational capability of neural networks more in-depth, multistability analysis of LT [4] networks have been presented by some authors recently. In papers [5] and [6], three basic properties of multistable networks: boundedness, global attractivity and complete convergence are studied in detail for Continuous-Time and Discrete-Time form of LT networks, respectively. The qualitative analysis of continuous-time LT networks is presented in paper [7]. It is well known that, two-neurons networks sometimes display the same dynamical behavior as larger networks do, and can thus be used as prototypes to improve our understanding of the computational performance of large networks. It has been an increasing interesting in the study of the dynamics of the twoneuron networks (see for example [8]-[10]). This paper focuses on the analysis of the convergence and periodicity of solutions for the discrete-time LT networks with two-neurons. The rest of this paper is organized as follows. A brief description of the LT network model of two-neurons is presented in section 2. Convergence analysis of LT network with two-neurons is given in section 3. Periodicity of solutions for LT network with two-neurons is analyzed in sections 4. Finally, conclusions are drawn in section 5. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 291–296, 2006. c Springer-Verlag Berlin Heidelberg 2006
292
2
H. Qu and Z. Yi
The Network Model with Two-Neurons
The discrete-time LT networks with two-neurons can be described as follows: x1 (k + 1) = a11 σ(x1 (k)) + a12 σ(x2 (k)) + h1 , (1) x2 (k + 1) = a21 σ(x1 (k)) + a22 σ(x2 (k)) + h2 , or its equivalent vector form x(k + 1) = Aσ(x(k)) + h where x1 , x2 denote the activity of neuron 1 and 2, respectively, A = (aij )2×2 is a real 2 × 2 matrix, each of its elements aij ∈ R (i = 1, 2; j = 1, 2) denotes the synaptic weights and represents the strength of the synaptic connection from neuron j to neuron i, hi ∈ R (i = 1, 2) denotes external input of neuron i, h is the vector form of h1 and h2 . The activation function σ is defined as follows: u, if u > 0 σ(u) = (2) 0, if u ≤ 0 For any k > 0, each LT neuron is either firing or silent. The whole ensemble of the neurons can be divided into a partition P + (k) of neurons with positive states xi (k) > 0 for i ∈ P + (k), and a partition P − (k) of neurons with negative states xi (k) ≤ 0 for x ∈ P − (k). Define a set e1 0 S= e = 0 or 1, i = 1, 2 0 e2 i then, given any initial condition x(0) ∈ R2 , the trajectory of (1) starting from x(0) can be represented as ⎡ ⎤ ⎡ ⎤ k k−1 l
⎣ x(k + 1) = ⎣ AE(k − j)⎦ x(0) + AE(k − j)⎦ h + h (3) j=1
l=0
j=0
for all k ≥ 0, where each E(i) ∈ S. It is clearly that, for any (x1 (0), x2 (0)) ∈ R, system (1) has a unique solution {(x1 (k), x2 (k))} starting from (x1 (0), x2 (0)), and (x1 (0), x2 (0)) is called the initial value of {(x1 (k), x2 (k))}. The aim of this paper is to determine the limiting behavior or periodicity of {(x1 (k), x2 (k))} as n → ∞ for (x1 (0), x2 (0)) ∈ D1 ∪ D2 ∪ D3 ∪ D4 = R2 , where D1 = X ++ , D2 = X −+ , D3 = X −− , D4 = X +− and X ±± = {(x1 (0), x2 (0)) | x1 (0), x2 (0) ∈ R± }, with R+ = {u | u ∈ R and u > 0} and R− = {u | u ∈ R and u ≤ 0}. Throughout this paper, {(x1 (k), x2 (k))} denotes the solutions of system (1) with the initial value (x1 (0), x2 (0)).
Convergence and Periodicity of Solutions for a Class of Discrete-Time
3
293
Convergence of Solutions
In this section, we consider the convergence of solutions of (1) with σ satisfying (2). We only list several representative results. Theorem 1. (x1 (k), x2 (k)) → (h1 , h2 ) as k → ∞, if h1 ≤ 0, h2 ≤ 0, and (x1 (0), x2 (0)) ∈ D3 Proof. If (x1 (0), x2 (0)) ∈ D3 , then x1 (0) < 0 and x2 (0) < 0, it follows from (1) that x1 (1) = a11 σ(x1 (0)) + a12 σ(x2 (0)) + h1 = h1 ≤ 0 (4) x2 (1) = a21 σ(x1 (0)) + a22 σ(x2 (0)) + h2 = h2 ≤ 0 let (x1 (k − 1) ≤ 0, x2 (k − 1)) ≤ 0 and h1 ≤ 0, h2 ≤ 0, then, we can obtain by (1) that x1 (k) = a11 σ(x1 (k − 1)) + a12 σ(x2 (k − 1)) + h1 = h1 ≤ 0 (5) x2 (k) = a21 σ(x1 (k − 1)) + a22 σ(x2 (k − 1)) + h2 = h2 ≤ 0 By induction, we can get that: ∀k ∈ {1, 2, 3, ..., n, ...}, if (x1 (0), x2 (0)) ∈ D3 and h1 ≤ 0, h2 ≤ 0, then the solution starting from (x1 (0), x2 (0)) converges to (h1 , h2 ). Theorem 2. Suppose λ1 , λ2 are eigenvalues of A. Define 10 00 10 E1 = , E2 = , E4 = . 01 01 00 If λ1 < 1, λ2 < 1, then (x1 (k), x2 (k)) → A(I −A)−1 Ei h+h as k → ∞, when one of the following three conditions is satisfied: 1). (x1 (0), x⎧ 2 (0)) ∈ D1 , and ⎨ a12 = a21 = 0 a11 h1 > 0, a12 h1 > 0 a22 > 0 T1 = , 2). (x1 (0), x2 (0)) ∈ D2 , and T2 = , a21 h2 > 0, a22 h2 > 0 ⎩ h < 0, h > 0 1 2 ⎧ ⎨ a12 = a21 = 0 a11 > 0 3). (x1 (0), x2 (0)) ∈ D4 , and T4 = . ⎩ h1 > 0, h2 < 0 Proof. There are three case of i are involved in this proof. Case 1) : When (x1 (0), x2 (0)) ∈ D1 and T1 holds, it follows from (1) that x1 (1) = a11 x1 (0) + a12 x2 (0) + h1 (6) x2 (1) = a21 x1 (0) + a22 x2 (0) + h2 let’s consider the follows equations of x1 and x2 : a11 x1 +a12 x2 +h1 = 0. Suppose, L1 is the straight line on the plane which express the above equation. L1 join x1 -axis on point A(− ah111 , 0), join x2 -axis on point B(0, − ah121 ). It is clear that the whole plane can be divide into two region: P1 and P2 , by L1 , shown as Fig.1.
294
H. Qu and Z. Yi x2 L1
P1 A
P2
0
x1
B
Fig. 1. L1 divide R2 into two region
∀P (p1 , p2 ) ∈ R2 , if T1 is holds, then a11 p1 + a12 p2 + h1 > 0, if P ∈ P1 a11 p1 + a12 p2 + h1 < 0, if P ∈ P2
(7)
Note that (x1 (0), x2 (0)) ∈ D1 and D1 ⊂ P1 , so we obtain: x1 (1) = a11 x1 (0) + a12 x2 (0) + h1 > 0. It is clear to see that: x2 (1) = a21 x1 (0) + a22 x2 (0) + h2 > 0,by the seem method. This implies (x1 (1), x2 (1)) ∈ D1 . Now we assume that: (x1 (k − 1), x2 (k − 1)) ∈ D1 , then equation (8) will be educed easily if T1 holds, x1 (k) = a11 x1 (k − 1) + a12 x2 (k − 1) + h1 > 0 (8) x2 (k) = a21 x1 (k − 1) + a22 x2 (k − 1) + h2 > 0 this implies that (x1 (k), x2 (k)) ∈ D1 . Thus, we can get that: ∀k ∈ {1, 2, 3, ..., n, ...}, if (x1 (0), x2 (0)) ∈ D1 and T1 is guaranteed, then (x1 (k), x2 (k)) ∈ D1 . Case 2) : When (x1 (0), x2 (0)) ∈ D2 , it follows from (1) that x1 (1) = a12 x2 (0) + h1 , (9) x2 (1) = a22 x2 (0) + h2 if T2 holds, then x1 (1) = h1 < 0 and x2 (1) = a22 x2 (0) + h2 > 0, which implies (x1 (1), x2 (1)) ∈ D2 . Suppose (x1 (k−1), x2 (k−1)) ∈ D2 and T2 is holds, then we obtain: x1 (k−1) < 0 and x2 (k − 1) > 0. It follows from (1) that x1 (k) = a12 x2 (k − 1) + h1 = h1 < 0 (10) x2 (k) = a22 x2 (k − 1) + h2 > 0 which implies (x1 (k), x2 (k)) ∈ D2 . By induction, we can show that: ∀k ∈ {1, 2, 3, ..., n, ...}, if (x1 (0), x2 (0)) ∈ D2 and T2 is guaranteed, then (x1 (k), x2 (k)) ∈ D2 . Case 3) : When (x1 (0), x2 (0)) ∈ D4 . By a similar process employed in case 2), we can obtain that: ∀k ∈ {1, 2, 3, ..., n, ...}, if (x1 (0), x2 (0)) ∈ D4 and T4 holds, then (x1 (k), x2 (k)) ∈ D4 .
Convergence and Periodicity of Solutions for a Class of Discrete-Time
295
By induction, we can say that: ∀k ∈ {1, 2, 3, ..., n, ...}, if (x1 (0), x2 (0)) ∈ Di and Ti is guaranteed, then (x1 (k), x2 (k)) ∈ Di , for all i = 1, 2, 4. Notes, A ∗ Ei = Ei ∗ A if Ti holds, and Eik = Ei for all k > 1, here i = 1, 2, 4. From the above conclusion and (3), the trajectory of (1) starting from x(0) ∈ Di can be represented as ⎡ ⎤ ⎡ ⎤ k k−1 l
⎣ x(k + 1) = ⎣ AEi ⎦ x(0) + AEi ⎦ h + h j=1
l=0
= Ak Eik x(0) +
k−1
j=0
Al Eil h + h = Ak Ei x(0) +
l=0
k−1
Al Ei h + h (11)
l=0
for all k > 0, where i = 1, 2, 4. If λ1 < 1, λ2 < 1, then we obtain lim Ak = 0
k→∞
lim
k→∞
k−1
(12)
Al Ei h = A
l=0
k−2
lim
k→∞
Al
Ei h = A(I − A)−1 Ei h
(13)
l=0
Apply (12) and (13) to (11), we can obtain lim x(k + 1) = Ak Ei x(0) +
k−1
k→∞
=
Al Ei h + h
l=0
lim Ak Ei x(0) + A
k→∞
lim
k→∞
= A(I − A)−1 Ei h + h
k−2
Al
Ei h + h
l=0
(14)
for all i = 1, 2, 4. This completes the proof.
4
Periodicity of Solutions
In this section we give some theorems for the periodicity of solutions of (1) with σ satisfying (2). Theorem 3. If h1 > 0, h2 > 0, and Ah ≤ −h, then the solution {(x1 (k), x2 (k))} of (1) starting from (x1 (0), x2 (0)) is oscillate between (h1 , h2 ) and (a11 h1 + a12 h2 + h1 , a21 h1 + a22 h2 + h2 ), when (x1 (0), x2 (0)) ∈ D3 . Proof. If (x1 (0), x2 (0)) ∈ D3 , then x1 (0) < 0, x2 (0) < 0, it follows from (1) that x1 (1) = a11 σ(x1 (0)) + a12 σ(x2 (0)) + h1 = h1 > 0 (15) x2 (1) = a21 σ(x1 (0)) + a22 σ(x2 (0)) + h2 = h2 > 0 from (15) and (1), it follows that x1 (2) = a11 h1 + a12 h2 + h1 x2 (2) = a21 h1 + a22 h2 + h2
(16)
296
H. Qu and Z. Yi
if Ah ≤ −h holds, then we can get from (16) that: x1 (2) ≤ 0 and x2 (2) ≤ 0. When k = 3, it follows from (1) that x1 (3) = a11 σ(x1 (2)) + a12 σ(x2 (2)) + h1 = h1 (17) x2 (3) = a21 σ(x1 (2)) + a22 σ(x2 (2)) + h2 = h2 It is obviously that the solution of system (1) is oscillate between (h1 , h2 ) and (a11 h1 + a12 h2 + h1 , a21 h1 + a22 h2 + h2 ) for all k > 0, if h1 > 0, h2 > 0, and Ah ≤ −h are guaranteed and (x1 (0), x2 (0)) ∈ D3 . It is shown as {(x1 (k), x2 (k))} = {(x0 , x1 ), (h1 , h2 ), (a11 h1 + a12 h2 + h1 , a21 h1 + a22 h2 + h2 ), (h1 , h2 ), ...}. This completes the proof of Theorem 3.
5
Conclusions
Two-neurons networks often be used as the prototypes to improve our understanding of the computational performance of larger networks. This paper presents a method to analyze performance of the discrete-time LT networks with two-neurons. Some interesting sufficient conditions on the convergence and periodicity of solutions of the system are obtained.
References 1. Yi, Z., Wang, P.A. and Fu, A.W.C.: Estimate of Exponential Convergence Rate and Exponential Stability for Neural Networks. IEEE Trans. Neural Networks 10(6) (1999) 1487-1493 2. Yi, Z., Wang, P.A. and Vadakkepat, P.: Absolute Periodicity and Absolute Stability of Delayed Neural Networks. IEEE Trans. Circuits Syst I 49(2) (2002) 256-261 3. Yi, Z., and Tan, K.K.: Dynamic Stability Conditions for Lotka-Volterra Recurrent Neural Networks With Delays. Phys. Rev. E 66(1) (2002) 011910 4. Hahnloser, R.L.T.: On the Piecewise Analysis of Networks of Linear Threshold Neurons. Neural Networks 11 (1998) 691-697 5. Yi, Z., Tan, K.K. and Lee, T.H.: Multistability Analysis for Recurrent Neural Networks with Unsaturating Piecewise Linear Transfer Functions. Neural Comput. 15(3) (2003) 639-662 6. Yi, Z., and Tan, K.K.: Multistability of Discrete-Time Recurrent Neural Networks with Unsaturating Piecewise Linear Transfer Functions. IEEE Trans. Neural Networks 15(2) (2004) 329-336 7. Tan, K.C., Tang, H.J. and Zhang, W.N.: Qualitative Analysis ror Recurrent Neural Networks with Linear Threshold Transfer Functions. IEEE Trans. Circuits Syst. I 52(5) (2005) 1003-1012 8. Wei, J.J. and Ruan, S.G.: Stability and Bifurcation in a Neural Network Model with Two Delays. Physica D 130(3-4) (1999) 255-272 9. Huang, L.H., and Wu, J.: The Role of Threshold in Preventing Delay-Induced Oscillations of Frustrated Neural Networks with McCulloch-Pitts Nonlineearity. Game Theory and Algevra 11(6) (2001) 71-100 10. Gopalsamy, K. and Leung, I.: Delay Induced Periodicity in a Neural Network of Excitation an Inhibition. Physica D 89(3-4) (1996) 395-426
Existence and Global Attractability of Almost Periodic Solution for Competitive Neural Networks with Time-Varying Delays and Different Time Scales Wentong Liao1 and Linshan Wang2 1
Department of Mathematics, Ocean University of China, Qingdao 266071, People’s Republic of China
[email protected] 2 Department of Mathematics, Ocean University of China, Qingdao 266071, People’s Republic of China
Abstract. The dynamics of cortical cognitive maps developed by selforganization must include the aspects of long and short-term memory. The behavior of such a neural network is characterized by an equation of neural activity as a fast phenomenon and an equation of synaptic modification as a slow part of the neural system, besides,this model bases on unsupervised synaptic learning algorithm. Considered the effect of time delays, we prove the existence, uniqueness and global attraction of the almost periodic solution by using fixed theorem and Variation-ofconstants formula.
1
Introduction
In the review of biology, the networks posses synapses whose synaptic weights vary in time. Recently, several papers have discussed the neural systems with time-varying weights.such as [2-5], but all these papers employ a supervised learning dynamics. In [1], A Meyer-Baese et al investigated the dynamics of cortical cognitive maps,and studied the networks model which included both the neural activity level, the short-term memory (STM), and the dynamics of unsupervised synaptic modifications, the long-term memory (LTM). Then A.MeyerBaese put forward the following neural model in [1], and this networks can be considered extensions of Grossberg’s shunting network in [6] or Amari’s model for primitive neuronal competition in [7]. Consider the following competitive neural networks (CNN): ST M : x˙i (t) = −˜ ai xi (t) +
N
˜ ij f˜(xj (t)) + B ˜i si (t) D
(1)
j=1
LT M : s˙ i (t) = −si (t) + f˜(xi (t)) i = 1, 2, · · · , N. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 297–302, 2006. c Springer-Verlag Berlin Heidelberg 2006
(2)
298
W. Liao and L. Wang
where xi is the current activity level, ai > 0 is the time constant of the neuron, ˜i is the contribution of the external stimulus term, f˜(xi ) is the neuron’s output. B ˜ ij represents a synaptic connection parameter si (t) is the external stimulus, D between the i − th neuron and the j − th neuron. We know that time-delays always exist in CNN networks, and they may have strong effect to the systems sometimes, so it’s necessary and very meaningful to take into account timedelays’ effect. Based on systems (1)and(2), we consider the following model: x˙i (t) = −˜ ai xi (t) +
N
˜ ij f˜j (xj (t − τ (t))) + B ˜i si (t − τ (t)) D
(3)
j=1
s˙ i (t) = −si (t) + f˜i (xi (t − τ (t))) xi (t) = φi (t), si (t) = φi+N (t), 0 ≤ τ (t) ≤ τ , i = 1, 2, · · · , N.
(4)
Let xi+N (t) = si (t) , i = 1, 2, · · · , N ⎧ ˜ ij 1 ≤ i, j ≤ N D ⎪ ⎪ ⎨˜ Bi 1 ≤ i ≤ N, j = N + i Cij = ⎪ 1 N + 1 ≤ i ≤ 2N, and i = j + N ⎪ ⎩ 0 else
a˜i 1 ≤ i ≤ N 1 N + 1 ≤ i ≤ 2N, f˜i (x) 1 ≤ i ≤ N fi (x) = x N + 1 ≤ i ≤ 2N, bi =
we can rewrite the system [3] and [4] as the following model: x˙ i (t) = −bi xi (t) +
2N
Cij fj (xj (t − τ (t)))
(5)
j=1
xi (t) = φi (t), t ∈ [−τ, 0], i = 1, 2, · · · , 2N. Where 0 ≤ τ (t) ≤ τ and τ (t) is an almost periodic function[8], fi (x) is a continuous function which satisfies fi (0) = 0, i = 1, 2, · · · , 2N ,Φ = (φ1 , φ2 , · · · , φ2N ) ∈ C([−τ, 0], R2N ) . We let φi (t) =
sup |φi (t + s)|, and we always assume that there exists a
−τ ≤s≤0
continuous solution of system (5) noted x(t) = col{xi (t)} , also we assume the following inequalities hold in this paper: |fi (x) − fi (y)| ≤ Mi |x − y|, f or all x, y ∈ R k = max{ i
2N 1
b j=1 i
d−1 i dj |Cij |Mj } < 1, di > 0, i = 1, 2, · · · , 2N
(6) (7)
Existence and Global Attractability of Almost Periodic Solution
2
299
Main Results and Proof
Theorem 1. Assume (6) − (7) hold, then there exists an unique almost periodic solution of systems (5) . Proof. For f or all x = col{xi } ∈ R2N , let x(t) = max{d−1 i |xi (t)|}, and A = i
{¯ x(t) = col{¯ xi (t)}|¯ x(t) : R → R2N , where x ¯(t) is an almost periodic f unction}, let inducing module ¯ x = sup ¯ x(t), where x ¯(t) ∈ A, and (A, · ) is a Banach t∈R
space. For any x ¯(t) ∈ A, we consider the following system: x˙ i (t) = −bi xi (t) +
2N
Cij fj (¯ xj (t − τ (t))), i = 1, 2, · · · , 2N
(8)
j=1
We know system (5) exists an unique almost periodic solution Xx¯ from [8], where Xx¯ = col{
t
e
−bi (t−s)
−∞
2N [ Cij fj (¯ xj (s − τ (s)))]ds}
(9)
j=1
We can define the map F : A → A , F (¯ x)(t) = Xx¯ (t), where x ¯ ∈ A. Let B = {¯ x|¯ x ∈ A, ¯ x ≤ k}, obviously, B is a closed convex subset of A. For any x ¯(t) ∈ B, we get F (¯ x)(t) = col{
t
−∞
2N e−bi (t−s) [ Cij fj (¯ xj (s − τ (s)))]ds}
≤ sup max{d−1 i { i
t∈R
≤ sup max{d−1 i { i
t∈R
j=1
t
−∞
t
−∞
2N e−bi (t−s) [ |Cij |Mj |¯ xj (s − τ (s))|]ds} j=1 2N e−bi (t−s) [ |Cij |Mj dj (d−1 xj (s − τ (s))|]ds)} j |¯ j=1
≤
2N t −1 −bi (t−s) sup max{di { e [ |Cij |Mj dj ¯ x]ds)} t∈R i −∞ j=1
= max i
2N d−1 i [ |Cij |Mj dj ¯ x] ≤ k 2 < k bi j=1
(10)
(11)
Then we conclude: F (¯ x) ∈ B . Now we prove the uniqueness of the almost periodic solution, let x ˜ ∈ B, for any x ¯ ∈ B, we obtain that F (¯ x) − F (˜ x) = sup F (¯ x)(t) − F (˜ x)(t) ≤ sup max{d−1 i { t∈R
i
t∈R t
−∞
e−bi (t−s)
2N j=1
|Cij ||fj (¯ xj (s − τ (s))) − fj (˜ xj (s − τ (s)))|ds}
300
W. Liao and L. Wang
≤
t 2N −1 sup max{di { e−bi (t−s) |Cij |Mj |¯ xj (s t∈R i −∞ j=1
≤
t 2N −1 sup max{di { e−bi (t−s) |Cij |Mj dj ¯ x− t∈R i −∞ j=1
≤ max{ i
− τ (s)) − x ˜j (s − τ (s))|ds} x ˜ds}
2N 1 −1 d |Cij |Mj dj ¯ x−x ˜} ≤ k¯ x−x ˜ bi j=1 i
(12)
Since k < 1, thus F is a compact mapping on B. By using Banach fixed theorem we know x∗ is the unique almost periodic solution of system (5). Supposed x∗ is the unique almost periodic solution of system (6, for any other solution x(t) = (x1 , x2 , · · · , x2N ), we have x˙ i (t)− x˙ ∗i (t) = −bi (xi (t)− x∗i (t))+
2N j=1
Cij [fj (xj (t− τ (t)))− fj (x∗j (t− τ (t)))]
Let yi (t) = xi (t) − x∗i (t), gj (yj (t − τ (t))) = fj yj (t − τ (t)) + x∗j (t − τ (t)) − fj (x∗j (t − τ (t))), ϕi (t) = φi (t) − x∗i (t), t ∈ [−τ, 0]; i, j = 1, 2, · · · , 2N, from (6), we get |gi (y)| ≤ |Mi ||y| for any y ∈ R, then we can rewrite the above system as following y˙ i (t) = −bi yi (t) +
2N
Cij gj (yj (t − τ (t)))
(13)
j=1
xi (t) = ϕi (t), t ∈ [−τ, 0], i = 1, 2, · · · , 2N. Obviously, the global attraction of x∗ is equivalent to the global attracting of system (12)’s solution y = 0, thus, we only consider the global attraction of y = 0 to system (12). Similar to literature [9], we can get the following conclusion. Theorem 2. Assumed (6)−(7) holds, then for any ϕ(t) = (ϕ1 (t), ϕ2 (t), · · · , ϕ2N (t)), system (8) exists a constant V > 0 such that y(t) < V for any t ≥ 0. Proof. For any ϕ ∈ C([−τ, 0], R2N ), by using (7) we know that there exists a constant V > 0 which satisfies ϕ < V and (1 − k)V > 0 . If y(t) < V does not hold, then y(t) ≥ V , there must be a constant t0 such that y(t0 ) = V, and y(t) < V for 0 ≤ t < t0 .By using (8) and variation-of-constants formula , we get y(t0 ) ≤
max{e−bi t0 d−1 i |ϕi (0)| i
+
t0
e−bi (t0 −s) d−1 i .
0
|Cij |Mj dj (d−1 j |yj (s − τ (s))|)ds} t0 2N −bi t0 ≤ max{e V + e−bi (t0 −s) d−1 i |Cij |Mj yj (s − τ (s)))ds} i
0
j=1
Existence and Global Attractability of Almost Periodic Solution
≤ max{e
−bi t0
i
t0
e−bi (t0 −s)
V + 0
≤ max{e
−bi t0
i
2N
301
d−1 i |Cij |Mj V ds}
j=1
V + (1 − e
−bi t0
)kV } < V
(14)
this result is contradict to y(t0 ) = V , thus conclusion holds. Theorem 3. Assume (6) − (7) holds, then the solution y(t) = 0 is global attraction to system (12). Proof. Now we prove β = 0, by using Variation-of-constants formula together with |gi (y)| ≤ |Mi ||y|, we have |y(t)| ≤ e
−bi t
t
|ϕi (0)| +
e−bi (t−s)
0
2N
|Cij |Mj |yj (s − τ (s))|)ds}
(15)
j=1
From theorem 2, we conclude ϕ < V and y(t) < V for any t ≥ 0 .Since 2N ∞ −1 bi > 0 , then exists a constant T > 0 such that T e−bi s di dj |Cij |Mj V ds ≤ ε j=1
for any ε > 0, i = 1, 2, · · · , 2N , let lim sup y(t) = β ≥ 0 , then there exists a t→∞
constants T0 ≥ 0 such that y(t − τ (t)) ≤ β for any t ≥ T0 . Now we assume t ≥ T0 + T , from (14) we have |y(t)| |ϕi (0)| ≤ e−bi t + di di < e−bi t V +
t
e−bi (t−s)
0
t−T
e−bi (t−s)
0
t
+
e−bi (t−s)
2N j=1
t
e−bi s
T t
e−bi (t−s)
t−T
≤e
−bi t
2N
t
+ t−T
2N
d−1 i |Cij |Mj dj V ds
j=1
d−1 i |Cij |Mj dj βds
j=1
∞
V +
e
−bi s
T
−1 d−1 i |Cij |Mj dj (dj |yj (s − τ (s))|)ds
−1 d−1 i |Cij |Mj dj (dj |yj (s − τ (s))|)ds
< e−bi t V + e−bi (t−T )
+
2N j=1
t−T
2N 1 |Cij |Mj dj (d−1 j |yj (s − τ (s))|)ds d i j=1
e−bi (t−s)
2N
d−1 i |Cij |Mj dj V ds
j=1 2N
d−1 i |Cij |Mj dj βds
j=1
≤ e−bi t V + ε + (1 − e−bi T )
2N j=1
−1 b−1 i di |Cij |Mj dj β
(16)
302
W. Liao and L. Wang
From (14) we conclude −bi t y(t) = max(d−1 V + ε + (1 − e−bi T ) i |y(t)|) ≤ max[e i
·
i
2N
−1 b−1 i di |Cij |Mj dj β]
j=1
< max[e−bi t V + ε + (1 − e−bi T )kβ] i
< max[e−bi t V + ε + kβ] i
(17)
Let t → ∞, ε → 0, thus β ≤ kβ, if β = 0, then k ≥ 1, this result is contradict to H2 , thus β ≡ 0, also, lim yi (t) = 0, i = 1, 2, · · · , 2N , from this result, it is easy t→∞
to prove that the almost periodic solution x∗ of system (5) is global attracting.
Acknowledgement This paper was supported by the National Natural Science Foundation of China under Grant 10171072.
References 1. Meyer-Baese, A., Pilyugin, S.S., Chen, Y.: Global Exponential Stability of Competitive Neural Networks with Different Time Scale. IEEE Trans. Neural Networks 14(5) (2003) 716-719 2. Jin, L., Gupta, M.: Stabale Dynamic Backpropagation Learning in Recurrent Neural Networks. IEEE Trans. Neural Networks 10(11) (1999) 1321-1334 3. Galicki, M., Leistitz, L., Witte, H.: Learning Continuous Trajectories in Recurrent Neural Networks with Time-dependent Weights. IEEE Trans. Neural Networks 10(5) (1999) 741-756 4. Suykens, J., Moor, B., Vandewalle, J.: Robust local Stability of Multilayer Recurrent Neural Networks. IEEE Trans. Neural Networks 11(1) (2000) 222-229 5. Wang, L.S., Xu, D.Y.: Stability Analysis of Hopfield Neural Networks with Time Delay. Applied Mathematics and Mechanics 23 (2002) 250-252 6. Grossberg, S.: Competition,Decision and Consensus. J. Math. Anal. Applicat 66 (1978) 470-493 7. Amari, S.: Competitive and Cooperater Aspects in Dynamics of Neural Excition and Self-organization. Competition Cooperation Neural Networks 20 (1982) 1-28 8. He, C.Y.: Almost Periodic Differential Equations. Higher Education Publishing House, Beijing (1992) 9. Zhao, H.Y., Wang, G.L.: Existence and Global Atrractivity of Almost Periodic Solutions for Hopfield Neural Networks with Variable Delay. Acta Mathematica Scientia A 24(6) (2004) 723-726 10. Lu, W., Chen, T.: Global Exponential Stability of Almost Periodic Trajectory of a Large Class of Delayed Dynamical Systems. Science in China A. Mathematics,(2005) 1015-1026. 11. Cao, J.D, Chen, A.P, Huang, X.: Almost Periodic Attractor of Delayed Neural Networks with Variable Coefficients. Physics letters A 340 (2005) 104 -120
Global Synchronization of Impulsive Coupled Delayed Neural Networks Jin Zhou1 , Tianping Chen2 , Lan Xiang3 , and Meichun Liu3 1
Shanghai Institute of Applied Mathematics and Mechanics, Shanghai University, Shanghai, 200072, P.R. China 2 Laboratory of Nonlinear Science, Institute of Mathematics, Fudan University, Shanghai, 200433, P.R. China {Jinzhou, Tchen}@fudan.edu.cn 3 Department of Applied Mathematics and Physics, Hebei University of Technology, Tianjin, 300130, P.R. China {Xianglanhtu, Mcliu2005}@126.com.cn
Abstract. This paper formulates and studies a model of impulsive coupled delayed neural networks. Based on stability theory of impulsive dynamical systems, a simple but less-conservative criterion is derived for global synchronization of such coupled neural networks. Furthermore, the theoretical result is applied to a typical chaotic delayed Hopfied neural networks, and is also illustrated by numerical simulations.
1
Introduction
Recently, there has been increasing interest in the study of synchronization dynamics of coupled neural networks. Because they can exhibit many interesting phenomena such as spatiotemporal chaos, auto waves, spiral waves and others. In addition, it has been found that synchronization of coupled neural networks has potential applications in many fields including secure communication, parallel image processing, biological systems, information science, etc (see [1, 2, 3, 4, 5, 6, 7]). In many evolutionary systems there are two common phenomena: delay effects and impulsive effects [4, 5, 8, 9]. For example, a signal or influence travelling through a network often is associated with time delays, and this is very common in biological and physical systems. On the other hand, many evolutionary processes, particularly some biological systems such as biological neural networks and bursting rhythm models in pathology, as well as frequency-modulated signal processing systems, and flying object motions, are characterized by abrupt changes of states at certain time instants. This is the familiar impulsive phenomena. Therefore, the investigation of synchronization dynamics for impulsive coupled delayed neural networks is an important step for practical design and application of neural networks. This paper formulates and studies a model of impulsive coupled delayed neural networks. Based on stability theory of impulsive dynamical systems, a simple but less conservative criterion is derived for global synchronization of the impulsive J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 303–308, 2006. c Springer-Verlag Berlin Heidelberg 2006
304
J. Zhou et al.
coupled neural networks. It is shown that global synchronization of impulsive coupled delayed neural networks are dependent on both the time delays and the impulsive effects in the coupled neural networks.
2
Model and Preliminaries
Consider an isolate delayed neural networks, which is described by the following set of differential equations with time delays [3]: x(t) ˙ = −Cx(t) + Af (x(t)) + Aτ g(x(t − τ )) + I(t),
(1)
where x(t) = (x1 (t), · · · , xn (t)) , C = diag(c1 , . . . , cn ) with cr > 0, A = (a0rs )n×n , Aτ = (aτrs )n×n , I(t) = (I1 (t), · · · , In (t)) , f (x(t)) = [f1 (x1 (t)), · · · , fn (xn (t))] and g(x(t)) = [g1 (x1 (t), · · · , gn (xn (t))] , here we assume that the activation functions fr (x) and gr (x) are globally Lipschitz continuous, i.e., (A1 ) There exist constants kr > 0, lr > 0, r = 1, 2, · · · , n, for any two different x1 , x2 ∈ R, such that 0≤
fr (x1 ) − fr (x2 ) ≤ kr , x1 − x2
|gr (x1 ) − gr (x2 )| ≤ lr |x1 − x2 |, r = 1, 2, · · · , n.
Based on the structure of coupled delayed neural networks in [3], we formulate an array of N linearly coupled delayed neural networks with impulsive effects described by the following measure differential equation Dxi (t) = −Cxi (t) + Af (xi (t)) + Aτ g(xi (t − τ )) + I(t) +
N
bij Γ xj (t)Dwj (t),
i = 1, 2 · · · , N.
(2)
j=1
in which xi (t) = (xi1 (t), · · · , xin (t)) is the state of the ith delayed neural networks, the operator D denotes the distribution derivative, the bounded variation functions wi : [t0 , +∞) → R are right-continuous on any compact subinterval of [t0 , +∞), Dwi depicts the impulsive effects of connecting configuration in the coupled networks. Without loss of generality, we assume that Dwj = 1 +
+∞
uk δ(t − tk ),
j = 1, 2 · · · , N.
(3)
k=1
where the fixed impulsive moments tk satisfy tk−1 < tk and limk→∞ tk = +∞, δ(t) is the Dirac function, uk represents strength of the impulsive effects of connection between the jth neural network and the ith neural network at time tk . For simplicity, we further assume that the inner connecting matrix Γ = diag(γ1 , · · · , γn ), and the coupling matrix B = (bij )N ×N is the Laplacian matrix, i.e. a symmetric irreducible matrix with zero-sum and real spectrum. This implies that zero is an eigenvalue of B with multiplicity 1 and all the other eigenvalues
Global Synchronization of Impulsive Coupled Delayed Neural Networks
305
of B are strictly negative. The initial conditions of Eq. (2) are given by xi (t) = φi (t) ∈ P C([t0 − τ, t0 ], Rn ), where P C([t0 − τ, t0 ], Rn ) denotes the set of all functions of bounded variation and right-continuous on any compact subinterval of [t0 −τ, t0 ]. We always assume that Eq. (2) has a unique solution with respect to initial conditions. Clearly, if uk = 0, then the model (2) becomes the well-known continuous coupled delayed neural networks [3]. Now, some definitions with respect to synchronization of impulsive coupled delayed neural networks and the famous Halanay differential inequality on impulsive delay differential inequality are listed as follows: Definition 1. The hyperplane Λ = {(xT1 , · · · , xTN )T ∈ Rn×N ; xi = xj , i, j = 1, 2 · · · , N.} is said to the synchronization manifold of the impulsive coupled delayed neural network (2). Definition 2. Synchronization manifold Λ is said to be globally exponential stable, equivalently, the impulsive coupled delayed neural network (2) is globally exponentially synchronized, if there exist constants ε > 0 and M > 0, for all φi (t) ∈ P C([t0 − τ, t0 ], Rn ), such that for i, j = 1, 2 · · · , N, xi (t) − xj (t) ≤ M e−ε(t−t0 ) .
(4)
Lemma 1. [8, 9] Suppose p > q ≥ 0 and u(t) satisfies the scalar impulsive differential inequality + D u(t) ≤ −pu(t) + q( sup u(s)), t = tk , t ≥ t0 , t−τ ≤s≤t (5) u(tk ) ≤ αk u(t− ), u(t) = φ(t), t ∈ [t0 − τ, t0 ]. k where u(t) is continuous at t = tk , t ≥ t0 , u(tk ) = u(t+ k ) = lims→0+ u(tk + s) − u(tk + s) exists, φ ∈ P C([t0 − τ, t0 ], R). Then and u(t− ) = lim s→0 k u(t) ≤ ( θk )e−μ(t−t0 ) ( sup φ(s)), (6) t0
t0 −τ ≤s≤t0
where θk = max{1, |αk |} and μ > 0 is a solution of the inequality μ−p+qeμτ ≤ 0.
3
Synchronization Criteria
Theorem 1. Consider the impulsive coupled delayed neural networks (2). Let the eigenvalues of its coupling matrix B be ordered as 0 = λ1 > λ2 ≥ λ3 ≥ · · · , λN . Assume that, in addition to (A1 ), the following conditions are satisfied for all i = 1, 2, · · · , n and k ∈ Z + = {1, 2, · · · , ∞} (A2 ) There exist n positive numbers δ1 , · · · , δn , and two numbers pi = δi + ci −
(a0ii )+ ki
n n n 1 1 τ 1 τ 0 0 − |aij |kj + |aji |ki − |aij |lj , qi = |a |li , 2 2 j=1 2 j=1 ji j=1
j =i
306
J. Zhou et al.
such that p = min1≤i≤n {2pi } > q ⎧ = max1≤i≤n {2qi } and γi λ(γi ) + δi ≤ 0, where ⎨ λ2 , if γi > 0, (a0ii )+ = max{a0ii , 0} with λ(γi ) = 0, if γi = 0, ⎩ λN , if γi < 0. (A3 ) Let μ > 0 satisfy μ − p + qeμτ ≤ 0, and
ln θ 1 k θk = max 1, , θ = sup 1 − uk γj λ(uk γj ) k∈Z + tk − tk−1 such that uk γj λ(uk γj ) < 1 and θ < μ. Then the impulsive coupled delayed neural network (2) is globally exponentially synchronized. N Brief Proof. Let s(t) = N1 k=1 xk (t), vi (t) = xi (t) − s(t) (i = 1, 2, · · · , N ). Then, Eq. (2) can be rewritten as Dvi (t) = −Cvi (t) + A[f (xi (t)) − f (s(t))] + Aτ [g(xi (t − τ )) −g(s(t − τ ))] +
N
bij Γ vj (t)Dwj (t) + J,
i = 1 · · · , N. (7)
j=1
where J = Af (s(t)) + Aτ g(s(t − τ )) + N1 Let us construct a Lyapunov function
N
k=1 [Af (xk (t))
+ Aτ g(xk (t − τ ))].
1 v (t)vi (t). 2 i=1 i N
V (t) =
From Condition (A1 ), and note that D V (t) ≤ +
N n
N
(8)
i=1 vi (t)
= 0, we can get for t = tk ,
(a0rr )+ kr
1 0 + (|a |ks + |a0sr |kr ) 2 s=1 rs n
− δ r − cr +
i=1 r=1
s =r
n N 1 1 τ 2 2 + |aτrs |ls vir (t) + |asr |lr vir (t − τ ) + vi (t) 2 s=1 2 s=1 i=1 n
×
N
bij Γ vj (t) + diag(δ1 , . . . , δn )vi (t)
j=1
≤ −pV (t) + qV (t − τ ) +
n
v¯j (t)(γj B + δj IN )¯ vj (t)
(9)
j=1
N def where v1j (t), · · · , v¯N j (t)) ∈ L = z = (z1 , · · · , zN ) ∈ RN | i=1 zi v¯j (t) = (¯ n = 0 , from which it can be concluded that if γj λ(γj ) + δj ≤ 0, then j=1 v¯j (t) (γj B + δj IN )¯ vj (t) ≤ 0. This leads to D+ V (t) ≤ −pV (t) + q( sup
t−τ ≤s≤t
V (s))
(10)
Global Synchronization of Impulsive Coupled Delayed Neural Networks
307
On the other hand, from (7) and (8), and by using the properties of Dirac measure, we have V (tk ) = V (t− k)+
n
v¯j (t)uk γj B¯ vj (t) ≤ V (t− k ) + uk γj λ(uk γj )V (tk ), (11)
j=1
which implies that if uk γj λ(uk γj ) < 1, and k ∈ Z + , V (tk ) ≤
1 V (t− k ). 1 − uk γj λ(uk γj )
(12)
It follows from Lemma 1 that if θ < μ for all t > t0 , V (t) ≤ e−(μ−θ)(t−t0 ) (
sup
t0 −τ ≤s≤t0
V (s)).
(13)
This completes the proof of Theorem 1. Remark 1. It can be seen from (A2 ) and (A3 ) that global synchronization of the impulsive coupled delayed neural networks (2) not only depends on the coupling matrix B, the inner connecting matrix Γ, and the time delay τ, but also is determined by the strength values θk of impulsive effects and the impulsive interval tk − tk−1 . Example 1. Consider a model of impulsive coupled delayed neural networks (2), where xi (t) = (xi1 (t), xi2 (t)) , f (xi (t)) i1 (t)), tanh(x i2 (t))) , = g(x i (t)) = (tanh(x 10 2.0 −0.1 −1.5 −0.1 I(t) = (0, 0) (i = 1, 2, 3), C = ,A = and Aτ = . 01 −5.0 3.0 −0.2 −2.5 It should be noted that the isolate neural networks x(t) ˙ = −Cx(t) + Af (x(t)) + Aτ g(x(t − 1)) is actually a chaotic delayed Hopfied neural networks [4, 5] (see ⎡ ⎤ −8 2 6 Fig. 1). Let B = ⎣ 2 −4 2 ⎦ and Γ = diag(γ1 , γ2 ), then B has eigenvalues 6 2 −8 0, −6 and −14. For simplicity, we consider the equidistant impulsive interval τk − τk−1 = 0.01 and uk = −0.0005 (k ∈ Z + ). By taking kr = lr = 1 and δr = 12 (r = 1, 2), it is easy to verify that if γ1 = γ2 = 2, then all the conditions of Theorem 1 are satisfied. Hence, the impulsive coupled delayed neural networks (2) will achieve global synchronization. The simulations result corresponding to this situations is shown in Fig. 2.
4
Conclusions
In this paper, a general model of a system consisting of linearly and diffusively impulsive coupled delayed neural networks has been formulated and its synchronization dynamics have been studied. A simple criterion for global synchronization of such networks has been derived analytically. It is shown that the theoretical results can be applied to some typical chaotic neural networks such as delayed Hopfied neural networks and delayed cellular neural networks (CNN).
308
J. Zhou et al. 0.8
3
0.6
x (t)−x (t), y (t)−y (t) ( i=2,3)
4
2
0
x3(t)−x1(t) 0.2
i
Y
1
1
x2(t)−x1(t) 0.4
0
1
−1
y2(t)−y1(t)
i
−0.2
−2
−0.4
−3 −4 −1
−0.5
0
Fig.1 X
0.5
1
−0.6 0
y3(t)−y1(t)
1
2
Fig. 2
3
4
5
time t
Fig. 1. A fully developed double-scroll-like Fig. 2. Global synchronization of the imchaotic attractors of the isolate delayed pulsive coupled delayed neural networks (2) in time interval [0, 5]. Hopfied neural networks.
Acknowledgements This work was supported by the National Science Foundation of China (Grant no. 60474071), the China Postdoctoral Science Foundation (Grant no. 20040350121) and the Science Foundation of Education Commission of Hebei Province (Grant no. 2003013).
References 1. Chen, G., Dong, X.: From Chaos to Order: Methodologies, Perspectives, and Applications, World Scientific Pub. Co, Singapore (1998) 2. Wu, C. W., Chua, L. O.: Synchronization in an Array Linearly Coupled Dynamical System. IEEE Trans. CAS-I. 42 (1995) 430-447 3. Chen, G., Zhou, J., Liu, Z.: Global Synchronization of Coupled Delayed Neural Networks and Applications to Chaotic CNN Model. Int. J. Bifur. Chaos. 14 (2004) 2229-2240 4. Zhou, J., Chen, T., Xiang, L.: Robust Synchronization of Delayed Neural Networks Based on Adaptive Control and Parameters Identification. Chaos, Solitons, Fractals. 27 (2006) 905-913 5. Zhou, J., Chen, T., Xiang, L.: Chaotic Lag Synchronization of Coupled Delayed Neural Networks and Its Applications in Secure Communication. Circuits, Systems and Signal Processing. 24 (2005) 599-613 6. Zhou, J., Chen, T., Xiang, L.: Robust Synchronization of Coupled Delayed Recurrent Neural Networks. In: Yin, F., Wang, J., Guo, C. (eds.): Advances in Neural Networks - ISNN 2004. Lecture Notes in Computer Science, Vol. 3173. Springer-Verlag, Berlin Heidelberg New York (2004) 144-149 7. Zhou, J., Chen, T., Xiang, L.: Adaptive Synchronization of Delayed Neural Networks Based on Parameters Identification. In: Wang, J., Liao, X., and Yi, Z. (eds.): Advances in Neural Networks - ISNN 2005. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 308-313 8. Yang, Z., Xu, D.: Stability Analysis of Delay Neural Networks with Impulsive Effects. IEEE Trans. CAS-II. 52 (2005) 517-521. 9. Yang, T.: Impulsive Control Theory. Springer-Verlag, Berlin Heidelberg New York (2001)
Synchronization of a Class of Coupled Discrete Recurrent Neural Networks with Time Delay Ping Li Computational Intelligence Laboratory, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, P.R. China
[email protected] http://cilab.uestc.edu.cn
Abstract. This paper studies the synchronization of a class of coupled discrete recurrent neural networks with time delay. The local and global conditions are obtained by using Lyapunov functional and linear matrix inequality methods.
1
Introduction
Recently, there has been an increasing activity studying the arrays of coupled systems in many research fields[1, 2, 4]. In fact, arrays of coupled systems not only can exhibit many interesting properties, such as synchronization[5, 6, 11], auto waves[7], etc., but can be used in many applications, such as image processing[7], secure communication[8] and pattern storage, etc. Therefore, the study of the synchronization of coupled neural networks is very significant for both theoretical research and practical use. In [9], the authors have investigated synchronization of arrays of continuous delayed neural networks with linear coupling. The model of coupled neural networks we will study in this paper is different from that used in [9], it does not contain linear terms and is discrete. The convergence of this kind of continuous neural networks has been well discussed by[10]. This kind of model studied in the paper can be written as xi (k + 1) =
n
[aij gj (xj (k)) + bij gj (xj (k − h))] + Ii (k), (i = 1, 2, ..., n)
(1)
j=1
for t ≥ 0, where each xi (k) is the state of the ith neuron, aij and bij are constant intra-connection weights, the delay h ≥ 0, gj (xj (k)) denotes the output of the jth neuron at time t, gj (xj (k − h)) denotes the output of the jth neuron at time t − h, and Ii (k) is the input to the ith neuron. It also can be rewritten as following compact form: x(k + 1) = Ag(x(k)) + Bg(x(k − h)) + I(k)
(2)
This work was supported by National Science Foundation of China under Grant 60471055 and Specialized Research Fund for the Doctoral Program of Higher Education under Grant 20040614017.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 309–315, 2006. c Springer-Verlag Berlin Heidelberg 2006
310
P. Li
where x(k) = (x1 (k), x2 (k), ..., xn (k))T ∈ Rn , A = (aij )n×n , B = (bij )n×n , g(x(k)) = [g1 (x1 (k)), ..., gn (xn (k))]T , g(x(k − h)) = [g1 (x1 (k − h)), ..., gn (xn (k − h))]T and I(k) = (I1 (k), I2 (k), ..., In (k))T . An array of such discrete recurrent neural networks connecting each other can be described as: xi (k + 1) = Ag(xi (k)) + Bg(xi (k − h)) + I(k) +
n
cij Γ xj (k), i = 1, ..., N (3)
j=1
where xi (k) = (xi1 (k), ..., xin (k))T ∈ Rn , xip denotes the pth neuron of the ith neural network, C = (cij )N ×N denotes the topology of the coupled networks , it can be small-world or scale-free coupling, and Γ = diag(γ1 , ..., γn ). In this paper, we study the local and global stability of the synchronization manifold for arrays of discrete recurrent neural networks with diffuse coupling and give several sufficient condition to control the networks to synchronization by LMI approach. The rest of this paper is organized as follows: In Section 2, some necessary definitions and lemmas are given; In Section 3, we will give the local and global synchronization results which can be easily used in practice.
2
Preliminaries
In what follows, we present the definitions and lemmas are needed in this paper. Definition 1. The set S = x = (x1 (k)T , x2 (k)T , ..., xN (k)T )T is called synchronization manifold, thereinto xi (k) = xj (k), i, j = 1, ..., N . To measure the distance between the various cells, we use the definitions introduced in [6]. Definition 2. M1 (k) are matrices M with entries in Fk such that each row of M contains zeros and exactly one αIk and one −αIk for some nonzero α, where Fk is the subfield of n × n matrices as Fk = {αIk : α ∈ R} and Ik denote the n × n real identity matrix. Definition 3. M2 (k) are matrices M in M1 (k) such that for any pair of indices i and j there exist indices i1 , i2 , ..., il with i1 = i and il = j such that for all 1 ≤ q < l,Mp,iq = 0 and Mp,iq+1 = 0 for some p. Definition 4. An irreducible matrix A = (aij )n×n is said to satisfy condition N A1 if aij ≥ 0, aii = − j=1 aij . Next, we give some lemmas: Lemma 1. if A is a matrix satisfying A1, then the following holds: 1. A has eigenvalue 0 with multiplicity 1 and corresponding eigenvalue [1, 1, ..., 1]T ; 2. all nonzero eigenvalues of A are negative. Lemma 2. [6] Let x = (x1 , x2 , ..., xN )T where xi ∈ Rn ,i = 1, 2, ..., N , then x ∈ S if and only if M x = 0 (4) holds, for some M ∈ M2 (n).
Synchronization of a Class of Coupled Discrete Recurrent Neural Networks
3
311
Synchronization Analysis
In this section, we will mainly use Lyapunov method and linear matrix inequality approach to obtain asymptotic synchronization of the above dynamic systems. By the definition 1, we denote by s(k) the synchronization state such that: x1 (k) = x2 (k) = · · · = xN (k) = s(k), k → ∞
(5)
It is clear that the synchronization state satisfies the identical equation s(k + 1) = Ag(s(k)) + Bg(s(k − h)) + I(k)
(6)
Firstly, we investigate local stability of the synchronization manifold S for the coupled system(3). Theorem 1. Suppose that 0 = λ1 > λ2 ≥ λ3 ≥ ... ≥ λN
(7)
is the N eigenvalues of symmetric matrix C satisfying condition A1, if the following holds: w(k + 1) = (A + λi Γ )w(k) + B w(k − h) (8) for N − 1 nonzero eigenvalues, then the system(3) is asymptotically synchronized,where A = AJ(k), B = BJ(k − h) and J(k) := g (s(k)) is the Jacobian of g(x(k)) at s(k). Proof : Let xi (k) = s(k) + δi (k), then we have the following variation equation for system(3): δi (k + 1) = A δi (k) + B δi (k − h) +
N
cij Γ δj (k)
(9)
j=1
Let δ(k) = (δ1 (k), ..., δN (k)) ∈ Rn,N , then Eq.(9) can be rewritten as follows: δ(k + 1) = A δ(k) + B δ(k − h) + Γ δ(k)C T
(10)
Obviously, symmetric C T has the real Schur decomposition:C T = U ΛU T ,where U T = U −1 and Λ = diag(λ1 , ..., λN ). Now, Let e(k) = δU we have e(k) = A e(k) + B e(k − h) + Γ e(k)Λ
(11)
In component form, we have ei (k) = (A + λi Γ )ei (k) + B ei (k − h), i = 1, ..., N
(12)
Therefore, if system(12) is asymptotically stable, then δ(k) approaches origin. That is, system(3) is asymptotically synchronized. Theorem 1 is proved. Next, we give sufficient conditions for local and global stability of the system’s synchronization respectively.
312
P. Li
Theorem 2. If there exist two positive matrices P, Q > 0, a constant ε > 0 and an integar k0 > 0 such that T (A + λi Γ )P (A + λi Γ ) + (Q − P ) B T P (A + λi Γ ) ≤ −ε (13) B T P (A + λi Γ ) B T P B − Q for all i = 2, ..., N and k ≥ k0 . then system(3) is asymptotically synchronized for any h > 0. Proof : Constructing a Lyapunov-Krasovskii functional as following: V (w(k)) = wT (k)P w(k) +
k−1
wT ( )Qw( )
(14)
=k−h
Along system(8), then we have ΔV = V (w(k + 1)) − V (w(k)) = wT (k)(AT + λi Γ )P (A + λi Γ )w(k) + wT (k − h)B T P B w(k − h) −wT (k)P w(k) + wT (k)Qw(k) − wT (k − h)Qw(k − h) +2wT (k − h)B T P (A + λi Γ )w(k) ⎡ ⎤ T (AT + λi Γ )P (A + λi Γ ) w(k) w(k) ⎣ +(Q − P ) B T P (A + λi Γ ) ⎦ = w(k − h) w(k − h) B T P (A + λi Γ ) B T P B − Q ≤ −ε(w(k)2 + w(k − h)2 ) < 0
(15)
which implies system(3) is asymptotically synchronized. The proof is completed. Next, we investigate another important issue–global synchronization.We will show that the trajectory will converge to the synchronization manifold from any initial value under some mild conditions. For simplicity, assume that Γ = diag(γ1 , ..., γn ) is a diagonal matrix with γi = 1. Then, based on Lyapunov stability theory, the following result can be derived. Theorem 3. √ Suppose that C satisfies condition A1. If there exist a positive constant c ≥ 3, a semi-positive definitive matrix Q = diag(q1 , q2 , ..., qn ) , an irreducible matrix T satisfying condition A1 such that T 2A A + Q AT B <0 (16) BT A 2B T B − Q and
1 T ≥0 (17) c2 holds, then the synchronization manifold S of system(3) is globally asymptotically stable. CT T C −
˜ ∈ M2 (1) Proof : By definition 3 and Lemma 6 in[6], there exists a p×N matrix M T ˜ ˜ ˜ such that T = −M M , and M = M ⊗ In ∈ M2 (N ). Mi = (Mi1 , Mi2 , ..., MiN )
Synchronization of a Class of Coupled Discrete Recurrent Neural Networks
313
is the ith row of M, where Mii1 = αi In , Mii2 = −αi In and Mij = 0, if j = i1 , j = i2 . Denote x(k) = (x1 (k), ..., xN (k))T ∈ RnN ,where xi (k) = (xi1 (k), ..., xin (k))T ∈ Rn . Let y(k) = Mx(k)
A = IN ⊗ A
B = IN ⊗ B
C = C ⊗ In
Q = IN ⊗ Q
g(x(k)) = (g(x1 (k)), g(x2 (k)), ...., g(xN (k)))T then system(3) can be rewritten as following: y(k + 1) = MAg(x(k)) + MBg(x(k − h)) + MCx(k)
(18)
Choose a Lyapunov functional for (16) V (y(k)) = y T (k)y(k) +
k−1
g T (x(δ))MT QMg(x(δ))
(19)
δ=k−h
then we have ΔV = V (y(k + 1)) − V (y(k)) = gT (x(k))AT MT MAg(x(k)) + gT (x(k − h))BT MT MBg(x(k − h)) +xT (k)CT MT MCx(k) + 2gT (x(k))AT MT MBg(x(k − h)) +2gT (x(k − h))BT MT MCx(k) + 2gT (x(k))AT MT MCx(k) − y T (k)y(k) +gT (x(k))MT QMg(x(k)) − gT (x(k − h))MT QT Mg(x(k − h))
(20)
By quadratic inequality, we have 2gT (x(k − h))BT MT MCx(k) ≤ gT (x(k − h))BT MT MBg(x(k − h)) +xT (k)CT MT MCx(k)
(21)
and 2gT (x(k − h))AT MT MCx(k) ≤ gT (x(k − h))AT MT MAg(x(k − h)) +xT (k)CT MT MCx(k)
(22)
By the structure of M, following equations are easy to verify[9]: MA = A1 M
MB = B1 M
where A1 = Ip ⊗ A, B1 = Ip ⊗ B. Therefore, we have ΔV ≤ 2gT (x(k))MT AT1 A1 Mg(x(k)) + 2gT (x(k − h))MT BT1 B1 Mg(x(k − h)) +3xT (k)CT MT MCx(k) + 2gT (x(k))MT AT1 B1 Mg(x(k − h)) − y T (k)y(k) +gT (x(k))MT QMg(x(k)) − gT (x(k − h))MT QT Mg(x(k − h))
(23)
314
P. Li
By (16) and Lemma 5 in [11], 3xT (k)CT MT MCx(k) − y T (k)y(k) = 3xT (k)CT MT MCx(k) − xT (k)MT Mx(k) 3 ≤ ( 2 − 1)xT (k)MT Mx(k) ≤ 0 c
(24)
For simplicity, we denote by H the matrix (15), we have ΔV ≤
p j=1
+(
αj (g(xj1 (k)) − g(xj2 (k))) αj (g(xj1 (k − h) − g(xj2 (k − h))))
T
H
αj (g(xj1 (k)) − g(xj2 (k))) αj (g(xj1 (k − h) − g(xj2 (k − h))))
3 − 1)xT (k)MT Mx(k) c2
(25)
By conditions given in theorem 3, H < 0,which implies ΔV < 0. Thus, theorem 3 is proved. By the theory of linear matrix inequality, we only need that 2B T B − Q < 0,
2AT A + Q − AT B(2B T B − Q)−1 BAT < 0
(26)
holds, then we can obtain LMI condition. So our result is convenient to use.
4
Conclusion
This paper studies the synchronization of coupled discrete recurrent neural networks with time delay, giving some sufficient conditions for guarantee the synchronization of coupled discrete neural networks. These results could be used in the design of coupled networks.
References 1. Wu, C., Chua, L.O.: A Unified Framework for Synchronization and Control of Dynamical Systems. Int. J. Bifurcation Chaos 4(4) (1994) 979-998 2. Zheleznyak, A., Chua, L.O.: Coexistence of Low- and High-Dimensional SpatioTemporal Chaos in a Chain of Dissipatively Coupled Chua’s Circuits. Int. J. Bifurcation Chaos 4(3) (1994) 639-674 3. Forti, M.: On Global Asymptotic Stability of a Class of Nonlinear Systems Arising in Neural Network Theory. Journal of Differential Equations 113 (1994) 246-264 4. Perez-Munuzuri, A., Perez-Villar, V.: Autowaves for Image Processing on a Two-Dimensional CNN Array of Excitable Nonlinear Circuits:Flat and Wrinkled Labyrinths. IEEE. Circuits Syst.-I 40 (1993) 174-181 5. Louis, M., Thomas, C.: Synchronization in Chaotic Systems. Phys. Rev. Lett. 64 (1990) 821-824 6. Wu, C., Chua, L.O.: Synchronization in An Array of Linearly Coupled Dynamical Systems. IEEE Trans. on Circuit and systems-I 42 (1995) 430-447 7. Krinsky, V., Biktashev, V., Efimov, L.: Autowaves Principle for Parallel Image Processing. Phys. D 49 (1991) 247-253
Synchronization of a Class of Coupled Discrete Recurrent Neural Networks
315
8. Zhang, Y., He, Z.: A Secure Communication Scheme Based on Cellular Neural Networks. Proc. IEEE Int. Conf. Intelligent Processing systems 1 (1997) 521-524 9. Lu, W., Chen, T.: Synchronization of Coupled Connected Neural Networks with Delays. IEEE trans. on Circuit and systems 51 (2004) 2491-2503 10. Yi, Z., Lv, J., Zhang, L.: Output Convergence Analysis For a Class of Delayed Recurrent Neural Networks with Time Varing Inputs. IEEE Trans. on Systems, Man, and Cybernetics 36(1) (2006) 87-95 11. Lu, W., Chen, T.: Synchronization Analysis of Linearly Coupled Networks of Discrete Time Systems. Physica D 198 (2004) 148-168
Chaos and Bifurcation in a New Class of Simple Hopfield Neural Network Yan Huang 1,2 and Xiao-Song Yang 1 1
Department of Mathematics, Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, P.R. China
[email protected]
2
Abstract. A class of simple Hopfield neural networks with a parameter is investigated. Numerical simulations show that the simple Hopfield neural networks can display chaotic attractors and periodic orbits for different parameters. The Lyapunov exponents are calculated, the bifurcation plot and several important phase portraits are presented as well.
1 Introduction The Hopfield neural network (HNN) [1] abstracted from brain dynamics is a significant model in artificial neural network, which is able to store certain memories or patterns in a manner rather similar to the brain - the full pattern can be recovered if the network is presented with only partial information. The smart thing about the Hopfield network is that there exists a rather simple way of setting up the connections between nodes, in such a way that any desired set of patterns can be made “stable firing patterns”. Substantial evidence has been found in biological studies for the presence of chaos in the dynamics of natural neuronal systems [2-5]. Many researchers suggested that chaos play a central role in memory storage and retrieval. From this point of view, many artificial neural networks have been proposed in order to realize more dynamical attractors such as chaos and hyperchaos in artificial neural dynamics [2, 6-9].
2 Simple 3-Neuron Hopfield Neural Networks In this paper we are concerned with a new class of simple 3-neuron Hopfield-type neural networks (3-HNNs) as follows:
x& = −cx + W ⋅ f ( x) , x = ( x1 , x2 , x3 ) ∈ R 3 , W = ( wij )3×3 , f ( x) = tanh( x) T
(1)
The connection matrix W describing the strength of connections between neurons is of the following form: J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 316 – 321, 2006. © Springer-Verlag Berlin Heidelberg 2006
Chaos and Bifurcation in a New Class of Simple Hopfield Neural Network
⎛ w11 ⎜ W = ⎜ w21 ⎜w ⎝ 31
w12 w22 w32
w13 ⎞ ⎟ 0 ⎟ w33 ⎟⎠
317
(2)
The topological connection of the weight matrix W is illustrated in Fig.1.
Fig. 1. The topological connection
Concretely, we consider the following connection matrix
u 0.995 ⎞ ⎛ 1.5 ⎜ ⎟ 0 ⎟ W = −2.1 1.68 ⎜ ⎜ 3.977 -21 1.97 ⎟ ⎝ ⎠
(3)
with a parameter u . We study periodic orbits and chaotic trajectories for different parameters u in the connection matrix (3) by means of calculating Lyapunov exponent which is very useful in studying complex dynamics of dynamical systems. To give more rigorous analysis of chaotic behavior in the 3-HNNs, we can resort to Poincaré section and Poincaré map technique and show existence of chaos by virtue of topological horseshoes theory[10-12] as [13,14], the more rigorous analysis of chaotic behavior will be given in our forthcoming work.
3 Periodic Orbit and Chaotic Attractor of the Simple Hopfield Neural Networks In this section we investigate the dynamics of (1) with the connection matrix (3), and consider the case that the parameter u varies from 1 to 7. As to be shown as the following, the dynamic behavior is very rich for u ∈ [1, 7] . We compute the Lyapunov exponents for u ∈ [1, 7] , which are illustrated in Fig.2. The Lyapunov exponents are calculated with time step size 0.01 and time length 850, following the approach by [15]. Fig.2 shows that the system (1) can exhibit rich dynamic behaviors, such as periodic orbit and chaotic attractor.
318
Y. Huang and X.-S. Yang
Phase portraits of certain parameters are shown in Fig.3, the parameters, corresponding initial points, Lyapunov exponents, and dynamic property are listed in Table.1. For the meaning of the Lyapunov exponents and their relation to dynamic property, we refer the reader to [17]. It is very clear that there remains much to be done in dynamics of these simple 3-HNNs from dynamical systems point of view, especially from bifurcation viewpoint. Fig.4 is the bifurcation plot of the 3-HNNs as we adjust u ; Fig.4 (b) and Fig.4(c) are local magnifications of Fig.4 (a). Computer simulations show that the periodic orbit and chaotic behavior appear alternatively when the parameter u varies from 1 to 7, and double period bifurcation is observed when u varies from 1.5 to 1.81, which can be seen from the Fig.3 (a)-(d) and Fig.4 (b). It can be concluded from Fig.4(c) that double period bifurcation is exhibited when u varies from 2.8 to 2.9.
Lyapunov exponents
0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2
1
2
3
4
5
6
7
u Fig. 2. Lyapunov exponents for u ∈ [1, 7] Table 1. The parameters, corresponding initial points, Lyapunov exponents and dynamic property of Fig.3
Fig.3
u
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)
1.5 1.77 1.777 1.8 1.9 2.2 2.6 2.8 3 4.5
Lyapunov exponents 0.00, -0.433, -0.984 0.00, -0.013, -0.936 0.00, -0.004, -0.926 0.00, -0.011, -0.879 0.091, 0.00, -0.838 0.123, 0.00, -0.708 0.124, 0.00, -0.630 0.00, -0.133, -0.498 0.060, 0.00, -0.572 0.00, -0.133, -0.147
Dynamic property periodic periodic periodic periodic chaotic chaotic chaotic periodic chaotic periodic
Initial point ± (-0.130,2.424,-23.694) ± (0.906,-1.791, 13.513) ± (-0.266,-1.41, 16.89) ± (0.279, 0.1888, 4.859) ± (0.256, -0.342, -6.57) (-0.721, -0.208, 11.538) ± (-1.00,-0.314, 12.886) ± (0.104, 0.048, -4.755) ± (0.297, 0.539, -4.989) ± (-0.47,-0.449, -0.564)
Chaos and Bifurcation in a New Class of Simple Hopfield Neural Network
319
30 20
30
10
20 10 0
z
z
0 -10
-10
-20
-20
-30 4 2
-30 4
2
2
1
0
2 1
0
0
-2
-1 -4
y
0
-2
-2
-1 -4
y
x
(a)
-2
x
(b)
30 30
20 20
10 10
z
0
z
0
-10
-10
-20
-20
-30 4
-30 4 2
2
2
2 1
0
1
0 0
-2 -2
0
-2
-1 -4
y
-1 -4
y
x
(c)
-2
x
(d)
30 20 20
10 10
0
z
z
0 -10 -10
-20 -20 4
-30 -5
2
0 5 y
0.5
0
-0.5
-1
-1.5
2 1
0
1.5
1
0
-2
-1 -4
y
-2
x
x
(e)
(f)
20 20
10 10
0
z
z
0
-10
-10
-20 2
-20 2
1
2
1
2
1
0 -1 -2
1 -1 -2
y
x
(g)
-2
x
(h)
20
10
10
5
0 z
z
0
-1
-1 -2
y
0
0
0
-10 -5
-20 2 1
2 1
0
y
-10 -1
0
-1
0
-1 -2
-2
1
x
y
-1.5
-0.5
-1
0
x
(i)
(j)
Fig. 3. The phase portrait of system (1) with different parameter u
0.5
1
1.5
320
Y. Huang and X.-S. Yang
2.5
2
y
1.5
1
0.5
0
-0.5
1
2
3
4
5
6
7
u
(a) 2.4 1.6
2
1.4
1.8
1.2
y
y
2.2
1.6
1 1.4 0.8 1.2 0.6
1 0.8 1.5
1.55
1.6
1.65
1.7
1.75
1.8
0.4 2.8
2.85
2.9
2.95
3
u
(b)
3.05
3.1
3.15
3.2
3.25
u
(c)
Fig. 4. The bifurcation plot of the 3-HNNs as we adjust
u
(x= 0)
For u = 1.5, 1.77, 1.777, 1.8, 2.8, 4.5, it can be concluded that one Lyapunov exponent is approximate zero; two Lyapunov exponents are negative, indicating that (1) has one or more periodic orbits. For u = 1.9, 2.2, 2.6, 3, it can be concluded from Table.1 and Fig.3 that one Lyapunov exponent is positive, one is approximately zero, one is approximate negative, indicating that (1) is chaotic. For u = 1.5, choose two different initial points ± (-0.130, 2.424,-23.694). By studying the limit sets of the two trajectories with the two initial points, we can find two periodic orbits as illustrated in Fig.3 (a). For u = 1.77, choose two different initial points ± (0.906,-1.791, 13.513), we can also find two periodic orbits as illustrated in Fig.3 (b). For u = 1.777, 1.8, 2.8, 4.5, two periodic orbits can be found with two different initial points as listed in Tab.1, respectively. For u = 1.9, two one-scroll chaotic attractors as illustrated in Fig.3 (e) are found by studying two trajectories with two different initial points ± (0.256, -0.342, -6.57). For u = 2.2, a double-scroll chaotic attractor as illustrated in Fig.3 (f) is found by studying the trajectory with initial point (-0.721, -0.208, 11.538). For u = 2.6, two one-scroll chaotic attractors as illustrated in Fig.3 (g) are found with two different initial points ± (-1.00,-0.314, 12.886). For u = 3, two chaotic attractors as illustrated in Fig.3 (i) are found with two different initial points ± (0.297, 0.539, -4.989).
Chaos and Bifurcation in a New Class of Simple Hopfield Neural Network
321
It can be seen from Fig.3 that the two one-scroll chaotic attractors turn into one double-scroll chaotic attractor when u varies from 1.9 to 2.2, then this double-scroll chaotic attractor turns into two one-scroll chaotic attractor when u varies from 2.2 to 2.6.
4 Conclusion In this paper, we study on a class of simple Hopfield neural networks with a parameter. Computer simulations show that 3-HNNs can display chaotic attractors and periodic orbits for different parameters. In order to study the rich dynamics behavior of the system, we study the Lyapunov spectrum, the bifurcation plot and several important phase portraits of the system for certain range of parameter.
References 1. Hopfield, J. J.: Neurons with Graded Response Have Collective Computational Properties like those of Two-State Neurons. Proc. Natl. Acad. Sci. USA 81(10) (1984) 3088–3092 2. Bersini, H., Sener, P.: The Connections between the Frustrated Chaos and the Intermittency Chaos in Small Hopfield Networks. Neural Networks 15(10) (2002) 1197–1204 3. Guevara, M.R., Glass, L: Chaos in Neurobiology. IEEE Trans. Syst. Man Cyb. 13 (9/10) (1983)790–798 4. Babloyantz, A. Lourenco, Brain, C.: Chaos and Computation. Int. J. Neural Syst. 7(4) (1996) 461–471 5. Dror, G., Tsodyks, M.: Chaos in Neural Networks with Dynamic Synapses. Neurocomputing 32–33 (6) (2000) 365–370 6. Wheeler, D.W., Schieve, W.C.: Stability and Chaos in an Inertial Two-Neuron Systems. Physica D 105(7) (1997) 267–284 7. Bersini, H.: The Frustrated and Compositional Nature of Chaos in Small Hopfield Network. Neural Networks 11 (6) (1998) 1017–1025 8. Li, Q.D., Yang, X.S. Yang, F.Y.: Hyperchaos in Hopfield-Type Neural Networks. Neurocomputing 67(8) (2005) 275–280 9. Yang, X.S., Yuan, Q.: Chaos and Transient Chaos in Simple Hopfield Neural Networks. Neurocomputing 69(12) (2005)232–241 10. Yang, X.S., Tang, Y.: Horseshoes in Piecewise Continuous Maps. Chaos, Solitons & Fractals 19(3) (2004) 841–845 11. Yang, X.S.: Metric Horseshoes. Chaos, Solitons & Fractals 20 (6) (2004)1149–1156 12. Yang, X.S., Li, H-M, Huang, Y.: A Planar Topological Horseshoe Theory with Applications to Computer Verifications of Chaos. J. Phys. A: Math. Gen. 38(5) (2005) 4175–4185 13. Huang, Y., Yang, X.S.: Horseshoes in Modified Chen’s Attractors. Chaos, Solitons and Fractals 26(10) (2005) 79–85 14. Huang,Y., Yang, X.S.: Chaoticity of Some Chemical Attractors: A Computer Assisted Proof. Journal of Mathematical Chemistry 38 (7) (2005) 107–117 15. Ramasubramanian, K., Sriram, M.S.: A Comparative Study of Computation of Lyapunov Spectra with Different Algorithms. Physica D 139(5) (2000) 72–86 16. Eckmann, J. P., Ruelle, D.: Ergodic Theory of Chaos and Strange Attractors. Rev. Mod. Phys. 57 (7) (1985) 617–656
Synchronization of Chaotic System with the Perturbation Via Orthogonal Function Neural Network Hongwei Wang and Hong Gu Department of Automation, Dalian University of Technology, Dalian, 116024, China
[email protected]
Abstract. In the paper, the orthogonal function neural network is utilized to realize the synchronization of chaotic system. The orthogonal function neural network is Legendred orthogonal neural network. First of all, the orthogonal function neural network is trained to learn the uncertain information. The parameters of Legendred orthogonal neural network are adjusted to accomplish the synchronization of two chaotic systems with the perturbation by Lyapunov steady theorem. At last, the result of numerical example is shown to illustrate the validity of the proposed method.
1 Introduction It is a hot research topic in the field of control theories and applications that chaotic system is applied into secret communication, which shows the important value and the beautiful future. From chaotic system applying into secret communication, it is a key technology how to construct the synchronization of chaotic systems. The system can be moved into the expecting orbit according to the design requests, which is synchronization problem of chaotic system. The synchronization of chaotic system is the important field of chaotic science researching. The synchronization of chaotic system has been a familiar topic of intensive research in the past decade [1-5]. The movement of two same systems with the same initializing conditions can show the synchronization in the less time. However, the synchronization is weak because the small disturbing destroys it. In addition, the synchronization of two different chaotic systems is complicated problem. In the paper, the orthogonal function neural network is utilized to realize the synchronization of chaotic system. The orthogonal function neural network is Legendred orthogonal neural network. The orthogonal function neural network is trained to learn the uncertain information of the system. The parameters of orthogonal neural network are adjusted to accomplish the synchronization of two chaotic systems with the disturbing of parameters by Lyapunov steady theorem. At last, with the demonstration of numerical example, the simulation results can show the validity of proposed method.
2 The Description of the Problem The chaotic system is shown as the following form.
x& = F ( x) + u J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 322 – 327, 2006. © Springer-Verlag Berlin Heidelberg 2006
(1)
Synchronization of Chaotic System with the Perturbation Via Orthogonal Function
323
x ∈ R n ; F (x) is a nonlinear function, satisfying F(x) = ( f1 (x), f 2 ( x),....,f n (x))T , f i ( x)(i = 1,2,..., n) ; u is the input vector,
where the input vector satisfies
u ∈ Rn . In the actual system, the chaotic system is perturbed by the outside perturbation, and then the equation (1) can be written as the following equation.
x& = F ( x) + ΔF ( x) + u
(2)
where ΔF (x ) is the outside perturbation. Definition: the reference chaotic system is as the equation (3).
x& r = g ( x r )
(3)
x r is a state vector, satisfying x r ∈ R n , g (⋅) is the vector with many slippery nonlinear functions, having the same structure as F (⋅) or the different with F (⋅) . Let e = x − x r , we have
where
lim e(t ) = 0
(4)
t →∞
then the system (1) keeps the synchronization with the system (2). Matrix A is a matrix with negative real part of the eigenvalues. The error dynamical trajectory equation is shown as the equation (5).
e& = x& − x& r = A( x − x r ) + Ax r − Ax + F ( x) + ΔF ( x) + u − g ( x r )
(5)
The equation (5) is also defined as the following form
e& = Ae + Ax r − Ax + F ( x) + ΔF ( x) + u − g ( x r )
(6)
Because of the eigenvalues of the matrix A having negative real part, there is a positive definite and symmetric matrix P to satisfy the following Lyapunov equation.
AT P + PA = −Q where matrix Q is a positive definite and symmetric matrix. Let ing equation.
(7)
G (x) as the follow-
G ( x) = Ax r − Ax + F ( x) + ΔF ( x) − g ( x r )
(8)
The equation (5) represents as the equation (9)
e& = Ae + G ( x) + u
(9)
324
H. Wang and H. Gu
The estimated value
Gˆ ( x) substitutes the function G (x) when the function
G (x) is unknown. The controller is defined as the following equation (10). u = −Gˆ ( x) − σ where
σ
(10)
is a vector with the smaller norm.
3 The Controller of Orthogonal Function Neural Network The orthogonal function neural network has simple structure, fast convergence by the comparison of the common BP neural network. The orthogonal function neural network can approach to any nonlinear function on the tight set. In the paper, the orthogonal function of neural network is Legendre orthogonal polynomial. The Legendre polynomial is defined as the following form.
P0 ( x) = 1 ⎧ ⎪ i 1 di ⎨ = P ( x ) x2 −1 i i ⎪⎩ i 2 i! dx
[(
)]
i = 1,2,....N
(11)
The global output of the orthogonal function neural network is defined as the equation (12). N
y = ∑ Φ iWi
(12)
i =1 n
where
Φi = P1i (x1)× P2i (x2 ) ×⋅⋅⋅×Pni (xn ) = ∏Pji (xj ) , Pji ( x j ) is Legendre polynoj=1
1 2 i = 1,2,..., N , j = 1,2,...,n .
mial, Pj1(xj ) = x , Pj2(xj ) = (3xj −1), Pj(i+1) (x j ) = ((2i +1)x j Pji (x j ) − iPj(i−1) (x j ))/(i +1) , 2
Lemma 1. For defining the any function g ( X ) on the section [ a, b] and any small positive number ε , there is an orthogonal function sequence
{Φ1 ( X ), Φ 2 ( X ),..., Φ N ( X )} and the real number sequence Wi (i = 1,2,..., N )
to satisfy the following equation N
g ( X ) − ∑ Wi Φ i ( X ) ≤ ε
(13)
i =1
On the basis of the lemma 1, the properties are acquired as the following form Property 1. Give a positive constant
ε0
and a continuous function G (x ) : x ∈ R , n
exist an optimization weighted matrix W = W , W = [W1 ,W2 ,...,WN ] to satisfy the following equation. *
*
*
*
*
Synchronization of Chaotic System with the Perturbation Via Orthogonal Function
G ( x) − W * Φ ( x) ≤ ε 0 T
325
(14)
Φ satisfies Φ( x) = [Φ 1 ( x), Φ 2 ( x),..., Φ N ( x)] . Based on the equation (14), G (x ) is shown as the equation (15). T
where
G ( x ) = W * Φ ( x) + η T
(15)
where η is a vector with the smaller norm. The equation (15) is substituted into the equation (9).
e& = Ae + W * Φ ( x) + η + u (t ) T
(16)
P , satisfying T T & W = Φ ( x)e P , A P + PA = −Q , where Q is a positive definite and symmetric
Theorem 1. Exist a positive definite and symmetric matrix matrix, the controller is designed as the following form.
u = u1 + u 2
(17)
u1 is u1 = −W T Φ( x) and u 2 is u 2 = −η 0 sgn( Pe) The state of the equation (9) approaches zero, namely x → x r , the system (1) is
where
synchronization with the system (2).
~ W = W * − W , which W * an optimization weighted matrix, W is a ~ weighted matrix, W is an estimated error matrix. Define the norm of matrix R: 2 R = tr ( RR T ) = tr ( R T R ) . The lyapunov function is shown as the following form. Proof. Let
V =
1 T 1 ~ e Pe + W 2 2
2
(18)
Differentiating equation (18) with respect to the time is shown the following steps.
1 ~& ~ V& = (e&Pe + e T Pe&) + tr (W T W ) = 2 1 T T ~ ~& ~ e A P + PA e + Φ T ( x)WPe + e T Pη − e T Pη 0 sgn(e T P) + tr (W T W ) = 2 1 ~ ~& ~ − e T Qe + Φ T ( x)WPe + e T Pη − e T Pη 0 sgn(e T P) + tr (W T W ) 2 ~ ~ T T T T T Because of Φ ( x )WPe = tr ( PeΦ ( x )W ) , e Pη − e Pη0 sgn(e P) ≤ 0 1 1 ~& ~ ~ V& ≤ − e T Qe + tr (W T W + PeΦ T ( x)W ) = − e T Qe ≤ 0 2 2 ~ On the basis of the lyapunov theorem, e 、 W have the limit, then lim e = 0 .
(
)
t →∞
326
H. Wang and H. Gu
4 Simulation The Lorenz chaotic system is shown as the equation (19).
⎧ v&1 = a(v 2 − v1 ) ⎪ ⎨v&2 = (b − v3 )v1 − v 2 ⎪ v& = −cv + v v 3 1 2 ⎩ 3
(19)
In the actual system, the Lorenz chaotic system is perturbed by the environment, and then the Lorenz chaotic system is represented as the following equation.
⎧ v&1 = (a + δa )(v 2 − v1 ) + d1 + u1 ⎪ ⎨v&2 = (b + δb − v3 )v1 − v 2 + d 2 + u 2 ⎪ v& = −(c + δc)v + v v + d + u 3 1 2 3 3 ⎩ 3
(20)
The system parameters are a = 10 , b = 30 , c = 8 / 3 ; The perturbation items of the parameters are δa = 0.1 , δb = 0.2 , δc = 0.2 ; The perturbation items of the system states are
d1 = 0.3 sin t , d 2 = 0.1cos t , d 3 = 02 sin(3t ) .
Fig. 1. The responding diagram of
e1
Fig. 2. The responding diagram of
Fig. 3. The responding diagram of
e3
e2
Synchronization of Chaotic System with the Perturbation Via Orthogonal Function
327
The parameters of Legendred orthogonal neural network are adjusted to accomplish the synchronization. The other parameters are A =diag(−1,−1,−1) , Q = diag(2,2,2) ,η0 = 0.02, P = diag(1,1,1) . The error responding diagrams are shown as the figure 1 to figure 3. The Based on these responding diagrams, the synchronization of the two chaotic systems is carried out by the proposed method. The validity of the proposed method is demonstrated with the result of numerical example simulation.
5 Conclusion In the paper, the orthogonal function neural network based on Legendred orthogonal polynomial is utilized to realize the synchronization of chaotic systems. The parameters of orthogonal neural network are adjusted to accomplish the synchronization of two chaotic systems with the perturbation of parameters by Lyapunov steady theorem. With the demonstration of simulating result, the proposed method can guarantee the synchronization of two chaotic systems with the perturbation of parameters.
Acknowledgement This work is supported by National Natural Science Foundation of China (60575039). The authors would like to thank the support and encourage of Prof. Xiaodong Liu.
References 1. Liu, F., Ren, Y., Shan, X.M., Qiu, Z.L.: A Linear Feedback Synchronization Theorem for a Class of Chaotic Systems. Chaos, Solution and Fractals 13(4) (2002) 723-730 2. Sarasola, C., Torrealdea, F.J.: Cost of Synchronizing Different Chaos Systems. Mathematics and Computers in Simulation 58(4) (2002) 309-327 3. Shahverdiev, E.M., Sivaprakasam, S., Shore, K.A.: Lag Synchronization in Time-Delayed Systems. Physics Letter A 292(6) (2002) 320-324 4. Tan, W., Wang, Y.N., Liu, Z.R., Zhou, S.W.: Neural Network Control for Nonlinear Chaotic Motion. Acta Physica Sinica 51(11) (2002) 2463-2466 5. Li, Z., Han, C.S.: Adaptive Control for a Class of Chaotic Systems with Uncertain Parameters. Acta Physica Sinica 50(5) (2001) 847-850
Numerical Analysis of a Chaotic Delay Recurrent Neural Network with Four Neurons Haigeng Luo1 , Xiaodong Xu2, , and Xiaoxin Liao1 1
Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
[email protected],
[email protected] 2 College of Public Administration, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China xiaodong
[email protected]
Abstract. Complex dynamical behavior in a four-neuron recurrent neural network with discrete delays is investigated in this paper. The dissipativity of the system and stability of equilibrium point are studied by mens of Lyapunov theory. Stable fixed point, periodic and quasi-periodic orbits, and chaotic motion are observed in system via numerical calculation. With the change of the slope and threshold of activation function, as well as time delay and synaptic weight, the system passes from stable to periodic and then to chaotic regimes. Interestingly, the system returns to periodic or stable regimes by further changing these parameter values. Furthermore, some numerical evidences, such as phase portraits, bifurcation diagrams, power spectrum density, are given to confirm chaos.
1
Introduction
An artificial neural network (ANN) is a mathematical model that represents the biological process of human brain. These mathematical models can then be used to generate computer graphics which illustrate the usefulness of these techniques for solving problems in such areas as system control, combination optimization, data compression, pattern recognition and system identification. By far, most works focused on the stability of neural networks, see [1, 2, 3, 4, 5] and references therein. In recent years, bifurcation and chaotic phenomena have been considered in neural networks with or without time delays. For example, Das et al. [6, 7] investigated the chaotical behavior in a three dimensional neural network. Liao et al. [9] analyzed the Hopf bifurcation and chaos in a single delayed neural equation with non-monotonic activation function. Zhou et al. [10] studied chaos and synchronization in two-neuron systems with discrete delays. However, It seems to be necessary to choose a bit more complicated activation function, such as combination of sigmoid function and periodic function, to exhibit chaos in delay neural networks with one or two neurons. Moreover, most
Corresponding author.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 328–333, 2006. c Springer-Verlag Berlin Heidelberg 2006
Numerical Analysis of a Chaotic Delay Recurrent Neural Network
329
works place their emphasis on synaptic weight and decay rate, and omit the influence of time delay, self-weight ( that is, wij = 0 for i = j ) and the parameters of activation function. In this paper, a small recurrent neural network consisting of four coupled neurons with discrete delays and typical sigmoid activation function is considered and their activity is determined by four differential equations as given in section 2. Dissipativity of model and stability of equilibrium point is studied in section 3. In section 4, details of numerical simulation have been given. An attempt has been made to give the model a general shape by taking all possible synaptic connections , include the self-weight, of the neurons into consideration. In numerical simulation, we have considered the influence of time delay and synaptic weight as well as the slope and the threshold of the activation function. The relevance of the finding of this work is given as a discussion in section 5.
2
Model Formulation
Consider a continuous time four-neuron recurrent neural network model with multiple time delays 4 dxi (t) = −xi (t) + wij (fi (xj (t)) − fi (xj (t − τj ))), dt j=1
(1)
for i = 1, 2, 3, 4. Where xi (t) is the state variable of the ith neuron. wij represents synaptic weight from neuron j to neuron i. When the output of neuron j excites neuron i, synaptic weight wij > 0. When the output of neuron j inhibits neuron i, synaptic weight wij < 0. When the output of neuron j has no influence on neuron i, synaptic weight wij = 0. fi (·) is a sigmoid-type activation function as fi (s) = tanh(βi (s − θi )),
(2)
where βi > 0 and θi denote the slope and the threshold of the activation function of the neuron i respectively. In the following, we choose τ = [0.8, 0.5, 0.5, τ4], β = [β1 , 1.2, 1, 1] and θ = [θ1 , 0.1, 0.1, 0.1], the connection weight matrix is taken as ⎛ ⎞ 1.5 w12 −0.9 0.5 ⎜ 1.8 1.2 0.6 1.3 ⎟ ⎟ W =⎜ (3) ⎝ 1.1 0.9 2.2 1.5 ⎠ , 0.2 −0.4 0.4 1.85 in other word, τ4 , β1 , θ1 and w12 are chosen as bifurcation parameters, other parameters values are fixed.
3
Analysis of Dissipativity and Stability
In this section, we firstly show system (1) is dissipative.
330
H. Luo, X. Xu, and X. Liao
Theorem 1. If we choose activation function as (2), then neural network model (1) is a dissipative system, the set S = S1 ∩S2 is a positive invariant and globally attractive set, where 4 4 4 4
2
2
S1 x
|xi | − 2 |wij | |wij | , i=1
j=1
i=1
(4)
j=1
4
2
S2 x |xi | 4 |wij | , i = 1, 2, 3, 4 .
(5)
j=1
Proof. First, we employ 4 a radically unbounded and positive definite Lyapunov function as V (x) = i=1 x2i /2. Computing dV dt along the positive half trajectory of (1), we have
4 4 4
dV
dxi 2 = x 2|w |x | − x i ij j i < 0, dt (1) i=1 dt i=1 j=1
(6)
when x ∈ Rn \S1 ; i.e., x ∈ / S1 . Equation (6) implies that ∀x0 ∈ S1 holds x(t, t0 , x0 ) ⊆ S1 , t t0 . For x0 ∈ / S1 , there exist T > 0 such that x(t, t0 , x0 ) ⊆ S1 , ∀ t T + t0 , i.e., the neural network model (1) is a dissipative system and S1 is a positive invariant and attractive set. Second, we define a radically unbounded and positive definite Lyapunov function Vi = |xi |, i = 1, 2, 3, 4. Calculating the right-upper Diniderivative D+ Vi , one obtains 4
D+ Vi (1) −|xi | + 2 |wij | < 0, i = 1, 2, 3, 4,
(7)
i=1
when x ∈ Rn \S2 . So, S2 is also a positive invariant and globally attractive set. Combining the above proof, we know that S = S1 ∩ S2 is a positive invariant and globally attractive set. Theorem 1 is completed. Obviously, the origin is the unique equilibrium point of neural network model (1). The Linear stability analysis is effective for investigating its stability and bifurcation. Unfortunately, the characteristic equation of (1) is a four-order transcendent equation, both analytical and numerical analysis are very difficult. Therefore, a qualitative analysis by Lyapunov theory is done. Theorem 2. The equilibrium point of neural network model (1) is globally asymptotically stable, if for all connection weights wij and slope βi , following inequality 1 2 max wij + max βi2 < , (8) 1i,j4 1i4 4 holds.
Numerical Analysis of a Chaotic Delay Recurrent Neural Network
331
Proof. Constructing a radically unbounded Lyapunov functional as follows V (x(t), x(t − τ )) =
4 4 t 1 2 xi + 2 max {βi2 }x2i (s)ds, 1i4 2 i=1 t−τ i i=1
(9)
computing the time derivative along the positive half trajectory of equation (1), using mean-value theorem and fi (s) βi , one obtains
4 4 dV
2 = −x + x wij tanh(βi (xj − θi )) − tanh(βi (xj (t − τj ) − θi )) i i
dt (1) i=1 j=1 +2
4 i=1
max βi2 x2i − x2i (t − τi )
1i4
4 2 − 1 − 4 max wij + max βi2 x2i 0, 1i,j4
when
1i4
max
1i,j4
i=1
1 2 wij + max βi2 < , 1i4 4
dV
dt (1)
and = 0, if and only if xi = 0, i = 1, 2, 3, 4. This implies the neural network model is globally asymptotically stable. The proof of theorem 2 is completed.
4
Numerical Simulation
We will show numerical experiments according to the model and parameter values given in section 2. Based on numerical simulation software Matlab 7.0.4, delay differential equation solver DDE23 is used to integrate the delay neural network model (1). Phase portraits and power spectrum of the system: If the bifurcation parameters values are chosen as β1 = 12, θ1 = 0, w12 = −1.5 and τ4 = 0.8, the delay neural network model (1) possesses an unstructured chaos attractor. The phase Portraits are given in Fig. 1. The power spectrum density of x1 component is drawn in Fig. 2. It is a typical continuous spectrum and discloses system (1) is chaotic. Bifurcation diagram: To find out critical parameter domains where the activity pattern of the system takes place and to correctly establish critical dependence of the system’s stability on parameters like the synaptic weight w12 ,the time delay τ4 , the slope β1 and the threshold θ1 , we have drawn bifurcation diagrams (Figs. 3-6), by computing the local maximum Poincare section of state variable x1 . They clearly show how the system passes through period doubling bifurcation to chaos. Also the sudden change in the chaotic dynamics of the system which can be termed as crisis are clear from these figures. From these figures and a great deal of numerical calculation, we can sum up that the system is not sensitive to change of the slope, only large value of β1 can guide the system from chaotic regime to periodic orbit, and can not reach fixed point. Whereas the
H. Luo, X. Xu, and X. Liao 4
1
10
0.5
10
2
Power Spectrum Density
4
System State x (t)
332
0
−0.5
−1 4
0
10
−2
10
−4
2
10
4 2
0
0
−2 System State x3(t)
−6
10
−2 −4
−4
0
5
Fig. 1. Three dimensional view of attractor
20
2
1.5
1.5
40
45
50
x1
0.5
0
0
−0.5
−0.5 −1
−1
0
5
10
β
15
20
−1.5 −3
25
−2
−1
1
Fig. 3. The system reaches chaotic regime when the slope β1 decreases. (θ1 = 0, τ4 = 0.8, w12 = −1.5)
1
2
3
2.5
4
2
3
1.5 1 x1
2 1
0.5
0
0
−1
−0.5 −1
−2 −3 −5
0 θ1
Fig. 4. The system starts from fixed point to period orbit, then to chaotic regime, finally returns to fixed point with the change of the threshold θ1 . (β1 = 12, τ4 = 0.8, w12 = −1.5)
5
1
35
1
0.5
x
25 30 frequency (Hz)
2.5
2
1
x1
15
Fig. 2. Power spectrum of x1
2.5
−1.5
10
System State x2(t)
−1.5 0 w
5
12
Fig. 5. Bifurcation diagram of the connection weght w12 versus system state x1 . (β1 = 12, θ1 = 0, τ4 = 0.8,)
0.65
0.7
0.75
0.8 τ4/s
0.85
0.9
0.95
1
Fig. 6. Bifurcation diagram of the time delay τ4 versus system state x1 . (β1 = 12, θ1 = 0, w12 = −1.5)
time delay can guide the system from periodic motion to chaos, but the system can not reach fixed point by change of time delay. However,both of the change of connection weight and of threshold can guide the system from fixed point to periodic motion, then to chaos, and finally returning to stationary regime.
Numerical Analysis of a Chaotic Delay Recurrent Neural Network
5
333
Conclusion
In this paper, a four-neuron recurrent neural network with discrete delays is concerned. Chaotic motion is observed while changing the slope and the threshold of a sigmoid activation function, as well as the time delay and the synaptic weight of the system. Numerical simulations show these parameters impose different effect on the bifurcation situation of the system. This work can provide some reference evidence for chaos control in chaotical neural networks.
Acknowledgement The work was supported by National Natural Science Foundation of China (60274007, 60474011) and Science and Research Project of Hubei Provincial Department of Education (D200560001).
References 1. Shen, Y., Zhao, G.Y., Jiang, M.H., Mao, X.R.: Stochastic Lotka-Volterra Competitive Systems with Variable Delay. In: Huang, D.S., Zhang, X.P., Huang, G.B. (eds.): Advances in Intelligent Computing. Lecture Notes in Computer Science, Vol. 3645. Springer-Verlag, Berlin Heidelberg New York (2005) 238-247 2. Zeng, Z.G., Wang, J., Liao, X.X.: Stability Analysis of Delayed Cellular Neural Networks Described Using Cloning Templates. IEEE Trans. Circuits and Systems I 51 (2004) 2313-2324 3. Zeng, Z.G., Huang, D.S., Wang Z.F.: Global Stability of A General Class of Discrete-time Recurrent Neural Networks. Neural Processing Letters 22 (2005) 33-47 4. Shen, Y., Zhao, G.Y., Jiang, M.H., Hu, S.G.: Stochastic High-order Hopfield Neural Networks. In: Wang, L.P., Chen, K., Ong, Y.S. (eds.): Advances in Natural Computation. Lecture Notes in Computer Science, Vol. 3610. Springer-Verlag, Berlin Heidelberg New York (2005) 740-749 5. Shen, Y., Jiang, M.H., Liao, X.X.: Global Exponential Stability of Cohen-Grossberg Neural Networks with Time-varying Delays and Continuously Distributed Delays. In: Wang, J., Liao, X.F., Zhang, Y. (eds.): Advances in Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 156-161 6. Das, A., das, P., Roy, A.B.: Chaos in Three Dimensional Neural Network. Applied Mathematical Modelling 24 (2000) 511-522 7. Das, A., das, P., Roy, A.B.: Chaos in A Three-dimensional Model of Neural Network. Int. J. of Bifur. and Chaos 12 (2002) 2271-2281 8. Li, C.G., Yu, J.B., Liao, X.F.: Chaos in A Three-neuron Hysteresis Hopfield-type Neural Network. Physics Letters A 285 (2001) 368-372 9. Liao, X.F, Wong, K.W., Leung, C.S., Wu, Z.F.: Hopf Bifurcation and Chaos in A Single Delayed Neuron Equation with Non-monotonic Activation Function. Chaos, Solitons and Fractals 12 (2001) 1535-1547 10. Zhou, S.B., Liao, X.F., Yu, J.B., Wong, K.W.: Chaos and Its Synchronization in Two-neuron Systems with Discrete Delays. Chaos, Solitons and Fractals 21 (2004) 133-142
Autapse Modulated Bursting Guang-Hong Wang1 and Ping Jiang1,2 1
Department of Information and Control Engineering, Tongji University, Shanghai 200092, China
[email protected] 2 School of Informatics, University of Bradford, Bradford, BD7 1DP, UK
[email protected]
Abstract. In this paper we present a model of autapses which are synapses connecting axons and dendrites of the same neuron and feeding back axonal action potentials to the own dendritic tree. The timely physiological self-inhibitory function they may serve, indicated by the spiketiming dependent plasticity rule, provides a potential negative feedback mechanism to control the dynamical properties of neurons. The model autapse is applied to a neuron with conductance-based minimal model to construct a fast-slow buster. Three types of bursting that the burster exhibits are analyzed geometrically through phase portraits to illustrate the modulation competence of the autapse model on bursting.
1
Introduction
Information processing depends not only on the anatomical substrates of synaptic circuits, the electrophysiological properties of neurons, but also on their dynamical properties [1]. Electrophysiologically similar neurons may respond to the same synaptic input in very different manners because of each cell’s intrinsic bifurcation dynamics. In an extensive series of experiments on the giant axon of the squid, Hodgkin and Huxley succeeded in measuring the currents responsible for action potentials and describing their dynamics in terms of differential equations [2]. H-H model is not only the starting point for detailed neuron models which account for numerous ion channels, different types of synapse, and the specific spatial geometry of an individual neuron, but also an important reference model for the derivation of simplified neuron models [3]. Further development of the geometrical analysis of neuronal models was done in [1, 4], stressing the integrator and resonator modes of operation and made the connections to other neuro-computational properties. [5, 6] developed a geometrical framework for an averaging method of singularly perturbed systems. Many mathematical models of bursters have fast-slow form (1) described in the next section. The Hodgkin-Huxley equations, FitzHugh-Nagumo equations, Morris-Lecar equations, as well as many other fast-slow models of this form, exhibit an extremely rich variety of nonlinear dynamical behaviors [7]. The methods of qualitative theory of slow-fast systems applied to biophysically realistic neuron models can describe basic scenarios of how these regimes of activity can be J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 334–343, 2006. c Springer-Verlag Berlin Heidelberg 2006
Autapse Modulated Bursting
335
generated and transitions between them can be made [5]. A top down approach was used to classify all the bursters, considering all possible pairs of co-dimension one bifurcations of resting and spiking states [4]. Resting state of neurons corresponds to a stable equilibrium, and tonic spiking state corresponds to a limit cycle attractor. Another classification framework was provided based on the observation that the bifurcations of the fast system that lead to bursting can be collapsed to a single local bifurcation, generally of higher co-dimension [7]. In [8] a neural basket cell via the Hodgkin-Huxley model was simulated with an autapse feeding back onto the soma of the neuron forced to burst by the constant spike train of 200Hz as the excitatory synaptic input. However, the synaptic level analysis of the autapse did not sufficiently reveal its dynamical role on the whole system, which in fact was not analyzed dynamically, let alone the different types of bursting. Here, we present an autapse model IK(Aut) which is driven by action potentials, and reveal its role from the perspective of dynamical system. The model autapse is applied to a neuron with conductancebased minimal model to construct a fast-slow buster. Three types of bursting that the burster exhibits are analyzed geometrically through phase portraits to illustrate the dynamics and modulation competence of the autapse model on bursting.
2 2.1
Models Fast-Slow Bursting Model
The fields of mathematical and computational neuroscience focus on modeling electrical activity in nerve cells. Fundamental processes, such as voltage changes and ion transport across cell membranes, occur on disparate time scales [7]. A neuron is a fast-slow burster if its behavior can be described by a fast-slow system of the form x˙ = f (x, u) (fast spiking) u˙ = μg(x, u) (slow modulation) ,
(1)
where, vector x describes fast variables responsible for spiking, including the membrane potential, activating and inactivating gating variables for fast currents, etc; vector u describes relatively slow variables that modulate fast spiking, e.g., gating variable of a slow K+ current, intracellular concentration of Ca2+ ions, etc; small positive parameter μ << 1 represents the ratio of time scales between spiking and modulation. At μ = 0, the fast system becomes independent of the slow one, hence they can be considered separately. The u variable can then be treated as a bifurcation parameter in the fast subsystem. This constitutes the method of dissection of neural bursting pioneered by Rinzel [9]. One should note that the fast subsystem goes through bifurcations, does not mean that the entire system (1) undergoes bifurcations during bursting. As long as parameters of (1) are fixed, the system as a whole does not undergo any bifurcations, no matter how small μ is.
336
2.2
G.-H. Wang and P. Jiang
Conductance-Based Voltage-Gated INa,p +IK -Model
We select conductance-based voltage-gated INa,p +IK -model (2) as the fast spiking subsystem, which is one of the most fundamental models in computational neuroscience consisting of a fast Na+ current and a slower K+ current [1]. IL
INa,p
IK
C V˙ = I − gL (V − EL ) − gNa m(V − ENa ) − gK n(V − EK ) m ˙ = (m∞ (V ) − m)/τm n˙ = (n∞ (V ) − n)/τn ,
(2)
where V is the membrane potential; I is the total current flowing across a patch of a cell membrane; IL is the Ohmic leak current carried mostly by Cl− ions; INa,p is the persistent inward Na+ current; IK is the outward K+ current; C is the membrane capacitance; EL , ENa and EK are the relevant Nernst equilibrium potentials; gL , gNa and gK are maximal conductance of channels for corresponding ions; m and n are the probability of channel activation gate to be open for Na+ and K+ , with time constants τm and τn , respectively; m∞ (V ) and n∞ (V ) are the steady state activation curves approximated by Boltzmann function B∞ (V ) = 1/(1 + exp((V1/2,B − V )/kB )), B = m, n ,
(3)
where the parameter V1/2 satisfies B∞ (V1/2 ) = 1/2, and k is the slope factor. Smaller values of |k| result in steeper B∞ (V ). A reasonable assumption based on experimental observations is that m is much faster than V, so that m approaches the asymptotic value m∞ (V ) instantaneously. In this case the three-dimensional system (2) can be reduced to a planar system (4) by substituting m = m∞ (V ) [1]. C V˙ = I − gL (V − EL ) − gNa m∞ (V )(V − ENa ) − gK n(V − EK ) (4) n˙ = (n∞ (V ) − n)/τn 2.3
Autapse IK(Aut) -Model
The site where the axon of a presynaptic neuron makes contact with the dendrite (or soma) of a postsynaptic cell is the synapse. When an action potential arrives at a synapse, it triggers a complex chain of biochemical processing steps that lead to the release of neurotransmitter from the presynaptic terminal into the synaptic cleft. As soon as transmitter molecules have reached the postsynaptic side, they will be detected by specialized receptors in the postsynaptic cell membrane and open specific channels so that ions from the extracellular fluid flow into the cell. The ion influx, in turn, leads to a change of the membrane potential at the postsynaptic site so that, in the end, the chemical signal is translated into an electrical response [3]. As an inhibitory synapse, autapse is assumed to activate K+ ion channels, hence its equilibrium potential EAut = EK [8], and its current can be described as [3] IAut (t, V ) = gAut exp(−(t − t(f ) )/τa )Θ(t − t(f ) )(V − EK ) , (5) f
Autapse Modulated Bursting
337
where gAut and τa are the maximal conductance and time constant of the autapse, respectively; t(f ) denotes the arrival time of a presynaptic action potential; Θ(·) is a Heaviside step function. Based on (5), further assuming that autapses are driven by the potential difference at some higher level during a spiking period, we depict it as IAut (V) = gAut a(V − EK ), local local local a = min 1, a + d(Vmax − Vmin ) (V = Vmax and V ≥ Vthr ) with a˙ = −a/τa (otherwise) ,
(6)
where a is similar to n, representing the probability of activation gate to be local local open for K+ driven by inhibitory autapse; Vmax and Vmin denote the local maximum and minimum of membrane potential during the latest spiking period, respectively; Vthr represents the threshold over which the potential difference can operate on the autapse; d > 0 is a proportional coefficient indicating the effect of potential difference. If τa >> τn then (6) is competent to act as the slow modulation part in (1). Combining (4) and (6), we get the whole fast-slow burster INa,p +IK +IAut -model as shown in (7). C V˙ = I − gL (V − EL ) −gNa m∞ (V )(V − ENa ) − gK n(V − EK ) − gAut a(V − EK ) local n˙ = (n∞ (V ) − n)/τn if V = Vmax and V ≥ Vthr , local local a˙ = −a/τa then a ← min 1, a + d(Vmax − Vmin ) .
(7)
C V˙ = I − gL (V − EL ) −gNa m∞ (V )(V − ENa ) − gK n(V − EK ) − gM nM (V − EK ) n˙ = (n∞ (V ) − n)/τn n˙ M = (n∞,M (V ) − nM )/τnM
(8)
The intrinsic difference between INa,p +IK +IAut -model and INa,p +IK +IK(M) model (8), which was thoroughly analyzed in [1], is that IK(M) is driven by local local potential V, but IAut is driven by potential difference Vmax − Vmin during a spiking period. From another perspective, as noted in [10], the slow variables either provide the modulation effectively without feedback from the fast system, in which case the slow variables oscillate periodically on their own, irrespective of how the coupling to the fast subsystem influences them. Or, they provide it with feedback from the fast system, in which case the switching between the two states is determined also by the fast variables. In this sense, (7) is closer to the former, while (8) belongs to the latter.
3
Numerical Results
Bifurcations that the model exhibits are illustrated in Fig. 1. One is referred to [1, 4] for details about the whole complete set of bifurcations. As shown in Fig. 1, there are two bifurcations for equilibria to transit from resting states to spiking states, and another two bifurcations for limit cycles to transit from
338
G.-H. Wang and P. Jiang
Homo
Fold
Fold Cycle
subHopf
Fig. 1. Bifurcations appearing in INa,p +IK +IAut -model. Saddle-Node: A node is approached by a saddle; they coalesce and annihilate each other. Subcritical AndronovHopf : A small unstable limit cycle shrinks to a stable equilibrium and makes it lose stability. Saddle Homoclinic Orbit: A limit cycle grows into a saddle. Fold Limit Cycle: Stable limit cycle is approached by an unstable one; they coalesce and then disappear. With Fold, subHopf, Homo, and Fold Cycle for short respectively.
spiking states to resting states. The INa,p +IK +IAut -model exhibits three types of combinations of these two pairs of bifurcations: Fold /Homo, subHopf /Fold Cycle, and Fold /subHopf. Notice that in the last case the fast subsystem has two resting states: upper and lower ones, which was reported occurring in realistic neurons in [11]. The numerical results of the whole system are presented in Fig. 2, 4 and 6, with detailed geometrical analysis of the fast subsystem through phase portraits as shown in Fig. 3, 5 and 7, respectively. The values of the parameters in the three types of bursting are collected in Table 1. n = (I − gL (V − EL ) − gNa m∞ (V )(V − ENa ))/(gK (V − EK )) − gAut a/gK (9) n = n∞ (V ), n-nullcline (n˙ = 0); V -nullcline (V˙ = 0) for the above. Equation (9) depicts the two nullclines. The n-nullcline has the shape of Boltzmann function (S -shaped), while the V -nullcline is cubic (N -shaped), as shown in Fig. 3, 5 and 7. When parameters except a are fixed, the n-nullcline is fixed, and the V -nullcline can shift vertically by adjusting a to change the intersections with n-nullcline, hence change the stability of equilibria. In other words, a acts as the bifurcation parameter of the fast subsystem. But the whole system does not undergo any bifurcation. In the following, we describe the bursters by gluing the phase portraits of the frozen time points as illustrated in Fig. 3, 5 and 7. Bearing in mind the current state for each phase portrait is very important in the following description, where the term system denotes the fast subsystem. Table 1. Values of parameters of bursters in their appearance order, with units mV(V, k and E ), mS/cm2 (g), ms(τ ), mV−1 (d), μF/cm2 (C) and μA/cm2 (I) V1/2,m -20 -30 -20
km 15 7 15
gNa 20 4 20
ENa 60 60 60
V1/2,n -25 -45 -20
kn 5 5 5
gK 9 4 9
E K τn -90 0.152 -90 1 -90 0.152
τa gAut 20 2 50 3 20 10
Vthr d C I -20 0.0006 1 5 -20 0.0006 1 55 -20 0.001 1 5
gL 8 1 8
EL -80 -78 -80
Autapse Modulated Bursting
339
(b)
(a) 0.8
n
V (mV)
−4
−70
0 (d)
(c) 0.2
a
V (mV)
−4
−37
−70 0.2 0.8
0.1 0
0
200
0.4 0
a
0
n
time (ms)
Fig. 2. Fold/Homo bursting (a)
(b)
0.8
(c)
(d)
n−nullcline
n
V−nullcline
−0.1 −70
(e)
V (f)
0
Fold (g)
stable unstable initial state frozen time point Homo
a
0.2
a 0 30
b
c
d
e
fg
a 113
time (ms)
Fig. 3. Phase portraits of Fold/Homo bursting
340
G.-H. Wang and P. Jiang (a)
(b) 1
n
V (mV)
6
delayed transition −70
0 (c)
(d)
0.12
a
V (mV)
6
−32
−70 0.12 1
0.06 0
0
0.5 0
a
400
0
n
time (ms)
Fig. 4. subHopf/Fold Cycle bursting 1
0 −70
(a)
(b)
(c)
(d)
(g)
(h)
Fold Cycle (e)
0
(f)
subHopf
a
0.12 ab c d e f
g
h
a
0 100
320 time (ms)
Fig. 5. Phase portraits of subHopf/Fold Cycle bursting
Autapse Modulated Bursting
341
(b)
(a) 0.7
n
V (mV)
0
−80
0 (d)
(c) 0.2
a
V (mV)
0
−40
−80 0.2 0.7
0.1 0
0
300
a
0.35 0
0
n
time (ms)
Fig. 6. Fold/subHopf bursting 0.7
(a)
(b)
(c)
Fold
0 −80
(e)
(d)
0
subHopf
a
0.2 a 0 33
b
c
de d
a 141
time (ms)
Fig. 7. Phase portraits of Fold/subHopf bursting
342
3.1
G.-H. Wang and P. Jiang
Fold/Homo Burster (Fig. 3 )
(a) There are three equilibria, one stable node, one saddle and one unstable focus. The stable node represents the resting state, and the stable limit cycle surrounding the unstable focus represents the spiking state. The saddle acts as the threshold. The current state is around the resting state. a is decreasing slowly, hence V -nullcline is shifting upward. (b) The stable node is approached by the saddle with their distance decreasing. (c) They coalesce to a saddlenode. The system is undergoing the saddle-node bifurcation. (d ) The saddle-node disappears and the trajectory evolves to the limit cycle attractor to start spiking. (e) a is alternating between the fast jumping up and the slow decaying. Since the total tendency of a is increasing, V -nullcline is shifting downward, hence the limit cycle is growing into the saddle. (f ) A saddle homoclinic orbit occurs, so the system is undergoing the saddle homoclinic orbit bifurcation. (g) The stable limit cycle disappears, and the system evolves toward the stable node. Then from (a) a new bursting cycle begins. 3.2
subHopf/Fold Cycle Burster (Fig. 5 )
(a) There is a unique unstable equilibrium, which is surrounded by a stable limit cycle. The current state is spiking, and the total tendency of a is increasing. (b) The equilibrium becomes stable, which is surrounded by a new unstable limit cycle surrounded by the original stable limit cycle. It does not change the current state, spiking, since the original bigger limit cycle is still stable. However the new unstable limit cycle is growing bigger to approach onto the original stable one. (c) The two limit cycles have coalesced and annihilated each other, i.e., the system has undergone the fold limit cycle bifurcation. The trajectory is evolving toward the unique stable focus. (d )-(f ) a is decreasing but the focus is still stable. Hence the current state is on resting. (g) A new unstable limit cycle generates between the stable focus and the stable limit cycle, which does not affect the current state at present with the similar reason as in the case of (b). However, a is decreasing, so that in this case, the newly generated unstable limit cycle is shrinking to the stable focus. (h) The shrinking has made the focus lose its stability, and the system had undergone the subcritical Andronov-Hopf bifurcation. However, since the current state is still nearly on the unstable focus, it takes the system a long time to escape from the unstable focus. As shown in Fig. 4, the long time is labeled with delayed transition. Then from (a) a new bursting cycle begins. 3.3
Fold/subHopf Burster (Fig. 7 )
(a) The system has three equilibria, one stable node, one saddle and one stable focus. The current state is around the lower resting state, and V -nullcline is shifting upward with a decreasing slowly. (b) The stable node is approached by the saddle. They coalesce and annihilate each other through the saddle-node bifurcation. (c) The lower resting state disappears, and the trajectory evolves toward the upper resting state. Since the upper resting state is a focus, it will
Autapse Modulated Bursting
343
take a long time for the system to screw from the state corresponding to the disappeared lower resting state to the upper resting state. (d ) Before the system can stand by the upper resting state, an unstable limit cycle appears, surrounding and shrinking to the upper stable focus. Notice that the lower stable node and saddle have appeared. (e) The system has undergone the subcritical AndronovHopf bifurcation. The upper equilibrium has become unstable. Since the current state is still on the way to the original upper resting state, i.e., there is no delayed transition in this case, and the trajectory evolves toward the lower resting state quickly. Then from (a) a new bursting cycle begins.
4
Conclusions
In this paper we present a model of autapses which is driven by action potentials. The model autapse is applied to a neuron with conductance-based minimal model to construct a fast-slow buster. Three types of bursting that the burster exhibits are analyzed geometrically through phase portraits to illustrate the modulation competence of the autapse model on bursting of fast subsystem.
References 1. Izhikevich, E.M.: Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting. Preprint edn. The MIT Press, Cambridge, MA (2005) 2. Hodgkin, A.L., Huxley, A.F.: A Quantitative Description of Membrane Current and Application to Conduction and Excitation in Nerve. J. Physiol. 117 (1952) 500–544 3. Gerstner, W., Kistler, W.M.: Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cambridge University Press, New York (2002) 4. Izhikevich, E.M.: Neural Excitability, Spiking, and Bursting. Int. J. Bifurcation and Chaos 10(6) (2000) 1171–1266 5. Shilnikov, A., Calabrese, R.L., Cymbalyuk, G.: Mechanism of Bistability: Tonic Spiking and Bursting in a Neuron Model. Phys. Rev. E 71(5) (2005) 056214 6. Shilnikov, A., Cymbalyuk, G.: Transition between Tonic Spiking and Bursting in a Neuron Model via the Blue-Sky Catastrophe. Phys. Rev. L. 94(4) (2005) 048101 7. Golubitsky, M., Josic, K., Kaper, T.J.: An Unfolding Theory Approach to Bursting in Fast-Slow Systems. In: Broer, H.W., Krauskopf, B., Vegter, G. (eds.): Global Analysis of Dynamical Systems. Institute of Physics Publishing, Bristol and Philadelphia (2001) 277–308 8. Herrmann, C.S., Klaus, A.: Autapse Turns Neuron into Oscillator. Int. J. Bifurcation and Chaos 14(2) (2004) 623–633 9. Rinzel, J.: Bursting Oscillations in an Excitable Membrane Model. In: Sleeman, B.D., Jarvis, R.J. (eds.): Ordinary and Partial Differential Equations. Lecture Notes in Mathematics, Vol. 1151. Springer-Verlag, Berlin (1985) 304–316 10. Rinzel, J., Lee, Y.S.: On Different Mechanisms for Membrane Potential Bursting. In: Othmer, H.G. (ed.): Nonlinear Oscillations in Biology and Chemistry. Lecture Notes in Biomathematics, Vol. 66. Springer-Verlag, New York (1986) 19–33 11. Wilson, C.J., Kawaguchi, Y.: The Origins of Two-State Spontaneous Membrane Potential Fluctuations of Neostriatal Spiny Neurons. J. Neurosci. 16(7) (1996) 2397–2410
A Neural Network Model for Non-smooth Optimization over a Compact Convex Subset Guocheng Li1, Shiji Song1, Cheng Wu1, and Zifang Du2 1
Department of Automation, Tsinghua University, Beijing, 100084, China
[email protected] 2 School of Statistics, Renmin University, 100872, China
Abstract. A neural network model is introduced which is aimed to solve nonsmooth optimization problem on a nonempty compact convex subset of R n . By using the subgradient, this neural network model is shown to obey a gradient system of differential inclusion. It is proved that the compact convex subset is a positive invariant and is a attractive to the neural network system, and that all the network trajectories starting from the inside of the compact convex subset converge to the set of equilibrium points of the neural network. The above every equilibrium point of the neural network is an optimal solution of the primal problem. A numerical simulation example is also given to illustrate the qualitative properties of the proposed neural network model.
1 Introduction Optimization problems arise in many areas of scientific and engineering applications. Most of the practice engineering applications problems are required to solve in real-time. One possible and very promising approach for solving in real-time optimization problem is to apply artificial neural networks. Hopfield and Tank [1-2] first proposed a neural network to deal with linear programming problems. Their works have motivated many researchers to investigate alternative neural networks for solving linear and nonlinear programming problems. And many models of neural networks used to optimization problems have been established in literatures [3-7]. In this paper, we extend Liang [6] for the recurrent neural network of nonlinear continuously differentiable optimization to the case of the non-smooth convex optimization. And another neural network model is introduced, which is aimed to solve in real time a class of non-smooth convex optimization over a compact convex subset. By using the subgradient, this network is shown to obey a subgradient system described by differential inclusion, and its dynamical behavior and optimization capabilities are rigorously analyzed in the framework of convex analysis and the theory of differential inclusions. Finally, the numerical simulation experiment is illustrated to substantiate our theoretical results.
2 Neural Network Model We consider the minimization problem formulated as the following: minimize E ( x) subject to x ∈ Ω J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 344 – 349, 2006. © Springer-Verlag Berlin Heidelberg 2006
(1)
A Neural Network Model for Non-smooth Optimization
345
where E : R n → R is a convex function on R n and Ω is a compact convex subset of
R n , when the objective function E ( x) is smooth, and ∂i E ( x) = ∂i ( x) / ∂xi is singlevalued. However, for non-smooth convex E ( x) , the subdifferential ∂i E ( x) is a setvalued mapping. For example, if E : R → R is given by E ( x) =| x | , then we have
x>0 ⎧ 1 ⎪ ∂E ( x) = co[sgn( x)] = ⎨[−1,1] x = 0 ⎪ −1 x < 0. ⎩ The previous discussion makes it clear that neural network for optimization problem (1) is no longer described by a standard differential equation where the velocity vector field dx / dt is a single-valued mapping. Rather, the neural network for (1) obeys a differential inclusion where dx / dt becomes to be a set-valued mapping. Indeed, by explicitly expressing the neural network, we get that it should satisfy the system of differential inclusions: τ x& (t ) ∈ Fα ( x) = − x + PΩ ( x − α∂E ( x)) , (2) where PΩ (⋅) is a projection from
R n onto Ω , and
PΩ ( x − α∂E ( x )) = {PΩ ( x − α v ) : v ∈ ∂E ( x )} , where τ > 0 and α > 0 are the time constant and step-size parameter of the network model respectively. First, we need to explain what is meaning by a solution of a Cauchy problem associated with the system of differential inclusion (2). Since Fα is a set-valued mapping whose values are nonempty compact convex subset, in order to develop our discussion, we now present a definition which will be adopted in this paper (ref. Filippov [8]). Definition 1. We say that a function x (⋅) on [t1 , t2 ](t2 > t1 ) is a solution of (2), if the
x(⋅) is an absolutely continuous function and satisfies x(t1 ) = x0 and for almost all t ∈ [t1 , t2 ] and τ x& (t ) ∈ Fα ( x(t )) . The importance for the engineering applications of the concept of solutions in the sense of Filippov is due to the fact that they are good approximations of solutions of actual systems with very high slope nonlinearities (ref. [9-10]). Theorem 1. (i). For any x0 ∈ R , there exists at lest one solution of (2) with initial n
condition x(0) = x0 in the global time interval [0, ∞) . (ii) Ω is a positive invariant and is attractive set of (2). Proof. Since the projection operator PΩ (⋅) is continuous and ∂E (⋅) is an upper semi-
continuous set-valued mapping, so, Fα (⋅) is also an upper semi-continuous set-valued
Fα (x) is a compact convex subset for any x ∈ R n . Therefore, from Theorem 1 of [11], it follows that there exists at least one local solution x (⋅) on mapping and
[0, t1 ](t1 > 0) which satisfies (2) with x(0) = x0 .Similar to the processing discussed in
346
G. Li et al.
Theorem 2 of [6], for the set-valued mapping rather than single-valued one, we have following argumentation: 1 Let H ( x) = d 2 ( x, Ω) . For almost all t ∈ [0, t1 ] , we have 2 dH ( x(t )) dx(t ) ) τ = ( x(t ) − PΩ ( x(t ))T (τ dt dt ≤ sup ( x(t ) − PΩ ( x(t ))T [−( x(t ) − PΩ ( x(t ))) + ( PΩ ( x(t ) − α v) − PΩ ( x(t ))] v∈∂E ( x ( t ))
≤ − || x(t ) − PΩ ( x(t )) ||2 = −2 H ( x(t )) .
(3)
By the comparison principle, from the differential inequality (3), it follows that t n d ( x(t ), Ω) ≤ d ( x0 , Ω) exp(− ) , x 0 ∈ R , ∀t ≥ 0 . (4)
τ
This means that the solution x(t ) of (2) with initial condition x(0) = x0 is bounded and hence well defined on [0, ∞) . From (4) we know that if x0 ∈ Ω , then d ( x(t ), Ω) = 0 , x(t ) ∈ Ω and if x0 ∉ Ω , then d ( x(t ), Ω) → 0 at an exponential rate
1
τ
as t → ∞ , which means that x(t )
□
converges to the set Ω exponentially.
An equilibrium point x ∈ R of (2) is a stationary solution. Clearly, x is an equilibrium point of (2) if and only if 0 ∈ Fα ( x e ) . The set of equilibrium point of (2) is thus given by e
n
e
Ωαe = {x ∈ R n : 0 ∈ Fα ( x)} = {x ∈ R n : x ∈ PΩ ( x − α∂E ( x))}. Theorem 2. For any α > 0 , we have that Ωαe ≠ φ . Proof. Since Fα is an upper semi-continuous set-valued mapping whose values are
nonempty compact convex subset, and Fα ( x) I TΩ ( x) ≠ φ for any x ∈ Ω , by Theo-
□
rem 1 in [11, pp228], (2) has at least one equilibrium point.
Now we consider the optimization problem (1). Let M = arg min E ( x) be set of x∈Ω
global minimum of problem (1), then the next basic result holds. Theorem 3. For any α > 0 , we have that Ωαe ⊆ M . Proof. If x ∈ Ωαe , then there exists an v ∈ ∂E ( x) , such that x = PΩ ( x − α v) .For any y ∈ Ω , we have
[ PΩ ( x − α v) − ( x − α v)]T [ PΩ ( x − α v) − y ] ≤ 0 , that is, vT ( y − x) ≥ 0 .Since E is a convex function, by the definition of subdifferential, for any y ∈ Ω and v ∈ ∂E ( x) , we have E ( y ) − E ( x) ≥ vT ( y − x) ≥ 0 . Therefore, the x is a global minimum of problem (1), i.e. x ∈ M .
□
A Neural Network Model for Non-smooth Optimization
347
Next, we well recall the concept of limit point of a function to present the main result on trajectory convergence and optimization capabilities of network model (2). Definition 2. We say that
x * is a limit point of x(t ) when t → ∞ if there exists
sequence {t n } such that x (t n ) → x as *
t n → ∞.
Theorem 4. For any α > 0 , any limit point of trajectory of (2) starting from the inside of Ω belongs to Ωαe ,that is lim d ( x(t ), Ωαe ) = 0 . t →∞
Proof. Let
x(⋅) is any solution of (2) with initial condition x(0) = x0 . From (ii) of
Theorem 1, if x0 ∈ Ω , then the solution x(t ) ∈ Ω for t ≥ 0 . Note that τ x& ∈ Fα ( x) = − x + PΩ ( x − α∂E ( x)) ,
1 || x& (t ) ||≤ [|| x(t ) || + max || x ||] ≤ M τ
then
τ
Mτ =
2
τ
x∈Ω
for
almost
all
t≥0
,
where
max || x || . Let V ( x(t )) = τα E ( x(t )) and y (t , ω ) = x(t ) − αω , from Property 1 x∈Ω
in [5] and the projection theorem, differentiating V ( x(t )) along the solution x(t ) of (2), we have dV ( x(t )) = ταω T x& (t ) dt ≤ sup (αω )T (− x(t ) + PΩ ( x(t ) − αω )) ω ∈∂E ( x ( t ))
= sup [( x(t ) − PΩ ( y(t, ω ))) + ( PΩ ( y(t , ω )) − y(t , ω ))]T [ PΩ ( y(t , ω )) − x(t )] ω∈∂E ( x ( t ))
≤ sup − || x(t ) − PΩ ( y (t , ω )) ||2 ω ∈∂E ( x ( t ))
= sup − || x(t ) − PΩ ( x(t ) − αω ) ||2 ω ∈∂E ( x ( t ))
= −dist 2 ( x(t ), PΩ ( x(t ) − α∂E ( x(t ))) ≤ 0 .
(5)
Let Wα ( x) = dist 2 ( x, PΩ ( x − α∂E ( x))) . Since x → PΩ ( x − α∂E ( x)) is upper semicontinuous, then x → Wα ( x) is a lower semi-continuous nonnegative valued function n
defined on R . Clearly, p ∈ Ωαe if and only if Wα ( p) = 0 . For any
α > 0, let ωα ( x(⋅)) be the set of the limit points of a solution
x(⋅) of (2),
that is ωα ( x(⋅)) = I {x(t ) : t ≥ t ′} . Since x(t ) ∈ Ω , t ≥ 0 , it is sufficient to prove that t ′∈[0, ∞ )
ωα ( x(⋅)) ⊆ Ωα . For each p ∈ ωα ( x(⋅)) , there exists a increasing sequence {tn } such e
that lim tn = ∞ and lim x(tn ) = p . n →∞
n →∞
By (5), V ( x(tn )) is non-increasing and bounded from below. Therefore, there exists a constant V0 such that lim V ( x(tn )) = V0 . Since V ( x(tn )) is non-increasing, n →∞
V ( x(t )) ≥ V0 and lim V ( x(t )) = V0 . from (5) we further have t →∞
348
G. Li et al.
t
t
0
0
∫ Wα ( x(s))ds ≤ − ∫
d [V ( x( s ))]ds = V ( x(0)) − V ( x(t )) . ds
Hence, we have
∫
∞
0
Wα ( x ( s ))ds ≤ V ( x0 ) − V0 .
(6)
Next, we prove that p ∈ Ωαe . If this is no true, that is Wα ( p ) > 0 . Since x → Wα ( x) is lower semi-continuous, then there exists a constant η > 0 and a small neighborhood of p , B2ε ( p) = {x ∈ Ω :|| x − p ||< 2ε } such that Wα ( x) > η , for all x ∈ B2ε ( p ) . We now claim that for all t ∈ U [tn − n ≥ n0
ε Mτ
, tn +
ε Mτ
], x(t ) ∈ B2ε ( p ) , where n0 is a
positive integer such that for all n ≥ n0 , || x(tn ) − p ||< ε . In fact, for t ∈ [t n −
ε Mτ
, tn +
ε Mτ
] , n ≥ n0 , we have
|| x(t ) − p ||≤|| x(t ) − x(tn ) || + || x(tn ) − p ||≤ M τ || t − tn || +ε . Therefore, for all t ∈ U [tn − n ≥ n0
ε Mτ
, tn +
ε Mτ
] we have Wα ( x) > η > 0 . Since lim tn = ∞ n →∞
and since the Lebesgue measure of the set t ∈ U [tn − n ≥ n0
ε Mτ
, tn +
ε Mτ
] is infinite, it
follows that
∫
∞
0
Wα ( x(t ))dt = ∞ .
(7)
The contradiction between (6) and (7) implies that Wα ( p) = 0 , i.e., p ∈ Ωα , this e
e mean that lim d ( x(t ), Ωα ) = 0 . t →∞
□
From Theorem 3 and Theorem 4, we know that every limit point of the trajectory of (2) starting from the inside of Ω is the optimal solution of Problem (1). In the following, we give a numerical simulation example to illustrate the qualitative behaviors of model (2) for solving an optimization problem over different compact convex subsets. Example 1. We consider the following optimization problem min | x12 − x2 | , x = ( x1 , x2 ) ∈ Ω .
The two cases of Ω being of Ω1 = {x ∈ R 2 :| x1 | + | x2 |≤ 1} and Ω 2 = {x ∈ R 2 : x12 + x 22 ≤ 1}. We select, respectively, 120 random points which are uniformly distributed over [−1,1] × [−1,1] as the initial ones of the solution trajectories of the model (2). Their geometric representations of these trajectories in the cases of Ω1 and Ω 2 are shown in Fig.1 and Fig. 2 respectively, which demonstrate that all the network trajectories are convergent to the minimum set of the optimization problem respectively.
A Neural Network Model for Non-smooth Optimization
Fig. 1
349
Fig. 2
Acknowledgment The authors of this paper would like to thanks for the support of 973 project (No.2002CB312205) and the National Natural Science Foundation of China (No.60574077).
References 1. Hopfield, J.J., Tank, D.W.: Neural Computation of Decisions in Optimization Problems. Biol. Cybern. 52 (1985) 141-152 2. Tank, D.W., Hopfield, J.J.: Simple ‘Neural’ Optimization Networks: an A/D Converter, Signal Decision Circuit, and a Linear Programming Circuit. IEEE Trans. Circuits Syst. 33 (1986) 533-541 3. Kennedy, M.P., Chua, L.O.: Neural Networks for Nonlinear Programming. IEEE Trans. Circuits Syst. 35 (1988) 554-562 4. Rodriguez-Vazquez, A., Dominguez-Castro, R., Rueda, A., Huertas, J.L., Sachez-Sinencio E.: Nonlinear Switched-Capacitor Neural Networks for Optimization Problems. IEEE Trans. Circuits Syst. 37 (1990) 384-397 5. Forti, M., Nistri, P., Quincampoix, M.: Generalized Neural Network for Non-smooth Nonlinear Programming Problems. IEEE Trans. Circuits Syst. 51 (2004) 1741-1754 6. Liang, X.B.: A Recurrent Neural Network for Nonlinear Continuously Differentiable Optimization over a Compact Convex Subset. IEEE Trans. Neural Networks 12 (2001) 1487-1490 7. Wang, J.: A Deterministic Annealing Neural Network for Convex Programming. Neural Networks 7 (1994) 629-641 8. Filippov, A.F.: Differential Equations with Discontinuous Right-Hand Side. Kluwer Academic, Dordrecht (1988) 9. Utkin, V.I.: Sliding Modes and Their Application in Variable Structure Systems. U.S.S.R., MIR, Moscow (1978) 10. Paden, B.E., Sastry, S.S.: Calculus for Computing Filippov’s Differential Inclusion with Application to the Variable Structure Control of Robot Manipulator. IEEE Trans. Circuits Syst. 34 (1987) 73-82 11. Aubin, J.P., Cellina, A.: Differential Inclusion. Springer-Verlag, Berlin (1984)
Differential Inclusions-Based Neural Networks for Nonsmooth Convex Optimization on a Closed Convex Subset Shiji Song, Guocheng Li, and Xiaohong Guan Department of Automation, Tsinghua University, Beijing 100084, China
[email protected]
Abstract. Differential inclusions-based dynamic feedback neural network models are introduced to solve in real time nonsmooth convex optimization problems restricted on a closed convex subset of Rn . First,a differential inclusion-based dynamic feedback neural network model for solving unconstrained optimization problem is established, and its stability and convergence are investigated, then based on the preceding results and the method of successive approximation, differential inclusions-based dynamic feedback neural network models for solving in real time nonsmooth optimization problem on a closed convex subset are successively constructed, and its dynamical behavior and optimization capabilities are analyzed rigorously.
1
Introduction
Most of existing feedback neural network optimization models for solving in real time optimization problems are commonly constructed with a concept of gradient,the object functions are required to be smooth and the network models are described with differential equations(ref. [1 − 7]).Recently, the neural networks with discontinuous neuron activation functions are also presented in Forti[8] and Lu[9],the dynamical behaviors and the global convergence in finite time are investigated for the respective networks. In addition, by using the Clarkes generalized gradient of the involved functions, generalized nonlinear programming circuit is shown to obey a gradient system of differential inclusions, and its dynamical behavior and optimization capabilities, both for convex and nonconvex problems, are rigorously analyzed in Forti[10]. This paper is devoted to solve in real time nonsmooth convex optimization problems restricted on a closed convex subset of Rn . At first, by using the nonsmooth analysis theory instead of the Lyapunov stability theory and the LaSall invariance principle, we investigated the stability and convergence of network constructed by differential inclusion for solving unconstrained optimization problem. And for optimization problem on a closed convex subset, we construct successively an iterative energy functions sequence and the corresponding dynamic subnetworks which are described to obey differential inclusion systems respectively, we prove that the trajectory of J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 350–358, 2006. c Springer-Verlag Berlin Heidelberg 2006
Differential Inclusions-Based Neural Networks
351
the every subnetwork converges to one minimum point of the its energy function which also correspond to an equilibrium point of the subnetwork, and the minimum points sequence corresponding to these energy functions converges to one minimum point of the primal optimization problem, and the convergence theorem is then obtained.Finally, the conclusions of this paper are summarized.
2
Uuconstrained Optimization Problem
In this paper,we firstly suppose that V is a convex function from Rn to R, then from Aubin-Cellina [11], the following properties are well known. Lemma 1. If V is a convex function from Rn to R, the following propositions hold: (i) V (x) − V (x − v) ≤ DV (x)(v) ≤ V (x + v) − V (x); (ii) ∂V (x) = {p ∈ Rn : V (x) − V (y) ≤ (p, x − y), ∀y ∈ Rn }. Lemma 2. Let V be a convex function from Rn to R. The following statements are equivalent: (i) x∗ minimizes V on Rn ; (ii) 0 ∈ ∂V (x∗ ); (iii) ∀v ∈ Rn , 0 ≤ DV (x∗ )(v). Moreover, we always assume that V (x) is a coercive function, that is, V (x) → . +∞, as ||x|| → +∞. Let x0 ∈ Rn , the set L(x0 ) = {x ∈ Rn : V (x) ≤ V (x0 )} is called the level set. Then the following result holds(ref. [12]). Lemma 3. Let V (x) be a continuously convex function , then any level set of V (x) is bounded if and only if V (x) is a coercive function. It is easy to verify that if V (x) is a convex coercive function, the V (x) achieves its minimum at some point. We consider the following general unconstrained nonsmooth convex optimization problem min V (x), x ∈ Rn
(1)
with the subgradient differential inclusion as follows x (t) ∈ ∂V (x(t)), x(0) = x0
(2)
where V : Rn → R is a convex function. A function x(·) defined on [t0 , ∞) is said to be a solution of (2), if the x(·) is a absolutely continuous function and satisfies x(t0 ) = x0 and x (t) ∈ ∂V (x(t) for almost all t ∈ [t0 , ∞). The concept of solution here is in the sense of Filippov and the derivative of solution means the Gˆ ateaux derivation(ref. [11] or [13]). An equilibrium point x∗ ∈ Rn of (2) means a constant solution of (2), that is, x(t) ≡ x∗ , t ∈ [0, ∞).
352
S. Song, G. Li, and X. Guan
It is clearly that x∗ is an equilibrium point of (2) if and only if 0 ∈ ∂V (x∗ ) and x∗ = x0 . The following Theorem 1 provides the existence, uniqueness and many of better properties of solutions for differential inclusion (2)(ref. Aubin-Cellina [11]). Theorem 1. Let V : Rn → R be a convex function. Then for any x0 ∈ Rn , there exists a unique solution x(·) defined on [0, ∞) to the differential inclusion (2). Moreover (i) t → ||x (t)|| is non-increasing function; (ii) Let x(·), y(·) be two solutions corresponding the initial points x0 and y0 respectively. Then for t ≥ 0, ||x(t) − y(t)|| ≤ ||x0 − y0 ||. (iii) V (x(t)) is a convex non-increasing function satisfying d V (x(t)) + ||x (t)||2 = 0. dt
(3)
From (ii) of Theorem 1, we know that the solutions of (2) have Lyapunov stability. In order to establish convergence theorem, we show the following definition and lemma(ref. Aubin-Cellina [11]). Definition 1. We say that x∗ is a limiting point of x(t) when t → ∞ if there exists sequence {tn } such that limtn →∞ x(tn ) = x∗ . Lemma 4. Let V : Rn → R be a convex coercive function and x(·) be a solution of (2). If all the limiting point x∗ of x(t) when t → ∞ achieve the minimum of V , then x(t) converges to one minimum point of V when t → ∞. From Lemma 4, we further have the following convergence theorem. Theorem 2. Let V : Rn → R be a convex coercive function. Then for any x0 ∈ Rn , the corresponding trajectory of (2) converges to one minimum point of V. Proof. For ∀x0 ∈ Rn , let x(t) = x(t, x0 ) be the unique solution of (2). By Lemma 4, we only need to check that the every limiting point x∗ of x(t) when t → ∞ achieve the minimum point of V . Integrating equality (3), we deduce that t V (x(t)) − V (x(s)) + ||x (τ )||2 dτ = 0, s
since V (x(t)) is convex coercive,this implies that there exists the minimum point of V (x(t)), so, V (x(t)) is bounded from below and non-increasing by Theorem 1.Further we have t lim ||x (τ )||2 dτ = lim (V (x(s)) − V (x(t))) = 0. t, s→∞
s
t, s→∞
Differential Inclusions-Based Neural Networks
Hence the Cauchy criterion yields that ∞ t 2 ||x (τ )|| dτ = lim ||x (τ )||2 dτ < ∞. t→∞
0
353
(4)
0
From (i) of Theorem 1 and equation (4), we have limt→∞ ||x (t)|| = 0. Taking a minimum point x ¯ of V , then by Lemma 2, x ¯ is equilibrium point of (2), and by (ii) of Theorem 1 we have ||x(t) − x ¯|| ≤ ||x0 − x ¯|| for all t ≥ 0. Therefore, the set {x(t) : t ≥ 0} is bounded. By (ii) of Lemma 1, we deduce from the inclusion x (t) ∈ −∂V (x(t)) that for all y ∈ Rn , inf t≥0 V (x(t)) ≤ V (x(t)) ≤ V (y) + (−x (t), x(t) − y) ≤ V (y) + ||x (t)||||x(t) − y|| ≤ V (y) + My ||x (t)||, where My = supt≥0 ||x(t) − y|| is finite because {x(t) : t ≥ 0} is bounded. Therefore, by letting t → ∞ and taking the infimum for all y ∈ Rn , we obtain inf V (x(t)) = infn V (y).
t≥0
y∈R
Because the set {x(t) : t ≥ 0} is bounded, so there exists the limiting points x(t) when t → ∞, noting that t → V (x(t)) is a non-increasing function, by any limiting point x∗ of x(t), it follows that V (x∗ ) = lim V (x(tn )) = inf V (x(t)) = infn V (y). tn →∞
t≥0
y∈R
Lemma 4 guarantee that x(t) converges to one minimum point of V which is also the equilibrium point of (2) when t → ∞. Thus we complete the proof of this theorem. By the theorems described previously, two results are obtained immediately. (i) when V (x) has unique minimum point x∗ , then x∗ is globally, uniformly, and asymptotically stable. (ii) when V (x) has infinitely many minimum points, then given an arbitrary initial point x0 ∈ Rn , the trajectory of neural network (2) converges to one of its minimum points.
3
Optimization Restricted on a Closed Convex Subset
In this section, based on the preceding results and procedures of successive approximation,we construct the energy function and the corresponding neural subnetwork described by differential inclusions to solve in real time the nonsmooth convex optimization problem on a closed convex subset. Now we Consider the nonsmooth convex optimization problem: min V (x), x ∈ Ω
(5)
354
S. Song, G. Li, and X. Guan
where x ∈ Rn , V : Rn → R is a convex coercive function, Ω is a closed convex subset in Rn . By the weierstrass’ Theorem(ref. Appendix in [12]), (5) achieves at least one optimal solution on ∈ Ω. Taking M1 as a lower bound of the optimal value of (5), i.e., M1 ≤ V (x∗ ), where x∗ is an optimal solution of (5). Set V1 (x) = V (x) − M1 and F (x, M1 ) = 1 4 V1 (x)[V1 (x) + |V1 (x)|]. It is easy to verify that F (x, M1 ) is a nonnegative, continuous convex coercive function, and (V (x) − M1 )∂V (x), if V(x) ≥ M1 ∂F (x, M1 ) = 0, if V(x) < M1 We construct the energy function on Rn E(x, M1 ) = F (x, M1 ) +
1 min ||x − y||2 . 2 y∈Ω
(6)
Obviously E(x, M1 ) is also a nonnegative, continuous convex coercive function. Its subdifferential is ∂E(x, M1 ) = ∂F (x, M1 ) + x − PΩ (x), where PΩ (x) is the projection of x onto Ω. For energy function E(x, M1 ), it is easy to see that the following two theorems hold, and their proofs are omitted here. Theorem 3. Let x∗ be an optimal solution of (5) and M1 ≤ V (x∗ ), then E(x, M1 ) has a minimum point on Rn and min E(x, M1 ) ≤ E(x∗ , M1 ).
x∈Rn
Theorem 4. 0 ∈ ∂E(x∗ , M1 )) if and only if x∗ is a minimum point of E(x, M1 ), that is, the set of equilibrium point of E(x, M1 ) is equal to that of its minimum points on Rn . From the preceding results, a neural network for solving the minimum point of E(x, M1 ) on Rn can be constructed as x (t) ∈ ∂E(x(t), M1 ).
(7)
The network described by differential inclusion (7) is a subnetwork of problem (5). From Theorem 1 and Theorem 2, we have following theorem about the subnetwork. Theorem 5. (1) ∀x0 ∈ Rn , the network (7) has a unique trajectory x(t, x0 ) on [0, +∞) with x(0) = x0 ; (2) Any minimum point of E(x, M1 ) is Lyapunov stable; (3) ∀x0 ∈ Rn , the corresponding trajectory of neural network (7) converges to a minimum point of E(x, M1 ) on Rn .
Differential Inclusions-Based Neural Networks
355
Now, we construct energy function sequence and give the convergence theorem. suppose that x∗ is an optimal solution of problem (5) and M1 is an estimated value of the lower bounded of its optimal value, i.e., M1 ≤ V (x∗ ). We construct the energy function sequence as E(x, Mk ) = F (x, Mk ) +
1 min ||x − y||2 , k = 1, 2, ... 2 y∈Ω
(8)
where F (x, Mk ) = 14 Vk (x)[Vk (x) + |Vk (x)|], Vk (x) = V (x) − Mk , Mk+1 = Mk + 2E(xk , Mk ), and xk is the minimum point of E(x, Mk ) on Rn . Note that E(x, Mk ) is a convex coercive function, then from Theorem 2, xk is well defined. By the preceding analysis, we now establish the convergent result as follows: Theorem 6. (Convergence Theorem): Suppose that x∗ is an optimal solution of problem (5), then for above constructed energy function sequence E(x, Mk ), k = 1, 2, ..., the following results hold (1) for k = 1, 2, ..., Mk ≤ Mk+1 , Mk ≤ V (x∗ )and V (xk ) ≤ V (x∗ ); (2) limk→∞ Mk = M ∗ ; (3) the sequence {xk } has limiting point x ¯; and any limiting point x ¯ of {xk } satisfies V (¯ x) = V (x∗ ) = M ∗ . Proof. (1) Since Mk+1 = Mk + 2E(xk , Mk ) ≥ Mk , k = 1, 2, ..., we have Mk ≤ Mk+1 (k = 1, 2, ...). By hypothesis M1 ≤ V (x∗ ), we suppose that for k, Mk ≤ V (x∗ ). Since xk is the minimum point of E(x, Mk )(k = 1, 2, ...), we have E(xk , Mk ) ≤ E(x∗ , Mk ) = F (x∗ , Mk ) + 12 miny∈Ω ||x∗ − y||2 = F (x∗ , Mk ) = 12 [V (x∗ ) − Mk ]2 . Hence, we have Mk+1 = Mk + 2E(xk , Mk ) ≤ V (x∗ ). By the induction principle, Mk ≤ V (x∗ ) holds for k = 1, 2, ... Moreover, 2E(xk , Mk ) ≥ 2F (xk , Mk ) ≥ V (xk ) − Mk , k = 1, 2, ..., that is, V (xk ) ≤ Mk + 2E(xk , Mk ) = Mk+1 ≤ V (x∗ ), so V (xk ) ≤ V (x∗ )(k = 1, 2, ...). Since the sequence {Mk } is monotone nondecreasing and it has an upper bound V (x∗ ), then (2) of Theorem 6 is immediately obtained and M ∗ ≤ V (x∗ ). (3) Since V (x) is a coercive function and V (xk ) ≤ V (x∗ ), then by Lemma 3 the sequence {xk } is bounded. Therefore, there exists subsequence {xki } ⊆ {xk } (ki < ki+1 ) such that limi→+∞ xki = x ¯. By the continuity of V (x), we have lim V (xki ) = V (¯ x) ≤ V (x∗ ).
i→+∞
By Mki +1 = Mki + 2E(xki , Mki )(i = 1, 2, ...) and the continuity of E(x, Mk ), and taking the limit as i → +∞, we have 2E(¯ x, M ∗ ) = 0. From this fact, it follows that F (¯ x, M ∗ ) = 0 ⇒ V (¯ x) ≤ M ∗ , 1 2 x − y|| = 0 ⇒ x¯ ∈ Ω. 2 miny∈Ω ||¯
356
S. Song, G. Li, and X. Guan
So, x ¯ is a feasible solution of problem(5), and M ∗ ≤ V (x∗ ) ≤ V (¯ x). Therefore, combining the above proof, we have V (x∗ ) = V (¯ x) = M ∗ . The proof of this theorem is thus completed. Since problem (5) may have many optimal solutions and so x ¯ may not equal x∗ , the previous equation shows that any limiting point x ¯ of {xk } is the optimal solution of problem (5), and the corresponding limit value V (¯ x) of {V (xk )} is the optimal value of problem (5). Based on the above convergence theorem, we can construct a feedback neural network for solving optimization problem (5). By Theorem 5, the trajectory of neural network (7) converges to a minimum point x1 of E(x, M1 ). Let M2 = M1 + 2E(x1 , M1 ), then by (1) of Theorem 6, M2 ≤ V (x∗ ). Construct the energy function on Rn E(x, M2 ) = F (x, M2 ) +
1 min ||x − y||2 , 2 y∈Ω
(9)
and the corresponding neural network for obtaining its minimum point becomes x (t) ∈ ∂F (x, M2 ) + x − PΩ (x).
(10)
Replacing M1 with M2 in Theorem 4-5 above, the trajectory of the network n (10) by the same token, converges to a minimum point x2 of E(x, M2 ) on∗ R . Let M3 = M2 + 2E(x2 , M2 ), and by (1) of Theorem 6, M3 ≤ V (x ). We can also construct the energy function E(x, M3 ) and the corresponding neural network for solving its minimum point on Rn . The remainder can be deduced by analogy. In general, let Mk = Mk−1 + 2E(xk−1 , Mk−1 ), then Mk ≤ V (x∗ ). We construct the energy function on Rn as E(x, Mk ) = F (x, Mk ) +
1 min ||x − y||2 , 2 y∈Ω
(11)
and the corresponding neural networks for obtaining its minimum points are x (t) ∈ ∂F (x, Mk ) + x − PΩ (x).
(12)
Similarly, Replacing M1 with Mk in Theorem 3-5 above, it follows that the trajectory of the network (12) converges to a minimum point xk of E(x, Mk ). Hence, the neural network, which is a feedback network, for solving problem (5) is constructed as follows x (t) ∈ ∂F (x, Mk ) + x − PΩ (x), (13) x(0) = xk , k = 1, 2, ...,
Differential Inclusions-Based Neural Networks
357
where the parameter M1 ≤ V (x∗ ) and Mk+1 = Mk + 2E(xk , Mk ) in this network must be determined by the equilibrium point xk and the parameter Mk in the above network. By Theorem 6, every limiting point of the sequence {xk } produced by the feedback neural network (13) is the the optimal solution of (5) and limiting values of the sequence {V (xk )} and {Mk } are its optimal value.
4
Conclusions
This paper aim to solve in real time non-differentiable convex optimization problems on a closed convex subset of Rn , the dynamic feedback neural network models based on differential inclusions are introduced,through constructing successively an iterative energy functions sequence and the corresponding dynamic subnetworks which are described to obey differential inclusion systems, the dynamical behavior and optimization capabilities of the proposed network models are analyzed rigorously, and the convergence theorem for solving in real time the primal problem is then obtained. About the implementation technique relating with this paper,the reader can refer to authors’ paper[14]where some simulation algorithms and numerical simulation experiments are given.
Acknowledgment The authors of this paper would like to thanks for the support of 973 project (No.2002cb312205) and the National Science Foundation of China (No. 60574077).
References 1. Hopfield, J.J. and Tank, D.W.: Neural Computation of Decisions in Optimization Problems. Biol. Cybern.52(1985) 141-152 2. Tank, D.W. and Hopfield, J.J.: Simple Neural Optimization Network: An A/D Converter, Signal Decision Circuit, and a Linear Programming Circuit. IEEE Trans. Circuits Syst.33(1986) 533-541 3. Kennedy, M.P. and Chua, L.O.: Neural Networks for Nonlinear Programming.IEEE Trans. Circuits Syst.35(1988) 554-562 4. Rodr´ıguez-V´ aquez, A.,Rueda, R., Huertas, J.L. and S´ anchez-Sinencio, E.: Nonlinear Switched-Capacitor Neural Networks for Optimization Problems. IEEE Trans. Circuits Syst. 37(1990) 384-397 5. Bouzerdoum, A. and Pattison, T.R.: Neural Network for Quadratic Optimization with Bound Constraints. IEEE Trans. Neural Networks 4(1993) 293-304 6. Sudharsanan, S. and Sundareshan, M.: Exponential Stability and a Systematic Synthesis of a Neural Network for Quadratic Minimization. Neural Networks 4(1991) 599-613 7. Leung, Y., Chen, K.Z. and Gao, X.B.: A High-Performance Feedback Neural Network for Solving Convex Nonlinear Programming Problems. IEEE Trans. Neural Networks 14(2003) 1469-1477
358
S. Song, G. Li, and X. Guan
8. Forti, M. and Nistri, P.: Global Convergence of Neural Networks with Discontinuous Neuron Activations.IEEE Trans.Circuits and Systems-I 50(2003) 1421-1435 9. Lu, W. and Chen, T.: Dynamical Behaviors of Cohen-Grossberg Neural Networks with Discontinuous Activation Functions. Neural Networks 18(2005) 231-242 10. Forti, M., Nistri, P. and Quincampoix, M.: Generalized Neural Network for Nonsmooth Nonlinear Programming Problems.IEEE Trans.Circuits and Systems-I 51(2004) 1741-1754 11. Aubin, J.P. and Cellina, A.: Differential Inclusions. Springer-Verlag, Berlin (1984) 12. Dimitri, P.B.: Nonlinear Programming, Second Edition. Athena Scientific,Belmont (1999) 13. Filippov, A.F.: Differential Equations with Discontinuous Righthand Sides. Kluwer Academic,Dordrecht (1988) 14. Li, G., Song, S. and Wu, C.: Subgradient-Based Feedback Neural Networks for Nondifferentiable Convex Optimization Problems. Science in China, Series F 49(2006) 91-106
A Recurrent Neural Network for Linear Fractional Programming with Bound Constraints Fuye Feng, Yong Xia, and Quanju Zhang Dongguan University of Technology, Dongguan, Guangdong, China {fengye408, xiay}@dgut.edu.cn,
[email protected]
Abstract. This paper presents a novel recurrent time continuous neural network model which performs linear fractional optimization subject to bound constraints on each of the optimization variables. The network is proved to be complete in the sense that the set of optima of the objective function to be minimized with bound constraints coincides with the set of equilibria of the neural network. It is also shown that the network is primal and globally convergent in the sense that its trajectory cannot escape from the feasible region and will converge to an exact optimal solution for any initial point chosen in the feasible bound region. Simulation results are given to demonstrate further the global convergence and the good performance of the proposed neural network for linear fractional programming problems with bound constraints.
1
Introduction
Although linear programming arising in various branches of human activity, especially in economics, has become well-known, fractional programming has known increasing exposure recently and its importance in solving concrete problems is steadily increasing. Economic problems described by fractional programming models can be found in Charnes, Cooper and Rhodes [1], Patkar [2], and Mjelde [3]. Besides the economic applications, it was found that the fractional programming problems also appeared in other domains, such as Physics, Information theory, Game theory and the others. Among all kinds of fractional programming problems, the linear fractional programming is the most important one for its much widely applications, see Stancu-Minasian [4] in details. Unlike most conventional algorithms which are time-consuming in solving optimization problems with large-scale variables, neural network approach can handle, as described in Hopfield’s seminal work [5]-[6], optimization process in real-time on line, and hence to be the top-choice. As it is known, artificial neural networks (RNN) governed by a system of differential equations can be implemented physically by designated hardware with integrated circuits and an optimization process with different specific-purposes,see Cichocki and Unbehauen [7], could be conducted in a truly parallel way. More intensive neural network models were investigated by Wang and other researchers, see [8]-[14], including for convex programming [8] and linear programming [9]-[10], which, proved to J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 359–368, 2006. c Springer-Verlag Berlin Heidelberg 2006
360
F. Feng, Y. Xia, and Q. Zhang
be globally convergent to the problem’s exact solutions. Kennedy and Chua [11] developed a neural network model for solving nonlinear programming problems where a penalty parameter needed to tune in the optimization process and hence only approximate solutions were generated. Xia and Wang [12] gave a general neural network model designing methodology which put together many gradientbased network models under their framework. Neural networks for the quadratic optimization and nonlinear optimization with bound constraints were developed by Bouzerdorm, Pattison [13] and Liang, Wang [14], respectively. All these neural networks can be classified into the following three types: 1) The gradient-based models [8]-[10] and its extension [12]; 2) The penalty function based model [11]; 3) The projection based models [13]-[14]. Among them the first was proved to have the global convergence [12] and the third quasi-convergence [8]-[10] only when the optimization problems are convex programming problems. The second [11] could only be demonstrated to have local convergence and it might fail to find exact solutions and hence has little applications then. As it is known, linear fractional programming does not belong to convex optimization problems [4] and hence need to investigate further then. Motivated by this idea, we are going to propose a promising recurrent continues-time neural network model which has the following features. 1) The set of optima of the linear fractional programming coincides with the set of equilibria of the proposing model. 2) The model is invariant with respect to the problem’s feasible set and any trajectory from the feasible set converges to an exact solution of the problem. Remains of the paper are organized as follows. Section II formulates the optimization problem and describes the construction of the proposing RNN model. Complete and global convergence properties are proven in Section III. Illustrative simulation results are reported in Section IV as the test of good performance for the proposing model by numerical example. Finally, Section V is a conclusion remark which summarizes the whole paper.
2
Problem Formulation and the Neural Network Model
The linear fractional programming with bound constraints can be formulated as min{F (x) : a ≤ x ≤ b},
(1)
where (i) F (x) = cT x + c0 /dT x + d0 , (ii) c, d are n dimensional column vectors, (iii) c0 , d0 are scalars, (iv) superscript T denotes the transpose operator, (v) x = (x1 , x2 , · · · , xn )T ∈ Rn is the decision vector, (vi) a = (a1 , a2 , · · · , an )T ∈ Rn , b = (b1 , b2 , · · · , bn )T ∈ Rn are constant vectors with ai ≤ bi (i = 1, 2, · · · , n). It is assumed that denominator of the objective function F (x) maintains a constant sign on an open set O which contains the bound constraints W = {x :
A Recurrent Neural Network for Linear Fractional Programming
361
a ≤ x ≤ b}, say positive, i.e., dT x + d0 > 0, ∀x ∈ W and that the function F (x) does not reduce to a linear function, i.e., dT x + d0 = constant on W and c, d are linearly independent. If x∗ ∈ W and F (x) ≥ F (x∗ ) for any x ∈ W , then x∗ is called an optimal solution to the problem (1). The set of all solutions to problem (1) is denoted by Ω∗ , i.e., Ω∗ = {x∗ ∈ W |F (x) ≥ F (x∗ ), ∀x ∈ W }. Consider the following single-layered recurrent neural network whose state variable x is described by the differential equation dx = −x + fW (x − ∇F (x)), dt
(2)
where ∇ is the gradient operator and fW : Rn → W is the projection operator defined by fW (x) = arg min x − w . (3) w∈W
For the bound constrained feasible set W , the operator fW can be expressed explicitly as fW (x) = (fW 1 (x), · · · , fW n (x)) whose ith component is ⎧ xi < ai , ⎨ ai , fW i (x) ≡ fW i (xi ) = xi , ai ≤ xi ≤ bi , ⎩ bi , xi > bi . We can also reformulate (2) in the following component form dxi ∂F (x) = −xi + fW i (xi − ), i = 1, 2, · · · , n. dt ∂xi
(4)
The block functional diagram of the RNN model (4) is depicted in Fig.1. Accordingly, architecture of the proposed model (4) is composed of n integrators, n processors for F (x), 2n piece-wise linear activation functions and 2n summers. Let the equilibrium state of the RNN model (2) be Ωe which is defined by e Ω = {xe ∈ Rn |xe = fW (xe − ∇F (xe ))} ⊆ W. Its relationship with problem (1)’s minimizer set Ω∗ is explored by the coming section now.
3
The Complete Property of the RNN Model (2)
As proposed in [15], a neural network is said to be regular or normal if the set of minimizers of an energy function is a subset or superset of the set of the stable states of the neural network, respectively. If the two sets are exact the same, we say the network to be complete. Regular property implies the neural network’s reliability and normal effectiveness, respectively, while complete both. Here, for model (2), it is said to be regular, normal, and complete if three cases of Ω∗ ⊆ Ωe , Ωe ⊆ Ω∗ , and Ω∗ = Ωe occur respectively. Theorem 1. The RNN model (2) is complete, i.e., Ω∗ = Ωe . In order to prove Theorem 1, we need the following three lemmas.
362
F. Feng, Y. Xia, and Q. Zhang
r .. .
.. .
∂F (x) ∂x1
∂F (x) ∂x2
+ ? − 6 r
@ @
+ ? − r
+ ? − 6
@ @
.. . .. .
r
.. . .. .
.. . .. .
.. . .. .
+ ? − .. . .. .
.. . .. .
r .. .
∂F (x) ∂xn
@ @
.. . .. .
6
xb01
r b x1 (t)
6 -
xb02
r b x2 (t) .. . .. .
.. . .. .
r
+ ? − 6
+ ? −
6
xb0n
r b xn (t)
Fig. 1. Functional block diagram of the neural network model (2)
Lemma 1. Suppose that x∗ is a solution to problem (1), that is, F (x∗ ) = min F (y), y∈W
(5)
then x∗ is a solution to the variational inequality x ∈ W : (∇F (x), y − x) ≥ 0, ∀y ∈ W.
(6)
Proof: See [16] Proposition 5.1. Lemma 2. Function F (x) = cT x + c0 /dT x + d0 defined in (1) is both pseudoconvex and pseudoconcave over W . Proof: See [17] Lemma 11.4.1. Lemma 3. Let F (x) : Rn → R be a differentiable pseudoconvex function on an open set Y ⊆ Rn , and W ⊆ Y any given nonempty and convex set. Then x∗ is an optimal solution to the problem of minimizing F (x) subject to x ∈ W if and only if (x − x∗ )T ∇F (x∗ ) ≥ 0 for all x ∈ W . Proof: See [4] Theorem 2.3.1 (b). Now, we turn to the proof of Theorem 1: Let x∗ = (x∗1 , x∗2 , · · · , x∗n )T ∈ Ω∗ , then F (x) ≥ F (x∗ ) for any x ∈ W and hence, Lemma 1 means that x∗ solves (6), that is
A Recurrent Neural Network for Linear Fractional Programming
x∗ ∈ W : (y − x∗ )T ∇F (x∗ ) ≥ 0, ∀y ∈ W,
363
(7)
which is equivalent to, see [18], x∗ = fW (x∗ − ∇F (x∗ )), so, x∗ ∈ Ωe . Thus, Ω∗ ⊆ Ωe . Conversely, let xe = (xe1 , xe2 , · · · , xen )T ∈ Ωe , that is xe = fW (xe − ∇F (xe )), which, also see [18], means xe ∈ W : (y − xe )T ∇F (xe ) ≥ 0, ∀y ∈ W.
(8)
Since the function F (x) is pseudoconvex over W , see Lemma 2, it can be obtained by Lemma 3 that F (x) ≥ F (xe ), ∀x ∈ W, so, xe ∈ Ω∗ . Thus, Ωe ⊆ Ω∗ . Therefore, the obtained results Ωe ⊆ Ω∗ and Ω∗ ⊆ Ωe show the result of Theorem 1, Ω∗ = Ωe , comes to be true then.
4
The Global Convergence Problem
First, it can be shown that the RNN model (2) has a solution trajectory which is global in the sense that the existence interval can be extended to ∞ on the right hand for any initial point in W. The continuity of the right hand of (2) means, by Peano’s local existence theorem, see [19], that there exists a solution x(t; x0 ) for t ∈ [0, tmax ) with any initial point x0 ∈ W , here tmax is the maximal right hand point of the existence interval. The following lemma states that this tmax to be ∞. Lemma 4. The solution x(t; x0 ) of RNN model (2) with any initial point x(0; x0 ) = x0 ∈ W is bounded and so, it can be extended to ∞. Proof: It is easy to check that the solution x(t) = x(t; x0 ) for t ∈ [0, tmax ) with initial condition x(0; x0 ) = x0 is given by −t 0
x(t) = e x + e
−t
t
es fW [x(s) − ∇F (x(s))]ds.
(9)
0
Obviously, mapping fW is bounded, that is fW ≤ K for some positive number K > 0, where . is the Euclidean 2-norm. It follows from (9) that −t
−t
x(t) ≤ e x + e K 0
t
es ds,
(10)
0
≤ e−t x0 + K(1 − e−t ),
(11)
≤ max{ x , K}.
(12)
0
364
F. Feng, Y. Xia, and Q. Zhang
Thus, solution x(t) is bounded. By the extension theorem for ODEs, see [19], it can be concluded that tmax = ∞. Another vital dynamical property about the RNN model (2) says the set W is positive invariant. That is, any solution x(t) starting from a point in W , e.g. x0 ∈ W , will stay in W for all time t elapsing. Theorem 2. W is a positive invariant set of the RNN model (2). Proof: Let W i = {xi ∈ R|ai ≤ xi ≤ bi } and x0i = xi (0; x0 ) ∈ W i . We first prove that, for all i = 1, 2, · · · , n, if x0i = x(0; x0 ) ∈ W i , then the ith component xi (t) = xi (t; x0 ) belongs to W i , that is, xi (t) ∈ W i for all t ≥ 0. Let t∗i = sup{t˜ |xi (t) ∈ W i , ∀t ∈ [0, t˜ ]} ≥ 0. It can be shown by a contradiction that t∗i = +∞. Suppose t∗i < ∞, then xi (t) ∈ W i for t ∈ [0, t∗i ] and xi (t) ∈ / W i for t ∈ (t∗i , t∗i + δ), here δ is a positive number. With no loss of generality, we assume that xi (t) < ai , ∀t ∈ (t∗i , t∗i + δ).
(13)
The proof for xi (t) < ai , ∀t ∈ (t∗i , t∗i + δ) is similar. By the definition of fWi , the RNN model (4) and the assumption (13), it follows that dxi (t)/dt ≥ −xi (t) + ai > 0, ∀t ∈ (t∗i , t∗i + δ). So, xi (t) is strictly increasing in t ∈ (t∗i , t∗i + δ) and hence xi (t) > xi (t∗i ), ∀t ∈ (t∗i , t∗i + δ).
(14)
Noting that xi (t) ∈ W for t ∈ [0, t∗i ] and assumption (13) means xi (t∗i ) = ai . So, by (14), we get xi (t) > ai , ∀t ∈ (t∗i , t∗i + δ). This is in contradiction with the assumption (13). Therefore, t∗i = +∞. So, W i is positive invariant. We can now explore the global convergence of the neural network model (2). To proceed, we need an inequality result about the projection operator fW and the definition of convergence for a neural network. Definition 1. Let x(t) be a solution of system x˙ = F (x). The system is said to be globally convergent to a set X with respect to set W if every solution x(t) starting at W satisfies ρ(x(t), X) → 0,
as t → ∞,
here ρ(x(t), X) = inf x − y and x(0) = x0 ∈ W. y∈X
(15)
A Recurrent Neural Network for Linear Fractional Programming
365
Definition 2. The neural network (2) is said to be globally convergent to a set X with respect to set W if the corresponding dynamical system is so. Lemma 5. For all v ∈ Rn and all u ∈ W (v − fW (v))T (fW (v) − u) ≥ 0. Proof: See [16] pp. 9-10. Theorem 3. The neural network (2) is globally convergent to the solution set Ω∗ with respect to set W . Proof: From Lemma 5, we know that (v − fW (v))T (fW (v) − u) ≥ 0, v ∈ Rn , u ∈ W. Let v = x − ∇F (x) and u = x, then (x − ∇F (x) − fW (x − ∇F (x)))T (fW (x − ∇F (x)) − x) ≥ 0, i.e., (∇F (x))T {fW (x − ∇F (x)) − x} ≤ − fW (x − ∇F (x)) − x .
(16)
Define an energy function F (x), then, differentiating this function along the solution x(t) of (2) gives us dF (x(t)) dx = (∇F (x))T = (∇F (x))T {fW (x − ∇F (x)) − x}. dt dt
(17)
According to (16), it follows that dF (x(t)) ≤ − fW (x − ∇F (x)) − x ≤ 0. dt
(18)
It means the energy of F (x) is decreasing along any trajectory of (2). By Lemma 4, we know the solution x(t) is bounded. So, F (x) is a Liapunov function to system (2). Therefore, by LaSalle’s invariant principle [20], we know that all trajectories of (2) starting at W will converge to the largest invariant subset Σ of set E like E = {x |
dF = 0}. dt
(19)
However, it can be guaranteed from (18) that dF dt = 0 only if fW (x − ∇F (x)) − x = 0, which means that x must be an equilibrium of (2) or, x ∈ Ω. Thus, Ω is the convergent set for all trajectories of neural network (2) starting at W. Noting that Theorem 1 tells us that Ω∗ = Ω and hence, Theorem 3 is proved to be true then.
366
F. Feng, Y. Xia, and Q. Zhang
Up to now, we have demonstrated that the proposed neural network (2) is a promising neural network model both in implementable construction sense and in theoretic convergence sense for solving linear fractional programming problems. Certainly, it is also important to simulate a network’s effectiveness by numerical experiment to test its performance in practice. In next section, we will focus our attention on handling illustrative example to reach this goal.
5
Illustrative Example
We give a computational example as simulation experiment to show the proposed network’s good performance. Example. Consider the following linear fractional programming x1 + x2 + 1 , 2x1 − x2 + 3 0 ≤ x1 ≤ 2, 0 ≤ x2 ≤ 2.
min F (x) = s.t.
(20) (21) (22)
This problem has an exact solution x∗ = [0, 0]T with the optimal value F (x∗ ) = 1/3 and the gradient of F (x) can be expressed as −3x2 + 1/(2x1 − x2 + 3)2 ∇F (x) = . (23) 3x1 + 4/(2x1 − x2 + 3)2 1
0.9
0.8
0.7
x1, x2
0.6
0.5
x
1
0.4
0.3
0.2
x
2
0.1
0
0
5
10 t
Fig. 2. Transient behaviors of neural trajectories x1 , x2
15
A Recurrent Neural Network for Linear Fractional Programming
367
0.5 x
1
x1, x2
0.4
0.3
0.2
x
0.1
0
0
2
2
4
6 t
8
10
12
Fig. 3. Neural trajectories x1 , x2 with starting point (0.5, 0.5)
We use neural network (2) to solve this problem. Simulation result is carried out by ODE 23 solver conducted on MATLAB 7.0. and the transient behaviors of the neural trajectories x1 , x2 starting at x0 = [0.4, 1]T are shown in Fig. 2. It can be seen visibly from the figure that the proposed neural network converges to the exact solution very soon. Also, Fig. 3 presents how the solution of this problem is located by the proposed neural trajectories from a different initial point, here x0 = [0.5, 0.5]T , in a more clearly visible way.
6
Conclusion
In this paper, we have proposed a neural network model for solving linear fractional programming problems with bound constraints. The network is governed by a system of differential equations with a projection method. The stability of the proposed neural network has been demonstrated to have global convergence with respect to the problem’s feasible set. As it is known, the existing neural network models with penalty function method for solving nonlinear programming problems may fail to find the exact solution of the problems. The new model has overcome this stability defect which appears in all penalty function based models. Certainly, the network presented here can perform well in the sense of real time computation which, in the time elapsing sense, is also superior to the classical algorithms. Finally, numerical simulation results demonstrate further that the new model can act both effectively and reliably on the purpose of locating the involved problem’s solutions.
368
F. Feng, Y. Xia, and Q. Zhang
References 1. Charnes, A., Cooper, W. W., Rhodes, E.: Measuring the Efficiency of Decision Making Units. European J. Oper. Res. 2 (2) (1978) 429-444 2. Patkar, V. N.: Fractional Programming Models for Sharing of Urban Development Responsabilities. Nagarlok. 22 (1) (1990) 88-94 3. Mjelde, K.M.: Fractional Resource Allocation with S-shaped Return Functions. J. Oper. Res. Soc. 34 (2) (1983) 627-632 4. Stancu-Minasian, I. M.: Fractional Programming, Theory, Methods and Applications. Kluwer Academic Publishers, Netherlands (1992) 5. Hopfield, J. J.: Neurons with Graded Response Have Collective Computational Properties Like Those of Two-state Neurons. Proc. Natl. Acad. Sci. 81 (10) (1984) 3088-3092 6. Hopfield, J. J., Tank, D. W.: Neural Computation of Decisions in Optimization Problems. Biolog. Cybernetics. 52 (1) (1985) 141-152 7. Cichocki, A., Unbehauen, R.: Neural Networks for Optimization and Signal Processing. John Wiley & Sons, New York (1993) 8. Wang, J.: A Deterministic Annealing Neural Network for Convex Programming. Neural Networks. 7 (2) (1994) 629-641 9. Wang, J., Chankong, V.: Recurrent Neural Networks for Linear Programming: Analysis and Design Principles. Computers and Operations Research. 19 (1) (1992) 297-311 10. Wang, J.: Analysis and Design of a Recurrent Neural Network for Linear Programming. IEEE Transactions on Circuits and Systems. 40 (5) (1993) 613-618 11. Kennedy, M. P., Chua, L. O.: Neural Networks for Nonlinear Programming. IEEE Transaction on Circuits and Systems. 35 (5) (1988) 554-562 12. Xia, Y. S., Wang, J.: A General Methodology for Designing Globally Convergent Optimization Neural Networks. IEEE Transaction on Neural Networks. 9 (12) (1998) 1311-1343 13. Bouzerdorm, A., Pattison, T. R.: Neural Network for Quadratic Optimization with Bound Constraints. IEEE Transaction on Neural Networks. 4 (2) (1993) 293-304 14. Liang, X. B., Wang, J.: A Recurrent Neural Network for Nonlinear Optimization with a Continuously Differentiable Objective Function and Bound Constraints. IEEE Transaction on Neural Networks. 11 (11) (2000) 1251-1262 15. Xu, Z.B., Hu, G.Q., Kwong, C.P.: Asymmetric-Hopfield-Type Networks: Theory and Applications. Neural Networks. 9 (2) (2000) 483-501 16. Kinderlehrer, D., Stampcchia, G.: An Introduction to Variational Inequalities and Their Applications. Academic, New York (1980) 17. Bazaraa, M. S., Shetty, C. M.: Nonlinear Programming, Theory and Algorithms. John Wiley and Sons, New York (1979) 18. Eaves, B.C.: On the Basic Theorem of Complementarity. Mathematical Pragramming. 1 (1) (1970) 68-75 19. Hale, J.K.: Ordinary Diffential Equations. Wiley, New York (1993) 20. LaSalle, J.: The Stability Theory for Ordinary Differential Equations. J. Differential Equations 4 (1) (1983) 57-65
A Delayed Lagrangian Network for Solving Quadratic Programming Problems with Equality Constraints Qingshan Liu1 , Jun Wang1 , and Jinde Cao2 1
Department of Automation and Computer-Aided Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong {qsliu, jwang}@acae.cuhk.edu.hk 2 Department of Mathematics, Southeast University, Nanjing 210096, China
[email protected]
Abstract. In this paper, a delayed Lagrangian network is presented for solving quadratic programming problems. Based on some known results, the delay interval is determined to guarantee the asymptotic stability of the delayed neural network at the optimal solution. One simulation example is provided to show the effectiveness of the approach.
1
Introduction
Neural networks based on circuit implementation can provide real-time solutions, and can be applied to solve many engineering problems, such as optimal control, structure design, signal and image processing. In recent years, many recurrent neural networks were proposed for solving linear and nonlinear optimization problems [1]-[12]. Tank and Hopfield [1] first proposed a neural network for solving linear programming problems in 1986. Based on penalty function and gradient methods, Kennedy and Chua [2] extended and improved the TankHopfield network for solving nonlinear programming problems. Bouzerdoum and Pattison [3] presented a neural network for solving quadratic convex optimization problems with bounded constraints. Zhang and Constantinides [4] proposed the Lagrangian network and Wang et al. [7] and Xia [8] studied the global convergence of the network. By primal-dual approach and projection approach, the primal-dual neural network and projection neural network were presented for solving linear and nonlinear convex programming problems [9]-[11]. However, these neural networks are based on the assumption that neurons communicate and respond instantaneously without any time delay. In fact, the switching delay exists in some hardware implementation, and the time delay is hard to predicted to guarantee the stability of the neural network theoretically. This is our first motivation to introduce the delayed neural network. Chen
This work was supported by the Hong Kong Research Grants Council under Grant CUHK4165/03E, and the National Natural Science Foundation of China under Grant 60574043.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 369–378, 2006. c Springer-Verlag Berlin Heidelberg 2006
370
Q. Liu, J. Wang, and J. Cao
and Fang [12] proposed a delayed neural network for solving convex quadratic programming problems by using the penalty function approach. However, this network can not converge an exact optimal solution and has an implementation problem when the penalty parameter is very large. To avoid using finite penalty parameters, it is necessary to construct delayed neural network by using other approach. This is our second motivation to study the delayed neural network. For some quadratic programming problems, the convergence of some neural networks without time delays can not be guaranteed due to the asymmetry of the Jacobian matrix [8]. However, if we add some delays to the networks, but not change the equilibrium points, the topology structure of the solutions of the networks is altered and the convergence of the network can be obtained by selecting proper delays. This is our third motivation to introduce the delayed neural network and study its convergence.
2
Problem Formulation and Delayed Neural Network
In this paper, we consider the following quadratic convex programming problem: QP
minimize subject to
1 T 2 x Qx
+ cT x,
Ax = b,
(1)
where Q ∈ Rn×n is symmetric and positive semidefinite, c ∈ Rn , A ∈ Rm×n , and b ∈ Rm . Throughout this paper, we always assume that feasible domain F = {x ∈ Rn×n |Ax − b = 0} is not empty. In order to adopt the neural network method for solving problem (1), it is necessary to convert this problem into some equalities. It is well known that the Lagrangian function of (1) is defined as L(x, y) =
1 T x Qx + cT x − y T (Ax − b), 2
where y ∈ Rm is the Lagrange multiplier. By Karush-Kuhn-Tucker conditions [15], x∗ is a solution to (1) if and only if there exists y ∗ ∈ Rm such that (x∗ , y ∗ ) satisfies the following Lagrange condition: ∇Lx (x, y) = Qx + c − AT y = 0 (2) ∇Ly (x, y) = Ax − b = 0 where ∇L is the gradient of L. Letting Q −AT c W = , J= , A 0 −b
x u= , y
then (2) can be rewritten as W u + J = 0. In [4], the following Lagrangian network was proposed to solve problem (1): du = −(W u + J). dt
(3)
A Delayed Lagrangian Network
371
In [7] and [8], the global convergence of the Lagrangian network (3) has been investigated in case of the matrix Q being positive definite. However, if Q is positive semidefinite but not positive definite, the convergence of system (3) can not be guaranteed for some quadratic programming problems, which can be seen from the following example. Example. Consider the following convex quadratic programming problem: minimize f (x) =
1 T x 2
0.1 0.1 0.1 0.1
x+ −1, 1 x, subject to (0.5, −0.5)x = −0.5,
where x = (x1 , x2 )T . This problem has a unique solution x∗ = (−0.5, 0.5)T . Then, ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 0.1 0.1 −0.5 −1 x1 W = ⎝ 0.1 0.1 0.5 ⎠ , J = ⎝ 1 ⎠ , u = ⎝ x2 ⎠ . 0.5 −0.5 0 0.5 y In the Lagrangian network (3), since the coefficient matrix −W has three eigenvalues λ1 = −0.2 and λ2,3 = ±0.7071i, this system has periodic solutions. The equilibrium point of this system is not asymptotically stable. From the above example, we know that the Lagrangian network (3) can not converge to its equilibrium point for some quadratic programming problems. In this case, our method is to change the topology structure of the solutions by adding some delay state items. In this paper, we propose the following delayed Lagrangian network for solving problem (1): du = −(D + W )u(t) + Du(t − τ ) − J, dt
(4)
where τ ≥ 0 denotes the transmission delay, D ∈ R(n+m)×(n+m) , u(t) with the initial value function u(s) = φ(s) s ∈ [−τ, 0]. It is easily to see that Lagrangian network (3) is a special case of the delayed Lagrangian network (4) when τ = 0. Obviously, the Lagrangian network (3) and the delayed Lagrangian network (4) have the same equilibrium point and we can get the exact solution of problem (1) by using network (4). From the results in [7], the following lemma holds. Lemma 1. x∗ is an optimal solution to problem (1) if and only if there exists y ∗ ∈ Rm such that u∗ = (x∗ , y ∗ ) is an equilibrium point of the Lagrangian network (3) or the delayed Lagrangian network (4). Assume that u∗ is an equilibrium point of (4), then −(D + W )u∗ + Du∗ − J = 0. By means of the coordinate translation z = u − u∗ , we have dz = −(D + W )z(t) + Dz(t − τ ). dt
(5)
372
3
Q. Liu, J. Wang, and J. Cao
Time-Delay Analysis
In this section, we give the stability of an equilibrium of the following system based on the known results in [14]: dz = M z(t) + Mτ z(t − τ ), dt
(6)
where z(t) ∈ Rn , M, Mτ ∈ Rn×n , τ ≥ 0. For x ∈ Cn , the vector norms · p are defined as
n n
x1 = |xi |, x2 = xi x∗i , x∞ = max |xi |, i=1
1≤i≤n
i=1
where Cn denotes the n-dimensional complex vector space, x∗i is the conjugation of xi . The induced matrix norms · p are displayed as follows A1 = max j
n
|aij |,
A2 =
λmax (A∗ A),
A∞ = max i
i=1
n
|aij |,
j=1
where A = {aij }n×n , A∗ denotes the conjugate transpose matrix of A, λmax (·) is the maximum eigenvalue of corresponding matrix. Definition 1. [16] Let A = {aij }n×n is a matrix on Cn×n , then the corresponding matrix measure is the function μp : Cn×n → R defined by μp (A) = lim
ε→0+
I + εAp − 1 , ε
where · p is an induced matrix norm on Cn×n , I is identity matrix and p = 1, 2, ∞. The induced matrix measure μp (·) has the following forms μ1 (A) = max{Re(ajj ) + j
|aij |},
μ2 (A) =
i =j
μ∞ (A) = max{Re(aii ) + i
1 [λmax (A∗ + A)], 2
|aij |},
j =i
where Re(·) represents the real part of the complex number. Lemma 2. [16] The matrix measure μp (·) defined in Definition 1 has the following properties: (i) Reλi (A) ≤ μp (A); (ii) μp (A + B) ≤ μp (A) + μp (B), ∀A, B ∈ Cn×n , where λi (·) represents an eigenvalue (i = 1, 2, . . .).
A Delayed Lagrangian Network
373
Define the characteristic equation of system (6) as det(λI − M − Mτ exp(−τ λ)) = 0.
(7)
Equivalently, the characteristic function can be written as λ = λi (M + Mτ exp(−τ λ)).
(8)
In order to study the stability condition of system (6), let us define l1 μp (M ) + Mτ p ,
(9)
l2 μp (−ιM ) + Mτ p ,
(10)
and where ι2 = −1, p = 1, 2 and ∞. The first stability condition is given by Mori et al. [13] as the following theorem. Theorem 1. If l1 < 0, then every equilibrium point of the dynamic system defined by (6) is asymptotically stable. This is a delay-independent stability condition, and the other delay-depend stability condition is given by Mori and Kokame [14] in the following theorem. Theorem 2. When l1 ≥ 0, if Reλi (M + Mτ exp(−τ λ)) < 0,
(i = 1, 2, . . .),
(11)
then every equilibrium point of the dynamic system defined by (6) is asymptotically stable, where λ takes the values of ιω, l1 + ιω and r + ιl2 , with 0 ≤ ω ≤ l2 and 0 ≤ r ≤ l1 . In Theorem 2, it is very hard to check the stability conditions, since the eigenvalues in (7) are infinite. However, from Lemma 2 and Theorem 2, we can get several convenient way to guarantee to stability of system (6). We state them as the following corollaries. Corollary 1. When l1 ≥ 0, if μp (M + Mτ exp(−τ λ)) < 0,
(12)
μp (Mτ exp(−τ λ)) < −μp (M ),
(13)
or then every equilibrium point of the dynamic system defined by (6) is asymptotically stable, where λ takes the values of ιω, l1 + ιω and r + ιl2 , with 0 ≤ ω ≤ l2 and 0 ≤ r ≤ l1 . If l1 = 0, the stability conditions are more convenient to be checked.
374
Q. Liu, J. Wang, and J. Cao
Corollary 2. When l1 = 0, if μp (M + Mτ exp(−τ λ)) < 0,
(14)
μp (Mτ exp(−τ λ)) < −μp (M ),
(15)
or then every equilibrium point of the dynamic system defined by (6) is asymptotically stable, where λ takes the values of ιω, with 0 ≤ ω ≤ l2 . If Mτ = kI, we have λi (M + Mτ exp(−τ λ)) = k exp(−τ λ) + λj (M ), then Reλi (M + Mτ exp(−τ λ)) = k exp(−τ Reλ) cos(τ Imλ) + Reλj (M ), where I is the identity matrix, k is a real scalar, Im(·) represents the imaginary part of complex number, i = 1, 2, . . . , and j = 1, 2, . . . , n. From Theorem 2, we can get the following useful corollaries. Corollary 3. When l1 ≥ 0, Mτ = kI and k ≥ 0, every equilibrium point of the dynamic system defined by (6) is asymptotically stable if the following conditions hold k exp(−τ r) cos(τ ω) + Reλj (M ) < 0, (16) where r, ω take the values of (a) r = 0, l1 , 0 ≤ ω ≤ l2 ; (b) 0 ≤ r ≤ l1 , ω = l2 , and j = 1, 2, . . . , n. By Lemma 2, a more convenient way for Corollary 3 is using k exp(−τ r) cos(τ ω) + μp (M ) < 0,
(17)
where r, ω take the same values as Corollary 3. Corollary 4. When l1 ≥ 0, Mτ = kI and k < 0, every equilibrium point of the dynamic system defined by (6) is asymptotically stable if one of the following conditions holds for 0 ≤ l2 ≤ for
π , 2τ
π π < l2 ≤ , 2τ τ for
for j = 1, 2, . . . , n.
k exp(−l1 τ ) cos(l2 τ ) + Reλj (M ) < 0,
π < l2 , τ
(18)
k cos(l2 τ ) + Reλj (M ) < 0,
(19)
−k + Reλj (M ) < 0,
(20)
A Delayed Lagrangian Network
375
If l1 = 0, Mτ = kI, k > 0, and matrix M + Mτ is invertible, from (17), we have k exp(−τ r) cos(τ ω) + μp (M ) = k cos(τ ω) + μp (M ) < 0,
(21)
where 0 ≤ ω ≤ l2 . Since l1 = 0 and Mτ = kI, we have μp (M ) = −Mτ p = −k.
(22)
From (21) and (22), it follows that cos(τ ω) < 1. Since 0 ≤ ω ≤ l2 , we have 0 < τ ω < 2π, then, 0 < τ < 2π/l2 . Since matrix M + Mτ is invertible, equation (8) has no zero solution. Then λ can not take zero value. Therefore, we need only consider λ takes the values of ιω, with 0 < ω ≤ l2 in Corollary 3. Then the inverse procedure of above is also guaranteed. We can get the following results. Corollary 5. When l1 = 0, Mτ = kI, and matrix M + Mτ is invertible, every equilibrium point of the dynamic system defined by (6) is asymptotically stable if 0 < τ < 2π/l2 , where k > 0. Next, we apply the above results to solve the quadratic programming problem (1). We state it as the following theorem. Theorem 3. When matrix Q ∈ Rn×n is symmetric and positive semidefinite in problem (1), D = kI, and W is invertible in system (4), every equilibrium point u∗ = (x∗ , y ∗ ) of the delayed Lagrangian network (4) is asymptotically stable if 0 < τ < 2π/l2 , where x∗ corresponds to an optimal solution of problem (1), and k > 0. Proof. From system (4), Q −AT M = −(D + W ) = − kI + , A 0 2Q 0 M ∗ + M = − 2kI + , 0 0
thus
then
2Q 0 λi (M + M ) = −2k + λi − , 0 0 ∗
(i = 1, 2, . . . , n + m).
Since Q is symmetric and positive semidefinite, we have 2Q 0 max λi − = 0, 0 0 i
376
Q. Liu, J. Wang, and J. Cao
then
1 max λi (M ∗ + M ) = −k. 2 i Since Mτ = D = kI, Mτ 2 = k, it follows that l1 = μ2 (M ) + Mτ 2 = 0. We have M + Mτ = −W is invertible. Therefore, by Corollary 5, every equilibrium point u∗ = (x∗ , y ∗ ) of the delayed Lagrangian network (4) is asymptotically stable if 0 < τ < 2π/l2 . From Lemma 1, x∗ corresponds to an optimal solution of problem (1).
μ2 (M ) =
4
A Simulation Example
In this section, we use the delayed Lagrangian network to solve the problem in the example of Section 2. Let D = I3×3 identity matrix, then ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ −1.1 −0.1 0.5 100 −1 x1 M = ⎝ −0.1 −1.1 −0.5 ⎠ , Mτ = ⎝ 0 1 0 ⎠ , J = ⎝ 1 ⎠ , u = ⎝ x2 ⎠ . −0.5 0.5 −1 001 0.5 y √ By simple computation, we have l2 = μ2 (−ιM ) + Mτ 2 = 1 + 2/2. By Theorem 3, the equilibrium point of system (4) is asymptotically stable, if 3
3 x1 x2 y
2 1
1
0
0
−1
−1
−2
−2
−3
−3
−4
−4
−5
x1 x2 y
2
0
50
100
150 time
200
250
300
−5
0
50
(a) τ = 0
150 time
200
250
300
(b) τ = 0.5
3
3 x1 x2 y
2
x1 x2 y
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4
−4
−5
100
0
50
100
(c) τ = 2
150 time
200
250
300
−5
0
50
100
150 time
200
250
300
(d) τ = 3.6
Fig. 1. State trajectories of the delayed Lagrangian network for different time delays
A Delayed Lagrangian Network
377
0 < τ < 2π/l2 ≈ 3.68. Figure 1 gives the transient behavior with τ = 0, 0.5, 2 and 3.6. We can see that the solutions of system (4) converge to the equilibrium point (−0.5, 0.5, −2) when τ takes the values of 0.5, 2 and 3.6, where (x∗1 , x∗2 ) = (−0.5, 0.5) is the optimal solution of this problem and y ∗ = −2 is the Lagrangian multiplier.
5
Conclusions
In this paper, we have proposed a delayed Lagrangian network to solve convex quadratic programming problems with equality constrains. We give a delay interval for the asymptotic stability of the delayed Lagrangian network at the optimal solution. By choosing proper delays, we can get the exact optimal solutions. One simulation example is presented to show the effectiveness of the delayed Lagrangian network to quadratic programming problems.
References 1. Tank, D.W., Hopfield, J.J.: Simple Neural Optimization Networks: An A/D Converter, Signal Decision Circuit, and a Linear Programming Circuit. IEEE Trans. Circuits and Systems 33 (1986) 533-541 2. Kennedy, M.P., Chua, L.O.: Neural Networks for Nonlinear Programming. IEEE Trans. Circuits and Systems 35 (1986) 554-562 3. Bouzerdoum, A., Pattison, T.R.: Neural Network for Quadratic Optimization with Bound Constraints. IEEE Trans. Neural Networks 4 (1993) 293-304 4. Zhang, S., Constantinides, A.G.: Lagrange Programming Neural Networks. IEEE Trans. Circuits and Systems II 39 (1992) 441-452 5. Zhang, S., Zhu, X., Zou, L.H.: Second-order Neural Nets for Constrained Optimization. IEEE Trans. Neural Networks 3 (1992) 1021-1024 6. Wang, J.: A Deterministic Annealing Neural Network for Convex Programming. Neural Networks 7 (1994) 629-641 7. Wang, J., Hu, Q., Jiang, D.: A Lagrangian Network for Kinematic Control of Redundant Robot Manipulators. IEEE Trans. Neural Networks 10 (1999) 11231132 8. Xia, Y.: Global Convergence Analysis of Lagrangian Networks. IEEE Trans. Circuits and Systems I 50 (2003) 818-822 9. Wang, J.: Primal and Dual Assignment Networks. IEEE Trans. Neural Networks, 8 (1997) 784-790 10. Wang, J., Xia, Y.: Analysis and Design of Primal-dual Assignment Networks. IEEE Trans. Neural Networks 9 (1998) 183-194 11. Xia, Y., Wang, J.: A General Projection Neural Network for Solving Monotone Variational Inequalities and Related Optimization Problems. IEEE Trans. Neural Networks 15 (2004) 318-328 12. Chen, Y.H., Fang, S.C.: Neurocomputing with Time Delay Analysis for Solving Convex Quadratic Programming Problems. IEEE Trans. Neural Networks 11 (2000) 230-240 13. Mori, T., Fukuma, N., Kuwahara, M.: On an Estimate of the Decay Rate for Stable Linear Delay Systems. Int. J. Control 36 (1982) 95-97
378
Q. Liu, J. Wang, and J. Cao
14. Mori, T., Kokame, H.: Stability of x(t) ˙ = Ax(t)+Bx(t−τ ). IEEE Trans. Automat. Control 34 (1989) 460-462 15. Bazaraa, M.S., Sherali, H.D., Shetty, C.M.: Nonlinear Programming: Theory and Algorithms (2nd Ed.). John Wiley, New York (1993) 16. Vidyasagar, M.: Nonlinear System Analysis. Englewood Cliffs, NJ: Prentice Hall (1993)
Wavelet Chaotic Neural Networks and Their Application to Optimization Problems Yao-qun Xu1,2, Ming Sun2, and Guangren Duan1 1
Center for Control Theory and Guidance Technology, Harbin Institute of Technology, 150001 Harbin, China
[email protected] 2 Institute of Computer and Information Engineering, Harbin Commercial University, 150028 Harbin, China
[email protected]
Abstract. In this paper, we first review Chen’s chaotic neural network model and then propose a novel wavelet chaotic neural network. Second, we apply them to search global minima of a continuous function, respectively. Meanwhile, the time evolution figures of the corresponding most positive Lyapunov exponent are given. Third, 10-city traveling salesman problem (TSP) is given to make a comparison between them. Finally we conclude that the novel wavelet chaotic neural network is more valid.
1 Introduction Many combinatorial optimization problems arising from science and technology are often difficult to solve entirely. Hopfield and Tank first applied the continuous-time, continuous-output Hopfield neural network (HNN) to solve TSP [1], thereby initiating a new approach to optimization problems [2, 3]. However, using the HNN to solve continuous-time nonlinear searching optimization and TSP suffers from several shortcomings. First, the network is often trapped at a local minimum in the complex energy terrain because of its gradient descent property. Second, HNN may converge to an infeasible solution. At last, sometimes, HNN does not converge at all within prescribed iteration. Chaotic neural networks have been proved to be powerful tools for escaping from local minima. From then on, there have been some researches on chaotic neural networks in the field. Chen and Aihara proposed chaotic simulated annealing (CSA) to illustrate the features and effectiveness of a transiently chaotic neural network (TCNN) in solving optimization problems [4]; and Wang proposed a stochastic noisy chaotic simulated annealing method (SCSA) [5] by combining stochastic simulated annealing (SSA) and chaotic simulated annealing (CSA). All above researches are based on simulated annealing methods; distinctly, now we do research on the activation function. In this paper, we first review the Chen’s chaotic neural network model. Second, we propose a novel chaotic neural network model. Third, we apply both of them to search global minima of a continuous nonlinear function and then the time evolution figures of J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 379 – 384, 2006. © Springer-Verlag Berlin Heidelberg 2006
380
Y.-q. Xu, M. Sun, and G. Duan
their most positive Lyapunov exponents are given. At last, we apply both of them to 10-city traveling salesman problem (TSP) in order to make a comparison. Finally we conclude the novel chaotic neural network we proposed is more valid.
2 Chaotic Neural Network Models In this section, two chaotic neural network models are given. And the first is proposed by Chen, the second is proposed by us. 2.1 Chaotic Simulated Annealing with Decaying Self-coupling Chen and Aihara’s transiently chaotic neural network [4] is described as follows:
x i (t ) = f ( y i (t )) =
1 1+ e
(1)
− yi ( t ) / ε
⎡ ⎤ y i (t + 1) = ky i (t ) + α ⎢ Wij x j + I i ⎥ − z i (t )( x i (t ) − I 0 ) ⎢ j ⎥ ⎣ ⎦ z i (t + 1) = (1 − β ) z i (t )
∑
(2) (3)
where x i (t ) is output of neuron i ; y i (t ) denotes internal state of neuron
i ; Wij describes connection weight from neuron j to neuron i , Wij = W ji ; I i is input bias of neuron i , a a positive scaling parameter for neural inputs, k damping factor of nerve membrane, 0≤ k ≤1, z i (t ) self-feedback connection weight (refractory strength) ≥0, β damping factor of z i (t ) , 0< β <1, I 0 a positive parameter, parameter of the output function, ε >0.
ε steepness
2.2 Morlet-Sigmoid Chaotic Neural Network (M-SCNN)
Morlet-Sigmoid chaotic neural network is a novel model proposed by us, described as follows: ⎡ y i (t + 1) = ky i (t ) + α ⎢ ⎢ ⎣
⎤ Wij x j + I i ⎥ − z i (t )( x i (t ) − I 0 ) ⎥ j ⎦ x i (t ) = f ( y i (t ))
∑
z i (t + 1) = (1 − β ) z i (t )
ηi (t + 1) = f ( yi (t )) = γe
−
( u1 yi ( t )(1+ηi ( t ))) 2 2
(5) (6)
η i (t ) ln(e + λ (1 − ηi (t )))
cos(5u1 yi (t )(1 + ηi (t ))) +
(4)
(7) 1 1+ e
− u 0 yi ( t )(1+η i ( t ))
(8)
where xi (t ) , y i (t ) , Wij , α , k , I i , β , z i (t ) , I 0 are the same with the above. And η i (t ) is the other simulated annealing factor, η i (t ) >0; λ is a positive parameter,
Wavelet Chaotic Neural Networks and Their Application to Optimization Problems
381
which controls the speed of this annealing process; u 0 and u1 are important parameters of activation function which should be varied with kinds of special optimization problems.
3 Application to Search Optimization of Continuous Nonlinear Function In this section, we apply the two chaotic neural networks to search the minimum points of a famous Six-Hump Camel-Back Function [6] which can be described as follows: f ( x1 , x 2 ) = 4 x12 − 2.1x14 + x16 / 3 + x1 x 2 − 4 x 22 + 4 x 24
x i ≤5
(9)
Its minimum point is (-0.08983,0.7126) or (0.08983,-0.7126), and its corresponding minimum value is -1.0316285. Moreover, the time evolution figures of the corresponding most positive Lyapunov exponent are given. 3.1 Chen’s Chaotic Neural Network
The parameters are set as follows:
α =0.5, k =1, β =0.004,I0=0.8, ε =1 ,z (0)=[17.5,17.5],y(0)=[0,0]. The time evolution figure of the corresponding most positive Lyapunov exponent is shown as Fig.1.
Fig. 1. Lyapunov exponent time evolution figure
We find out that when these parameters α 、 k 、 β 、 z (0) and y(0) are invariable, however we change the parameters I0 and ε , the minimum point computed by Chen’s model can not reach (-0.08983,0.7126) or (0.08983,-0.7126), and nor does the minimum energy -1.0316285. When these parameters are set as above, its minimum point is (1.3705e-131, 0.70717), and its corresponding minimum value is -1 within 2000 iterations. Now, these parameters α 、 k 、 β 、 z (0) and y(0) are fixed so as to make a comparison between Chen’s and our model.
382
Y.-q. Xu, M. Sun, and G. Duan
3.2 Morlet-Sigmoid Chaotic Neural Network (M-SCNN)
The parameters are set as follows:
α =0.5, k =1, β =0.004, I0=0.5,µ 0=0.05,µ 1=20,λ=0.002, z (0)=[17.5,17.5], y(0)=[0,0],η(0)=[0.05,0.05]. The time evolution figure of the corresponding most positive Lyapunov exponent is shown as Fig.5.
Fig. 2. Lyapunov exponent time evolution figure
Under these parameters, its minimum point is (-0.088431, 0.71251), and its corresponding minimum value is -1.0316 within 2000 iterations. Seen from the above analysis, the result is more accurate than Chen’s, and the velocity of convergence is much faster than that of Chen’s. In order to verify the availability of our novel model, we apply it to the traveling salesman problem (TSP).
4 Application to Traveling Salesman Problem The coordinates of 10-city is as follows: (0.4, 0.4439)、 ( 0.2439, 0.1463)、 ( 0.1707, 0.2293)、 ( 0.2293, 0.716)、 ( 0.5171,0.9414)、 ( 0.8732, 0.6536)、 ( 0.6878, 0.5219)、 ( 0.8488, 0.3609)、 ( 0.6683, 0.2536)、 ( 0.6195, 0.2634). The shortest distance of the 10-city is 2.6776. Here are the results of the test about Chen’s and M-SCNN. The objective function we adopt is that provided in the reference [7].The parameters of the objective function are set as follows: A=2.5, D=1. The parameters of Chen’s are set as follows : α =0.5, k =1, I0=0.5, ε =1/20 ,z (0)=[0.5,0.5]. The parameters of M-SCNN are set as follows : α =0.5, k =1,u0=10,u1=0.8,I0=0.5,z(0)=[0.5,0.5],λ=0.001,η(0)=[0.8,0.8]. We make the test for 200 iterations in different β , as is shown in table 1. (VN= valid number; GN= global number; VP= valid percent; GP=global percent.)
Wavelet Chaotic Neural Networks and Their Application to Optimization Problems
383
Table 1. Test result of two chaotic neural network
E 0.04 0.01 0.008
Reference M-SCNN Chen’s M-SCNN Chen’s M-SCNN Chen’s
VN 191 180 191 180 195 183
GN 188 177 188 177 191 182
VP 95.5% 90% 95.5% 90% 97.5% 91.5%
GP 94% 88.5% 94% 88.5% 95.5% 91%
The time evolution figures of the energy function of M-SCNN and Chen’s in solving TSP are respectively given in Fig.3 and Fig.4 when β =0.008.
Fig. 3. Energy time evolution figure of M-SCNN
Fig. 4. Energy time evolution figure of Chen’s
By comparison, it is concluded that M-SCNN is superior to Chen’s model. From the Fig.3, Fig.4, one can see that the velocity of convergence of M-SCNN is much faster than that of Chen’s in solving TSP.
384
Y.-q. Xu, M. Sun, and G. Duan
The superiority of M-SCNN contributes to several factors: First, because of the quality of Morlet wavelet function, the activation function of M-SCNN has a further performance in solving combinatorial optimization problems than Chen’s. Second, it is easier to produce chaotic phenomenon [8] in that the activation function is non-monotonic. Third, η i (t ) is varied with time, which denotes steepness parameter of M-SCNN.
5 Conclusions We have introduced two models of chaotic neural networks. To verify the availability of them, we have made comparison with Chen’s model in optimization problems. By comparison, one can conclude that M-SCNN is superior to Chen’s in searching global minima of continuous nonlinear function. Different from Chen’s model, the activation function of M-SCNN is composed by Morlet wavelet and Sigmoid. So, besides it has the nature of sigmoid activation, the activation function of M-SCNN has a higher nonlinear nature than Sigmoid, which is easier to produce chaotic phenomenon [8] because of its non-monotonic. Due to these factors, M-SCNN is superior to Chen’s in solving TSP.
Acknowledgement This work is supported by the National Science Foundation of China 70471087.
References 1. Hopfield, J., Tank, W.: Neural Computation of Decision in Optimization Problems. Biol. Cybern. 52(2) (1985)141-152 2. Hopfield, J.: Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proc Natl. Acad. Sci. 79(4) (1982) 2554-2558 3. Hopfield, J.: Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proc. Natl. Acad. Sci. 81(4) (1984)3088-3092 4. Chen, L., Aihara, K.: Chaotic Simulated Annealing by a Neural Network Model with Transient Chaos. Neural Networks 8(6) (1995)915-930 5. Wang, L., Tian, F.: Noisy Chaotic Neural Networks for Solving Combinatorial Optimization Problems. International Joint Conference on Neural Networks. Italy: IEEE, (2000)37-40 6. Wang, L.: Intelligent Optimization Algorithms with Application. Tsinghua University & Springer Press, Beijing (2001) 7. Sun, S., Zheng, J.: A Kind of Improved Algorithm and Theory Testify of Solving TSP in Hopfield Neural Network. Acta Electronica Sinca 1(23) (1995)73-78 8. Potapove, A., Kali, M.: Robust Chaos in Neural Networks. Physics Letters A 277(6) (2000) 310-322
A New Optimization Algorithm Based on Ant Colony System with Density Control Strategy Ling Qin1,3, Yixin Chen2, Ling Chen1,3, and Yuan Yao1 1
Department of Computer Science, Nanjing University of Aeronautics and Astronautics, 210093, Nanjing, China
[email protected] 2 Department of Computer Science and Engineering, Washington University, in St. Louis, St. Louis, MO 63130-4899, USA
[email protected] 3 Department of Computer Science, Yangzhou University, 225009, Yangzhou, China
[email protected] Abstract. A new optimization algorithm based on the ant colony system is presented by adopting the density control strategy to guarantee the performance of the algorithm. In each iteration of the algorithm, the solutions are selected to have mutation operations according to the quality and distribution of the solution. Experimental results on the traveling salesman problem show that our algorithm can not only get diversified solutions and higher convergence speed than the Neural Network Model and traditional ant colony algorithm, but also avoid the stagnation and premature problem.
1 Introduction Ant Colony algorithm (AC) was introduced by Dorigo, Maniezzo, and Colorni to solve the Traveling Salesman Problem (TSP) [1]. With the further study in this area, ant colony algorithm has been widely applied to complicated combinatorial optimization problems. However, the classical ant colony algorithm also has its defects, excessive positive feedback could cause premature solutions and local convergence. The major factor causes this two problems mentioned above is its lack of a diversity protection mechanism, which can keep the balance between the convergence speed and the quality of the solutions. In this paper, we present a new type of ant colony algorithm using the idea of density control strategy. In this algorithm, the individuals are selected to have crossover and mutation operations according to their fitness value and diversity. Experimental results on the traveling salesman problem show that our density controlled ant colony algorithm(DACA) can get diversified solutions and higher convergence speed than the Neural Network Model [2](NNM) and traditional ant colony algorithm[3](TA), but also avoid the stagnation and premature problem.
2 The Classical Ant Colony Algorithm Here we use TSP as an example to illustrate AC and its application. We denote the distance between city i and j as dij ( i,j =1,2,…, n). Let τij(t) be the intensity of trail J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 385 – 390, 2006. © Springer-Verlag Berlin Heidelberg 2006
386
L. Qin et al.
information on edge (i , j) at time t, and use τij(t) to simulate the pheromone of real ants. Suppose m is the total number of ants, at time t the kth ant selects from its current city i to city j according to the following probability distribution: ⎧ α (t )ηβ (t ) τij ij ⎪ ⎪ k pij (t ) = ⎨ β τα (t )ηir (t ) ⎪ ∑ r∈allowed k ir ⎪⎩ 0
j ∈ allowed k
(1) .
otherwise
where allowedk is a set of the cities can be chosen by the kth ant at city i for the next step, ηij is a heuristic function which is defined as the visibility of the path between cities i and j , for instance it can defined as 1/ dij . The relative influence of the trail information τij(t) and the visibility ηij are determined by the parameters α, β. The intensity of trail information should be changed by the updating formula, where 0<ρ<1, and (1-ρ) τij(t) represents the evaporation of τij(t) between time t and t+1, Δτij is the increment of τij(t). And Δτijk is the trail information laid by the kth ant on the path between city i and city j in step t, it takes different formula depending on the model used. in the most popularly used model called “ant circle system” it is given as Eq.(4), where Q is a constant, Lk is the total length of current tour traveled by the kth ant. τij(t 1) ρτij(t)+Δτij . (2) m (3) k Δτ = Δτ .
+=
∑ ij k =1 if the k t h ant pass es edge ( i , j ) in curren t tou r . otherwise ij
⎧Q / Lk Δ τ k (t ) = ⎨ ij ⎩ 0
(4)
3 The Density Controlled Ant Colony Algorithm Here, we present the density controlled ant colony system to improve the optimization capability of the ant colony algorithm. And we use it to solve different kinds of TSP to evaluate its performance. In fact, the experimental results have proved that our algorithm is suitable for solving not only TSP but also QoS routing problem, clustering problem and continuous optimization problems etc.[5] . 3.1 Framework of the Density Controlled Ant Colony Algorithm It’s obvious that the ant colony algorithm is mainly composed by an iterative course including searching and verification for the solutions. Although evolution of the solutions of each individual can be accelerated by positive feedback, sometime it may also be degenerated inevitably. Thus, we make full use of the characteristic of the problems to be solved and adopt the conception and theory of density control strategy [4] to restrain the degeneration in the iteration. The framework of the algorithm is given as follows: Begin 1. initialize 1.1 Generate Np initial solutions randomly, compute their fitness and set the initial trail information on edge(i,j) 2.While (t<=Nc)
A New Optimization Algorithm Based on Ant Colony System
387
2.1for k=1 to N do //for n cities for i=1 to m do //for m ants 2.2.1The ant i select the next city j with the probability given by Eq.(5); 2.2.2Locally update the trail information on edge(i,j) according to Eq.(4); end for i end for k 2.2Select some solutions in current colony to have mutation operation according to Eq.(10) 2.3have global trail information updating operation, update the pheromone on the edges between each city according to Eq.(10); 2.4t++; if suffice the convergence condition then break the while loop; end while End In the line2.2.1, the next city j will be selected according to formula(5): arg max τij (t ), j ∈ allowedi q ≤ q0 ⎧ . j=⎨ ⎩select the next city j0 according to Eq.(1) otherwise
(5)
Here, q is stochastically generated number between (0,1),argmaxτij(t) is the available adjacent city of city i who has the highest trail information, q0 is a probability of selecting the optimal individual, for example, if q0 0.7, the city who has the highest trail information will be chosen with a probability over 0.7.
=
3.2 Pheromone Updating Based on the Density Control Strategy The biological density control mechanism uses the individual density in its colony. The higher individual density the ant has, i.e. the more approximate individuals this ant has, the lower diversity the colony has. Suppose ant it =(ai1, ai2 ,…,aiN) is the solution of ith ant at time t, where aih is the hth vertex that the ith ant chooses. We let distance (antit,antjt) be the distance between the ith ant and the jth ant, density(i) denotes the the density factor of ant i at time t. Here dismax is a constant determined by the size of problem N. It was proved by large number of experimental results that N /5 dismax
∈[0, ].
N distance ( ant it , ant tj ) = ∑ ( aih ⊕ a jh ). h =1 m density ( i ) = ∑ ε ( i , j ). j =1 ⎧⎪ 1 distance ( ant it , ant tj ) < dis max here ε (i, j ) = ⎨ ⎪⎩ 0 otherwise
(6) (7) .
(8)
In the traditional ant colony algorithms, there are several methods to update the pheromone: increasing the pheromone on the tours of all the ants [1], or increasing the pheromone on the tours where the best ant passed, or updating the pheromone based on a certain rank [6], or increasing the pheromone on tours passed by the ants whose
388
L. Qin et al.
fitness are above the average fitness and decreasing the pheromone on other tours. All these methods only consider the fitness of solutions, but ignore their distribution. In our algorithm, we present an adaptive pheromone updating strategy based on density factor reflecting the quality and distribution of the solutions. Our strategy can adjust the pheromone on the tours dynamically, and avoid the pheromone being centralized on some individuals. ⎧ Q ⋅ density(i) ⎪ if ant i chooses city k on current iteration Δτi = ⎨ lengthi . lk ⎪⎩ 0 otherwise
(9)
Here lengthi is the length of the solution of the ith ant in the current iteration. The algorithm makes full use of the information of the solutions to update the pheromone. The b density factor reflects the distribution of solution. Compared with other ants, the larger the ant’s density is, the more similar its solution is. By the density factor, the tour of the ant with the high fitness and low-density solution will be greatly increased. This adaptive pheromone updating method not only can strengthen the pheromone on the tour of the best solution, but also can guarantee the diversity of solutions. 3.3 Density Controlled Mutation Operation In line 2.2, the algorithm have mutation operation with the mutation probability of : P mutation
= e
− (1 −
density m
(i )
)⋅
f (i, t ) sumf ( t ) .
(10)
Which is adaptively defined according to the density and fitness value of the solutions, here f(i,t) is the fitness of the ith ant at the tth iteration, sumf(t) is the total fitness value of the whole colony at time t i.e. the tth iteration, and m is the total number of the ants in current iteration. It can be seen that the ants that have high individual density and low fitness value will have high mutation probability.
4 Experimental Results The algorithm presented in this paper was applied to symmetric and asymmetric TSP benchmarks provided by TSPLIB. We also test these TSP benchmarks using the classical ant colony algorithm and neural network model to compare the performance with our algorithm. From our experience obtained by a large numbers of experiments [5], we let the parameters ρ=0.4,Q=5, α=1 and β=1. Table 1. The experimental results of symmetric TSP
Instance d198
att532
Algorithm NNM TA DACA NNM TA DACA
Best solution 15780 15780 15780 27686 27686 27686
Average solution 15780.4 15780.4 15780.2 27686.3 27686.3 27686.2
Number 22 20 23 23 22 24
)
Average time(s 112.4 117.3 107.5 521.5 533.0 472.6
allowed time(s) 300
1000
A New Optimization Algorithm Based on Ant Colony System
389
Table 2. The experimental results of asymmetric TSP
Instance ry48p
ftv170
Algorithm NNM TA DACA NNM TA DACA
Best solution
Average solution
14422 14422 14422 2755 2755 2755
14424.7 14425.3 14422.0 2755.1 2755.0 2755.0
Number 20 19 25 25 25 25
) 76.4
Average time(s
71.1 35.9 114. 9 109.2 103.4
allowed time(s) 250
500
Here “number” is the number of trials obtaining the optimal solution “average solution ”is the average route length of the solution obtained ,“average time ”is the average time required to find the best solution of the trials,“allowed time”is the longest time allowed to be executed in each trial. The numbers included in the name of the problems denote the numbers of cities for its corresponding problem.
Fig. 1. It shows the evolutionary process of the best solution for d198 and ry48p using NNM, TA and DACA respectively. It is obvious that using DACA can not only get the same fitness valued solution, but also has a higher speed of reaching the best solution than NNM and TA.
From the tables we can see that the quality of our algorithm is much higher than that of NNM and TA, and it has much higher processing speed, stronger capability of finding optimal solutions, since DACA reaches the optimal solution in much less iterations. Fig.1 gives the evolutionary process of the best solution for symmetric and asymmetric TSP problems. In DACA, after reaching the best solution, the curve fluctuates within a very narrow scope around the best solution, this confirms the conclusion that DACA has higher stability and is suitable for solving large scaled TSP problems.
5 Conclusion In this paper we propose a new density controlled ant colony algorithm by adopting the density control strategy to guarantee the quality and diversity of the solutions. In this algorithm, the individuals are selected to have mutation operation according to the diversity and quality of the solutions. Experimental results on TSP benchmarks show that our algorithm has strong capability of optimization.
390
L. Qin et al.
Nevertheless, the ant colony algorithm is still a newly emerging optimization algorithm, though it is developed rapidly and has demonstrated its superiority of solving complex optimization problems. The ant colony algorithm lacks of a solid mathematic foundation, its selection of the parameters usually depends on experiments and experiences. All these indicate that there are lots of open problems of ant colony optimization worth of deep research in both practical and theoretical area.
Acknowledgements Supported in part by the Chinese National Natural Science Foundation under grant No. 60473012, Chinese National Foundation for Science and Technology Development under contract 2003BA614A-14, and Natural Science Foundation of Jiangsu Province under contract BK2005047.
References 1. Dorigo, M., Maniezzo, V., Colorni, A.: Ant System: Optimization By A Colony of Coorperating Agents. IEEE Transactions On Systems, Man, And Cybernetics-Part B 26(1) (1996) 29-41 2. Tan, K.C., Huajin, T., Ge, S.S.: On Parameter Settings of Hopfield Networks Applied to Traveling Salesman Problems. IEEE Transactions on Circuits and Systems - Regular Papers 52(8) (2005) 994 – 1002 3. Stützle, T., Hoos, H.H.: MAX-MIN Ant System. Future Generation Computer Systems Journal 16(8)(2000)889-914 4. Xiao, P., L., Wei, W.: A New Optimization Method on Immunogenetics. Acta Electronica Sinica 31(1)(2003) 59-62 5. Ling, C., Ling, Q., Hong, J. C., Xiao, H.X.: An Ant Colony Algorithm with Characteristic of Sensation and Consciousness. Journal of System Simulation 15(10) (2003)1418-1425 6. Bullnheimer, B., Hartl, R.F., Strauss, C.: A New Rank Based Version of the Ant System-A Computational Study. Central European Journal for Operations Research and Economics 7(1) (1999) 25-38
A New Neural Network Approach to the Traveling Salesman Problem Paulo Henrique Siqueira1, Sérgio Scheer2, and Maria Teresinha Arns Steiner3 1
Federal University of Paraná, Department of Drawing, Postfach 19081, 81531-990, Curitiba, Brazil
[email protected] 2 Federal University of Paraná, Department of Civil Construction, Postfach 19011, 81531-980, Curitiba, Brazil
[email protected] 3 Federal University of Paraná, Department of Mathematics, Postfach 19081, 81531-990, Curitiba, Brazil
[email protected]
Abstract. This paper presents a technique that uses the Wang Recurrent Neural Network with the "Winner Takes All" principle to solve the Traveling Salesman Problem (TSP). When the Wang Neural Network presents solutions for the Assignment Problem with all constraints satisfied, the "Winner Takes All" principle is applied to the values in the Neural Network’s decision variables, with the additional constraint that the new solution must form a feasible route for the TSP. The results from this new technique are compared to other heuristics, with data from the TSPLIB (TSP Library). The 2-opt local search technique is applied to the final solutions of the proposed technique and shows a considerable improvement of the results.
1 Introduction The Traveling Salesman Problem is a classical problem of combinatory optimization of Operations Research’s area. The purpose is to find a minimum total cost Hamiltonian cycle [1]. There are several practical uses for this problem, such as Vehicle Routing [2] and Drilling Problems [3]. This problem has been used during the last years as a comparison basis for improving several optimization techniques, such as Genetic Algorithms [4], Simulated Annealing [5], Tabu Search [6], Local Search [7], Ant Colony [8] and Neural Networks [9], the latter used in this work. The principal types of Neural Network used to solve the TSP are: Hopfield’s Recurrent Networks [10] and Kohonen’s Self-Organizing Maps [9]. In a Hopfield’s Network, the main idea is to automatically find a solution for the TSP by means of a balanced state of the equation system defined for the TSP. By using Kohonen’s Maps for the TSP, the final route is determined through those neurons that have weights that are closer to the pair of coordinates ascribed to each city in the problem. The Recurrent Neural Network is able to solve others Operations Research’s classical problem, as the Shortest-Path Routing problem [11], [12] and the J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 391 – 398, 2006. © Springer-Verlag Berlin Heidelberg 2006
392
P.H. Siqueira, S. Scheer, and M.T.A. Steiner
Assignment problem (AP) [13], [14]. The Wang Recurrent Neural Network (WRNN) with the "Winner Takes All" (WTA) principle [15] can be applied to solve the TSP, solving this problem as if it was an AP by means of the WRNN, and, furthermore, using the WTA principle on the solutions found with the WRNN, with the constraint that the solutions found must form a feasible route for the TSP. The parameters used for the WRNN are those that show the best solutions for the AP. The solutions found with the technique proposed in this work (WRNN+WTA) are compared with the solutions from the Self-Organizing Maps (SOM) and the Simulated Annealing (SA) for the symmetrical TSP, and with other heuristics for the asymmetrical TSP. The 2-opt Local Search technique [7] is used to improve the solutions found with the technique proposed in this work. The data used for the comparisons are from the TSPLIB database [16]. This essay is divided into 5 sections, including this introduction. In Section 2, the TSP is defined. In Section 3, the WRNN and the WTA principle are presented. In Section 4, the results of the proposed technique are presented and comparisons between this technique and other heuristics known in the literature are made. In Section 5, the work’s conclusions are presented.
2 Formulation of the Problem The TSP can be mathematically formulated as follows [1]: n
Minimize C =
n
∑∑ c
ij
xij
(1)
i =1 j =1
n
∑x
Subject to
ij
= 1,
j = 1, 2, ..., n
(2)
i =1
n
∑x
ij
= 1,
i = 1, 2, ..., n
(3)
j =1
xij ∈ {0, 1}, i, j = 1, 2, ..., n ~ x forms a Hamiltonian cycle
(4) (5)
where cij and xij are, respectively, the costs and the decision variables associated to the assignment of element i to position j. When xij = 1, the element i is assigned to the position j and this means that the route has a stretch that is the sequence from city i to city j. Vector ~ x has the whole sequence of the route that was found, i.e., the solution for the TSP. The AP has the same previous formulation, with exception of (5). The objective function (1) represents the total cost to be minimized. The constraints sets (2) and (3) assure that each city i will be assigned to exactly one city j. Set (4) represents the integrality constraints of variables zero-one xij, and can be replaced by constraints of the following type [1]: xij ≥ 0 , i, j = 1, 2, ..., n.
(6)
A New Neural Network Approach to the Traveling Salesman Problem
393
Constraint (5) assures that in the final route each city will be visited once and that no sub-routes will be formed. The matrix form of the problem described in (1)-(5) is the following [14]: Minimize C = cTx
(7)
Subject to Ax = b x ij ≥ 0 , i, j = 1, 2, ..., n ~ x forms a Hamiltonian cycle
(8)
where vectors cT and x contain all the rows from the cost matrix and from the matrix with decision elements xij, this is, cT = (c11, c12, ..., c1n, c21, c22, ..., c2n, ..., cn1, cn2, ..., cnn) and x = (x11, x12, ..., x1n, x21, x22, ..., x2n, ..., xn1, xn2, ..., xnn). Vector b has the number 1 in all of its positions and matrix A has the following form:
⎡I A=⎢ ⎣ B1
... I ⎤ 2 ∈ ℜ 2 n×n ⎥ ... Bn ⎦
I B2
where I is the identity matrix, of order n × n, and each matrix Bi, for i = 1, 2, ..., n, has zeros, with the exception of the ith row, which has the number 1 in all of its positions.
3 WRNN and WTA to Solve the TSP The Recurrent Neural Network proposed by Wang, published in [13], [14] and [17], is characterized by the following differential equation: duij (t ) dt
= −η
n
∑x
ik
(t ) − η
k =1
n
∑ x (t ) + ηθ lj
ij
− λcij e
−
t
τ
,
(9)
l =1
where xij = g(uij(t)) and this Neural Network’s balance state is a solution for the AP. Function g is the sigmoid function, with a β parameter, i.e., g(u) =
1 . 1 + e −β u
(10)
The boundary vector is defined as θ = A b = (2, 2, ..., 2) ∈ ℜ n ×1 . Parameters η, λ and τ are constant and empirically chosen [14], thus affecting the Neural Network’s convergence. Parameter η is to penalize violations of the constraints set of the problem defined by (1)-(4). Parameters λ and τ control the AP’s objective function minimization (1). The matrix form for this Neural Network can be written as follows: 2
T
t
− du (t ) = −η (Wx(t ) − θ ) − λce τ dt
(11)
where x = g(u(t)) and W = ATA. The technique proposed in this work uses the WTA principle, accelerating the WRNN’s convergence and correcting eventual problems that may appear due to multiple optimal solutions or optimal solutions that are very close to each other [15].
394
P.H. Siqueira, S. Scheer, and M.T.A. Steiner
The parameters chosen for the WRNN are those that determine the best results for the AP [15]. Parameter η is considered equal to 1 in all cases tested within this work. Parameter λ is taken as a vector defined by: ⎛1
1 ⎞
1
λ = η ⎜⎜ , ,..., ⎟⎟ , dn ⎠ ⎝ d1 d 2
(12)
where di, for i = 1, 2, …, n, represents the standard deviation for each row in the cost matrix c. Each element of vector λ is used to update the corresponding row in decision matrix x. The best adjust for parameter τ uses the WRNN’s definition’s fourth term (9). When cij = cmax, the term λc exp(−t / τ ) = α must be such that g(α) =
φ ≅ 0. Isolating the value of α, we have: ⎛1 ⎞ − ln⎜⎜ − 1⎟⎟ ⎝φ ⎠ , α=
(13)
β
Using the value of α = λc exp(−t / τ ) and expression (13), the value of parameter τ is obtained.
τi =
−t ⎛ −α ln⎜⎜ ⎝ λcmax
⎞ ⎟ ⎟ ⎠
. (14)
In equation (11), Wx(t) − θ measures the AP’s constraints violation. After a certain number of iterations this term suffers no substantial changes of its value. This evidences the fact that the problem’s constraints are almost satisfied. At this moment, the WTA principle can be applied. When all the elements of x satisfy the condition Wx(t) − θ ≤ δ , where δ ∈ [0, 2], the proposed technique can be used and its algorithm is presented below: Step 1: Determine a maximum number of routes rmax. Find an AP’s x solution using the WRNN. If Wx(t) − θ ≤ δ , then go to Step 2. Otherwise, find another solution x. Step 2: Given the decision matrix, consider matrix x , where x = x, m = 1 and go to Step 3. Step 3: Choose a row k in decision matrix x . Do p = k, ~ x (m) = k and go to Step 4. Step 4: Find the biggest element of row k, x kl. This element’s value is given by the half of the sum of all elements of row k and of column l of matrix x, this is,
xkl =
1 ⎛⎜ 2 ⎜⎝
n
n
∑x + ∑x il
i =1
j =1
kj
⎞ ⎟ . ⎟ ⎠
(15)
The other elements of row k and column l become null. So that sub-routes are not formed, the other elements of column k must also be null. Do ~ x (m + 1) = l; to continue the TSP’s route, make k = l and go to Step 5.
A New Neural Network Approach to the Traveling Salesman Problem
395
Step 5: If m < n, then make m = m + 1 and go to Step 4. Otherwise, do 1⎛ xkp = ⎜ 2 ⎜⎝
n
∑ i =1
xip +
n
∑x j =1
kj
⎞ ⎟ , ⎟ ⎠
(16)
~ x (n + 1) = p, determine the route’s cost, C, and go to Step 6. Step 6: If C < Cmin, then do Cmin = C and x = ~ x . Make r = r + 1. If r < rmax, then run the WRNN again and go to Step 2, otherwise Stop. The technique proposed in this work can be applied to symmetrical or asymmetrical TSPs. The next Section shows the results of applying this technique to some of the TSPLIB’s problems for symmetrical, as well as asymmetrical TSPs.
4 Results of Applying the Proposed Technique to Some of the TSPLIB’s Problems The results found with the technique proposed in this work for the TSPLIB’s TSP’s symmetrical cases are compared with SOM and SA results, and the asymmetrical cases are compared to remotion and insertion arc heuristic. For symmetrical TSPs, the following methods were used to compare with the technique presented in this work: the method that involves statistical methods between a SOM’s neurons’ weights [18] and has a global version (KniesG: Kohonen Network Incorporating Explicit Statistics Global), where all cities are used in the neuron dispersion process, and a local version (KniesL), where only some represented cities are used in the neuron dispersion step; the SA technique [5], using the 2-opt improvement technique; Budinich’s SOM, which consists of a traditional SOM applied to the TSP, presented in [5]; the expanded SOM (ESOM) [9], which, in each iteration, places the neurons close to their corresponding input data (cities) and, at the same time, places them at the convex contour determined by the cities; the efficient and integrated SOM (eISOM) [19], where the ESOM procedures are used and the winning neuron is placed at the mean point among its closest neighboring neurons; the efficient SOM technique (SETSP) [20], which defines the updating forms for parameters that use the TSP’s number of cities; and Kohonen’s cooperative adaptive network (CAN) [21] uses the idea of cooperation between the neurons’ close neighbors and uses a number of neurons that is larger than the number of cities in the problem. In Table 1 are the average errors/mistakes of the techniques mentioned above. Are also shown: the "pure" technique proposed in this work, the proposed technique with the 2-opt improvement algorithm, as well as the best (max) and worse (min) results of each problem considered. The solutions presented in bold characters show the best results for each problem, disregarding the results with the 2-opt technique. The results of the technique proposed in this work, together with the 2-opt improvement, are better than the results of the other Neural Networks, with the sole exception of problem lin105. The average errors range from 0 to 3.31%, as shows the last column in Table 1. Without the improvement technique, the results of problem eil76 are better than the results of the other Neural Networks.
396
P.H. Siqueira, S. Scheer, and M.T.A. Steiner Table 1. Results of the experiments for the symmetrical TSP
TSP’s name
n
eil51 st70 eil76 gr96 rd100 eil101 lin105 pr107 pr124 bier127 pr136 pr152 rat195 kroa200 lin318 pcb442 att532
51 70 76 96 100 101 105 107 124 127 136 152 195 200 318 442 532
optimal solution
average error (%) for 7 algorithms KniesG KniesL SA
WRNN & WTA
BSom ESom EiSom Setsp CAN
430 2.86 2.86 2.33 3.10 678.6 2.33 1.51 2.14 1.70 545.4 5.48 4.98 5.54 5.32 514 NC NC 4.12 2.09 7,910 2.62 2.09 3.26 3.16 629 5.63 4.66 5.74 5.24 14,383 1.29 1.98 1.87 1.71 44,303 0.42 0.73 1.54 1.32 59,030 0.49 0.08 1.26 1.62 118,28 3.08 2.76 3.52 3.61 96,772 5.15 4.53 4.90 5.20 73,682 1.29 0.97 2.64 2.04 2,323 11.92 12.24 13.2 11.48 29,368 6.57 5.72 5.61 6.13 42,029 NC NC 7.56 8.19 50,784 10.45 11.07 9.15 8.43 27,686 6.8 6.74 5.38 5.67
2.10 2.09 3.89 1.03 1.96 3.43 0.25 1.48 0.67 1.70 4.31 0.89 7.13 2.91 4.11 7.43 4.95
max
min 2opt
2.56 2.22 0.94 1.16 1.16 0 NC 1.60 1.33 4.04 2.71 0 NC 4.23 2.04 2.49 1.03 0 NC NC NC 6.61 4.28 0 NC 2.60 1.23 7.17 6.83 0.08 3.59 NC 1.11 7.95 3.02 0.48 NC 1.30 0 5.94 4.33 0.20 NC 0.41 0.17 3.14 3.14 0 NC NC 2.36 2.63 0.33 0 NC 1.85 0.69 5.08 4.22 0.37 NC 4.40 3.94 6.86 5.99 1.21 NC 1.17 0.74 3.27 3.23 0 NC 11.19 5.27 8.82 5.55 3.31 1.64 3.12 0.92 12.25 8.95 0.62 2.05 NC 2.65 8.65 8.35 1.90 6.11 10.16 5.89 13.18 9.16 2.87 3.35 NC 3.32 15.43 14.58 1.28
Table 2. Results of the experiments for the asymmetrical TSP TSP’s name
n
Average error (%) For 5 algorithms
optimal solution GR
br17 ftv33 ftv35 ftv38 pr43 ftv44 ftv47 ry48p ft53 ftv55 ftv64 ft70 ftv70 kro124p ftv170 rbg323 rbg358 rbg403 rbg443
RI
KSP
GKS
RPC
COP
WRNN & WTA max
min
2opt
17 39 102.56 0 0 0 0 0 0 0 0 33 1,286 31.34 11.82 13.14 8.09 21.62 9.49 7.00 0 0 35 1,473 24.37 9.37 1.56 1.09 21.18 1.56 5.70 3.12 3.12 38 1,530 14.84 10.20 1.50 1.05 25.69 3.59 3.79 3.73 3.01 43 5,620 3.59 0.30 0.11 0.32 0.66 0.68 0.46 0.29 0.05 44 1,613 18.78 14.07 7.69 5.33 22.26 10.66 2.60 2.60 2.60 47 1,776 11.88 12.16 3.04 1.69 28.72 8.73 8.05 3.83 3.83 48 14,422 32.55 11.66 7.23 4.52 29.50 7.97 6.39 5.59 1.24 53 6,905 80.84 24.82 12.99 12.31 18.64 15.68 3.23 2.65 2.65 55 1,608 25.93 15.30 3.05 3.05 33.27 4.79 12.19 11.19 6.03 64 1,839 25.77 18.49 3.81 2.61 29.09 1.96 2.50 2.50 2.50 70 38,673 14.84 9.32 1.88 2.84 5.89 1.90 2.43 1.74 1.74 70 1,950 31.85 16.15 3.33 2.87 22.77 1.85 8.87 8.77 8.56 100 36,230 21.01 12.17 16.95 8.69 23.06 8.79 10.52 7.66 7.66 170 2,755 32.05 28.97 2.40 1.38 25.66 3.59 14.66 12.16 12.16 323 1,326 8.52 29.34 0 0.53 0 0 16.44 16.14 16.14 358 1,163 7.74 42.48 0 2.32 0.26 22.01 12.73 8.17 0 403 2,465 0.85 9.17 0.69 0.20 4.71 4.71 4.71 0 0 443 2,720 0.92 10.48 0 8.05 8.05 2.17 0 0 0
For the asymmetrical TSP, the techniques used to compare with the technique proposed in this work were [22]: the Karp-Steele path methods (KSP) and general Karp-Steele (GKS), which begin with one cycle and by removing arcs and placing new arcs, transform the initial cycle into a Hamiltonian one. The difference between these two techniques is that the GKS uses all of the cycle’s vertices for the changes in
A New Neural Network Approach to the Traveling Salesman Problem
397
the cycle’s arcs. Were also used: the path recursive contraction (PRC) that consists in forming an initial cycle and transforming it into a Hamiltonian cycle by removing arcs from every sub-cycle; the contraction or path heuristic (COP), which is a combination of the GKS and RPC techniques; the “greedy” heuristic (GR) that chooses the smallest arc in the graph, contracts this arc creating a new graph, and keeps this procedure up to the last arc, thus creating a route; and the random insertion heuristic (RI) that initially chooses 2 vertices, inserts one vertex that had not been chosen, thus creating a cycle, and repeats this procedure until it creates a route including all vertices. Table 2 shows the average errors of the techniques described, as well as those of the "pure" technique presented in this work and of the proposed technique with the 2-opt technique. The results of the "pure" technique proposed in this work are better or equivalent to those of the other heuristics mentioned above, for problems br17, ftv33, ftv44, ft53, ft70 and kro124p, as shown in Table 2. By using the 2-opt technique on the proposed technique, the best results were found for problems br17, ftv33, pr43, ry48p, ftv44, ft53, ft70 and kro124p, with average errors ranging from 0 to 16.14%.
5 Conclusions This work presented the WRNN with the WTA principle to solve the TSP. By means of the WRNN, a solution for the AP is found and the WTA principle is applied to this solution, transforming it into a feasible route for the TSP. This technique’s solutions were considerably improved when the 2-opt technique was applied on the solutions presented by the technique proposed in this work. The data used for testing are from the TSPLIB and the comparisons that were made with other heuristics show that the technique proposed in this work obtains better results in several of the tested problems, with average errors below 16.14%. A great advantage of implementing the technique presented in this work is the possibility of using the same technique to solve symmetrical as well as asymmetrical TSPs.
References 1. Ahuja, R.K., Mangnanti, T.L., Orlin, J.B.: Network Flows. Prentice Hall, New Jersey (1993) 2. Laporte, G.: The Vehicle Roting Problem: An Overview of Exact and Approximate Algorithms. European Journal of Operational Research 59(2) (1992) 345-358 3. Onwubolu, G.C., Clerc, M.: Optimal Path for Automated Drilling Operations by a New Heuristic Approach Using Particle Swarm Optimization. International Journal of Production Research 42(3) (2004) 473-491 4. Affenzeller, M., Wanger, S.: A Self-Adaptive Model for Selective Pressure Handling within the Theory of Genetic Algorithms. In: Moreno-Díaz, R., Pichler, F. (eds.): Computer Aided Systems Theory – EUROCAST 2003. Lecture Notes in Computer Science, Vol. 2809. Springer-Verlag, Berlin Heidelberg New York (2003) 384-393 5. Budinich, M.: A Self-Organizing Neural Network for the Traveling Salesman Problem That Is Competitive with Simulated Annealing. Neural Computing 8 (1996) 416-424 6. Liu, G., He, Y., Fang, Y., Oiu, Y.: A Novel Adaptive Search Strategy of Intensification and Diversification in Tabu Search. In: Proceedings of the IEEE International Conference on Neural Networks and Signal Processing – ICNNSP’03, Vol. 1. IEEE (2003) 428-431
398
P.H. Siqueira, S. Scheer, and M.T.A. Steiner
7. Bianchi, L., Knowles, J., Bowler, J.: Local Search for the Probabilistic Traveling Salesman Problem: Correction to the 2-P-Opt and 1-Shift Algorithms. European Journal of Operational Research 162(1) (2005) 206-219 8. Chu, S.C., Roddick, J.F., Pan, J.S.: Ant Colony System with Communication Strategies. Information Sciences 167(1-4) (2004) 63-76 9. Leung, K.S., Jin, H.D., Xu, Z.B.: An Expanding Self-Organizing Neural Network for the Traveling Salesman Problem. Neurocomputing 62 (2004) 267-292 10. Wang, R.L., Tang, Z., Cao, Q.P.: A Learning Method iIn Hopfield Neural Network for Combinatorial Optimization Problem. Neurocomputing 48(4) (2002) 1021-1024 11. Wang, J.: Primal and Dual Neural Networks for Shortest-path Routing. IEEE Transactions on Systems: Man and Cybernetics - Part A: Systems and Humans 28(6) (1998) 864-869 12. Xia, Y., Wang, J.: A Discrete-time Recurrent Neural Network for Shortest-path Routing. IEEE Transactions on Automatic Control 45(11) (2000) 2129-2135 13. Wang, J.: Analog Neural Network for Solving the Assignment Problem. Electronic Letters 28(11) (1992) 1047-1050 14. Hung, D.L., Wang, J.: Digital Hardware Realization of a Recurrent Neural Network for Solving the Assignment Problem. Neurocomputing 51 (2003) 447-461 15. Siqueira, P.H., Scheer, S., Steiner, M.T.A.: Application of the "Winner Takes All" Principle in Wang's Recurrent Neural Network for the Assignment Problem. In: Wang, J., Liao, X., Yi, Z. (eds.): Advances in Neural Networks – ISNN 2005. Lecture Notes in Computer Science, Vol. 3496(1). Springer-Verlag, Berlin Heidelberg New York (2005) 731-738 16. Reinelt, G.: TSPLIB – A Traveling Salesman Problem Library. ORSA Journal on Computing 3(4) (1991) 376-384 17. Wang, J.: Primal and Dual Assignment Networks. IEEE Transactions on Neural Networks 8(3) (1997) 784-790 18. Aras, N., Oommen, B.J., Altinel, I.K.: The Kohonen Network Incorporating Explicit Statistics and Its Application to the Traveling Salesman Problem. Neural Networks 12(9) (1999) 1273-1284 19. Jin, H.D., Leung, K.S., Wong, M.L., Xu, Z.B.: An Efficient Self-Organizing Map Designed by Genetic Algorithms for the Traveling Salesman Problem. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics 33(6) (2003) 877-887 20. Vieira, F.C., Doria Neto, A.D., Costa, J.A.: An Efficient Approach to the Travelling Salesman Problem Using Self-Organizing Maps. International Journal of Neural Systems 13(2) (2003) 59-66 21. Cochrane, E.M., Beasley, J.E.: The Co-Adaptive Neural Network Approach to the Euclidean Travelling Salesman Problem. Neural Networks 16(10) (2003) 1499-1525 22. Glover, F., Gutin, G., Yeo, A., Zverovich, A.: Construction Heuristics for the Asymmetric TSP. European Journal of Operational Research 129(3) (2001) 555-568
Dynamical System for Computing Largest Generalized Eigenvalue Lijun Liu and Wei Wu Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, P.R. China Abstract. A dynamical system is proposed for generalized eigendecomposition problem. The stable points of the dynamical system are proved to be the eigenvectors corresponding to the largest generalized eigenvalue.
1
Introduction
Principal component analysis (PCA) [2, 3, 5, 9, 11] and linear discriminant analysis (LDA) [1] are widely used statistical techniques in various applications such as data compression, feature extraction and signal estimation. The PCA provides a set of adaptive algorithms for eigen-decomposition of a matrix A, which is the asymptotic expectation of a sequence of random matrices. LDA, on the other hand, provides adaptive algorithms for generalized eigen-decomposition of a matrix pair (A, B) which are the asymptotic expectations of two sequences of random matrices. According to the stochastic approximation theory [6], under some mild conditions, the asymptotic behavior of a PCA or LDA learning algorithm can be determined by its corresponding averaging differential equation [1–3, 5, 7, 9–13]. Therefore, to design and analyze PCA or LDA learning algorithms, we can design and analyze certain differential equations for solving the generalized eigenvalue problem. In this paper we are to find eigenpair (λ, x) ∈ R × Rn such that Ax = λBx
(1)
for given matrices A and B. When B is invertible, (1) is equivalent to the ordinary eigenvalue problem B −1 Ax = λx (2) Let us have a look at some existing results on the computation of eigenvalues by ordinary differential equation on which our work is based. Zhang and Yan [13] successfully use a neural network model for computing the ordinary eigenvalue problem Ax = λx for a symmetric matrix A as follows dx = xT xAx − xT Axx, dt
t≥0
(3)
We observe that generally B −1 A in (2) is no longer guaranteed to be symmetric. Thus model (3) cannot be extended directly to the generalized eigenvalue problem (1). J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 399–404, 2006. c Springer-Verlag Berlin Heidelberg 2006
400
L. Liu and W. Wu
Rao and Principe [10] propose an RLS type algorithm for generalized eigendecomposition, which can be summarized as the differential equation dx xT Bx −1 = T B Ax − x, t ≥ 0 (4) dt x Ax Their research is strongly related and successfully applied to signal processing theory. But their arguments and results can not be used directly to the general generalized eigenvalue problem (1). Zhang and Leung [12] propose a general framework for PCA algorithms based on generalized eigen-decomposition using the differential equation dx(t) = Ax − f (x)Bx, t ≥ 0 (5) dt This framework succeeds in covering many important algorithms. But in order to get a general result for the framework, they require that both A and B have the same eigenvector associated with the largest eigenvalue, which is hard to satisfy generally. Based on and inspired by the above mentioned works, in this paper, we propose the following dynamical system for the generalized eigen-decomposition of (1) dx = xT BxB −1 Ax − xT Axx, t ≥ 0 (6) dt We assume A is symmetric and B is symmetric positive. We see that, compared with (4), the term xT Ax has been moved out of the denominator in order to prevent the possible instability in case that xT Ax is small. The rest of this paper is organized as follows. In Section 2, some properties of the solutions to the dynamical system are studied as a preparation for the convergence results. In Section 3, the convergence of the system to the largest generalized eigenvalue is established. Numerical result is given in section 4.
2
Preliminary Results
Assumption 1. A and B are real, symmetric and positive definite. Remark 1. If B is positive definite but A is not, then we can simply replace A by A + cB, such that A + cB is positive definite with a suitable positive scalar constant c. It is easy to show that both A and A + cB have the same generalized eigenvector ξ with respect to B, with generalized eigenvalue λ and λ + c, respectively. T
Theorem 1. If x(t) is a solution to (6) for t ≥ 0, then x(t) Bx(t) keeps inT variant for t ≥ 0, i.e., x(t) Bx(t) ≡ x(0)T Bx(0). Proof. It is straightforward to check that d T dx (x Bx) = 2xT B = 2xT B(xT BxB −1 Ax − xT Axx) = 0 dt dt So xT Bx keeps constant for t ≥ 0.
(7)
Dynamical System for Computing Largest Generalized Eigenvalue
401
Remark 2. Luo [7] considered the invariant norm property of the principal component algorithm. Here we extend this property to generalized eigenvalue problem. Theorem 2. Let ω be any given vector in Rn . Then there exists a unique solution x(t) with x(0) = ω for the system (6). Proof. Denote the right hand side of (6) by f (x). Then f (x) is continuously differentiable in x. For a given point ω ∈ Rn , there exist a neighborhood N (ω) of ω and a positive constant K1 such that the local Lipschitz condition is valid, i.e., (8) f (u) − f (v) ≤ K1 u − v for all u, v ∈ N (ω), where · is the usual Euclidean norm. Furthermore, from Theorem 1 and the positive definiteness of B, we have x(t)2 ≤ δ −1 x(0) Bx(0) T
(9)
where δ is the smallest positive eigenvalue of B. By the well-known PicardLinerlof theorem [8], there exists a unique solution x(t) with x(0) = ω for the system (6), which can be uniquely extended to the infinite time interval [0, ∞). Theorem 3. If ξ = 0 is an equilibrium point of the system (6), then ξ is an eigenvector associated with the eigenvalue λ = ξ T Aξ/ξ T Bξ of (1). Conversely, if ξ is an eigenvector corresponding to an eigenvalue of the generalized eigenvalue problem (1), then ξ is also an equilibrium point of (6). Proof. The proof is straightforward. For convenience of reference, we present the following well-known generalized eigen-decomposition theorem (cf. [4]). Theorem 4. Let Assumption 1 be valid, then generalized eigenvalue problem (1) has n linearly independent eigenvectors {Si }ni=1 associated with its real generalized eigenvalues {λi }ni=1 and λi > 0(i = 1, 2, · · · , n). Furthermore, denote S = (S1 , S2 , · · · , Sn ) ∈ Rn×n , then
3
S T AS = Λ = diag{λ1 , λ2 , · · · , λn }
(10)
S T BS = I = diag{1, 1, · · · , 1}
(11)
Convergence Analysis
Denote x(t) =
n
θi (t)Si = SΘ, then according to Theorem 1 we have
i=1
Θ(t)2 = x(0)T Bx(0) = Θ(0)T S T BSΘ(0) = Θ(0)2
(12)
So there exists a uniform upper bound M > 0 such that |θi (t)| ≤ M,
(i = 1, 2, · · · n)
Now we give the main convergence theorem for the dynamics of (6).
(13)
402
L. Liu and W. Wu
Theorem 5. If θ1 (0) = 0 and the largest generalized eigenvalue λ1 is unrepeated, then lim x(t) = ± x(0)T Bx(0)S1 (14) t→∞
Proof. Under the transformation x = SΘ, (6) can be rewritten as S
dΘ = Θ(0)2 B −1 ASΘ − (ΘT S T ASΘ)SΘ dt
(15)
Clearly S −1 = S T B by Theorem 4. Then we have dΘ = Θ(0)2 B −1 ASΘ − (ΘT ΛΘ)Θ dt = Θ(0)2 S T ASΘ − (ΘT ΛΘ)Θ = Θ(0)2 ΛΘ − (ΘT ΛΘ)Θ Hence, (6) can be expressed in component form as ⎛ ⎞ n dθi (t) ⎝ = Θ(0)2 λi − λj θj2 (t)⎠ θi (t), dt j=1 from which we have
⎛
θi (t) = θi (0) exp ⎝Θ(0)2 λi t −
t n
(16)
(i = 1, 2, · · · , n)
(17)
⎞ λj θj2 (s)ds⎠
(18)
0 j=1
Clearly if θi (0) = 0, then θi (t) = 0 for all t ≥ 0. Since θ1 (0) = 0, we can define θi (t) θ1 (t) for i = 1, 2, · · · , n. Then we have d dt
θi (t) θ1 (t)
⎛ ⎞ n 1 ⎝ = Θ(0)2 λi − λj θj2 (t)⎠ θi (t) θ1 (t) j=1 ⎛ ⎞ n θi (t) ⎝ − 2 Θ(0)2 λ1 − λj θj2 (t)⎠ θ1 (t) θ1 (t) j=1
θi (t) Θ(0)2 (λi − λ1 ) θ1 (t)
(19)
θi (t) θi (0) = exp Θ(0)2 (λi − λ1 ) t θ1 (t) θ1 (0)
(20)
= This means that
So by λi − λ1 < 0 (i = 2, 3, · · · , n), we have lim
t→∞
θi (t) = 0. θ1 (t)
(21)
Dynamical System for Computing Largest Generalized Eigenvalue
403
From (12), (13) and (21), we have
n lim θ1 (t) = ± lim θi2 (0) = ± x(0)T Bx(0)
t→∞
t→∞
and lim x(t) = lim
t→∞
n
t→∞
(22)
i=1
θi (t)Si = ±
x(0)T Bx(0)S1
(23)
i=1
Remark 3. If λ1 is repeated, say λ1 = · · · = λl > λl+1 ≥ · · · ≥ λn , then it is easy l to show that lim x(t) = lim θi (t)S1 , which is still an eigenvector associated t→∞
t→∞ i=1
with λ1 . In fact, this limit can be expressed explicitly in terms of x(0) through the procedure adopted in [13].
4
Numerical Result
In this section, we randomly select symmetric matrix A and symmetric positive definite matrix B as follows ⎡ ⎤ 0.9660 0.2860 0.1939 −0.0699 −0.1404 ⎢ 0.2860 −0.2482 −0.0702 −0.3862 0.3025 ⎥ ⎢ ⎥ ⎥ A=⎢ ⎢ 0.1939 −0.0702 0.6894 0.2726 0.5524 ⎥ ⎣ −0.0699 −0.3862 0.2726 0.1384 −0.0330 ⎦ −0.1404 0.3025 0.5524 −0.0330 0.3111 ⎡
⎤ 1.7838 0.2825 0.2648 −0.4169 −0.1014 ⎢ 0.2825 2.6752 0.0882 0.1259 −0.2907 ⎥ ⎢ ⎥ ⎥ B=⎢ ⎢ 0.2648 0.0882 2.0226 0.7591 0.1542 ⎥ ⎣ −0.4169 0.1259 0.7591 2.6133 0.0689 ⎦ −0.1014 −0.2907 0.1542 0.0689 1.2801
and
1 0.8
x1(t) x4(t)
0.4
Actual Value λ
0.7
=0.6500
max
0.6
max
0.2
0.5
0
Estimated λ
value of each component
0.8
0.6
x2(t)
−0.2 −0.4 −0.6
x3(t)
−0.8
x5(t)
0.4 0.3 0.2 0.1 0
−1
−0.1
0
5
10
15
20
25
30
0≤ t ≤ 50
35
40
45
Fig. 1. Asymptotic stability of x(t)
50
−0.2
0
1
2
3
4
5
0≤ t ≤ 10
6
7
8
9
Fig. 2. Convergence of xT Ax/xT Bx
10
404
L. Liu and W. Wu
The true largest generalized eigenvalue λmax = 0.6500. Using the network model, the estimated eigenvector corresponding to largest eigenvalue is computed as ξ = (0.4276, −0.2191, −0.6859, 0.2629, −0.9024)T with initial value x(0) taken as (0.1249, 0.2332, −0.7733, 0.7965, 0.5091)T . The result is shown in Figure 1 and Figure 2. The numerical result further confirms that the network model can be used to compute the largest generalized eigenvalue of real symmetrical matrix A associated with B.
Acknowledgement This work is partly supported by the National Natural Science Foundation of China (10471017). Corresponding author.
[email protected]
References 1. Chatterjee, C., Roychowdhury, V. P., Zoltowski, M. D., Ramos, J.: Self-organizing and Adaptive Algorithms for Generalized Eigendecomposition. IEEE Trans. on Neural Networks 8 (1997) 1518–1530 2. Chen, T.P.: Modified Ojas Algorithms for Principal Subspace and Minor Subspace Extraction. Neural Processing Letters 5 (1997) 105–110 3. Diamantaras, K.J., Hornik, K., Strintzis, M.G.: Optimal Linear Compression under Unreliable Representation and Robust PCA Neural Models. IEEE Trans. on Neural Networks (1999) 1186–1195 4. Gloub, G.H., VanLoan, C.F.: Matrix Computations. Johns Hopkins Univ. Press (1983) 5. Hornik, K., Kuan, C.M.: Convergence Analysis of Local Feature Extraction Algorithms. Neural Networks (1992) 229–240 6. Ljung, L.: Analysis of Recursive Stochastic Algorithms. IEEE Trans. on Auto. Cont. AC-22(1977) 551–575 7. Luo, F., Unbehauen, R., Li, Y.D.: A Principal Component Analysis Algorithm with Invariant Norm. Neurocomputing (1995) 213–221 8. Miller, R.K., Michel, A.N.: Ordinary Differential Equations. New York Academic (1982) 9. Oja, E.: Principal Components, Minor Components, and Linear Neural Networks. Neural Networks 5 (1992) 927–935 10. Rao, Y.N., Principe, J.C.: An RLS Type Algorithm for Generalized Eigendecomposition. IEEE Workshop on Neural Networks for Signal Processing XI 7 (2001) 263–272 11. Xu, L., Oja, E., Suen, C.: Modified Hebbian Learning for Curve and Surface Fitting. Neural Networks 5 (1992) 441–457 12. Zhang, Q.F., Leung, Y.W.: A Class of Learning Algorithms for Principal Component Analysis and Minor Component Analysis. IEEE Trans. on Neural Networks 11 (2000) 200–204 13. Zhang, Y., Yan, F., Tang, H.J.: Neural Networks Based Approach for Computing Eigenvectors and Eigenvalues of Symmetric Matrix. Comp. and Math. with Appl. 47 (2004) 1155–1164
A Concise Functional Neural Network for Computing the Extremum Eigenpairs of Real Symmetric Matrices Yiguang Liu and Zhisheng You Institute of Image & Graphics, School of Computer Science and Engineering, Sichuan University, Chengdu, 610064, P.R. China
[email protected]
Abstract. Quick extraction of the extremum eigenpairs of a real symmetric matrix is very important in engineering. Using neural networks to complete this operation is in a parallel manner and can achieve high performance. So, this paper proposes a very concise functional neural network (FNN) to compute the largest (or smallest) eigenvalue and one corresponding eigenvector. After transforming the FNN into a differential equation, and obtaining the analytic solution, the convergence properties are completely analyzed. By this FNN, the method that can compute the extremum eigenpairs whether the matrix is nondefinite, positive definite or negative definite is designed. Finally, three examples show the validity. Comparing with the other ones used in the same field, the proposed FNN is very simple and concise, so it is very easy to realize.
1 Introduction Using neural networks to compute eigenvalues and eigenvectors of a matrix has parallel and quick features, quick computation of eigenvalues and eigenvectors has many applications such as primary component analysis (PCA) for real time image compression, or adaptive signal processing etc. So, many research works about this field have been reported [1-14]. Recently, a recurrent neural network (RNN) is proposed to compute eigenvalues and eigenvectors of a real symmetric matrix [6]. It is as follow
[
)]
(
x& (t ) = − x(t ) + x T (t )x(t )A + 1 − x T (t )Ax(t ) I x(t )
(1)
where t ≥ 0 , x = ( x1 ,L , x n )T ∈ R n represents the states of neurons, I denotes an identity matrix, matrix A ∈ R n×n requiring eigenpair calculation is symmetric. Just now, Reference [11] proposed a neural network model dx(t ) dt = Ax (t ) − x T (t )x(t )x(t ) to execute similar computation. This paper gives an improved form of this model described as
[
]
x& (t ) = Ax(t ) − sgn x T (t ) x(t )x(t )
(2)
where definitions of A , t and x(t ) are same to those of eq. (1), sgn [x i (t )] returns 1 if xi (t ) > 0 , 0 if xi (t ) = 0 and -1 if xi (t ) < 0 . When A − I sgn[ x T (t )]x(t ) is looked as synaptic connection weights which obviously are functions of x(t ) , the neuron activation functions are assumed as pure linear functions, eq. (2) describes a functional J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 405 – 413, 2006. © Springer-Verlag Berlin Heidelberg 2006
406
neural
Y. Liu and Z. You
network
(FNN
(2)
).
Evidently,
FNN
(2)
is
simpler
than
dx(t ) dt = Ax (t ) − x T (t )x(t )x(t ) and RNN (1), this feature is important for electronic
design. Comparing with RNN (1), the connection count is much less, the electronic circuitry is more concise, so the realization is easier. Moreover, the performance is near to that of RNN (1), this can be seen in the “convergence analysis” section of this paper.
2 Some Preliminaries Since matrix A is real symmetrical, all eigenvalues are denoted as λ1 ≥ λ2 ≥ L ≥ λ n , corresponding eigenvectors and eigensubspaces are denoted as μ1 , L , μ n and V1 , L , V n . Assume matrix A has m ≤ n different eigenvalues, and denote them as σ 1 > σ 2 > ... > σ m , let K i (i = 1,..., m ) denote the sum of algebraic multiplicity of
σ 1 ,L, σ i , and let
K0 = 0 ,
λ0 = 0 , obviously K m = n . Let
U i = VK
i −1 +1
⊕ L ⊕ V Ki (i = 1,..., m )
(3)
where ‘ ⊕ ’ denotes the direct sum operator. Obviously U i is the eigensubspace corresponding to σ i . Lemma 1. λ1 + α ≥ λ 2 + α ≥ L ≥ λ n + α are eigenvalues of A + αI μ1 , L , μ n are the eigenvectors corresponding to λ1 + α , L , λ n + α .
(α ∈ R ) and
Proof: From Aμ i = λi μ i , i = 1,L , n .
It follows that Aμ i + αμ i = λi μ i + αμ i , i = 1, L , n .
i.e.
( A + αI )μ i = (λi + α )μ i , i = 1, L, n . Therefore, this lemma can be drawn. Let ξ denote the equilibrium vector of FNN (2). ξ ought to exist, because ξ = 0 is an invariable equilibrium vector. When ξ exists, there exists ξ = lim x(t ) . t →∞
and Aξ =
∑
n
i =1
ξi ξ ,
when ξ ≠ 0 , ξ is an eigenvector and the corresponding eigenvalue λ is
(4)
A Concise FNN for Computing the Extremum Eigenpairs of Real Symmetric Matrices
407
∑
(5)
λ=
n
i =1
ξi ,
3 Analytic Solution and Convergence Analysis Like the proving procedure of theorem 1 in ref. [11] or [12], the following theorems can be similarly proved. Theorem 1. Denote S i = μ i μ i , xi (t ) denotes the projection value of x(t ) onto S i ,
then xi (t ) can be presented as
⎛ xi (t ) = xi (0 ) exp(λi t ) ⎜1 + ⎝
∑
n j =1
x j (0 )
∫ exp(λ τ )dτ ⎟⎠ , ⎞
t
0
(6)
j
Using analytic solution (6) and the techniques used in ref. [11], we can prove the following theorems. Theorem 2. If x(0) ≠ 0 and x (0 ) ∈ Vi , then ξ = 0 or ξ ∈ Vi . If ξ ∈ Vi , then λi =
∑
n k =1
ξk .
Theorem 3. If x (0) ≠ 0 , x(0 ) ∉ Vi (i = 1, L , n ) , and ξ ≠ 0 , then ξ ∈ U 1 , λ1 =
∑
n k =1
ξk .
Theorem 4. When A is replaced with −A, If x (0) ≠ 0 , x(0 ) ∉ Vi (i = 1, L , n ) and ξ ≠ 0 , then ξ ∈ U m , λ n = −
∑
n k =1
ξk .
Theorem 5. Whether the sign of A is reversed, FNN (2) can not directly compute λn of positive definite matrix and λ1 of negative definite matrix. Theorem 6. Let ⎡λ1 ⎤ denote the minimal integer not less than λ1 . For positive definite matrix A , if x(0) ≠ 0 and x(0 ) ∉ Vi (i = 1, L , n ) , replacing A with − ( A − ⎡λ1 ⎤I ) makes ξ ≠ 0 and λn = −
∑
n
i =1
ξ i + ⎡λ1 ⎤ .
(7)
Theorem 7. Let ⎣λn ⎦ denote the maximal integer not more than λn . With negative definite matrix A , if x(0 ) ≠ 0 and x(0 ) ∉ Vi (i = 1, L , n ) , replacing A with ( A − ⎣λn ⎦I ) will make ξ ≠ 0 and λ1 =
∑
n
i =1
ξ i + ⎣λ n ⎦ .
(8)
Theorem 8. For FNN (2), there must have ξ = 0 or that ξ1 + ξ 2 + ... + ξ n equal to one positive eigenvalue.
408
Y. Liu and Z. You
4 Steps to Compute λ1 and λn In this section, steps are designed to compute the largest eigenvalue and the smallest eigenvalue of a real symmetric matrix B ∈ R n×n with FNN (2), the steps are: 1. When A is replaced with B and ξ ≠ 0 , then λ1 = ξ1 + ξ 2 + ... + ξ n .
2. When A is replaced with − B and ξ ≠ 0 , we get λn = −(ξ1 + ξ 2 + ... + ξ n ) . If λ1 has been obtained, go to step 5. Otherwise, go to step 3. If ξ = 0 , go to step 4. 3. Through replacing A with (B − ⎣λ n ⎦I ) , we get λ1 = ξ1 + ξ 2 + ... + ξ n + ⎣λn ⎦ and go to step 5. 4. Through replacing A with − (B − ⎡λ1 ⎤I ) , we get λ n = −(ξ1 + ξ 2 + ... + ξ n ) + ⎡λ1 ⎤ and go to step 5. 5. End.
5 Simulation Results We give three examples to test the validity of the method for computing the smallest eignevalue and the largest eigenvalue of a real symmetric matrix. The simulating platform is MATLAB. Example 1. A real 7 × 7 symmetric matrix is randomly generated as ⎛ ⎜ ⎜ ⎜ ⎜ B=⎜ ⎜ ⎜ ⎜ ⎜ ⎝
0.5972 0.3858 0.8341 0.7884 0.5024 0.3249 0.5769 ⎞ ⎟ 0.3858 0.7015 0.2719 0.6883 0.4196 0.2061 0.6175 ⎟ , 0.8341 0.2719 0.9566 0.3936 0.7404 0.6784 0.3759 ⎟ ⎟ 0.7884 0.6883 0.3936 0.2473 0.2119 0.3470 0.4511 ⎟ ⎟ 0.5024 0.4196 0.7404 0.2119 0.3143 0.5366 0.2305 ⎟ 0.3249 0.2061 0.6784 0.3470 0.5366 0.5414 0.6682 ⎟ ⎟ 0.5769 0.6175 0.3759 0.4511 0.2305 0.6682 0.5030 ⎠
Through step 1, we get ξ = (0.5851 0.4572 0.6304 0.4575 0.4404 0.4783 0.4893)T . So, λ1 = ξ1 + L + ξ n = 3.5382 . The trajectories of xi (t ) (i = 1,2,L,7 ) and λ1 (t ) = x1 (t ) + L + x n (t ) are shown in Fig. 1 and Fig. 2. From Fig. 1 and 2, we can see that FNN (2) quickly reaches the equilibrium state and the largest eigenvalue is obtained soon. Through step 2, we get and ξ = (- 0.1301 - 0.0931 0.0269 0.1453 0.0858 - 0.0916 0.0848 )T (i = 1,2,L,7 ) and λn = −(ξ1 + L + ξ n ) = −0.6576 . The trajectories of xi (t )
λn (t ) = −( x1 (t ) + L + xn (t ) ) are shown in Fig. 3 and Fig. 4. Using Mathlab to di-
rectly compute the eigenvalues of and smallest eigenvalue
λ'n
B gives the true largest eigenvalue
= −0.6576 .
λ1' = 3.5379
A Concise FNN for Computing the Extremum Eigenpairs of Real Symmetric Matrices
Fig. 1. The trajectories of components of x when A is replaced with B
Fig.
2
409
The
trajectories of when A is
The
trajectories of when A is Re-
The
trajectories
λ1 (t ) = x1 (t ) + L + xn (t )
replaced with B
Fig. 3. The trajectories of components of x when A is replaced with -B
Fig.
4.
λn (t) = −( x1(t) +L+ xn (t) )
placed with –B
Fig. 5. The trajectories of components of x when A is replaced with C
Fig.
6.
x1 (t ) + L +
with C
x n (t )
of
when A is Replaced
410
Y. Liu and Z. You
Therefore, the results computed by the method are very close to the true values, the absolute difference values are and λ1 − λ1' = 3.5382 − 3.5379 = 0.0003 λ n − λ'n = − 0.6576 − (− 0.6576) = 0.0000 . This verifies that using this method to com-
pute the largest eigenvalue and the smallest eigenvalue of
B is successful.
Example 2: A positive matrix is ⎛ 0.6495 ⎜ ⎜ 0.5538 C=⎜ 0.5519 ⎜ ⎜ 0.5254 ⎝
0.5538 0.5519 0.5254 ⎞ ⎟ 0.8607 0.7991 0.5583 ⎟ . 0.7991 0.8863 0.4508 ⎟ ⎟ 0.5583 0.4508 0.6775⎟⎠
Fig. 7. The trajectories of components of x when A is replaced with -C
Fig.
8.
The
(
)
trajectories
− ( x1 (t ) + L + xn (t ) )
of
when A is Re-
placed with -C
Fig. 9. The trajectories of components of x when A is replaced with C’
Fig.
10.
The
trajectories
of
-
( x (t ) + L + x (t ) ) when A is Replaced 1
with C’
n
A Concise FNN for Computing the Extremum Eigenpairs of Real Symmetric Matrices
411
The true eigenvalues of computed by MATLAB are C λ '= (0.0420 0.1496 0.3664 2.5160 ) . From step 1, we can get λ1 = 2.5160 , the trajectories of the components of
x(t ) and
x1 (t ) + L + x n (t ) which will approach to
λ1 are shown in Fig. 5 and 6. Using step 2, the trajectories of components of x(t ) and λn (t ) = −( x1 (t ) + L + x n (t ) ) are shown in Fig. 7 and 8. The equilibrium vector
ξ = 1.0e - 5 * (0.1229 0.2827 - 0.2183 - 0.1342) T and λn = −(ξ1 + L + ξ n ) = -7.5811e - 6 .
Obviously, λn is very close to zero. So, step 4 is used to compute the smallest eigenvalue. Replacing A with
⎛ - 2.3505 0.5538 ⎜ ⎜ 0.5538 - 2.1393 C ' = −(C − ⎡λ1 ⎤I ) = −⎜ 0.5519 0.7991 ⎜ ⎜ 0.5254 0.5583 ⎝ the trajectories of components of
0.5519 0.7991 - 2.1137 0.4508
x(t ) and − ( x1 (t ) + L +
0.5254 ⎞ ⎟ 0.5583 ⎟ , 0.4508 ⎟ ⎟ - 2.3225 ⎟⎠
x n (t ) ) which will approach
to the smallest eigenvalue of C ' are shown in Fig. 9 and 10. In the end, we get the equilibrium vector ξ ∗ = (0.4168 1.0254 - 0.9303 - 0.5854 )T and the small eigenvalue of C λ∗n = − ξ1∗ + L + ξ n∗ + ⎡λ1 ⎤ = −2.9579 + 3 = 0.0421 .
(
)
λ1 = 2.5160 and λ∗n = 0.0421 are the results computed by FNN (2), comparing with the true values λ1' = 2.5160 and λ'n = 0.0420 , we can see that λ1 is very close to λ1' , the absolute difference is 0.0000 , and λ∗n is very near to λ'n , the absolute difference is 0.0001 . This verifies that using step 4 to compute the smallest eigenvalue of a real positive symmetric matrix is valid. Example 3. Matrix D = −C is used to test the validity of FNN (2). Obviously, D is negative definite and its true eigenvalues are λ '= (- 2.5160 - 0.3664 - 0.1496 - 0.0420 ) . Replacing A with D , from step 1, the obtained eigenvalue is 7.5811e - 6 , and we cannot decide whether this value is the true largest eigenvalue. Through step 2, the smallest eigenvalue is obtained λ4 = -2.5160 . Using step 3, we get λ1 = −0.0421 . Comparing λ4 and λ1 to λ'4 and
λ1' , respectively, it can be found that using step 3 to compute the largest eigenvalue is nice.
x(0 ) of example 1 is [0.2892 0.1175 0.0748 0.2304 0.2502 0.2001 0.2275]T , and x(0 )
of example 2 and 3 is [0.2892 0.1175 0.0748 0.2304]T . Repeating the three examples with the other initial values, the obtained results are all very close to the corresponding true values.
412
Y. Liu and Z. You
6 Conclusions This paper proposes a very concise functional neural network presented by a differential dynamical system, using the component analytic solution of this system gives the convergence behavior. This network is adaptive to extract the largest positive eigenvalue or the smallest negative eigenvalue. Detail steps are designed to compute the largest eigenvalue and smallest eigenvalue of a real symmetric matrix using this neural network. This method is adaptive to non-definite, positive definite, or negative definite matrix. Three examples show the validity of this method, B is a non-definite, C is positive definite and D is negative definite, the results obtained by this method are all very close to the corresponding true values. Comparing with other approach based on neural networks, this functional neural network is very concise and can be realized easily.
Acknowledgement This work is supported by the Youth Foundation of Sichuan University (No. 200574) and Youth Foundation of School of Computer Science & Engineering of Sichuan University.
References 1. Luo, F.-L., Unbehauen, R., Cichocki, A.: A Minor Component Analysis Algorithm. Neural Networks 10(2) (1997) 291-297 2. Luo, F.-L., Unbehauen, R., Li, Y.-D.: A Principal Component Analysis Algorithm with Invariant Norm. Neurocomputing 8(2) (1995) 213-221 3. Song, J., Yam, Y.: Complex Recurrent Neural Network for Computing the Inverse and Pseudo-inverse of the Complex Matrix. Applied Mathematics and Computing 93(2-3) (1998) 195-205. 4. Kakeya, H., Kindo, T.: Eigenspace Separation of Autocorrelation Memory Matrices for Capacity Expansion. Neural Networks 10(5) (1997) 833-843 5. Kobayashi, M., Dupret, G., King, O., Samukawa, H.: Estimation of Singular Values of Very Large Matrices Using Random Sampling. Computers and Mathematics with Applications 42(10-11) (2001) 1331-1352 6. Zhang, Y., Nan, F., Hua, J.: A Neural Networks Based Approach Computing Eigenvectors And Eigenvalues Of Symmetric Matrix. Computers and Mathematics with Applications 47(8-9) (2004) 1155-1164 7. Luo, F.-L., Li, Y.-D.: Real-time Neural Computation of the Eigenvector Corresponding to the Largest Eigenvalue of Positive Matrix. Neurocomputing 7(2) (1995) 145-157 8. Reddy, V. U., Mathew, G., Paulraj, A.: Some Algorithms for Eigensubspace Estimation. Digital Signal Processing 5(2) (1995) 97-115 9. Perfetti, R., Massarelli, E.: Training Spatially Homogeneous Fully Recurrent Neural Networks in Eigenvalue Space. Neural Networks 10(1) (1997) 125-137 10. Liu, Y., You, Z., Cao, L., Jiang, X.: A Neural Network Algorithm for Computing Matrix Eigenvalues and Eigenvectors. Journal of Software 16(6) (2005) 1064-1072
A Concise FNN for Computing the Extremum Eigenpairs of Real Symmetric Matrices
413
11. Liu, Y., You, Z., Cao, L.: A Simple Functional Neural Network for Computing the Largest and Smallest Eigenvalues and Corresponding Eigenvectors of a Real Symmetric Matrix. Neurocomputing 67 (2005) 369-383 12. Liu, Y., You, Z., Cao, L.: A Functional Neural Network for Computing the Largest Modulus Eigenvalues and Their Corresponding Eigenvectors of an Anti-symmetric Matrix. Neurocomputing 67 (2005) 384-397. 13. Liu, Y., You, Z., Cao, L.: A Functional Neural Network Computing Some Eigenvalues and Eigenvectors of a Special Real Matrix. Neural Networks 18(10) (2005) 1293-1300 14. Wang, J.: Recurrent Neural Networks for Computing Pseudoinverses of Rank-deficient Matrices. SIAM Journal on Scientific Computing 18(5) (1997) 1479-1493
A Novel Stochastic Learning Rule for Neural Networks Frank Emmert-Streib Institut f¨ ur Theoretische Physik, Universit¨ at Bremen, Otto-Hahn-Allee 1, 28359 Bremen, Germany
[email protected]
Abstract. The purpose of this article is the introduction of a novel stochastic Hebb-like learning rule for neural networks which combines features of unsupervised (Hebbian) and supervised (reinforcement) learning. This learning rule is stochastic with respect to the selection of the time points when a synaptic modification is induced by simultantious activation of the pre- and postsynaptic neuron. Moreover, the learning rule does not only affect the synapse between pre- and postsynaptic neuron which is called homosynaptic plasticity but effects also further remote synapses of the pre- and postsynaptic neuron. This more complex form of plasticity has recently come into the light of interest of experimental investigations in neurobiology and is called heterosynaptic plasticity. Our learning rule is motivated by these experimental findings and gives a qualitative explanation of this kind of synaptic plasticity. Additionally, we give some numerical results that demonstrate that our learning rule works well in training neural networks, even in the presence of noise.
1
Introduction
What are the mechanisms that modulate learning in the brain of an animal or a human? This question is up to now under debate, but the imagination one has for a biological learning rule is that the synaptic weights are changed according to a local rule. In the context of neural networks local means that only the adjacent neurons of a synapse contribute to modifications of the synaptic weight. Such a mechanism with respect to synaptic strengthening was proposed by Hebb [11] in 1949 and experimentally found by T. Bliss and T. Lomo [3]. In a biological terminus Hebbian learning is called long-term potentiation (LTP). Experimentally as well as theoretically there is a great body of investigations aiming to formulate precise conditions under which learning in neural networks takes place. For example, the influence of the precise timing of pre- and postsynaptic neuron firing [14, 17] or the duration of a synaptic change (for a review see [16]) have been studied extensively. All these studies share the locality condition proposed by Hebb [11]. However, there are also experimental findings which
Present address: Stowers Institute for Medical Research, 1000 E. 50th Street, Kansas City, MO 64110, USA.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 414–423, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Novel Stochastic Learning Rule for Neural Networks
415
extend the traditional view of synaptic plasticity. In this article we present a novel stochastic Hebb-like learning rule inspired by experimental findings about heterosynaptic plasticity [9]. This form of neural plasticity affects not only the synapse between pre- and postsynaptic neuron in which a synaptic modification was induced, but also further remote synapses of the pre- and postsynaptic neuron. Additionally, we demonstrate that this learning rule can be successfully applied in training multilayer neural networks. This paper is organized as follows. In section 2 we provide experimental results about heterosynaptic plasticity and learning in general. In section 3 we introduce our learning rule and in section 4 we present numerical results. This article finishes in section 5 with a discussion and conclusions
2
Biological Learning
Recently, there is an increasing number of experimental results, which investigate a new form of synaptic modification, the so called heterosynaptic plasticity. In contrast to homosynaptic plasticity, where only the synapse between active pre- and postsynaptic neuron is changed, heterosynaptic plasticity concerns also further remote synapses of the pre- and postsynaptic neuron. This scenario is depicted in the left Fig. 1. Here we suppose, that the synapse w68 was changed either by LTP or LTD. Fitzsimonds et al. [9] found in cultured hippocampal neurons that the induction of LTD in w68 is also accompanied by back propagation of depression in the dendrite tree of the presynaptic neuron. Further more, depression also propagates laterally in the pre- and postsynaptic neuron. Similar results hold for the propagation of LTP, see [2] for a review. We emphasize all synapses, whose weights are changed, and all neurons, which enclose these synapses by drawing full lines respectively black circles. We want explicitly emphasize, that Fitzsimonds et al. found up to now no forward propagated postsynaptic plasticity. This would correspond to the synapses w811 , w812 of neuron n8 , which are drawn as dotted lines in the left Fig.1. A biological explanation for the cellular mechanisms of these findings are currently under investigation. Fitzsimonds et al. [9] suggest the existence of retrograde signaling from the postto the presynaptic neuron which could produce a secondary cytoplasmic factor for back-propagation and presynaptic lateral spread of LTD. On the postsynaptic side lateral spread of LTD could be explained similarly under the assumption that there is a blocking mechanism for the cytoplasmic factor which prevents forward propagated LTD. They are of the opinion that extracellular diffusible factors are of minor importance. The experiments of Fitzsimonds et al. [9] are certainly an extention of homosynaptic learning, which we denote briefly as Hebbian learning1 , but nevertheless both principles can be characterized as unsupervised learning because both learning types use exclusively local information available in the neural system. This 1
We are aware, that Hebbian learning is usually only used in the context of LTP as explained in the text above and want to emphasize by this paralance that homosynaptic LTP and LTD are more interrelated than homo- and heterosynaptic LTP.
416
F. Emmert-Streib n1
n4
n7
n10
n2
back propagation
n5
n6 induction
postsynaptic lateral spread
n8
n11
c1
n3
presynaptic lateral spread
c4
c2
c5
c3
c6
induction
d68
w68
c7
n9
n12
c10
c9
c8
c11
c12
Fig. 1. Left: Visualization of heterosynaptic plasticity experimentally found by Fitzsimonds et al.. The neurons (synapses) which are affected by heterosynaptic plasticity induced in w68 are drawn as black circles (full lines). Right: Visualization of the symmetry of our learning rule. The cell character of a neuron is emphasized by a schematical contour around the whole neuron. Each neuron in this network has a new degree of freedom ci we call the neuron counter. Neuron counters can communicate which each other across synapses. The dynamics of the neuron counters is determined by the global reinforcement signal.
is in contrast to the famous back-propagation learning rule [21, 19] for artificial neural networks. The back-propagation algorithm is famous because until the 80’s there was no systematic method known to adjust the synaptic weights of an artificial multilayer (feed-forward) network to learn a mapping2 . Still, the problem with the back-propagation algorithm is that it is not biological plausible because it requires a back-propagation of an error in the network. We emphasize, that the problem is not the back-propagation process itself, because, e.g., heterosynaptic plasticity could provide such a mechanism, but the knowledge of the error, which can not be known explicitely to the neural network [6]. For this reason learning by back-propagation is classified as supervised learning or learning by a teacher [12]. However, there is a modified form of supervised learning namely reinforcement learning that is biologically plausible. Reinforcement learning reduces the information provided by a teacher to a binary reinforcement signal r that reflects the quality of the network’s performance. Interestingly, experimental observations from the hippocampus CA1 region have shown that there is a global signal in form of dopamine which is feedback to the neurons and causes thereby a modulation of LTD [18]. Schematically, this is depicted in the right Fig. 1. In this figure each neuron is connected with an additional edge which represents the feedback of dopamin in form of a reinforcement signal r. 2
The back-propagation algorithms was independently developed by Werbos [21] and Rumelhart et al. [19] but became aware to the broad research community after the article by Rumelhart et al.
A Novel Stochastic Learning Rule for Neural Networks
417
The working mechanism of the learning rule we suggest is based on the explanation of Fitzsimonds et al. [9] for heterosynaptic plasticity given above. To understand what kind of mathematical formulation is capable to describe ’a secondary cytoplasmic factor’ in a qualitative way we start our explanation with emphasizing that a neuron is from a biological point of view first of all a cell. The subdivision of a neuron in synapses, soma (cell body) and axon is a model and reflects already the direction of the information flow within the neuron namely from the synapses (input) to the soma (information processing) to the axon (output). Here, we do not question this model view with respect to the direction of signal processing, but learning. We see no biological reason why the model of a neuron for signal processing should be the same as the model of a neuron for learning. In the right Fig. 1 we emphasize the cell character of a neuron by underlying the contour of the whole neuron in gray. Now, our reason for drawing the synapses in an unusual way becomes clear, because it emphasizes automatically the cell character of a neuron. Suppose now, we assign to each neuron in the network one additional parameter ci as shown in the right Fig. 1. We call these parameters ci neuron counters. The neuron counters shall modulate the synaptic modification in a certain way defined in detail below. According to our cell view of the neuron, we assume further that the neuron counters of adjacent neurons, which are connected by synapses, can communicate with each other in an additive way. For example, in Fig. 1 the neuron counters c6 and c8 form a new value d68 = c6 + c8 in synapse w68 , which we call the approximated synapse counter. By this mechanism we obtain a star-like influence of ,e.g., the neuron counters c6 and c8 on all synapses connected with neuron 6 or 8, because either d6k = c6 + ck or dk8 = ck + c8 holds and regulates the synaptic update of the corresponding synaptic weight of the synapses w6k and wk8 respectively. This situation corresponds in a qualitative way to the learning behavior of heterosynaptic plasticity, however, with the difference, that we have a fully symmetrical learning rule. An interpretation of the communication between adjacent neuron counters can be given, if one views the neuron counters as cytoplasmic factors, which are allowed to freely move within the cytoplasm of the corresponding neuron (cell). Because, we introduced no blocking mechanism for the forward propagation of the postsynaptic neuron counter we result in a fully symmetric communication between adjacent neuron counters. Unfortunately, there are no experimental data available that would allow to specify the influence of dij on the corresponding synapse wij quantitatively. For this reason, we use an ansatz to close this gap and make it plausible [7].
3
Mathematical Definition of the Learning Rule
If one assumes, that the neuron counters shall modulate learning, then it is plausible to determine the values of ci in dependence on a reinforcement signal r reflecting the performance of the network qualitatively. In the most simple case, the dynamics of the neuron counters depends linearly on the reinforcement signal.
418
F. Emmert-Streib
⎧ ci − r > Θ ⎨ Θ, ci →c‘i = ci − r, Θ ≥ ci − r ≥ 0 ⎩ 0, 0 > ci − r.
(1)
Here, Θ ∈ IN is a threshold that restricts the possible values of the neuron counters ci to Θ + 1 possible values {0, . . . , Θ}. The value of ci reflects the network’s performance, but it has only relative and no absolute meaning with respect to the mean network error. This can be seen by the following example. Suppose c = 0, then it is clear that at least the last output of the network was right, r = 1. However, we know nothing about the outputs which occurred before the last one. E.g., the following two sequences of reinforcement signals can lead to the same value of the neuron counter c = 0. r1 = {1, −1, 1, −1, 1, −1, 1} and r2 = {1, 1, 1, 1, 1, 1, 1} if the start value is c = 1 for r1 and c = 7 for r2 . Obviously, the estimated mean error is different in both cases, if averaged over the last seven time steps. The crucial point is, that the start value of the neuron counter is not available for the neuron and, hence, the neuron can not directly calculate the mean error of the network. However, we can introduce a simple assumption, which allows an estimate of the mean network error. We claim that, if c1 < c2 for one neuron in a network, but trained by two different learning rules3 then the mean error of network one is lower then of network two. This may not hold for all cases, but it is certainly true in average. By this we couple the value of the neuron counter to the mean error of the network. Due to the fact, that this holds only statistically, we will introduce a stochastic rather than a deterministic update rule for the synapses that depends on the neuron counters. In the previous section we said, that adjacent neuron counter can communicate, if both neurons are connected by a synapse. This gives a new variable dij = ci + cj
(2)
we call the approximated synapse counter. We will use the approximated synapse counter as the driving parameter of our stochastic update rule, because its value reflects the performance of the synapse in the network which shall be updated, because the synapses are the adaptive part of a neural network. Hence, evaluating the value of an approximated synapse counter of a synapse will give us indirectly a decision for the update of this synapse. It is clear that, roughly speaking, the higher the approximated synapse counter of a synapse is the higher should be the probability the synapse is updated. This intuitively plausible assumption will now be quantified. Similar to [1, 5, 15] only active synapses wij which were involved in the last signal processing step can be updated, if the output of the network was wrong. This is plausible, because it prevents that already learned mappings in the neural network are destroyed possibly. If r = −1 the probability, that synapse wij is updated is given by PΔw (wij ) = P (pc < prdij ) (3) 3
The superscript indicates here not the neuron in the network, but the learning rule used.
A Novel Stochastic Learning Rule for Neural Networks
419
This probability has to be calculated for each synapse wij in the network. We want to emphasize, that this needs only local information besides the reinforcement signal. Hence, it is a biologically possible mechanism. If the synapse is actually chosen for update, the synaptic weight will be modified by wij →wij = wij − δ,
(4)
Here, δ is a positive constant which determines the amount of the synaptic depression. To evaluate the stochastic update condition Eq. 3 the two auxiliary variables pc and prdij have to be identified. This is done in the following way: 1. Calculate the approximated synapse counter dij = ci + cj
(5)
2. Map the value of the approximated synapse counter dij to prdij by k = 2Θ + 3 − dij
(6)
Pkr ∝ k −τ , τ ∈ IR+
(7)
for k ∈ {1, . . . , 2Θ + 3}. We call Pkr rank ordering probability distribution 3. The random variable pc is drawn from the continuous coin distribution P c (x) ∝ x−α , α ∈ IR+
(8)
for x ∈ (0, 1]. We had three reasons to choose a power law in Eq. 8 for the coin distribution instead of an equal distribution, which would be the simplest choice. First, we see no evidence that a random number generator occurring in a neural system should favor a equal distribution. Second, it is highly probable that two different random number generators of the same biological system are not identical. Instead, they could have different parameters, in our case they could have different exponents. In this paper we will content ourself investigating the case of identical random number generators, but our framework can be directly applied to the described scenario. Third, by choosing α = 0, the coin distribution in Eq. 8 becomes the equal distribution. This allows us to investigate the influence of the distance of the coin distribution to an equal distribution on the learning behavior of a neural network by studying different parameters of α. We want to remark, that in this case the update probability Eq. 3 simplifies to PΔw (wij ) = prdij
(9)
Before we present our results in the next section, we want to visualize the stochastic update probability PΔw . Figure 2 shows the update probability PΔw as function of prdij . The different curves correspond to different values of the exponent α of the coin distribution. One can see, that the update probability follows the values of prdij . This holds for each curve in Fig. 2. That means, the higher the values of prdij are the higher is the update probability. This is the behavior one would intuitively expect, because high values of prdij correspond to
420
F. Emmert-Streib 1
α5
PΔ w
0.8 0.6
α4
0.4
α3
α
0.2
2
α
1
0 −4 10
10
−3
10
−2
10
pr
−1
10
0
d
ij
Fig. 2. Update probability PΔw as function of prdij . The different curves correspond to different values of the exponent α of the coin distribution. The values are: α1 = 0.0, α2 = 0.5, α3 = 0.75, α4 = 1.0 and α5 = 1.75.
high values of the approximated synapse counters dij indicating high values of the neuron counters, which correspond to a bad network performance. Moreover, one can see in Fig. 2 that the larger α the higher is the update probability for fixed prdij . In the limit α → ∞ the update probability equals one for all values of prdij . Hence, higher values of the exponent α of the coin distribution result in a higher update probability. That means, by α one can control the sensitivity the update probability depends on prdij . Another parameter our stochastic update rule depends on is the exponent of the rank ordering distribution τ . We display in Fig. 3 PΔw as function of τ and prdij to visualize its influence on the update probability. The values of the update probability are color-coded and blue corresponds to 0 and red to 1. For the left Fig. 3 we used α = 0.7 and for the right α = 1.5 as exponent for the coin distribution. If prdij = 0 no update takes place. For increasing values of the approximated synapse counter and fixed values of τ one obtains increasing values for the update probability. Moreover, higher values of α lead to higher values of PΔw . This can be seen by comparing the left and right Fig. 3. Increasing values of τ result in decreasing values of PΔw for fixed dij . To summarize, the stochastic update condition we introduced for a synaptic update depends on six parameters PΔw (wij ) = PΔw (r, ci , cj , Θ, α, τ ).
(10)
From the visualizations we gave in Fig. 2 and 3 we saw that increasing values of dij and α as well as decreasing values of τ lead to an increase in the update probability.
4
Numerical Results
Exemplarily, we demonstrate that our stochastic learning rule is capable to train multilayer neural networks by learning a XOR-function, given in table 1, in a
A Novel Stochastic Learning Rule for Neural Networks 1
1
3
2.5
0.8
2.5
0.8
2
0.6
2
0.6
τ
τ
3
421
1.5
1.5
0.4
0.4
1
1 0.2
0.5 0
2
4
dij
6
8
0.2
0.5
0
10
0
2
4
dij
6
8
0
10
Fig. 3. Update probability PΔw as function of τ and dij obtained for Θ = 4. PΔw is color-coded, blue indicates low and red high values. The exponent α of the coin distribution was α = 0.7 in the left and α = 1.5 in the right figure. Table 1. XOR-mapping from three input (I) two two output (O) neurons. xI3 serves as bias to prevent zero activity. xI1 0 0 1 1
xI2 0 1 0 1
xI3 1 1 1 1
xO 1 1 0 0 1
xO 2 0 1 1 0
3
Θ=1 Θ=3 Θ=5
6
10 Mean learning time
Mean learning time
10
2
10
Θ=1 Θ=3 Θ=5
4
10
2
10 1
10
0
5 10 15 Number of neurons in the hidden layer
20
0
5 10 15 Number of neurons in the hidden layer
20
Fig. 4. Mean learning time in dependence on the number of neurons in the hidden layer. τ = (2.2, 2.3, 1.9) and α = (0.8, 1.7, 3.0) for Θ = (1, 3, 5). The network dynamics was a winner-take-all mechanism (left figure) and a noisy winner-take-all mechanism (right figure). The ensemble size for all simulations was 10000.
three layer network. In the left Fig. 4 we show the mean learning time as function of the number of neurons H in the hidden layer. The curves are indexed by different values of the neuron counter Θ. In the right figure we demonstrate the robustness of these results in the presence of noise η by using a noisy winner-
422
F. Emmert-Streib
take-all mechanism as network dynamics which adds to the inner fields of the neurons noise η before the neuron with the highest inner field is selected. The noise was uniformly drawn from [0, η0 ]. From both figures one can see that the mean learning time decreases with an increasing number of neurons in the hidden layer as expected whereas the increase from 3 to 4 neurons has the biggest effect. This is due to the fact that the destructive path inference, which means that already correctly learned paths in the network are destroyed by a new synaptic modification, is strongly reduced by increasing the number of possible paths as a result of additional neurons in the hidden layer. Increasing the number of neurons beyond 19 has only marginal influence, because an additional increase of redundant paths has no affect. Even in the presence of noise our learning rule is capable of learning the XOR-function. One can nicely see how an increasing number of neurons in the hidden layer can efficiently reduce the amount of noise in the system.
5
Conclusions
In this article we presented a novel stochastic Hebb-like learning rule for neural networks and demonstrated its working capability exemplary in learning the XOR-function in a multilayer neural network even in the presence of noise. Our goal was a biologically inspired learning rule based on experimental findings about heterosynaptic plasticity rather than to learn functions in a multilayer network as fast a possible. Despite this fact, the results we obtained from numerical simulations are comparable to [1, 15] and indicate that the learning rule itself is useful for practical applications in training neural networks. In further investigations [8] we will demonstrate that our learning rule is not restricted to a feed-forward network topology but works also in recurrent networks. We hope, that our work stimulates further investigations in this exciting field and demonstrates how fruitful it is to create mathematical models based on experimental observations of biological phenomena.
Acknowledgments We would like to thank Tom Bielefeld, Rolf D. Henkel, Jens Otterpohl, Klaus Pawelzik, Roland Rothenstein, Peter Ryder, Heinz Georg Schuster and Helmut Schwegler for fruitful discussions.
References 1. Bak, P. and Chialvo, D.R.: Adaptive Learning by Extremal Dynamics and Negative Feedback. Phys. Rev. E 63 (2001) 031912-1–031912-12 2. Bi, G-g. and Poo, M-m.: Synaptic Modification by Correlated Activity: Hebb’s Postulate Revisited. Annual Review of Neuroscience 24 (2001) 139–166
A Novel Stochastic Learning Rule for Neural Networks
423
3. Bliss, T.V.P. and Lomo, T.: Long-lasting Potentiation of Synaptic Transmission in the Dentate Area of the Anaesthetized Rabbit Following Stimulation of the Perforant Path. J. Physiol. 232 (1973) 331–356 4. Bosman, R. J. C., van Leeuwen, W. A. and Wemmenhove, B.: Combining Hebbian and Reinforcement Learning in a Minibrain Model. Neural Networks 17:1 (2004) 29–39 5. Chialvo, D.R. and Bak, P.: Learning from Mistakes. Neuroscience 90 (1999) 1137– 1148 6. Crick, F.: The Recent Excitement about Neural Networks. Nature 337 (1989) 129– 132 7. Emmert-Streib, F.: Aktive Computation in Offenen Systemen. Lerndynamiken in biologischen Systemen: Vom Netzwerk zum Organismus. Dissertation, Universit¨ at Bremen, (2003), Mensch & Buch Verlag. 8. Emmert-Streib, F.: Active Learning in Recurrent Neural Networks Facilitated by a Hebb-like Learning Rule with Memory. Neural Information Processing - Letters and Reviews 9(2) (2005) 31–40 9. Fitzsimonds, R.M., Song, H-j. and Poo, M-m.: Propagation of Synaptic Modulation in Small Neural Networks. Nature 388 (1997) 439–448 10. Frey, U. and Morris, R.G.M.: Synaptic Tagging and Long-term Potentiation. Nature 385 (1997) 533–536 11. Hebb, D.O.: The Organization of Behavior. Wiley, New York (1949) 12. Hertz, J., Krogh, A. and Palmer, R.G.: Introduction to the Theory of Neural Computation. Addison-Wesley (1991) 13. Holland, J.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975) 14. Kempter, R., Gerstner, W. and van Hemmen, J.L.: Hebbian Learning and Spiking Neurons. Phys. Rev. E 59 (1999) 4498–4514 15. Klemm, K., Bornholdt, S. and Schuster, H.G.: Beyond Hebb: Exclusive-Or and Biological Learning. Phys. Rev. Lett. 84 (2000) 3013–3016 16. Koch, C.: Biophysics of Computation, Oxford Press (1999) 17. Markram, H., L¨ ubke, J., Frotscher, M. and Sakmann, B.: Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs. Science 275 (1997) 213– 215 18. Otmakhova, N.A. and Lisman, J.E.: D1/D5 Dopamine Receptors Inhibit Depotentiation of CA1 Synapses via cAMP-dependent Mechanism. J. Neuroscience 18 (1998) 1270–1279 19. Rumelhart, D.E., Hinton, G.E. and Williams, R.J.: Learning Representations by Back-propagating Errors. Nature 323 (1986) 533–536 20. Kirkpatrick, S., Gelatt, C.D. and Vecchi, M.P.: Optimization by Simulated Annealing. Science 220 (1983) 671–680 21. Werbos, P.: Beyond Regression: New tools for Prediction and Analysis in the Behavioral Sciences. Ph.D. Thesis, Harvard University (1974)
Learning with Single Quadratic Integrate-and-Fire Neuron Deepak Mishra1 , Abhishek Yadav2 , and Prem K. Kalra1 1
Department of Electrical Engineering, IIT Kanpur, India 2 Department of Electrical Engineering, G.B. Pant University of Agri. & Technology, Pant Nagar, India
[email protected],
[email protected],
[email protected]
Abstract. In this paper, a learning algorithm for a single Quadratic Integrateand-Fire Neuron (QIFN) is proposed and tested for various applications in which a multilayer perceptron neural network is conventionally used. It is found that a single QIFN is sufficient for the applications that require a number of neurons in different layers of a conventional neural network. Several benchmark and real-life problems of classification and function-approximation have been illustrated.
1 Introduction Various Researchers have proposed many neuron models for artificial neural networks. Although all of these models were primarily inspired from the biological neuron, there is still a gap between the philosophies used in neuron models for neuroscience studies and neuron models used for artificial neural networks. Some of these models exhibit a close correspondence with their biological counterparts while others do not. Freeman [6] has pointed out that while brains and neural networks share certain structural features such as massive parallelism, biological networks solve complex problems easily and creatively, and existing neural networks do not. He discussed the issues related to the similarities and dissimilarities between biological and artificial neural systems of present days. The first artificial neuron model was proposed by McCulloch and Pitts [12] in 1943. They developed this neuron model based on the fact that the output of the neuron is 1 if the weighted sum of its inputs is greater than a threshold value and 0 otherwise. In 1949, Hebb [7] proposed a learning rule that became initiative for artificial neural networks. He postulated that the brain learns by changing its connectivity patterns. Widrow and Hoff [2] in 1960 presented the most analyzed and most applied learning rule. It was called the least mean square learning rule. Later in 1985, Widrow and Sterns [8] found that this rule converges in the mean square to the solution that corresponds to least mean square output error if all the input patterns are of the same length. Scholles [3] discussed a few biologically inspired artificial neurons. Feng and Li [16] introduced some neuron models with current as the inputs. Training of the integrate-and-fire model with the Informax principle was discussed in [14] and [15]. Learning with single integrate-and-fire neuron has been studied in [5]. In the same sequel, a learning with a more biologically realistic artificial neural system i.e. quadratic integrate and fire model is proposed and discussed in this paper. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 424–429, 2006. c Springer-Verlag Berlin Heidelberg 2006
Learning with Single Quadratic Integrate-and-Fire Neuron
425
2 Quadratic Integrate-and-Fire Neuron (QIFN) Models Quadratic integrate-and-fire neuron is also known as the theta-neuron [18], or the Ermentrout-Kopell canonical model [19] (when it is written in a trigonometric form). We represent it here in terms of the following differential equation[10][20]. dv/dt = I + a(v − V1 )(v − V2 ); if v = V3 , then v ← V4
(1)
where V1 and V2 are the resting and threshold values of the membrane potential. V3 and V4 are the peak and reset values. This model is canonical in the sense that any Class 1 excitable system described by smooth ODEs can be transformed into this form by a continuous change of variables. Unlike its linear analogue, the quadratic integrate-andfire neuron has spike latencies, activity dependent threshold, and bistability of resting and tonic spiking modes. The solution of the first-order differential equation describing the dynamics of this model in sub-threshold region (eqn.(1)) can be found analytically. With v(0) = V1 , the solution is given by V1 + V2 A At a(V1 − V2 ) −1 v(t) = + tan + tan (2) 2 a 2 A A = 4aI + 2 a2 V1 V2 − a2 V2 2 − a2 V1 2 (3) Let us assume that t1 is the refractory period and v(t) hits V2 at t = t2 . Interspike interval, t0 = t1 + t2 , can be expressed as 4 w −1 t0 = t1 − tan (4) (4aI − w2 ) (4aI − w2 ) where, w = a(V1 − V2 )
(5)
This equation can be written as t0 = t1 − 4x tan−1 (wx) 1 where, x = (4aI − w2 )
(6) (7)
It can also be written as 2 tan−1 (X) π (t1 − t0 ) where, y = , X = wx 2πx y=
(8) (9)
Here the relationship between t0 and I is expressed in terms of the relationship between X and y in order to make it bounded. y is always between -1 and 1 for all real values of X. Therefore, besides its biological meaningfulness, this function is mathematically appropriate to serve as an activation function of a neuron.
426
D. Mishra, A. Yadav, and P.K. Kalra
3 The Proposed Model Along with a biologically motivated and mathematically appropriate activation function, a suitable aggregation function is also required. In view of evidence in support of the presence of multiplicative-like operations in the nervous system, multiplication of net inputs to the activation function is considered [1][9]. In biological neural systems, this operation depends on the timings of various spikes. Aggregation of the exponential waveforms with different time-delays has been approximated by considering associated with weight and bias parameters. In the proposed model, these parameters are not measures of synaptic strength and resting potential as in conventional neuron models. They are measures of time delay and amplitude decay that represent the temporal summation as described in [11] for spiking neuron models. Equation (9), which is inspired from the relationship between injected current and interspike interval for quadratic integrate-and-fire neuron, can be modified to incorporate the temporal summation in case of multi-input neuron. n 2 y = tan−1 (wi xi + bi ) (10) π i=1 where, n is the number of inputs. Equation (10), is used as the agregation function to calculate output y for input is x. wi and bi are connection weights and bias respectively. A simple steepest descent method is applied to minimize the following error function: e=
1 (y − t)2 2
(11)
where, t is the target and y is the actual output of the neuron. e is the function of parameters bi & wi ; i = 1, 2, ...n. Therefore, the parameter update rule (weight update rule) is find out by calculating partial derivative of e w.r.t. the function of parameters.
4 Illustrative Examples 4.1 Classification Problem: XOR Data The XOR problem, as compared with other logic operations (NAND, NOR, AND and OR), is probably one of the best and most used nonlinearly separable pattern associator and consequently provides one of the most common examples of artificial neural systems for input remapping. We compared the performance of quadratic integrate-and-fire neuron (QIFN) with that of multilayer perceptron (MLP). For this purpose, we considered an MLP with 3 hidden units. Figure 1(a) shows the MSE vs. number of epochs curve with the proposed QIFN model while training it for XOR Problem. The proposed model takes only 3 iterations while MLP takes 523 iterations for training to achieve an MSE of the order of 0.0001. Figure 1(b) exhibit the comparison between MLP and QIFN in terms of the deviation of actual outputs from corresponding targets. It can be seen here that the performance of QIFN is almost same as compared with MLP. It means that a single QIFN is capable to learn XOR relationship significantly faster than an MLP
Learning with Single Quadratic Integrate-and-Fire Neuron
(a)
427
(b)
Fig. 1. (a) Mean Square Error vs. Epochs for training for XOR problem, (b) Target vs. Actual Output with MLP and QIFN, for XOR-problem
Table 1. Comparison of testing and training performance for XOR-problem S. No. Parameter
MLP
QIFN
1
Training Goal, in terms of MSE 0.0001 0.0001
2
Iterations Needed
523
8
3
Training Time in seconds
0.54
0.10
4
Testing Time in seconds
0.01
0.01
5
MSE for Testing Data
0.0001 0.0001
6
Correlation Coefficient
0.9999 1.0000
7
Percentage Misclassification
0%
0%
8
Number of Parameters
11
4
with 3 hidden units. Table 1 shows the comparison of training and testing performance with MLP and QIFN while solving the XOR-problem. 4.2 Function Approximation Problem: Electroencephalogram Data Electroencephalogram (EEG) data used in this work was taken from [21]. In this problem, four measurements y(t − 1), y(t − 2), y(t − 4) and y(t − 8) were used to predict y(t). We considered MLP with 6 hidden neurons for the comparison purposes. Figure 2 shows the comparison between MLP and QIFN in terms of the deviation of actual outputs from corresponding targets. Data till 100 sampling instants was used for training. It can be seen here that the performance of QIFN for training as well as testing data is much better than that of MLP. It means that a single QIFN is capable to learn this relationship faster than that in case of an MLP with 6 neurons and its performance on seen
428
D. Mishra, A. Yadav, and P.K. Kalra
Fig. 2. Target and Actual Output with MLP and QIFN, for EEG Data
Table 2. Comparison of testing and training performance for EEG Data S. No. Parameter
MLP
QIFN
1
Training Goal, in terms of MSE 0.015 0.015
2
Iterations Needed
5000
206
3
Training Time in seconds
0.86
0.62
4
Testing Time in seconds
0.08
0.08
5
MSE for Testing Data
0.0703 0.0124
6
Correlation Coefficient
0.9942 0.9883
7
Number of Parameters
32
8
as well as unseen data is significantly better. Table 2 shows the comparison of training and testing performance with MLP and QIFN while applying for the EEG Data.
5 Conclusion The training and testing results with different benchmark and real-life problems show that the proposed artificial neural system with a single neuron inspired from the quadratic integrate-and-fire neuron model is capable of performing classification and function approximation tasks as efficiently as a multilayer perceptron with many neurons and in some cases its learning is even better. It is also observed that training and testing times in case of QIFN are significantly less as compared with MLP. Future scope of this work includes incorporation of these neurons in a network and analytical investigation of its learning capabilities (e.g., as universal function approximator).
Learning with Single Quadratic Integrate-and-Fire Neuron
429
References 1. McKenna, T., Davis, J., Zornetzer, S.F. (ed.): Single Neuron Computation. Academic Press, San Diego, CA, USA (1992) 2. Anderson, J.A., Rosenfeld, E. (ed.): Neurocomputing: Foundations of Research. MIT Press, Cambridge (1960) 3. Scholles, M., Hosticka, B.J., Kesper, M., Richert, P., Schwarz, M.: Biologically Inspired Artificial Neurons: Modeling and Applications. Proceedings of International Joint Conference on Neural Networks, Vol. 3 (1993) 2300–2303 4. Iannella, N., Back, A.: A Spiking Neural Network Architecture for Nonlinear Function Approximation. Proceedings of the 1999 IEEE Signal Processing Society Workshop, (1999) 139–146 5. Yadav, A., Mishra, D., Ray, S., Yadav, R.N., Kalra, P.K.: Learning with Single Integrateand-Fire Neuron. Proceedings of International Joint Conference on Neural Network, (2005) 6. Freeman, W.J.: Why Neural Networks Dont Yet Fly: Inquiry into the Neurodynamics of Biological Intelligence. Proceedings of IEEE International Conference on Neural Networks, Vol. 2 (1988) 1–7 7. Hebb, D.: Organization of Behavior. John Weiley and Sons, New York (1949) 8. Widrow, B., Steams, S.: Adaptive Signal Processing. Prentice-Hall, Upper Saddle River, NJ USA (1985) 9. Koch, C.: Biophysics of Computation: Information Processing in Single Neurons. Oxford University Press, New York (1999) 10. Hoppensteadt, F.C., Izhikevich, E.M.: Weakly Connected Neural Networks. SpringerVerlag, New York (1997) 11. Gerstner, W., Kistler, W.M.: Spiking Neuron Models: An Introduction. Cambridge University Press, New York (2002) 12. McCulloch, W.S., Pitts, W.S.: A Logical Calculus of the Ideas Immanent in Nervous Activity. Mathematical Biophysics 5 (1943) 18–27 13. Sinha, M., Chaturvedi, D.K., Kalra, P.K.: Development of Flexible Neural Network. Journal of The Institution of Engineers (India) 83 (2002) 14. Feng, J., Buxton, H., Deng, Y.C.: Training the Integrate-and-Fire Model with the Informax Principle I. Journal of Physics–A 35 (2002) 2379-2394 15. Feng, J., Sun, Y., Buxton, H., Wei, G.: Training Integrate-and-Fire Neurons with the Informax Principle II. IEEE Trans. on Neural Networks 14(2) (2003) 326–336 16. Feng, J., Li, G.: Neuronal Models with Current Inputs. Journal of Physics– A 34 (2001) 1649-1664 17. Liu, S.C., Douglas, R.: Temporal Coding in a Silicon Network of Integrate-and-Fire Neurons. IEEE Trans. on Neural Networks 15(5) (2004) 1305–1314 18. Ermentrout, G.B.: Type I Membranes, Phase Resetting Curves, and Synchrony. Neural Computation 8(5) (1996) 979-1001 19. Ermentrout, G.B., Kopell, N.: Parabolic Bursting in an Excitable System Coupled with a Slow Oscillation. SIAM Journal of Applied Mathematics 46(2) (1986) 233-253 20. Latham, P.E., Richmond, B.J., Nelson, P.G., Nirenberg, S.: Intrinsic Dynamics in Neuronal Networks I. The Journal of Neurophysiology 83(2) (2000) 808-827 21. www.cs.colostate.edu/eeg/
Manifold Learning of Vector Fields Hongyu Li and I-Fan Shen Department of Computer Science and Engineering, Fudan University, Shanghai, China {hongyuli, yfshen}@fudan.edu.cn Abstract. In this paper, vector field learning is proposed as a new application of manifold learning to vector field. We also provide a learning framework to extract significant features from vector data. Vector data containing position, direction and magnitude information is different from common point data only containing position information. The algorithm of locally linear embedding (LLE) is extended to deal with vector data. The learning ability of the extended version has been tested on synthetic data sets and experimental results demonstrate that the method is very helpful and promising. Manifold features of vector data obtained by learning methods can be used for next work such as classification, clustering, visualization, or segmentation of vectors.
1 Introduction Whether at optical flow analysis in computer vision, surface reconstruction in computer graphics or flow simplification in computational fluid dynamics (CFD), vector data is the primary research object at all times. Vector data different from common point data includes direction and magnitude information other than position information. Just since much more information can be obtained from them, vector data is more worth studying and analyzing than common point data. In recent years, many efficient and feasible methods have been proposed to deal with vector data. Telea and Wijk ([1]) developed a bottom-up approach for clustering vector data. Heckel et al. ([2]) considered a top-down approach to segment vector fields using principal component analysis. Garcke et al. ([3]) proposed a continuous clustering method for vector field based on diffusion. A modified normalized cut algorithm was extended successfully for hierarchical vector field segmentation by Chen et al. in [4]. Li et al. ([5]) proposed a novel approach to partition 2D discrete vector fields based on discrete Hodge decomposition. The basic principle of these methods essentially is to learn or mine significant information from a set of vector data which is called as vector field learning(VFL) by us. The problem of vector field learning has become a significant concern for the exploratory study of vector data ([1, 4, 6, 7]). In this paper, we in detail explain the concept of vector field learning and propose a basic learning framework. In particular, we extend the classic nonlinear manifold learning method–locally linear embedding (LLE, [8]) to a modified version for vector data. The remainder of the paper is divided into the following parts. Section 2 introduces the concept of vector field learning and designs the extended version of LLE to deal with vector data. In section 3, manifold features of vector data are completely analyzed J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 430–435, 2006. c Springer-Verlag Berlin Heidelberg 2006
Manifold Learning of Vector Fields
431
1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Fig. 1. Three sets of discrete vectors respectively sampled from different circles. The direction of each vector represents the tangent direction of a certain point on circumference.
with a synthetic data, and some experimental results are presented in section 4. Finally, section 5 ends with some conclusions.
2 Manifold Learning of Vector Fields In differential geometry, a vector field is defined as a map f : Rn → Rn that assigns each x in Rn a vector function f(x). For the purpose of application, discrete vector fields are considered more in general. For instance, the 2-D discrete vector field shown in Fig. 1 corresponds to a map function, f(x, y) = {−y, x}. Discrete points in the figure are sampled from boundaries of three circles with a common center, and the vector at each point represents the tangent vector at the current position. A discrete vector field is essentially composed of a set of vectors. Vector data is different from common point data in the ability of representation: given a set of common points, the geometrical meaning represented by them is usually indefinite; if a vector is additionally provided at each point, the implicit meaning of the set usually is totally definite. Just like in Fig. 1, if only considering 2-D Euclidean coordinates of points, it is difficult for us to demonstrate that those points are obtained from the boundaries of three circles. But, when vector data is given at each point, the meaning represented by the set of data becomes clear and definite. Vector fields can provide us more useful information to study. Vector filed learning aims to extract that significant information contained implicitly in vector data. In this paper, we use the classical manifold learning method–locally linear embedding (LLE)– as the method of feature extraction from vector field. Moreover, we need a similarity measure ([7]) to judge the affinity among vectors and a vector field space to extend the LLE algorithm. 2.1 Extended Locally Linear Embedding Traditional linear learning methods such as PCA and MDS ([9, 10]) usually work poor for data with nonlinear distribution. In last years, locally linear embedding (LLE) grad-
432
H. Li and I-F. Shen 0.2
0.15
0.5
0.1
0.4
Feature 2
Feature 1
0.05
0.3
0
−0.05
0.2 −0.1
0.1 −0.15
0
0
10
20
30
40
50
60
−0.2
0
10
Vector data
20
30
40
50
60
Vector data
(a) the first feature
(b) the second feature
Fig. 2. Two features extracted from the vector field in Fig.1. (a): the first feature; (b): the second feature.
ually became popular because of its simplicity and better performance for nonlinear manifolds. The original LLE method, however, is unable to deal with vector data. We need to modify it in the first step. In the modified version, determining the neighborhood of any vector is no longer based on Euclidean distance, but the similarity measure proposed in [7]. Given n points xi with vectors vi , the similarity between xi and xj can be described as follows, s(xi , xj ) = αe−d(xi ,xj ) + βe1−a(vi ,vj ) + γe−m(vi ,vj )
(1)
where d(xi , xj ) represents Euclidean distance between two points, a(vi , vj ) the angular distance between two vectors, m(vi , vj ) the difference between magnitudes of vectors, and α + β + γ = 1. As for the exact definition of parameters, please refer to the paper [7]. Those vectors with largest mutual similarity are considered as nearest neighbors. Besides, we replace Euclidean space with vector field space to extract features of vector data. In general, for a vector field constructed on a space Rn , its corresponding vector field space should be of (2 × n+ 1) dimensions which include n-dimensional Euclidean coordinate, n-dimensional cosine of angle between vectors and coordinate axes and 1dimensional magnitude of vectors. The modified LLE algorithm is briefly introduced as follows: 1. Given N points xi with vectors vi , constructing a vector field space τ ; 2. Finding k most similar neighbors for each vector data in terms of the similarity measure (1); 3. Calculating the weights wij on the vector field space τ by minimizing the reconstruction errors, ε1 (W ) = |τi − wij τij |2 , i
j
where τij is the j-th most similar neighbor. The minimization is performed subject to a constraint: wij = 1 for all i.
Manifold Learning of Vector Fields
433
4. Finding coordinates ξi in a low-dimensional feature space by minimizing another error function, ε2 (Ξ) = |ξi − wij ξij |2 . i
j
3 Manifold Features of Vector Data For a 2-D vector field, we usually extend this 2-D space to a 5-D complex space where significant features about vector data are easier to be discovered. Applying the extended LLE to the vector field shown in Fig.1, we can mine some meaningful information and test the effectiveness and learning ability of our method. The learning results are shown in Fig.2 where the left and right panels respectively represent the first and second features extracted by our method. As one can see from the figure, the first feature only contains three values each of which corresponds to a set of vector data got from the same circle. In essence, the value on this feature is closely related with the radius of circle: when the radius of circle is changed, the feature value corresponding to the circle will also vary along. The second feature explains the periodicity of data: it actually implies the variation of central angle.
0.15
0.1
Feature 2
0.05
0
−0.05
−0.1
−0.15
−0.2
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Feature 1
Fig. 3. Two dominant features are simultaneously shown in a panel. Vector data orderly appears on such a feature space.
If both features are simultaneously shown in a panel, we can get the result in Fig.3 where three data sets linearly distribute in order on such a feature space. The result is very helpful for next vector clustering or other applications.
4 Experimental Results Another example is provide in Fig.4. The vector data is composed of two sets which are randomly sampled on two circular loops respectively. On the 2-D Euclidean space, since
434
H. Li and I-F. Shen
1.5 0.15
0.1
1
0.05
0.5
Feature 2
0
0
−0.05
−0.1
−0.5 −0.15
−1
−0.2
−0.25 0.2
−1.5 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
Feature 1
1
(a)
(b)
0.65
0.15
0.6
0.1
0.55 0.05
0.5
Feature 2
Feature 1
0
0.45
0.4
−0.05
−0.1
0.35 −0.15
0.3
0.25
−0.2
0.2
−0.25
0
50
100
Vector data
(c)
150
0
50
100
150
Vector data
(d)
Fig. 4. Manifold learning of vector field. (a): Original vector field. (b): Feature space spanned by the first two dominant features of vector data. (c): The first feature extracted by our method. (d): The second feature extracted by our method.
data distributes on surfaces of nonlinear manifolds just like in Fig.4(a), it is difficult to find an appropriate measure to cluster vectors. But if a feature space where data is with linear distribution was discovered, the process of clustering could become simple and doable. When the extended LLE algorithm is applied to this set of data, two significant features can be extracted respectively shown in Figs.4(c) and 4(d): the first feature still corresponds to a variable related with the radius of circle, and the second one tells us the periodicity of variation of data. These features span a feature space shown in Fig.4(b), where vector data scatters in two linear regions. On such a space, we can easily find a decision boundary to separate two sets of data. Although vector data in Fig.4 is affected by noise which results that direction and magnitude of vectors are kind of changed, learning effects of our method obviously are not destroyed which proves that the method is robust to a little of noise. What’s more, from the figure, one can also find that the method is not related to the regularity of sampling. Even if in the situation of irregular sampling, the method can still work well and extract significant features.
Manifold Learning of Vector Fields
435
5 Conclusions In this paper, we propose the concept of vector field learning and provide a learning framework to extract significant features from vector data. Vector data different from common point data contains direction and magnitude information other than position information. We extend the locally linear embedding (LLE) algorithm to make it suitable for vector data. The learning ability of the extended method has been tested on synthetic data sets and experimental results demonstrate the merits of such approach. Manifold features of vector data extracted by learning methods can be used in the field of classification, clustering, visualization, or segmentation of vectors, which is the topic of our future studies.
Acknowledgments This research work is supported by the National Natural Science Foundation of China under Grant No. 60473104.
References 1. Telea, A.C., Wijk, J.J.V.: Simplified representation of vector fields. In Ebert, D., Gross, M., Hamann, B., eds.: Proc. of IEEE Visualization 1999. (1999) 35–42 2. Heckel, B., Weber, G., Hamann, B., Joy, K.I.: Construction of vector field hierarchies. In: Proc. of IEEE Visualization 1999. (1999) 19–25 3. Garcke, H., Preusser, T., Rumpf, M., Telea, A., Weikard, U., Wijk, J.J.V.: A phase field model for continuous clustering on vector fields. IEEE Trans. Visualization and Computer Graphics (2001) 230–241 4. Chen, J.L., Bai, Z., Hamann, B., Ligocki, T.J.: A normalized-cut algorithm for hierachical vector field data segmentation. In: Proc. of Visualization and Data Analysis 2003. (2003) 5. Li, H., Chen, W., Shen, I.F.: Segmentation of discrete vector fields. IEEE Transaction on Visualization and Computer Graphics (2006) (to appear). 6. Li, H., Chen, W., Shen, I.F.: Supervised learning for classification. In: FSKD (2). Volume 3614 of Lecture Notes in Computer Science. (2005) 49–57 7. Li, H., Shen, I.F.: Similarity measure for vector field learning. In: ISNN 2006. Lecture Notes in Computer Science (2006) (to appear). 8. Roweis, S., Saul, L.: Nonlinear dimension reduction by locally linear embedding. Science 290 (2000) 2323–2326 9. Borg, I., Groenen, P.: Modern Multidimensional Scaling. Springer-Verlag (1997) 10. Tipping, M.E., Bishop, C.: Mixtures of probabilistic principal component analyzers. Neural Computation 11 (1999) 443–482
Similarity Measure for Vector Field Learning Hongyu Li and I-Fan Shen Department of Computer Science and Engineering, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China {hongyuli, yfshen}@fudan.edu.cn
Abstract. Vector data containing direction and magnitude information other than position information is different from common point data only containing position information. Those general similarity measures for point data such as Euclidean distance are not suitable for vector data. Thus, a novel measure must be proposed to estimate the similarity between vectors. The similarity measure defined in this paper combines Euclidean distance with angle and magnitude differences. Based on this measure, we construct a vector field space on which a modified locally linear embedding (LLE) algorithm is used for vector field learning. Our experimental results show that the proposed similarity measure works better than traditional Euclidean distance.
1 Introduction With the increase in the precision of computational simulation and Computational Fluid Dynamics (CFD), very large vector data sets become readily available. As a result, the problem of vector field learning became a significant concern for the exploratory study in visualizing vector data ([1]). The main task of vector filed learning is to extract significant features contained implicitly in vector fields. Once such features are found with learning methods such as principal component analysis (PCA, [2]) and locally linear embedding (LLE, [3]), we can easily cluster or classify vector data ([4, 5, 6, 7]), segment ([8]) and simplify ([9, 10]) vector fields on the feature space. Vector data different from common point data includes direction and magnitude information other than position information. Thus, methods for dealing with vector data should be different from those for common point data. The traditional method to compute the similarity between point data is based on Euclidean distance which is obviously inappropriate to vector data. It is necessary for us to define a new similarity measure to study the affinity between vector data. The measure proposed in the paper combines the direction and magnitude information with Euclidean distance in the form of linear combination which corresponds to a vector field space(VF space). On this space, significant features of vector data can be easily extracted using a modified LLE algorithm. The remainder of the paper is divided into the following parts. Section 2 introduces a similarity measure of vector data. In section 3, a VF space is constructed for vector field learning. Experimental results will be presented in section 4. Finally, section 5 ends with some conclusions. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 436–441, 2006. c Springer-Verlag Berlin Heidelberg 2006
Similarity Measure for Vector Field Learning
437
2 Similarity Measure of Vector Data For common point data (for example, Fig. 1(a)), the neighborhood selection is simply based on Euclidean distance in general. The measure is possibly inappropriate to vector data (for example, Fig. 1(b)) due to the appearance of other important information except spatial information. 1.5 1
0.8
1 0.6
0.4
0.5 0.2
0
0 −0.2
−0.4
−0.5 −0.6
−0.8 −0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
−1 −0.8
(a) discrete point data
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
(b) discrete vector data
Fig. 1. Discrete point and vector data. Vector data include direction and magnitude information other than position information.
2.1 Euclidean Distance Given n points {pi } with spatial coordinates {xi , yi }, the similarity between pi and pj is traditionally defined in terms of Euclidean distance. s(pi , pj ) = e
−
d(pi ,pj ) σ2 1
(1)
which is essentially a Gauss kernel. d(pi , pj ) represents Euclidean distance between pi and pj , the scaling parameter σ1 is a positive real constant. Under this measure, nearby points with smaller Euclidean distance are assigned higher similarity. Obviously the neighborhood of each point is completely determined by Euclidean distance. But a problem will appear in some cases: when data points are sampled from several different but close manifolds, those points with smaller distance are possibly got from different manifolds. That is, those data points in the local neighborhood of a certain point should actually belong to different sets. If such neighborhood is used for learning, the effects could be poor. For example, data points (p1 , . . . , p7 ) in Fig. 2 only containing spatial coordinates are sampled from two 1-d manifolds. Although the spatial distance between p1 and p7 is smaller than that between p1 and p2 , p1 and p2 actually belong to the same class and have higher similarity. Clearly it is not beneficial for extracting features of manifolds if purely using Euclidean distance to determine the similarity measure.
438
H. Li and I-F. Shen
Fig. 2. Angular difference between vectors to compute the similarity
2.2 Angular Difference If vector information at each point are provided like in Fig. 2, the most similar point to p1 is clearly p2 , not p7 since vectors at p1 and p2 have closer direction. Therefore, the angle information should also be included in the similarity measure. Given a set of points with vector information, the similarity between any two points pi and pj can be rewritten in the form of linear weighting, s(pi , pj ) = αe−d(pi ,pj ) + βe1−a(pi ,pj )
(2)
where α and β are positive real constants, and α + β = 1. a(pi , pj ) = vi · vj /(|vi ||vj |) represents the angular distance between pi and pj where vi is the vector at point pi and | · | denotes the norm of a vector. The smaller is the angle between vectors at two points, the larger is the similarity between two point data. In Fig. 2, points (p1 , · · · , p7 ) should be divided into two sets: (p1 , · · · , p4 ) belong to set 1; (p5 , · · · , p7 ) to set 2. According to the measure (2), there is clearly higher intrinsic similarity within points of either set, and lower similarity between points of different sets. 2.3 Magnitude Difference Besides Euclidean and angular distances, magnitude is an important factor influencing similarity between data. It should also be incorporated in the measure. So formula (2) can be reorganized as follows, s(pi , pj ) = αe−d(pi ,pj ) + βe1−a(pi ,pj ) + γe−m(pi ,pj )
(3)
where γ is positive real constants, α+β+γ = 1, and m(pi , pj ) = ||vi |−|vj || represents the difference between magnitudes of vectors at pi and pj .
Similarity Measure for Vector Field Learning
439
2.4 Analysis In this paper we will use the similarity measure (3). The measure contains a fact that the similarity becomes the largest at 1 when i = j. Furthermore, when α (β or γ) increases gradually from 0 to 1, the importance of the role Euclidean distance (angle or magnitude) plays in the measure is increasingly built up until the measure is completely dependent on the component when the parameter attains 1. Note the similarity measure (3) is specially designed for dealing with vector data. It is actually built on a vector field space different from traditional Euclidean space on which Euclidean distance is based.
3 Vector Field Learning 3.1 Construction of Vector Field Space From n points pi = {xi , yi } with vector information vi = {vix , viy }, we can construct a vector field space τ where τi = (αxi , αyi , βκxi , βκyi , γmi ). mi = |vi | represents the vy vx magnitude of vector, κxi = mii and κyi = mii respectively denote the cosine of the angle between vi and the coordinate axis, and the weights α, β and γ is equivalent to those in (3). Note if the measure (3) is used in a learning method, the learning process must work on a vector field space τ . 3.2 Nonlinear Learning Traditional linear learning methods such as PCA and MDS work poor for data with nonlinear distribution. In recent years, locally linear embedding (LLE) became more popular because of its simplicity and better performance for nonlinear manifolds. The original LLE method, however, is not suitable for dealing with vector data. We need to modify the method in the first step. In the modified version, the criterion of determining the neighborhood of any data is not Euclidean distance, but the similarity measure (3). Those points with largest mutual similarity are considered as nearest neighbors. Besides, we replace Euclidean space with vector field space to extract features of vector data. The modified LLE algorithm for vector field is discussed in [11] in detail.
4 Experimental Results The point data shown in Fig. 1(a) is actually sampled from a 1-d dimensional manifold. However, the reconstructed curve from such data set is not always with the similar shape to the original manifold if only the position information of points is provided. Fig. 3(a) shows one of possible curves reconstructed from the point data. If the tangent vector at each point is provided like in Fig. 1(b), it is clear that the geometrical structure of such a set of data is unique and definite as in Fig. 3(b). If the proposed similarity measure (3) is applied into a set of vector data (Fig. 1(b)) to estimate the neighborhood of each vector, the significant feature of the set of vector data can be effectively found using the modified LLE algorithm and the original manifold curve can be appropriately unfolded on the feature space. Fig. 4(b) shows the extracted
440
H. Li and I-F. Shen 1.5 1
0.8
1 0.6
0.4
0.5 0.2
0
0 −0.2
−0.4
−0.5 −0.6
−0.8 −0.6
−0.4
−0.2
0
0.2
0.4
0.6
−1 −0.8
0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
(a) one of possible curves reconstructed from (b) a unique curve determined by vector data point data Fig. 3. Curves reconstructed from point and vector data. Clearly the curve obtained from vector data is unique and definite. Many different curves, however, possibly are reconstructed from point data.
1.4
1.08
1.3
1.06
1.2
1.04
1.1
1.02
1
1
0.9
0.98
0.8
0.96
0.7 −1.5
−1
−0.5
0
0.5
1
1.5
0.94 −1.1
−1
−0.9
−0.8
−0.7
−0.6
−0.5
−0.4
(a) the result obtained with original LLE on (b) the result obtained with modified LLE on Euclidean space a vector field space Fig. 4. Comparison of results of unfolding vector data shown in Fig.1(b). (a). Only spatial position information is utilized, and the similarity measure is completely based on Euclidean distance. (b). Vector information at each point is also incorporated into the similarity measure.
feature along which vector data on the 1-d manifold is arranged in good order. However, if we use Euclidean distance as the measure and execute the original LLE algorithm on Euclidean space, the result will become disorderly. Fig.4(a) shows the result based on Euclidean distance, where vector data appears orderless.
5 Conclusions Vector data distributed on a low-dimensional manifold provides us more geometric structure information of data in comparison with common point data. To determine
Similarity Measure for Vector Field Learning
441
the neighborhood of each vector data appropriately, a novel measure different from Euclidean distance is proposed to estimate the similarity of vectors. Besides, we design a vector field space to incorporate vector information on which the modified LLE algorithm can effectively extract significant features of vector data. Furthermore, the proposed similarity measure is also applicable to the analysis of optical flow in the field of computer vision, and to the reconstruction of surface in the field of computer graphics.
Acknowledgments This research work is supported by the National Natural Science Foundation of China under Grant No. 60473104.
References 1. Scheuermann, G., Hamann, B., Joy, K.I., Kollmann, W.: Visualizing local vector field topology. SPIE Journal of Electronic Imaging 9 (2000) 356–367 2. Tipping, M.E., Bishop, C.: Mixtures of probabilistic principal component analyzers. Neural Computation 11 (1999) 443–482 3. Roweis, S., Saul, L.: Nonlinear dimension reduction by locally linear embedding. Science 290 (2000) 2323–2326 4. Li, H., Chen, W., Shen, I.F.: Supervised learning for classification. In: FSKD (2). Volume 3614 of Lecture Notes in Computer Science. (2005) 49–57 5. Chen, J.L., Bai, Z., Hamann, B., Ligocki, T.J.: A normalized-cut algorithm for hierachical vector field data segmentation. In: Proc. of Visualization and Data Analysis 2003. (2003) 6. Garcke, H., Preusser, T., Rumpf, M., Telea, A., Weikard, U., Wijk, J.J.V.: A continuous clustering method for vector fields. In Ertl, T., Hamann, B., Varshney, A., eds.: Proc. of IEEE Visualization 2000. (2000) 351–358 7. Heckel, B., Uva, A.E., Hamann, B.: Clustering-based generation of hierarchical surface models. In Wittenbrink, C., Varshney, A., eds.: Proc. of IEEE Visualization 1998 (Hot Topics). (1998) 50–55 8. Li, H., Chen, W., Shen, I.F.: Segmentation of discrete vector fields. IEEE Transaction on Visualization and Computer Graphics (2006) (to appear). 9. Tricoche, X., Scheuermann, G., Hagen, H.: A topology simplification method for 2d vector fields. In Ertl, T., Hamann, B., Varshney, A., eds.: Proc. of IEEE Visualization 2000. (2000) 359–366 10. Telea, A.C., Wijk, J.J.V.: Simplified representation of vector fields. In Ebert, D., Gross, M., Hamann, B., eds.: Proc. of IEEE Visualization 1999. (1999) 35–42 11. Li, H., Shen, I.F.: Manifold learning of vector fields. In: ISNN 2006. Lecture Notes in Computer Science (2006) (to appear).
The Mahalanobis Distance Based Rival Penalized Competitive Learning Algorithm Jinwen Ma and Bin Cao Department of Information Science, School of Mathematical Sciences and LMAM, Peking University, Beijing, 100871, China
[email protected]
Abstract. The rival penalized competitive learning (RPCL) algorithm has been developed to make the clustering analysis on a set of sample data in which the number of clusters is unknown, and recent theoretical analysis shows that it can be constructed by minimizing a special kind of cost function on the sample data. In this paper, we use the Mahalanobis distance instead of the Euclidean distance in the cost function computation and propose the Mahalanobis distance based rival penalized competitive learning (MDRPCL) algorithm. It is demonstrated by the experiments that the MDRPCL algorithm can be successful to determine the number of elliptical clusters in a data set and lead to a good classification result.
1
Introduction
Clustering analysis is important not only in statistics but also in many aspects of practical applications. Analysis of clusters on a set of (unlabeled) sample data by means of mixture distribution, called mixture-model cluster analysis, is one of the most difficult problems in statistics, and its main task is to select a proper number of clusters in a sample data set. In the terminology of artificial neural networks, it is the problem of unsupervised classification on a sample data set where the number of clusters is unknown. Its importance and difficulty has been noted by many researchers (e.g., [1]-[2]). As an adaptive version of the classical k-means clustering, competitive learning (CL) has been developed from the field of neural networks and provided us a promising tool for unsupervised classification [3]. In the basic form of competitive learning, as each input sample is presented, the winning unit or neuron of the output layer of a CL network is activated and its corresponding weight vector is modified. While the rest of the units in the output layer are not activated and their corresponding weight vectors are not modified. This form of competitive learning is referred to as the simple or classical competitive learning (CCL) or winner-take-all (WTA) learning [4]. In the late of the 1980’s, it was found that this simple learning mechanism has the dead unit (or under-utilized) problem. In order to solve it, the Frequency Sensitive Competitive Learning (FSCL) algorithm was developed under the light of conscience mechanism [5]. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 442–447, 2006. c Springer-Verlag Berlin Heidelberg 2006
The Mahalanobis Distance Based Rival Penalized Competitive
443
However, these conventional competitive learning algorithms cannot be applied to solving the unsupervised classification problem on a set of sample data where the number of clusters is unknown. Xu, Krzyzak & Oja discussed this problem and showed that it is equivalent to the selection of an appropriate number of the units [6]. To tackle this problem, the Rival Penalized Competitive Learning (RPCL) algorithm was further proposed by adding a new mechanism into FSCL [6]. The basic idea is that for each input, not only the weight vector of the winner unit is modified (rewarded) to adapt to the input, but also the weight vector of its rival (the 2nd winner) is de-learned (penalized) by a smaller learning rate. When the learning and de-learning rates are appropriately selected, RPCL has the ability of automatically allocates an appropriate number of weight vectors for a sample data set, with the weight vectors of the extra units being pushed far away from the sample data. Actually, the RPCL algorithm is demonstrated well and has been applied to many fields such as clustering, vector quantization, and the training of RBF neural networks. Moreover, it has also been generalized to several versions for different types of sample data [7]. Via theoretical analysis, it has been recently proved that the RPCL algorithm can be considered as a special case of the distance sensitive RPCL (or DSRPCL) algorithms constructed by minimizing a cost function [8]. However, the distance between a sample point and a weight vector in the cost function is mainly analyzed and used in the form of the Euclidean distance, which actually limits the clusters to the spherical forms. In this paper, we use the Mahalanobis distance instead of the Euclidean distance in the cost function and propose the Mahalanobis distance based rival penalized competitive learning (MDRPCL) algorithm that can be successful to determine the number of elliptical clusters in a data set and lead to a good classification result.
2
The Cost Function and the MDRPCL Algorithm
μ μ μ T μ d Given a set of sample data S = {X μ }N μ=1 with X = [x1 , x2 , · · · , xd ] ∈ R , the simple competitive learning or FSCL algorithm can be regarded as the adaptive versions of the well known k-mean algorithm, which essentially minimize the Mean Square Error (MSE) cost function as follows:
EMSE (W) =
1 μ μ 1 Mi (xj − wij )2 = X μ − Wc(μ) 2 2 ijμ 2 μ
(1)
where W = [W1 , W2 , · · · , Wk ], and each Wi = [wi1 , wi2 , · · · , wid ]T ∈ Rd is just the i-th weight vector in consideration. Moreover, we have 1 if i = c(μ) such that X μ − Wc(μ) = minj X μ − Wj , Miμ = 0 otherwise, with c(μ) denoting the index of the winner unit (or its weight vector) for the μth sample point.
444
J. Ma and B. Cao
However, EMSE (W) cannot be used for detecting a correct number k because it decreases monotonically as the number of the units increases. Thus, it does not apply to the RPCL algorithm. Recently, Ma and Wang [8] constructed the following cost function to describe the mechanism of the RPCL algorithm: E(W) = E1 (W) + E2 (W) E1 (W)
1 1 X μ − Wc(μ) 2 , E2 (W) = 2 μ P
(2) X μ − Wi −P ,
μ,i =c(μ)
where W = vec[W1 , W2 , · · · , Wn ] and P is a positive number. This new cost function is composed of the square mean error E1 and the model selection term E2 . Actually, the minimization of E2 can allocate the proper number of weight vectors to the sample data, while the minimization of E1 makes the classification of the sample data possible. It is shown by theoretical analysis and experiment in [8] that the minimization of this cost function can lead to a a correct number of weight vectors located around the centers of the clusters in the sample data, respectively, with the other weight vectors being driven away far from the sample data. However, the discussions in [8] are mainly based on the Euclidean distance for the cost function, which essentially assumes that the clusters in the sample data are spherical. To relax this constraint, we can use the Mahalanobis distance instead of the Euclidean one in the cost function. That is, the distance Wi − X μ is given by X μ − Wi 2 = (X μ − Wi )T Σi−1 (X μ − Wi ),
(3)
where Σi is the covariance matrix of the cluster i assumed positive definite. Thus, the cost function given in Eq.(2) becomes a function of the parameters (W1 , Σ1 ), · · · , (Wn , Σn ). With the above preparations, we now construct the Mahalanobis distance based rival penalized competitive learning algorithm as follows. First we get the derivatives of E with respect to Wj : j ∂E =− δc(μ) Σj−1 (X μ − Wj ) ∂Wj μ j + (1 − δc(μ) )||X μ − Wj ||−P −2 Σj−1 (X μ − Wj ) μ,j
where δji is the Kronecker number. Then, the MDRPCL algorithm for the update of the weight vectors can be constructed as a gradient descent algorithm of the cost function as follows. (t)
Wj
(t−1)
= Wj
+ Wj ,
(4)
∂E whereWj = −η ∂W and η > 0 is the learning rate which is generally selected j as a small positive constant number.
The Mahalanobis Distance Based Rival Penalized Competitive
445
Instead of the batch mode, we can get the following adaptive algorithm for the update of the weight vectors: at current step t, we random choose a sample X μ , and have ∂E1 −Σj−1 (X μ − Wj ), if j = c(μ); = (5) 0, otherwise, ∂Wj ∂E2 0, if j = c(μ); = (6) μ −P −2 −1 μ ||X − W || Σ (X − W ), ∂Wj i j otherwise, j Meanwhile, we use the following update rule to modify the covariance matrix Σj at each step in the batch mode: (t−1) (1 − η )Σj + η μ (X μ − Wc(μ) )(X μ − Wc(μ) )T , if j = c(μ); (t) Σj = (7) (t−1) Σj , otherwise, and in the adaptive mode: (t−1) (1 − η )Σj + η (X μ − Wc(μ) )(X μ − Wc(μ) )T , if j = c(μ); (t) Σj = (t−1) Σj , otherwise.
(8)
where η > 0 is a small positive constant number. This update rule was given by Xu [7] for the extension of the RPCL algorithm to the data sets of elliptical clusters. Although it does not follow the gradient learning rule of the cost function exactly, it can maintain a good stability on the convergence of the covariance matrices. Actually, the exact gradient learning rule often leads to a degenerate covariance matrix such that the algorithm cannot be convergent. Since the MDRPCL algorithm is a type of gradient descent algorithm, it may be possible to trap in a local minimum so that it may lead to a wrong clustering result. In order to overcome this problem, we can apply an simulated annealing mechanism to the MDRPCL algorithm in the same way as the ASRPCL algorithm given in [8]. We will use this simulated annealing method to make the classification of the wine data in the next section.
3
Experimental Results
In this section, some simulated experiments are carried out to demonstrate the performance of the MDRPCL. Moreover, the MDRPCL algorithm is applied to the classification of the wine data. 3.1
On Simulated Data Sets
We conducted two experiments on the set of samples drawn from a mixture of four bivariate Gaussians densities (i.e., d = 2), which can observed from the backgrounds of Fig.1(a)&(b), respectively. Clearly, all the four Gaussian distributions in each case were selected to be elliptical, with a certain degree
446
J. Ma and B. Cao 6
10
4
5
2
0
0
−5
−2
−10
−4 −15
−10
−5
0
5
10
(a)
15
−4
−2
0
2
4
6
(b)
Fig. 1. Trajectories of weight vectors (solid blocks) during the learning process of the MDRPCL algorithm. (a). The batch MDRPCL algorithm in which the learning rate η = 0.001, η = 0.0001, P = 0.02, and the number of iterations was 100. (b). The adaptive MDRPCL algorithm in which the parameters were selected as above, but the number of iterations was 1000. Here, the contour lines of each Gaussian based on a pair of final estimated mean vector and covariance matrix obtained from the MDRPCL algorithm are retained unless its density is less than e−3 (peak).
of overlap. We implemented the batch MDRPCL algorithm on the data set given in Fig. 1(a) and the adaptive MDRPCL algorithm with the simulated annealing mechanism on the data set given in Fig. 1(b), which has a higher degree of overlap. In the both experiments, we let the parameters Wj being selected randomly and independently from the samples and took p = 0.02. The experimental results with n = 8 in the two cases are also given in Fig. 1(a)&(b), respectively. We can observe that four Gaussians based on the estimated weight vectors Wj and the covariance matrix Σj are finally located accurately at the actual Gaussians, respectively, while the other weight vectors are driven far away from the sample data. That is, the correct number of the elliptical clusters have been detected on each of these data sets. Moreover,the classification accuracies on the two data sets of Fig. 1(a)&(b) can reach at 98% and 93%, respectively. 3.2
On the Wine Data Set
Moreover, we applied the adaptive MDRPCL algorithm with the simulated annealing mechanism to the classification (or recognition) of the wine data set which is a typical real data set for testing a classification or clustering algorithm and can be downloaded from the web: http://www.ics.uci.edu/∼mlearn/MLRepository.html. Actually, it consists of 178 samples of three types of wine. Each datum is 13-dimensional and consists of chemical analysis of a sample from certain type of wine. We first regularized these wine data as into a interval of [0,3] and then apply the simulated annealing MDRPCL algorithm to solving the unsupervised classification problem of the wine data by setting k = 6 with the parameters wij being selected randomly from the interval [0,3] and let P = 0.2.
The Mahalanobis Distance Based Rival Penalized Competitive
447
It is shown by the experiments that the simulated annealing MDRPCL algorithm can detect the three classes in the wine data set with a classification accuracy of 100% or 99.64%(there is only one error) which is a rather good result for the unsupervised learning methods.
4
Conclusions
In this paper, we have investigated the function of the Mahalanobis distance in the special cost function associated with the RPCL algorithm and proposed a Mahalabis distance based RPCL algorithm. By relaxing the Euclidean distance into the Mahalanobis one, the proposed RPCL algorithm is able to be applied to a sample data set of elliptical clusters. It is demonstrated by the experiments that the MDRPCL algorithm can be successful to determine the number of elliptical clusters in a simulated or real data set and lead to a good classification result.
Acknowledgement This work was supported by the Natural Science Foundation of China for Project 60471054.
References 1. Galinski, R.B., Harabasz, J.A.: A Dendrite Method for Cluster Analysis. Communications in Statistics 3(1974) 1-27 2. Hartigan, J.A.: Distribution Problems in Clustering. In J. Van Ryzin (ed.): Classification and Clustering. Academic Press, New York (1977) 45-72 3. Nasrabadi, N., King, R.A.: Image Coding Using Vector Quantization: A Review. IEEE Trans. Commun. 36 (1988) 957-971 4. Hecht-Nielsen, R.: Neurocomputing. Addison-Wesley, Reading MA (1990) 5. Ahalt, S.C., Krishnamurty, A.K., Chen, P., Melton, D.E.: Competitive Learning Algorithms for Vector Quantization. Neural Networks 3 (1990) 277-291 6. Xu, L., Krzyzak, A., Oja, E.: Rival Penalized Competitive Learning for Clustering Analysis, RBF Net, and Curve Detection. IEEE Trans. Neural Networks 4 (1993) 636-649 7. Xu, L.: Rival Penalized Competitive Learning, Finite Mixture, and Multisets Clustering. In: Proc. of International Joint Conference on Neural Networks (IJCNN’98). vol.3. IEEE Press, Alaska (1998) 2525-2530 8. Ma, J., Wang, T.: A Cost Function Approach to Rival penalized competitive Learning (RPCL). IEEE Trans. Systems, Man, and Cybernetics B in Press.
Dynamic Competitive Learning Seongwon Cho1, Jaemin Kim1, and Sun-Tae Chung2 1
School of Electronic and Electrical Engineering, Hongik University, 72-1 Sangsu-dong, Mapo-gu, Seoul 121-791, Korea
[email protected] 2 School of Electronic Engineering, Soongsil University, 511 Sangdo-dong, Dongjak-gu, Seoul 156-743, Korea
[email protected]
Abstract. In this paper, a new competitive learning algorithm called Dynamic Competitive Learning (DCL) is presented. DCL is a supervised learning method that dynamically generates output neurons and initializes automatically the weight vectors from training patterns. It introduces a new parameter called LOG (Limit of Grade) to decide whether an output neuron is created or not. If the class of at least one among LOG number of nearest output neurons is the same as the class of the present training pattern, then DCL adjusts the weight vector associated with the output neuron to learn the pattern. If the classes of all the nearest output neurons are different from the class of the training pattern, a new output neuron is created and the given training pattern is used to initialize the weight vector of the created neuron. The proposed method is significantly different from the previous competitive learning algorithms in the point that the selected neuron for learning is not limited only to the winner and the output neurons are dynamically generated during the learning process. In addition, the proposed algorithm has a small number of parameters, which are easy to be determined and applied to real-world problems. Experimental results demonstrate the superiority of DCL in comparison to the conventional competitive learning methods.
1 Introduction Competitive learning has been used for rapid training of neural networks since it is of low computational complexity. Typical competitive learning models such as Simple Competitive Learning (SCL) [1], Frequency Sensitive Competitive Learning (FSCL) [2], [3], and Self-Organizing Feature Map (SOFM) [4], [5], [6], Learning Vector Quantization (LVQ) [7], [8], have the static structure of neural networks. Their learning performance extremely depends on the distribution of initial weight vectors and the number of output neurons, which has to be determined before learning. Contrariwise, as a typical model using a changeable structure of neural network, Adaptive Resonance Theory (ART) [9] holds the plasticity by consistently integrating the newly acquired knowledge to the whole knowledge base while keeping the previous knowledge. ART can simulate human learning behavior. However, its structure and learning algorithm is so complex and the adjustment of numerous learning parameters is so difficult, and thus it has some difficulties to apply to practical problems. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 448 – 455, 2006. © Springer-Verlag Berlin Heidelberg 2006
Dynamic Competitive Learning
449
Hence, in this paper, we propose a new supervised competitive learning neural network called dynamic competitive learning (DCL). In the proposed neural network, the output neurons are dynamically generated during the learning process and the initial weight vectors are automatically selected from the training patterns. The major differences between DCL and the conventional competitive learning models are the point that DCL introduces a new learning parameter, Limit of Grade (LOG), to decide the generation of output neurons and the selected neuron for learning is not limited only to the winner. In DCL, the output layer is changeable and there is a small number of parameters which have to be determined by users. The paper consists of 4 sections. In Section II, DCL is described. Experimental resu lts are presented in Section III. Section IV is conclusions.
2 Dynamic Competitive Learning 2.1 Basic Concept DCL neural network consists of 2 layers, input layer and output layer, as with the conventional competitive learning neural network. DCL is similar to SCL and FSCL in the point that only one weight vector is updated for the given training vector. However, in DCL the neurons of output layer are dynamically generated during learning and the next nearest output neurons are allowed to be updated if the class of the winning neuron with the minimum Euclidean distance is different from the class of the training vector incorrectly. The proposed method decides to generate output neurons during learning process by introducing a new learning parameter, LOG. DCL first checks whether the class of the winner is the same as the training vector or not. If the classes are identical, the winner learns. Otherwise the second nearest output neuron becomes a candidate for learning and its class is checked. The same procedure is repeated for the next nearest output neurons. LOG represents the number of the candidates for learning If all the candidates fail to learn, a new output neuron is generated and the given training vector is used as the initial weight vector for the output neuron. As learning proceeds, LOG of each class increases monotonically for preventing the number of generated output neurons to diverge. The terminologies used in this paper are as follows: t : the present learning time n : total number of the generated output neurons J : total number of classes j : index for class, 1≤ j ≤ J I : index for the output neuron, 1≤ i ≤ n x j : input vector belonging to class j
LOG j : LOG of class j ε : gain for LOG parameter n j : number of output neurons assigned to class j
w i : weight vector for output neuron i u i : labelled class of output neuron i u : vector consisting of the classes of all the output neurons, i.e.,
450
S. Cho, J. Kim, and S.-T. Chung
u = [u1 , u 2 , L , u n ]T c r : the rth nearest output neuron (the winner is the first nearest output neuron) α(t) : learning rate at time t. 2.2 Learning Algorithm DCL algorithm is composed of parameter initialization, selection of candidates for learning, weight vector update, generation of a new output neuron, and parameter adjustment. The final weight vector and the label vector u can be obtained after all the learning and relabelling process. More detailed algorithm can be described as follows: Step 0. Initialize parameters: t = 0, n = 0 α(0) ∈ (0,1) n j = 0 for j = 1,2, L , J LOG j = 1 for j = 1,2,L , J Step 1. While stopping condition is false, do Steps 2-8. Step 2. For each training input vector xi , do Steps 3-7. Step 3. For r = 1, L , LOG j , a. Find C r such that
x j− w c
r
= min i ≠ c1, L, c r −1 { x j − w i }
b. If ucr = j , update the weight vector and go to Step 5: w cr (new) = w c (old) + α(t) ⋅ [x j − w cr (old)] c. Next r. Step 4. If u c r ≠ j for all r, then create a new output neuron and initialize the weight vector: n = n + 1, n j = n j + 1. w n (new) = x j , u n = j Step 5. Adjust limit of grade LOG j Step 6. Reduce learning rate alpha (t) Step 7. Next t ; Step 8. Test stopping condition : The condition may specify a fixed number of iterations. Step 9. Relabelling.
Dynamic Competitive Learning
451
In DCL the weight vector for an output neuron can be updated using several training vectors with different classes after initial labeling. Thus, after learning ends, relabeling is necessary for assigning the optimal class of the output neuron and removing dead neurons. DCL uses LOG as a parameter for output neuron generation. LOG takes the monotonically increasing form such as (1) or (2) to prevent the divergence of generated neurons. LOG j = 1 + ε ⋅ n j
(1)
LOG j = 1 + ε ⋅ n
(2)
LOG parameter are associated with the number of output neurons and the gain ε is between 0 and 1 ( 0 < ε < 1 ). The equation of LOG parameter and its gain are predetermined before learning, and the number of generated neurons can be adjusted accordingly. As ε approximates to 1, the small number of output neurons is generated. In the contrary, as ε approximates to 0, relatively a large number of output neurons are generated. Especially, in case of ε = 1 in (2) the number of generated output neurons is the same with the number of classes. In case of ε = 0 , only the winner learns and DCL is similar to LVQ except that in DCL the weight vector is not updated when the training vector is classified incorrectly.
2.3 Characteristics of DCL Neural Network DCL dynamically generates output neurons during learning using the changeable structure of neural network such as in ART instead of static structure such as in SCL, FSCL, SOFM and LVQ. In this sense, it is not required to predetermine the number of output neurons and to initialize weight vectors. ART model is similar to DCL in the sense that it also has changeable structure of neural network, but ART model, which has both bottom-up and top-down vector, performs two-way learning and clusters input by using unsupervised competitive learning. Contrariwise, DCL performs only forward learning and its implementation is simple because it has only one parameter, LOG. Since DCL allows the weight vector for one of nearest output neurons to be updated or a new output neuron to be generated, it makes the generation probability of dead neurons low, and prevents the number of output neurons from diverging by introducing a monotonically increasing form of LOG. DCL constructs more sophisticated classifier by assigning many output neurons near the decision surfaces that misclassification occurs frequently.
3 Experimental Results 3.1 Recognition of Remote Sensing Data In this section, to check applicability and to demonstrate the superior performance of the DCL in the real problems, experiments with a remote sensing data were conducted in order to compare the proposed DCL with the other neural networks.
452
S. Cho, J. Kim, and S.-T. Chung
A. Data Set The data set for the experiment is a multi-spectral earth observational remote sensing data called Flight Line C1 (FCL1) [10], [11]. The geographical location of the FCL1 is the southern part of Tippecanoe Country, Indiana, U.S.A. This multi-spectral image was collected with an airborne scanner in June 1966 at noon time. The FLC1 consists of 12 band signals. Each point of one spectral image represents one of 256 gray levels. In this experiment, only 8 spectral bands out of 12 bands are used. Each data vector is generated by concatenating eight corresponding band signal values together. 8 dominant farm products (alfalfa, corn, oats, red clover, soybean, wheat, bare soil, rye) are chosen to represent 8 particular classes. Training and testing with the networks were done with 200 signal samples per class and 375 other signal samples per class. Therefore, the total numbers of the training and the testing samples are 1600 and 3000, respectively. B. Simulation Results DCL generated 89 output neurons. In order to have a fair comparison, the number of output neurons used was 89 for FSCL and 10 x10 for SOFM. In the case of DCL, the learning rate coefficient was chosen as α(t) = 0.3 ⋅ (1 −
t ) Total number of iterations
and the gain for the LOG parameter was chosen as ε = 0.2 . In the case of FSCL and SOFM, the learning rate coefficients were chosen as α(t) = 0.9 ⋅ (1 −
t ) Total number of iterations
and the first 89 or 100 training patterns were used to initialize the weight vectors. Table 1 and Table 2 show the experimental results with the FCL1 data set. The training and testing accuracies of the DCL were 97.667% and 93.856%, respectively. The training and testing accuracies of FSCL were 94.125% and 92.217%, respectively. The training and testing accuracies of SOFM were 92.677% and 91.172%, respectively. Table 1. Recognition Rates of DCL with the FCL1 data set Epochs
No. of neurons
training(%)
testing(%)
50 60 70 80 90 100 Average
82 84 84 82 89 84 84.1675
97.938 97.875 97.688 97.438 97.688 97.438 97.667
93.9 93.7 93.9 93.933 93.867 93.833 93.856
Dynamic Competitive Learning
453
Table 2. Recognition Rates of FSCL and SOFM with the FCL1 data set FSCL training(%) testing(%) 50 94 91.733 60 94.75 92.167 70 93.125 91.333 80 95.312 92.733 90 93.25 92.767 100 94.312 93.567 Average 94.125 92.217 Epoch
SOFM training(%) testing(%) 92.688 92.267 92.75 90.267 92.438 90.2 92.625 92 92.5 91.1 93.062 91.2 92.677 91.172
3.2 Handwritten Numeral Recognition Pattern recognition system is generally divided into 2 parts. First, feature extraction which computes the parameters that can provide the maximum information for the given pattern. Second, classifier which recognizes the object using extracted features. In the recognition of handwritten letters, we might want to recognize patterns without being influenced by distortions or transformations. For handwritten numeral recognition independent of the position or orientation of the pattern or its size, the Fourier descriptors were used as the features. A. Data Set Experimental data are 2000 handwritten Arabic numbers (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) with various positions, sizes and orientations. They were made by 100 persons and acquired using a CCD camera and an image grabber. Each pattern has 128 x 128 pixels and 128 gray levels. 14 Fourier descriptors were computed for each pattern. Training and testing with the networks were done with 100 patterns per class and 100 other patterns per class. B. Simulation Results The number of output neurons, learning rate and LOG parameter are set to the same as experiments with remote sensing data. The initial learning rates of DCL, LVQ and SOFM were chosen as 0.3, 0.4 and 0.9, respectively. In the case of LVQ, the initialization of the weight vectors was carried out by choosing the training patterns belonging to each class. In the case of SOFM, the first 100 training patterns were used for the initial weight vectors. The recognition rates of DCL, LVQ and SOFM are shown in Tables 3 and 4. Table 3. Recognition Rate of DCL with the handwritten numeral data set Epochs 50 60 70 80 90 100 Average
No. of neurons 110 110 110 110 111 111 110.333
training(%) 97.4 97.6 97.8 97.8 97.8 97.8 97.7
testing(%) 84.4 84.6 84.6 84.4 84.6 84.6 84.533
454
S. Cho, J. Kim, and S.-T. Chung
Table 4. Recognition Rate of LVQ and SOFM with the handwritten numeral data set LVQ SOFM training(%) testing(%) training(%) testing(%) 50 92.2 83.4 85.6 76.8 60 92 82.2 86.4 76 70 91.8 82.6 86.2 75.2 80 91.6 82.6 85.8 79.2 90 91.8 82.6 86.6 78.2 100 92 82.8 86.8 79.2 Average 91.9 82.7 86.233 77.433 Epoch
4 Conclusions We presented a new competitive learning neural network called DCL. It dynamically generates output neurons instead of determining before learning, and the initial weight vectors are selected from training patterns. DCL has the following features: 1) Learning algorithm itself includes the initialization process of weight vectors; 2) it has simple structure, learning algorithm and small number of parameters compared to the conventional neural networks with changeable structure; 3) the number of dead neurons can be reduced. For the best recognition, appropriate LOG gain needs to be chosen according to the distribution of given patterns. For the performance evaluation of the proposed method, we conducted the experiments using practical remote sensing data and handwritten numeral data, and demonstrated its superiority in comparison to FSCL, LVQ and SOFM.
Acknowledgement This work was supported by 2004 Hongik University Research Fund.
References 1. Rummelhart, D.E., McClelland, J.L.: Parallel Distributed Processing. In: Rummelhart, D.E., Zipser, D. (eds.): Feature Discovery by Competitive Learning. MA: MIT Press, Cambridge (1988) 2. Ahalt, S.C., Krishnamurthy, A.K., Chen, P., and Melton, D.E.: Competitive Learning Algorithms for Vector Quantization. Neural Networks 3 (1990) 277-290 3. DeSieno, D.: Adding a Conscience to Competitive Learning. Proc. IEEE International Conference on Neural Networks, San Diego, California (1988) I117-I124 4. Kohonen, T.: Self-Organization and Associative Memory. 3rd edn. Berlin, SpringerVerlag, Berlin Heidelberg New York (1989) 5. Kangas, J.A., Kohonen, T.K., and Laaksonen, J.T.: Variants of Self-Organizing Maps. IEEE Trans. Neural Networks l(1) (1990) 93-99
Dynamic Competitive Learning
455
6. Yin, F., Wang, J.: Advances in Neural Networks. Springer-Verlag, Berlin Heidelberg New York (2005) 7. Kohonen, T.: Learning Vector Quantization. Neural Networks 1(1) (1988) 303 8. Fausett, L.: Fundamentals of Neural Networks. Prentice Hall (1994) 9. Carpenter, G.A., Grossberg, S.: The ART of Adaptive Pattern Recognition by SelfOrganization Neural Network. Computer (1988) 77-88 10. Cho, S., Ersoy, O.K., Lehto, M.: Parallel, Self-Organizing, Hierarchical Neural Networks with Competitive Learning and Safe Rejection Schemes. IEEE Trans. Circuits and Systems, 40(9) (1993) 556-566 11. Cho, S.: Parallel, Self-Organizing, Hierarchical Neural Networks with Competitive Learning and Safe Rejection Scheme. Ph.D. Thesis, Purdue University (1992)
Hyperbolic Quotient Feature Map for Competitive Learning Neural Networks Jinwuk Seok1 , Seongwon Cho2 , and Jaemin Kim2 1
2
Embedded S/W Research Division, ETRI, 161, Gajeong-dong, Yoosung-gu, Daejun 305-700, Korea
[email protected] School of Electronic and Electrical Engineering, Hongik University 72-1 Sangsu-dong, Mapo-gu, Seoul 121-791, Korea
[email protected]
Abstract. In this paper, we present a new learning method called hyperbolic quotient feature map for competitive learning neural networks. The previous neural network learning algorithms didn’t consider their topological properties, and thus their dynamics were not clearly defined. We show that the weight vectors obtained by competitive learning decompose the input vector space and map it to the quotient space X/R. In addition, we define a quotient function which maps [1, ∞) ⊂ Rn to [0, 1) and induce the proposed algorithm from the performance measure with the quotient function. Experimental results for pattern recognition of remote sensing data indicate the superiority of the proposed algorithm in comparison to the conventional competitive learning methods.
1
Introduction
In the case of Backpropagation (BP), its object function depends on the ”Kullback Distance” obtained by ”Kullback Information”, and the stability, the convergence, and the convergence rate have been researched [1], [2]. The Hopfield model (or Chua-Lin-Kennedy Model) for the object function with some equality constraints has Lyapunov stability [3]. Genetic Algorithm (GA), which has been researched more vigorously nowadays, is proven a Markov chain. Ritter and Schuten induced the Fokker-Plank equation for Self-Organizing Feature Map (SOFM), and analyzed the convergence condition and dynamic properties with the equation [4], [5]. However, in the previous neural network algorithm analysis, no topological analysis has not been done yet except SOFM. In this paper, we analyze the topological structure of the simple competitive learning (SCL) and its performance measure. In addition, we construct a new performance measure with canonical function on the quotient space, and induce a learning equation for the hyperbolic quotient feature map neural network. The paper is organized as follows. Section 2 presents the performance measure of SCL. In Section 3, the topological information of input vector space is described. In Section 4, hyperbolic quotient competitive learning algorithm is induced. Section 5 discusses experimental results obtained with the proposed algorithm in comparison to SCL. Section 6 is conclusions J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 456–463, 2006. c Springer-Verlag Berlin Heidelberg 2006
Hyperbolic Quotient Feature Map for Competitive Learning
2
457
Performance Measure of SCL
The performance measure of SCL at time t ∈ R+ is as follows : 1 u ||x − wtr ||2 2 u r U
Jt =
W
(1)
In the above equation, xu is the u-th input vector such that xu ∈ (Rn , d), and wtr is the r-th weight vector such that wtr ∈ (Rn , d) at time t, respectively. Index u, r are the elements of the index set Λ ⊆ R+ := [0, ∞) and represent the indices of the input vector and the weight vector, respectively and satisfy 0 ≤ u ≤ U, 0 ≤ r ≤ W . Let the directional derivation ht be the gradient of Jt for the performance measure of SCL then, we obtain ht = ∇Jt = −(xu − wtr )
(2)
In equation (11), if we choose the step size (t) as a monotone decreasing function which depends on time t, then we obtain the following learning equation : r wt+1 = wtr − (t)ht = wtr + (t)(xu − wtr )
(3)
Suppose that in the above equation, the stochastic nonlinear programming criterion is used. Then the step size (t) satisfies the following condition: t lim (t) = 0 lim (t)dt = ∞ (4) t→∞
t→∞
0
Considering the level set Lα of Jt , and suppose a separated level set Lsα ⊆ Lα satisfies the following conditions for an open -neighborhood N (p, ) = {x ∈ Rn |d(x, p) < }: ∃M < ∞ s.t. ∀y ∈ Lsα , ||y|| < M ∀x ∈
Lsα
s.t. ∀ >
0Lsα
∩ N (p, ) − {x} = ∅
(5) (6)
In addition, suppose that the performance measure Jt is locally Lipschitz and twice continuously differentiable, and for the local minima x ˆ, the Hessian of Jt , H(x) satisfies the following condition : ∃ 0 < m ≤ M < ∞ s.t. ∀ y ∈ Rn , m||y||2 ≤ y, H(ˆ x)y ≤ M ||y||2
(7)
The boundary set ∂Lsα of the level set Lsα with the limit point x ˆ has a quadratic function form. Thus, the noise term effects largely in the boundary set ∂Lsα . If Lsα is globally convex set, the norm of transfer vector ||xu − wtr || become larger and the sequence {wtr }∞ ˆ. However, the performance t=0 uniformly converges to x W measure Jt consists of the sum with the quadratic function r ||xu − wtr ||2 for each u. Thus the local minima of Jt become increasing as the index u is increasing. The competitive learning is basically one of the stochastic optimal strategies, and the weight vector may become stuck in the local minima. However, it reaches global minima with some probability.
458
3
J. Seok, S. Cho, and J. Kim
The Topological Information of Input Vector Space
Supposing that the finite number of the input vector exist in the input vector space (Rn , d), the hyper plane made by weight vectors wtr constructs the finite k(r) simplexes σwtr in (Rn , d) where k(r) is a function of the weight vector index r such that k(r) ∈ {x ∈ Z ++ |x ≥ 2} [6]. From the above induction, the input p vector space (Rn , d) is homopic with |K| = ∪σ∈K σw ˆr of the r . The limit point w t r ∞ sequence {wt }t=0 generated by competitive learning is one of the limit points in k(r) the simplex σwtr . Considering the topology of the input space, let the subbasis k(r)
k(r)
be σwtr − ∪∂σwtr and the open set(or space) (X, d) such that (X, d) ⊂ (Rn , d). We denote the space (X, d) as X abbreviated. Then we construct the topology for the input vector space such that k(r)
k(r)
k(r)
T = {X, ∅, ∪r (σwtr − ∪∂σwtr ), ∪∂σwtr }
(8)
On the above topology, suppose that x, x ∈ σwtr − ∪∂σwtr . Then, there exists a relation R such that R : x ∼ x . Hence, we can define a R an equivalence class k(r) k(r) R[x] = {y ∈ X|yRx} which makes σwtr − ∪∂σwtr a partial subspace denoted as X/R = {R[x]|x ∈ X} by R[x]. In addition, the input space X is a topological space by the above definition, and thus we can introduce a canonical function q(x) which maps X to X/R. We consider a function which maps X/R to level set Lα , as a new performance measure JtN combined with a canonical function q(x), then we obtain a surjective, continuous open (or closed) function. k(r)
4 4.1
k(r)
The Induction of Quotient Canonical Competitive Learning Algorithm Sub-alexandrov Compactification
In the above results, the performance measure Jt for competitive learning neural network maps r ||xu − wtr ||2 to level set Lα defined on real line R for each u. Hence, a large noise affects on the difference of the performance measure Jt . However, if the difference of Jt is affected by the partition of X/R, i.e. if the input vector which has a large noise should not act on the performance measure and the input vector which has a small noise should do the performance measure largely. Consequently, if an algorithm has the above property, there are more simplexes in the partition X/R located in the dense cluster of input vectors. In order to locate each simplex in X/R more closely in the dense cluster of input vectors, we introduce a new performance measure JtQ which has large differential value in small JtQ and in contrast, has small differential value in large JtQ In addition, because the partition X/R is constructed by the whole weight vectors and each weight vector should locate in the center of the dense cluster of input vectors, it needs that we let the function only with the term of summation for weight vectors in performance measure. We denote the above
Hyperbolic Quotient Feature Map for Competitive Learning
459
function as N et1 = r ||xu − wtr ||2 For the function N et1 defined on the [0, ∞), we want N et1 to be defined on [1, ∞). For this, let N et2 such that N et2 = exp(N et1 /2σ 2 ) = exp( r ||xu − wtr ||2 /2σ 2 ). For the compactification, let C denote the sub-real line R[1, ∞) in Euclidean 2-space R2 , and let S denote the sphere with center (0, 0) and radius 1. The line passing through the ”north pole” pˆ = (0, 1) ∈ S and any point p ∈ C intersects the sphere S in exactly one point p distinct from pˆ. Let f : C → S be defined by f (p) = p . Then f is, in fact, a homeomorphism from the plane C which is not compact, onto the subset S − {ˆ p} of the sphere S, and S is compact. Hence S is a compactification of C. 4.2
The Induction of a New Performance Measure Using Sub-alexandrov Compactification
Let an equation for a line through a ”north pole” : pˆ and x = N etx in Figure 1 such that x y + =1 (9) N etx pˆ Since ”north pole” : pˆ = (0, 1), the equation of sphere S is the following that : x2 + y 2 = 1
(10)
Substituting (9) into (10), we obtain an equation for y which denote the compact measure JtQ for an input vector xu . y=
N et2x − 1 N et2x + 1
(11)
Since N etx in (11) is exp( r ||xu − wtr ||2 /2σ 2 ), we obtain a new performance measure as follows : exp( ||xu − wr ||2 /σ 2 ) − 1 t r u JtQ (x, w) = = tanh( ||xu − wtr ||2 /2σ 2 ) r 2 2 exp( ||x − w || /σ ) + 1 t r u u r (12) In equation (12), substituting σ 2 to σ 2 (t) such that σ 2 (t) ↓ 0, the Lebesgue measure of the gradient of σ 2 (t)JtQ (x, w) converges to 0 Theorem 1. The Lebesgue measure of the gradient of σ 2 (t)JtQ (x, w)converge to 0 Proof. The gradient of σ 2 (t)JtQ (x, w) is as follows ||xu − wr ||2 t ∇(σ 2 (t)JtQ (x, w)) = −sech2 ( )(xu − wtr ) 2σ(t) r u r 2 Let N et = r ||x −wt || . For - neighborhood N (0, ) = {N et ∈ R+ |N et < }, Lebesuge–integrate ∇(σ 2 (t)JtQ (x, w)) as follows
m(∇(σ 2 (t)JtQ (x, w))) ≤ m(||∇(σ 2 (t)JtQ (x, w))||) = ||
−sech2 ( N(0,)
≤ σ 2 (t)(tanh(
) ≤ σ 2 (t)||sup tanh(·)|| σ 2 (t)
N et )dN et|| σ 2 (t)
460
J. Seok, S. Cho, and J. Kim
Since σ 2 (t) ↓ 0, we can pick a tf ∈ R+ for an arbitrary positive number δ satisfying a following proposition. ∃tf > t ∈ R[0, t], s.t. σ(tf ) < δ Hence, the gradient of σ 2 (t)JtQ (x, w)weakly converge to Lebesuge measure with 0. (Q.E.D). 4.3
The Induction of the Quotient Canonical Feature Map Algorithm
Considering the corresponding performance measure, let the directional derivation ht be the gradient of the performance measure, ||xu − wr ||2 t ht = ∇Jt = −sech2 ( )(xu − wtr ) 2 (t) 2σ r
(13)
Then, we obtain the following learning equation by substituting to σ 2 (t) r wt+1 = wtr − ht r wt+1 = wtr + sech2 ( r
||xu −wtr ||2 u 2σ2 (t) )(x
− wtr )
(14)
It can be easily verified that the sequence {wtr } converges to a limit point weakly from Theorem 1. The hyperbolic function for the gradient of the performance et measure sech2 ( σN2 (t) ) defined on the R[0, 1], it is the candidate of the canonical function for X → X/R. Therefore, we construct a canonical function as follows : ||xu −wr ||2 xsech2 ( r 2σ2 (t)t )dx X ∀x ∈ X, q(x) = (15) ||xu −wtr ||2 2( sech )dx 2 r 2σ (t) X Since the function q(x) is a surjective, continuous and open(or closed) function, it is the canonical function which maps X → X/R. The proposed learning algorithm is as follows : step 0: Initialize the weight vector step 1: Find the weight vector satisfying the following for an input vector xu . wtr = minr¯||xu − wtr¯|| step 2: Compute the search direction for an input vector xu . ||xu −wr ||2 ht = ∇Jt = −sech2 ( r 2σ2 (t)t )(xu − wtr ) step 3: Update wtr as follows : ||xu −wr ||2 r wt+1 = wtr + sech2 ( r 2σ2 (t)t )(xu − wtr ) step 4: Replace t by t + 1. step 5: Go to step 1
Hyperbolic Quotient Feature Map for Competitive Learning
4.4
461
The Local Convergence Properties of the Proposed Algorithm
An important issue in analysis of learning algorithms for neural networks is the well-behaveness in a suitably defined sense. From the practical viewpoint, this issue is very important because disturbances such as noise or error are always present in a real system to knock it out of equilibrium. Therefore, in this section, we discuss the stability of the proposed algorithm, and prove that the proposed algorithm has the well-behavedness property. Definition 1. Directional Derivation ht is consisted of the as following: ht = λt (xμt − wtr ) λt = ∇(σ (t)J 2
Q
t)
·
(xμt
−
wtr )−1
R = sech ( r 2
(16) r 2 ||xμ t −wt || 2σ(t)2 )
(17)
Theorem 2. The sequence {wtr }∞ t=0 produced by the proposed algorithm is the Cauchy sequence in a measure. Proof. For an arbitrary positive n satisfying 0 < n < ∞, r r r r ||wtr − wt−n || = ||wt−1 + λt−1 (xμt−1 − wt−1 )χ(xut−1 , B o ) − wt−n || r r r r r r = ||wt−1 − wt−2 + wt−2 · · · −wt−n−1 + wt−n−1 − wt−n μ r u o +λt−1 (xt−1 − wt−1 )χ(xt−1 , B )|| μ r u o = || n−1 k=0 λt−k (xt−k − wt−k )χ(x t−k , B )||
(18)
r Hence, we obtain m{||wtr − wt−n || ≥ } as follows: r r r m{||wtr − wt−n || ≥ } = m{wt−n |wt−n ∈ / B o (wtr , )}
(19)
Therefore, r r r r m{wt−n |wt−n ∈ / B o (wtr , )} ≤ Bo (wr ,) ||wtr − wt−n ||dwt−k t n−1 r u o r ≤ Bo (wr ,)C k=0 ||λt−k (xμ t−k − wt−k )|| · ||χ(x t−k , B )||dwt−k n−1t μ r r ≤ k=0 Bo (wr ,)C ||λt−k (xt−k − wt−k )|| dwt−k t n−1 Q 2 r 2 Net = n−1 k=0 B o (wr ,)C ||∇w σ (t)Jt−k || dwt−k = k=0 B o (wr ,)C ||sech ( σ 2 (t−k) )||dN et t
t
(20) By Theorem 1, the above Lebesgue integral value has the following range. n−1 N et N et 2 2 k=0 B o (wtr ,)C ||sech ( σ2 (t−k) )||dN et ≤ Rn [0,∞) ||sech ( σ2 (t−k) )||dN et n−1 2 ≤ k=0 σ (t − k) ≤ nσ 2 (t − k) (21) Since limt→∞ σ(t) = 0 and σ(t) ↓ 0 and n is a finite value, there exist a suitable r n0 ≤ t − k such that 0 < σ(n0 ) < n , we obtain m{||wtr − wt−n || ≥ } ≤ . r ∞ Hence, {wt }t=0 is the Cauchy sequence in a Lebesgue measure. (Q.E.D) In Theorem 2, the proposed algorithm generates the limit point w∗r of {wtr }∞ t=0 Moreover, we simply note that the limit point w∗r is a local minimizer of JtQ ∈ C ∞.
462
J. Seok, S. Cho, and J. Kim
Fig. 1. Sub-Alexandrov compactification Table 1. The experiment for FLC1 8 class data Epoch 100 200 300 400 500 600 700 800 900 1000 Average Variance
5
The conventional SCL (%) Trainig Testing 90.19 88.63 87.56 83.3 90.81 87.93 91.12 90.23 89.75 89.2 88.94 86.27 91.69 88.93 90.56 87.37 88.13 83.67 88.38 86.87 89.71 87.24 1.33 2.18
The proposed Training 91.38 90.25 90.69 91.69 90.06 89.38 90.63 90.69 90.38 90.81 90.59 0.62
Algorithm (%) Testing 88.27 87.43 88.53 89.8 88.53 86.1 87.97 88.7 89.03 87.8 88.22 0.95
Experimental Results
The data set for the experiment is multispectral earth observational remote sensing data called ”Flight Line C1” (FLC1). The Geometrical location of the FLC1 is the southern part of Tippecanoe country, Indiana, U.S.A. The FLC1 consists of 8 bands signals. Each band signal value represents 256 gray levels, requiring a minimum of 8 bit for binary representation. 8 farm products were chosen to represent 8 classes (alfalfa, corn, oats, red clover, soybean, wheat, bare soil, rye). Training and testing of the networks were done with 200 signal samples per class and 375 other signal samples per class. For the 4 class data, 20 weight vectors were used, and for the 8 classes data, 40 weight vectors respectively. The weight vector initialization were carried out by choosing 20 or 40 input vectors randomly. The learning rates (t) in the conventional SCL algorithm and the proposed algorithm σ 2 (t) are as follows : (t) = 0.9 · (1 − σ 2 (t) = 10.0 · (1
t T otal number of Iteration ) − T otal numbert of Iteration )
Hyperbolic Quotient Feature Map for Competitive Learning
6
463
Conclusions
In this paper, we presented a canonical function which maps X → X/R based on the analysis of the topological information in the input vector space. A new performance measure by our topological analysis was defined and the learning equation for the quotient canonical feature map neural network was induced. In addition, we proved the convergence of proposed algorithm by the analysis of the gradient of our induced performance measure that converges weakly to Lebesgue measure 0. Experimental results for the remote sensing data recognition showed that the proposed algorithm produced higher classification accuracy.
Acknowledgement This work was supported by 2004 Hongik University Research Fund.
References 1. Cox, D.R., Hinskley, D.V., Reid, N., Rubin, D.B., Silverman, B.W.: Networks and Chaos - Statistical and Probabilitic Aspects. Chapman & Hall, New York (1993) 2. Yin, F., Wang, J.: Advances in Neural Networks. Springer-Verlag, Berlin Heidelberg New York (2005) 3. Hopfield, J.J., Tank, D.W.: Neural Computation of Decisions in Optimization Problems. Biol. Cybern. 52 (1985) 141-152 4. Ritter, H., Schulten, K.: The Convergence Properties of Kohonen’s Topology Conserving Maps : Fluctuations, Stability, and Dimension Selection. Biol. Cybern. 60 (1988) 59-71 5. Gihman, I.I., Skorohod, A.V.:The Theory of Stochastic Process. 2nd edn. SpringerVerlag, Berlin Heidelberg New York (1980) 6. Luenberger, D.G.: Optimization by vector space methods. John-Wiley and Sons, New Jersey (1968)
A Gradient Entropy Regularized Likelihood Learning Algorithm on Gaussian Mixture with Automatic Model Selection Zhiwu Lu1 and Jinwen Ma2 1
2
Institute of Computer Science & Technology of Peking University, Beijing 100871, China
[email protected] Department of Information Science, School of Mathematical Sciences, and LMAM, Peking University, Beijing 100871, China
[email protected]
Abstract. In Gaussian mixture (GM) modeling, it is crucial to select the number of Gaussians for a sample data set. In this paper, we propose a gradient entropy regularized likelihood (ERL) algorithm on Gaussian mixture to solve this problem under regularization theory. It is demonstrated by the simulation experiments that the gradient ERL learning algorithm can select an appropriate number of Gaussians automatically during the parameter learning on a sample data set and lead to a good estimation of the parameters in the actual Gaussian mixture, even in the cases of two or more actual Gaussians overlapped strongly.
1
Introduction
As a typical statistical model, Gaussian mixture (GM) has been widely used in a variety of practical applications. Although there have been several statistical or unsupervised learning approaches to solving this problem, such as the k-means algorithm [1] and the EM algorithm [2], the number k of Gaussians in the data set is usually assumed to be pre-known. Unfortunately, in many instances, this key information is not available, and then we have to select the number of Gaussians or mixture model before or during parameter estimation. The traditional approach to solve this kind of model selection problem is to choose the optimal number k ∗ of Gaussians with some statistical criteria such as the Akaike’s information criterion [3] and the Bayesian inference criterion [4]. However, since we need to repeat the entire parameter estimation at a number of different values of k, the process of evaluating these criteria incurs a large computational cost. Some more efficient approaches have also been developed on Gaussian mixture with automatic model selection using a mechanism that an appropriate number of Gaussians can be automatically selected during parameter learning, with the mixing proportions of the extra mixture components being reduced to zero. Under the Bayesian Ying-Yang (BYY) harmony learning theory [5, 6], a gradient BYY harmony learning algorithm [7] was proposed for Gaussian mixture modeling, which can make model selection automatically with parameter estimation. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 464–469, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Gradient Entropy Regularized Likelihood Learning Algorithm
465
However, the number of Gaussians can not be detected accurately by this BYY harmony learning algorithm when two or more Gaussians are overlapped in a high degree. In this paper, we propose a gradient entropy regularized likelihood (ERL) learning algorithm to solve the model selection problem for Gaussian mixture in light of regularization theory [8]. It is demonstrated by the simulation experiments that the proposed gradient ERL learning algorithm can automatically determine the number of actual Gaussians in the sample data with a good estimation of the parameters in the original mixture in the mean time. Moreover, even when two or more actual Gaussians are overlapped in a relatively higher degree, the gradient ERL learning algorithm can still make model selection appropriately and outperforms the gradient BYY harmony learning algorithm.
2
Gradient Entropy Regularized Likelihood Learning Algorithm
We consider the following Gaussian mixture model: p(x | Θ) =
k
αl p(x | θl ),
l=1
p(x | θl ) =
k
αl = 1,
αl ≥ 0,
(1)
l=1
1 exp(−(1/2)(x − ml )T Σl−1 (x − ml )), (2π)n/2 |Σl |1/2
(2)
where n is the dimensionality of x, k is the number of Gaussians, and p(x | θl ) are Gaussian densities with parameters of mean vectors and covariance matrices, that is, θl = (ml , Σl ) for l = 1, ..., k. Given a sample data set S = {xt }N t=1 generated from a Gaussian mixture model with k ∗ actual Gaussians and setting k ≥ k ∗ , the negative log-likelihood function on the Gaussian mixture model p(x | Θ) is given by L(Θ) = −
N k 1 ln( (p(xt | θl )αl )). N t=1
(3)
l=1
The well-known maximum likelihood (ML) learning is just an implementation of minimizing L(Θ). With the posterior probability that xt arises from the l-th Gaussian component P (l | xt ) = p(xt | θl )αl /
k
p(xt | θj )αj ,
(4)
j=1
we have the discrete Shannon entropy of these posterior probabilities for the sample xt E(xt ) = −
k l=1
P (l | xt ) ln P (l | xt ),
(5)
466
Z. Lu and J. Ma
which is globally minimized at P (l0 | xt ) = 1, P (l | xt ) = 0(l = l0 ),
(6)
that is, the sample xt is totally classified into or subject to the l0 -th Gaussian component. We now consider the average or mean entropy over the sample set S: E(Θ) = −
N k 1 P (l | xt ) ln P (l | xt ), N t=1 l=1
(7)
and use it to regularize the log likelihood function by H(Θ) = L(Θ) + γE(Θ),
(8)
where γ > 0 is the regularization factor. That is, E(Θ) is a regularization term to reduce the model complexity such that the Gaussian mixture can be made as simple as possible by minimizing H(Θ). In order to solve the minimization problem of H(Θ) without constraint conditions, we utilize the following substitution: αl = exp(βl )/
k
exp(βj ).
(9)
j=1
According to the derivatives of H(Θ) with respect to the parameters ml , Σl and βl , respectively, we have the following gradient learning algorithm to minimize H(Θ): Δml =
N η U (l | xt )Σl−1 (xt − ml ), N t=1
(10)
ΔΣl =
N η U (l | xt )Σl−1 [(xt − ml )(xt − ml )T − Σl ]Σl−1 , 2N t=1
(11)
N k η U (j | xt )(δjl − αl ), N t=1 j=1
(12)
Δβl =
where U (l | xt ) = P (l | xt )(1 + γ
k
(δjl − P (j | xt )) ln(p(xt | θj )αj )),
j=1
and η is the learning rate which is usually selected as a small positive constant. As compared with the EM algorithm for Gaussian mixture, this gradienttype learning algorithm implements a regularization or penalization mechanism on the mixing proportions during the parameter learning process, which thus leads to the automatic model selection. The regularization factor is generally selected by experience.
A Gradient Entropy Regularized Likelihood Learning Algorithm
3
467
Simulation Results
In order to test the performance of the gradient ERL learning algorithm, several simulation experiments are carried out on eight different sets of sample data generated from Gaussian mixtures. Moreover, some comparison is made between the gradient ERL learning algorithm and the BYY harmony learning algorithm, especially in the cases of two or more Gaussians overlapped strongly. The eight sets of sample data for the simulation experiments are drawn from a mixture of four or three bivariate Gaussian densities (i.e., n = 2). As shown in Fig.1, each sample data set is generated at different degree of overlap among the clusters(i.e., Gaussians) in the Gaussian mixture by controlling the mean vectors and covariance matrices of the Gaussian distributions, and with equal or unequal mixing proportions of the clusters in the mixture by controlling the number of samples from each Gaussian density. For a sample data set with k ∗ clusters or Gaussians, we implement the gradient ERL algorithm with k ≥ k ∗ and η =0.1. Moreover, the other parameters are initialized randomly within certain intervals. In all the experiments, the algorithm is stopped if |ΔH| < 10−6 . Typically, we give the experimental results of the gradient ERL learning algorithm on the sample data set given by Fig.1(d) where k ∗ = 4. For two different regularization factors γ = 1.0 and γ = 0.5, the experimental results of the algorithm with k = 8 are shown in Fig.2(a) and Fig.2(b), respectively. We can observe that four Gaussian components are locally located accurately, while the mixing proportions of the other four Gaussian components are reduced to below 0.001, i.e, these Gaussian components are extra and can be discarded. Hence, the correct number of the Gaussians has been detected on this data set. Moreover, we can observe that the smaller the regularization factor γ is, the more slowly the ERL algorithm converges to a reasonable solution. We also make a comparison with the BYY harmony learning algorithm on the sample set from Fig.1(h), which contains three Gaussians (i.e., k ∗ = 3) with two ones overlapped in a high degree. Additionally, the three Gaussians have 4 3
5
5
4
4
4
3
3
3
2
2
1
1
0
0
2
2 1
1 0
0
−1 −1
−1
−1
−2
−2
−2
−2 −3 −3 −4
−3
−4 −4
−3
−2
−1
0
1
2
3
−4
−4
−5 −5
4
−3
0
(a)
5
−5 −5
0
(b)
5
−5 −6
0
2
4
6
(d)
4
4
5
3
3
3
4
2
2
2
3
1
1 1
2
0 0
−1
−1
−2
−2
−3 −4 −4
−2
(c)
4
0
−4
−3
−3
−2
−1
0
(e)
1
2
3
4
−4 −4
−3
−2
−1
0
(f)
1
2
3
4
1
−1
0
−2
−1
−3 −3
−2
−1
0
(g)
1
2
3
4
−2 −2.5
−2
−1.5
−1
−0.5
0
(h)
Fig. 1. Eight sets of sample data used in the experiments
0.5
1
1.5
2
2.5
468
Z. Lu and J. Ma 6
6 a1=0.000358
a =0.000446
a =0.000212
1
8
4
a =0.282383
4
2
a =0.282982 2
2
2
a =0.215703 4
a8=0.000278 0
0
−2
−2 a4=0.215909
a =0.335181 7
−4 a6=0.000786
−4
6
−4
−2
0
2
4
a3=0.000571
a5=0.164490
a5=0.163697
−6 −6
a7=0.335504
a =0.000779
a3=0.000722
6
−6 −6
−4
−2
0
(a)
2
4
6
(b)
Fig. 2. The experiment results of automatic detection of the number of Gaussians on the sample set from Fig.1(d) by the gradient ERL algorithm with different regularization factors. (a) The results by ERL with γ = 1.0(stopped after 2438 iterations); (b) The results by ERL with γ = 0.5(stopped after 2539 iterations). 6
6 a =0.000379 a2=0.000362
5
5
2
a4=0.000922
a =0.000369 4
4
4
a =0.000351 7
a =0.500282 6
3
3
a =0.000339 7
a =0.501238
a =0.001682
6
1
2
2
1
1
0
0
a1=0.236529
−1
a =0.007142 8
a5=0.000375
−1
3
−2 −4
a8=0.008170
−2 −4
−3
−2
−1
a5=0.000475
a =0.251948
a3=0.489438
3
0
(a)
1
2
4
−3
−2
−1
0
1
2
3
4
(b)
Fig. 3. The experiment results of automatic detection of the number of Gaussians on the sample set from Fig.1(h) by different learning algorithms. (a) The results by the BYY learning algorithm; (b) The results by the ERL learning algorithm.
different numbers of samples, which makes it harder to learn on this sample set. Both the gradient ERL and BYY learning algorithms are implemented on this sample set with k = 8. As shown in Fig.3(a), the number of Gaussians is not correctly detected by the gradient BYY algorithm. However, the gradient ERL learning algorithm with γ = 0.3 detects the three Gaussian components correctly shown in Fig.3(b), with the five extra Gaussian components being canceled automatically once their mixing proportions are reduced to below 0.001. Therefore, the ERL learning algorithm can be considered to be more powerful than the BYY harmony learning algorithm on the model selection. The further experiments on other sample sets have also been made successfully for the correct number detection in the similar cases. Additionally, we also find that the ERL learning algorithm converges with a lower average error less than
A Gradient Entropy Regularized Likelihood Learning Algorithm
469
0.1 between the estimated parameters and the true parameters in the original mixture.
4
Conclusions
We have investigated the automatic model selection and parameter estimation on Gaussian mixture via a gradient entropy regularized likelihood (ERL) learning algorithm. The simulation experiments have demonstrated that our proposed gradient ERL learning algorithm can detect the number of Gaussians automatically during parameter learning, with a good estimation of the parameters in the original mixture in the mean time, even on the sample set with a high degree of overlap and then can outperform the BYY harmony learning algorithm on ceratin aspect.
References 1. Devijver, P. A., Kittter J.: Pattern Recognition: A Statistical Approach. Prentice Hall, Englewood Cliffs, N. J. (1982) 2. Render, R.A., Walker, H.F.: Mixture Densities, Maximum Likelihood and the EM Algorithm. SIAM Review 26(2) (1984) 195–239 3. Akaike, H.: A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19(6) (1974) 716–723 4. Schwarz, G.: Estimating the Dimension of a Model. The Annals of Statistics 6(2) (1978) 461–464 5. Xu, L.: BYY Harmony Learning, Structural RPCL, and Topological Self-Organizing on Mixture Modes. Neural Networks 15(8–9) (2002) 1231–1237 6. Lu, Z., Cheng, Q., Ma, J.: A Gradient BYY Harmony Learning Algorithm on Mixture of Experts for Curve Detection. Lecture Notes in Computer Science 3578 (2005) 250–257 7. Ma, J., Wang, T., Xu, L.: A Gradient BYY Harmony Learning Rule on Gaussian Mixture with Automated Model Selection. Neurocomputing 56 (2004) 481–487 8. Dennis, D.C., Finbarr, O.S.: Asymptotic Analysis of Penalized Likelihood and Related Estimators. The Annals of Statistics 18(6) (1990) 1676–1695
Self-organizing Neural Architecture for Reinforcement Learning Ah-Hwee Tan Intelligent Systems Centre and School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798
[email protected]
Abstract. Self-organizing neural networks are typically associated with unsupervised learning. This paper presents a self-organizing neural architecture, known as TD-FALCON, that learns cognitive codes across multi-modal pattern spaces, involving states, actions, and rewards, and is capable of adapting and functioning in a dynamic environment with external evaluative feedback signals. We present a case study of TDFALCON on a mine avoidance and navigation cognitive task, and illustrate its performance by comparing with a state-of-the-art reinforcement learning approach based on gradient descent backpropagation algorithm.
1
Introduction
Reinforcement learning [3] is a learning paradigm wherein an autonomous system learns to adjust its behaviour according to feedback from the environment. This paper describes a natural extension of Adaptive Resonance Theory (ART) [2] for developing an integrated reinforcement learner that adapts its behaviour through interacting with the environment. The proposed neural architecture, known as FALCON (Fusion Architecture for Learning, Cognition, and Navigation), learns multi-channel mappings simultaneously across multi-modal input patterns, involving states, actions, and rewards, in an online and incremental manner. Compared with prior systems, FALCON presents a truly integrated solution in the sense that there is no implementation of a separate reinforcement learning module or Q-value table. Using competitive coding as the underlying principle of computation, the network dynamics encompasses a myriad of learning paradigms, including unsupervised learning, supervised learning, as well as reinforcement learning. In this paper, we focus on a class of FALCON networks, known as TD-FALCON, that learns the value functions of the state-action space estimated through temporal difference (TD) methods. To evaluate the proposed system, experiments have been conducted based on a minefield navigation task, which involves an autonomous vehicle (AV) learning to navigate through obstacles to reach a stationary target (goal) within a specified number of steps. Experimental results have shown that TD-FALCON produces a much more stable performance and yet learns faster, by roughly two orders of magnitude, than a state-of-the-art reinforcement learning system using the gradient based backpropagation (BP) network. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 470–475, 2006. c Springer-Verlag Berlin Heidelberg 2006
Self-organizing Neural Architecture for Reinforcement Learning
2
471
FALCON Architecture
FALCON employs a 3-channel architecture (Figure 1), comprising a cognitive field F2c and three input fields, namely a sensory field F1c1 for representing current states, a motor field F1c2 for representing actions, and a feedback field F1c3 for representing reward values. The generic network dynamics of FALCON, based on fuzzy ART operations [1], is described below.
Fig. 1. The FALCON architecture
Input vectors: Let S = (s1 , s2 , . . . , sn ) denote the state vector, where si ∈ [0, 1] indicates the sensory input i. Let A = (a1 , a2 , . . . , am ) denote the action vector, where ai ∈ [0, 1] indicates a possible action i. Let R = (r, r¯) denote the reward vector, where r ∈ [0, 1] is the reward signal value and r¯ (the complement of r) is given by r¯ = 1 − r. Activity vectors: Let xck denote the F1ck activity vector for k = 1, . . . , 3. Let yc denote the F2c activity vector. Weight vectors: Let wjck denote the weight vector associated with the jth node in F2c for learning the input patterns in F1ck for k = 1, . . . , 3. Initially, F2c contains only one uncommitted node and its weight vectors contain all 1’s. Parameters: The FALCON’s dynamics is determined by choice parameters αck > 0, learning rate parameters β ck ∈ [0, 1], contribution parameters γ ck ∈ [0, 1] and vigilance parameters ρck ∈ [0, 1] for k = 1, . . . , 3. 2.1
Predicting
In the predicting mode, FALCON receives input pattern(s) from one or more fields and predicts the pattern(s) in the remaining fields. The prediction process comprises three key steps, namely code activation, code competition, and activity readout. Code activation: Given the activity vectors xc1 , xc2 and xc3 , for each F2c node j, the choice function Tjc is computed as follows: Tjc =
3 k=1
γ ck
|xck ∧ wjck | αck + |wjck |
,
(1)
472
A.-H. Tan
where the fuzzy AND operation ∧ is defined by (p ∧ q)i ≡ min(pi , qi ), and the norm |.| is defined by |p| ≡ i pi for vectors p and q. Code competition: A code competition process follows under which the F2c node with the highest choice function value is identified. The winner is indexed at J where TJc = max{Tjc : for all F2c node j}. (2) When a category choice is made at node J, yJc = 1; and yjc = 0 for all j = J. This indicates a winner-take-all strategy. Activity readout: The chosen F2c node J performs a readout of its weight vectors to the input fields F1ck such that xc2(new) = xc2(old) ∧ wJc2 . 2.2
(3)
Learning
In the learning mode, FALCON performs code activation and code competition (as described in the previous section) to select a winner J based on the activity vectors xc1 , xc2 , and xc3 . To complete the learning process, template matching and template learning are performed as described below. Template matching: Before code J can be used for learning, a template matching process checks that the weight templates of code J are sufficiently close to their respective input patterns. Specifically, resonance occurs if for each channel k, the match function mck J of the chosen code J meets its vigilance criterion: mck J =
|xck ∧ wJck | ≥ ρck . |xck |
(4)
If any of the vigilance constraints is violated, mismatch reset occurs in which the value of the choice function TJc is set to 0 for the duration of the input presentation. With a match tracking process, at the beginning of each input presentation, the vigilance parameter ρc1 equals a baseline vigilance ρ¯c1 . If a mismatch reset occurs, ρc1 is increased until it is slightly larger than the match c function mc1 J . The search process then selects another F2 node J under the revised vigilance criterion until a resonance is achieved. Template learning: Once a resonance occurs, for each channel k, the weight vector wJck is modified by the following learning rule: ck(new)
wJ
ck(old)
= (1 − β ck )wJ
ck(old)
+ β ck (xck ∧ wJ
).
(5)
When an uncommitted node is selecting for learning, it becomes committed and a new uncommitted node is added to the F2c field. FALCON thus expands its network architecture dynamically in response to the input patterns. The network dynamics of the FALCON architecture described above can be used to support a myriad of cognitive operations. We show how FALCON can be used to learn action policies and value functions for reinforcement learning in the next section.
Self-organizing Neural Architecture for Reinforcement Learning
3
473
TD-FALCON
TD-FALCON incorporates Temporal Difference (TD) methods to estimate and learn value function Q(s, a), that indicates the goodness to take a certain action a in a given state s. The general sense-act-learn algorithm for TD-FALCON is summarized in Table 1. Given the current state s, the FALCON network is used to predict the value of performing each available action. The value functions are then processed by an action selection strategy (also known as policy) to select an action. Upon receiving a feedback, a TD formula is used to compute a new estimate of the Q value. The new Q value is then used as the teaching signal for TD-FALCON to learn the association of the current state and the chosen action to the estimated value. The details of the action selection strategy and the temporal difference equation are described as follows. Table 1. Generic flow of the TD−FALCON algorithm 1. Initialize the FALCON network. 2. Given the current state s, for each available action a in the action set A, predict the value of the action Q(s,a) by presenting the corresponding state and action vectors S and A to FALCON. 3. Based on the value functions computed, select an action a from A following an action selection policy. 4. Perform the action a, observe the next state s , and receive a reward r (if any). 5. Estimate the value function Q(s, a) following a temporal difference formula given by ΔQ(s, a) = αT Derr . 6. Present the corresponding state, action, and reward (Q-value) vectors, namely S, A, and R, to FALCON for learning. 7. Update the current state by s=s’. 8. Repeat from Step 2 until s is a terminal state.
3.1
Action Selection Policy
Action selection policy refers to the strategy for selecting an action from the set of the available actions in a given state. The simplest action selection policy is to pick the action with the highest value predicted by the TD-FALCON network. However, if an agent keeps selecting the optimal action, it will not be able to explore and discover better alternative actions. There is thus a fundamental tradeoff between exploitation, i.e., sticking to the best actions believed, and exploration, i.e., trying out other seemingly inferior and less familiar actions. The -greedy policy selects the action with the highest value with a probability of 1 − and takes a random action with probability . With a constant value, the agent will always explore the environment with a fixed level of randomness. In practice, it may be beneficial to have a higher value to encourage the exploration of paths in the initial stage and a lower value to optimize the performance by exploiting familiar paths in the later stage. A decay -greedy policy is thus adopted to gradually reduce the value of over time.
474
3.2
A.-H. Tan
Value Function Estimation
One key step of TD-FALCON is the temporal difference equation ΔQ(s, a) = αT Derr , where α ∈ [0, 1] is the learning parameter and T Derr is the error term. Using Q-learning [4], T Derr is computed by T Derr = r + γmaxa Q(s , a ) − Q(s, a)
(6)
where r is the immediate reward value, γ ∈ [0, 1] is the discount parameter, and maxa Q(s , a ) denotes the maximum estimated value of the next state s . With value iteration, the value function Q(s, a) is expected to converge to r + γmaxa Q(s , a ) over time. As ART and TD-FALCON typically assume all input values are bounded between 0 and 1, we develop a Bounded Q-Learning rule ΔQ(s, a) = αT Derr (1 − Q (s, a))
(7)
that incorporates a self-scaling term directly.
4
The Minefield Navigation Task
The objective of the cognitive task is to navigate through a minefield to a randomly selected target position in a specified time frame without hitting a mine. In each trial, the autonomous vehicle (AV) starts from a random position in the field, and repeats the cycles of sense, act, and learn. A trial ends when the system reaches the target (success), hits a mine (failure), or exceeds 30 sense-act-learn cycles (out of time). The target and the mines remain stationary during the trial. For sensing, the AV has a coarse 180 degree forward view based on five sonar 1 sensors. In each direction i, the sonar signal is measured by si = 1+d , where i di is the distance to an obstacle in the i direction. Other sensory inputs include the current and target bearings. In each step, the system chooses one of the five possible actions, namely move left, move diagonally left, move straight ahead, move diagonally right, and move right. We conduct experiments with both immediate and delayed evaluative feedback. For both reward schemes, at the end of a trial, a reward of 1 is given when the AV reaches the target. A reward of 0 is given when the AV hits a mine. For the immediate reward scheme, a reward is estimated at each step of the trial by 1 computing a utility function utility = 1+rd , where rd is the remaining distance between the AV and the target position. To put the performance of TD-FALCON in perspective, we further conduct experiments using a reinforcement learning system, using the standard Q-learning rule and a gradient descent backpropagation (BP) algorithm as the function approximator. The BP learner employs a standard three-layer Perceptron architecture and has been shown to produce superior performance in many prior applications [5]. We experiment with a varying number of hidden nodes empirically and obtain the best results with 36 nodes. Figure 2 summarizes the performance of TD-FALCON and the BP learner in terms of success rate, hit-mine rate, and out-of-time rate in a 16 by 16 minefield
Self-organizing Neural Architecture for Reinforcement Learning
(a) FALCON with immediate reward
(b) FALCON with delayed reward
(c) BP with immediate reward
(d) BP with delayed reward
475
Fig. 2. The performance of TD-FALCON and BP learner
containing 10 mines. The BP learner generally takes a very large number of (around 90, 000) trials to reach 90% success rates. In contrast, TD-FALCON consistently achieves the same level of performance within the first 1000 trials. In other words, TD-FALCON learns roughly two orders of magnitude faster than the BP learner. Considering network complexity, the BP learner has a highly compact network architecture. In terms of stability, however, TD-FALCON is clearly a much more reliable learner by consistently producing superior performance.
References 1. Carpenter, G.A., Grossberg, S., Rosen, D.B.: Fuzzy ART: Fast Stable Learning and Categorization of Analog Patterns by an Adaptive Resonance System. Neural Networks 4(6) (1991) 759-771 2. Carpenter, G.A., Grossberg, S.: Adaptive Resonance Theory. In: Arbib, M.A. (eds.): The Handbook of Brain Theory and Neural Networks. MIT Press, Cambridge (2003) 87-90 3. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998) 4. Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8(3) (1992) 279-292 5. Si, J., Barto, A.G., Powell, W.B., Wunsch, D. (eds.): Handbook of Learning and Approximate Dynamic Programming. Wiley-IEEE Press (2004)
On the Efficient Implementation Biologic Reinforcement Learning Using Eligibility Traces SeungGwan Lee School of Computer Science and Information Engineering, Catholic University, 43-1, Yeokgok 2-Dong, Wonmi-Gu, Bucheon-Si, Gyeonggi-Do, 420-743, Korea
[email protected]
Abstract. The eligibility trace is one of the basic mechanisms in reinforcement learning to handle delayed reward. In this paper, we have used meta-heuristic method to solve hard combinatorial optimization problems. Our proposed solution introduce Ant-Q learning method to solve Traveling Salesman Problem (TSP). The approach is based on population that use positive feedback as well as greedy search and suggest ant reinforcement learning algorithms using eligibility traces which is called replace-trace methods(Ant-TD(λ)). Although replacing traces are only slightly, they can produce a significant improvement in learning rate. We could know through an experiment that proposed reinforcement learning method converges faster to optimal solution than ACS and Ant-Q.
1
Introduction
Reinforcement Learning is learning by interaction as agent achieves learning doing interaction by trial-and-error. Agent attempts an action that can be taken in the given state, and changes its state and receive reward value for the action. In this paper we introduce Ant-Q algorithm [7], [8] for combinatorial optimization has been introduced by Colorni, Dorigo and Maniezzo, and we suggest new ant reinforcement learning algorithms using eligibility traces are often called replacetrace methods(Ant-TD(λ)) to Ant-Q learning. Proposed Ant-TD(λ) reinforcement learning method searches goal using TDerror [3], [4], [5] and eligibility traces [13], there is characteristic that converge faster to optimal solution than original ACS [11], [12] and Ant-Q. The rest of the paper is organized as follows. In section2, we introduce Ant-Q and Ant-TD reinforcement learning. Section3 describes new ant reinforcement learning algorithms using eligibility traces are called replace-trace methods (AntTD(λ)). Section 4 presents experimental results. Finally, Section 5 concludes the paper and describes directions for future work.
2 2.1
Ant-Q Algorithm Ant-Q
Ant-Q learning method [7], [8] that is proposed by Coloni, Dorigo and Mauiezzo is an extension of Ant System(AS) [1], [2], [6], [9], [10], it is reinforcement learning reinterpreting in view of Q-learning. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 476–481, 2006. c Springer-Verlag Berlin Heidelberg 2006
On the Efficient Implementation Biologic Reinforcement Learning
477
In Ant-Q, an agent(k ) situated in node(r ) moves to node(s) using the follow rule, called pseudo-random proportional action choice rule(or state transition rule): arg max [AQ(r, u)]δ · [HE(r, u)]β if q ≤ q0 (exp loitation) u∈Jk (r) s= (1) S otherwise (exp loration) AQ(r, u) is Ant-Q value, be a positive real value associated to the edge(r,u), It is counterpart of Q-learning Q-values, and is intended to indicate how useful it is to move node(u) when in node(r ). AQ (r,u) is changed at run time. HE(r, u) is a heuristic value associated to edge(r,u) which allows an heuristic evaluation of which moves are better(in the TSP, the inverse of the distance). Let k be an agent making a tour. Jk (r) are nodes to be visited from the current node(r ). Let initial parameter is set to AQ(r, u)=AQ0 = 1/(average length of edges·n). Where δ and β is parameters which weigh the relative importance of the learned AQ-values and the heuristic values. q is a value chosen randomly with uniform probability in [0,1], q0 (0 ≤ q0 ≤ 1) is a parameter, and S is a random variable selected according to the distribution given by Eq.(2) which gives the probability with which an agent in node(r ) choose the node(s) to move to. ⎧ δ β ⎨ [AQ(r,s)] ·[HE(r,s)] if s ∈ Jk (r) [AQ(r,u)]δ ·[HE(r,u)]β pk (r, s) = u∈Jk (r) (2) ⎩ 0 otherwise The goal of Ant-Q is to learn AQ-values to find better solution as stochastic. AQ-values are updated by the following Eq.(3). AQ(r, s) ← (1 − α) · AQ(r, s) + α · (ΔAQ(r, s) + γ · M ax AQ(s, z))
(3)
z∈Jk (s)
α(0<α<1) is the pheromone decay parameter and γ is discount rate. M axAQ(s, z) is maximum reinforcement value that receive from external environment, global reinforcement is zero. Also, ΔAQ is reinforcement value, local reinforcement is always zero, while global reinforcement, which is given after all the agents have finished their tour, is computed by the following Eq.(4). W if (r, s) ∈ tour done by the agent kib ΔAQ(r, s) = Lkib (4) 0 otherwise Where Lkib is the length of the tour done by the best agent which did the shortest tour in the current iteration, and W is a parameter, set to 10. 2.2
Ant-TD
To solve temporal-credit assignment problems, reinforcement learning that uses TD can be a method to learn prediction.
478
S. Lee
Traditional learning methods such as supervised learning wait until final result happens to keep recording for prediction for all outputs that were calculated in intergrade. And then, as training error, we use difference between prediction value for output of each state and final result. But, the reinforcement learning using TD without waits final result. At each learning step, training error uses difference with prediction for output of present state and prediction for output of next state and this is known as TD-error(Temporal Difference error). The prediction for output of present state is updated to approximate with prediction for output of next state in TD-learning. Thus, it is known that reinforcement learning of simple form that uses difference of prediction values for output of each state is TD-learning. TD-learning that use TD-error calculates Q-function value of present state with Eq.(5) using difference with prediction for output of present state and prediction for output of next state. Q(st , at ) ← (1 − α) · Q(st , at ) + α · T D error
(5)
Here, α is learning rate, TD-error calculates with Eq.(6) as difference with prediction of present state and prediction of next state. T Derror = rt+1 + γ[ M ax Q(st+1 , at+1 ) − Q(st , at )] a∈A(st )
(6)
rt is reinforcement value and γ is discount rate. The goal of Ant-Q is to learn AQ-values to find better solution as stochastic. AQ-values are updated by the following Eq.(7). Appling Eq.(6) to Ant-Q, it is with Eq.(8). AQ(r, s) ← (1 − α) · AQ(r, s) + α · (ΔAQ(r, s) + γ · M ax AQ(s, z))
(7)
z∈Jk (s)
T Derror = ΔAQ(r, s) + γ[ M ax AQ(s, z) − AQ(r, s)]
(8)
z∈Jk (s)
Finally, an ant reinforcement learning model that apply TD-error in Ant-Q[14] calculates Q-function value for node(r,s) of present state with Eq.(9). α(0<α<1) is the pheromone decay parameter and γ is discount rate. M axAQ(s, z) is maximum reinforcement value that is received from external environment, global reinforcement is zero. Also, ΔAQ is reinforcement value, local reinforcement is always zero. AQ(r, s) ← (1 − α) · AQ(r, s) +α · (ΔAQ(r, s) + γ · [ M ax AQ(s, z) − AQ(r, s)]) z∈Jk (s)
(9) where
ΔAQ(r, s) = 0 M ax AQ(s, z) − AQ(r, s) = 0
z∈Jk (s)
if Local updating if Global updating
On the Efficient Implementation Biologic Reinforcement Learning
3
479
Ant-TD(λ) : Ant-TD Using Eligibility Traces
An eligibility trace is one of the basic mechanisms of reinforcement learning. For example, in the popular TD(λ) algorithm, the reference to the use of an eligibility trace. Almost any temporal-difference (TD) method, such as Q-learning, can be combined with eligibility traces to obtain a more general method that may learn more efficiently [13]. An eligibility trace is a temporary record of the occurrence of an event, such as the visiting of a state or the taking of an action. The trace marks the memory parameters associated with the event as eligible for undergoing learning changes. When a TD error occurs, only the eligible states or actions are assigned credit for the error. Thus, eligibility traces help bridge the gap between events and training information. Like TD methods themselves, eligibility traces are a basic mechanism for temporal credit assignment.
Fig. 1. Feature of Replacing Eligibility Traces
In TD(λ), there is an additional memory variable associated with each state, its eligibility trace. The eligibility trace for edge(r,s) is denoted e(r, s). On each step, the eligibility trace for the maximum(M axAQ(s, z)) state visited on the step is incremented by 1 and the eligibility traces for another states decay by γλ. 1 if AQ(r, s) = M ax AQ(s, z) z∈Jk (s) e(r, s) = (10) γ · λ · e(rt−1 , st−1 ) otherwise Where γ is the discount rate and λ is the trace-decay parameter. This kind of eligibility trace is called a replacing trace because it replaces each time the state is visited, and then fades away gradually when the state is not maximum, as illustrated below: Finally, Ant reinforcement learning algorithms using eligibility traces are called replace-trace methods(Ant-TD(λ)) updated by the following Eq.(11). AQ(r, s) ← (1 − α) · AQ(r, s) +α · (ΔAQ(r, s) + γ · [ M ax AQ(s, z) − AQ(r, s)]) · e(r, s) z∈Jk (s)
(11) where
ΔAQ(r, s) = 0 M ax AQ(s, z) − AQ(r, s) = 0
z∈Jk (s)
if Local updating if Global updating
480
4
S. Lee
Experimental Results
Prediction of performance of ant reinforcement learning algorithms using eligibility trace is called replace-trace methods(Ant-TD(λ)). We measure performance through comparison with original ant model(ACS and Ant-Q). Basis environment parameter for an experiment was decided as following, and optimum value decided by an experiment usually are β=2, α=0.1, q0 =0.9, γ=0.3, λ=0.7, m=n, W =10 and AQ0 = 1/(average length of edges·n). The initial position of agents assigned one agent in an each node at randomly, and the termination condition is that a fixed number of cycles or the value known as the optimum value was found. Table 1 shows optimal tour length and average tour length that are achieved by each algorithms(ACS, Ant-Q and Ant-TD(λ)) in case of repeated 1000 cycles in 10th trials about R×R grid problems, the performance of proposed method is excellent. Table 1. Performance evaluation of Ant-TD(λ)
Node 4×4 5×5 6×6 7×7 8×8 9×9 10×10
5
ACS Average Best length length 160 160 254 254 360 360 510 507 661 654 790 764 971 959
Ant-Q Average Best length length 160 160 254 254 362 360 509 502 654 648 782 753 963 941
Ant-TD(λ) Average Best length length 160 160 254 254 360 360 494 494 640 640 775 749 952 936
Conclusion and Future Work
In this paper, we suggested new ant reinforcement learning algorithms using eligibility traces are called replace-trace methods(Ant-TD(λ)). Proposed AntTD(λ) learning is method that is proposed newly to improve Ant-Q, this converged faster to optimal solution by solving temporal-credit assignment problems to use TD-error while agents accomplish tour cycle. Although replacing traces are only slightly, they can produce a significant improvement in learning rate. Proposed Ant-TD(λ) ant model using eligibility traces uses difference with prediction for output of present state and prediction for output of next state at each learning step, and updated to approximate with prediction for output of present state and prediction for output of next state in present state. Forward, we need study about several possible ways to generalize replacing eligibility traces in Ant-TD(λ).
On the Efficient Implementation Biologic Reinforcement Learning
481
References 1. Colorni, A., Dorigo, M., Maniezzo, V.: An Investigation of Some Properties of an Ant Algorithm. In: Manner, R., Manderick, B.(eds.): Proceedings of the Parallel Problem Solving from Nature Conference. Elsevier Publishing (1992) 509-520 2. Colorni, A., Dorigo, M., Maniezzo, V.: Distributed Optimization by Ant Colonies. In: Varela, F., Bourgine, P.(eds.): Proceedings of the First European Conference of Artificial Life. Elsevier Publishing (1991) 134-144 3. Watkins, C.J.C.H.: Learning from Delayed Rewards. Ph.D. Thesis, King’s College, Cambridge, U.K (1989) 4. Fiecher, C.N.: Efficient Reinforcement Learning. Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (1994) 88-97 5. Barnald, E.: Temporal-Difference Methods and Markov Model. IEEE Trans. Systems, Man and Cybernetics 23 (1993) 357-365 6. Gambardella, L.M., Dorigo, M.: Solving Symmetric and Asymmetric TSPs by Ant Colonies. Proceedings of IEEE International Conference of Evolutionary Computation, IEEE Press (1996) 622-627 7. Gambardella, L.M., Dorigo, M.: Ant-Q: A Reinforcement Learning Approach to the Traveling Salesman Problem. In: Prieditis, A., Russell, S.(eds.): Proceedings of ML-95, Twelfth International Conference on Machine Learning. Morgan Kaufmann (1995) 252-260 8. Dorigo, M., Gambardella, L.M.: A Study of Some Properties of Ant-Q. In: Voigt, H.M., Ebeling, W., Rechenberg, I., Schwefel, H.S.(eds.): Proceedings of PPSN IV, Fourth International Conference on Parallel Problem Solving From Nature. Lecture Notes in Computer Science, Springer-Verlag, Berlin Heidelberg (1996) 656-665 9. Dorigo, M., Maniezzo, V., Colorni, A.: The Ant System: Optimization by a Colony of Cooperation Agents. IEEE Trans. Systems, Man and Cybernetics-Part B 26(1) (1996) 29-41 10. Stutzle, T., Hoos, H.: The Ant System and Local Search for the Traveling Salesman Problem. Proceedings of IEEE 4th International Conference of Evolutionary (1997) 11. Gambardella, L.M., Dorigo, M.: Ant Colony System: A Cooperative Learning Approach to the Traveling Salesman Problem. IEEE Trans. Evolutionary Computation 1(1) (1997) 12. Stutzle, T., Dorigo, M.: ACO Algorithms for the Traveling Salesman Problem. In: Miettinen, K., Makela, M., Neittaanmaki, P., Periaux, J.(eds.): Evolutionary Algorithms in Engineering and Computer Science, Wiley (1999) 13. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press (1998) 14. Lee, S.G.: Multiagent Reinforcement Learning Algorithm Using Temporal Difference Error. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg (2005) 627-633
Combining Label Information and Neighborhood Graph for Semi-supervised Learning Lianwei Zhao1, Siwei Luo1, Mei Tian1, Chao Shao1, and Hongliang Ma2 1
School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
[email protected] 2 Logistics Science Research Institute, General Logistics Department, Beijing 100071, China
Abstract. In this paper, we consider the problem of combining the labeled and unlabeled examples to boost the performance of semi-supervised learning. We first define the label information graph, and then incorporate it with neighborhood graph. We propose a new regularized semi-supervised classification algorithm, in which the regularization term is based on this modified Graph Laplacian. According to the properties of Reproducing Kernel Hilbert Space (RKHS), the representer theorem holds, so the solution can be expressed by the Mercer kernel of examples. Experimental results show that our algorithm can use unlabeled and labeled examples effectively.
1 Introduction The problem of learning from labeled and unlabeled examples has received significant attention in recent years. It is also known as semi-supervised learning, and a number of algorithms have been proposed for it, including Co-training [4], random field models [5,6] and graph based approaches [7,8]. However, learning from examples has been seen as an ill-posed inverse problem, and regularizing inverse problem means finding a meaning stable solution. So in this paper we focus on regularization approaches. Both measure based regularization[9] and information regularization [10] take density into consideration, and can get the decision boundary that lies in the region of low density in 2D example. However, it is difficult to apply them in high-dimensional real world data sets. Manifold regularization [1, 2] exploits the geometry of the marginal distribution to incorporate unlabeled examples within a geometrically motivated regularization term. One of the key assumptions of manifold based learning algorithm is that the data lies on the low dimensional manifold, and the data can be approximated by a weighted discrete graph. So how to use the labeled and the unlabeled examples construction an approximation graph is very important. In this paper, we first define the label information graph, and then incorporate it with neighborhood graph. Using Graph Laplacian regularizier, we propose regularized semi-supervised classification algorithm. According to the properties of Reproducing Kernel Hilbert Space (RKHS), the representer theorem holds, so the solution can be expressed by the Mercer kernel of examples. There is only one regularization parameter reflecting the tradeoff between the Graph Laplacian and the J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 482 – 488, 2006. © Springer-Verlag Berlin Heidelberg 2006
Combining Label Information and Neighborhood Graph
483
complexity of solution. Experimental results show that our algorithm can use unlabeled and labeled examples effectively and are more robust than Transductive SVM and LapSVM. This paper is organized as follows. Section 2 briefly reviews Graph Laplacian. In section 3, we define label information graph with labeled examples, and propose the regularized semi-supervised classification algorithm. Experimental results on synthetic and real world data are shown in section 4, followed by conclusions in section 5.
2 Graph Laplacian Graph Laplacian [3] has played a crucial role in several recently developed algorithms [11] because it approximates the natural topology of data and is simple to compute for enumerable based classifiers. Let’s consider a neighborhood graph G = (V , E ) whose
vertices are example points V = {x1 , x 2 ,L , xl +u } and whose edge weights {Wij }li ,+ju=1 represent pairwise similarity relationships between examples. The neighborhood of x j can be defined as those examples which are closer than ε or the k nearest neighbors of x j . To ensure that the embedding function f is smooth, a natural choice is to get empirical estimate I (G ) , which measures how much f varies across the graph: I (G ) =
1 2∑ Wij
l +u
∑ ( f ( xi ) − f ( x j )) 2 Wij
i , j =1
(1)
i, j
where 2∑ Wij is normalizing factor, so that 0 ≤ I (G ) ≤ 1 . i, j
Defining fˆ = [ f ( x1 ,L , f ( xl +u )]T , and L = D − W as Graph Laplacian matrix, l+u
where D is diagonal matrix given by Dii = ∑ Wij , I (G ) can be rewritten as: j=1
I (G ) =
1 fˆ T Lfˆ 2∑ Wij
(2)
i, j
3 Regularized Semi-supervised Classification We assume that if two points x1 , x 2 ∈ X are close in the input space, then the conditional p( y | x1 ) and p( y | x 2 ) are near in intrinsic geometry of PX , as in [1]. However, this adjacency matrix of neighborhood graph doesn’t take into consideration the information carried by labeled examples. The regularization term
484
L. Zhao et al.
I (G ) , especially for binary case classifiers, is proportional to the number of separated neighbors. Therefore for labeled examples, if x i and x j have the same label, they should not be separated by the decision boundary, so we can redefine the relationship between x i and x j by strengthening it.
Consider all the sample points {x1 , x 2 , L , xl +u } , the Least Squares algorithm solves with the squared loss function
l
l
i =1
i =1
∑V ( xi , yi , f ) = ∑ ( yi − f ( xi )) 2 , which is
based on the minimizing the error on the labeled examples. It is important to observe that: 2(l − 1)
l
l
i =1
i , j =1
∑ ( y i − f ( xi )) 2 ≥ ∑ (( f ( x i ) − f ( x j )) − ( y i − y j )) 2
(3)
l
If
∑ ( yi − f ( xi )) 2 → 0 , then i =1
l
∑ (( f ( xi ) − f ( x j )) − ( y i − y j )) 2 → 0
(4)
i , j =i
So if y i − y j < δ , then f ( xi ) − f ( x j ) < ε where δ , ε → 0, and δ , ε > 0 . Defining (l + u ) × (l + u ) matrix J as follows.
⎧1 or Wij , if i, j ≤ l , and y i − y j < δ ⎪ ⎪ J ij = ⎨0 or − Wij , if i, j ≤ l , and yi − y j ≥ δ ⎪ otherwise ⎪⎩0,
(5)
This can be seen as a label information graph G ′ = (V , E ′) , whose vertices are the labeled or unlabeled example points V = {x1 , x 2 ,L , xl +u } , and whose edge weights J ij represent appropriate pairwise label relationships between labeled examples i and j . So the right of equation 3 can be rewritten as: l +u
∑ ( f ( xi ) − f ( x j )) 2 J ij
(6)
i , j =1
This term can be seen as label information carried by labeled examples, and penalizes classifiers that separate the examples having the same labels. The neighborhood graph and the label information graph have the same vertices, and can be incorporated together. So the optimization problem has the following objective function:
Combining Label Information and Neighborhood Graph
min H [ f ] = γ f
2
=γ f
2
f ∈H
K
K
+
485
l +u
∑ ( f ( xi ) − f ( x j )) 2 (Wij + J ij )
i , j =1
(7)
+ fˆ T La fˆ l+ u
where La = D − (W + J ) , D is diagonal matrix given by Dii = ∑ (Wij + J ij ) and γ j=1
is a regularization parameter that controls the complexity of the clustering function. It has the same form as unsupervised regularization spectral clustering [1], and the Representer theorem holds and the solution of the equation the solution can be expressed by the kernel of examples as follows: l +u
f ( x ) = ∑ α i K ( x, x i )
(8)
i =1
where α can be solved by an eigenvalue problem, and the regularization parameter γ can be selected by the approach of L-curve. The complete semi-supervised learning algorithm consists of the following steps. Step 1. Construct adjacency graph G = (V , E ) with (l + u ) nodes using k nearest neighbors. Choose edge weights Wij with binary or heat kernel weights, construct label information graph G ′ = (V , E ′) , and then compute the Graph Laplacian La . Step 2. Regularized semi-supervised classification. At this step, we use equation 7 as objective function. Step 3. Label the unlabeled examples. Firstly, we select one labeled example from M = {xi , y i }li =1 . Without loss of generality, we select {x1 , y1} , so all the examples clustering with {x1 , y1} will have the same label y1 as {x1 , y1} , while the others will have the label different with y1 . So for every {xi , y i } ∈ M , we get a label yˆ i . l
Step 4. Compute
∑ yi − yˆ i
2
l
. Stop if
i =1
∑ yi − yˆ i
2
≤ threshold , otherwise, select
i =1
the i th labeled example where i = arg max y i − yˆ i . i
Step 5. Adjust the weights J ij . For the selected i th example, we can find the labeled examples j satisfying { y j − y i ≤ δ , y j − yˆ j ≤ ε ,1 ≤ j ≤ l} , and then adjust the weight J ij and re-compute the matrix La . Goto step 2.
4 Experimental Results We first conducted experiments on two moons dataset. The dataset contains 200 unlabeled sample points, and all the labeled points are sampled from the unlabeled points randomly. Figure 1 shows the results of semi-supervised classification using
486
L. Zhao et al.
our algorithm on two moons dataset added Gaussian noise with 0 and 5 labeled points respectively. With 0 labeled points it can be regarded as unsupervised regularization clustering. From the figure, it is clear that unsupervised classification failed to label the points correctly. It is because after adding noise, the dataset has not the regular geometry structure. When using only 5 labeled points for each class with the proposed algorithm, we find the optimal solution as shown in Figure 1. We also show the experimental results on USPS dataset. Using the binary weight of the edge of the neighborhood graph, that is Wij = 0 or 1 , we show the results of 45 binary classification problems. We use the first 400 images for each handwritten digit, and processed using PCA to 100 dimensions as in [1]. As shown in Figure 2, we compare the error rate of Transductive SVM, LapSVM, and the proposed method at the precision-recall breakeven points in the ROC curves. Both experimental results show clearly that our proposed algorithm is more robust than Transductive SVM and LapSVM, as given in Figure 2. semi-supervised
clustering
2
2
1
1
0
0
-1
-1 -1
0
1
2
-1
0
1
2
Fig. 1. Regularized Semi-supervised classification on two moons dataset added Guassian noise with 0, 5 labeled points respectively 8
15
LapSVM ReguSCoM
10
5
0 0
6 Error Rates
Error Rates
TSVM ReguSCoM
4
2
10 20 30 40 45 Classification Problems
0 0
10 20 30 40 45 Classification Problems
Fig. 2. Comparing the error rate of ReguSCoM, Transductive SVM, and LapSVM at the precision-recall breakeven points
Combining Label Information and Neighborhood Graph
487
5 Conclusions Semi-supervised learning is to benefit from a large number of unlabeled examples and few labeled examples. In this paper, by combining the labeled and unlabeled examples, we define the modified Graph Laplacian, and then propose a new regularized semi-supervised classification algorithm. Our method yields encouraging experimental results on both synthetic data and real world datasets. In future work, we will investigate other approaches of construction of graph using the labeled and unlabeled examples, further to improve performance of semi-supervised learning algorithm.
Acknowledgements This work is supported by National Natural Science Foundations of China (60373029) and the National Research Foundation for the Doctoral Program of Higher Education of China (20050004001). The author would like to thank Dr. M. Belkin for useful suggestion.
References 1. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold Regularization: A Geometric Framework for Learning from Examples. Department of Computer Science, University of Chicago, TR-2004-06 2. Belkin, M., Niyogi, P., Sindhwani, V.: On Manifold Regularization. Department of Computer Science, University of Chicago, TR-2004-05 3. Belkin, M., Niyogi, P.: Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation 15(6) ( 2003)1373-1396 4. Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory. Madison, Wisconsin, USA. ACM (1998) 92-100 5. Szummer, M., Jaakkola, T.: Partially Labeled Classification with Markov Random Walks. In: Thomas G. Dietterich, Suzanna Becker, Zoubin Ghahramani (Eds.): Advances in Neural Information Processing Systems 14. MIT Press (2001) 945-952 6. Zhu, X., Ghahramani, Z., Lafferty J.: Semi-supervised Learning Using Gaussian Fields and Harmonic Functions. In: Tom Fawcett, Nina Mishra (Eds.): Proceedings of the Twentieth International Conference on Machine Learning. USA. AAAI Press (2003) 912919 7. Blum, A., Chawla, S.: Learning from Labeled and Unlabeled Data Using Graph Mincuts. In: Carla E. Brodley, Andrea Pohoreckyj Danyluk (Eds.): Proceedings of the Eighteenth International Conference on Machine Learning. USA, Morgan Kaufmann (2001)19-26 8. Zhou, D., Bousquet, O, Lal, T.N., Weston, J., Schoelkopf, B.: Learning with Local and Global Consistency, In: S. Thrun, L. Sauland B. Sch¨olkopf (Eds.): Advances in Neural Information Processing Systems 16. Cambridge, MA, MIT Press 2004
488
L. Zhao et al.
9. Bousquet, O., Chapelle, O, Hein, M.: Measure Based Regularization, In: S. Thrun, L. Sauland B. Scholkopf (Eds.): Advances in Neural Information Processing Systems. Cambridge, MA, MIT Press 2004 10. Szummer, M., Jaakkola, T.: Information Regularization with Partially Labeled Data. In: Suzanna Becker, Sebastian Thrun, Klaus Obermayer (Eds.): Advances in Neural Information Processing Systems 15. Canada. MIT Press, (2003) 1025-1032 11. Krishnapuram, B., Williams, D., Xue, Ya, Hartemink, A., Carin L., Figueiredo M. A. T.: On Semi-Supervised Classification. In: L. K. Saul, Y.Weiss and L. Bottou (Eds.): Advances in Neural Information Processing Systems 17. Cambridge,MA, MIT Press (2005) 721-728
A Cerebellar Feedback Error Learning Scheme Based on Kalman Estimator for Tracing in Dynamic System* Liang Liu, Naigong Yu, Mingxiao Ding, and Xiaogang Ruan** School of Electronic Information and Control Engineering, Beijing University of Technology, Beijing 100022, P.R. China {liuliang, mxdin}@emails.bjut.edu.cn {yunaigong, adrxg}@bjut.edu.cn
Abstract. Motivated by recent physiological and anatomical evidence, a new feedback error learning scheme is proposed for tracing in motor control system. In the scheme, the model of cerebellar cortex is regarded as the feedforward controller. Specifically, a neural network and an estimator are adopted in the cerebellar cortex model which can predict the future state and eliminate faults caused by time delay. Then the new scheme was used to control inverted pendulum. The simulation experimental results show that the new scheme can learn to control the inverted pendulum for tracing successfully.
1 Introduction “Brain’s engine of agility” is entitled to the cerebellum, since the cerebellum plays a key role in robust, adaptive neuromuscular control of multi-jointed movements, and helps to achieve smooth and dynamically efficient trajectories [1]. Presently, there is a highlight of researches in cerebellar neurobiology and control engineering [2]. Several early computational cerebellar models are derived from the regular structure of the cerebellum. In Albus’ perceptron model, inputs are summed in association units and connected via adjustable weights to an output unit. Marr’s view was that parallel fiber-Purkinje cell synapses should increase when a parallel fiber is active at about the time that a climbing fiber fires [3]. In this way, Purkinje cells become selectively to parallel fiber pattern. However, Albus proposed an error-correction rule: weight should decrease if the Purkinje cell was erroneously active. This is supported by Ito et al’s later experiment. The concept of internal model was first introduced into the field of neurophysiology by Ito [4], who suggested that there is internal model for motor learning and control in the central nervous system (CNS). Functional magnetic resonance imaging (fMRI) experiments on cerebellum activities showed that internal model does exist in the cerebellum [5]. Kawato proposed feedback error learning (FEL) model which comprises a fixed feedback controller that ensures the stability of the system and an adaptive feedforward controller that improves control performance [6]. The FEL scheme in the cerebellum is supported by the advance in neurophysiology [7]. * This work is supported by National Natural Science Foundation of China (60375017). ** Corresponding author. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 489 – 495, 2006. © Springer-Verlag Berlin Heidelberg 2006
490
L. Liu et al.
One landmark of the work in cerebellar neurobiology and control engineering is Paulin’s research on sensory prediction for dynamic state estimation and control [8]. In his view, the cerebellum can learn an implicit model of the dynamics between motor output and sensory input, from which to provide motor command structures with predictive information about relevant state variables. Inspired by his idea, this paper proposes a new feedback error learning scheme containing a cerebellar cortex model. The cerebellar cortex model can learn the dynamic inverse model of the controlled object according to the instructive information from the outputs of feedback controller and estimator to change the weights of the forward model. The inverted pendulum is adopted as a control object for the new scheme. This paper is organized as follows: in section 2, the present models about the cerebellar cortex are reviewed; in section 3, the principle of the Kalman estimator is described; in section 4, based on FEL model with the learning regulation, a new feedback error learning scheme to be implemented in motor control system is designed; in section 5, the simulation on the inverted pendulum was performed to testify the effectiveness of new scheme (this section is illustrated in two subsections: problem formulation; the analysis of simulation experimental results which are presented to validate the proposed new FEL scheme); finally, conclusions are stated in section 6.
2 Cerebellar Cortex Model According to Kawato’s FEL model, the engineering abstraction model of cerebellar cortex is shown in Fig. 1(a). In his view, the connection synapses between Parallel fiber (PF) and Purkinje cells (PC) can be adjusted by the teacher signal, which is provided from the channel of Climbing fiber (CF). The output of the model provides the feedforward motor command (FFMC). Cerebellar Cortex MF
GC
PF
SS
PF
PC
GO
PC CF
Actual Trajectory
FFMC
Desired Trajectory
GC
IO NUC
MF
CF Teacher Signal
(a) Kawato’s cerebellar cortex model
Excitatory Inhibitatory
Teacher signal
(b) Cerebellar microcomplex
Fig. 1. The models of cerebellum
A model of the adaptive control of saccades was developed by Schweighofer et al. [9]. This model inherits the idea of a structural and functional unit suggested by Ito [4] called a microcomplex as shown in Fig. 1(b). Input excites Granule cells (GC) and Nuclear cells (NUC), which provide the output of the unit via Mossy fibers (MF). The input of PC receive signal from PF, and NUC activity is modulated by inhibitory
A Cerebellar Feedback Error Learning Scheme Based on Kalman Estimator
491
action from PC. Learning occurs at the Purkinje-parallel fiber synapses as a function of a second input system, the CF originating in the Inferior olive (IO). In most on-line learning cases, however, a problem is that the error signal is delayed and has to be temporally matched with the action that caused it. Therefore, a device which can eliminate the delay is indispensable, and an estimator is introduced.
3 Kalman Estimator The Kalman filter is taken as an estimator because of its powerful ability to support estimations of future states, and it can do so even when the precise nature of the modeled system is unknown [10]. Consisted of a set of mathematical equations that provide an efficient computational method to estimate the state of a process, the Kalman filter addresses the general problem of trying to estimate the state x ∈ ℜ n of a discrete-time controlled process.
x(t + 1) = Ax(t ) + Bu (t ) + w(t ) ,
(1)
where x(t ) is the state of the error signal and w(t ) is a random driving force of the system, having covariance Γ(t ) . The measurement data are y (t ) = Cx(t ) + v(t ) ,
(2)
where v(t ) is the measurement error term. It has covariance ∑ (t ) . The optimal estimate xˆ(t + 1) of system’s error state based on measurement up to time t is showed by
xˆ (t + 1) = [ A − K (t )C ]xˆ (t ) + K (t ) y (t ) .
(3)
The Kalman gain K determines the strength of coupling between the new data and the current state estimate. Its formulation shown as K (t ) = APC T [CPC T + ∑(t )]−1 ,
(4)
where P represents the covariance of current state estimate, and can derive from the matrix Riccatti equation, P(t ) = AP(t − 1) AT + Γ(t − 1) − AP(t − 1)C T . × [CPC T + ∑(t − 1)]−1 [ AP(t − 1)C T ]T
(5)
The equations (3)-(5) above are Kalman filter equations. They can be modified to optimally estimate the state at any time, based on current available information [10].
4 Feedback Error Learning Scheme The FEL model has been successfully applied to various control tasks, such as the control of a robot arm with pneumatic actuators. Here we aims to establish a new FEL scheme based on the FEL model.
492
L. Liu et al.
4.1 New FEL Scheme
In this paper, in order to use inverted pendulum for tracing task in the later experiments, an inverse model must be trained to generate the appropriate command for a desired movement of the plant. We proposed a new feedback error learning scheme for tracing which requires information in three ways: 1) from the input and output to the NNFC (Neural Network Feedforward Controller) that specify the current tracing requirement and previous performance, respectively, and 2) from the output of the CFC (Conventional Feedback Controller) that adjusts the NNFC as does in the original FEL and 3) from the output of the Kalman estimator that serves as an additional teacher signal as does in [11]. The framework of our new feedback error learning scheme for tracing is shown in Fig. 2.
Cerebellar Cortex NNFC
Kalman Estimator
r(t) Desired trajectory
+ -
e(t+1)
e(t)
CFC
uff +
+
u
ufb
Controlled object
y Actual trajectory
Fig. 2. The new FEL scheme with a cerebellar cortex model
4.2 The Learning Regulation for the New FEL Scheme
The new feedback error learning scheme for tracing is shown as follows: The controlled object’s dynamic equation is defined as: f (θ , θ&, x, x& ) = u
(6)
e = [θ d − θ , xd − x]T .
(7)
We define the error signal as:
The response of the CFC is formulated as: u fb = K d e& + K p e .
(8)
u ff = Φ(e, w)[α r + (1 − α ) y ] ,
(9)
NNFC is expressed as:
where w is the synaptic weight of NNFC, and | α |≤ 1 , the coefficient of the inputs. Thus the total dynamic formulation is given as: u = u fb + u ff = K d e& + K p e + Φ (e, w)[α r + (1 − α ) y ] .
(10)
A Cerebellar Feedback Error Learning Scheme Based on Kalman Estimator
493
With respect to 0 < μ < 1 and k T ≥ 0 , the learning rule in the NNFC is dw ∂Φ T = −η ( ) ( μ u fb + (1 − μ ) k T e ) dt ∂w
(11)
5 Simulation and Experimental Results In this part, the inverted pendulum is used as the typical nonlinear plant to testify the new feedback error learning scheme. 5.1 Problem Formulation
A SIMO system in the real circumstance, the inverted pendulum balancing task involves a pendulum mounted on the top of a wheeled cart that travels along a track. This system was depicted by using the following equations of motion: 2 ( M + m) && x + ml cos θθ&& + dx& − ml sin θ (θ&) = u
(12)
2 (4ml / 3)θ&& + ml cos θ && x − mgl sin θ = 0
(13)
The goal of the inverted pendulum balancing and tracing task is to balance the upright pole of the inverted pendulum and follow the reference signal within the range of the rail in a short time, that is, the reference signal is r = [θ d , θ&d , xd , x&d ]T . The desired position of the cart can be set to any reasonable point on the rail, however, the rail origin is selected here as the desired position of the cart without losing generality. 5.2 Simulation Experimental Results Tracing. Firstly, the desired trajectory was given as r (t ) = 0 similar to the balancing problem. The proposed new FEL scheme is depicted in Fig. 3(b) prior to the FEL model shown in Fig. 3(a). The results indicate that the more information about future in the devised cerebellar cortex model, the more favorable trajectory can be got. 0.25
0.3 pole angle traced trajectory chart position
0.2
0.2 0.15
0.1
balance performance
balance performance
0.15
0.05 0 -0.05
0.1 0.05 0 -0.05
-0.1
-0.1
-0.15 -0.2
pole angle traced trajectory chart position
0.25
-0.15
0
0.5
1
1.5
2
2.5 time(s)
3
3.5
4
(a) The curves of FEL model
4.5
5
-0.2
0
0.5
1
1.5
2
2.5 time(s)
3
3.5
4
4.5
(b) The curves of the new FEL scheme
Fig. 3. The tracing performances with zero reference signal
5
494
L. Liu et al.
Robustness. Then the reference signal is given as r (t ) = A sin(4π t ) and the track’s
length of the wheeled cart is limited by | x |< 0.6m . The results can be seen in Fig. 4(a). Then some disturbances are added to the controller’s output. When two increments are set as Δu = 350 ( t = 1s ) and Δu = 500 ( t = 2.5s ), the actual trajectory in Fig. 4(b) illustrates good curves for tracing. 0.3
0.3
pole angle traced trajectory chart position
0.2
0.2 balance performance
0.1 balance performance
pole angle traced trajectory chart position
0.25
0
-0.1
0.15 0.1 0.05
-0.2 0
-0.3
-0.4
-0.05
0
0.5
1
1.5
2
2.5 time(s)
3
3.5
4
4.5
(a) The curves without disturbance
5
-0.1
0
0.5
1
1.5
2
2.5 time(s)
3
3.5
4
4.5
5
(b) The curves with disturbances
Fig. 4. The tracing performance of the new FEL scheme is presented by adopting a sine function as the input in the left column. In the right column, the robustness of the new scheme is shown by adding disturbances, at time 1s and 2.5s with the increments 350 and 500, respectively, to the summed controller output conforming to u = u ff + u fb .
6 Conclusions A novel feedback error learning scheme is described for tracing in motor control system. Inspired by the cerebellar circuitry, a cerebellar cortex model has been developed in the new scheme. A neural network can be used to simulate the synaptic plasticity resulting from correlations in activity between the Parallel fibers and Purkinje cells by the teacher signal via climbing fiber inputs, and a Kalman estimator is introduced to enhance the capacity for dynamic state estimation. In motor control system, the devised cerebellar cortex model can learn the dynamic inverse model of the controlled object according to the instructive information both from the output of the feedback controller and from the output of the estimator to change the weights of the forward model. The simulation results show that this new FEL scheme not only makes sure of the error convergence as conventional adaptive control but also acquires the exact inverse of the controlled object without loop delay. It is anticipated to build on this type of scheme for explaining cerebellar function and for using in robotic applications in the future.
References 1. Wickelgren, I.: The cerebellum: The Brain’s Engine of Agility. Science 281 (1998) 1588– 1590 2. Wolpert, D.M., Miall, R.C., and Kawato, M.: Internal Models in the Cerebellum. Trends Cognitive Sciences 2 (1998) 338–347
A Cerebellar Feedback Error Learning Scheme Based on Kalman Estimator
495
3. Marr, D.: A Theory of Cerebellar Cortex. Journal of physiology. 202 (1969) 437-470 4. Ito, M.: Cerebellum and Neural Control. Raven Press: New York (1984) 5. Tamada, T., Miyauchi, S., Imamizu, H., Yoshioka, T., Kawato, M.: Activation of the Cerebellum in Grip Force and Load Force Coordination: an fMRI Study. Neuroimage 6 (1999) S492 6. Kawato, M., Gomi, H.: A Computational Model of Four Regions of the Cerebellum Based on Feedback Error Learning. Biological Cybernetics 69 (1992) 95-103 7. Doya, K., Kimura, H., Kawato, M.: Neural Mechanisms of Learning and Control. Control Systems Magazine. IEEE 4 (2001) 42-54 8. Paulin, M.G.: The Role of the Cerebellum in Motor Control and Perception. Brain Behavior and Evolution 41(1) (1993) 39–50 9. Schweighofer, N., Arib, M.A., and Dominey, P. F.: A Model of Cerebellum in Adaptive Control of Saccadic Gain. . The Model and Its Biological Substrate. Biological Cybernetics 75 (1996) 19-28 10. Davis, M.H.A., and Vinter, R.B.: Stochastic Modelling and Control. Chapman and Hall. London New York (1985) 11. Luo, Z.W., Fujii, S., Saitoh, Y., Muramatsu, E., Watanabe, K.: Feedback-error Learning for Explicit Force Control of a Robot Manipulator Interacting with Unknown Dynamic Environment. Proceedings of the IEEE International Conference on Robotics and Biomimetics 2004, 448, August 2004
Ⅰ
An Optimal Iterative Learning Scheme for Dynamic Neural Network Modelling Lei Guo1,2 and Hong Wang2 1 Research Institute of Automation, Southeast University, Nanjing 210096, China
[email protected] 2 Control Systems Centre, The University of Manchester, Manchester, M60 1QD, UK
[email protected]
Abstract. In this paper, iterative learning control is presented to a dynamic neural network modelling process for nonlinear (stochastic) dynamical systems. When (B-spline) neural networks are used to model a nonlinear dynamical functional such as the output stochastic distributions in the repetitive processes or the batch processes, the parameters in the basis functions will be tuned using the generalized iterative learning control (GILC) scheme. The GILC scheme differs from the classical model-based ILC laws, with no dynamical inputoutput differential or difference equations given a prior. For the “model free” GILC problem, we propose an optimal design algorithm for the learning operators by introducing an augmented difference model in state space. A sufficient and necessary solvable condition can be given where a robust optimization solution with an LMI-based design algorithm can be provided.
1 Introduction Neural network (NN) based approximations have been effectively used in many modelling and control problems for complex dynamic systems [1,2]. In [1], the genetic algorithm has been used in optimization of the parameters of radial basis function (RBF) networks. In [2] the orthogonal least square algorithm with basis pursuit and D-optimality has been studied. The repetitive processes exist in many industrial cases (e.g. the batch processes) where the NN models can be adjusted based on the obtained information. How to use and learn the information in the last batch to perform and improve a new nonlinear model dynamically is a significant and challenging problem. One typical example motivated from the stochastic systems, where the so-called stochastic distribution control (SDC) problem has been effectively applied to in chemical and mechatronics engineering is [3]. In this paper we will use the B-spline expansions in SDC problems to illustrate the iterative learning strategies for the parameter optimization in the NN-based modelling. It is noted that the proposed approaches are not limited to the SDC problems. In SDC problems, it is supposed that the probability density functions (PDFs) of system output have been measured or estimated. To transform the infinite dimensional control problem into a finite dimensional one, B-spline NNs have been adopted to J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 496 – 501, 2006. © Springer-Verlag Berlin Heidelberg 2006
An Optimal Iterative Learning Scheme for Dynamic Neural Network Modelling
497
approximate the PDFs so that the PDF tracking and the fault detection problems can be performed [4-6]. However, basis functions in the B-spline expansions have been supposed to be fixed in a deterministic domain. One possible drawback is that dynamical changes may lead to serious approximation errors. Otherwise, one may unnecessarily select redundant basis functions so that the weighting system may have a high dimension. The basic idea of ILC is to improve the control signal for the present operation cycle by feeding back the control error in the previous cycle. ILC has been shown its applications in many processes, as an established control method in situations where the purpose of the system is to perform a particular operation repeatedly [8-10]. However, for such a learning process with respect to the parameters of a group of B-spline basis functions for a nonlinear distribution, most of the classical ILC laws are unavailable any more. Most mature ILC schemes are designed for the plants with given linear models, where the measurement for optimization or learning is a functional rather than the value of some signals. In [7], an iterative learning control (ILC) framework has been discussed for the SDC problem, where at each batch process the parameters of the basic functions can be tuned via an approximate ILC law. In [7], the primary idea is at each step to make the tracking errors between the practical PDF and the target one is smaller than the last step, where linearization has to be used to provide a convergence criterion. However, it is difficult for such a criterion to provide a constructive design algorithm with the required constraints of the weights. In this paper, we consider the real time optimization of the B-spline expansions based on the PDFs by using the ILC framework. The iterative learning process is used to tune the parameters that decide the shape and location of the basis functions. The proposed results can be used to the general neural network modelling processes for nonlinear dynamics. It also can be regarded a generalized dynamical identification problem if the parameters of the B-spline neural network are considered as the system parameters. By introducing an augmented vector related to the performance index function, the original generalized ILC (GILC) problem can be formulated to a classical nonlinear performance control problem, for which some nonlinear or robust design methodologies can be applied. It is noted that this framework also has potential significance in other ILC problems with nonlinear (non-quadratic) performance functions.
2 Formulation of the GILC Problem The following B-spline model for the SDC problems can be found in [3-6]. To simplify the mathematical description, we only consider one batch where only the selection of the basis functions is considered. For a dynamic stochastic system with the input
uk ,i ( t ) ∈ R m and the output yk ,i ( t ) ∈ [ a, b ] , the conditional PDF of output
yk ,i ( t ) is denoted by γ ( uk ,i ( t ) , z ) . The index k represents the step of batch while i stands for the step of control within the k th batch. In the previous works, usually the basis functions are supposed to be fixed while the weights can be controlled (see [3-6]). In this paper the linear B-spline model is adopted as follows
498
L. Guo and H. Wang
γ (θ k , uk ,i , z ) = B (θ k , z ) V ( uk ,i , t ) + h (V ( uk ,i , t ) ) bn (θ k , z )
(1)
where h(V (t )) is a known nonlinear function as an irrelevant term in this paper (see [4]). In the tracking problem of SDC, the main task is to find the control law and the B-spline expansions such that the measured output PDFs can follow a target one, denoted by
γ g ( z) .
For a group of B-spline basis functions, the parameters to be
tuned in the k th batch is θ k . For example, θ k can be defined as the height and/or width of the quadratic polynomials (see [7]), or the mean and variance of a Gaussian distribution. In this paper, only the learning law of the θ k will be considered. The classical model-based ILC scheme needs a well-established plant model. The purpose of ILC is to find the learning rate such that the closed loop system is convergent and satisfies the stability and other performances. Simple comparison shows that two main differences exist between the classical model-based ILC formulations and the considered problem. Firstly, equality (3) as the relationship connecting the weights and parameters in the basis functions with the input is not a classical model in the conventional context, where a differential or difference model has been provided for learning. To some extent, such a problem is a type of “model free” problem since no conventional models can be used for learning. Secondly, the “model free” situation requires a generalization of the traditional output error since neither
yk − yk −1 nor
yk − yd can be measured for the learning procedures. Instead, the measured infor-
mation is the stochastic distribution of the system output. Noting the objective of the learning B-spline modelling is to find converge to a target PDF
γ g ( z)
yk − yd , we select
θk
such that the B-spline expansion (3) can
, as an analogue of the conventional tracking error
J k ,i (θ k ) = ∫ ( γ (θ k , uk ,i , z ) − γ g ( z ) ) dz b
a
2
(2)
as the “error distance” for the learning procedure. Denote the cumulative distance m
J k (θ k ) = ∑ J k ,i (θ k ) as the error for learning. The learning law for θ k can be i =1
given by
θ k = ϒ1θ k −1 + ϒ 2 J k −1 (θ k −1 )
(3)
ϒ1 and ϒ 2 are two learning operators to be determined. The objective is reduced to find ϒ1 and ϒ 2 such that J k (θ k ) can be minimized in the k th batch where
process. For the above problem, few available ILC design algorithms are applicable. It is shown that not only the measurement for learning, but also the performance functions
An Optimal Iterative Learning Scheme for Dynamic Neural Network Modelling
for optimization will be nonlinear functions of
499
θ k . It is noted that even for the classi-
cal ILC problems, most feasible methods are dependent on linearization [8]. To this end, we denote an auxiliary vector as ξ k = J k (θ k ) and introduce a new quadratic performance index for
θk
can be formulated as
cost function
ξk
2 2
ξk .
Thus, the above iterative learning problem for
⎧θ k = ϒ1θ k −1 + ϒ 2ξ k −1 subject to the performance ⎨ ⎩ξ k = J k ( ϒ1θ k −1 + ϒ 2ξ k −1 )
, for which the classical control methods can be performed. In
order to obtain a feasible algorithm with a satisfactory convergent rate, we will perform a global optimization procedure by using an H ∞ formulation.
3 A Solution of the Transformed GILC Problem It is shown that the equation
ξ k = J k (θ k ) = J k ( ϒ1θ k −1 + ϒ 2ξ k −1 )
is highly nonlin-
θ k , which depends on the form of the basis functions. Within the k th batch, one can obtain J k (θ k ) = Fθ k + G1r + G2δ k −1 to represent the linearization of J k (θ k ) with respect to θ k , where F , G1 and G2 are known matrices, r is a known input related to the target PDF, and δ k −1 is the modelling error and unear with respect to
modelling dynamics which can be supposed
δk
2
< 1 by selecting the proper G2 .
As such, the original system can be further described as
⎧θ k = ϒ1θ k −1 + ϒ 2ξ k −1 ⎨ ⎩ξ k = F ϒ1θ k −1 + F ϒ 2ξ k −1 + G1r + G2δ k −1
(4)
It is shown that the transformed formulation (4) can be considered as a model-based ILC problem [9,10]. In this paper, in order to guarantee the convergent performance and the robust performance, a robust optimization problem will be studied. Define ρ k = W1θ k + W2ξ k as the reference output, where Wi (i = 1, 2) are two weighting matrices. Then we can formulate the following to find
H ∞ optimization problem:
ϒi (i = 1, 2) such that (4) is asymptotically stable in the absence of the ex-
ogenous inputs and satisfies
ρk
2
< ν , where ν is a given parameter. To simplify
the description, in this paper we suppose that the initial condition is zero. The following result gives a sufficient and necessary condition for the above optimization problem based on LMIs. Denote
ϒ = [ ϒ1
ϒ 2 ] , W = [W1 W2 ] , A = F ϒ , and
500
L. Guo and H. Wang
⎡I ⎤ ⎡0 0⎤ ⎡θ k ⎤ ⎡r⎤ F =⎢ ⎥, G=⎢ , ϕk = ⎢ ⎥ , d k = ⎢ ⎥ ⎥ ⎣ F1 ⎦ ⎣G1 G2 ⎦ ⎣ξ k ⎦ ⎣δ k ⎦ Theorem. System (4) is asymptotically stable in the absence of
ρk
2
< ν if and only if there are matrices Q > 0 and R satisfying ⎡ −Q + W T W ⎢ 0 ⎢ ⎢ FR ⎣
where
d k and satisfies
γ =
ν2
2
r 2 +1 2
0 −γ 2 I G
RT F T ⎤ ⎥ GT ⎥ < 0 −Q ⎥⎦
(5)
. In this case, the learning rates can be calculated by
ϒ = RQ −1 . Proof: (4) can be rewritten as
ϕ k +1 = Aϕ k + Gd k , for which it has been shown that
it is asymptotically stable in the absence of only if there exist a matrix
d k and satisfies ρ k
P > 0 satisfying
⎡ AT PA − P + W TW ⎢ GT PA ⎣
2 2
< γ 2 dk
2 2
if and
AT PG ⎤ ⎥<0 −γ 2 I ⎦
Q = P −1 it can be verified that (5) is equivalent to the above inequality with R = ϒQ . In this case, it can be seen that Using Schur complement and
ρk
2 2
< γ 2 dk
2 2
(
)
≤ γ 2 r 2 + 1 . Q.E.D 2
Remark: The above result also provided an available design algorithm for the learning rates, where (5) is a linear matrix inequality (LMI) with respect to Q > 0 and R . PI-type GILC is expected to eliminate the static error, which will be considered in the future research [8]. Using the proper B-spline basis functions and the generalized distance, it is possible that J k (θ k ) can be selected as a linear function of θ k so that the linearization procedure is unnecessary. In some cases, constraints are required in learning procedures. These topics will be further discussed in the future.
4 Conclusions For the NN modelling procedure for a nonlinear dynamical output function used in SDC problems, we present a new type of ILC law to adjust the parameters in the basis functions. One feature is no classical models are required for the ILC process. Sec-
An Optimal Iterative Learning Scheme for Dynamic Neural Network Modelling
501
ondly, the information for learning is the generalized distance between the measured PDFs and the target PDF. An augmented vector is firstly introduced such that the GILC problem can be transformed into an H ∞ optimization one, where an LMIbased design method can be obtained for the two learning rates. It is noted that the GILC can be applied to some general parameter learning problems in neural network modelling. Acknowledgement. This work is supported by the NSF of China (No. 60474050 and 60472065), the program of NCET of China and the Short Visiting Fund of the Royal Society of UK.
References 1. Chen, S., Wu, Y., Luk, B. L.: Combined Genetic Algorithm Optimization and Regularized Orthogonal Least Squares Learning for RBF Networks, IEEE Trans. on Neural Networks 10 (1999) 1239-1244 2. Hong, X., Brown, M., Chen, S., Harris, C. J.: Sparse Model Identification Using Orthogonal Forward Regression with Basis Pursuit and D-Optimality, IEE Proc.-Control Theory Appl. 151 (2004) 491-498 3. Wang, H.: Bounded Dynamic Stochastic Systems: Modelling and Control. SpringerVerlag, London (2000) 4. Guo, L., Wang, H.: PID Controller Design for Output PDFs of Stochastic Systems using Linear Matrix Inequalities, IEEE Trans on Systems, Man and Cybernetics, Part-B 35 (2005) 65-71 5. Guo, L., Wang, H.: Applying Constrained Nonlinear Generalized PI Strategy to PDF Tracking Control through Square Root B-Spline Models, International Journal of Control 77 (2004) 1481–1492 6. Guo, L., Wang, H.: Fault Detection and Diagnosis for General Stochastic Systems Using B-Spline Expansions and Nonlinear Filters, IEEE Trans. on Circuits and Systems-I: Regular Papers 52 (2005) 1644-1652 7. Wang, H., Zhang, J. F., Yue, H.: Iterative Learning Control of Output Shaping in Stochastic Systems, Proceedings of the 2005 IEEE International Symposium on Intelligent Control, Cyprus (2005) 1225-1230 8. Xu, J. X., Tan, Y.: Linear and Nonlinear Iterative Control. Springer-Verlag, London (2003) 9. Norrlof, M., Gunnarsson, S.: Experimental Comparison of Some Classical Iterative Learning Control Algorithms, Technical report from the Control & Communication group in Linkoping. http://www.control.isy.liu.se/publications (2002) 10. Amann, N., Owens, D. H., Rogers, E.: Predictive Optimal Iterative Learning Control. International Journal of Control 69 (1998) 203-226
Delayed Learning on Internal Memory Network and Organizing Internal States Toshinori Deguchi1 and Naohiro Ishii2 1
Gifu National College of Technology, Gifu, 501–0495, Japan
[email protected] 2 Aichi Institute of Technology, Aichi, 470–0392, Japan
[email protected]
Abstract. Elman presented a network with a context layer for the timeseries processing. The context layer is connected to the hidden layer for the next calculation of the time series, which keeps the output of the hidden layer. In this paper, the context layer is reformed to the internal memory layer, which is connected from the hidden layer with the connection weights to make the internal memory. Then, the internal memory plays an important role of the learning of the time series. We developed a new learning algorithm, called the time-delayed back-propagation learning, for the internal memory. The ability of the network with the internal memory layer is demonstrated by applying the simple sinusoidal time-series.
1
Introduction
For time-series learning, Elman’s network[1, 2] has been used[3, 4, 5]. Elman’s network is a simple recurrent network which has a context layer as a copy of the hidden layer, and is able to use the information of the context layer for the next calculation. Since the context layer is a simple copy of the hidden layer, the outputs of the hidden layer are sent directly to the context layer. In this paper, the connection weights are introduced to the connections from the hidden layer to the context layer.[6] With these connections, the context layer is no longer a simple copy of the hidden layer but the internal memory of the network. Therefore, this layer is named the internal memory layer in this paper. It is expected that the network with the internal memory layer is more powerful for learning in time series. For learning of this network, we developed a new learning algorithm, called here, the time-delayed back-propagation learning.[6] In this paper, the internal memory network and the time-delayed backpropagation learning are summarized, and the advantage of this network is presented with a simple example of time-series. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 502–508, 2006. c Springer-Verlag Berlin Heidelberg 2006
Delayed Learning on Internal Memory Network
2 2.1
503
Network and Learning Internal Memory Network
Elman introduced the context layer to the 3-layer back-propagation network. In this paper, the connection weights from the hidden layer to the context layer are introduced to Elman’s network as shown in Fig.1. While the context layer in Elman’s holds the outputs of the hidden layer, this layer keeps the internal state calculated from the outputs of the hidden layer. Therefore, this layer is named the internal memory layer, and the 3-layer network with the internal memory layer is called the internal memory network. Input Layer
Output Layer Hidden Layer
Internal Memory Layer
Fig. 1. Internal memory network
Because the network has the internal memory calculated from the output of the hidden layer, the learning ability of this network is expected to be higher than Elman’s. The equations of the internal memory network are defined as follows. N Nm i h ih i mh m h xj (t) = f wkj xk (t) + wkj xk (t − 1) − θj , j = 1, · · · , Nh (1) k=0
xm j (t)
k=0
=f
N h
hm h wkj xk (t)
−
θjm
−
θjo
k=0
xoj (t)
=f
N h
,
j = 1, · · · , Nm
(2)
ho h wkj xk (t)
,
j = 1, · · · , No
(3)
k=0 o where xij (t), xhj (t), xm j (t), and xj (t) are the output of the j-th neuron at step t, in the input layer, in the hidden layer, in the internal memory layer, and in
504
T. Deguchi and N. Ishii
the output layer, respectively. θjh , θjm , and θjo are the threshold value of the j-th neuron in the hidden layer, in the internal memory layer, and in the output layer, ih mh ho hm respectively. wkj , wkj , wkj , and wkj are the connection weight from the k-th neuron in the input layer to the j-th neuron in the hidden layer, from the k-th neuron in the internal memory layer to the j-th neuron in the hidden layer, from the k-th neuron in the hidden layer to the j-th neuron in the internal memory layer, and from the k-th neuron in the hidden layer to the j-th neuron in the output layer, respectively. Ni , Nh , Nm , and No are the number of neurons in the input layer, in the hidden layer, in the internal memory layer, and in the output layer, respectively, and f is the output sigmoid function described below in Equation (4), f (x) = 2.2
1 1 + exp(−x)
(4)
Delayed Learning Algorithm
Back-propagation learning[7] is not able to apply directly to the internal memory network, because the internal memory layer does not have the desired output, while the output layer does. For the internal memory networks, time-delayed back-propagation learning — or simply ‘delayed learning’ — is developed. The basic idea of the delayed learning is the expansion of the network in time as shown in Fig.2. The outputs of the internal memory layer at step t − 1 are the inputs of the hidden layer at step t. The errors are to be back-propagated from the hidden layer at step t to the outputs of the internal memory layer at step t − 1. It is not enough when there are few neurons in the output layer, because the errors from the output layer are sometimes so small that the hidden layer is not able to learn. In the delayed learning, the hidden layer at step t − 1 receives the errors from both the output layer at step t − 1 and the internal memory layer at step t − 1. Let yi (t) be the desired value of i-th neuron in the output layer at step t. The error signal of i-th neuron in the output layer is shown as follows: δio = xoi (t)(1 − xoi (t))(yi (t) − xoi (t))
(5)
As same as the usual back-propagation, the error signal of i-th neuron in the hidden layer is No ho δih = xhi (t)(1 − xhi (t)) δjo wij (6) j=0
Then, calculate the error signal of the internal memory layer at step t − 1. m δ i = xm i (t − 1)(1 − xi (t − 1)) m
Nh
mh δjh wij
(7)
j=0
Again, the errors of the output layer and of hidden layer at step t − 1 are calculated by the next equations. δ i = xoi (t − 1)(1 − xoi (t − 1))(yi (t − 1) − xoi (t − 1)) o
(8)
Delayed Learning on Internal Memory Network supervised signal at t
supervised signal at t-1
step t-1
505
step t
Fig. 2. Time-delayed back-propagation learning
δ i = xhi (t − 1)(1 − xhi (t − 1)) h
No
ho δ j wij + o
j=0
Nm
hm δ j wij m
(9)
j=0
Then, at step t − 1, the connection weights are changed as follows: pq pq wij (t + 1) = wij (t) + η δ j xpi (t − 1), q
(p, q) = (i, h), (m, h), (h, m), (h, o) (10)
When you adopt a momentum term, pq pq pq pq wij (t + 1) = wij (t) + η δ j xpi (t − 1) + α(wij (t) − wij (t − 1)), q
(p, q) = (i, h), (m, h), (h, m), (h, o)
(11)
If you need another back-step, you can calculate the error signals at t − 2. If you need more, you can get another back-step in the same way. Therefore, the delayed learning provides 1-delay, 2-delay, . . . , n-delay learning.
3
Simulations on Simple Time-Series
To demonstrate the internal memory network and the delayed learning, the sinusoidal values of sin(x) sampled at every π/10 are used. Therefore, this time-series is the cyclic values that come again after 20 steps. Each value was transformed by sin(x)+1 to fit to the range of the neuron from 0 to 1 as in 2 Equation (12). sin πt 10 + 1 a(t) = (12) 2 In these simulations, 5 types of network are used. One is Elman’s network with 1 neuron for the input, 30 for hidden units, 30 in context layer, and 1
506
T. Deguchi and N. Ishii
for the output. The others are the internal memory networks with 1 neuron for input, 30 neurons for the hidden units, 30 neurons in the internal memory layer, and 1 for the output. The internal memory networks have variations of 1-delay, 2-delay, 4-delay, and 8-delay learnings. In the simulations, the input of the network at step t is a(t) and the desired output is a(t + 1). If the output is put to the input continuously, the network will generate the time-series of a(t). In this series, the value 0 comes twice a cycle, and the next values are different. Therefore, the network has to make the internal state which can tell the difference. In these learning, the parameter of the learning is set as α = 0.0005, η = 0.05. One cycle of sampled values of sin(t) by π/10 is counted as one learning time. The networks learned for 100,000,000 learning times. The maximum errors during one cycle at every 1,000,000 learning times are shown in Fig.3. 0.7
Max error in a cycle
0.6
Elman’s Net 1-delay 2-delay 4-delay 8-delay
0.5 0.4 0.3 0.2 0.1 0
1×107 2×107 3×107 4×107 5×107 6×107 7×107 8×107 9×107 1×108 Learning times
Fig. 3. Maximum errors during one cycle at every 1,000,000 learning times
Elman’s network and the networks with 1-delay and 2-delay delayed learning did not lower the errors. At the end of the learning in the 4-delay delayed learning, the maximum error has descended to 0.017. The 8-delay delayed learning shows almost the same results as the 4-delay case. This suggests the lower bound of the delay time in the learning.
Delayed Learning on Internal Memory Network
4
507
Organizing Structure
After 100,000,000 learning of 4-delay delayed learning, the outputs of the network are shown in Fig.4.
1 Output I.M. #8 I.M. #14 I.M. #22
0.9 Outputs of neurons
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10 Steps
15
20
Fig. 4. Outputs and the internal state of neuron #8, #14, and #22 in 4-delay learning
The outputs are lined in sinusoidal curve. This means that the network has learned the time-series and that the network has made proper internal states in the internal memory layer. For viewing the internal memory, the outputs of the neurons in the internal memory layer are also shown in Fig.4. First, the output of #22 neuron is a similar shape of cos(t) function which is the derivation of sin(). Next, the output of #14 neuron is a opposite phase of cos(t). The output of #8 neuron is out of phase with sin(t) shape by 2 steps. During the delayed learning, the network organized these internal states.
5
Conclusion
In this paper, the connection weights are introduced to the connection from the hidden layer to the context layer of Elman’s. With this connection weights, the context layer is no longer a simple copy of the hidden layer but the internal memory of the network. With the time-delayed back-propagation learning, the internal memory network has the more ability to learn time series. Then, the advantage of this network is presented with a simple example of time-series, and the part of the organized internal states are reported.
508
T. Deguchi and N. Ishii
References 1. Elman, J.L.: Finding Structure in Time. Cognitive science 14 (1990) 179–211 2. Elman, J.L.: Learning and Development in Neural Networks: The Importance of Starting Small. Cognition 48 (1993) 71–99 3. Koskela, T., Lehtokangas, M., Saarinen, J. and Kaski, K.: Time Series Prediction with Multilayer Perceptron, FIR and Elman Neural Networks. Proc. of the World Congress on Neural Networks, INNS Press, San Diego, USA (1996) 491–496 4. Cholewo, T.J., and Zurada, J.M.: Sequential Network Construction for Time Series Prediction. Proc. of the IEEE Intl. Joint Conf. on Neural Networks, Houston, Texas, USA (1997) 2034–2039 5. Giles, C.L., Lawrence,S and Tsoi, A,C.: Noisy Time Series Prediction Using a Recurrent Neural Network and Grammatical Inference. Machine learning, Vol.44, No.1/2. Springer Science+Business Media B.V., Formerly Kluwer Academic Publishers B.V. (2001) 161–183 6. Iwasa, K., Deguchi, T., and Ishii, N.: Acquisition of the Time-Series Information in the Network with Internal Memory (in Japanese). IEICE Technical Report NC2001– 71 (2001) 7–12 7. Rumelhart, D.E., Hinton, G.E., and Williams, R.J.: Learning internal representations by error propagation. In: D.E.Rumelhart & J.L.McClelland (eds.): Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol.1. Cambridge, MA: MIT Press (1986) 318–362
A Novel Learning Algorithm for Feedforward Neural Networks Huawei Chen and Fan Jin School of Information Science & Technology, Southwest Jiaotong University, P.O. Box 406, Chengdu, Sichuan 610031, China
[email protected]
Abstract. A novel learning algorithm called BPWA for feedforward neural networks is presented, which adjusts the weights during both forward phase and backward phase. It calculates the minimum norm square solution as the weights between the hidden layer and output layer in the forward pass, while the backward pass adjusts the weights connecting the input layer to hidden layer according to error gradient descent algorithm. The algorithm is compared with Extreme learning Machine, BP algorithm and LMBP algorithm on function approximation and classification tasks. The experiments’ results demonstrate that the proposed algorithm performs well.
1 Introduction The theory of neural network shows that a three layer feedforward net with sufficient number of hidden units of sigmoidal activation type can approximate any continuous or other kinds of function to any desired accuracy[1,2].Although neural network has ability to approximate any function in theory, how to assign appropriate weights of neural network is still a hard problem[3]. Backpropagation (BP) algorithm is the most widely used training algorithm for finding out the weights for feedforward neural networks[4]. The error information is propagated backwards and the weights are adjusted according to gradient descent within the solution’s vector space. Although BP algorithm is a first successful training algorithm for multi-layer feedforward neural networks, it easily converges to local minimums and converges slowly in practice Many variations have been proposed with the aim of making it run faster and making it less likely to become stuck in local minima, such as conjugate gradient backpropagation[5] and LevenbergMarquardt(LMBP) algorithm[6], etc. Both BP algorithm and its variants modify the weights only in the backward pass. Recently, a new learning scheme for single hidden layer feedforward neural networks (SLFNs) called extreme learning machine (ELM) [7] was proposed by Huang. In ELM, the output weights (connecting the hidden layer to the output layer) are directly calculated by using Moore–Penrose generalized inverse, while the input weights (connecting the input layer to the hidden layer) and hidden biases can be randomly chosen. It’s interesting and surprising that ELM can learn very fast as well as achieve good generalization performance. However, ELM may require more J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 509 – 514, 2006. © Springer-Verlag Berlin Heidelberg 2006
510
H. Chen and F. Jin
hidden neurons in many cases[8] and ELM can’t make its error small enough because it only calculates the weights once and doesn’t make use of the input weights. In this paper, combining gradient-descent learning with idea of ELM, a novel hybrid algorithm taking advantages of both is proposed. The proposed algorithm can adjust the weights during both forward phase and backward phase according to generalized inverse and error gradient-descent principle respectively.
2 Model and Learning Problem of SLFNs Huang pointed out difference between universal approximation on compact input sets and approximation in a finite set[7]. Huang and Babri also shows that a SLFN with at most N hidden neurons and with almost any nonlinear activation function can learn N distinct observations with zero error[9]. For N arbitrary distinct samples (xi,ti), where xi = [ xi1, xi 2 ,...xin ]T ∈ R n ~ and ti = [ti1, ti 2 ,...tim ]T ∈ R m , standard SLFNs with N hidden neurons and activation function g(x) are mathematically modeled as a linear system ~ N
∑ β g (w • x i
i
j
+ bi ) = o j
j = 1,..., N
(1)
i =1
where wi = [ wi1 , wi 2 ,...win ]T is the weight vector linking the ith hidden neuron and the input neurons, β i = [ β i1, β i 2 ,...β im ]T is the weight vector linking the ith hidden neuron and output neurons, and bi is the threshold of the ith hidden neuron. wi ∗ x j denotes the inner product of wi and xj. If the SLFN with N hidden neurons with activation function g(x) can approximate N distinct samples(xi,ti) with zero error means that Hβ = T
(2)
H is the hidden layer output matrix of the neural network. So for fixed arbitrary input weights wi and the hidden layer biases bi, to train a SLFN is simply equivalent to finding a least-squares solution βˆ of the linear system Hβ = T . βˆ = H +T is the best weights[8,10], where H+ is the Moore-Penrose generalized inverse.
3 Bi-phases Weights Adjusting(BPWA) Algorithm Based on Moore-Penrose generalized inverse and the structure of SLFN, Huang proposed ELM. To be pointed out, although βˆ is the minimum norm least square solution, it can’t assure that output error is smaller than error threshold or make the output error small enough for lack of making use of the input weights. Another problem is that ELM always requires more hidden neurons than other learning algorithm. Incorporating generalized inverse with backpropagation, a hybrid scheme is proposed here. In this algorithm activation functions of hidden neurons are assumed as
A Novel Learning Algorithm for Feedforward Neural Networks
511
sigmoidal function. The algorithm caculates the minimum norm square solution as the weights between the hidden layer and output layer in the forward pass, while the backward pass, on the other hand, adjusts the weights connecting the input layer to hidden layer according to error gradient descent algorithm. Since it adjusts the weight during both phases, it’s called Bi-Phases’ Weights Adjust learning algorithm. The main thought of BPWA is illustrated in Fig. 1. From Fig.1, BPWA divides the process of weight adjusting into two processes during which it calculates or modifies different parts of weight vectors. Learning algorithm of BPWA is described as follows. Forward Pass: Adjusting the output weights
Backward Pass: Adjusting the input weights Fig. 1. The diagram of Bi-Phases Weights Adjusting algorithm
3.1 Output Weights Calculation During Forward Pass
For given any fixed arbitrary input weights and the hidden layer biases, 1. Calculate the hidden layer output matrix H, 2. According to H, compute βˆ = H +T that minimizes the error between the output and target, 3. Calculate the error between neural network’s output Hβˆ and target T. If error is less than given error threshold or the learning epoch has reached the maximum epoch, it stops else the learning process continues into backward pass. 3.2 Input Weights Adjusting Based on Gradient Descent During Backward Pass
Backward pass modifies the weights connecting input layer to hidden layer including hidden layer biases. To be noted, the output weights connecting hidden layer to output layer don’t change during this pass, only input weights can be adjusted according to error gradient descent. Forward pass calculates the optimal output weights that minimize the error, backward pass adjusts the input weights when the error is larger than error threshold. So the converge speed of this algorithm is fast due to adjusting weight during both phases. Different from traditional training algorithm, BPWA can train the parameters of neural network during both passes. Due to the minimum norm least square solution, the output weights can get small learning error as well as small generalization error after BPWA succeeded in training the neural network.
512
H. Chen and F. Jin
4 Performance Evaluation In this section, the performance of the proposed algorithm is tested and compared with the BP algorithm, LMBP algorithm and ELM on some problems in the function approximation and classification areas. The simulations for these learning algorithms are carried out in Matlab 6.0 running in an AMD Athlon XP 1800+ CPU with 384MB memory. All the results shown in this paper are the mean values of 100 trails. 4.1 Two-Dimension Function Approximation
Target function is shown as follows[11] f ( x1 , x 2 ) = 3(1 −
2 2 x12 )e − x1 − ( x 2 +1)
− 10(
,
x1 5
−
x13
−
2 2 x 25 )e − x1 − x 2
e − ( x1 +1) − 3
2
− x 22
(3)
where x1 ∈ [−2,2] x2 ∈ [−2,2] For training purpose, the outputs are normalized to the interval (−1, 1). For each trial, the number of maximum epochs was fixed at 1000 epochs of exemplar presentation. An experiment was deemed to be successful, if it achieved a mean squared error (mse) value of 0.04. The results conducted are presented in the table 1. The architecture of neural network used here is 2-7-1. MSE is the mean of the mse achieved in all experiments for the task, SEMSE is the average of the mse in successful experiments, and SucRate is the mean of the ratio of the number of experiments succeeded to achieve the mse of 0.04 to total number of experiments. Table 1. Performance of BPWA, ELM, BP, LMBP on approximating f(x1,x2)
Algorithm BPWA ELM BP LMBP
MSE 0.027227 0.028618 0.10816 0.030121
SEMSE 0.027227 0.02755
-
0.030121
SucRate(%) 100 95 0 100
Training Time(s) 0.02312 0.02172 10.7438 0.67641
In the experiment, BP can’t achieve the error threshold when maximal learning epochs arrived, failing in this experiment. ELM really has fast learning speed, but it can’t finish the task every time. Although both BPWA and LMBP work well, it’s obvious that BPWA outperforms LMBP. 4.2 Benchmarking with Iris Problem
Iris database is a famous database in the pattern recognition field and always used for benchmarking for different methods. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. Every instance has 4 attributes. We used 4-6-3 neural network for classifying the iris types. The whole dataset divided into two sets: training set and testing set. The two sets both have 75 instances randomly selected from the iris database and no intersect set exists between them.
A Novel Learning Algorithm for Feedforward Neural Networks
513
For one instance, if three output values satisfy the relation below yid − yi < θ
(4)
θ
The classification on this instance is considered as correct, else, false. is threshold ,which is set to The correct rate is defined as (8), which is one performance parameter evaluating the classification.
θ=0.5.
E=
n × 100% N
(5)
where n is the number of correct classification instances, N is size of training set or testing set. Training SucRate means the mean correct classification ratio over training set, Testing SucRate is the mean correct classification ratio over test set. The mean results over 100 trials are summarized in Table 2. Table 2. Performance of the BPWA,ELM,BP and LMBP over Iris database
Algorithm BPWA ELM BP LMBP
Training SucRate(%) 98.227 85.493 22.36 99.88
Testing SucRate(%) 94.56 80.747 20.987 93.533
Training Time(s) 0.06297 0.00953 0.47875 0.48562
Form Table above, ELM trained fastest, while its correct classification ratios are below 90 percent whether in training process or testing process. BP is the worst again. BPWA and LMBP yield similar correct classification ratio for training, while the correct ratio for testing of BPWA is slightly better than LMBP and the training time of BPWA is one eighth of LMBP. From experiments above, we can see that BPWA did well in the simulations. Since LMBP is one of the best gradient-based learning algorithms and ELM is fastest learning algorithm, we can draw a conclusion that BPWA performs well on both learning speed and generalization ability.
5 Conclusion This paper proposed a new learning algorithm for feedforward neural networks– BPWA, which can adjust the weights during both forward phase and backward phase. The algorithm calculates the minimum norm square solution as the weights between the hidden layer and output layer in the forward pass, while the backward pass, on the other hand, adjusts the weights connecting the input layer to hidden layer according to error gradient descent algorithm. Experimental results show that the proposed algorithm performs well comparing with BP algorithm, LM-BP algorithm and ELM Although BPWA is discussed based on single hidden layer feedforward neural networks, it’s obvious that this algorithm can also be applied to training multilayer feedforward neural networks.
514
H. Chen and F. Jin
References 1. Hornik, K., Stinchcombe, M., White, H.: Multilayer Feedforward Networks are Universal Approximators. Neural Networks 2(5) (1989) 359-366 2. Chen, T, Chen, H.: Universal Approximation to Nonlinear Operators by Neural Networks with Arbitrary Activation Functions and Its Application to Dynamical Systems. IEEE Trans. Neural Networks 6(4) (1995) 911-917 3. Judd, J. S.: Learning in Networks is Hard. In: Proc. the 1st IEEE International Conference on Neural Networks. Vol. 2. New York (1987) 685-692 4. Rumelhart, D. E., Hinton, G. E., Williams, R. J.: Learning Internal Representations by Error Propagation. In: Parallel Distributed Processing, Vol.1 MIT Press, Cambridge (1986) 5. Hagan, M. T., Menhaj, M. B.: Training Feedforward Networks with the LevenbergMarquardt Algorithm. IEEE Trans. on Neural Networks 5(6) (1994) 989-993 6. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme Learning Machine: a New Learning Scheme of Feedforward Neural Networks. In: Proceedings 2004 International Joint Conference on Neural Networks Vol. 2. (2004) 985-990 7. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Saratchandran P., Sundararajan N.: Can Threshold Networks Be Trained Directly?. IEEE Transactions on Circuits and Systems II: Express Briefs : Accepted for future publication 8. Huang, G.-B., Babri, H. A.: Upper Bounds on the Number of Hidden Neurons in Feedforward Networks with Arbitrary Bounded Nonlinear Activation Functions. IEEE Trans. on Neural Networks 9(1) (1998) 224-229. 9. Bartlett, P. L.: The Sample Complexity of Pattern Classification with Neural Networks: the Size of the Weights Is More Important than the Size of the Network. IEEE Trans. on Information Theory 44(2) (1998) 525-536 10. Chandra, P., Singh, Y.: An Activation Function Adapting Training Algorithm for Sigmoidal Feedforward Networks. Neurocomputing 61(10) (2004) 429-437
On H∞ Filtering in Feedforward Neural Networks Training and Pruning He-Sheng Tang1, Song-Tao Xue2, and Rong Chen1 1
Research Institute of Structural Engineering and Disaster Reduction, Tongji University, Shanghai, 200092, China
[email protected] 2 Department of Architecture, School of Science and Engineering, Kinki University, Kowakae3-4-1, Higashi Osaka City, 577-0056, Japan
Abstract. An efficient training and pruning method based on H∞ filtering algorithm is proposed for Feedforward neural networks (FNNs). A FNNs’ weight importance measure linking up prediction error sensitivity obtained from H∞ filtering training and a weight salience based pruning technique are derived. The results of extensive experimentation indicate that the proposed method provides better pruning results during the training process of the network without losing its generalization capacity, also provides a robust global optimization training algorithm for given arbitrary network structures.
1 Introduction For neural networks (NNs) design, there are two crucial problems: one is the choice of a ‘fast’ and ‘robust’ training algorithm, and the other is the choice of a suitable or, ideally, minimal NNs topology to be adopted. In neural network training, the most well-known online training method is the backpropagation algorithm (BPA) [1], which is virtually a first-order stochastic gradient descent method and shows slow learning speed [2]. To overcome the slowness, many modified schemes based on the classical nonlinear programming technique have been suggested to speed up the training [3][4]. Recently, a class of second-order descent methods inspired by the theory of system identification and nonlinear filtering [5] has been introduced to estimate the weights of a neural network. Extended Kalman filter (EKF) [6][7][8] and Recursive least square (RLS) method have been applied to multilayer perceptron [9][10][11]. In the above mentioned EKF algorithm, although the learning speed is improved, the method requires the knowledge of the noise statistics. Convergence of this algorithm as well as the final values depends, to great extent, on this initial guess. The authors have presented suboptimal H∞ filtering to train feedforward multilayer network which is independent on noise statistics [12]. Besides the training algorithms, another concern encountered in the practical application of the NNs is the choice of suitable model architecture. Since an unsuitable topology will increase the training time or even cause non-convergence, it usually decrease the generalization capability of the network. If there are too few weights, the network may not be trained to learn the training data for the system mapping. On the J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 515 – 523, 2006. © Springer-Verlag Berlin Heidelberg 2006
516
H.-S. Tang, S.-T. Xue, and R. Chen
other hand, if the network size is too large, weights overfitting problems may usually occur and thus lead to worse generalization capacity[13][14]. Thus, in order to eliminate unnecessary weights, the pruning algorithm is applied. There are different pruning or model selection methods, such as Akaike Information Criterion (AIC) and cross-validation techniques [15][16] which require tens of networks to be exhaustively trained before the correct network size is determined, or simple weight decay method [17[18], or error sensitivity-based Optimal Brain Damage (OBD) [19] and Optimal Brain Surgeon (OBS) [20] methods, or OBD-like pruning methods [21][22], or growing methods [4] which may be sensitive to initial conditions and become trapped in local minima[23] . As the H∞ filtering was shown to be more efficient and robust than the Kalman filter[24][25], it would be interesting to inquire if there is any possibility of applying H∞ filtering training method together with network pruning. The objective of the present study is to develop a FNNs training and pruning method based on H∞ filtering algorithm for identification of nonlinear systems. The presented new method is able to reduce the complexity of the network during the training without diminishing the network’s estimation capacity. Also, independent of the statistics of the disturbances of the network’s inputs and outputs, the presented method provides a natural global optimization training algorithm for given arbitrary network structures. Examples of nonlinear system identification are given to verify the usefulness and effectiveness of the proposed method.
2 H∞ Filtering Algorithm in Neural Network Training Let y k = f (w k , u k ) be the transfer function of a single-layer FNNs where y k is the output, u k is the input and w k is its parameter vector that is combined by the weight 1 2 3 matrices w , w and w . Given a set of training data, the training of a neural network can be formulated as a filtering problem [6][8]. In this case, a discrete-time FNNs’ behavior can be described by the following nonlinear state-space model: w k +1 = w k + v k
(1)
y k = f (w k , u k ) + n k
(2)
Eq. (1) is known as process equation, where v k is process noise, the state of system is given by the network’s weight parameters values w k . Eq. (2) is the observation or measurement equation, represents the desired network response vector y k as a nonlinear function f (•) of the input vector u k and the weight parameter vector w k ; this equation is augmented by random measurement noise n k . To apply the optimal H∞ filtering algorithm, linear Taylor approximation of the ˆ −k −1 (prediction of w k ), u k is considered here, that is f (w k , u k ) at w ˆ −k −1 , u k ) + Ck (w k − w ˆ k−−1 ) f ( w k , u k ) ≈ f (w
where C k = ∂f ∂w | u =u ,w = wˆ − . A new quantity is introduced as follows: k k −1
(3)
On H∞ Filtering in Feedforward Neural Networks Training and Pruning ˆ −k −1 , u k ) + Ck w ˆ −k −1 ηk = y k − f ( w
517
(4)
The entries in the term ηk are all known at time k, and, therefore, ηk can be regarded as an observation vector at time k. Hence, the nonlinear model (Eq. (2)) is approximated by the linear model ηk = C k w k + n k
(5)
The problem addressed by the H∞ filtering is to find an estimate wˆ k of w k given u j , η j ( j = 0,1,L k ) . The suboptimal H∞ estimation [25][26] is interested not necessarily
in the estimation of w k but in the estimation of some arbitrary linear combination of w k using the noise-corrupted observations η j ( j = 0,1,L k ) i.e., zk = Lk wk
(6)
where L k ∈ R q×n . Different from that of the modified Wiener/Kalman filter which minimizes the variance of the estimation error, the design criterion of the H∞ filter is to provide a uniformly small estimation error, z k − zˆ k , for any n k ∈ l2 and w 0 ∈ R n . The H∞ filtering will search zˆ k such that the optimal estimate of z k among all possible zˆ k in the sense that the supremum of the performance measure should be less than a positive pre-chosen noise attenuation factor γ 2 , i.e., the worse-case performance measure
∑
sup x 0 ,{n k , v k }
ˆ0 w0 − w
2 P0−1
N −1 k =0
z k − zˆ k
{
2
+ ∑k =0 n k N −1
2
+ vk
2
}
<γ2
(7)
where wˆ 0 is an a priori estimate of w 0 and w 0 − wˆ 0 represents unknown initial condition error, P0−1 > 0 is weighting matrix. P0−1 > 0 denotes a positive definite matrix that reflects a priori knowledge on how the initial guess wˆ 0 close to w 0 is. Let γ > 0 be a prescribed level of noise attenuation. If this is the case, an optimized H∞ filtering algorithm for neural network training can be derived: ˆk =w ˆ −k −1 + K k ( ηk − Ck w ˆ k−−1 ) w ˆ −k −1 + K k (y k − f (w ˆ k−−1 , u k )) , w ˆ k− = w ˆk ,w ˆ −−1 = w ˆ0 =w
(8)
where wˆ k is an a posteriori estimate of the state at step k , the gain K k of the filter is given by K k = Pk C Tk (I + C k Pk C Tk ) −1
(9)
Pk +1 = Pk (I + C Tk C k Pk − γ −2 LTk L k Pk ) −1
(10)
where the attenuation factor γ must be tuned so as to satisfy the Pk positive definite.
518
H.-S. Tang, S.-T. Xue, and R. Chen
3 H∞ Filtering in Neural Network Pruning In this section, the conjunction of network training and pruning with the H∞ filtering algorithm will be illustrated. Without loss of generality, the network employed here is considered as a feedforward architecture with nI input units, nH hidden sigmoid unites and a single linear output unit. The initial network is fully connected between layers and implements a nonlinear mapping from input space u k to target output space yˆ k = f k (u k , w k ) , where f (•) is the actual output mapping function, w is the network parameters and yˆ k is the prediction of the target output yk . Then, for a given training set, the cost function can be expressed as: Ε(w ) =
1 2N
∑ ( yk − f k ) N
2
(11)
k =1
where N is the number of training examples. Under the assumption that the network is fully trained, that is, the cost function Ε has adjusted to a local or global minimum on the error surface, the second derivative of Ε with respect to w or the Hessian matrix [19] can be approximated as: Η( N ) ≈
1 N
⎛ ∂f ⎞⎛ ∂f ⎞ ∑ ⎜ ∂wk ⎟⎜ ∂wk ⎟ k =1 ⎝ ⎠⎝ ⎠ N
T
(12)
To illustrate the connection between the matrix Pk and the Hessian matrix Η of the cost function, the Riccati Eq. (10) can be rewritten in an alternative form that will be more convenient for analysis. By employing the following matrix inversion lemma (MIL), the following update for Pk−1 is obtained Pk−+11 = Pk−1 + CTk Ck − γ −2LTk L k
(13)
Suppose that the weight parameter and the ‘error covariance matrix’ Pk are both converge. Without loss of generality, it is convenient to select the matrix L k equal to Ck . The Riccati recursion Eq. (13) with initial condition P0 can be rewritten in the form of a recursion as: k
Pk−+11 = P0−1 + (1 − γ −2 )∑ CTi Ci
(14)
i =0
From Eq. (12) and Eq. (14), the inversion of Hessian matrix of the cost function is approximately expressed as:
[
Η −1 ≈ N (1 − γ −2 )Pk +1 I − (Pk +1 − P0 )−1 Pk +1
]
(15)
The pruning procedure simplifies the computations by making a further assumption of the Hessian matrix Η being a diagonal matrix [19]. Thus the saliencies for each parameter are as follows:
On H∞ Filtering in Feedforward Neural Networks Training and Pruning
Si =
1 [Η]i ,i w[2i ] 2
519
(16)
where [Η ]i,i is the i-th diagonal element of the Η and w[i ] is the i-th weight. The pruning strategy is to find the low-saliency or smallest saliency parameters which are then selected for deletion. The weights are pruned according to this pruning order until the change in the estimated training error is greater than the tolerance limit.
4 Illustrative Numerical Examples A non-linear hysteretic system [27] is selected for example analysis to illustrate the applicability of the proposed method. The non-linear system is described by the following second order difference equation: d 2u (t ) du (t ) + z( , u (t ), z (t )) = p(t ) 2 dt dt
(17)
2 3 dz (t ) du (t ) du (t ) du (t ) = 24.5 −2 z (t ) z (t ) − 0 . 5 z (t ) dt dt dt dt
(18)
where u (t ) and p(t ) are system’s output and input, respectively. The FNNs is taken into account to model the dynamic behaviors of the system, namely, the FNNs is trained to identify z (t ) . An FNNs with 3 input neurons ( du (t ) dt , u (t ) and z (t − 1) ), 30 hidden neurons (with hyperbolic tangent activation function), and one output neuron ( z (t ) ,with linear activation function) is trained to capture the unknown system. In this case, the total number of weights is 120. In this study, the effectiveness of the noise injection training is also investigated; noises are artificially added to the training data. The noise level is defined as the value of the standard deviation. For instance, if the standard deviation is 0.05, the noise level in the data can be referred as 5% in the root-mean-square (RMS) level. Training and test samples for the non-linear system identification problem are shown in Fig.1. The training set contains 1000 samples with 3% noise in RMS level and test set contains 500 samples. Numbers of training iterations of some parts of estimated network’s weights are shown in Fig.2. Some weights converge to stable values very quickly in first 500 iterations. This fast convergence demonstrated that the H∞ filtering is a very fast training algorithm. Last step values of the Pk (120×120 matrix, Fig. 3) of the Riccati recursion show that the matrix Pk is almost diagonal dominant. Fig.4 shows the estimated training error against the number of weights in the pruned network. It is clear that nearly only 70 weights are enough to capture the unknown system without increasing the training error dramatically. After training, another 500 pairs of testing data are passed to the pruned network. Fig. 5 depicts a segment of the actual and the desired network output for the test set. It shows that output of the pruned network is quite close to the desired output.
520
H.-S. Tang, S.-T. Xue, and R. Chen p(t)
20 0 -20
(a)
10 u(t) 0 -1 -2 40 2 du(t)/dt 0 -2 -4 50 z(t) 0 -5
500
1000
1500
1000
1500
(b)
500 (c)
500 Training set
1000
1500
Test set
(d)
0
500
1000
Discrete time
1500
Fig. 1. Training and test samples for the system identification problem 0.2 0 -0.2 0.5 0 -0.5 0.5 0 -0.5 0.5 0 -0.5 0.2 0 -0.2
w1 w2 w3 w4 w5 0
100
200
300
400
500
600
700
800
900
1000
Training iteration
Fig. 2. Convergence of the estimate weights
Pk
Mean square training
3
2
1
0
0 20 40 60 80 100 120 No. of weights in the pruned network
Fig. 3. Values of the Pk after trained H∞-network
Fig. 4. Number of weights in the pruned network versus training errors
On H∞ Filtering in Feedforward Neural Networks Training and Pruning 2 0 -2
521
z(t) (a)
1000 1050 2 z(t) 0 -2
1100
1000
1100
1050
1150
1200
1250
1300
1350
1400
1450
1500
1300
1350
1400
1450
1500
(b)
1150
1200
1250
Discrete time
Fig. 5. A segment of the actual and the desired network output for the test set in the system identification example. Solid line represents the desired output; the dotted line corresponds to the actual output of: (a) the unpruned network; (b) the network pruned by our approach
To demonstrate that the generalization capability is not affected much if some weights in the network are pruned, another set of input signal (the test set) is fed for the pruned network. A system input p(t ) (Fig.6 (a)) is selected for the validation set. According to this input, corresponding desired output for this validation set are obtained. Fig. 6(b) depicts a segment of the actual and the desired network output corresponding to the test input. The instantaneous error is shown in Fig. 6 (c). The figure shows that the generalization capability of the pruned networks does not change much. 20 0 -20
p(t) (a)
40 2 0 -2 -4 20
z(t)
200
400
600
800
1000
600
800
1000
600
800
1000
(b)
200 instantaneous error
400
0 -2
(c)
0
200
400
Discrete time
Fig. 6. A segment of the actual and the desired network output for the validation set in the system identification example. (a) Validation input of system; (b) comparison between actual output (solid line) and the pruned H∞-network output (dashed line); (c) instantaneous error
5 Conclusions In this paper, we have derived an efficient FNNs training and pruning method using the H∞ filtering algorithm. FNNs model pruned with H∞ filtering provides a good architecture design for generalization capacity and requires a smaller number of connection weights than a totally connected net. Examples of nonlinear system identifica-
522
H.-S. Tang, S.-T. Xue, and R. Chen
tion are carried out to evaluate the performance of the network. The results show the effectiveness of the neural network pruning and training by H∞ filtering algorithm.
References 1. Rumelhart, D., Hinton, G., and Williams, G.: Learning Internal Representations by Error Propagation. Parallel Distributed Processing, Vol.1, Cambridge, MA: MIT Press (1986) 318-362 2. Robbins, H. and Monro, S.: A Stochastic Approximation Method. Ann. Math. Stat. 22 (1951) 400-407 3. Bojarczak, O. S. P., and Stodolski, M.: Fast Second-order Learning Algorithm for Feedforward Multilayer Neural Networks and Its Application. Neural Networks 9(9) (1996) 1583–1596 4. Haykin, S.: Neural Networks: A Comprehensive Foundation, Prentice Hall, Inc. (1999) 5. Anderson, B. D. O., and Moore, J. B.: Optimal Filtering. Englewood Cliffs, NJ: PrenticeHall (1979) 6. Singhal, S., and Wu, L.: Training Multiplayer Perceptrons with the Extended Kalman Algorithm. In: Touretzky,D.S.(eds.): Advances in Neural Information Processing Systems, Vol.1, Morgan Kaufmann, San Mateo, CA (1989) 133-140 7. Fukuda, W. K. T., and Tzafestas, S. G.: Learning Algorithms of Layered Neural Networks via Extended Kalman Filters. Int. J. Syst. Sci. 22(4) (1991) 753–768 8. Iiguni, Y., Sakai, H., and Tokumaru, H.: A Real-time Learning Algorithm for a Multilayered Neural Network Based on the Extended Kalman Filter. IEEE Trans. Signal Processing 40(4) (1992) 959–966 9. Chen, S., Cowan, C. F. N., Billings, S. A., and Grant, P. M.: Parallel Recursive Prediction Error Algorithm for Training Layered Neural Network. Int. J. Contr. 51(6) (1990) 1215–1228 10. Kollias, S., and Anastassiou, D.: An Adaptive Least Squares Algorithm for the Efficient Training of Artificial Neural Networks. IEEE Trans. Circuits Syst. 36(8) (1989) 1092–1101 11. Leung, C. S., Wong, K. W., Sum, J., and Chan., L. W.: On-line Training and Pruning for RLS Algorithms. Electron. Lett. 32(23) (1996) 2152–2153 12. Tang, H., and Sato, T.: Structural Damage Detection Using the Neural Network and H∞ Algorithm. In: Kundu, T. (eds.): Health Monitoring and Smart Nondestructive Evaluation of Structural and Biological Systems III, Proceedings of SPIE, Vol. 5394, San Diego, CA (2004) 454-463 13. Abrahart, R.J., See, L., and Kneal, P.E.: Investigating the Role of Saliency Analysis with Neural Network Rainfall-runoff Model. Computer & Geosciences 27(8) (2001) 921-928 14. Makarynskyy, O., Pires-Silva, A.A., Makarynska, D., and Ventura-Soares, C.: Artificial Neural Networks in Wave Predictions at the West Coast of Portual. Computers & Geosciences 31(4) (2005) 415-424 15. Akaike, H.: A New Look at the Statistical Model Identification. IEEE Trans. Automat. Control AC-19 (6) (1974) 716-723 16. Stone, M.: An Asymptotic Equivalence of Choice of Model by Cross-validation and Akaike's criterion. Roy. Stat. Soc. Ser. 39(1) (1977) 44-47 17. Krogh, A., and Hertz, J.: A Simple Weight Decay Can Improve Generalization. In: Touretzky, D.S. (ed.): Advances in Neural Information-Processing Systems, Vol.4, Morgan Kaufmann, San Mateo, CA (1992) 950–957
On H∞ Filtering in Feedforward Neural Networks Training and Pruning
523
18. William, P.M.: Bayesian Regularization and Pruning Use a Laplace Prior, Neural Comput. 7(1) (1995) 117– 143 19. LeCun, Y., Denker, J. S., and Solla, S. A.: Optimal Brain Damage. In: Touretzky, D. S. (ed.): Advances in neural information processing, Vol.1, Morgan Kaufman, San Mateo, CA (1990) 396–404 20. Hassibi, B., and Stork, D. G.: Second-order Derivatives for Network Pruning: Optimal Brain Surgeon. In: Hanson, S. J., Cowan, J. D., and Lee Giles, C. (eds.): Advances in neural information processing, Vol.4, Morgan Kaufman, CA (1993) 164-171 21. Sum, J., Leung, C., Young, G. H., and Kan, W.: On the Kalman Filtering Method in Neural-Network Training and Pruning. IEEE Trans. on Neural Networks 10(1) (1999)161-166 22. Leung, C., Wong, K., Sum, P., and Chan, L., A Pruning Method for the Recursive Least Squared Algorithm. Neural Networks 14(2) (2001) 147-174 23. Prechelt, L.: A quantitative Study of Experimental Evaluations of Neural Network Learning Algorithms: Current Research Practice. Neural Networks 9 (3) (1996)457-462 24. Sato, T., and Qi, K.: Adaptive H∞ fillter: Its Application to Structural Identification. Journal of Engineering Mechanics 124(11) (1998) 1233-1240 25. Hassibi, B., Sayed, A.H., and Kailath, T.: Indefinite-Quadratic Estimation and Control: A Unified Approach to H2 and H∞Theories, SIAM (1998). 26. Didinsky, G., Pan, Z., and Basar, T.: Parameter Identification of Uncertain Plants Using H∞ Methods. Automatica 31(9) (1995) 1227-1250 27. Wen, Yi-K.: Method for Random Vibration of Hysteretic Systems. Journal of Engineering Mechanics 102(EM2) (1976) 249-263
A Node Pruning Algorithm Based on Optimal Brain Surgeon for Feedforward Neural Networks Jinhua Xu1 and Daniel W.C. Ho2 1
Department of Computer Sciences, East China Normal University
[email protected] 2 Department of Mathematics, City University of Hong Kong
[email protected]
Abstract. In this paper, a node pruning algorithm based on optimal brain surgeon is proposed for feedforward neural networks. First, the neural network is trained to an acceptable solution using the standard training algorithm. After the training process, the orthogonal factorization is applied to the output of the nodes in the same hidden layer to identify and prune the dependant nodes. Then, a unit-based optimal brain surgeon(UB-OBS) pruning algorithm is proposed to prune the insensitive hidden units to further reduce the size of the neural network, and no retraining is needed. Simulations are presented to demonstrate the effectiveness of the proposed approach.
1
Introduction
The feedforward neural network (NN) is one of the most popular neural network topologies. One of the difficulties of using the feedforward neural network is to determine the network architecture. In general, when a network is too small, it may not learn the problem well. However, a network which is too large may lead to overfitting and poor generalization performance. Algorithms that can find an appropriate network architecture automatically are thus highly desirable. There are three major approaches to tackle this problem. They are pruning algorithms [1], constructive algorithms [2] and regularization [3], respectively. Recent interest has been growing on pruning algorithms that start with an oversized network and remove unnecessary network parameters, either during training or after convergence to a local minimum. Network parameters that are considered for removal are individual weights, hidden units and input units. Sensitivity analysis has been used successfully to prune irrelevant parameters from feedforward neural networks. Sensitivity analysis techniques quantify the relevance of a network parameter as the influence that small parameter perturbations have on a performance function. The sensitivity measures could fail to identify possible correlations among nodes [4]. Other post-training pruning algorithms based on the correlations among nodes are proposed in [5, 6, 7]. In this paper, a new post-training pruning algorithm is proposed for feedforward neural networks. After the training process, the orthogonal factorization is applied to the output of the nodes in the same hidden layer to identify J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 524–529, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Node Pruning Algorithm Based on Optimal Brain Surgeon
525
and prune the linear dependant nodes. Then, a new sensitivity measure based on optimal brain surgeon(UB-OBS) is proposed to prune the insensitive hidden units to further reduce the size of the neural network, and no retraining is needed.
2
Problem Formulation
The neural networks considered in this paper are feedforward neural networks. For simplicity, the algorithm is derived for feedforward neural networks with one hidden layer. It is straightforward to extend the algorithm to multiple hidden layer networks. For one hidden layer feedforward network, the input-output relationship can be represented as ˆ = W2 σ(W1 x + B1 ) + B2 , y (1) where x ∈ Rn , y ∈ Rp are the network input and output, respectively. n is the input dimension and p is the output dimension. W1 ∈ Rm×n and W2 ∈ Rp×m are the weight matrices from the input layer to the hidden layer and from the hidden layer to the output layer, respectively. B1 ∈ Rm and B2 ∈ Rp are the bias vectors of the hidden neurons and output neurons respectively, where m is the number of the hidden neurons. σ is an activation function, such as sigmoid functions. With batch learning, a set of training patterns {x(t), y(t), t = 1, · · · , N } is given. The output of the ith hidden node for the tth training pattern is denoted as σi (t) = σ(Wi1 x(t) + b1i ). Define the matrix Φ := [φ1 , φ2 , · · · , φm , φm+1 ],
(2)
where φi = [σi (1), σi (2), · · · , σi (N )]T , i = 1, · · · , m, and φm+1 (j) = 1 for j = 1, · · · , N is the output of a bias unit for all training patterns. The output of the network is denoted as Yˆ := [ˆ y (1), yˆ(2), · · · , yˆ(N )]T and let W := [W 2 , B 2 ] to obtain ΦW T = Yˆ .
(3)
then the training problem can be formulated as an optimization problem as follows: min V (θ) = tr{(Y − ΦW T )T (Y − ΦW T )} (4) θ
where Y = [y(1), y(2), · · · , y(N )]T , and θ represents the weight parameters of the NN. The standard NN training algorithms such as backpropagation (BP) or Newton’s method can be used to train the NN to a local minimum.
526
3
J. Xu and D.W.C. Ho
Dependent Nodes Pruning Algorithm Using QR Factorization
Consider a network trained by the standard training algorithm to a local minimum of the error function V in equation (4). Next, QR factorization with column pivoting is applied to matrix Φ to detect the independent columns, that is Φ · P = Q · A,
(5)
where Q is a N ∗ N unitary matrix, A is an upper triangular matrix of the same dimensions as Φ, P is a permutation matrix with PT P = PPT = I, which is chosen to make the absolute value of the diagonal elements of A in decreasing order. If |arr |/|a11 | < ρ, ρ is a very small positive number, then |ajj |/|a11 | < ρ for j = r + 1, · · · , min(m + 1, N ), this means there are r nonzero diagonal elements in A, and therefore there are r independent columns in matrix Φ. Decompose A in appropriate block matrix forms as A11 A12 A= , (6) 0 0 where A11 is an r ∗r upper triangular matrix, and r is the number of the nonzero ¯ := WP, and it diagonal elements in A. Substitute equation (5) into (3), let W implies that ¯ T = QT Y. ˆ AW (7) ¯ Correspondingly, decompose Q, W in appropriate block matrix forms as Q = [Q1 , Q2 ],
¯ = [Wr , Wp ], W
(8)
where Wr and Wp are p × r and p × (m + 1 − r) matrices, representing the output weight of independent nodes and dependent nodes, respectively. T T Wr A11 A12 Q1 ˆ = Y (9) 0 0 WTp QT2 ˆ = 0. Setting Wp = 0, the independent weight Wr can One can verify that QT2 Y be easily obtained by solving the following linear equation: ˆ A11 (Wr )T = QT1 Y.
(10)
From the above analysis, the output weights of the m + 1 − r dependent nodes can be set as zeros and the output weights of the independent nodes can be recalculated using equation (10) without any effect on the network’s input-output behavior.
4
Unit-Based Optimal Brain Surgeon Pruning
In this section, a novel sensitivity measure is proposed to remove the insensitive nodes based on optimal brain surgeon(UB-OBS) developed by Hassibi et al. [4]. For notation simplicity, the pruned network is described as
A Node Pruning Algorithm Based on Optimal Brain Surgeon
ΦW T = Y,
527
(11)
here Φ is a N × r full column rank matrix, and the remaining weight W can be easily obtained as W = Y T Φ(ΦT Φ)−1 . (12) Pruning the j th hidden node is equivalent to setting the corresponding weights from the j th node to all the output nodes to be zeros, that is Wij = 0, i = 1, · · · , p. The new weight W after pruning needs to be updated such that the output of ΦW T remains as close as possible to Y . The functional Taylor series of the error function in (4) with respect to the output weights is δV = (
∂V 1 )δW + δW T · H · δW + O(δW 3 ), ∂W 2
(13)
where H = ∂ 2 V /∂ 2 W = ΦT Φ is the hessian matrix. The first term vanishes for a network trained to a local minimum, and the third term is zero. The updating process can be formulated as an optimization problem: subject to δW ej = −Wj ,
min δV δW
(14)
where ej and Wj are the j th column of the identity matrix and W respectively, that is Wj = W ej . The constraint in (14) means that the j th node is assumed to be pruned. Solving the constrained linear programming problem in (14) using the Lagrangian method, and δW and δV can be obtained as follows: δW = −Wj eTj (ΦT Φ)−1 /(ΦT Φ)−1 jj , δV =
WjT Wj /(ΦT Φ)−1 jj .
(15) (16)
Therefore the sensitivity measure of the minimal effect after pruning the j th node and readjusting the remaining weights can be defined as: mesj := WjT Wj /(ΦT Φ)−1 jj .
(17)
The lower the value of mesj , the less effect on the output. Hence an insensitive ˆ can be node j ∗ th to be pruned can be chosen according to equation (17) and W calculated using equation (15).
5
Simulation Results
In this section, a regression problem is used to demonstrate the effectiveness of the proposed node pruning algorithm. The performance of the network is measured by the so-called metric fraction of variance unexplained (FVU). For details about the meaning of FVU, see [8]. Note that the FVU is proportional to the mean square error (MSE).
528
J. Xu and D.W.C. Ho
Table 1. Comparisons between the mean FVUs of the training and the generalization for the HAS regression problem (SNR = -3dB, 40 runs) sensitivity measure Number of Training Generalization hidden nodes FVU FVU measure in [5] 10 0.7764 0.3864 proposed measure 10 0.7368 0.2758
1
0.95
FVU for training data
0.9
0.85
0.8
0.75
0.7
0
5
10 15 number of hidden nodes
20
25
Fig. 1. Comparison of the FVU for training data using different sensitivity measure (solid line: proposed measure; dashed line: measure in [5])
Consider the following five-dimensional function (HAS) [8] that is given by g(x1 , x2 , x3 , x4 , x5 ) = 0.0647(12+3x1−3.5x22 +7.2x33 )(1+cos4πx4 )(1+0.8sin3πx5 ) Three thousands random samples with additive noise were generated for network training, the SNR is -3dB. Ten thousand random samples without additive noise were generated for verifying the network generalization capabilities. Forty independent runs were performed (where for each run the network is initialized to a different set of weight values). The neural networks with 20 hidden nodes are trained to an acceptable solution, that is the FVU for training data is less than 0.71. Then at first stage, dependent nodes are pruned using QR factorization, and at second stage, insensitive nodes are selected and pruned until ten hidden nodes are remained, and the FVUs for training and generalization are checked. Means of FVUs for the training and the generalization performance of networks for the HAS function are summarized in Table 1. The sensitive measure in [5] was defined as follows: V˜j = (φTj φj )(WjT Wj ).
(18)
For comparison, this measure is also used at second stage to select the insensitive nodes, and output weight of the remaining nodes are readjusted by computing
A Node Pruning Algorithm Based on Optimal Brain Surgeon
529
the pseudo-inverse. The result is also shown in Table 1. It can be seen that the training and generalization performance is much better using our proposed sensitive measure. In Figure 1, means of training FVUs are also displayed after each hidden nodes is pruned for different measures. It follows that the proposed algorithm has less effect on the neural network performance after each hidden nodes is pruned.
6
Conclusion
In this paper, a node pruning algorithm based on optimal brain surgeon is proposed for feedforward neural networks. At the first stage, the orthogonal factorization is applied to the output of the nodes in the same hidden layer to identify and prune the linear dependant nodes. At second stage, a unit-based optimal brain surgeon(UB-OBS) pruning algorithm is used to prune the insensitive hidden units to further reduce the size of the neural network. No retraining is needed.
References 1. Reed, R.: Pruning Algorithm-A Survey. IEEE Trans. Neural Networks 4(5) (1993) 740-747 2. Kwok, T., Yeung, D.: Constructive Algorithms for Structure Learning in Feedforward Neural Networks for Regression Problems. IEEE Trans. Neural Networks 8(3) (1997) 630-645 3. Girosi, F., Jones, M., Poggio, T.: Regularization Theory and Neural Network Architectures. Neural Comput. 7(3) (1995) 219-269 4. Hassibi, B., Stork, D.G.: Second Order Derivatives for Network Pruning: Optimal Brain Surgeon. In: S.J. Hanson, (Ed.): Advances in Neural Information Processing, Vol.5. CA: Morgan Kaufmann (1993) 164-171 5. Castellano, G., Fanelli, A.M., Pelillo, M.: An Iterative Pruning Algorithm for Feedforward Neural Networks. IEEE Trans. on Neural Networks 8(3) (1997) 519-531 6. Chandrasekaran, H., Chen, H., Manry, M.T.: Pruning of Basis Functions in Nonlinear Approximators. Neurocomputing 34(1) (2000) 29-53 7. Fletcher, L., Katkovnik, V.,Steffens, F.E., Engelbrecht, A.P.: Optimizing the Number of Hidden Nodes of a Feedforward Artificial Neural Network. In: IEEE World Congress on Computational Intelligence, Anchorage, Alaska (1998) 1608-1612 8. Ma, L., Khorasani, K.: New Training Strategies for Constructive Neural Networks with Application to Regression Problems. Neural Networks 17(4) (2004) 589-609
A Fast Learning Algorithm Based on Layered Hessian Approximations and the Pseudoinverse E.J. Teoh, C. Xiang, and K.C. Tan Department of Electrical and Computer Engineering, National University of Singapore, 4 Engineering Drive 3, Singapore 117576
[email protected] Abstract. In this article, we present a simple, effective method to learning for an MLP that is based on approximating the Hessian using only local information, specifically, the correlations of output activations from previous layers of hidden neurons. This approach of training the hidden layer weights with the Hessian approximation combined with the training of the final output layer of weights using the pseudoinverse [1] yields improved performance at a fraction of the computational and structural complexity of conventional learning algorithms.
1
Introduction
Typical learning algorithms for training feedforward neural architectures for a variety of applications such as classification, regression and forecasting utilize well-known optimization techniques. These numerical optimization methods usually exploit the use of first-order (Jacobian) and second-order (Hessian) methods. Typically, the Jacobian or gradient of the cost function can be computed quite readily and conveniently; however, the same cannot be said of the Hessian, particularly for larger-sized networks as the number of free parameters (synaptic weights) increase. With this in mind, approximation of the Hessian matrix, commonly found in numerous quasi-Newton approaches for optimization, is of great interest. Although more efficient methods in computing the Hessian exactly have been suggested [2, 3], exact computation of the Hessian is costly - moreover, in many practical applications, such accurate computation is not required. Some of the more well-known approximation techniques include only using the diagonal entries of the Hessian, omission of certain terms (such as the second partial derivatives) [4] or through numerical differentiation [3]. While various approximations to the Hessian have been proposed, computation of the Hessian matrix based solely on local information, such as (input and output) signals available to weights of a particular layer, is often desirable, as this may allow the partitioning of the training of a multi-layered neural architecture into layers, each of which can be trained separately from each other. On the other hand, non-gradient based approaches such as evolutionary algorithms (EAs), and the ELM paradigm [1] which trains only the output layer weights while leaving the hidden layer weights initialized randomly with a uniform probability distribution provide interesting alternative methods of training a neural J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 530–536, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Fast Learning Algorithm
531
network. Non-gradient based techniques are attractive particularly because they are less likely to converge to a local minima, which holds true for increasingly complex problems. However, EAs are largely stochastic optimizers and are hence slow in discovering solutions; the ELM approach is fast in obtaining solutions but at the cost of the need to use an architecturally more complex (as measured by the number of hidden layer neurons used) structures. In [1], the Moore-Penrose generalized inverse (or more commonly known as the pseudoinverse) was used to train the output weights of the feedforward network (for either a conventional MLP or an RBF network) while leaving the hidden layer weights (initialized randomly using a uniform probability distribution) untrained. While this method is extremely fast in the training stage, relatively larger number of hidden layer neurons are needed – some form of training if utilized, to learn the weights of the hidden weights, is always able to obtain a smaller, or minimal network architecture using fewer hidden neurons. The number of iterations (or epochs) that is additionally used to train the hidden layer weights need not be exhaustive, as we will susequently demonstrate.
2
The Proposed Approach
Given a cost function E(w) : n → , such as the sum-of-squares cost function, we can construct E(w) as a localized model using the Taylor series approximation by constructing an expansion about a nominal point w0 and subsequently analyze the effect of perturbing the parameter vector (w) wbout w0 (the local minimum). Thus, we obtain E(w) = E(w0 ) + J(w0 )(w − w0 ) + 12 (w − w0 )T H(w0 )(w − w0 ) + O(w − wo 3 ), where J(w) and H(w) are the Jacobian (gradient) and Hessian of the cost function respectively. The essential idea that underlies the proposed approach is the layer-wise (iterative) estimation of the Hessian, which together with the gradient (Jacobian) information forms a weight update rule for hidden layer weights. This, coupled with the least-squares or pseudoinverse approach for the weights of the final (output) layer is the basis of our proposed training algorithm which offers faster convergence and better generalizability. We manipulate the weights indirectly through the modification of the desired activations of each layer, and is derived by noticing the separability of the layers of the network into linear (given by the affine form of the locally induced field) and a nonlinear (given by the activation function of the nodes) blocks. Consider the activation of a layer k at time step (iteration) t as Yk (t), where Yk (t) = Xk (t)Wk (t) Xk+1 (t) = [f (Yk (t))
1]
(1) (2)
The new estimate for Yk at the next time step is Yk (t + 1) = Yk (t) + Dk (t) where Dk (t) is the estimate for the direction in which the new activation for layer k should proceed. Dk (t) can be derived using a variety of methods, but for simplicity, is found using the gradient descent on the cost function with respect to the
532
E.J. Teoh, C. Xiang, and K.C. Tan
output activations of the hidden layers (Yk (t)), i.e. Dk (t) = −η(t)∇Yk (t) E(w, t). Thus, Yk (t + 1) = Yk (t) − η(t)∇Yk (t) E(w, t) Yk (t + 1) = Xk (t)Wk (t + 1)
(3) (4)
Xk (t)Wk (t + 1) = Xk (t)Wk (t) − ∇Yk (t) E(w, t)
(5)
†
†
Wk (t + 1) = Xk (t) Xk (t)Wk (t) − η(t)Xk (t) ∇Yk (t) E(w, t) †
Wk (t + 1) = Wk (t) − η(t)Xk (t) ∇Yk (t) E(w, t)
(6) (7)
Clearly, ∇Wk (t) E(w, t) = Xk (t)T ∇Yk (t) E(w, t) ⇒ ∇Yk (t) E(w, t) and thus ∇Wk (t) E(w, t) = (Xk (t)T )−1 ∇Wk (t) E(w, t). By virtue of the fact that Xk (t)† = (Xk (t)T Xk (t))−1 Xk (t)T , this then leads to Wk (t + 1) = Wk (t) − η(t)Xk (t)† ∇Yk (t) E(w, t) −1
Wk (t + 1) = Wk (t) − η(t)(Xk (t) Xk (t))
(8)
Xk (t) ∇Yk (t) E(w, t)
(9)
Wk (t + 1) = Wk (t) − η(t)(Xk (t)T Xk (t) + λ1 I)−1 ∇Wk (t) E(w, t)
(10)
T
T
−1
Wk (t + 1) = Wk (t) − η(t)(Φk (t) + λ1 I)
∇Wk (t) E(w, t)
(11)
From this, Φk (t) represents the correlation matrix of the activation outputs of the preceding layer and λ1 being the regularization parameter for the hidden layer weights. On the other hand, the gradient or the Jacobian of the cost function with respect to the weights in layer k, ∇Wk (t) E(w, t) can be recursively computed layer-by-layer. This approach was previously suggested by [5, 6] (without involving the use of the pseudoinverse), which offers a simpler method to building an approximation of the Hessian in a second-order approach. The use of the pseudoinverse in the layer-by-layer computation of the weights of each layer affords us the ability to introduce regularization terms to ‘control’ the resulting size of the weights in each layer as measured by its L2 -norm. This prevents overtraining of the network, and is similar in idea to that of ‘earlystopping’. As was rigorously discussed by Bartlett, the size of the weights of the network is important for the generalization performance of the network [7]. The computational advantage obtained from the use of the proposed approach however, is dependent on the number of neurons that is used in the hidden layer(s). Another point worthy to note is that this layered estimation of the Hessian assumes the independency of weights on different layers – such as assumption is generally acceptable in the case when the learning is close to convergence [6]. The convergence rate using only the pseudoinverse as a learning algorithm is extremely fast - however the use of the pseudoinverse in determining the weights of a layer in a multi-layer feedforward network is only optimal with respect to the weights of the other layers. This is to say that global optimality is only guaranteed for that layer. In a global context, the weights found using the pseudoinverse is locally optimal. Based on Cover’s 1965 function counting theorem [8], we know that projecting samples or observations from a low dimensional input space to a higher
A Fast Learning Algorithm
533
dimensional feature space would make these transformed features more likely to be linearly separable. Placed in our context, we could then use a single hidden layer with a large number of neurons (with randomly initialized hidden layer weights). To find the output layer weights, we use the pseudoinverse. Note that the output nodes need not be a linear activation function, for if f (y) = d, then y = f −1 (d) where d is the desired signal, f −1 (·) is the functional inverse. The pseudoinverse is then used with y as the desired signal. Like other numerical minimization algorithms, the method being suggested here is an iterative procedure. The gradient (Jacobian) of the cost function (sum-of-squares) with respect to the weights in the different layers can be found quite readily. The estimate of the Hessian is based on a layer-wise approximation that only uses local information (the correlation of the activation signals from the previous layer). Clearly, this Hessian approximation is a function of the layer index (k) and the iteration step (t). The output layer, using a linear activation function is computed using the pseudoinverse. For the final (output) layer (k = L), with the superscript † denoting the pseudoinverse and d the desired signals, XL−1 (t)wL (t + 1) = d †
wL (t + 1) = XL−1 (t) d −1 T wL (t + 1) = XL−1 (t) XL−1 (t)λ2 I XL−1 (t)T d
(12) (13) (14)
XL−1 (t) is the set of activation (output) values from the previous layer in the current time step. The system is usually overdetermined as we have more samples than hidden layer neurons in layer L − 1. The use of this training process, in a way, allows a layer-by-layer training approach in determining the appropriate set of weights for each hidden layer – by utilizing different learning algorithms for learning the weights on different layers, we are able to exploit the properties and characteristics of each layer.
3
Experimental Results
The datasets that were used to demonstrate the effectiveness and efficiency of the proposed approach are taken from the UCI Machine Leaning Repository. We compare the proposed approach with the ELM paradigm [1] and a MLP-based neural network (using the NETLAB toolbox). All initial simulation parameters (such as the weights, ∈ [−1, 1]) of the network used were randomly initialized but are the same for the different approaches. The comparison of the proposed algorithm with the back-propagation algorithm (scaled conjugate gradient variant) is based on the CPU time. The ratio of the improvement of the proposed algorithm over the BP algorithm (the speed-up) is approximately 3-5 times faster – notably it is difficult to compare the speeds of the algorithms based solely on the CPU time as different implementation and coding techniques may bias the results and hence the run times. Moreover, the convergence of the algorithms are different, for example the ELM approach converges in a single step (‘one-shot’
534
E.J. Teoh, C. Xiang, and K.C. Tan
Table 1. Performance comparisons – Mean accuracies and standard deviations Dataset
MLP MLP training testing
Diabetes
0.7517 (0.0238) 0.9659 (0.0106) 0.8861 (0.0329) 0.7395 (0.0492) 0.8357 (0.0340) 0.8088 (0.0357) 0.9184 (0.0275) 0.6736 (0.0369) 0.8039 (0.1232) 0.9450 (0.0342) 0.8695 (0.0432) 0.9373 (0.0388)
0.7777 (0.0146) Breast cancer 0.9680 (0.0072) Biomedical 0.9166 (0.0199) Sonar 0.8695 (0.0409) Ionosphere 0.8884 (0.0237) Heart 0.8646 (0.0211) Glass 0.9617 (0.0127) Liver 0.7351 (0.0282) E.Coli 0.9503 (0.0165) Iris 0.9498 (0.0269) IMOX 0.9214 (0.0217) Wine 0.9498 (0.0298)
Fast Fast training testing (λ = 0.001) (λ = 0.001) 0.8184 0.7403 (0.0178) (0.0245) 0.9789 0.9668 (0.0110) (0.0072) 0.9285 0.8671 (0.0333) (0.0380) 0.9924 0.7388 (0.0091) (0.0392) 0.9886 0.8736 (0.0072) (0.0270) 0.9404 0.7837 (0.0174) (0.0350) 0.9913 0.8635 (0.0097) (0.0948) 0.7836 0.6581 (0.0406) (0.0341) 0.9656 0.9250 (0.0123) (0.0318) 0.9682 0.9420 (0.0318) (0.0461) 0.9674 0.8976 (0.0186) (0.0322) 0.9687 0.9320 (0.0322) (0.0457)
Fast training (λ = 1) 0.7945 (0.0132) 0.9790 (0.0054) 0.9178 (0.0157) 0.9729 (0.0126) 0.9744 (0.0097) 0.8981 (0.0163) 0.9710 (0.0117) 0.7539 (0.0199) 0.9670 (0.0084) 0.9587 (0.0149) 0.9552 (0.0158) 0.9598 (0.0145)
Fast testing (λ = 1) 0.7629 (0.0193) 0.9666 (0.0078) 0.8821 (0.0298) 0.7507 (0.0358) 0.8664 (0.0245) 0.8295 (0.0251) 0.9219 (0.0224) 0.7032 (0.0377) 0.9467 (0.0173) 0.9423 (0.0327) 0.9126 (0.0283) 0.9417 (0.0309)
ELM ELM training testing 0.7690 (0.0188) 0.9666 (0.0087) 0.8963 (0.0170) 0.7270 (0.0451) 0.8227 (0.0383) 0.8087 (0.0369) 0.9340 (0.0182) 0.7069 (0.0252) 0.9543 (0.0125) 0.9431 (0.0269) 0.8864 (0.0249) 0.9491 (0.0365)
0.7530 (0.0256) 0.9640 (0.0110) 0.8871 (0.0309) 0.6776 (0.0626) 0.7967 (0.0388) 0.7963 (0.0462) 0.9073 (0.0272) 0.6706 (0.0404) 0.8150 (0.1677) 0.9307 (0.0421) 0.8453 (0.0414) 0.9263 (0.0427)
learning) as no iterative correction of the weights are required since the training on the linear output layer of weights is achieved via the use of the pseudoinverse. The MLP on the other hand takes a longer time to converge while the proposed algorithm’s convergence speed is between the two, depending on the problem complexity and network size. The tansig nonlinearity is used in the hidden layers. Results obtained, comparing the performances of the MLP, the proposed approach (using 2 different regularization parameters) and the ELM are as in Table 1. 60% of available data was used as the training data while the remaining 40% was used as the testing data – these were randomly selected for each of the 50 runs, trained for 10 epochs, with 10 hidden neurons used.
4
Discussion
The ‘size’ of the weights, as measured by the norm of the weights at each layer is smaller using the proposed approach than that obtained from using the backpropagation. For the output layer weights this can be attributed to the operation of the pseudoinverse as the pseudoinverse obtains the minimum norm of possible weights that solves an overdetermined problem in a least-squares sense. For better generalization performance, clearly, fitting the network to the data of
A Fast Learning Algorithm
535
the training set is not desirable. Some form of regularization is advantageous in preventing the network from being overtrained. However, determining the appropriate amount of regularization is an art in itself. Regularization is introduced explicitly through the use of the regularization parameter λ, and through the approximation of the Hessian. Approximations made during the weight update in the training phase is akin to regularization, since little attempt is made to fit the samples in the training data in an exact manner. In the approach presented here, the Hessian, is approximated on a layer-by-layer basis using the correlation matrix of the output activations from the preceding layer. Moreover, exact computation of the Hessian is not only computationally expensive but may also result in an ill-conditioned and even singular matrix. The regularization term that is introduced in the computation of the pseudoinverse has an effect on the classification performance of the resulting feedforward network – generally a larger regularization term leads to poorer training performance but better testing performance. Geometrically, this is seen as the construction of increasingly smoother separating hyperplanes in hidden layer space as larger regularization is used. Clearly, the magnitude of the regularization term assists in the generalization ability of the resulting network only up to a certain value, beyond which any increase would be detrimental to both the training and testing performance. While the ELM is fast in the training stage, from the perspective of architectural or structural complexity, it requires more hidden neurons than is typically expected for a network that is trained on all its weights. In other words, the output layer weights found from the use of the pseudoinverse in the ELM method is dependent (or ‘slaved’ [9]) to the randomly initialized weights of the hidden layer. The output layer weights is globally optimum, but only with respect to the hidden layer weights that was found during the training process (as is in the proposed approach), or that was which was initially randomized (as in the ELM paradigm). The use of the pseudoinverse in solving the output layer weights in a least-squares sense quickens convergence of the algorithm. As such, fewer training epochs are required for the proposed algorithm, which essentially combines the speed of the ELM and second-order gradient descent methods with the minimal network requirement of MLPs which are trained on all layers. The strength of the proposed approach lies in (1) its rate of convergence, (2) its speed resulting from its simplicity of implementation, and (3) a minimal network structure requirement.
5
Conclusion
The approach proposed is less complex on each iteration (epoch) while still retaining the fast convergence of the pseudoinverse used in [1] yet using a smaller network size; by exploiting Cover’s 1965 theorem [8], together with the pseudoinverse and an approximation to the Hessian, it demonstrates an increase in the performance on both the training and testing set, in addition to a faster convergence rate. The psudoinverse is used to train the output layer weights while
536
E.J. Teoh, C. Xiang, and K.C. Tan
a second-order method using a Hessian approximation is used to train hidden layer(s) weights. Regularization terms introduced allow fine-tuning of both the weights in all layers of weights. The approach proposed here can possibly be used for networks with multiple hidden layers, which is a direction for future exploration. As the structural complexity of the network increases or if noise is present in the training examples, it makes little sense to impose a ‘strong’ or ‘strict’ learning algorithm for which it will tend to overtrain the network, thus leading to a decreased ability of the resulting network to generalize on unseen data. Correspondingly, pproximate methods incorporating regularization terms allow better and faster training techniques to be applied.
Acknowledgement The authors would like to thank Dr. Huang Guang-Bin for directing them to some of his earlier work which provided some direction for their initial work.
References 1. Huang, G., Zhu, Q., Siew, C.: Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN), IEEE (2004) 2. Bishop, C.: Exact Calculation of the Hessian Matrix for the Multi-Layer Perceptron. Neural Computation 4(4) (1992) 494–501 3. Buntine, W., Weigend, A.: Computing Second Derivatives in Feed-Forward Networks: A Review. IEEE Trans. Neural Networks 5(3) (1994) 480–488 4. Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: Numerical Recipes in C Example Book : The Art of Scientific Computing, 2nd. ed. Cambridge University Press (1994) 5. Scalero, R., Tepedelenlioglu, N.: A Fast New Algorithm for Training Feedforward Neural Networks. IEEE Trans. Signal Processing 40(1) (1992) 6. Parisi, R., Claudio, E.D., Orlandi, G., Rao, B.: A Generalized Learning Paradigm Exploiting the Structure of Feedforward Neural Networks. IEEE Trans. Neural Networks 7(6) (1996) 1450–1460 7. Bartlett, P.: The Sample Complexity of Pattern Classication with Neural Networks: The Size of the Weights is More Important than the Size of the Network. IEEE Trans. Information Theory 44(2) (1998) 525–536 8. Cover, T.: Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. IEEE Trans. Electronic. Comput. 14 (1965) 326–34 9. Lowe, D.: Adaptive Radial Basis Function Nonlinearities, and the Problem of Generalization. 1st IEE International Conference on Artificial Neural Networks (1989) 171–175
A Modular Reduction Method for k-NN Algorithm with Self-recombination Learning Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd., Shanghai 200240, China
[email protected]
Abstract. A difficulty faced by existing reduction techniques for k-NN algorithm is to require loading the whole training data set. Therefore, these approaches often become inefficient when they are used for solving large-scale problems. To overcome this deficiency, we propose a new method for reducing samples for k-NN algorithm. The basic idea behind the proposed method is a self-recombination learning strategy, which is originally designed for combining classifiers to speed up response time by reducing the number of base classifiers to be checked and improve the generalization performance by rearranging the order of training samples. Experimental results on several benchmark problems indicate that the proposed method is valid and efficient.
1
Introduction
Given a training set of previously labeled samples and an unknown sample x, the k-nearest neighbor (k-NN) algorithm assigns the most frequently represented class-label among the k closest prototypes to x. The k-NN algorithm and its derivatives have been shown to perform well for pattern classification in many domains over the last 40 years. In theory, the k-NN algorithm is asymptotically optimal in the Bayesian sense [1]. In other words, k-NN performs as well as any other classifier, provided that there is an arbitrarily large number of (representative) prototypes available and the volume of the k-neighborhood of x is arbitrarily close to zero for all x. Basically, the k-NN algorithm needs the entire training set as the representative set. Therefore, the space requirement of storing the complete set and the high computational cost for the evaluation of new samples are often criticized. The k-NN algorithm also shows sensitivity to outliers, i.e. noisy or even erroneously labeled prototypes. To overcome these drawbacks, various techniques
To whom correspondence should be addressed. This work was supported by the National Natural Science Foundation of China under the grants NSFC 60375022 and NSFC 60473040.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 537–544, 2006. c Springer-Verlag Berlin Heidelberg 2006
538
H. Zhao and B.-L. Lu
have been developed. Two main types of algorithms can be identified: prototype generation and prototype selection. The first group focuses on merging the initial prototypes into a small set of prototypes, so that the performance of the k-NN algorithm is optimized. Examples of such techniques are k-means algorithm [2] and learning vector quantization algorithm [3]. The second group aims at the reduction of the initial training set and/or increasing the accuracy of the NN prediction. Various editing and condensing methods were developed, such as condensing algorithm [4] and editing algorithm [5]. Wilson et al. recently gave a good review for all these reduction techniques for the k-NN algorithm [6]. All of these existing reduction algorithms, however, still need to store the whole training set firstly to meet the requirement of fast processing. The reason is that all of them are actual k-NN based self-test of the training set. Therefore, if there is any no parallel or modular techniques introduced to the k-NN algorithm, those existing k-NN reduction techniques can not efficiently perform reduction of a large-scale data set. A modular k-NN algorithm has been proposed in our previous work [7]. Of course, this modular k-NN algorithm is mainly suitable to dealing with very large-scale data sets, and a user can not divide a training data set into too many subsets in order to maintain a stable combining accuracy. Otherwise, some assistant techniques, such as clustering, should be introduced to help the partition of the training data set [8][9]. However, all clustering procedures also require additional computing time and the same storage space unless there is a parallel or modular version of clustering algorithm. In one word, unless we can directly solve the reduction problem for the k-NN algorithm or there exists a method which can carry out both partition and reduction at the same time, we will face many other concomitant problems. In this paper, s self-recombination learning scheme based reduction method is proposed for overcoming above difficulties. The proposed method is based on combining classifier from the quadratic decomposition of two-class task. It is also a framework of parallel or modular self-test based on the k-NN algorithm.
2
Combining Classifier
Since any multi-class classification problem can be solved by using two-class classification techniques according to well known one-against-rest and oneagainst-one task decomposition methods. Especially, the one-against-one task decomposition method has been an innate parallel and modular processing method for binary classifier modules. Thus, we will only concern with two-class combining classifier in the remained part of this paper. We firstly discuss the decomposition of a two-class classification task. Let X + and X − be the given positive and negative training data sets, respectively, for a two-class classification task T , −
l − l X + = {(x+ = {(x− i , +1}i=1 , X i , −1)}i=1 +
(1)
A Modular Reduction Method for k-NN Algorithm
539
− n n + where x+ and l− denote i ∈ R and xi ∈ R are the input vectors, and l the numbers of positive training samples and the number of negative training samples, respectively. X + and X − can be partitioned into N + and N − subsets respectively, l+
j + Xj+ = {(x+j i , +1)}i=1 , j = 1, . . . , N
(2)
l− j
− Xj− = {(x−j i , −1)}i=1 , j = 1, . . . , N −
+ − + + − − where ∪N ≤ l+ , and ∪N ≤ l− . j=1 Xj = X , 1 ≤ N j=1 Xj = X , 1 ≤ N + − Matching all these subsets, Xi and Xj , where i = 1, . . . , N + and j = 1, . . . , N − , can yield N + × N − relatively smaller and more balanced (if needed) two-class subsets pairs. Each of them can be expressed as T (i,j) = Xi+ Xj− , i = 1, . . . , N + , j = 1, . . . , N − (3) +
Assign a k-NN classifier as a base classifier to learn from each subset T (i,j) . In the learning phase, all of the k-NN base classifiers are independent from each other and can efficiently work in a parallel way. We also denote the corresponding k-NN base classifier as T (i,j) , if there is not any misunderstanding. For the convenient, these base classifiers, T (i,j) , j = 1, · · · , N − , are defined as ‘positive group’ and i is defined as its group label, and those base classifiers, T (i,j) , i = 1, · · · , N + , are defined as ‘negative group’ and j is defined as its group label. Intuitively, if all classifiers in a positive group support the decision output of positive class for an unknown input sample (notice that this is a very strong condition), then to assign this unknown sample the positive class label is very natural. It is the same for a negative sample and its related negative group. We call a positive group whose all member base classifiers support the positive class as a ‘positive winning group’ and a negative group whose all member base classifiers support the negative class as a ‘negative winning group’. It should be noted that it is impossible that positive and negative winning groups exist at the same time, because these two groups must share one same base classifier and this classifier can not output two different classification results at the same time. To finish a combination without any preference, we define the combination strategy as follows. The combining output of the positive class is determined by the condition that no negative winning groups exist, and the combining output of the negative class is determined by the condition that no positive winning groups exist. Sometimes, neither positive nor negative winning group exits in application, that is the reason why we make the definitions with the negatory conditions. In order to effectively handle the case that no winning group exists and realize an efficient combination, the following symmetrical selection algorithm is given as a practical realization of the above combining strategy.
540
H. Zhao and B.-L. Lu
1) Set the initial labels, i = 1 and j = 1. 2) Repeat the following operations: (a) Check the base classifier T (ij) . (b) If T (ij) supports the positive class, then j = j + 1. (c) If T (ij) supports the negative class, then i = i + 1. (d) If i = 1 + N + , then the combining output is negative class and the algorithm ends here. (e) If j = 1 + N − , then the combining output is positive class and the algorithm ends here. We now show that the number of checked base classifiers in symmetrical selection algorithm for each input sample will be never larger than N + + N − − 1. As an intuitive expression, we imagine that all outputs of base classifiers are represented in a binary matrix with N − columns and N + rows. It is obvious that the algorithm is equally a search procedure in the matrix, in which the start point is the top-left corner and the end point is near the bottom-right corner. Each checking for a base classifier means one row or one column in the matrix must be excluded. The search direction will turn to the next row or column without any backtracking after each checking. Consider the maximumlength search from the top-left corner to the bottom-right corner in the matrix. This maximum-length search will cover N + + N − − 1 elements. That is, the number of checked base classifiers in symmetrical selection for an input sample will not be larger than N + + N − − 1, which is much less than the number of all base classifiers, N + × N − . The theoretical analysis in our existing study has actually given a probability model for symmetrical selection based combination through the performance estimation of two-class combining classifier [10].
3
Reduction with Self-recombination Learning
Self-recombination learning (SRL) originates such a simple idea. Consider the searching procedure of symmetrical selection algorithm in the outputs matrix of all base classifiers, the start point is the base classifier in the top-left corner, and the end point will be located in the most right column or the most below rows. We may notice if a base classifier is nearer to the bottom-right point of the output matrix, it will have the less chance to be checked and the less capability to determine the final combination output. Thus, we may obtain the higher combining accuracy by arranging those base classifiers with higher error rate to the more right columns and the more below rows. Finally, we may remove the learning of those base classifiers from ‘bad’ samples to realize the sample reduction. The proposed reduction algorithm with self-recombination learning can be described as follows. 1) Specify the volumes of all subsets. 2) Set the maximum turns of self-learning, M , and the counter of the selflearning times, t = 1.
A Modular Reduction Method for k-NN Algorithm
541
3) While t < M , do (a) Extract specified number of training samples in order from either training set to yield all training subsets of individual class. (b) Record the classification outputs of all k-NN base classifiers for all samples in the training set. (c) Record the combining outputs of all samples of the training set according to the symmetrical selection combination. (d) Put all incorrectly classified samples in the rear part of the set for both positive and negative training sets. 4) Remove all k-NN base classifiers learning from incorrect classified samples. There are still two potential problems remained in the above self-learning scheme. One is how to determine the stop criterion of SRL. We experimentally suggest that a stable training accuracy can be taken as such a criterion. The other is how to rearrange the correctly classified samples in proper order. Always, we can finish these two tasks in one step, moving the correctly or incorrectly classified samples to the corresponding parts and rearranging the order of the samples in either part. We call such procedure as moving and rearranging algorithm (MRA). Assume the number of all samples is NMax , a suitable MRA for the less exchange times and the relatively stable learning can be described as follows. 1) Set the serial number of the sample to be handled, i = 1. 2) While i <= NMax , do (a) If the sample i is correctly classified, then i = i + 1. Otherwise, do i. Set the counter, C = 0, and the serial number, j = i + 1. ii. If the sample j is correctly classified, then exchange samples i and j and goto a), Otherwise, let j = j + 1 and C = C + 1. iii. If C = NMax − j, then the algorithm ends here. Otherwise, go to ii..
4
Experiments
Two data sets as shown in Table 1 from UCI Repository [11] and STATLOG benchmark repository [12] are chosen for this study. To carry out decomposition of the training data set, the smaller class is divided into some equal size subsets, and the larger class is also equally divided into some subsets with the same size as the subsets of smaller classes. In k-NN algorithm, the optional values of k are set to 7, 13, 19, and 25, respectively. Comparisons of classification accuracy after reduction through multi-turn selfrecombination learning on every data sets are shown in Figs. 1. The experimental results show that the more stable and higher combining accuracy appears as the iteration of self-recombination learning increases. To show the classification accuracy of reduction is superior to the classification accuracy of using the whole samples, we give a comparison study on two cases. The experimental results shown in Fig. 2 verifies the fact that the reduction is effective on Census income data set. Actually, the comparison results are the
542
H. Zhao and B.-L. Lu Table 1. Distributions of the samples in each class on all data sets Data Set
Number of Training Samples Number of Test Samples #Input Total Positive
Negative
Total Positive Negative
Breast cancer 20000 14118
5882
7700
5482
2218
9
Census income 20107 15102
5005
10055
7552
2503
14
82 95
Classification Acccuracy(%)
Classification Acccuracy(%)
80 78 76 74 72 70 68 2
Initial Iter=1 Iter=2 Iter=3 Iter=11 Iter=12 Iter=23 Iter=24 3
90
85
80 Initial Iter=1 Iter=2 Iter=3 Iter=4
75
70 4
5 6 7 8 9 Number of Modules in Smaller Class
10
11
2
3
(a)
4
5 6 7 8 9 Number of Modules in Smaller Class
10
11
(b)
Fig. 1. Classification accuracy after reduction through multi-turn self-recombination learning and different scale partition. (a) Census income data set, where k = 7 and (b) Breast Cancer data set, where k = 25.
84
82 82
Classification Acccuracy(%)
Classification Acccuracy(%)
80 78 76 74 Reduce Reduce Reduce Reduce Remain Remain Remain Remain
72 70 68 2
3
Iter=1 Iter=2 Iter=3 Iter=4 Iter=1 Iter=2 Iter=3 Iter=4
4 5 6 7 8 Number of Modules in Smaller Class
80 78 76 k=7: Test Accuracy k=7: Training Accuracy k=13: Test Accuracy k=13: Training Accuracy k=19: Test Accuracy k=19: Training Accuracy k=25: Test Accuracy k=25: Training Accuracy
74 72 70
9
10
Fig. 2. Comparison of test accuracy with reduction and without reduction on Census income data set, k = 7
68 1
3
5 7 9 11 13 15 17 19 21 Iterative Times of Self Recombination Learning
23
Fig. 3. Comparison of training accuracy (storage percentage) and test accuracy in Census income data set, the smaller class is divided into 8 parts
same for the other values of k on Census income and other data sets. However, the comparison results for the other two data sets are very close, since their original classification accuracies have been very high. Comparisons of test accuracy with reduction and training accuracy (also is the storage percentage after reduction) after multi-turn self recombination learning on every data sets are shown in Fig. 3. The experimental results show the con-
A Modular Reduction Method for k-NN Algorithm
543
sistent trend for the training accuracy and the test accuracy, which suggest that a stable training accuracy does be taken as the stop criterion of self learning.
5
Related Work
We use a ‘quadratic’ decomposition of two-class training data set, that is, the number of produced subsets or base classifiers are the product of each number of single-class subsets. Min-max modular (M3 ) classifier [13] is also a kind of quadratic decomposition based combining classifier. The difference between the proposed combining classifier and M3 classifier is in the combination strategy. Min-max combination in M3 classifier needs to check O(N + × N − ) base classifiers for an input sample, while symmetrical selection combination only needs N + + N − − 1 at most. In addition, the min-max combination prefers the output of negative class label for the reason that the min-max combining procedure is asymmetrical for two different classes. As for the combining accuracy, our experimental results show two combination strategies are very similar at most cases, and symmetrical selection combination is some better at some unbalanced cases. Surely, the quadratic decomposition is not a common task decomposition method. Most combining classifiers are based on a ‘linear’ decomposition of training data set, that is, the whole training data set is divided into some parts [14]. This kind of decomposition produces less subsets and base classifiers. However, it will lose some flexibility to control ratio of the numbers of samples among every classes, while the quadratic decomposition is just on the contrary. Of course, linear decomposition can not provide the mechanism of the reduction method developed in this paper. Some recent works are also concerned with modular reduction [15]. The difference between our work and the existing approach is that it only considers an extreme case that each subset contains only two different samples and this approach is only suitable for 1-NN reduction, while our scheme is more general. In the extreme case, the combining classifier in [15] equally performs like a 1NN classifier. Therefore, it is not strange that self-recombination learning based reduction in such case is equal to edited nearest neighbor rule of Wilson [16].
6
Conclusions
A modular and parallel reduction method with self-recombination learning for k-NN algorithm has been proposed in this paper. One of the most important features of the proposed method is that the adjustment of partitioned training data set and the reduction task can be done at the same time, while the scales of all decomposed training subsets can always be kept unchanged. Thus, the proposed method is also an effective modular k-NN algorithm. As for the future work, we expect for a more quick and stable moving and rearrangement algorithm to improve self-recombination learning. More effective stop criterion of SRL is required, too.
544
H. Zhao and B.-L. Lu
References 1. Dasarathy, B. V. (ed.): Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. Los Alamitos, CA: IEEE Computer Society Press. (1991) 2. MacQueen, J. B.: Some Methods for Classification and Analysis of Multi Variate Observations. Berkeley Symposium on Mathematical Statistics and Probability (1967) 281-297 3. Kohonen, T.: Self-Organizing Maps. Springer, Germany. (1995) 4. Hart, P. E.: The Condensed Nearest Neighbor Rule. IEEE Trans. on Information Theory 14(5) (1967) 515-516 5. Devijver, P. A. & Kittler, J.: Pattern Recognition: A Statistical Approach. Prentice/HallLondon. (1982) 6. Wilson, D. R. & Martinez, T. R.: Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning 38(3) (2000) 257-286 7. Zhao, H. & Lu, B. L.: A Modular k-Nearest Neighbor Classification Method for Massively Parallel Text Categorization. In: Zhang, J., He, J.-H., Fu, Y. (eds.): Computational and Information Science: First International Symposium, CIS 2004. Lecture Notes in Computer Science, Vol. 3314. Springer-Verlag, Berlin Heidelberg New York (2004) 867 - 872 8. Giacinto, G. & Roli, F.: Automatic Design of Multiple Classi-fier Systems by Unsupervised Learning. Proceedings of Machine Learning and Data Mining in Pattern Recognition, First Interna-tional Workshop, MLDM’99, Leipzig, Germany (1999) 131-143 9. Frosyniotis, D., Stafylopatis A. & Likas, A.: A Divide-and-conquer Method for Multi-net Classifiers. Pattern Anal Applic 6 (2003) 32-40 10. Zhao, H.: A Study on Min-max Modular Classifier (in Chinese). Doctoral dissertation of Shanghai Jiao Tong University (2005) 11. Blake, C. L. & Merz, C. J.: UCI Repository of Machine Learning Data-bases [http://www.ics.uci.edu/ mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Com-puter Science (1998) 12. Ratsch, G., Onoda T. & Muller, K.: Soft Margins for AdaBoost. Machine Learning (2000) 1-35 13. Lu, B. L. & Ito, M.: Task Decomposition and Module Combination Based on Class Relations: a Modular Neural Network for Pattern Classification. IEEE Transactions on Neural Networks 10 (1999) 1244-1256 14. Chawla, N. V., Hall, L. O., Bowyer, K. W. & Kegelmeyer, K. P.: Learning Ensembles from Bites: A Scalable and Accurate Approach. Journal of Machine Learning Research 5 (2004) 421-451 15. Yang, Y. & Lu, B. L.: Structure Pruning Strategies for Min-Max Modular Network. In: Wang, J., Liao, X. and Yi Z. (eds.): Advances in Neural Networks C ISNN 2005: Second International Symposium on Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 646 - 651 16. Wilson, D. L.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetics 2-3 (1972) 408-421
Selective Neural Network Ensemble Based on Clustering Haixia Chen1, Senmiao Yuan1, and Kai Jiang2 1
College of Computer Science and Technology, Jilin University, Changchun 130025, China
[email protected] 2 The 45th Research Institute of CETC, Beijing 101601, China
[email protected]
Abstract. To improve the generalization ability of neural network ensemble, a selective method based on clustering is proposed. The method follows the overproduce and choose paradigm. It first produces a large number of individual networks, and then clusters these networks according to their diversity. Networks with the highest classification accuracies in each cluster are selected for the final integration. Experiments on ten UCI data sets showed the superiority of the proposed algorithm to the other two similiar ensemble learning algorithms.
1 Introduction Neural network ensembles have received considerable attention in the past decade as they can significantly improve the classification accuracy of the learning systems through training a finite number of base networks and then combining their outputs to achieve a ‘group consensus’. Theoretical and experimental results reported in the literature clearly showed that classifier combination was effective only when the individual classifiers were accurate and diverse, that is, they exhibited low error rates and made errors on different parts of the input space [1], [2], [3], [4]. However, reported experimental and theoretical results indicated that it is very difficult to create both accurate and diverse classifiers. Therefore, the fundamental need for methods aiming at designing accurate and diverse ensemble is currently acknowledged [5]. To achieve error diversity in an ensemble, several approaches have been proposed which can be divided into two categories: the direct strategy and the overproduce and choose strategy [6]. The first strategy aims at generating an error-diverse ensemble directly. The series of boosting methods which adaptively change the distribution of the training set according to former classifier’s performance belong to this strategy. However, they were noted to have unstable performance [7]. The latter strategy is based on the creation of an initially large set of candidate classifiers and the subsequent choice of a group of the classifiers based on different criteria [5], [6], [7], [8]. In this paper, we propose a new algorithm for Selecting Networks based on Clustering (SNC). The underlying paradigm of the algorithm is overproduce and choose. It first produces a large group of networks, and then clusters those classifiers according to their diversity. Base classifiers with the highest accuracy in each cluster are chosen J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 545 – 550, 2006. © Springer-Verlag Berlin Heidelberg 2006
546
H. Chen, S. Yuan, and K. Jiang
for final integration. The main aim of the algorithm is to build an ensemble with less base classifiers and to achieve higher generalization performance. It is a valuable characteristic when the classification system runs in environments with limited resouces. We present our algorithm in Section 2, give experimental results in section 3, and conclude in Section 4.
2 Selective Ensemble Learning Based on Clustering Suppose there are C patterns (ω1 ,..., ω C ) in the classification system. The target real
posterior distribution is q ( x ) = [ p(ω1 | x ),..., p (ω C | x )]T . Given dataset with N examples D = {( x (i ) , y (i ) )iN=1 } , L base neural networks are constructed. The estimated posterior distribution for classifier hi is hi ( x ) = [ pi (ω1 | x ),..., pi (ω C | x )]T . Suppose those classifiers are integrated by weighted majority voting. Weights for classifiers satisfy that 0 < wi < 1 ,
∑
L i =1
wi = 1 . Output of the ensemble is h( x ) = ∑ i =1 wi hi ( x ) . p(x) is L
x’s distribution. E = ∫ p ( x )(h( x ) − q( x )) 2 dx is the error rate of the ensemble.
Ei = ∫ p( x )(hi ( x ) − q( x )) 2 dx is the error rate of network hi. E = ∑ i =1 wi Ei is L
weighted average of error rates of base networks. Ai = ∫ p( x )(hi ( x ) − h( x ))2 dx is the difference between network hi and the ensemble. A = ∑ i =1 wi Ai is the weighted averL
age. It has been proved by [1] that E = E − A.
(1)
According to Formula 1, the performance of an ensemble is determined by two factors: performance of each network and diversity of the ensemble. For one thing, we want to reduce the error rate of each network so as to reduce the averaged error rate E . For another thing, we want these networks as different as possible so as to increase the averaged difference A . However, it is no easy work to find a good tradeoff between those two factors. In fact, no network has 100% generalization performance, so the inclusion of a network is justified by finding an excess of its contribution to averaged difference over its contribution to averaged error rate. The generalization performance of the system is not directly proportional to the ensemble size. So it helps to improve the generalization performance of the ensemble by first training a large group of base networks and then selecting a subset of it. 2.1 The Proposed Algorithm
Our algorithm follows the overproduce and choose strategy. Suppose there are M candidate networks that have been constructed independently. From those candidate networks, a subset with L networks is selected. The rationale behind this choice is that it is difficult to “directly” generate accurate and diverse classifiers [5]. The main idea
Selective Neural Network Ensemble Based on Clustering
547
is to group the candidate networks by diversity and choose the most accurate classifier in each group for integration. At first, the data set is divided into training set DT, validation set DV and test set DP using stratified random sampling so as to maintain the class distribution of examples in each set approximately the same as in the original data set. The training set is used for training candidate neural networks in the overproduce phase. The validation set is used to assess the networks in the choose phase. And the test set is used for evaluating the generalization performance of the final ensemble. In our implementation, we generate candidate networks by Random Subset Method (RSM) [9]. The method first generates a set of candidate feature subsets randomly. Then, DT is projected on each candidate feature subset Si and forms training set DTi for training multi-layer perceptron network hi. The number of input nodes of hi is determined by the degree of the corresponding feature subset |Si|. The number of outputs is C. According to the geometric pyramid rule, the number of hidden nodes is set as | Si | ⋅C . Next, clustering is called to group the candidate networks according to diversity measure. We use the double fault (DF) measure because it has been observed to have the highest correlation (0.86-0.99) with the ensemble error among the 6 tested measures for different datasets and classification methods in [10]. The double fault measure between two classifiers hi hk for dataset with N examples is defined as:
DFi , k =
C00 , N
(2)
where C00 is the number of examples that both hi and hk classify wrongly. It is noted that in the choose phase the optimal group of classifiers can be obtained by “exhaustive enumeration”. Unfortunately, the number of possible groups is 2M. So we use the hierarchical agglomerative clustering method. The similarity between hi and hk.is:
δ (hi , hk ) = 1 − DFi , k .
(3)
The distance between clusters Γ i and Γ k is:
Δ (Γ i , Γ k ) = max δ (h j , hl ) . h ∈Γ j
i
hl ∈Γ k
(4)
A larger cluster is obtained by agglomerate two nearest clusters in the lower hierarchy until L clusters are found. In the choose phase, candidate networks with the highest classification accuracies on DV in each cluster are chosen. Weighted average of those chosen predictors is used to give the final predication. Weight for predictor hi is: wi = log
1 − errorDV (hi ) errorDV (hi )
.
(5)
Where errorD (h) is the sample error rate of a classifier h with respect to data set D.
548
H. Chen, S. Yuan, and K. Jiang
3 Experiments To test the performance of the proposed algorithm, experiments are conducted on ten data sets taken from the UCI machine learning repository [11]. For each dataset, 60 percent examples are picked up to the training set. And the rest 40 percent examples are divided into approximately equal half as validation set and testing set. Experiment results are averaged over 30 runs with different starting random seeds. Table 1 compared the averaged accuracy of the SNC, RSM and Giacinto’s algorithm (abbreviated as GRF) [5] with M=70, L=9 on the ten tested data sets. It was noted that GRF outperformed RSM on 7 out of 10 data sets and was worse on only 2 data sets; SNC outperformed RSM on 8 out of 10 data sets and was worse on only one data set. It helped to improve the ensemble performance by selecting only a few networks from the original group. Table 2 showed the statistically significant win/draw/loss record where a win or loss was only counted if the difference in values was determined to be significant at the 0.05 level by a paired t-test between those three algorithms. SNC was statistically significant better than RSM for 6 data sets, and was statistically significant better than GRF for 4 data sets. It reached a draw with the two compared algorithms on the rest data sets. GRF was statistically significant better than RSM for 3 data sets, and was statistically significant worse than RSM for one data set. So we could conclude that SNC outperformed RSM and GRF in our experiments. Table 1. Performance comparisons with fixed parameters Accuracy
breast Diab
glass
RSM GRF SNC
0.744 0.747 0.603 0.741 0.769 0.611 0.740 0.771 0.619
Iono
liver
lymph soy
0.901 0.627 0.847 0.902 0.642 0.849 0.902 0.649 0.861
tic
0.999 0.999 0.999
vote
zoo
0.734 0.921 0.937 0.756 0.922 0.915 0.761 0.922 0.949
Table 2. Statistically significant win/draw/loss record (row vs. volume) win/tie/loss SNC GRF
SNC -
GRF 4/6/0 -
RSM 6/4/0 3/6/1
Fig.1(a) compared the accuracies of the three algorithms averaged over these ten datasets with varying initial ensemble size M. Fig.1(b) compared the averaged coverage of the three algorithms, and fig.1(c) compared the averaged diversity. Coverage is defined by Brodley and Lane [4] as the percentage of the examples that at least one base classifier can classify correctly. The averaged diversity measure is calculated as:
2∑ i =1 ∑ j = i ∑ k =1 Dif (hi ( xk ), h j ( xk )) L −1
Div( H , D ) =
L
N
N ⋅ L ⋅ ( L − 1)
where Dif(a,b) is zero if a=b and 1 otherwise.
.
(6)
Selective Neural Network Ensemble Based on Clustering
549
0.98 RSM GRF SNC
0.818
0.975
Averaged coverage
0.816
Averaged accuracy
0.814
0.812
RSM GRF SNC
0.81
0.97 0.965 0.96 0.955
0.808
0.95
0.806
0.945
0.804 50
60
70
80
90
0.94 50
100
60
70
80
Initial Ensemble Size(n)
Initial Ensemble Size(n)
(a)
(b)
90
100
0.18 0.17
Averaged diversity
0.16 0.15 0.14
RSM GRF SNC
0.13 0.12 0.11 0.1 0.09 50
60
70
80
90
100
Initial Ensemble Size(n)
(c) Fig. 1. Performance comparisons averaged on ten datasets with varying initial ensemble size
The average accuracies of SNC were higher than those of RSM and GRF for all initial ensemble sizes tested. Coverage increased for all the three methods with the initial ensemble sizes increasing. We thought this was because that a larger ensemble size provided more chance to find classifiers making different errors. This also led to the increase of accuracy and diversity as showed in Figure 2(a) and 2(c). However, accuracy increase was not as evident as coverage increase. The accuracy increased no more than 0.01 for the three algorithms with GRF increased the highest. While the coverage increased about 0.04 with ensemble size increasing from 50 to 100. So, the initial ensemble size increase had less effect on accuracy than on coverage. And a significant coverage increase did not guarantee a significant increase of accuracy. For the averaged diversity, GRF achieved the best score. The main difference between GRF and SNC is that the former intend to choose a group of most different classifiers while the latter rely on the clustering algorithm to control the diversity and find most accuracy classifier in each cluster in spite of their contribution to the ensemble’s diversity. So, we thought it was the difference between the choose phase of those two methods that rendered SNC’s superior performance on accuracy but inferior performance on averaged diversity. The average diversity score of RSM was the lowest in the three tested algorithms and contrary to the scores of GRF and SNC, it went down with the initial ensemble sizes increasing. We thought this was because that both GRF and SNC had specially designed mechanism for deleting improper base networks. RSM
550
H. Chen, S. Yuan, and K. Jiang
kept all the initial base classifiers for integrating. So its performance was hampered by redundant networks.
4 Conclusions In this paper, we proposed a selective neural network ensemble learning method based on clustering. The method follows the overproduce and choose paradigm. Experimental results on ten UCI data sets showed that the proposed method outperformed the other two state-of-art algorithms. We think it is the specially designed selection mechanism in the algorithm that answers for this superiority. In future research we are interested in adapting the method to incremental learning in conceptchanging environments. Acknowledgements. This paper is supported by National Natural Science Foundation of China under grant No. 60275026.
References 1. Krogh, A., Vedelsby, J.: Neural Network Ensembles, Cross Validation, and Active Learning. In: Tesauro, G., Touretzky, D. and Leen, T. Eds. Advances In Neural Information Processing Systems 7, Cambridge, MA: MIT Press. (1995) 231~238 2. Kuncheva, L.I., Skurichina, M., Duin, R.P.W.: An Experimental Study on Diversity for Bagging and Boosting with Linear Classifiers. Information Fusion 3 (2002) 245-258 3. Bauer, E., Kohavi, R.: An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning 36(1,2) (1999) 105-139 4. Brodley, C., Lane, T.: Creating and Exploiting Coverage and Diversity. In: Proc. AAAI-96 Workshop on Integrating Multiple Learned Models (1996) 8-14 5. Giacinto, G., Roli, F., Fumera, G.: Design of Effective Multiple Classifier Systems by Clustering of Classifiers. Proc. of ICPR2000, 15th Int’l Conf. on Pattern Recognition, Barcelona, Spain (2000) 3-8 6. Patridge, D., Yates, W.B.: Engineering Multiversion Neural-Net Systems. Neural Computation 8(4) (1996) 869-893 7. Rätsch, G., Onoda, T., Müller, K.R.: Soft Margins for Adaboost. Machine Learning 42(3) (2001) 287-320 8. Oliveira, L.S., Morita, M., Sabourin, R., Bortolozzi, F.: Multi-Objective Genetic Algorithms To Create Ensemble of Classifiers. C.A. Coello Coello et al. (Eds.) EMO 2005, LNCS 3410 (2005) 592-606 9. Ho, T.K.: The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8) (1998) 832-844 10. Chindaro, S., Sirlantzis, K., Fairhurst, M.: Analysis and Modeling of Diversity Contribution to Ensemble-Based Texture Recognition Performance. N.C. Oza et al. (Eds.): MCS 2005, LNCS 3541 (2005) 387-396 11. Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases at http://www. ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of California, Dept. of Information and Computer Science (1998)
An Individual Adaptive Gain Parameter Backpropagation Algorithm for Complex-Valued Neural Networks Songsong Li1, Toshimi Okada1, Xiaoming Chen2,3, and Zheng Tang2 1
Faculty of Engineering, Toyama Prefectural University, 939-0398, Imizu Shi, Toyama, Japan
[email protected] 2 Faculty of Engineering, Toyama University, 930-8555, 3190 Gofuku, Toyama, Japan
[email protected] 3 Faculty of Information Science and Engineering, Shenyang University of Technology, 110023, 58 Xinghua Street, Tiexi, Shenyang, China
Abstract. The complex-valued backpropagation algorithm has been widely used. However, the local minima problem usually occurs in the process of learning. We proposed an individual adaptive gain parameter backpropagation algorithm for complex-valued neural network to solve this problem. We specified the gain parameter of the sigmoid function in the hidden layer for each learning pattern. The proposed algorithm is tested by benchmark problem. The simulation results show that it is capable of preventing the complex-valued network learning from sticking into the local minima.
1 Introduction The complex-valued neural networks whose parameters (weights and thresholds) are all complex numbers, can be used in fields dealing with complex numbers such as telecommunications, speech recognition and image processing with Fourier transformation [1, 4]. The conventional backpropagation algorithm (called here, Real-BP) has been widely recognized as an effective method for training feed-forward neural network, and the complex-BP is a complex-valued version of the Real-BP, which was proposed by some researchers independently in early 1990s [1,2,3,4] The Complex-BP is known as an effective algorithm, but this gradient descentbased algorithm often suffers from the local minima as the Real-BP. We have proposed some method for this problem [11], now we will propose another method to solve this problem. Some researchers noted that many local minima problems are closely related to the neuron saturation in the hidden layer. When such problem occurs, neuron in the hidden layer will lose their sensitivity to input signals, they stick into the extreme area of the activation function and propagation chain almost blocked [5] [6]. The gain parameter in the conventional backpropagation algorithm is usually set as a constants, it is no change in the learning. Some researchers have proposed a method that the learning ability of the networks will improve through modulating the gain parameter of the activation. In this method, we set tow gain parameter real_gp J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 551 – 557, 2006. © Springer-Verlag Berlin Heidelberg 2006
552
S. Li et al.
and ima_gp, the real part and imaginary part, to the hidden layer neuron activation function for a given pattern p. real_gp and ima_gp are updated according to the feedback from the output to the hidden layer in order to make the weights connected to the hidden layer and output layer to be modified harmoniously.
2 The Complex-Valued Backpropagation Algorithm The inputs, outputs, weights and thresholds of the neural networks are all complex numbers. The total input of neuron n Yn is defiend as:
Yn = ∑ Wnm X m + θ n
(1)
m
where Wnm is the weight connecting neuron n and m, Xm is the input signal from neuron m, and θn is the threshold value of neuron n. The x and y are all real number. The output of neuron On can be obtained from the output of the activation function. According to the activation function, the Complex-BP algorithm can be divided into fully Complex-BP and split Complex-BP [7]. For the fully Complex-BP, the output of neuron n is defined as:
On = f c (Yn ) = f c ( x + iy )
(2)
where fc is a complex nonlinear activation function with both analytic and bounded anywhere in complex plane. For the split Complex-BP, the output of neuron n is defined as:
On = f R (Re[Yn ]) + if R (Im[Yn ]) = f R ( x) + if R ( y )
(3)
where fR is a real activation function with both analytic and bounded anywhere in complex plane. In this paper, we use split Complex-BP for clarity. Since the 3-layer neural network can approximate any continuous nonlinear mapping. And for the sake of simplicity, we will discuss the 3-layer complex-valued network. The network is shown in Fig.1. It also can be described as below:
H m = f R (U m ) = f R (Re[∑ Wlm I l + θ m ]) + if R [Im[∑ Wlm I l + θ m ]
(4)
On = f R ( S n ) = f R (Re[∑ Vmn H m + λn ]) + if R (Im[∑Vmn H m + λn ])
(5)
l
m
l
m
We define the square error as: N
E p = (1 / 2)∑ {| Re(Tn ) − Re(On ) |2 + | Im(Tn ) − Im(On ) |2 }
(6)
n =1
To minimize the square error Ep, the conventional split Complex-BP performs the following learning rule:
An Individual Adaptive Gain Parameter Backpropagation Algorithm
553
Fig. 1. A three layer complex-valued neural networks
ΔVmn = H m Δλn
(7)
Δλn = η{Re( Δn) Re[ f ' (On )] + i Im(Δn) f ' (On )}
(8)
ΔWlm = I l Δθ m
(9)
Δθm = η{Re[ f ' ( H m )]∑ (Re[Δn] Re[ f ' (On )] Re[Vmn ] n
+ Im[Δn] Im[ f ' (On )] Im[Vmn ]) − i Im[ f ' ( H m )]
∑ (Re[Δn] Re[ f ' (O )] Im[V n
mn
(10)
] − Im[Δn] Im[ f ' (On )] Re[Vmn ])}
n
where δ is the error between the actual pattern On and target pattern Tn.
3 The Individual Adaptive Gain Parameter for Complex-BP The backpropagation algorithm is the method which is to find a set of proper weights and thresholds that minimizes an overall error function E. Usually, the weights and thresholds would be initialized randomly at the begin of learning. For some pattern, the network will approximate to their proper location. Unfortunately, for the other pattern, the error minimization process may be stopped by the neuron saturation in the hidden layer. Here, the neuron saturation in the hidden layer refers to the situations where the inputs to the hidden layer nodes are so high (or so low) that all the neurons in the hidden layer are forced to produce an output response very close to the bound of the
554
S. Li et al.
activation. When the output layer of the network has not got its desired outputs and the neuron saturation in the hidden layers occurs, the local minima problem may happen. Such a phenomenon is caused by using the activation function. The sigmoid function is usually used in the learning. The gain parameter ‘Gain’ is also usually used [8]. The function is shown as below:
f R ( x) = 1 /(1 + e − gi x )
(11)
The derivative of the sigmoid function with gain parameter is shown as:
f R ( x) = g i (1 − f R ( x)) f R ( x)
(12)
Changing the parameter ‘Gain’ will change the derivation of function. There are two extreme areas in this function. Once the activity level of all hidden layer neurons approaches the two extreme areas, f’(x) will almost be 0. Severan Schreiber, Printz Cohen and other researchers proposed a method based on this founds [9], [10]. They used the gain parameter in backpropagation networks to simulate phenomena associated with catecholamine manipulation. In the character of activation function with gain parameter, with the smaller gain, the derivative f’(x) is kept relatively large in a wider range. Therefore, it is reasonable to give each individual training pattern adaptive gain parameter of activation functions in hidden layer. When output layer has not start to the desired outputs, the gain parameters should be adapted to restrict activation functions in the hidden layer within some ranges that will ensure the neurons not stuck in to the extreme areas. This methods is based an assumption that if the network can go to the desired output, the neurons saturation in hidden layer will be no in the extreme areas. Burrows and Niranjan have also testified that optimal solutions to the problem of the training a feed-forward neural network for particular data set are obtained when the majority of the hidden layer neurons acts in the linear region of their sigmoid activation, and only small of them acts in the nonlinear region. In this method, we set two gain parameter real_gp and ima_gp to the hidden layer neuron activation function for given pattern p. real_gp and ima_gp are updated according to the feedback Ap from the output layer to the hidden layer in order to make the weights connected to the hidden layer and output layer to be modified harmoniously. Ap is defined as the degree of approximation to the desired output in the output layer. For getting such value, we need to introduce two parameters. One is defined as the maximal error valued in the output layer for pattern p:
Re[e p ] = MAX ( MAX j ( Re[t pj − o pj ] ))
(13)
Im[e p ] = MAX ( MAX j ( Im[t pj − o pj ] ))
(14)
The other parameter H denotes the average value of the difference between teacher signals. It is a constant for a given learning problem. For most classification problems whose teacher signals are either ‘0’ or ‘1’. So the H can be defined as 0.5. The the approximate degree in the output layer Ap for the pattern p is given by:
An Individual Adaptive Gain Parameter Backpropagation Algorithm
Ap = e p / H
555
(15)
The gain parameter for pattern p at epoch is adapted as following rule:
⎧⎪ H / Re[e p (k − 1)], if Re[ Ap (k − 1)] > 1.0 real _ g p (k ) = ⎨ ⎪⎩1.0 otherwise
(16)
⎧⎪ H / Im[e p (k − 1)], if Im[ Ap (k − 1)] > 1.0 ima _ g p (k ) = ⎨ ⎪⎩1.0 otherwise
(17)
As we can see, the gain parameter is adapted only if Ap(k-1)>1.0. When the net work starts to approximate to the teacher signal, it will be adapted back to its original value. Therefore, the network topology will not be changed.
4 Simulation In order to demonstrate the effectiveness of this proposed method, we compare its performance with that of the conventional complex-valued backpropagation method with momentum the detection of symmetry problem and a real task. The complexvalued learning pattern of symmetry problem was described in paper [11]. The real task is that we use the network to identify the defect signals of EMAT. The defect signals are divided into 4 patterns: depth are 0.4mm , 0.6mm, 0.8mm and 1mm. And they are all 4cm long. Table 1. Simulation result of symmetry problem. Learning rate is 0.5, momentum facor is 0.5.(CC-BP is complex-BP, IC-BP is proposed complex-BP and CR-BP is real-BP).
Error=0.1
Error=0.01
Error=0.01
Methods Architecture Success rate Average epochs CPU time Success rate Average epochs CPU time Success rate Average epochs CPU time
CC-BP 3-1-1 75% 326 17ms 75% 1714 21ms 74% 12756 180ms
IC-BP 3-1-1 84% 327 28ms 84% 1654 37ms 83% 12930 361ms
CR-BP 6-2-2 89% 2484 15ms 0% >50000 0% >50000
We ran 100 initial weights vectors randomly from (-1.0, 1.0) and performed the learning with each method respectively. If learning epochs are more than 50000, we set this learning is failed. Three aspects of performance of the algorithm were assessed: “success rate”, “learning epochs” and “CPU time”. This algorithm is try to reduce the chance of network slip into the local minimum, if “success rate” is lager, “learning epochs” and “CPU time” not too lager, we will think this method is effec-
556
S. Li et al.
tive. If there are N pattern, this method will take 4N variable memory. The simulation results are shown in Table 1 and Table 2. We can see that the success rate is improved and convergence epochs are almost not changed. This is can be proved by typical learning curve of diction of symmetry problem in Fig. 2. The computer’s setting is: CPU is Pentium 4 2.8G and memory is 521M.
Fig. 2. Typical learning curves of symmetry problem. (Learning rate is 0.5 and momentum factor is 0.5). Table 2. Simulation result of EMAT signal idenfication. Learning rate is 0.1, momentum facor is 0.1.(CC-BP is complex-BP, IC-BP is proposed complex-BP and CR-BP is real-BP).
Error=0.1
Error=0.01
Methods Architecture Success rate Average epochs CPU time Success rate Average epochs CPU time
CC-BP 128-3-1 71% 19971 1131ms 69% 30147 27013ms
IC-BP 128-3-1 87% 27357 1941ms 86% 43045 35023ms
CR-BP 256-6-2 51% 39107 1013ms 0% >50000
5 Conclusion In this paper, we specified the gain parameter of sigmoid function in the hidden layer for each learning pattern. The proposed algorithm was tested by symmetry problem
An Individual Adaptive Gain Parameter Backpropagation Algorithm
557
and a real task. The simulation results show that it is capable of preventing the complex-valued network learning from sticking into the local minimum.
References 1. Benven, N and Piazza, F.: On the Complex Backpropagation Algorithm. IEEE Transactions on Signal Processing 40 (1992) 967–969 2. Georgio, G. M. Koutsougeras, C.: Complex Domain Backpropagation. IEEE Transaction on Circuits, Syst-II: Analog Digital Signal Processing 39 (1992) 330–334 3. Kim, M. S. Guest, C. C.: Modification of Backpropagation Networks for Complex-Valued Signal Processing in Frequency Domain. In: Pro. Internat. Joint Conf. Neural Networks 3 (1990) 27–31 4. Nitta, T., Furuya, T.: A Complex Back-Propagation Learning. Transactions of Information Processing Society of Japan 32 (1991) 1319–1329 5. Goerick, G., Seelen, W. V.: On Unlearnable Problems or a Model for Premature Saturation in Backpropagation Learning. In: Proceedings of the European Symposium on Artificil Neural Networks’96, Belgium (1996) 13–18 6. Haykin, S.: Neural Networks a Comprehensive Foundation. Macmillan Publishing, New York (1994) 7. Wang, X.G., Tang, Z., Tamura, H., Ishi, M.: Multilayer Network Learning Algorithm Based on Pattern Search Method. IEICE Transaction on Fundamentals of Electronics, Communications and Computer Science E86-A (2003) 1869–1875 8. Cybenko, G. Approximation by Superposition of a Sigmoid Function. Mathematics of Control, Signals, and Systems 2 (1989) 303–314 9. Servan-Schreuber, C., Printz, H., Cohen, J. D.: A Network Model of Neuromodulatory Effects: Gain, Signal-to-Noise Ratio and Behavior. Science 249 (1990) 892–895 10. Wang, D.L.: Pattern Recognition: Neural Networks in Perspective. IEEE Intelligent Systems 8(4) (1993) 52–60 11. Chen, X., Tang, Z., Li, S.: A Modified Error Function for the Complex-Value Backpropagation Neural Networks. Neural Information Processing – Letters and Review 9(1) (2005) 1-7
Training Cellular Neural Networks with Stable Learning Algorithm Marco A. Moreno-Armendariz1, Giovanni Egidio Pazienza2, and Wen Yu3 1
Escuela de Ingenier´ıa, Direcci´ on de Posgrado e Investigaci´ on, Universidad La Salle, Benjamin Franklin 47, Col. Condesa, M´exico D.F., 06140, M´exico
[email protected] 2 Enginyeria i Arquitectura La Salle, Universitat Ramon Llull, Pg. Bonanova 8, 08022 Barcelona, Spain
[email protected] 3 Departamento de Control Autom´ atico, CINVESTAV-IPN, A.P. 14-740, Av.IPN 2508, M´exico D.F., 07360, M´exico
[email protected]
Abstract. In this paper we propose a new stable learning algorithm for Cellular Neural Networks. Our approach is based on the input-to-state stability theory, so to obtain learning laws that do not need robust modifications. Here we present only a theoretical study, letting experimental evidences for further works.
1
Introduction
Among the different kinds of neural networks (NN), Cellular Neural Networks (CNNs) [3] are one of the simplest topology. Because of this, CNN functionality is determined by a small number of parameters, usually three matrices called cloning templates. Sometimes it is possible to design a cloning template for a specific task [12], or alternatively one can resort to template libraries [7]. Anyway, to find out an explicit learning algorithm for the templates of a CNN able to perform a given operation is still an open problem. Initial results of CNN learning are restricted to binary output and the stability of the network was assumed [13, 8]. These algorithms give useful templates if the initial conditions are not too far from the desired output, because of the presence of local minima in the state space. [6, 1, 9] use different gradient based methods and suffer from the typical problems of such a method, like the detection of local minima. [5, 4] consider genetic algorithms which are able to explore the whole state space but often take a lot of time to converge. In order to speed up the process, [2] proposes methods based on Simulated Annealing, although to find heuristically an adequate annealing schedule is usually difficult. Recently, [10] tries to settle the direct influence of the CNN parameters in the output, using a two-neurons CNN. This may be a good way to avoid local minima, but it is necessary further investigation in this field. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 558–563, 2006. c Springer-Verlag Berlin Heidelberg 2006
Training Cellular Neural Networks with Stable Learning Algorithm
559
Our algorithm is based on the input-to-state stability (ISS) theory, which is an alternative approach to analyze stability besides Lyapunov method, and it was already applied to other kinds of NN [11]. Our aim is to obtain new learning laws that do not need robust modifications. The paper is structured as follows: first, we introduce some important definitions about ISS; then, we explain in detail the algorithm; finally, we sketch conclusions.
2
Preliminaries
The main concern of this section is to understand some concepts of ISS. Consider the following discrete-time (DT hereafter) state-space nonlinear system x(k + 1) = f [x (k) , u (k)]
(1)
where u (k) ∈ m is the input vector, x (k) ∈ n is a state vector, f is a general nonlinear smooth function f ∈ C ∞ . Let us recall the following definitions. Definition 1. (a) If a function γ(s) is continuous and strictly increasing with γ(0) = 0, γ(s) is called a K-function (b) For a function β (s, t) , β (s, ·) is K-function, β (·, t) is decreasing and lim β (·, t) = 0, β (s, t) is called a KL t→∞
function. (c) If a function α(s) is K-function and lim α (s) = ∞, α(s) is called s→∞ a K∞ -function. Definition 2. (a) A system (1) is said to be input-to-state stable if there is a K-function γ(·) and KL -function β (·), such that, for each u ∈ L∞ , i.e., sup {u(k)} < ∞, and each initial state x0 ∈ Rn , it holds that x k, x0 , u (k) ≤ β x0 , k + γ (u (k)) (b) A smooth function V n → ≥ 0 is called a ISS-Lyapunov function for system (1) if there is K∞ -functions α1 (·), α2 (·) and α3 (·), and K-function α4 (·) such that for any s ∈ n , each x (k) ∈ n , u (k) ∈ m α1 (s) ≤ V (s) ≤ α2 (s) Vk+1 − Vk ≤ −α3 (x (k)) + α4 (u (k)) Theorem 1. For a DT nonlinear system, the following are equivalent: a) it is ISS; b) it is robustly stable; c) it admits a smooth ISS-Lyapunov function.
3
Learnable Cellular Neural Networks
The DT-CNN operation is described by the following system of equations ⎛ ⎞ xij (k + 1) = θf ⎝ An−i,l−j xij (k) + Bn−i,l−j uij (k) + I ⎠ (2) n,l∈N (ij)
n,l∈N (ij)
560
M.A. Moreno-Armendariz, G. Egidio Pazienza, and W. Yu
where xij is the state of the cell in position ij; uij is the input of the same cell; A represents the interaction between cells, which is local (since the summations are taken over the set N of indexes of neighborhood cells) and space-invariant (because the weights depend on the difference between cell indexes, rather than on their absolute values); B represents forward connections issuing from a neighborhood of inputs; I is a bias; N (i, j) is the set of indexes corresponding to cell ij itself and a neighborhood; θ is a scale constant, θ > 0. The operation performed by the network is fully defined by the so-called cloning templates {A, B, I}. In this paper we consider binary output, so the activation function f is defined as the sign function, that is ⎧ ⎨ 1 x>0 0 x=0 f (x) = ⎩ −1 x < 0 In order to realize learning, we approximate the sign function by means of a sigmoid function φ(x) of the following form φ(x) =
a −c 1 + e−bx
If a, b and c are selected as suitable values, φ(x) is similar to f (x). (2) can be rewritten as xij (k + 1) = θφ [W1 (k) xij (k) + W2 (k) uij (k) + W3 (k)]
(3)
where the matrices W1 (k) = [An−i,l−j ] ∈ n×n W2 (k) = [Bn−i,l−j ] ∈ n×n W3 (k) = [i] ∈ n×n are the feedback template, control template, bias of ij th CNN (that is, the weights of the NN). The nonlinear system represented by (3) bounded-input and bounded-output (BIBO) stable, because x (k) and u (k) are bounded. The desired state of xij (k) is defined as x∗ij (k). According to the Stone-Weierstrass theorem, the nonlinear system (3) can be written as
x∗ij (k + 1) = θφ W1∗ x∗ij (k) + W2∗ uij (k) + W3∗ + μ (k) (4) where W1∗ , W2∗ and W3∗ are constant weights which can minimize the modeling error μ (k). Since φ is a bounded function, μ (k) is bounded as μ2 (k) ≤ μ, μ is an unknown positive constant. The error is defined as e (k) = xij (k) − x∗ij (k)
(5)
From (4) and (3) e (k + 1) = θφ [W1 (k) xij (k) + W2 (k) uij (k) + W3 (k)] −θφ [W1∗ xij (k) + W2∗ uij (k) + W3∗ ] − μ (k) Using Taylor series around the points of W1 (k) xij (k) + W2 (k) uij (k) + W3 (k) φ [W1 (k) xij (k) + W2 (k) uij (k) + W3 (k)] − φ [W1∗ xij (k) + W2∗ uij (k) + W3∗ ] 1 (k) xij (k) + φ W 2 (k) uij (k) + φ W 3 (k) + ε (k) = φ W (6)
Training Cellular Neural Networks with Stable Learning Algorithm
561
1 (k) = W1 (k)−W1∗ , W 2 (k) = W2 (k)−W2∗ , W 3 (k) = W3 (k)−W3∗ , ε (k) where W is second order approximation error. φ is the derivative of nonlinear activation function φ at the point of W1 (k) xij (k) + W2 (k) uij (k) + W3 (k) . So 1 (k) xij (k) + θφ W 2 (k) uij (k) + θφ W 3 (k) + θζ (k) e (k + 1) = θφ W
(7)
where ζ (k) = ε (k) − μ (k). The following theorem gives a stable learning algorithm for CNNs. Theorem 2. The following gradient updating law can make the error e (k) bounded (stable in an L∞ sense) for the system (3) representing a CNN W1 (k + 1) = W1 (k) − η (k) φ xij (k) eT (k) W2 (k + 1) = W2 (k) − η (k) φ uij (k) eT (k) W3 (k + 1) = W3 (k) − η (k) φ eT (k) where η (k) satisfies ⎧ η ⎨ if 2 2 2 η (k) = 1 + φ xij (k) + φ uij (k) + φ ⎩ 0 if
1 θ
e (k + 1) ≥ e (k)
1 θ
e (k + 1) < e (k)
(8)
0 < η ≤ 1. The average of the identification error satisfies 1 2 η e (k) ≤ ζ (9) T π T →∞ k=1 2 2 2 , ζ = 2 , κ = max φ xij (k) + φ uij (k) + φ T
lim sup
η where π = k
(1 + κ) max ζ 2 (k) , T > 0 is identification time. k
Proof. Select Lyapunov function as 2 2 2 V (k) = W 1 (k) + W2 (k) + W3 (k) 2 T (k) W (k) . From the updating law where W (k) = ni=1 w (k)2 = tr W (8) 1 (k + 1) = W 1 (k) − η (k) φ xij (k) eT (k) W So
ΔV (k) = V (k + 1) − V (k) 2 2 T = W 1 (k) − η (k) φ xij (k) e (k) − W1 (k) 2 2 T + W 2 (k) − η (k) φ uij (k) e (k) − W2 (k) 2 2 T + W 3 (k) − η (k) φ e (k) − W3 (k) 2 1 (k) eT (k) = η 2 (k) e (k)2 φ xij (k) − 2η (k) φ xij (k) W 2 2 2 (k) eT (k) +η 2 (k) e (k) φ uij (k) − 2η (k) φ uij (k) W 2 2 3 (k) eT (k) +η 2 (k) e (k) φ − 2η (k) φ W
562
M.A. Moreno-Armendariz, G. Egidio Pazienza, and W. Yu
From (7) we have 1 1 (k) xij (k) + φ W 2 (k) uij (k) + φ W 3 (k) + ζ (k) e (k + 1) = φ W θ
(10)
Using (10) and η(k) ≥ 0, 1 (k) eT (k) 2 (k) eT (k) −2η (k) φ xij (k) W − 2η (k) φ uij (k) W −2η (k) φ W3 (k) eT (k) T e (k) 1 e (k + 1) − ζ (k) ≤ −2η (k) θ 1 T T = −2η (k) eT (k) θ1 e (k + 1)− e (k) ζ (k)T ≤ −2η (k) e (k) θ e (k + 1) + 2η (k) e (k) ζ (k) 2 If 1θ e (k + 1) ≥ e (k) , eT (k) 1θ e (k + 1) ≥ e (k) , since 0 < η ≤ 1 , ΔV (k) ≤ η 2 (k) e (k) φ xij (k) + η2 (k) e (k) φ uij (k) + η2 (k) e (k) φ 2 2 −η (k) e (k) ⎡ + η (k) ζ (k) 2 2 ⎤ 2 φ x (k) + u (k) φ + φ ⎥ ij ij ⎢ 2 2 = −η (k) ⎣1 − η 2 2 ⎦ e (k) + η (k) ζ (k) 2 1 + φ xij (k) + φ uij (k) + φ 2
2
2
2
2
2
≤ −πe2 (k) + ηζ 2 (k) (11) 2 2 2 where π = φ xij (k) + φ uij (k) + φ . As π > 0 2 , κ = max k (1 + κ) 2 2 n min w i ≤ Vk ≤ n max w i η
2 2 where n × min w i , n × max w i and πe2 (k) are K∞ -functions, ηζ 2 (k) is a K-function, so Vk admits the smooth ISS-Lyapunov function as in Definition 2. From Theorem 1, the dynamic of the error is input-to-state stable. The ”INPUT” corresponds to the second term of the last line in (11), i.e. the modeling error ζ (k) = ε (k)−μ (k) , the ”STATE” is corresponded to the first term of the last line in (11), i.e. the identification error e (k) . Because the ”INPUT” ζ (k) is bounded and the dynamic is ISS, the ”STATE” e (k) is bounded. If 1θ e (k + 1) < e (k) , ΔV (k) = 0. V (k) is constant, W1 (k) , W2 (k) and W3 (k) are constants. Since e (k + 1) < θ e (k) , e (k) is also bounded. (11) can be rewritten as ΔV (k) ≤ −πe2 (k) + ηζ 2 (k) ≤ πe2 (k) + ηζ
(12)
Summarizing (12) from 1 up to T , and by using V (T ) > 0 and V (1) is a constant
π
V (T ) − V (1) ≤ −π TK=1 e2 (k) + T ηζ 2 K=1 e (k) ≤ V (1) − V (T ) + T ηζ ≤ L1 + T ηζ
T
and (9) is established.
Training Cellular Neural Networks with Stable Learning Algorithm
563
Remark 1. The condition 1θ e (k + 1) ≥ e (k) defines a dead-zone. If θ is small, also the dead-zone becomes small. The situation is similar to the one illustrated in [11], but in that work the same approach is applied to feedforward NN, whereas CNNs are a type of recurrent NN. The main difference with respect to the final result is that the dead-zone of CNNs depends on θ, whereas the deadzone of feedforward NN depends on the upper bound of unmodeled dynamic.
4
Conclusions
In this paper, we proposed a new stable algorithm to obtain cloning templates for DT Cellular Neural Networks in a supervised problem. Thanks to the ISS theory, we can affirm that such updating laws can keep the error bounded and the whole algorithm is quite robust. It is true that we showed only a mathematical approach to the problem, but soon we will apply experimentally our theoretical solution. In our opinion, there is still room for improvements in this topic, above all to understand the reasons of the intrinsic difficulties of having CNN learning.
References 1. Balsi, M.: Recurrent Back-Propagation for CNNs. Proc. of the ECCTD’03, Davos, (1993) 677–682 2. Chandler, B., Rekeczky, C., Nishio, Y., Ushida, A.: Adaptative Simulated Annealing in CNN Template Learning. IEICE Tran. Fundamentals, 82 (2) (1999) 398–402 3. Chua, L.O., Yang, L.: Cellular Neural Networks Theory. IEEE Trans. Circuits and Systems, 35 (10) (1988) 1257–1272 4. G´ omez-Ram´ırez, E., Mazzanti, F.: Cellular Neural Networks Learning using Genetic Algorithms. In: D´ıaz de Le´ on J., Y´ an ˜ ez, C. (eds.): Reconocimiento de Patrones avances y perspectivas. IPN, M´exico (2002) 5. Kozek, T., Roska, T.: Genetic Algorithm for CNN Template Learning. IEEE Trans. Circuits and Systems—I, Fundamental Theory and Appl., 40(6) (1993) 392–402 6. Nossek, J.A.: Design and Learning with Cellular Neural Networks. Proc. of IEEE CNNA-94, Rome, (1994) 137–146 7. Roska, T., K´ek, L., Nemes, L., Zarandy, A. Brendel, M.: CSL-CNN Software Library. Report of the Analogical and Neural Computing Laboratory, Computer and Automation Institute, Hungarian Academy of Sciences, Budapest, Hungary (2000) 8. Szolgay, P., Kozek, T.: Optical Detection of Layout Errors of Printed Circuit Boards using Learned CNN Templates. Report DNS-8-1991, Dual and Neural Computing Systems Res. Lab., Comp. Aut. Inst., Hung Acad Sci. , Budapest, (1991) 9. Tealuff, R., Wolf, D.: A Learning Algorithm for the Dynamics of CNN with Nonlinear Templates - Part I Discrete-Time Case. Proc. IEEE CNNA-96 (1996) 461–466 10. Vilas´ıs-Cardona, X., Vinyoles-Serra, M.: On Cellular Neural Network Learning. Proc. of ECCTD 2005, Cork (2005) 11. Yu, W., Li, X.: Discrete-time Neuro Identification without Robust Modification. IEE Proceedings - Control Theory and Applications 150(3) (2003) 311-316 12. Zar` andy, A.: The Art of CNN Template Design. Int. J. of Circ. Th. Appl, 27 (1999) 5–23 13. Zou, F., Schwarz, S., Nossek J.A.: Cellular Neural Network Design using a Learning Algorithm. Proc. IEEE CNNA-90, (1990) 73–81
A New Stochastic PSO Technique for Neural Network Training Yangmin Li and Xin Chen Department of Electromechanical Engineering, Faculty of Science and Technology, University of Macau, Av. Padre Tom´ as Pereira S.J., Taipa, Macao SAR, P.R. China
[email protected],
[email protected]
Abstract. Recently, Particle Swarm Optimization(PSO) has been widely applied for training neural network. To improve the performance of PSO for high-dimensional solution space which always occurs in training NN, this paper introduces a new paradigm of particle swarm optimization named stochastic PSO (S-PSO). The feature of the S-PSO is its high ability for exploration. Consequently, when swarm size is relatively small, S-PSO performs much better than traditional PSO in training of NN. Hence if S-PSO is used to realize training of NN, computational cost of training can be reduced significantly.
1
Introduction
As an attempt to model the processing power of human brain, artificial neural network is viewed as an universal approximation for any non-linear function. Up to now many algorithms for training neural network have been developed, backpropagation (BP) method is a popular one. Instead of BP, this paper introduces a new particle swarm optimization (PSO) for training of NN. Since PSO was firstly developed in 1995 [1], it has been an increasingly hot topic in community of artificial intelligence. Due to PSO’s advantages in terms of simple structure and easy implementation in practice, PSO is widely used in many fields which involve optimization algorithms[2] [3][4]. Up to now there are more and more literatures referring to training of neural network with PSO. Normally it is accepted that PSO-NN has the following advantages: 1) There are no restricts for the PSO method which is critical for BP that the transfer function in hidden layer should be differentiable. So more transfer functions can be selected to fulfill different requirements. 2) Comparing with BP, PSO training algorithm is not easy to be trapped into local minima. But as a stochastic method, PSO method suffers from the “curse of dimensionality”, which implies that its performance deteriorates as the dimensionality of the search space increases. To overcome this problem, a cooperative approach is proposed in which the solution space is divided into several subspaces with J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 564–569, 2006. c Springer-Verlag Berlin Heidelberg 2006
A New Stochastic PSO Technique for Neural Network Training
565
lower dimension, and several swarms in these subspaces cooperate with each other to find the global best solution [5]. But such method induces a problem named stagnation. Since the performance of PSO is determined by exploration and exploitation of particles, it is reasonable that improving exploration ability can make particles explore solution space more efficiently in order to improve PSO performance. In this paper, we propose a stochastic PSO(S-PSO) to accomplish training NN which fulfills requirements with relative small size but high efficiency.
2
Stochastic PSO with High Exploration Ability
Owing to many literatures involving NN training using PSO, we just introduce the main idea of PSO training briefly. Given a multi-layer neural network, all its weights are combined together to form a vector which is viewed as a solution in a solution space. Then a swarm is proposed whose members (particles) represent such solution candidates. According to a certain criterion, which is normally in the form of minimal mean square error between patterns and outputs of NN, all particles congregate to a position on which the coordinate represents the best solution they found. Therefore the dimension of solution space is the same as the number of weights of NN. Take a three-layer fully connected feed-forward network as an example. If there are five neurons in the hidden layer and bias in hidden and output layers, even the NN has one input and one output, there are sixteen weights to be trained. Now let’s estimate the swarm size sufficient for training. Consider the traditional PSO updating principle with constriction coefficient K expressed as follows: vi (n + 1) = K vi (n) + ci1 ri1 (n)(Pid (n) − Xi (n)) + ci2 ri2 (n)(Pig (n) − Xi (n)) Xi (n + 1) = Xi (n) + vi (n + 1), (1) where Xi = [ xi1 xi2 · · · xiD ] denotes current position; vi denotes current velocity; c1 and c2 represent the acceleration coefficients; Pid (n) represents the best position found by particle i so far, Pig (n) represents the global best position found by particle i’s neighborhood. Obviously the random exploration ability is constricted within a subspace spanned by {Pid (n) − Xi (n), Pig (n) − Xi (n)}. That means the direction of exploration is restricted. At the same time, the intension of exploration behavior is totally determined by rate of decreasing of Pid (n) − Xi (n) and Pig (n) − Xi (n). To maintain exploration ability, there always need many particles within a swarm so that swarm size is several times, even ten times as dimension of solution space. For a simple SISO NN with a five neurons hidden layer, swarm size should be as many as sixteen. Hence the crucial shortage of PSO-NN is that the swarm size is so large that computational burden is unacceptable. Since the relatively low exploration ability is induced by constraints of direction and intension of relative distance between particles’ current positions and
566
Y. Li and X. Chen
the best solutions founded by them and their neighborhood, we try to introduce a random exploration velocity into updating principle which is independent with positions. Based on explicit representation (ER) of PSO [6], we propose a new stochastic PSO (S-PSO) with the following definition. Definition of Stochastic PSO. A stochastic PSO (S-PSO) is described as follows: Given a swarm including M particles, the position of particle i is defined as Xi = [ xi1 xi2 · · · xiD ]T , where D represents the dimension of swarm space. The updating principle for individual particle is defined as vi (n + 1) = ε(n) vi (n) + ci1 ri1 (n)(Pid (n) − Xi (n)) +ci2 ri2 (n)(Pig (n) − Xi (n)) + ξi (n)] Xi (n + 1) = αXi (n) + vi (n + 1) + φ1−α (ci1 ri1 (n)Pid (n) + ci2 ri2 (n)Pig (n)), i (n) (2) where c1 and c2 are positive constants; Pid (n) represents the best solution found by particle i so far; Pig (n) represents the best position found by particle i’s neighborhood; φi (n) = φi1 (n) + φi2 (n), where φi1 (n) = ci1 ri1 (n), φi2 (n) = ci2 ri2 (n). Applying theory of stochastic approximation, we can prove that if the following assumptions hold, 1) ξi (n) is a random velocity with continuous uniform distribution. It has constant expectation denoted by Ξi = Eξi (n), ∞ 2) ε(n) → 0 with n increasing, and Σn=0 εn = ∞, 3) 0 < α < 1, 4) ri1 (n) and ri2 (n) are independent variables satisfying continuous uniform distribution in [0, 1], whose expectations are 0.5, then the updating principle must converge with probability one. Let P ∗ = inf λ∈(RD ) F (λ) represent the unique optimal position in solution space. Then swarm must converge to P ∗ if limn Pid (n) → P ∗ and limn Pig (n) → P ∗ . Due to limitation of pages, we just introduce the main idea of the proof. If define Y (n) = X(n) − P ∗ , θ(n) = [ v(n) Z(n) ]T = [ v(n) Y (n) − En Qr (n) ]T , 1 where Qr (n) = φ(n) [φ1 (n)(P d (n)− P ∗ )+ φ2 (n)(P g (n)− P ∗ )], updating principle can be expressed as a standard form of stochastic ODE as θ(n + 1) = θ(n) + ε(n)H(n).
(3)
Applying Lyapunov theory on stochastic process, a Lyapunov function is defined as 0 1 T 1 L(θ(n)) = 2 θ θ = 12 (v 2 (n) + ΦZ 2 (n)). (4) 0 Φ After some calculations, we know for n > Nk , where Nk is a large enough integer, there is a positive non-decreasing function k(θ(n)) such that En L(θ(n + 1)) − L(θ(n)) ≤ −k(θ(n)) + E b1 (n)(Qr (n) − En Qr (n))2 , (5) where b1 (n) = Φ(1 − α + ε(n)φ(n))2 + (ε(n)φ(n))2 . Therefore θ(n) returns to the neighborhood of {θ|k(θ(t)) = Φ(1 − α)E(Qr (t) − En Qr (t))2 } infinitely often as n → ∞.
A New Stochastic PSO Technique for Neural Network Training
567
It can be proved that the following condition for asymptotic rate of change holds. lim sup max M 0 (jT + t) − M 0 (jT ) = 0, (6) n j≥n 0≤t≤T
m(t)−1 where M 0 (t) = i=0 ε(i)δM (i), δM (n) = H(n)−En H(n). Hence as n → ∞, the stochastic factor can not force θ(n) out of the vicinity of Φ(1 − α)E(Qr (t) − En Qr (t))2 } infinitely often. That means θ(n) must converge to the set {θ|k(θ(t)) = Φ(1 − α)E(Qr (t) − En Qr (t))2 } with probability one. And if limn Qr (n) = 0 or P d (n) and P g (n) converge to P ∗ , {θ|k(θ(t)) = Φ(1−α)E(Qr (t) − En Qr (t))2 } becomes {θ|k(θ(t)) = 0}, and the swarm must converge to P ∗ . The following properties are obtained. 1) When n is less than Nk which makes equation (5) hold, the updating principle is nonconvergent so that particle will move away from the best position recorded by itself and its neighborhood. This phenomenon can be viewed as a strong exploration that all particles are exploring in the solution space. And when n > Nk , particle starts to converge. 2) An additional random velocity ξ(n) independent with particle’s position is very useful to maintain intension of exploration. That means when particles congregate too fast, particles can maintain certain exploration behavior to avoid being trapped into local minimum. These two properties imply that S-PSO has strong exploration ability than traditional PSO, so that using S-PSO, we can accomplish training of NN with relative small swarm size.
3
Neural Network Training with S-PSO
To test the feasibility of S-PSO in training, we propose a test on non-linear artificial function approximation. As a comparison, other two ways of training, BP and traditional PSO are chosen. Since the test is to investigate dynamics of learning for the three algorithms, there is no need to build a complex neural network. So a standard one input, one output, three layers feed-forward neural network is chosen, whose hidden layer includes five neurons, just like Fig. 1 shows. There are sixteen weights to be optimized. In order to use BP, a differentiable sigmoid function and a linear function are chosen as transfer function in hidden layers and output layer. We use the way of batch training, that means in one epoch (iteration) the weights are updated once after all data are input to the net. The parameters used in PSO training methods are arranged as follows. For S-PSO, c1 = c2 = 3.5, α = 0.95, ε(n) is of the form ε(n) = 3.5/(n + 1)0.4 . For traditional PSO, c1 = c2 = 2.05, K = 0.729. To speed up convergence of BP, a gradient descent with momentum weight is applied as BP learning, in which the momentum constant is set as 0.9. The NN toolbox in MATLAB 6.5 is used to build up such BP neural network. Each weight in three NNs is initialized within [ 0 1 ]. A data set including 40 data is presented to NNs for training, in which data are sampled from a smooth
568
Y. Li and X. Chen
V W Input
Output
Fig. 1. Feed-forward neural network
Table 1. Comparative Results
Algorithm (swarm size) Average of MSE Minimum of MSE Maximum of MSE 0.6636 × 10−2 0.5465 × 10−2
S-PSO (20) S-PSO (35) Traditional PSO (20) Traditional PSO (35) Backpropagation
0.01067 0.3648 × 10−2 0.2367 × 10−2
0.9527 × 10−3 0.1155 × 10−2 0.1408 × 10−2
0.1078 × 10−2 0.1040 × 10−2
0.02109 0.01078
0.1187×−2 0.5102 × 10−3 0.2468×−3
0.02151
0.6556×−3 0.7082 × 10−3
Std. Dev.
0.01056 0.5743 × 10−2
Convergence of mean square error 8 S−PSO (35) BP Traditional PSO (35)
7
0.6 BP S−PSO (20) 0.4
6
0
4 −0.2 Y
Mean square error
0.2
5
3 −0.4
2 −0.6
1 −0.8
0
0
50
100
150
200
250
Iteration
(a)
300
350
400
450
500
−1 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
X
(b)
Fig. 2. Evolution processes for three-path planning and results of function approximation
continuous curve with little disturbances. Each training algorithm is tested 30 runs. And each run includes 1500 iterations. To investigate performance of PSO under different swarm sizes, two swarm sizes are tested. The one is 20 which is a little bit greater than the dimension of solution space, the other is 35 which is more than twice as the dimension. All test result is listed in Tab. 1. We observe that S-PSO training with a swarm including 20 particles performs much better than traditional PSO. We think the random velocity ξ(n) brings obvious improvement on the performance. But when swarm size increases to 35 which is more than twice as solution dimension, the performance of traditional
A New Stochastic PSO Technique for Neural Network Training
569
PSO catches up with S-PSO, because larger size make traditional PSO explore better and originally traditional PSO converges faster than S-PSO. Such fast convergence can be observed in Fig. 2 (a) in which traditional PSO converges as fast as BP, while due to divergence property mentioned previously, S-PSO converges much slower than the other two. The result of function approximation in one of runs is shown in Fig. 2 (b), in which the black circles denote the data presented to NN, and the red line and blue line denote outputs of neural networks after training using S-PSO and BP respectively.
4
Conclusion
PSO is considered as an important evolutionary technique for optimization, which becomes more and more popular in training neural network due to its advantages mentioned above. But since the dimension of solution space is the same as the number of weights in NN, normally the swarm size is so large that computational cost becomes a large burden. This paper proposes a new stochastic PSO (S-PSO) which has an advantage of including an independent random velocity ξ(n) to improve the exploration ability of swarm, so that S-PSO with relative small swarm size can accomplish training of NN. Hence applying such S-PSO, the computational cost of PSO training can be reduced significantly, meanwhile the advantages of PSO training are maintained as well.
Acknowledgement This work was supported by the Research Committee of University of Macau under Grant no. RG082/04-05S/LYM/FST.
References 1. Kennedy, J., Eberhart, R. C.: Particle Swarm Optimization. Proceedings of IEEE International Conference on Neural Network, Perth, Australia (1995) 1942-1948 2. Juang, C. F.: A Hybrid of Genetic Algorithm and Particle Swarm Optimization for Recurrent Network Design. IEEE Transactions on Systems, Man, and Cybernetics– Part B: Cybernetics 34(2) (2004) 997 - 1006 3. Li, Y. and Chen, X.: Mobile Robot Navigation Using Particle Swarm Optimization and Adaptive NN, Advances in Natural Computation, Eds by L. Wang, K. Chen and Y.S. Ong, Springer, LNCS 3612 (2005), 628-631. 4. Messerschmidt, L., Engelbrecht, A. P.: Learning to Play Games Using a PSO-Based Competitive Learning Approach. IEEE Transactions on Evolutionary Computation 8(3) (2004) 280 - 288 5. Bergh, F., Engelbrecht, A. P.: A Cooperative Approach to Particle Swarm Optimization. IEEE Transactions on Evolutionary Computation 8(3) (2004) 225 -239 6. Clerc, M., Kennedy, J.: The Particle Swarm: Explosion, Stability, and Convergence in a Multi-Dimentional Complex Space. IEEE Transactions on Evolutionary Computation 6(1) (2002) 58-73
A Multi-population Cooperative Particle Swarm Optimizer for Neural Network Training Ben Niu1,2, Yun-Long Zhu1, and Xiao-Xian He1,2 1
Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China 2 Graduate School of the Chinese Academy of Sciences, Beijing 100049, China {niuben, ylzhu}@sia.cn
Abstract. This paper presents a new learning algorithm, Multi-Population Cooperative Particle Swarm Optimizer (MCPSO), for neural network training. MCPSO is based on a master-slave model, in which a population consists of a master group and several slave groups. The slave groups execute a single PSO or its variants independently to maintain the diversity of particles, while the master group evolves based on its own information and also the information of the slave groups. The particles both in the master group and the slave groups are co-evolved during the search process by employing a parameter, termed migration factor. The MCPSO is applied for training a multilayer feed-forward neural network, for three benchmark classification problems. The performance of MCPSO used for neural network training is compared to that of Back Propagation (BP), genetic algorithm (GA) and standard PSO (SPSO), demonstrating its effectiveness and efficiency.
1 Introduction Artificial neural networks (ANNs) have been widely used in many areas. Especially, the back-propagation (BP) network [1] has been successfully adopted to solve many problems, such as systems control, data compression, and optimization problems. The error back propagation is one of the most popular methods used for training BP neural networks. However, it is based on the gradient information of an objective function, easily trapped in local minima, and is limited for applications in complex optimization problems. The advent of evolutionary computation (EC) has enhanced as a new technique, the optimal design of neural networks [2]. Recently, a swarm intelligence method, particle swarm optimization (PSO), has been applied to many problems, including neural network design [3]. As already has been mentioned by Angeline [4], the original PSO, while successful in the optimization of several difficult benchmark problems, presented problems in controlling the balance between exploration and exploitation, namely when fine tuning around the optimum is attempted. Shi and Eberhart [5] introduced a linear decreasing weight approach to control this balance, i.e., decreasing the inertia weight from a relatively large value (wmax ) to a J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 570 – 576, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Multi-population Cooperative Particle Swarm Optimizer
571
small value ( wmin ) through the course of the PSO run. Clerc [6] has also introduced a constriction factor approach to balance between the global exploration and local exploitation abilities of the swarm. In this work we present a new approach to control the exploration / exploitation balance by introducing a multi-population cooperative scheme [7, 8]. The proposed scheme (MCPSO) consists of a master group and several slave groups in a population. The slave groups can perform independently and produce many new promising particles, which are in the positions giving the best fitness values. The master group will collect the information of these particles from time to time during the evolution process and then update its own particles states using the best current positions discovered by all the particles both in the slave groups and its own. The interactions between master group and slave groups influence the balance between exploration and exploitation and maintain some diversity in the population, even when it is approaching convergence, thus reducing the risk of convergence to local sub-optima. The proposed scheme, MCPSO, is applied to train a multilayer feed-forward neural network, for three benchmark classification problems. Experimental results are presented to illustrate the effectiveness and competitiveness of MCSPO.
2 Review of Standard PSO (SPSO) In PSO, the potential solutions, called particles, fly in a D-dimension search space with a velocity that is dynamically adjusted according to its own experience and that r of its neighbors. The ith particle is represented as xi = (xi1, xi2,...xiD), where xid ∈ [ld , ud ], d ∈ [1, D ], ld , u d are the lower and
upper bounds for the
dth dimension, respectively. The velocity for particle
i is represented as
r r vi = (vi1, vi2,..., viD), which is clamped to a maximum velocity vector vmax . The best
previous position of the ith particle is recorded and represented as Pi = ( Pi1 , Pi 2 , ..., PiD ), which is also called pbest . The index of the best particle among all the particles in the population is represented by the symbol g , and Pg is called gbest . At each iteration step t , the particles are manipulated according to the following equations: vi (t + 1) = w × vi (t ) + R1c1( Pi − xi (t )) + R2c2 ( Pg − xi (t ))
(1)
xid = xid + vid ,
(2)
where w is inertia weight; c1 and c2 are acceleration constants; and R1 , R2 are random vectors with components uniformly distributed in [0, 1].
572
B. Niu, Y.-L. Zhu, and X.-X. He
3 MCPSO Algorithm The initial inspiration for the PSO was the coordinated movement of groups of animals in nature, e.g. schools of fish or flocks of birds. It reflects the cooperative relationship among the individuals within a group. However, in a natural ecosystem, many species have developed cooperative interactions with each other to improve their survival. Such cooperative-----also called symbiosis-----co-evolution can be found in organisms going from cells to superior animals, including the common mutualism between plants and animals [9]. Inspired by the phenomenon of symbiosis in the natural ecosystem, a master-slave mode is incorporated into the SPSO, and the Multi-population (species) Cooperative Optimizer (MCPSO) is thus presented. In our approach, the population consists of a master group and several slave groups. The slave groups encourage the global exploration (moving to previously not encountered areas of the search space), while the master group promotes local exploitation, i.e., fine-tuning the current search area. The symbiotic relationship between the master group and slave groups can keep a right balance of exploration and exploitation, which is essential for the success of a given optimization task. The master-slave communication model is shown in Fig.1, which is used to assign fitness evaluations and maintain algorithm synchronization.
Fig. 1. The master-slave model
Independent populations (species) are called slave groups. Each slave group executes a single PSO or its variants, including the update of position and velocity, and the creation of a new local population. When all slave groups are ready with the new generations, each slave group then sends the best local individual to the master group. The master group selects the best of all received individuals and evolves according to the following equations: M M M M M M vi (t + 1) = wvi (t ) + R1c1( pi − xi (t )) + ΦR2c2 ( pg − xi (t )) S M +(1 − Φ) R3c3 ( Pg − xi (t ))
(3)
M M M x i ( t + 1) = xi ( t ) + v i ( t ),
(4)
where M represents the master group, p gM and
S Pg
are the best previous particle among
all the particles in master group and slave groups, respectively. R3 is a random value
A Multi-population Cooperative Particle Swarm Optimizer
573
between 0 and 1. c3 is acceleration constant. For a minimization problem, Φ is migration factor, given by:
⎧ 0 Gbest S < Gbest M ⎪ S M Φ = ⎨0.5 Gbest = Gbest ⎪ 1 Gbest S > Gbest M , ⎩ where Gbest M , Gbest S are the fitness values determined by
(5)
M pg
and PgS
,respectively.
The pseudocode of the MCPSO algorithm is listed in Table 1. Table 1. Pseudocode for the MCPSO algorithm
Algorithm MCPSO Begin Initialize all the populations Evaluate the fitness value of each particle Repeat Do in parallel Node I, 1 ≤ i ≤ K //K is the number of slaver groups End Do in parallel Barrier synchronization //wait for all processes to finish Select the fittest global individual ( PgS or p gM ) from all the swarms Evolve the mast group // Update the velocity and position using (3) and (4), respectively Evaluate the fitness value of each particle Until a terminate-condition is met End
4 Experiment Studies In order to demonstrate the performance of the MCPSO, it is applied to the training of multilayer feed-forward neural networks (MFNNs) for classification problems. The performance of the MCPSO is also compared with that obtained using BP, GA and SPSO training. 4.1 Training MFNNs Using MCPSO Algorithm
Upon adopting MCPSO to train a MFNN, two key problems must be resolved, namely encoding the MFNN and designing the fitness function. For a three-layer MFNN, the free parameters to be coded include weights and biases, which can be defined as a one-dimensional matrix, i.e.,
574
B. Niu, Y.-L. Zhu, and X.-X. He
⎧ 644744 w ( IH ) 8 644744 w ( HO ) 8 64748 b( H ) b (O ) 8 ⎫ 6474 ⎪ ⎪ ⎨1 2 L I × H ,1 2 L H × O ,1 2 L H ,1 2 L O ⎬ , ⎪ ⎪ ⎩ ⎭
where I, H and O is the number of neurons in input layer, hidden layer and output ( IH )
is a vector of the weights between input layer and hidden layer respectively. w (HO) (H ) is a vector of the weights between hidden layer and output layer. b layer. w (O ) represents a vector of the represents a vector of the biases of the hidden layer. b biases of the output layer. The size of the matrix can be represented by
D= I ×H + H×O+ H +O.
In particular, each particle in MCPSO contains a set of weights and biases of MFNN. The dimension of each particle is same as the size of the above matrix, i.e. D. The MFNN is trained using MCPSO by moving the particles among the weight space to minimize the mean-squared error (MSE) : 1 1 M SE = # patterns O
# patterns O 2 − y kp ) , ∑ ∑ (d p =1 K =1 kp
(6)
where d kp is the k-th node of desired output and ykp is the k -th network output. 4.2 Numerical Examples
Three benchmark classification problems, i.e., Iris, New-thyroid and Glass are used for testing. The data sets for those three problems can be obtained from the UCI repository. The network configurations are listed in Table 2. Table 2. Network configuration
Problem Iris New-thyroid
Glass
Architecture 4-3-3 5-4-3 9-8-7
#weights 27 39 143
In applying MCPSO, the number of slave groups K=3, the population size of each swarm n = 20 , are chosen, i.e., 80 individuals are initially randomly generated in a population. For master group, inertial weights wmax , wmin , the acceleration constants c1 , c2 , and the migration coefficient c3 , are set to 0.9, 0.4, 1.5, 1.5 and 0.8, respectively. In slave groups the inertial weights and the acceleration constants are the same as those used in master group.
A Multi-population Cooperative Particle Swarm Optimizer
575
In SPSO, the population size is set as 80 and initial individuals are the same as those used in MCPSO. For fair comparison, the other parameters wmax , wmin , c1 , c2 are the same as those defined in MCPSO. In GA, the population size is set as 80, the parameters of crossover probability Pc = 0.4 and the mutation probability Pm = 0.1 is used. In BP, the learning rate η and the momentum α are set as 0.3 and 0.9, respectively. In each experiment, each data set was randomly divided into two parts: 2/3 as training set and 1/3 as test set. All results reported below are the averages computed over 10 runs. Table 3. Performance comparisons using different training algorithms
Data set
Algorithm BP
Iris
Newthyroid
Glass
GA PSO MCPSO BP GA PSO MCPSO BP GA PSO MCPSO
Train Correct 0.9550 0.9332 0.9712 0.9921 0.9148 0.9135 0.9642 0.9828 0.6896 0.7356 0.8086 0.8548
Test Correct 0.9224 0.9132 0.9628 0.9724 0.7214 0.8356 0.9421 0.9645 0.6142 0.6241 0.6857 0.7024
MSEt 0.0516 0.0138 0.0103 0.0115 0.1823 0.0144 0.0098 0.0085 0.1585 0.1526 0.1248 0.0598
MSEg 0.0591 0.0186 0.0274 0.0161 0.1954 0.0169 0.0252 0.0188 0.1185 0.1492 0.1425 0.0786
Table 3 shows the experimental results averaged 10 runs of the four algorithms. Where Train Correct and Test Correct are the correct rate of classification and generalization averaged 10 runs for the training and test sets, respectively. MSEt and MSEg refer to mean square error averaged 10 runs on the training and test set, respectively. It should be note that the BP-based MFNN is evolved for 3000 generations repeated for 10 runs. With regards to the smallest network (Iris problem), the training performance of MCPSO is far better than that of BP and GA but slightly worse than that of SPSO. However, among the four types of methods, it achieves the highest classified accuracy in the test part, which demonstrates that the results found by MCPSO are more stable than that of other methods. Regarding New-thyroid and Glass problems, SPSO slightly outperformed GA and BP. However, average results generated by MCPSO are significantly superior to those generated by any other approaches. Results obtained clearly state the competitiveness of MCPSO with classical algorithms, like BP, GA and SPSO. We may conclude that MCPSO can compete both with other evolutionary approaches and more classical techniques, at least in some data sets, in terms, not only of accuracy, but also of robustness of the results.
576
B. Niu, Y.-L. Zhu, and X.-X. He
5 Conclusions This paper has proposed a new optimization method, MCPSO, for neural network training. MCPSO is in a master-slave mode. The evolution of slave groups is likely to amplify the diversity of individuals of populations and consequently to generate more promising particles for the master group. The master group updates the particle states based on both its own experience and that of the slave groups. This new method is less susceptible to premature convergence and less likely to be stuck in local optima. The feasibility of the MCPSO was demonstrated for three benchmark classification problems, in comparison with other algorithms.
Acknowledgements This work is supported by the National Natural Science Foundation of China (Grant No. 70431003) and the National Basic Research Program of China (Grant No. 2002CB312200). The first author would like to thank Prof. Q.H Wu of Liverpool University, UK, for many valuable comments.
References 1. Nekovei, R., Sun, Y.: Back-propagation Network and its Configuration for Blood Vessel Detection in Angiograms. IEEE Trans. Neural Networks 6(1) (1995) 64-72 2. Billings, S. A., Zheng, G. L.: Radial Basis Function Network Configuration Using Genetic Algorithms. Neural Networks 8(6) (1995) 877-890 3. Tandon, V., El-Mounayri, H., Kishawy, H.: NC End Milling Optimization Using Evolutionary Computation. Int. J. Mach. Tools Manuf. 42(5) (2002) 595-605 4. Angeline, P. J.: Evolutionary Optimization versus Particle Swarm Optimization: Philosophy and Performance Difference. In: Waagen, D., Porto, V.W., Saravanan, N., Eiben, A.E. (eds.): Evolutionary Programming VII. Lecture Notes in Computer Science, Vol. 1447. Springer-Verlag, Berlin Heidelberg New York (1998) 745-754 5. Kennedy, J., Eberhart, RC, and Shi, Y.: Swarm Intelligence. Morgan Kaufmann. Publishers, San Francisco CA (2001) 6. Clerc, M., Kennedy, J.: The Particle Swarm: Explosion, Stability, and Convergence in a Multidimensional Complex Space. IEEE Trans. Evol. Comput. 6(1) (2002) 58–73 7. Niu, B., Zhu, Y.L., He, X.X.: Construction of Fuzzy Models for Dynamic Systems Using Multi-population Cooperative Particle Swarm Optimizer. In: Wang, L.P., Jin, Y.C. (eds.): Fuzzy Systems and Knowledge Discovery. Lecture Notes in Computer Science, Vol. 3613. Springer-Verlag, Berlin Heidelberg New York (2005) 987–1000 8. Niu, B., Zhu, Y.L., He, X.X.: Multi-population Cooperative Particle Swarm Optimization. In: Capcarrere, M., Freitas, A.A., Bentley, P.J., Johnson, C.G., Timmis, J. (eds.): Advances in Artificial Life. Lecture Notes in Computer Science, Vol. 3630. SpringerVerlag, Berlin Heidelberg New York (2005) 874–883 9. Genkai-Kato, M., Yamamura, N. Evolution of Mutualistic Symbiosis without Vertical Transmission. Comp. Biochem. Physiol. 123(3) (1999) 269-278
Training RBF Neural Network with Hybrid Particle Swarm Optimization Haichang Gao1, Boqin Feng1, Yun Hou1, and Li Zhu2 1
School of Electronics and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China
[email protected] 2 School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China
Abstract. The particle swarm optimization (PSO) has been used to train neural networks. But the particles collapse so quickly that it exits a potentially dangerous stagnation characteristic, which would make it impossible to arrive at the global optimum. In this paper, a hybrid PSO with simulated annealing and Chaos search technique (HPSO) is adopted to solve this problem. The HPSO is proposed to train radial basis function (RBF) neural network. Benchmark function optimization and dataset classification problems (Iris, Glass, Wine and New-thyroid) experimental results demonstrate the effectiveness and efficiency of the proposed algorithm.
1 Introduction Particle swarm optimization (PSO) is a new evolutionary computation technique introduced by Kennedy [1], which was inspired by social behaviors of birds. Similar to genetic algorithm (GA), PSO is a population based optimization tool [2]. But unlike GA, PSO has no evolution operators such as crossover and mutation. Compared with GA, PSO has some attractive advantages. It has memory, so knowledge of good solutions is retained by all particles. It has constructive cooperation between particles, particles in the swarm share information between them. PSO has been successfully applied in many areas: function optimization, artificial neural network training, fuzzy system control, and other areas [3]. But the particles collapse so quickly that it exits a potentially dangerous stagnation characteristic, which would make it impossible to arrive at the global optimum. In this paper, a hybrid PSO with simulated annealing and Chaos search technique (HPSO) is adopted to solve this problem.
2 RBF Neural Network Radial basis function (RBF) networks were introduced into the neural network by Broomhead [4]. Due to the better approximation capabilities, simpler network structures and faster learning algorithms, RBF networks have been widely used in many fields. A RBF neural network has a three-layer architecture with on feedback [5]. The input layer which consists of a set of source nodes connects the network to the J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 577 – 583, 2006. © Springer-Verlag Berlin Heidelberg 2006
578
H. Gao, et al.
environment. The hidden layer consists of H hidden neurons (radial basis units), with radial activation functions. Gaussian function is often selected as the activation function. The output of i-th hidden neuron, z i , is a radial basis function that defines a spherical receptive field given by the following equation:
z i = Φ (|| x − c i ||) = exp(− || x − c i || 2 /(2σ i2 )) , ∀i .
(1)
where ci and σ i are the center and the width of the i-th hidden unit, respectively. Each neuron in the hidden layer has a substantial finite spherical activation region, determined by the Euclidean distance between input vector, x , and the center, ci , of the function z i normalized with respect to the scaling factor σ i . The output layer, a set of summation units, supplies the response of the network.
3 Hybird PSO Training RBF Neural Network PSO has a strong ability finding the most optimistic result. But it has a disadvantage of local optimum. SA has a strong ability finding the global optimistic result, and it can avoid the problem of local optimum. Chaos movement can go through all states unrepeated according to the rule of itself in some area. So, combining PSO with Chaos and SA, learning from other’s strong point and offset one’s weak point each other, the hybrid PSO strategy (HPSO) is proposed. 3.1 Particle Swarm Optimization
The basic PSO model consists of a swarm of particles moving in an n-dimensional search space. Each particle has a position represented by a position-vector X and a velocity represented by a velocity-vector V. Particles move to trying to find the solution for the problem being solved. They find the global best solution by simply adjusting the trajectory of each individual towards its own best location and towards the best particle of the swarm at each time step. The position and the velocity of the i-th particle in the n-dimensional search space can be represented as X i = [ xi1 , xi 2 ,L, xin ] and Vi = [vi1 , vi 2 ,L, vin ] , respectively. Each particle has its own best position Pid , corresponding to the personal best objective value obtained so far. The global best particle is denoted by Pgd , which represents the best particle found so far. At each iteration step, the velocity is updated and the particle is moved to a new position. The update of the velocity from the previous velocity to the new velocity is calculated as follows: Vid ' = ω ⋅ Vid + G1 ⋅ rand () ⋅ ( Pid − X id ) + G2 ⋅ rand () ⋅ ( Pgd − X id ) .
(2)
where G1 and G2 are constants called acceleration coefficients, ω is called the inertia factor, rand () is random number uniformly distributed in the range of [0,1]. The new position is determined by the sum of the previous position and the new velocity, and it can be calculated according to the following equation:
Training RBF Neural Network with Hybrid Particle Swarm Optimization
X id ' = X id + Vid .
579
(3)
Due to the simple concept, easy implementation and quick convergence, nowadays PSO has gained much attention and wide application. But the performance of simple PSO greatly depends on its parameters, and it often suffers the problem of being trapped in local optima. Researchers have analyzed it empirically [6] and theoretically [7], which have shown that the particles oscillate in different sinusoidal waves and converging quickly, sometimes prematurely, especially for PSO with small inertia factor ω or acceleration coefficients G1 and G2 . 3.2 Chaos Optimization
Chaos movement can go through all states unrepeated according to the rule of itself in some area. Chaos has three important dynamic properties: the sensitive dependence on initial conditions, the intrinsic stochastic property and ergodicity. Chaos is in essence deeply related with evolution. In chaos theory, biologic evolution is regarded as feedback randomness, while this randomness is not caused by outside disturbance but intrinsic element [8]. Logistic equation [9] is brought forward for description of the evolution of biologic populations. It is the most common and simple chaotic function: x n +1 = L ⋅ x n (1 − x n ) .
(4)
where, L is a control parameter which is between 0 and 4.0. When L=4.0, the system is proved to be in chaotic state. Given arbitrary initial value that is in (0,1) but not equal with 0.25, 0.5 and 0.75, chaos trajectory will finally search non-repeatedly any point in (0,1). If the target function of continuous object problem that to be optimized is: f 3 = f ( xi3 ) = min f ( xi ) , xi ∈ [a i , bi ] , i=1,2,...,n.
(5)
Then the process of the chaos optimization strategy can be described as follows: Step 1: algorithm initialization. Let k = 1 , k ' = 1 , x ik = x i (0) , xi3 = xi (0) , f 3 = f (0) , a ik ' = a i , bik ' = bi . Where, k is the iterative symbol of chaos parameters.
k ' is the refine search symbol. xi3 is the best chaos variable found currently. f 3 is the current best solution that initialized as a biggish number. Step 2: map the chaos variable x ik to the optimization variable area, get mx ik : mx ik = a ik ' + x ik (bik ' − a ik ' ) . Step 3: search according to the chaos optimization strategy. f
(6) 3
= f (mxik ) , x i3 = x ik ,
if f (mxik ) < f 3 . Otherwise, go on. Step 4: let k = k + 1 , x ik = 4 x ik (1 − x ik ) , repeat step 2 and 3 until f 3 keep unchanged in certain steps. Step 5: reduce the search scale of chaos variable:
580
H. Gao, et al.
a ik '+1 = mxi3 − C (bik ' − a ik ' ) , bik '+1 = mxi3 + C (bik ' − a ik ' ) .
(7)
where, adjustment coefficient C ∈ (0,0.5) , mxi3 is the best solution currently. Step 6: revert optimization variable xi3 : x i3 = (mx i3 − a ik '+1 ) /(bik '+1 − a ik '+1 ) .
(8)
Repeat step 2 to 5 using new chaos variable y ik = (1 − A) x i3 + Ax ik , where A is a small number. Let k ' = k ' + 1 , until f 3 keep unchanged in certain steps. Step 7: finish the calculate process after several repeating of step 5 and 6. The final mxi3 is the best optimization variable, and f 3 is the best solution. 3.3 Simulated Annealing
SA is based on the idea of neighborhood search. Kirkpatrick [10] suggested a form of SA could be used to solve complex optimization problems. The algorithm works by selecting candidate solutions which are in the neighborhood of the given candidate solution. SA attempts to avoid entrapment in a local optimum by sometimes accepting a move that deteriorates the value of the objective function. With the help of the distribution scheme, SA can provide a reasonable control over the initial temperature and cooling schedule so that it performs effective exploration and good confidence in the solution quality. In Annealing function construction, exponential cooling schedule is used to adjust the temperature t k +1 = μ ⋅ t k , where μ ∈ (0,1) is a decrease rate. It is often believed to be a good cooling method, because it provides a rather good compromise between a computationally fast schedule and the ability to reach low-energy state. 3.4 Training Algorithm of HPSO
HPSO algorithm training RBF neural network can be summarized as follows: Step 1: Initialize the structure, activation function and objective function of HPSO. Step 2: Initialize the algorithm parameters of HPSO (i.e. initialize velocities X i and
positions Vi randomly. Initialize temperature T0 and cooling parameter α . Initialize Pid , equal to Pgd , with the index of the particle with the best position). Set a limit to
particles’ velocities and positions. Step 3: Evaluate and store initial position and fitness of each particle. Evaluate and store the global best position and fitness of the swarm. Step 4: Update particles’ velocities Vi and positions X i by equation (2) and (3). Update the individual best position and fitness of each particle. Step 5: Implement the Chaos search for the best particle. Decrease the search space according to equation (7) and (8). Update the global best position and fitness of the swarm.
Training RBF Neural Network with Hybrid Particle Swarm Optimization
581
Fig. 1. Average fitness logarithm value curve of function F
Step 6: Perform annealing operation, Decrease temperature t k +1 = update(t k ) and set k=k+1. Step 7: If the stopping criterion is not satisfied, go to step 4. Otherwise, output the best solution found so far.
4 Experiments HPSO was applied to benchmark function optimization and dataset classification problems (Iris, Glass, Wine and New-thyroid) experiments in this section. Benchmark function optimization experiment was carried to demonstrate the effectiveness of the proposed algorithm in detail. 4.1 Benchmark Function Optimization
Three algorithms (HPSO, simple PSO [11] and SA) were compared on DeJong benchmark function optimization. The DeJong function is following: F = 100( x12 − x 2 ) 2 + (1 − x1 ) 2 .
(9)
where −2.048 ≤ xi ≤ 2.048 , (i=1,2). The function is continuous and multimodal; x*=1, with f (1,1) = 0 . From the average fitness logarithm value cure of function F in figure 1, we can see HPSO performs PSO and SA in finding the global optimistic result, and it can avoid the problem of local optimum effectively.
582
H. Gao, et al.
4.2 Benchmark Datasets Experiment
Four datasets, which are all classification problems and can be got from UCI dataset house [12], was selected to carry the experiment. The attribute, class and instance of each dataset can be found in table 1. Each method runs 10 times on every dataset, and the average value was selected as the experiment result. Table 1. Datasets characteristic used for experiment
Dataset Iris Glass Wine New-thyroid
Example number 150 214 178 215
Input attribution 4 9 13 5
Output attribution 3 7 3 3
Three training algorithm (HPSO, PSO, and newrb) were compared [11]. The newrb routine was included in Matlab neural network toolbox as standard training algorithm for RBF neural network. The parameters of PSO and HPSO were set as follows: weight ω decreasing linearly between 0.8 and 0.2, acceleration coefficients G1 = G2 = 2.5 . The initiation temperature of HPSO is 1000. The test results have been listed on table 2, which are Error rate of three methods on different dataset. Table 2. Comparative accuracy rate of three algorithms on different datasets
Dataset Iris Glass Wine New-thyroid
HPSO PSO newrb Train Test Train Test Train Test 0.9989 0.9875 0.99 0.98 0.9850 0.9560 0.9124 0.7572 0.8042 0.6620 0.9850 0.6174 0.9991 0.9668 1 0.9631 0.9375 0.6554 0.9763 0.9534 0.9650 0.9444 0.9240 0.6204
From table 2, it can be seen that the accurate rate of train set and test set outperform those of simple PSO and newrb. So, the HPSO algorithm proposed for RBF neural network in this paper is more effective.
5 Conclusion This paper presents a hybrid PSO with simulated annealing and Chaos search technique to train RBF neural network. The HPSO algorithm combined the strong ability of PSO, SA, and Chaos. They can learn from other’s strong point and offset one’s weak point each other. Benchmark function optimization and dataset classification problems (Iris, Glass, Wine and New-thyroid) experimental results demonstrate the effectiveness and efficiency of the proposed algorithm.
Training RBF Neural Network with Hybrid Particle Swarm Optimization
583
Acknowledgements The authors would like to thank the anonymous reviewers for their careful reading of this paper and for their helpful comments. This work was supported by the National High Technology Development Plan of China (863) under grant no. 2003AA1Z2610.
References 1. Kennedy, J., Eberhart, R.C.: Particle Swarm Optimization. In: Proceedings of IEEE International Conference on Neural Networks, Perth, Australia (1995) 1942–1948 2. Eberhart, R.C., Shi, Y.: Comparison between Genetic Algorithm and Particle Swarm Optimization. In. Proceedings of 7th Annual Conference on Evolutionary Computation (1998) 611–616 3. Kennedy, J., Eberhart, R.C., Shi, Y.: Swarm Intelligence. Morgan Kaufmann Publishers, Inc., San Francisco, CA (2001) 4. Broomhead, D, Lowe, D.: Multivariable Functional Interpolation and Adaptive Networks. Complex Systems 2 (1998) 321–355 5. Catelani, M., Fort, A.: Fault Diagnosis of Electronic Analog Circuits Using a Radial Basis Function Network Classifier. Measurement 28(2000)147–158 6. Kennedy, J.: Bare Bones Particle Swarms. In: Proceedings of IEEE Swarm Intelligence Symposium, (2003)80–87 7. Cristian, T.I.: The Particle Swarm Optimization Algorithm: Convergence Analysis and Parameter Selection. Information Processing Letters 85(6) (2003)317–325 8. Zhang, T., Wang, H.W., Wang Z.C.: Mutative Scale Chaos Optimization Algorithm and Its Application. Control and Decision 14(3) (1999)285–288 9. Moon Francis C. Chaotic and Fractal Dynamics, an Introduction for Applied Scientists and Engineers. New York: John Wiley & Sons, (1992) 10. Kirkpatrick, S., Gelat, J.C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 4596(220) (1983)671–680 11. Liu, Y., Qing, Z., Shi, Z.W.: Training Radial Basis Function Network with Particle swarms, ISNN04, Springer-Verlag Berlin Heidelberg (2004)317–322 12. Blake, C., Keogh, E., Merz, C.J.: UCI Repository of Machine Learning Databases, www.ic.uci.edu/~mlearn/MLRepository.htm (2003)
Robust Learning by Self-organization of Nonlinear Lines of Attractions Ming-Jung Seow and Vijayan K. Asari Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, VA 23529, USA {mseow, vasari}@odu.edu
Abstract. A mathematical model for learning a nonlinear line of attractions is presented in this paper. This model encapsulates attractive fixed points scattered in the state space representing patterns with similar characteristics as an attractive line. The dynamics of this nonlinear line attractor network is designed to operate between stable and unstable states. These criteria can be used to circumvent the plasticity-stability dilemma by using the unstable state as an indicator to create a new line for an unfamiliar pattern. This novel learning strategy utilized stability (convergence) and instability (divergence) criteria of the designed dynamics to induce self-organizing behavior. The self-organizing behavior of the nonlinear line attractor model can helps to create complex dynamics in an unsupervised manner. Experiments performed on CMU face expression database shows that the proposed model can perform pattern association and pattern classification tasks with few iterations and great accuracy.
1
Introduction
Human brain memorizes information using the dynamical system made of interconnected neurons. Retrieval of information is accomplished in associative sense; starting from an arbitrary state that might be an encoded representation of a visual image, the brain activity converges to another state that is stable and which is what the brain remembers. Associative memory can be modeled using a recurrent network, in which the stored memories are represented by the dynamics of the network convergence. In most models of associative memory, memories are stored as attracting fixed points at discrete locations in the state space. Fixed point attractor may not be suitable for patterns which exhibit similar characteristics. To precisely characterize the similarity of images and other perceptual stimuli, it would be more appropriate to represent the pattern association using a nonlinear line attractor network that encapsulates the attractive fixed points scattered in the state space with an attractive nonlinear line, where each fixed point corresponds to similar patterns [1][2]. On the basis of studies of the olfactory bulb of an animal, Freeman suggested that in the rest state, the dynamics of the neural cluster in an animal is chaotic [3]. Conversely, when a familiar scent is presented to the animal, the neural system rapidly simplifies its behavior and the dynamics becomes more orderly. He found that J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 584 – 589, 2006. © Springer-Verlag Berlin Heidelberg 2006
Robust Learning by Self-organization of Nonlinear Lines of Attractions
585
when a rabbit’s cells of smell are inactive, the electrical activity follows a loose chaotic pattern. But when active, this pattern explodes to a far more definite pattern of activity. According to other researchers [4], the background chaotic activity of the brain enables the system to jump rapidly from one attractor to the other when presented with the appropriate input. That is, if the input does not send the system into one of the attractors, it is considered a novel input [5]. Some researchers have long speculated that chaotic processes have some fundamental roles in mental processes [3]. In designing a recurrent neural network, it is usually of prime importance to guarantee the convergence of dynamics of the network. We propose to modify this picture: if the brain remembers by converging to the state representing familiar patterns, it should also diverge from such states when presented with an unknown encoded representation of a visual image. That is, the identification of an instability mode can be an indication that a presented pattern is far away from any stored pattern and therefore cannot be associated with current memories. These properties can be used to circumvent the plasticity-stability dilemma by using the fluctuating mode as an indicator to create new states. We propose to capture this behavior using a novel neural architecture and learning algorithm, in which the dynamical system performs self-organization utilizing a stability mode and an instability mode. This selforganizing behavior of the nonlinear line attractor model can help to create complex dynamics in an unsupervised manner.
2
Nonlinear Line Attractor Network Model
Information related to similar data resides in a pattern manifold, which can be visualized as a curved line in the state space. This nonlinear line encapsulates attractive fixed points representing patterns with similar characteristics. A simple model for learning a pattern manifold was presented and demonstrated in our previous work [2][6] based on these observations. In asymmetric networks, there is no known Lyapunov function guaranteeing convergence to an attractor. Dynamic properties of trajectories in the state space of asymmetric networks can include chaotic and limit cycle behaviors. It is easily seen that the dynamics of the nonlinear line attractor network presented in [6] does not guarantee global stability. It can only guarantee convergence and stability if the input pattern is sufficiently close to any of the trained pattern in its region of convergence. The symmetry of the synaptic connection matrix has been a constraint from the biological standpoint. Symmetry has been essential for the existence of a landscape picture for the dynamics of the network and asymmetry excluded such a landscape. Parisi pointed out in [7] that an attempt to implement a process of learning in symmetric artificial neural networks would encounter difficulties because every stimulus will quickly run into an attractor of the spin-glass phase which always accompanies the retrieval states. Consequently every stimulus will be perceived as a familiar pattern. In the next section, we show that the nonlinear line attractor network presented in [6] is specifically designed to operate between stable and unstable states. That is, when the network is able to reach equilibrium (stable), the input is considered
586
M.-J. Seow and V.K. Asari
as one of the stored patterns. Conversely, if the network is unable to reach equilibrium (unstable), the input is considered to be dissimilar to the stored patterns and therefore is considered as pattern of another class.
3
Self-organizing Algorithm
The proposed self-organizing nonlinear line attractor network is designed to solve the issue of stability-plasticity problem [8] in a dynamical environment. That is, it is designed to perform between stability and instability. It uses the instability mode (divergence) to self organize in real time and produces stable associations while learning input patterns beyond those originally stored. Fig. 1 shows a basic concept for the self-organizing line attractor network. The system creates the first module F1 with the set of training data. A new module F2 is created using the data rejected (unlearned data) by module F1. This process continues by successively creating new modules until all the data are stored (learned). F2 F1
F5
F4
F3
-Solid lines represent the convergence of patterns -Dash lines represent the divergence of patterns -The circular regions can be represented by the nonlinear line attractor network.
Fig. 1. Illustration of the self-organizing algorithm
The self-organizing nonlinear line attractor network operates by testing the stability of the nonlinear line attractor networks. If a set of the training input vectors is not stable, a new line attractor will be created using the unstable data in the previous nonlinear line attractors. This would leave the stable data in the previous line attractors undisturbed. Self-organization algorithm. The various operating stages in the self-organizing nonlinear attractor can be summarized as: 1. 2. 3.
Initialize r = 1. Create a nonlinear line attractor network Fr according to equations (1) to (5) presented in [2]. Calculate the threshold function by the following steps: a. Calculate the difference between the predicted output and the actual output for synaptic weights between the ith node and the jth node for P patterns. b. Calculate the error histogram as in Fig. 2. c. Find the minimum and maximum error starting from 0 in a continuous region.
Robust Learning by Self-organization of Nonlinear Lines of Attractions
d.
587
If there are minimum or maximum as shown in Fig. 2a, choose those values as threshold values, otherwise if there is no minimum or maximum as in Fig. 2b, set the threshold values to zero. min
0
max
Frequency
Frequency
0
min
max
Error
Error
(a) Histogram with min-max (b) Histogram without min-max
Fig. 2. Error histogram to find the threshold function
4. 5. 6. 7.
Perform a test with all the trained patterns on Fr network structure. Form the new input set with the unstable input set obtained in step 4. Prepare to create a new network by modifying r; r = r + 1. Repeat step 2 through 7 until there are no more unstable inputs in step 5
Fig. 3 is an illustration of the formation of lines of the response of the ith neuron due to the excitations of jth neuron by the self-organization algorithm for k = 1. When inputs as shown in Fig. 3a are applied to the self-organizing algorithm, F1 is created. The line (represented by the green line) created has a decision boundary (represented by the red lines) as shown in Fig. 3b. The diverged pattern will become a new set of inputs to the self-organizing algorithm for creating a new network F2. The second set of inputs will create a decision boundary as shown in Fig. 3c. As a consequence, the final result contains 2 lines that have a decision boundary as shown in Fig. 3d. Training data
Unstable patterns forming a new line
F1 (a) Five blocks data
(b) Creating F1
F2 (c) Creating F2
Self-organizing algorithm
F1
F2
(d) Final boundary
Fig. 3. Self organization of the nonlinear line attractor network
4
Dynamic of the Network
The dynamics of the system based on the self-organizing algorithm and the nonlinear line attractor network is presented in this section. The system can be used in two modes of operations, namely associative memory and pattern classifier. 4.1
Pattern Classifier
The system operates as a pattern classifier evolving in iteration t according to equations (6) to (7) in [6]. The network is designed to operate between stable and unstable states. That is, when the network is able to reach equilibrium (stable), the
588
M.-J. Seow and V.K. Asari
input is considered as one of the stored patterns. Conversely, if the network is unable to reach equilibrium (unstable), the input is considered to be dissimilar to the stored patterns and therefore is considered as a different class. 4.2
Associative Memory
As an associative memory, the network evolves in iteration t according to the equation (6) to (8) in [2]. The stability and associability of the nonlinear line attractor is also examined in [2].
5
Applications
Experiments were conducted with images of faces from the CMU face expression database described in [9]. This is a database of 975 face images of 13 people captured with different expressions. The example images were of size 64 × 64, with gray scale ranging from 0 to 255. The self-organizing nonlinear line attractor network was trained using k = 1 on a specific person class, with the goal of learning complex pattern manifolds of expressions. In the pattern association task, example face images were corrupted by zeroing the pixels inside a 25 × 25 patches chosen at random locations. Fig. 4 shows a few examples of this experiment. The original image is corrupted by removing part of the image. After applying it to the network, it can be seen that the missing part of the face is filled within three iterations. We have compared all the reconstructed facial expressions with the original facial expressions. There are no significant differences between the original versions and the reconstructed versions. Additional experiments were also performed to the original images and the network is able to retain these trained images. That is, the network converges in 1 iteration if the original trained images are not modified.
Fig. 4. Pattern Reconstruction
In the pattern classification task, example images drawn from another person class were used. Fig. 5 shows an example of the divergence dynamics of a pattern. The dynamics of the network can be interpreted as follows. At the first six iterations, the network tries to converge to one of the closest learned pattern in its memory. Since the presented face image is drawn from another person class, the image diverges finally. We have trained 13 people with each 20 continuous expressions using 13 networks, where each network stores each person. We have got 100% accuracy when testing the full database. That is, the network converges familiar patterns and diverges dissimilar patters.
Robust Learning by Self-organization of Nonlinear Lines of Attractions
589
Fig. 5. Divergence of a pattern
6
Conclusion
A novel learning strategy based on stability (convergence) and instability (divergence) criteria of a nonlinear line attractor network is presented in this paper. These criteria can be used to circumvent the plasticity-stability dilemma by using the instability mode as an indicator to create a new line for an unfamiliar pattern. This self-organizing behavior of the nonlinear line attractor model helps to create complex dynamics in an unsupervised manner. The dynamics of the system can be used in two modes of operation, namely associative memory and pattern classifier. In the associative memory mode, the network is used for retrieving information. Conversely, in the classification mode, the network is used to discriminate information. The training of the nonlinear line attractor network is very fast since the only main computation is on a least squares fit of linear models. The network has very fast convergence dynamics due to the convergence to a line of attraction rather to several fixed points.
References 1. Seung, H.S.: Learning Continuous Attractors in Recurrent Networks. Advances in Neural Information Processing Systems (1998) 654-660 2. Seow, M.J., Asari, K.V.: Associative Memory Using Nonlinear Line Attractor Network for Multi-Valued Pattern Association. In: Wang, J., Liao, X., Yi, Z. (eds.): ISNN 2005. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag (2005) 485-490 3. Freeman, W.J., Barrie, J.M.: Analysis of Spatial Patterns of Phase in Neocortical Gamma EEGs in Rabbit. Journal of Neurophysiology 84 (2000) 1266-1278 4. Harter, D., Kozma, R.: Nonconvergent Dynamics and Cognitive Systems. Cognitive Science (2003) 5. Yao, Y., Freeman, W.J.: Model of Biological Pattern Recognition with Spatially Chaotic Dynamics. Neural Networks 3 (1990) 153-170 6. Seow, M.J., Asari, K.V.: Recurrent Network as a Nonlinear Line Attractor for Skin Color Association. In: Yin, F., Wang, J., Guo, C. (eds.): ISNN 2004. Lecture Notes in Computer Science, Vol. 3173. Springer-Verlag (2004) 870-875 7. Perisi, G.: Asymmetric Neural Networks and the Process of Learning. J. Phys. A19 L675 (1986) 8. Carpenter, G.A., Grossberg, S.: ART 2: Stable Self-Organization of Pattern Recognition Codes for Analog Input Patterns. Applied Optics 26 (1987) 4919-4930 9. Liu, X., Chen, T., Vijaya Kumar, B.V.K.: Face Authentication for Multiple Subjects Using Eigenflow. Pattern Recognition: special issue on Biometric 36 (2003) 313-328
Improved Learning Algorithm Based on Generalized SOM for Dynamic Non-linear System Kai Zhang1, Gen-Zhi Guan1, Fang-Fang Chen2, Lin Zhang1, and Zhi-Ye Du1 1 School
of Electrical Engineering, Wuhan University, Wuhan, Hubei, China
[email protected] 2 College of Finance, Hunan University, Shijiachong 6#, Changsha, Hunan, China
Abstract. This paper proposes an improved learning algorithm based on generalized SOM for dynamical non-linear system identification. To improve the convergent speed and the accuracy of SOM algorithm, we propose the improved self-organizing algorithm, which, at first, applies the multiple local models instead of the global model, and secondly, adjusts the weights of the computing output layers along with the weights of the competing neurons layer during the training process. We prove that the improved algorithm is convergent if the network has suitable initial weights and small positive real parameters. The simulation results using our improved generalized SOM show an improvement for non-linear system compared to traditional neural network control systems.
1 Instruction The Self-Organizing Map introduced by Kohnnen [1] is a self organizing network based on competitive learning and has been used in many engineering applications [2-5]. The artificial neural networks have been applied to non-linear system modeling widely in recent years. Especially feed-forward neural network is often used [4]. Because the whole weights of the network have to be renewed in the training process, feed-forward neural network has low convergent speed and bad memory. In order to improve the convergent speed and the accuracy of model, multiple local models are used to replace the global model in reference [2,3] and SOM is applied to model multiple local models, but the estimate value of the model parameters is not accuracy enough to get the satisfying performance index. A new learning algorithm based on generalized self-organizing map is also introduced in reference [3], but the model still has low convergent speed. In this paper, we propose a new learning algorithm based on generalized SOM. We suppose that there are two outer computing output layers corresponding to the output layer. In the algorithm, the learning of multiple local models instead of the global model is considered and weights of computing output layers are updated along with weights of the neurons within a neighboring region in a similar way during the training. Then the convergence of the improved algorithm based on the generalized SOM for dynamic non-linear system identification is proved. We also analyzed the relation between convergence of the network and the initial values of the network. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 590 – 598, 2006. © Springer-Verlag Berlin Heidelberg 2006
Improved Learning Algorithm Based on Generalized SOM
591
This paper is organized as follows: Section 2 introduces the non-linear system model. Section 3 presents an improved SOM learning algorithm. Section 4 proves the convergence of the improved algorithm. Section 5 has some simulation. Section 6 draws some conclusions.
2 Non-linear System Model Suppose that a discrete non-linear system is time invariant and dimensions finite.
y (t ) = f ( x(t )) + v(t )
(
)
(
)
⎡ y (t − 1), L , y1 t − n y1 , L , y r (t − 1), L , y r t − n yr ,⎤ x (t ) = ⎢ 1 ⎣u1 (t − 1), L , u1 (t − nu1 ), L , u s (t − 1), L , u s (t − nus ) ⎥⎦
(1)
(2)
where output y (t ) ∈ R r ; input u (t ) ∈ R s ; x(t ) ∈ R m ; f : R m → R r ; r, m , s are positive integer numbers; m = n y1 + L + n yr + nu1 + L + nus ; v(t ) denotes independent noise.
Suppose that f is continuously differentiable and the order of system is a known number, f ( x(t )) expands at point wc (t ) which is close enough to point x(t ) using Taylor series, ignoring the high-order term, then there exists a local linear formula. If y (t ) expands at only one point in the whole input space, the result isn’t accurate enough. If we divide the whole input space into many small input spaces and y (t ) expands at each input space, the accuracy of the result improves greatly. Then f ( wc (t )) is replaced by Ac (t ) and f ' ( wc (t )) is replaced by Bc (t ) , the function is as follow: ∧
y (t ) = Ac (t ) + BcT (t )( x(t ) − wc (t )) + v(t )
(3)
Ac (t ) is a r dimensional vector, Bc (t ) is a m × n matrix. The following arithmetic, using the advance self-organizing neural network, realize the parameter ( Ac (t ) , Bc (t ) and wc (t ) ) optimization.
3 Improved SOM Learning Algorithm Suppose that there are two external output layers ( Ac and Bc ) corresponding to the output layer. Fig.1 shows the network structure. Neurons output layers correspond to computing layers one by one, the weight wc of competition layer which is activated corresponds to Ac and Bc , we can use output of computing layers to match the partial model of non-linear object, just as function (3). In reference [3], learning method of Ac and Bc is gradient method, which is different to learning method of wc . This method leads to slower learning speed. Ai and Bi correspond to the distribution of network topology in this paper, Ai corresponds to r dimensional output network, Bi corresponds to the r × m
dimensional output network, wc which identifies the neurons and is activated in the
592
K. Zhang et al.
competition layer corresponds to Ac and Bc . In the training process, we modify not only the weights of neurons activated in the competition layer but also of wi , Ai and Bi (i ∈ N c ) in their neighboring region. In that case, the network form two projective topologies, the two projections develop at the same time and match each other gradually, it approach anticipative input and output relation. The modified formula is as follows: wi (t + 1) = wi (t ) + αh(c, i )( x(t ) − wi (t )) e(t ) , i ∈ N c (t )⎫ (4) ⎬ wi (t + 1) = wi (t ), i ∉ N c (t )⎭
Ai (t + 1) = Ai (t ) + 2ηh(c, i )e(t ), i ∈ N c (t )⎫ Ai (t + 1) = Ai (t ), i ∉ N c (t )⎬⎭
(5)
Bi (t + 1) = Bi (t ) + 2ηh(c, i )( x(t ) − wi (t ))e(t ), i ∈ N c (t )⎫ Bi (t + 1) = Bi (t ), i ∉ N c (t )⎬⎭ Inner competing output layer
N
N
c
A
c
(6)
Nc
i
Outer computing output layers
wi
Bi
Input layer
x1
…
…
x
n
Fig. 1. The structure of network
where e(t ) identifies error, e(t ) = y (t ) − yˆ (t ) ; η and α are learning speed; h(c, i ) is neighboring region function, which denotes the size of modified neighboring region N c . N c is biggish in the beginning, c identifies the center point, N c diminishes gradually along with the increase of training times and becomes number 1 at last. Suppose that h(c, i ) is a bell-shaped curve, the magnitude of N c depends on the parameter β at the beginning, as fellow: 2
h(c, i ) = exp(− β c − i t )
(7)
4 Convergence of the Improved SOM Algorithm 4.1 Prove of Convergence
Suppose that the local linear model of the non-linear system within a little region is defined as function y (t + 1) = f (x(t )) . Ignoring the noise, the function is as follows: yˆ (t +1) = Ac + BcT (x(t ) − wc )
(8)
Improved Learning Algorithm Based on Generalized SOM
593
~ ,
In functions (4) (6) if i = c, h(c, i ) = 1 , the equation of the network trained to k th time is written as: ⎧wc (k + 1) = wc (k ) + α ( x(t ) − wc (k )) e(k ) ⎪ ⎨ Ac (k + 1) = Ac (k ) + 2ηe(k ) ⎪ Bc (k + 1) = Bc (k ) + 2η ( x(t ) − wc (k )) e(k ) ⎩
(9)
Definition. Suppose that an algorithm of based on generalized SOM for dynamical non-linear system identification satisfies function (9). If input ( {u (t )} ) and output ( {y (t )} ) are bounded serials, parameters α , η are suitably small positive real numbers, and Ai , Bi and wi i = 1, 2, L, N . N is the number of the neurons have suitable initial values, the network is convergent.
(
)
We assume that system is SISO in order to achieve the generality. Function (8) with true values of the parameters is rewritten as follows: y (t + 1) = φ T θˆ0 Output and error of the model trained to k th time are defined as follows:
(10)
∧
yˆ (t + 1) = φ T θ (k )
(11)
e(k ) = y (t + 1) − yˆ (t + 1) = φ θ 0 − φ θˆ(k ) = −φ θ (k ) T
T
T
[
(12)
]
T ⎡ Aˆ (k ) − Bˆ c T (k )wˆ c T (k )⎤ T ⎡ ⎤ T where θ 0 = ⎢ Ac − Bc wc ⎥ , θˆ(k ) = ⎢ ⎥ , φ = 1 x (t ) ˆ B ( ) B k c ⎣ ⎦ c ⎣ ⎦ ˆ ˆ θ (k ) = θ (k ) − θ ,
(13)
0
hence,
⎡ Aˆ (k + 1) − Bˆ c (k + 1)wˆ c (k + 1)⎤ ⎥ Bˆ c (k + 1) ⎣ ⎦
θˆ(k + 1) = ⎢
r (k ) = x(t ) − wˆ c (k )
T
T
,
using
function
θ ⎡ ⎤ 1 − r T (k )wˆ c (k ) − α e(k ) r T (k )r (k ) − Bˆ c T (k )r (k )sgn (e(k ))⎥ K (k ) = 2η ⎢ 2 ⎢ ⎥ r (k ) ⎦ ⎣ ˆ ˆ θ (k + 1) = θ (k ) + e(k )K (k ) ~ ~ ~ θ (k + 1) = θ (k ) − K (k )φ T θ (k ) ,
e(k ) > 0 , where sgn (e(k )) = ⎧⎨ 1 ⎩− 1 e(k ) < 0
(9),
and ,then:
(14) (15)
K (k ) and φ are m dimensional (m = na + nb + 1) vectors.
Assumption 1. K (k ) and φ are nonzero vectors.
If K (k ) and φ satisfy assumption 1, it holds that K (k ) = p 2 (k )φ , where p 2 (k ) ∈ R m×m , p1 ∈ R m×n , hence, let us define the function as:
594
K. Zhang et al.
⎡1 ⎤ ⎡1 ⎤ 0 ⎢ ⎥ φ = p1 ⎢ M ⎥, K (k ) = p2 (k )φ = p 2 (k ) p1 ⎢⎢0M ⎥⎥ ⎢⎣0⎥⎦ ⎢⎣0⎥⎦
(16)
using the function (16), the function (15) is rewritten as: ⎡1⎤
θ (k + 1) = θ (k ) − p2 (k ) p1 ⎢⎢0M ⎥⎥[1 0 L 0] p1T θ (k ) ~
~
~
⎢⎣0⎥⎦
~ ~ assume that θ (k ) = p1T θ (k ) ,thus,
θ (k + 1) = p1T θ (k + 1) ~
where p1 = [g1 , g 2 , L, g n ]
(17)
Assumption 2. g1 ∈ R m is nonzero and bounded column vector.
θ (k + 1) = g1T θ (k + 1)⎫⎪ ⎬ ~ θ (k ) = g1T θ (k ) ⎪⎭ ~
where θ1 (k + 1) is the first row of θ (k + 1) . According (16), the function (17) is written as: ⎡1 0 T θ (k + 1) = θ (k ) − p1 p2 (k ) p1 ⎢⎢0M 0M ⎢⎣0 0 where p1T p2 (k ) p1 = [q1 (k ), q2 (k ), L, qn (k )] ,
(18) to function (15) and function
L L M L
0⎤ 0⎥θ (k ) M⎥ 0⎥⎦
We define q1 (k ) = [l1 (k ), l 2 (k ), L, l n (k )]T , thus,
⎡ l1 (k ) 0 L 0⎤ ⎢l (k ) M M⎥ θ (k + 1) = θ (k ) − ⎢ 2 ⎥ M M ⎢ ⎥ ⎣l n (k ) 0 L 0⎦ θ1 (k + 1) = (1 − l1 (k ))θ1 (k )
(19)
Assumption 3. 1 − l1 (k ) < 1 .
If 1 − l1 (k ) < 1 for ∀k > 0 , the system is stable according to the equation (19), that
is lim θ1 (k ) = 0 . k →∞
According to function (18), because g1 ∈ R m is nonzero and bounded column ~ ~ vector, we can get lim θ (k ) = 0 . Therefore θ (k ) converges at θ 0 . k →∞
Improved Learning Algorithm Based on Generalized SOM
595
According to functions (13) and (14), ⎡1 ⎤
θ (k + 1) = θ (k ) − p2 (k ) p1 ⎢⎢0M ⎥⎥ e(k ) , using function (17), thus, ~
~
⎢⎣0⎥⎦
θ (k + 1) = θ (k ) + q1 (k )e(k ) , θ1 (k + 1) = θ1 (k ) + l1 (k )e(k )
according to equation (19), we can get θ1 (k + 1) = θ1 (k ) − l1 (k )θ1 (k ) hence, e 2 (k ) = θ12 .
Because lim θ1 (k ) = 0 , so lim e(k ) = 0 . Then the algorithm is proved to be k →∞
k →∞
convergent. 4.2 Discuss the Assumptions of Convergence
There is three hypothesis conditions in the foregoing prove process: (1) K (k ) and φ are nonzero vectors. (2) g1 ∈ R m is nonzero and bounded column vector. (3) 1 − l1 (k ) < 1 . In order to ensure the stability, we suppose 0 < l1 (k ) < 1 . We propose P1 , P2 (k ) as follows based on linear transformation theory in function (12): φ = [φ 0]m×m [1 0 L 0]T , K (k ) = [ K (k ) 0]m×m [1 0 L 0]T 1 424 3 14243 P1
P2 ( k ) P1
⎡φ ⎢ P3 (k ) = ⎢ 0 ⎢ M ⎣⎢ 0
T
then
[
K (t ) 0 L 0⎤ 0 L 0⎥ ⎥ M M M⎥ 0 L 0⎦⎥ m×m
]
T
q1 (k ) = φ T K (k ) 0 L 0 , l1 (k ) = φ T K (k ) and
g1T = Φ T = [1 y (t ) L y (t − na + 1), u (t ) L u (t − nb + 1)]
If system is with bound-input and bound-output, then it satisfies the assumption 2. If the system satisfied assumption 3 then it also satisfied assumption 1. We discuss the following conditions on assumption 3. Propose the initialized value at first. Because input and output ( {y (t ), u (t )} ) are bounded serials, we suppose that the maximum of y (t ) is y max and the maximum of u (t ) is u max , and define M = max( y max , u max ) .We can propose initialized values of wˆ c as follows: The first na elements correspond to input in wˆ c are 0.5 y max , the n elements correspond to output are 0.5u . The initialized value of Aˆ and Bˆ are b
max
small positive real number. Define α = 2η , then
c
r (k ) = x(t ) − wˆ c (k )
e(k ) = y (t + 1) − ( Aˆ c (k ) + Bˆ c (k )r (k )) are bounded. l1 (k ) is calculated as:
c
and
596
K. Zhang et al.
l1 (k ) = α (1 + r T (k )r (k )(1 − α e ) − Bˆ c T (k )r (k ) sgn(e(k ))) So we can know from the above function that it can satisfied the condition 0 < l1 (k ) < 1 when α is a small positive real number. 4.3 Select the Parameters About Self-organizing Neural Network
Parameters α and η in function (4) ~ function (6) influence the convergence and convergent speed of the model. When the input and output signals are in a certain range, oversized α and η make the training process disperse. On the contrary, the training speed is low when α and η is too small. So we should select the suitable α and η based on the signals in order to ensure the convergence of network. We also should select the suitable initial values of Ai , Bi and wi ( wi can be the average value corresponding to input and output signals or random numbers ranging between input crest value and output crest value). The value of β decide the magnitude of the initial neighboring region N c . Simulation experiments improves that suitable β which makes initial N c as large as one third of the whole network is of the right magnitude.
5 Simulation Case 1. Using static non-linear system defined as follows:
⎧( x + 8) (8 − π ) −8 ≤ x ≤π ⎪ − π < x ≤ 1.25π y = ⎨1 + 0.3sin (2 x ) 1.25π < x ≤ 8 1 .3 ⎪ else 0 ⎩ parameters and initial values are: sampling period T = 0.1 , number of the neurons is 50, the initial value of first column one is 0.75 and the initial value of second column is 0.5, mean of x is 0 and x ∈ [− 8, + 8] , α = 2η = 0.02 , initial value of Ai is 1 which is equal to initial value of Bi .
Fig. 2. Relation between square error and the number of training times. The simulation result trained to 300th time based on improved algorithm is equivalent to the result trained to 10000 time in reference [3].
Improved Learning Algorithm Based on Generalized SOM
597
Case 2. Using strong dynamic non-linear system presented by Narendra and Parthasarathy [6] as follows:
y (t ) =
y (t − 1)
1 + y (t − 1) 2
+ u 3 (t − 1)
We construct a predictive control [7] (also named as control based on model, a new control algorithm developed in 1970s) system using improved learning algorithm based on generalized SOM and strong dynamic non-linear system.
u
yp
Controlled object
-
Neural network model
+
yr
u
ym
T
ym
Learning algrithm
u
yp
Fig. 3. Identification process of system Fig. 4. Control process of model forecast
We train Neural Network(NN) using the identification model shown in Fig.3.
Fig. 5. Training date of neural network
where x(t )T = [ y (t − 1), u (t − 1)] , u (t ) ∈ [0, 1] , y (t ) ∈ [0, 1.5] , wi ∈ R 50×2 , the initial value of first column one is 0.75 and the initial value of second column is 0.5, number of the neurons is 50, α = 2η = 0.02 , initial value of Ai is 1 which is equal to initial value of Bi , input of system is random signal ranging between 20 and 23. In Fig.5, the curve of modeling errors is shown, and we find that the output of network trained for 300 times follows the expectation output well. Then, then we add gauss noise (variance is 0.2 and mean is 0) to input. According to the model shown in Fig.4, we run the predictive control system.
598
K. Zhang et al.
Fig. 6. Simulation Result of Predictive control system for strong non-linear system. We can find that the predictive control system using improved learning algorithm has great antijamming capability.
6 Conclusion According to the foregoing statement, if network has suitable initial weights and parameters, the improved algorithm is convergent. Simulation results show that the dynamic non-linear system using improved algorithm based on generalized SOM shows rapid convergent speed and high fitting precision. Especially in the model of predictive control system, the improved SOM algorithm has great anti-jamming capability.
References 1. Kohonen, T.: Self-Organized Formation of Topologically Correct Feature Maps. Biological Cybernetics 43(1) (1982) 59–69 2. Martinetz, T.M., Ritter, H.J., Schulten, K.J.: Three-Dimensional Neural Net for Learning Visuomotor Coordination of a Robot Arm. IEEE Trans. Neural Networks 1(1) (1990) 131–136 3. Ding, L., Xi, Y.-K.: Generalized Self-Organized Learning in Neural Network Mode-ling for Nonlinear Plant. Acta Electronic Sinica 20(10) (1992) 56–60 4. Liang, J., Meng, Z.-Y., Zhang, B.: Short Time Load Forecast of ANN Electronic System. Journal of Shandong University of Technology 28(3) (1998) 249–252 5. Wang, X.-A., Wicker, S.B.: An Artificial Neural Net Viterbi Decoder. IEEE Trans. Communications 44(2) (1996) 165–171 6. Narendra, K.S., Parthasarathy, K.: Identification and Control of Dynamical Systems Using Neural Networks. IEEE Trans. Neural Networks 1(1) (1990) 10–21 7. da Silva, R.N., Filatov, N., Lemos, J.M., Unbehauen, H.: A Dual Approach to Start-up of An Adaptive Predictive Controller. IEEE Trans. Control Systems Technology 13(6) (2005) 877–883
Q-Learning with FCMAC in Multi-agent Cooperation Kao-Shing Hwang, Yu-Jen Chen, and Tzung-Feng Lin Department of Electrical Engineering, National Chung Cheng University, Chia-Yi, Taiwan
[email protected]
Abstract. In general, Q-learning needs well-defined quantized state spaces and action spaces to obtain an optimal policy for accomplishing a given task. This makes it difficult to be applied to real robot tasks because of poor performance of learned behavior due to the failure of quantization of continuous state and action spaces. In this paper, we proposed a fuzzy-based CMAC method to calculate the contribution of each neighboring state to generate a continuous action value in order to make motion smooth and effective. A momentum term to speed up training has been designed and implemented in a multi-agent system for real robot applications.
1 Introduction Reinforcement learning has gained significant attention in recent years [1, 2]. As a learning method that does not need a model of its environment and can be used online, reinforcement learning is well-studied for multi-agent systems, where agents know little about other agents, and the environment changes during learning. Applications of reinforcement learning in multi-agent systems include robot soccer [3], pursuit games [4, 5] and coordination games [6]. In this paper, we discuss a task for two robots in cooperation, and both robots are connected with a bar to curb each other. The goal is to get across a gate in the middle of a 5x9 grid map as depicted in Fig. 1.
Fig. 1. The task environment and robots
In general, when implementing real robot applications, the common reinforcement learning, such as Q-learning, needs well-defined quantized state spaces and action spaces to converge. This makes it difficult to be applied to real robot tasks because of poor quantization of state spaces and action spaces. The performance of robot behavior is not smooth either. A fuzzy-based cerebellar model articulation controller J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 599 – 606, 2006. © Springer-Verlag Berlin Heidelberg 2006
600
K.-S. Hwang, Y.-J. Chen, and T.-F. Lin
(FCMAC) method that uses multiple overlapping tiling of the space to produce the quantized state space with fine resolution has been proposed [7].In addition, since there are many states for the task, the robots need much time for training. In neural network, there is a momentum term in the multiplayer perceptron [8]. We use this concept to improve our training. This article is organized as follows: FCMAC-based continuous valued Q-learning learning is introduced in Section 2. The algorithm extending to Multi-agent system is described in Section3. The simulation results are illustrated in Section 4. Finally, a conclusion is drawn in Section 5.
2 FCMAC-Based Continuous Valued Q-Learning Multi-agent is more complex than single agent in a Q-learning system. Before describing the multi-agent FCMAC-based continuous valued Q-learning, we express the FCMAC-based continuous valued Q-learning in single agent system. 2.1 State Representation and Action Interpolation We will describe how to extend conventional discrete Q-learning to continuous valued Q-learning in the following. In this method, an interpolated action value is computed by weighted linear combination of action values in terms of representative states and actions. According to this method, it can generate real smooth motions of the real robot. First, we quantize the state/action space adequately. Each quantized state and action can be the representative state x1,…,xN and the representative action u1, …, uM, respectively. A general state is represented as a contribution vector of the representative state x = ( w x , L , w x ) , and action is represented as u = (wu , L, wu ) as well [9]. The contribution vector indicates the closeness of the neighbor representative state/action. The summation of contribution values is one. The contribution of each representative state has been calculated using FCMAC. 1
N
1
M
Fuzzy Cerebellar Model Articulation Controller. The standard univariate basis function of the CMAC is binary so that the network modeling capability is only piecewise constant, whereas the univariate basis functions with higher order piecewise polynomial, which can generate a smoother output, have recently been investigated [10,11]. A crisp set can be considered as a special form of fuzzy set, to which an instant may or may not belong to some degree. The property is similar to the problem whether the state variable excites a specific region in the sensor layer of the CMAC or not. The membership function, μi(x)Æ[0,1], associates each state variable x with a number. It represents the grade of membership of x in μi(x). These membership grades have a peak value at the center of the excited region, and decrease as the representative state moves toward the edge of the excited region. Different grades of membership are assigned to the corresponding cell as it is excited by a given state variable.
Q-Learning with FCMAC in Multi-agent Cooperation
Fig. 2. The nonlinear mapping of 2-D FCMAC
601
Fig. 3. The nonlinear mapping result of 2-D FCMA
The 2-D FCMAC operations, where each subset is offset relative to the others along the hyperdiagonal in the input hyperspace, are illustrated in the following schematic diagram shown in Fig. 2. The nonlinear mapping of the FCMAC is implemented by replacing the logic AND and OR operations in the CMAC with the commonly used T-norm and T-conorm operations, respectively. The nonlinear mapping result of the FCMAC is shown in Fig. 3. The algebraic product of T-norm and algebraic sum of T-conorm are adopted in this work since they produce smoother output surfaces and can make the system analysis available as well. Extension to the Continuous Space. When the robot perceives the current sensor information as a continuous state x, we define a weighting factor vector ( w x ,L, w x ) calculated by FCMAC method mentioned above. This weighting factor 1
N
i
wx represents how a current continuous state x is influenced by a representative state
xi. When the robot perceives its environment as a continuous state x, this continuous state x is described by weighted linear combination of representative state xi as
x = ∑ i =1 w x x i . Similarly in an action space, a continuous action command u is denoted by weighted linear combination of representative action uj as: N
i
j M u = ∑ j =1 w u u j .
(1)
In the conventional Q-learning, an optimal action at the current state is defined as an action that maximizes the Q value at the state. Such mapping from states to actions is called a policy π . In the same way, an optimal action u* at a continuous state x is defined as
u* = ∑i =1 w x arg max Q ( x i , a ) N
i
a
.
(2)
602
K.-S. Hwang, Y.-J. Chen, and T.-F. Lin
u* is obtained by the summation of the product of a weighting factor in terms of a representative state xi and a representative action uj that maximizes the Q value at the state xi. From (1) and (2), we can get j i N wu = ∑i =1 ⎡ w x u j = arg max Q ( x i , a ) ⎤ . ⎢⎣ ⎥⎦ a
(3)
2.2 Updating Q Function After the optimal action u* is executed, the agent will have a state transition from an old state to a new state. The value function with a policy π V π ( x i ) can be considered as an action value Q in terms of the representative state xi and an optimal action u*. V π ( x i ) can be defined as V π ( x i ) = Q( x i , u ∗ ) = ∑ j =1 wu Q( x i , u j ) . M
*j
(4)
The Q value is updated by 2 stages described in the following. First Stage. From (2), the optimal action u* is obtained by the summation of weighting factors in terms of a representative state xi and a representative action uj that maximizes a Q value at state xi. At each representative state, the action value that has the maximum value selected for the contribution for u* should be updated. The equation is
Qt+1 (xi , u j ) = Qt (xi , u j ) + αwx [r + γV π (x' ) − Qt (xi , u j )] , i
(5)
where r is a reward received from the environment, γ is a discounting rate, u j = arg max Q( x i , u j ) and x′ is the next state. uj
Second Stage. In the first stage, the representative states excited by the agent only update the representative action value according to a policy that find the maximum Q value of the representative state itself. In (2) and (3), since u* is obtained according to the state weighting factors, and we can calculate the action weighting factors by u*, the action weighting factors included some information and contribution of state weighting factors. In this stage, we utilize the weighting factors of all the representative actions to reinforce the representative state, which has the maximum weighting contribution. The representative equation is
Qt+1 (xi , u j ) = Qt (xi , u j ) + αwx wu [r + γV π ( x' ) − Qt ( xi , u j )] i
where x i
= arg max wx i
i
j
,
(6)
.
x
Momentum Term. In order to speed up training, we use the concept of momentum. A momentum term of the value function in terms of current continuous state x and a next continuous state x′ has been defined as
ΔV ( x, x′) = ∑i=1 w x′ max Q( x′i , u j ) − ∑i=1 w xi max Q( x i , u j ) . j j N
N
i
u ∈A
u ∈A
(7)
Q-Learning with FCMAC in Multi-agent Cooperation
603
We add this momentum term to improve and speed up learning. And (5) can be rewritten as
Qt+1 ( xi , u j ) = Qt ( xi , u j ) + αwx [r + ΔV + γV π ( x' ) − Qt ( xi , u j )] . i
(8)
3 Multi-agent Q-Learning 3.1 Conventional Discrete Multi-agent Q-Learning
As Littman noted in [12], no agent lives in a vacuum; it must interact with other agents in the environment to achieve its goal. Multi-agent systems differ from singleagent systems in that several agents exist in the environment modeling of each other’s goals and actions. From an individual agent’s perspective, multi-agent systems vary from single-agent systems most significantly since other agents can affect the environment dynamically. In addition to the uncertainty of a system, other agents could intentionally affect the environment [13]. In the case of multiple agents, each learning simultaneously, one particular agent is learning the value of actions in a non-stationary environment. Thus, the convergence of the original Q-learning algorithm is not necessarily guaranteed in a multi-agent system. As is well known, with certain assumptions about the way in which actions are selected at each state over time, Q-learning converges to the optimal value function V*. The simplest way to extend this to the multi-agent stochastic game (SG) setting is just by adding a subscript to the formulation above; that is, to have the learning agent pretend that the environment is passive: v v v v v v Qkt +1 ( x , u k ) = Qkt ( x , u k ) + α [rk ( x , u ) + γVk ( x ′) − Qkt ( x , u k )] ,
(9)
where xv means joint states of all P agents (x1,…,xP), uv means joint actions of all P agents (u1,…,uP), Qkt ( xv, u k ) means Q value of agent k in terms of joint states of all agents and action uk of agent k at time t, rk is a reward function of agent k and xv′ means next joint states of all agents. And v v Vk ( x ) = max Qk ( x , u k ) uk ∈Ak
(10)
,
where Ak means action space of agent k. Several authors have tested variations of this algorithm [14]. However, this approach is unmotivated because the definition of the Q-values assumes incorrectly that they are independent of the actions selected by the other agents. The cure to this problem is to simply define the Q-values as a function of all agents’ actions: v v v v v v v v v Qkt +1 ( x , u ) = Qkt ( x , u ) + α [rk ( x , u ) + γVk ( x′) − Qkt ( x , u )] .
(11)
For (by definition, two-player) zero-sum SGs, Littman suggests the minimax-Q learning algorithm, in which V is updated with the minimax of the Q-values [11]: v v Vi ( x ) = max min ∑u ∈ A P1 (u1 )Qi ( x , u1 , u 2 ), P1∈Π (A1 ) u 2 ∈ A2
1
1
i = 1,2
.
(12)
604
K.-S. Hwang, Y.-J. Chen, and T.-F. Lin
Although it can be extended to general-sum SGs, minimax-Q is no longer well motivated in those settings. In our problem, the two agents have to cooperative to achieve the same goal, so we can simply define V as: v v Vi ( x ) = max Qi ( x,u1 , u 2 ), i = 1,2 . (13) u1∈A1,u 2 ∈A2 This shows that the Q-values of the players define a game in which there is a globally optimal action profile (meaning that the payoff of any agent under that joint actions is no less than his payoff under any other joint actions). 3.2 FCMAC-Based Q-Learning Used in Multi-agent System
The purpose of the present work is to find a solution to help two robots cooperate to get across the gate. The conventional discrete Q-learning does not perform very well to control the robots, when it is implemented in the real robot application. Therefore we replaced it with the continuous valued Q-learning to obtain the smooth continuous value of the output command action. We use FCMAC to calculate the weighting factor of the quantized representative states and we can get the linear combination action in terms of the quantized representative actions. The detail of our multi-agent FCMAC-based continuous valued Q-learning is described as: For all representative states xkik (ik=1~N) and representative actions ukjk (jk=1~M), let Qkt ( x1i1 K x PiP , u1j1 K u PjP ) = 0 (k=1~P). Initialize all agents’ state xk and repeat steps 1~9 until learning is terminated. i
Step 1. Use FCMAC to calculate the all weighting factors wkx from xk, (k=1~P). i
Step 2. Calculate uk* in terms of wkx . j
Step 3. Get all the weighting factors wku , (k=1~P). Step 4. Take action u1∗ ,LuP∗ , and observe r, x1′ ,L , x ′P . x′ u′ ′ Step 5. Calculate wk and wk from xk which is the same as step 1~3. Step 6. Calculate the next state value function. Step 7. Calculate the momentum term. Step 8. Update representative Q value from stage and stage 2. Step 9. If the agents achieve the goal, they are removed to the initial positions. i
j
4 Simulation Results To demonstrate the algorithm proposed above, we give a simple task to two robots which have a straight bar connection with each other. The goal of them is to cooperate with each other to pass through the gate in a 9x5 grid map. Each agent has four representative actions: up, down, left and right. Both agents perform actions and update Qvalues simultaneously.
Q-Learning with FCMAC in Multi-agent Cooperation
Fig. 4. Performance comparison of single agent
12000
Fig. 5. Performance comparison of multi-agent
no momentum
10000
s p e ts
605
momentum
8000 6000 4000 2000 0 1
51
101
151
201
251
301
351
401
451
trail
Fig. 6. Performance comparison of momentum term
Figures 4 and 5 show comparisons between performing only stage 1 and performing stage 1 and stage 2 with single agent and two agents, respectively. The x-axis is the learning trail and the y-axis is the step number in one trail. Stage 2 can speed up learning clearly, especially in a multi-agent system. Figure 6 is the comparison between using and not using the momentum term when performing two stages. The xaxis is the learning trail and y-axis is the step number in one trail. It shows that the momentum term also improves the learning. Although the convergency of discrete Q-learning is better than the method we proposed in simulations, the Q-learning with FCMAC performs well in reality. It is due to the fact that the states are continuous in the real world, and noise disturbs state transition. Furthermore, continuous actions are better than discrete actions inherently.
5 Conclusion and Future Work In this paper, we propose an approach that combines FCMAC which calculates the weighting factor of the agents’ state with multi-agent Q-learning to generate a continuous action command and perform it in real robot applications. Although the conventional discrete valued Q-learning learns well and converges in a task that has well-defined discrete quantized state space and action space, it has poor performance in real world applications. It is due to the fact that the state of agent in real world is continuous and flexible continuous action outputs are better than inflexible discrete action outputs.
606
K.-S. Hwang, Y.-J. Chen, and T.-F. Lin
The task experiment of this paper is simple and the environment is regular and less varying. The proposed method may learn not well if the environment is highly unpredictable or has more obstacles. In the future work, we will apply our method to more complicated tasks under changing environment at conditions.
References 1. Kaebling, L., Littman, M.L., Moore, A.W.: Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 5 (1996) 4:237-285 2. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press/Bradford Books, March (1998) 3. Balch, T.: Learning Roles: Behavioral Diversity in Robot Teams. In Sec. 8 4. Tan, M.: Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents. In Proceedings of the Tenth International Conference on Machine Learning, June (1993) 330337 5. Jong, E.D.: Non-Random Exploration Bonuses for Online Reinforcement Learning. In Sec. 8 6. Claus, C., Boutilier, C.: The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems. In Sec. 8 7. Hwang, K.S., Lin C.S.: Smooth Trajectory Traking of Three-Link Robot: A SelfOrganizing CMAC Approach. IEEE Transactions on Systems, Man, and Cybernetics— PART B: Cybernetics 28(5) (1998) 8. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning Internal Representations by Error Propagation. In Rumelhart, D.E. and McClelland, J.L. editors, Parallel Distributed Processing: Exploration in the Microstructure of Cognition, vol. 1, chap. 8,. MIT Press, Cambridge, MA., (1986) 318-362 9. Takahashi, Y., Takeda, M., Asada, M.: Continuous Valued Q-learning for Vision-Guided Behavior Acquistion. Proceedings of 1999 IEEE/SICE/RSJ International Conference on Multisensor Fusion and Intergration for Intelligent Systems, (1999) 255-260 10. Brown, M., Harris, C.J.: A Perspective and Critique of Adaptive Neurofuzzy Systems Used for Modeling and Control Applications. Int. J. Neural Syst. 6(2) (1995) 197-220 11. Jou, C.-C: A fuzzy Cerebellar Model Articulation Controller. in IEEE Int. Conf. Fuzzy Systems, Mar. (1992) 1171-1178 12. Littman, M.L.: Markov Games as a Framework for Multi-Agent Reinforcement Learning. In Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, (1994) 157-163 13. Stone, P., Veloso, M.: Multiagent Systems: A Survey from a Machine Learning Perspective, (1997) 14. Sen, S., editor: Collected Papers from the AAAI-97 Workshop on Multiagent Learning. AAAI Press (1997)
Q Learning Based on Self-organizing Fuzzy Radial Basis Function Network Xuesong Wang, Yuhu Cheng, and Wei Sun School of Information and Electrical Engineering China University of Mining and Technology, Xuzhou, Jiangsu 221008, P.R. China {wangxuesongcumt, chengyuhu, sw3883204}@163.com
Abstract. A fuzzy Q learning based on a self-organizing fuzzy radial basis function (FRBF) network is proposed to solve the ‘curse of dimensionality’ problem caused by state space generalization in the paper. A FRBF network is used to represent continuous action and the corresponding Q value. The interpolation technique is adopted to represent the appropriate utility value for the wining local action of every fuzzy rule. Neurons can be organized by the FRBF network itself. The methods of the structure and parameter learning, based on new adding and merging neurons techniques and a gradient descent algorithm, are simple and effective, with a high accuracy and a compact structure. Simulation results on balancing control of inverted pendulum illustrate the performance and applicability of the proposed fuzzy Q learning scheme to real-world problems with continuous states and continuous actions.
1 Introduction Q learning is an effective reinforcement learning method to solve Markov Decision Problem (MDP) with incomplete information. Since Watkins and Dayan [1] proposed Q learning algorithm and proved its convergence, it has been received broad attention. However, most research done in the filed of Q learning has focused on discrete domains. Because the states and the actions of many control systems are continuous in fact, we should discrete the state and action spaces when we use Q learning to solve such control problems. Munos [2] has pointed that such a discrete operation is likely to result in the following problems: (a) the hidden state problem easily occurs when state space is divided roughly; (b) the curse of dimensionality problem appears when the state space is enormous; (c) the property of Markov of the system cannot be guaranteed after discrete operation. There has been much research to make Q learning deal with the continuous state and continuous actions. Smith [3] used two SOM networks to approximate the state and action spaces respectively. The algorithm is quite complex due to the coordinator problem between the two SOM networks. A Q learning method based on dynamic neural field was proposed by Gross et al. [4]. A neural vector quantization technique was used to cluster the similar states and to code action values. Despite the method meets the demands of Q learning for real-world problem with continuous state and continuous action, low efficiency of action selection hampers its application. Jouffe J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 607 – 615, 2006. © Springer-Verlag Berlin Heidelberg 2006
608
X. Wang, Y. Cheng, and W. Sun
[5] designed two fuzzy reinforcement learning methods such as fuzzy AC learning and fuzzy Q learning based on dynamic planning theory. These two methods merely tune the parameters of the consequent part of a fuzzy inference system (FIS) by using the reinforcement signal, while the premise part of the FIS are fixed during the learning process. Therefore, they could not realize adaptive construction of rule base. FISs are suitable for representing fuzzy and uncertain knowledge, which accord well with the thinking manner of human beings. But FISs lack self-learning and selfadaptive abilities. It’s known that RBF networks have advantages of parallel computation, fault-tolerance, and self-learning, but they are not suitable for representing knowledge. Hence, it’s possible to combine the features of these two systems, which has been developed into a powerful fuzzy RBF (FRBF) network systems. Aiming at the effective control of systems with continuous state and continuous action, a fuzzy Q learning based on FRBF network is proposed in the paper.
2 The Architecture of Fuzzy Q Learning Based on a FRBF network, the architecture of fuzzy Q learning is schematically shown in Fig.1. The definite meaning of each layer is described as follows. Layer 1 is an input layer. Each neuron in this layer denotes a crisp input variable where
si
i is an input variable index. Input vector s = (s1 , s 2 ,L s n ) ∈ R is transT
n
mitted to the next layer directly. Layer 2 is a rule layer. Each neuron in the rule layer represents the premise part of a fuzzy rule and has n differentiable Gaussian membership functions as following.
((
MFij = exp − si − μ ij where,
)2
)
2σ ij2 , i = 1,2,L , n , j = 1,2, L , h .
(1)
MFij is the ith membership function in the jth rule neuron, μ ij and σ ij are
MFij respectively. At time step t , the output of the jth rule neuron is computed from a n dimensional input observaj tion st as Eq. (2). ϕ ( s t ) is equal to the multiplication of the inner n membership
the center and the width of the membership function
functions, which denotes the firing strength of the jth rule neuron.
⎛ n (sit − μ ij )2 ⎞ ⎟. ϕ ( st ) = Π MFij (si ) = exp⎜ − ∑ 2 i =1 ⎜ i =1 ⎟ 2σ ij ⎝ ⎠ n
j
where
(2)
sit is the ith input variable within the input vector st at time step t .
Layer 3 is a normalized layer. The number of neurons in this layer is equal to that of the rule layer. The effect of the layer is to perform normalization operation on each rule. The output Φ of the jth neuron in this layer denotes the normalized firing strength of the jth fuzzy rule. j
Q Learning Based on Self-organizing Fuzzy Radial Basis Function Network
Φ j ( st ) =
⎛ n (s it − μ ij )2 ⎜− = exp h ⎜ ∑ 2σ ij2 i =1 ⎝ ϕ ∑ l
ϕj
⎛ n (s it − μ il )2 ⎜− ∑ exp ∑ ⎜ i =1 2σ il2 l =1 ⎝
⎞ ⎟ ⎟ ⎠
h
⎞ ⎟. ⎟ ⎠
609
(3)
l =1
∑
(μ11 , σ 11 ) (μ21 , σ 21 ) s1
Π
ϕ1 ( s )
(μ n1 , σ n1 )
(a11 , q11 ) (a12 , q12 )
Φ1 ( s )
(an11 , q1n1 )
(μ12 , σ 12 ) s2
(μ22 , σ 22 ) Π
max
b(s )
(a12 , q12 )
ϕ 2 ( s)
(μn 2 , σ n 2 )
(a22 , q22 ) Φ 2 (s)
max
(an22 , qn22 )
Q ( s, b )
sn
(μ1h , σ 1h )
(μ2h ,σ 2h ) Π
( a1h , q1h )
ϕh ( s)
(μ nh ,σ nh )
Φ h (s)
(a2h , q2h )
max
(anhh , qnhh )
Fig. 1. The architecture of fuzzy Q learning
Layer 4 is a discrete action competition layer. Each neuron corresponds to the consequent part of each fuzzy rule. Each fuzzy rule has some possible discrete actions and the corresponding Q values, for example there are n j discrete actions
{a , a ,L, a } and Q values {q , q ,L, q } for the jth rule. Based on Q values j 1
j 2
j nj
j 1
j 2
j nj
and action selection policy, one candidate action from the discrete action set is selected as a wining local action as
a j* in Eq. (4).
{ }
{ }
a j* (st ) = π greedy q kj = arg max q kj . a kj
(4)
where k is a candidate action index and π represents the policy used to select the action. In this paper, the completely greedy policy is used. Layer 5 is an output layer. The layer is made up of a continuous action part and the corresponding Q value part. The generation of continuous action depends upon the wining local action of each fuzzy rule and the vector of firing strengths of fuzzy rules.
610
X. Wang, Y. Cheng, and W. Sun
The final action
b(st ) is obtained by using a weighted sum of each wining local
action with their normalized firing strength as follows.
b(s t ) = ∑ j =1 π greedy {q kj }⋅ Φ j (st ) = ∑ j =1 a j* (st ) ⋅ Φ j (st ) . h
h
(5)
Once the final action is determined, the utility value for a wining local action of every rule can be calculated to realize the credit assignment of the final action to each fuzzy rule. Because the final action is not included in the discrete action set, the interj*
(
j*
)
polation technique can be introduced to obtain the utility value q a , b as shown in Eq. (6). Interpolation [6] is a method that is used to obtain the functional value of unknown parameters from a finite set of known data pairs.
q j* (a j* (st ), b(st )) = ∑ d (a kj , b ) ⋅ q kj nj
k =1
(
⎛ aj −b d a , b = exp⎜ − k 2 ⎜ 2ξ ⎝
(
j k
)
)
2
∑ d (a , b ) . nj
j k
(6)
k =1
nj ⎞ ⎟, ξ = 1 a kj − b . ∑ ⎟ n − 1 k =1 j ⎠
( ) ) pair in the jth rule will contribute to calculate q (a
(7)
j
where d a k , b is a Gaussian kernel function that determines how much each
(a
j k
, q kj
j*
j*
)
, b . The value of ξ
in kernel function is determined based on the number of candidate actions in discrete action set of rule and the spacing between these candidate actions to ensure the resulting interpolated curve can represent the utility value properly. The Q value of the final action is calculated by using a weighted sum of the utility value of each wining local action with their normalized firing strength as follows.
(
h
)
Q(st , b(s t )) = ∑ q j* a j* , b ⋅ Φ j (st ) . j =1
(8)
The Q value of the final action should be propagated back to update the Q value of each discrete action using temporal difference (TD) method here. The TD error δ t is calculated by the temporal difference of the Q value between successive states in the state transition.
δ t = rt + γQ(st +1 , b(st +1 )) − Q(st , b(st )) . where,
(9)
st and st +1 are the states at time t and (t + 1) respectively, rt is the exter-
nal reinforcement reward signal and
γ
denotes the discount factor that is used to
determine the proportion of the delay to the future rewards ( 0 < γ < 1 ). TD error
Q Learning Based on Self-organizing Fuzzy Radial Basis Function Network
611
indicates the merit of the selected action in fact. Therefore, the error function can be defined as Eq. (10).
1 E (t ) = δ t2 . 2
(10)
Based on TD error, the updating equation of the Q value of each discrete action is
(
q kj (t + 1) = q kj (t ) + α q ⋅ δ t ⋅ Φ j (st ) ⋅ d a kj , b where
αq
) ∑ d (a , b ). nj
j k
(11)
k =1
is learning rate ( 0 < α q < 1 ).
3 The Learning Algorithm of Fuzzy Q Learning When using a neural network to constitute a fuzzy inference system, the key problem is how to design a neural network learning algorithm that makes the structure and the parameters of the network optimal. The learning process of fuzzy Q learning is in fact the FRBF network learning, which includes the network structure learning and the parameter learning. 3.1 Structure Learning We can obtain better generalization performance and much higher convergence speed by keeping the network structure in a compact size during the learning process. 3.1.1 Adding Neurons (1) TD error criterion The TD error criterion is proposed by defining the local errors of each basis function that can be defined as the following [7]. Definition: For each basis function TD error
Φ j , the running average of the locally weighted
f j , and the squared TD error g j , are calculated by
(
)
(12)
(
)
(13)
f j (t + 1) = 1 − γ c Φ j f j (t ) + γ c Φ j ϕ j δ t . g j (t + 1) = 1 − γ c Φ j g j (t ) + γ c Φ jϕ j δ t2 . where
γc
is a attenuating coefficient. When
L j (t ) = g j (t ) f j (t ) is large than a
threshold θ L , a new neuron is then added. But when the environment is stationary, which means that the state transition probability is constant, the average of the TD
612
X. Wang, Y. Cheng, and W. Sun
error converges to zero, and consequently
L j diverges. In order to avoid this prob-
lem, we add a rule of stopping adding neurons under the condition that than a constant
g j is smaller
θ g . Thus, the criterion of adding neurons based on the TD error is L j > θ L And g j > θ g .
(14)
(2) if-part criterion Intuitively, a FIS should always be able to infer an appropriate control action for each state of the control system, which is called the “completeness” property. The completeness of a fuzzy controller depends on its data base and rule base. From the view of data base, the completeness mainly means that the supports of fuzzy sets should cover the related discourse domains with a definite level ε . This property of a FIS is called ε -completeness [8]. However from the view of rule base, the completeness mainly denotes that an extra rule can be added whenever a fuzzy condition is not included in the rule base, or whenever the match degree (or firing strength) between some inputs and the predefined fuzzy conditions is lower than level ε . Otherwise there will no dominant rule that will be activated in the latter case. Therefore when Eq. (15) is satisfied, which means that there are no neurons in the rule layer covering the current input data st perfectly, new neurons should be added to cover the current input data so as to ensure the membership function of each input data is large than
ϕ = arg max j ϕ (st ) < ε .
ε. (15)
If the TD error criterion Eq. (14) and the if-part criterion Eq. (15) are established for the current input st , the center and the width of the new neuron are
⎧μ new = s t . ⎨ = − σ τ s μ new t nearest ⎩ where,
st − μ nearest
(16)
denotes the Euclidean distance between the current input
state variable and its nearest rule neuron, τ is a overlapping coefficient that makes the basis functions of adjacent neurons have proper overlap region. At the initial learning phase, there are no neurons in the rule layer, and the first input observation data is viewed as the first neuron. The center of the first rule neuron is defined as the first input observation data, while its width can be set as a predefined initial width. As for the following input, we can judge whether to add neurons or not according to the above two criteria. 3.1.2 Merging Neurons As the Q learning continues to explore its environment, the number of rule neurons would increase. It is, therefore, essential to merge any highly similar rule neurons in order to keep the size of the FRBF network to a minimum without severely degrading
Q Learning Based on Self-organizing Fuzzy Radial Basis Function Network
613
its capability of searching the optimal control policy. An approximate but practical method to measure the similarity between two fuzzy sets is proposed here. The shape and the location of membership function of each rule neuron mainly depend on the center and the width of the Gaussian function. If the centers and the widths of rule neurons j and p satisfy the following relationship, we can deduce that these two neurons have similarity.
μ j − μ p < Δμ min
σ j − σ p < Δσ min .
And
(17)
where, Δμ min and Δσ min are merging thresholds. If Eq. (17) is satisfied, the two neurons should be merged into one and the numbers of the neurons both in the rule layer and the normalized layer is reduced by one. 3.2 Parameter Learning There are some parameters including the centers and the widths of the rule neurons need to be adjusted on-line. Gradient descent algorithm is adopted to carry out the parameter learning as shown in Eqs. (18) and (19) respectively.
μ ij (t + 1) = μ ij (t ) + η μ δ t ⋅ q j* ⋅ Φ j (st ) ⋅ (1 − Φ j (st )) ⋅ σ ij (t + 1) = σ ij (t ) + η σ δ t ⋅ q ⋅ Φ (st ) ⋅ (1 − Φ (st )) ⋅ j*
where,
ησ
and
ησ
j
j
(s
i
− μ ij )
σ ij2
(s
− μ ij )
(18)
.
2
i
σ ij3
.
(19)
are learning rates.
4 Simulation Results In order to assess the learning performance of the proposed fuzzy Q learning, a typical example of the inverted pendulum balancing problem is presented in this section. The dynamics and the parameters of the studied inverted pendulum system are the same as [5]. The only feedback signal received by the fuzzy Q learning controller from external environment is a failure signal. When the angle of the pole exceeds the range of − 12 0 ,+12 0 or the cart collides with the end of the track at the position of − 2.4m or
[
]
+ 2.4m , the external environment will issue a failure signal. Hence, the reward rt is defined as follows.
⎧1, r (t ) = ⎨ ⎩0,
θ (t ) ≤ 12 0 and otherwise
x ≤ 2. 4m .
(20)
614
X. Wang, Y. Cheng, and W. Sun
The input variables of the FRBF network are the four state variables ( x, x& , θ , θ& ) of the system, while the outputs of the FRBF network are the control action F and the corresponding Q value. The goal of this control problem is to train the fuzzy Q learning controller such that it can determine a sequence of forces to apply to the cart to balance the pole as long as possible without failure. A control strategy is deemed successful if it can balance the pole for 10 000 time steps within one trial. In order to acquire control experience from various situations, the controller starts to learn from a stochastic initial state until control failure and restarts to learn after each fail. The parameters for fuzzy Q learning controller are set as follows. γ = 0.95 , α q = 0.3 ,
γ c = 0.42 , θ L = 2.8 , θ g = 0.012 , ε = 0.1354 , τ = 0.68 , Δμ min = 0.01 ,
Δσ min = 0.01 , η μ = 0.06 , ησ = 0.045 .
10000
10000
9000
9000
8000
8000
7000
7000
6000
6000
Steps
Steps
Two groups of simulations for discrete action sets are {− 10,0,+10} and {− 10,−5,0,+5,+10} are done respectively. In this example, 50 runs are simulated, and a run ends when a successful controller is found or a failure run occurs. A failure run is said to occur if no successful controller is found after 100 trials. The controller with three and five discrete actions can make the inverted pendulum balance successfully after about 20 and 13 trials respectively. Fig. 2 gives the simulation results of 5 runs for three and five discrete actions respectively. As can be seen, fuzzy Q learning controller performs very well in both cases. Increasing the number of discrete actions increases the learning difficulty (greater search action space), but on the other hand, it also reduces the difficulty of finding compositional action.
5000
5000
4000
4000
3000
3000
2000
2000
1000
1000
0
0 2
4
6
8
10
12
14
16
Trials
(a) Three discrete actions
18
20
2
4
6
8
10
12
14
Trials
(b) Five discrete actions
Fig. 2. Learning curves of fuzzy Q learning controller
5 Conclusions In this research, a new fuzzy Q learning method based on a FRBF network is proposed by using the knowledge representing property of the FIS and the self-learning property of the RBF network. A FRBF network is used to learn the Q value of each discrete action. The interpolation technique is adopted to represent the appropriate utility value for the wining local action of every fuzzy rule, and then the final action
Q Learning Based on Self-organizing Fuzzy Radial Basis Function Network
615
is obtained by using a weighted sum of each wining local action with their normalized firing strength. Moreover, the FRBF network has an ability of adding and merging rule neurons dynamically with a novel self-organizing approach according to the task complexity and the progress of learning. Computer simulations on inverted pendulum balancing control verified the validity and performance of the proposed fuzzy Q learning method to problems with continuous state and continuous actions.
Acknowledgements This research is supported by the Scientific and Technological Foundation for the Youth, CUMT.
References 1. Watkins, C.J.C.H., Dayan, P.: Technical Report: Q-Learning. Machine Learning 8(3) (1992) 279-292 2. Munos, R.: A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions. Machine Learning 40(3) (2000) 265-299 3. Smith, A.J.: Applications of the Self-Organizing Map to Reinforcement Learning. Neural Network (15) (2002) 1107-1124 4. Gross, H. M., Stephan, V., Krabbes, M.: A Neural Field Approach to Topological Reinforcement Learning in Continuous Action Spaces. In: Proceedings of the IEEE World Congress on Computational Intelligence, Vol. 3. San Diego, CA. (1998) 3460-3465 5. Jouffe, L.: Fuzzy Inference System Learning by Reinforcement Learning. IEEE Transactions on Systems, Man and Cybernetics 28(3) (1998) 338-355 6. Kim, M.S., Hong, S.G., Lee, J.J.: On-line Fuzzy Q-Learning with Extended Rule Interpolation Technique. In: Proceedings of the 1999/RSJ International Conference on Intelligent Robots and Systems, Vol. 2. Kyongiu. (1999) 757-762 7. Samejima, K., Omori, T.: Adaptive Internal State Space Construction Method for Reinforcement Learning of a Real-World Agent. Neural Networks (12) (1999) 1143-1155 8. Meesad, P., Yen, G.G.: Accuracy, Comprehensibility and Completeness Evaluation of a Fuzzy Expert System. International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems 1(4) (2003) 445-466
A Fuzzy Neural Networks with Structure Learning Haisheng Lin1, Xiao Zhi Gao2, Xianlin Huang1, and Zhuoyue Song1 1 Department of Control Science and Engineering, Harbin Institute of Technology, Harbin, P.R. China {haishenglin318, zhuoyuesong}@yahoo.com.cn,
[email protected] 2 Institute of Intelligent Power Electronics, Helsinki University of Technology, Espoo, Finland
[email protected]
Abstract. This paper presents a novel clustering algorithm for the structure learning of fuzzy neural networks. Our novel clustering algorithm uses the reward and penalty mechanism for the adaptation of the fuzzy neural networks prototypes for every training sample. This new clustering algorithm can on-line partition the input data, pointwise update the clusters, and self-organize the fuzzy neural structure. No prior knowledge of the input data distribution is needed for initialization. All rules are self-created, and they automatically grow with more incoming data. Our learning algorithm shows that supervised clustering algorithms can be used for the structure learning for the on-line selforganizing fuzzy neural networks. The control of the inverted pendulum is finally used to demonstrate the effectiveness of our learning algorithm.
1 Introduction There are many kinds of neural fuzzy systems proposed in the literature. Most of them are suitable for only off-line operation [1] [2]. There are also some on-line learning methods for the neural fuzzy systems. In [3], a fuzzy neural networks that exhibits some self-adaptive property was developed. An aligned clustering algorithm was developed to establish a novel rule based on a pre-specified criterion in a real-time environment while the backpropagation algorithm was employed to optimize the parameters of the novel rule. In this paper, we proposed a novel on-line clustering algorithm for structure learning for our fuzzy neural networks. Adopting the “backpropagation” mechanism, supervised clustering algorithms, such as Learning Vector Quantization (LVQ) [4], can be used in the structure learning of fuzzy neural networks. This paper is organized as follows. Section 2 introduces the structure of the fuzzy neural networks. Section 3 describes the new clustering algorithm for the structure identification. In Section 4, parameter learning algorithm is described. In Section 5, these two algorithms are used to control the inverted pendulum to demonstrate their effectiveness. Finally, Section 6 concludes this paper.
2 Structure of Fuzzy Neural Networks In this section, we describe the structure of the fuzzy neural networks. As we can see in Fig. 1, the fuzzy neural networks have five layers with nodes. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 616 – 622, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Fuzzy Neural Networks with Structure Learning
617
Layer 1: Each node in this layer, only transmits input values to the next layer directly. Thus, the function of the i th node is defined as
f = ui1 = xi , and
a= f .
(1)
Layer 2: Each node in this layer corresponds to one linguistic label of one of the input variables in Layer 1. The operation performed in this layer is (2) 1 ui − mij 2 f =− ( ) , and a = e f . 2 σ ij
(2)
where mij and σ ij , respectively, are the center and width of the Gaussian membership of the j th term of the i th input variable xi . Layer 3: Nodes in this layer are rule nodes, and constitute the antecedents of the fuzzy rule base. The input and output functions of the j th rule node are n
f = ∏ ui(3) , and a = f .
(3)
i =1
Layer 4: The nodes in this layer are called “output-term nodes”. The links in Layer 4 perform the fuzzy OR operation that have the same consequent J
f = ∑ u 4j , and a = min(1, f ) .
(4)
j =1
Layer 5: These nodes and Layer 5 links attached act as the defuzzifier. The following functions can perform the Center Of Area (COA) defuzzification method:
f = ∑ wij5 ui5 = ∑ (mijσ ij )ui5 , and a =
f
∑σ
5 ij i
u
.
(5)
Based on the above structure, an on-line learning algorithm will be proposed to determine the proper centers ( mij ' s ) and widths ( σ ij ' s ) of term nodes.
Fig. 1. Structure of fuzzy neural networks
618
H. Lin et al.
3 Learning Algorithm for Structure Identification In this section, we proposed a novel clustering algorithm for the structure learning of the fuzzy neural networks. In the learning method, only the training data need to be provided from the outside world. There are no input/output-term nodes and no rule nodes in the beginning of learning. They are created dynamically as learning proceeds. The algorithm is shown below: Step 1: Initialize the fuzzy system with zero cluster: In = 0 , On = 0 . Step 2: For the first input and output training vectors, they are selected as the centers of the first clusters in the input and output space, respectively, i.e., Im1 = x1 , I σ 1 = 0 and Om1 = y1 , Oσ 1 = 0 . We connect the first cluster in the input space to the first cluster in the output space, and set the number of data belonging to the cluster as Ic1 = 1 and Oc1 = 1 . Step 3: For an input data point [ xi , yi ] , we compute the distances between the input vector and the existing input space clusters using the Euclidean metric function: q
d p = ∑ xil − I _ mlp
2
,
0≤ p≤n
(6)
i =1
where q is the dimension of the input training vector, xi is the i th input training vector, n is the number of the existing clusters, and l is the l th dimension of the input training vector. The nearest cluster j (winner neuron j ) is chosen by selecting the minimum d j . If d j is larger than a certain value d vigilance , we assume that this input data does not belong to any existing cluster, we form a novel cluster, set In = In + 1 , Im j = x j and I σ j = 0 . The newly added cluster is the winner cluster in the input space. If d j is smaller than d vigilance , we assume that cluster j is the winner cluster in the input space. The procedure in the input space in Step 3 is also adopted in the output space. We can also find the winner cluster in the output space. Step 4: We check the mapping process from input clusters to the output clusters. (1) If the winner cluster in the input space is a novel cluster, we connect this novel cluster to the winner cluster in the output space, and update the centers ( Omwinner ), variances ( Oσ winner ) and the counter ( Ocwinner ) of this cluster: 2 Oσ winner =
2 2 Ocwinner × (Oσ winner + Omwinner ) + y2 Omwinner × Ocwinner + x 2 −( ) , Ocwinner + 1 Ocwinner + 1
Omwinner =
Omwinner × Ocwinner + x and Ocwinner = Ocwinner + 1 . Ocwinner + 1
(7) (8)
(2) If the winner cluster of the input space is connected to the winner cluster of the output space originally, we adopt the same algorithm in (1) to update the centers, variances and the counters of the winner cluster in the input space. (3) If the winner cluster of the input space is not connected to the winner cluster of the output space yet, we use the following algorithm to punish the winner cluster of the input space
2 I σ winner =
A Fuzzy Neural Networks with Structure Learning
619
2 2 Icwinner × ( I σ winner + Imwinner ) − x2 Im × Icwinner − x 2 − ( winner ) , Icwinner − 1 Icwinner − 1
(9)
Imwinner =
Imwinner × Icwinner − x and Icwinner = Icwinner . Icwinner − 1
(10)
After that, we return to Step 3 to search for another cluster in the input space which will match the winner cluster in the output space. The unsupervised clustering algorithms are usually employed for the on-line structure learning of the self-organizing fuzzy neural networks. Our structure learning algorithm, is actually a supervised clustering algorithm. In this paper, we proposed that supervised clustering algorithms are also suitable candidates. Adopting the supervised clustering algorithms, the fuzzy neural networks get the initial structures faster, and the training algorithms are more effective.
4 Parameter Learning of Fuzzy Neural Networks We use the backpropagation algorithm to tune the parameters of the fuzzy neural networks. Suppose y1(t ) is the desired output, and y (t ) is the current output. If node k in Layer 4 is connected to node j in Layer 3, the output of node k in Layer 4 can be computed. We simply introduce several parameters in advance to be utilized: Id
f (t ) = ∑ ∏ e 4 k
−
( xi − Imij ( t ))2 ( I σ ij ( t ))2
,
(11)
f 1(t ) = ∑ Oσ k (t ) × f k4 (t ) ,
(12)
i =1
Od
k =1
Od
f 2(t ) = ∑ Omk (t ) × Oσ k (t ) × f k4 (t ) ,
(13)
k =1
where Id / Od are the dimensions of the input /output data. The centers and variances of the cluster in Layer 5 are updated by Omk (t + 1) = Omk (t ) + η × [ y1(t ) − y (t )] × Oσ k (t + 1) = Oσ k (t ) + η × [ y1(t ) − y (t )] ×
Oσ k (t ) × f k4 (t ) , f 1(t )
Omk (t ) × f k4 (t ) × f 1(t ) − f 2(t ) × f k4 (t ) . f 12 (t )
(14) (15)
If node k in Layer 4 is connected to node j in Layer 3, the error to be propagated to node j in Layer 3 is errorj3 (t + 1) = [ y1(t ) − y (t )] ×
Omk (t ) × f k4 (t ) × f 1(t ) − f 2(t ) × f k4 (t ) . f 12 (t )
(16)
620
H. Lin et al.
The centers and variances of the cluster in Layer 2 are updated by Imij (t + 1) = Imij (t ) + η × errorj3 (t ) × f j3 (t ) × 2 × I σ ij (t + 1) = I σ ij (t ) + η × errorj3 (t ) × f j3 (t ) × 2 ×
xi − Imij (t ) ( I σ ij (t )) 2
( xi − Imij (t )) 2 ( I σ ij (t ))3
,
(17)
.
(18)
The structure and parameter learning algorithms of the fuzzy neural networks will be examined using simulations in Section 5.
5 Simulations In this section, we use the inverted pendulum problem to demonstrate the effectiveness of our fuzzy neural networks. The dynamics of the inverted pendulum system are characterized by four state variables: θ (angle of the pole with respect to the vertical .
.
axis), θ (angular velocity of the pole), x (position of the cart on the track), and x (velocity of the cart). The behavior of these state variables is governed by the following two second-order differential equations [5]: .
− F − ml θ 2 sin θ g sin θ + cos θ ( ) .. mc + m θ= , 4 m cos 2 θ l( − ) 3 mc + m
.
..
F + m ⋅ l ⋅ (θ 2 ⋅ sin θ − θ ⋅ cos θ ) x= , (19) mc + m ..
where g (acceleration due to gravity) is 9.8 m / s 2 , mc (mass of cart) is 1.0 kg , m (mass of pole) is 0.1 kg , l (half-length of pole) is 0.5 m , and F is the applied force in Newtons, which can vary from -20 N to 20 N. The differential equations in (19) are solved by the following difference equations with a time step of h=0.02s, using the Euler approximation. Let x1 (t ) = θ (t ) , .
x2 (t ) = θ (t ) , we have ⎧⎪ x1 (t + h) = x1 (t ) + h ⋅ x2 (t ) . . ⎨ ⎪⎩ x2 (t + h) = x2 (t ) + h ⋅ x2 (t )
(20)
where .
x 2 (t ) =
F (t ) mc + m m ⋅ l ⋅ x22 (t ) ⋅ sin[ x1 (t )] − . ⎧ 4 m ⋅ cos 2 [ x1 (t )] ⎫ ⎧ 4 m ⋅ cos 2 [ x1 (t )] ⎫ l ⋅⎨ − l ⋅⎨ − ⎬ ⎬ mc + m ⎭ mc + m ⎭ ⎩3 ⎩3
g ⋅ sin[ x1 (t )] − cos[ x1 (t )] ⋅
(21)
A Fuzzy Neural Networks with Structure Learning
621
In this paper, we use direct inverse control to the trajectory control of the inverted pendulum. The modeling mechanism of the inverse of the inverted pendulum system is shown in Fig. 2. After the fuzzy neural network is trained, it is put into the control of the inverted pendulum. The popular direct inverse control structure is shown in Fig. 3.
Fig. 3. Inverse control structure
Fig. 2. Inverse system modeling
Example 1. We first try to balance the position of the inverted pendulum. The initial
positions are x1 (0) =
π
, x2 (0) = 0 . We choose the number of iteration steps to be 60 T=100. The controlled trajectory is illustrated in Fig. 4. From this simulation example, we can conclude that the angle of the inverted pendulum is successfully balanced by our fuzzy neural controller.
Example 2. In this example, we verify the tracking ability of our fuzzy neural con-
troller to follow the sinewave signal yd (t ) = inverted pendulum is selected as x1 (t ) =
π 30
sin(10t ) . The initial condition of the
π
, x2 (t ) = 0 . The number of iterations T is 60 100. Figure 5 shows the controlled sinewave trajectory of the inverted pendulum with the corresponding initial position. Dotted line represents the desired trajectory of the inverted pendulum. From the result, we know that the inverted pendulum is well controlled.
Fig. 4. Controlled balance trajectory of the inverted pendulum with x1 (t ) =
π 60
, x2 (t ) = 0
Fig. 5. Controlled sinewave trajectory of the inverted pendulum with x1 (t ) =
π 60
, x2 (t ) = 0
622
H. Lin et al.
6 Conclusions A new novel clustering algorithm is proposed here for the structure learning of the fuzzy neural networks. This clustering algorithm can on-line partition the input data and self-organize the fuzzy neural structure. No priori knowledge of the distribution of the input data is needed for initialization of fuzzy rules. They are automatically generated with the incoming training data. Our fuzzy neural networks can use this online training algorithm for the structure and parameter training. The effectiveness of our learning algorithm is verified by the trajectory control of the inverted pendulum.
Acknowledgements X. Z. Gao’s work was funded by the Academy of Finland under Grant 201353.
References 1. Jang, J. S.: ANFIS: Adaptive-Network-Based Fuzzy Inference Systems. IEEE Trans. Syst., Man, Cybern. 23 (3) (1993) 665-685 2. Lin, C. T. and Lee, C. S. G.: Neural-Network-Based Fuzzy Logic Control and Decision System. IEEE Trans. Comput. 40 (12) (1991) 1320-1336 3. Jang, C. F. and Lin, C. T.: An On-Line Self-Constructing Neural Fuzzy Inference Network and Its Applications. IEEE Trans. Fuzzy Syst. 6 (1) (1998) 12-32 4. Kohonen, T.: The Self-Organizing Map, Proc. of the IEEE, Vol. 78. Springer-Verlag, Berlin Heidelberg (1990) 1464 – 1480 5. Williams, V. and Matsuoka, K.: Learning to Balance the Inverted Pendulum Using Neural Networks, IEEE Int. Joint Conf. Neural Networks, Vol. 1. Springer-Verlag, Berlin Heidelberg (1991) 214-219
Reinforcement Learning-Based Tuning Algorithm Applied to Fuzzy Identification Mariela Cerrada1, Jose Aguilar1 , and Andr´e Titli2 1
Universidad de Los Andes, Control Systems Department-CEMISID, M´erida-Venezuela {cerradam, aguilar}@ula.ve 2 DISCO Group, LAAS-CNRS, Toulouse cedex 4, France
[email protected]
Abstract. In on-line applications, reinforcement learning based algorithms allow to take into account the environment information in order to propose an action policy for the overall optimization objectives. In this work, it is presented a learning algorithm based on reinforcement learning and temporal differences allowing the on-line parameters adjustment for identification tasks. As a consequence, the reinforcement signal is generically defined in order to minimize the temporal difference.
1
Introduction
The Reinforcement Learning (RL) problem has been widely researched an applied in several areas [1, 2, 3, 4, 5, 6, 7, 8]. In dynamical environments, the learning agent gets rewards or penalties, according to its performance for learning good actions. In identification problems, information from the environment is needed in order to propose an approximate model, thus, RL can be used for the on-line information taking. Off-line learning algorithms have reported suitable results in system identification, however these results are bounded on the available data, their quality and quantity. In this way, the development of on-line learning algorithms for system identification in an important contribution. In this work, it is presented an on-line learning algorithm based on RL using the Temporal Difference (TD) method, for identification purposes. Here, the basic propositions of RL with TD are used and, as a consequence, the linear T D(λ) algorithm proposed in [1] is modified and adapted for systems identification and the reinforcement signal is generically defined according to the temporal difference and the identification error. Thus, the main contribution of this paper is the proposition of a generic on-line identification algorithm based on RL. The proposed algorithm is applied in the parameters adjustment of a Dynamical Adaptive Fuzzy Model (DAFM) [9], and an illustrative example for time-varying non-linear identification is presented. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 623–630, 2006. c Springer-Verlag Berlin Heidelberg 2006
624
2 2.1
M. Cerrada, J. Aguilar, and A. Titli
Theoretical Background Reinforcement Learning and Temporal Differences
RL deals with the problem of learning based on trial and error in order to achieve the overall objective [1]. RL are related to problems where the learning agent do not know what it must do. At time t, (t = 0, 1, 2, ...), the agent receives the state St and based on this information it choice an action at . As a consequence, the agent receives a reinforcement signal or reward rt+1 . In case of the infinite time domain, a discount weights the received reward and the discounted expected gain is defined as: Rt = rt+1 + μrt+2 + μ2 rt+3 + ... =
∞
μk rt+k+1
(1)
k=0
where μ, 0 ≤ μ ≤ 1, is the discount rate, and it determines the current value of the futures rewards. On the other hand, TD method permits to solve the prediction problem taking into account the difference (error) between two prediction values at successive instants t given by a function P . According to the TD method, the adjustment law for the parameter vector θ of the prediction function P (θ) in given by the following equation [2]: θt+1 = θt + η(P (xt+1 , θt ) − P (xt , θt ))
∂P (xt , θt ) ∂θ
(2)
where xt is a vector of available data at time t and η, 0 ≤ η ≤ 1, is the learning rate. The term between parenthesis is the temporal difference and the equation (2) is the TD algorithm and it can be used on-line in a incremental way. RL problem can be viewed as a prediction problem where the objective is the estimation of the discounted gain defined by equation (1), by using the ˆ t be the prediction of Rt , then, from equation (1) and by T D algorithm. Let R ˆ t+1 , the prediction error replacing the real value of Rt+1 by its estimated value R ˆ t is defined by the equation (3), which describe a temporal between Rt and R difference: ˆ t = rt+1 + μR ˆ t+1 − R ˆt Δ = Rt − R (3) ˆ as P and by replacing the temporal difference in (2) by that By denoting R one defined by the equation (3), the parameters adjustment law is [1]: θt+1 = θt + η(rt+1 + μP (xt+1 , θt ) − P (xt , θt )) 2.2
∂P (xt , θt ) ∂θ
(4)
Dynamical Adaptive Fuzzy Models
Without loss of generality, a fuzzy logic model MISO (Multiple Inputs-Single Output), is a linguistic model defined by the following M fuzzy rules: R(l) : IF x1 is F1l AN D... AN D xn is Fnl T HEN y is Gl
(5)
Reinforcement Learning-Based Tuning Algorithm
625
where xi is a vector of linguistic input on the domain of discourse Ui ; y is the linguistic output variable on the domain of discourse V ; Fil and Gl are fuzzy sets on Ui and V , respectively, (i = 1, ..., n) y (l = 1, ..., M ), each one defined by their membership functions. The DAFM is obtained from the previous rule base (5), by supposing input values defined by fuzzy singleton, gaussian membership functions of the fuzzy sets defined for the fuzzy output variables and the defuzzification method given by center-average method. Then, the inference mechanism provides the following model [9]: M l l n (xi − αli (vil ,t))2 l=1 γ (u , t) i=1 exp − βil (wil ,t) y(X, t) = (6) 2 M n (xi − αli (vil ,t)) exp − l=1 i=1 β l (w l ,t) i
i
where X = (x1 x2 ... xn )T is a vector of linguistic input variables xi at time t; α(v, tj ), β(w, tj ) and γ(u, tj ) are time-depending functions; vil y wil are parameters associated to the variable xi in the rule l; ul is a parameter associated to the center of the output fuzzy set in the rule l. Definition 1. Let xi (tj ) be the value of the input variable xi to the DAFM at time tj to obtain the output y(tj ). The generic structure of the functions αli (vil , tj ), βil (wil , tj ) and γ l (ul , tj ) in equation (6), are defined by the following equations [9]: j αli (vil , xi (tj ))
=
vil
k=j−δ1 (xi (tk ))
δ1 + 1
j βil (wil , σi2 (tj ))
=
wil
∗(
l
δ1 + 1
γ (u , y(tj )) = u
3
l
k=j−δ2
δ2
δ1 ∈ ℵ
− xi (tk ))2
k=j−δ1 (xi (tk )
j−1 l
;
y(tk )
+ );
(7)
∈
; δ2 ∈ ℵ
(8) (9)
RL-Based Identification Algorithm for DAFM
In this work, the fuzzy identification problem is solved by using the weighted identification error as a prediction function in the RL problem, and by suitably defining the reinforcement value according to the identification error. Thus, the minimization of the prediction error (3) drives to the minimization of the identification error. The critic (learning agent) is used in order to predict the performance on the identification as an approximator of the system’s behavior. The prediction function is defined as a function of the identification error e(t, θt ) = y(t) − ye (t, θt ), where y(t) denotes the real value of the system output at time t and ye (t, θt ) denotes the estimated value given by the identification model by using the available values of θ at time t.
626
M. Cerrada, J. Aguilar, and A. Titli
Let Pt be the proposed non-linear prediction function in equation (10): t 1 P (xt , θt ) = (μλ)t−k e2 (k, θt ) 2
(10)
k=t−K
where e(t, θt ) = y(t) − ye (t, θt ) defines the identification error and K defines the size of the time interval. Then: t ∂P (xt , θt ) ∂e(k, θt ) = (μλ)t−k e(k, θt ) ∂θ ∂θ
(11)
k=t−K
By replacing (11) into (4), the following learning algorithm for the parameters adjustment is obtained: θt+1 = θt +η(rt+1 +μP (xt+1 , θt )−P (xt , θt ))
t
(μλ)t−k e(k, θt )
k=t−K
∂e(k, θt ) (12) ∂θ
The function P (xt+1 , θt ) in equation (13) is obtained from (10) and, finally, by replacing (13) into (12), the proposed learning algorithm is given. P (xt+1 , θt ) =
1 2 e (t + 1, θt ) + μλP (xt , θt ) 2
(13)
In the prediction problem of the discounted expected gain Rt , a good esˆ t is expected; that implies P (xt , θt ) goes to rt+1 + timation of Rt given by R μP (xt+1 , θt ). This condition is obtained from equation (3). Given that the prediction function is the weighted sum of the square identification error e2 (t), then it is expected that: 0 ≤ rt+1 + μP (xt+1 , θt ) < P (xt , θt )
(14)
On the other hand, a suitable adjustment of identification model means that the following condition is accomplished: 0 < P (xt+1 , θt ) < P (xt , θt )
(15)
The reinforcement rt+1 is defined in order to accomplish the expected condition (14) and taking into account the condition (15). Then, by using equations (10) and (13): rt+1 = 0 if P (xt+1 , θt ) ≤ P (xt , θt ) 1 rt+1 = − μe2 (t + 1, θt ) if P (xt+1 , θt ) > P (xt , θt ) 2
(16)
In this way, the identification error into the prediction function P (xt+1 , θt ), according to the equation (13), is rejected by using the reinforcement in equation (16). The learning rate η in (12) is defined by the equation (17). Parameters μ
Reinforcement Learning-Based Tuning Algorithm
627
and λ can depend on the system dynamic: small values in case of slow dynamical systems, and values around 1 in case of fast dynamical systems. η(t) =
η(t − 1) ,0 < ρ < 1 ρ + η(t − 1)
(17)
The proposed identification learning algorithm can be studied like a descentgradient method with respect to the parametric predictive function P . In the descent-gradient method, the objective is to find the minimal value of the error measure on the parameters space, denoted by J(θ), by using the following algorithm for the parameters adjustment: θt+1 = θt + Δθt = θt + 2α(E{z|xt } − P (xt , θ))∇θ P (xt , θ)
(18)
In this case, an error measure is defined as: J(θ, x) = (E{z|x} − P (x, θ))2
(19)
where E{z|x} is the expected value of the real value z, from the knowledge of the available data x. In this work, the learning algorithm (12) is like a learning algorithm (18), based on the descent-gradient method, where rt+1 +μP (xt+1 , θt ) is the expected value E{z|x} in (19). By appropriate selecting rt+1 according to (16), the expected value in the learning problem is defined in two ways: E{z|x} = μP (xt+1 , θt ) if P (xt+1 , θt ) ≤ P (xt , θt )
(20)
E{z|x} = μ2 λP (xt , θt ) if P (xt+1 , θt ) > P (xt , θt )
(21)
or Then, the parameters adjustment is made on each iteration in order to attain the expected value of the prediction function P according to the predicted value of P (xt+1 , θt ) and the real value P (xt , θt ). In both of cases, the expected value is minor than the obtained real value P (xt , θt ) and the selected value of rt+1 defines the magnitude of the defined error measure.
4
Illustrative Example
This section shows an illustrative example applied to fuzzy identification of timevarying non-linear systems by using the proposed on-line RL-based identification algorithm in order to adjust the parameters vil , wil and ul of the DAFM described in section 2.2. The performance of the fuzzy identification is evaluated according e (t) to the identification relative error (er = y(t)−y ) normalized on [0, 1]. y(t) The system is described by the following difference equation: y(t + 1) =
y(t)y(t − 1)y(t − 2)u(t − 1)(y(t − 2) − 1) + u(t) = g[.] a(t) + y(t − 2)2 + y(t − 1)2
(22)
628
M. Cerrada, J. Aguilar, and A. Titli 0.3
relative error
0.25 0.2 0.15 0.1 0.05
real output(−), estimated output (− −)
0
0
100
200
300
400 time (sec)
500
600
700
800
0
100
200
300
400 time (sec)
500
600
700
800
1 0.5 0 −0.5 −1 −1.5 −2
Fig. 1. Fuzzy identification using off-line tuning algorithm
where a(t) = 1 + 0.1 sin(2πt/100). In this case, the unknown function g = [.] is estimated by using the DAFM and, additionally, a sudden change on a(t) is proposed by setting a(t) = 5, t > 450. Figure 1 shows the performance of the DAFM using the off-line gradient-based tuning algorithm with initial conditions on the interval [0, 1] and using the input signal (23). After an extensive training phase, the fuzzy model with M = 8 is chosen. 1.5 + (0.8 sin(2πt/250) + 0.2 sin(2πt/25)) if 301 < t < 550 u(t) = (23) sin(2πt/250) if otherwise In the following, fuzzy identification performance by using the proposed RL-based tuning algorithm is presented. Here, λ = μ = 0.9, K = 5 and the learning rate is set up by the equation (17) with η(0) = 0.01. After experimental proofs, the performance approaching the accuracy obtained from off-line adjustment is obtained with M = 20, figure 2 shows the tuning algorithm performance. However, a good performance is also obtained with M = 8. Table 1 shows the comparative values related to the RMSE. Figure 3, shows the algorithm sensibility according to the initial conditions and figure 4 shows the algorithm performance under changes on the internal dynamics by taking a(t) = 1 + 0.3 sin(2πt/10). Table 1. Comparison between the on-line proposed algorithm and off-line tuning M RMSE on-line RMSE off-line 8 0.0323 0.0156 10 0.0339 0.0837 15 0.0339 0.0308 20 0.0205 0.1209
Reinforcement Learning-Based Tuning Algorithm
629
0.2
relative error
0.15
0.1
0.05
real output(−), estimated output (− −)
0
0
100
200
300
400 time index
500
600
700
800
0
100
200
300
400 time index
500
600
700
800
1 0.5 0 −0.5 −1 −1.5 −2
Fig. 2. Fuzzy identification using RL-based tuning algorithm. Initial conditions on [0.5, 1.5]. 0.2
relative error
0.15
0.1
0.05
real output(−), estimated output (− −)
0
0
100
200
300
400 time index
500
600
700
800
0
100
200
300
400 time index
500
600
700
800
2
1
0
−1
−2
Fig. 3. Fuzzy identification using RL-based tuning algorithm. Initial conditions on [0, 1]. 0.2
relative error
0.15
0.1
0.05
real output(−), estimated output (− −)
0
0
100
200
300
400 time index
500
600
700
800
0
100
200
300
400 time index
500
600
700
800
2
1
0
−1
−2
Fig. 4. Fuzzy identification using RL-based tuning algorithm
630
M. Cerrada, J. Aguilar, and A. Titli
The previous tests show the performance and the sensibility of the proposed on-line algorithm is adequate in terms of the initial conditions of the DAFM parameters, changes on the internal dynamic and changes on the inputs signal. Table 1 also shows the number of rules M do not strongly determines the global performance of the proposed on-line algorithm.
5
Conclusions
In this work, an on-line tuning algorithm based on reinforcement learning for identification problem has been proposed. Both the prediction function and the reinforcement signal have been defined by taking into account the identification error and the obtained algorithm can be studied like a descend-gradient-based method. In order to show the algorithm performance, an illustrative example related to time-varying non-linear system identification using a DAFM has been developed. The performance of the on-line algorithm is adequate in terms of the main aspects to be taken into account in on-line identification: the initial conditions of the model parameters, the changes on the internal dynamic and the changes on the input signal. This one highlights the use of the on-line learning algorithms and the proposed RL-based on-line tuning algorithm could be an important contribution for the system identification in dynamical environments with perturbations, for example, in process control area.
References 1. Sutton, R., Barto, A.: Reinforcement Learning. An Introduction. The MIT Press, Cambridge (1998) 2. Sutton, R.: Learning to Predict by the Methods of Temporal Differences. Machine Learning 3 (1988) 9-44 3. Miller, S., Williams R.: Temporal Difference Learning: A Chemical Process Control Application. In Murray A., ed.: Applications of Artificial Neural Networks. Kluwer, Norwell (1995) 4. Singh, S., Sutton, R.: Reinforcement Learning with Replacing Eligibility Traces. Machine Learning 22 (1995) 123-158 5. Schapire, R., Warmuth, M.: On the Worst-case Analysis of Temporal Difference Learning Algorithmes. Machine Learning 22 (1996) 95-121 6. Tesauro, G.: Temporal Difference Learning and TD-Gammon. Communications of the Association for Computing Machinery 38(3) (1995) 58-68 7. Si, J., Wang Y.: On Line Learning Control by Association and Reinforcement. IEEE Transactions on Neural Networks 12(2) (2001) 264-276 8. Van-Buijtenen, W., Schram, G., Babuska, R., Verbruggen, H.: Adaptive Fuzzy Control of Satellite Attitude by Reinforcement Learning. IEEE Transactions on Fuzzy Systems 6(2) (1998) 185-194 9. Cerrada, M., Aguilar, J., Colina, E., Titli A.: Dynamical Membership Functions: An Approach for Adaptive Fuzzy Modeling. Fuzzy Sets and Systems 152 (2005) 513-533
A New Learning Algorithm for Function Approximation Incorporating A Priori Information into Extreme Learning Machine Fei Han1,2, Tat-Ming Lok3, and Michael R. Lyu4 1 Intelligent Computing Lab, Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, P.O. Box 1130, Hefei, Anhui 230031, China 2 Department of Automation, University of Science and Technology of China, Hefei 230027, China 3 Information Engineering Dept., The Chinese University of Hong Kong, Shatin, Hong Kong 4 Computer Science & Engineering Dept., The Chinese University of Hong Kong, Shatin, Hong Kong
[email protected],
[email protected],
[email protected]
Abstract. In this paper, a new algorithm for function approximation is proposed to obtain better generalization performance and faster convergent rate. The new algorithm incorporates the architectural constraints from a priori information of the function approximation problem into Extreme Learning Machine. On one hand, according to Taylor theorem, the activation functions of the hidden neurons in this algorithm are polynomial functions. On the other hand, Extreme Learning Machine is adopted which analytically determines the output weights of single-hidden layer FNN. In theory, the new algorithm tends to provide the best generalization at extremely fast learning speed. Finally, several experimental results are given to verify the efficiency and effectiveness of our proposed learning algorithm.
1 Introduction Most traditional learning algorithms with feedforward neural networks (FNN) are to use backpropagation (BP) algorithm to derive the updated formulae of the weights [1]. However, these learning algorithms have the following major drawbacks that need to be improved. First, they are apt to be trapped in local minima. Second, they have not considered the network structure features as well as the involved problem properties, thus their generalization capabilities are limited [2-7]. Finally, since gradient-based learning is time-consuming, they converge very slowly [8-9]. In literatures [10-11], a learning algorithm was proposed that is referred to as Hybrid-I method. In this algorithm, the cost terms for the additional functionality based on the first-order derivatives of neural activation at hidden layers were designed to penalize the input-to-output mapping sensitivity. In literature [12], a modified hybrid learning algorithm (MHLA) was proposed according to Hybrid-I algorithm to improve the generalization performance. Nevertheless, it was found from the experimental results that the computational requirements for the above two algorithms J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 631 – 636, 2006. © Springer-Verlag Berlin Heidelberg 2006
632
F. Han, T.-M. Lok, and M.R. Lyu
are actually relatively large. These learning algorithms can almost improve the generalization performance to some degree, but there is not the best one resulted. In literature [13], the relations between the single-hidden layer FNN (SLFN) and the corresponding hidden layer to output layer network were lucubrated. In literatures [8-9], a learning algorithm for SLFN which was called as Extreme Learning Machine (ELM) was proposed. ELM randomly chooses the input weight and analytically determines the output weights of SLFN through simple generalized inverse operation of the hidden layer output matrices. Therefore, ELM has better generalization performance with much faster learning speed. However, ELM also had not considered the network structure features as well as the involved problem properties and its generalization performance is also limited to some extent. In this paper, a new learning algorithm for function approximation problem incorporating a priori information into ELM is proposed. The new learning algorithm selected the hidden neurons activation functions as polynomial functions on the basis of Taylor series expansion. Moreover, the new algorithm analytically determines the output weights of SLFN through simple generalized inverse operation of the hidden layer output matrices according to ELM. Finally, theoretical justification and simulated results are given to verify the better generalization performance and faster convergent rate of the proposed constrained learning algorithm.
2 Extreme Learning Machine In order to find an effective solution to the problem caused by BP learning algorithm, Huang [8-9] proposed ELM. Since a feedforward neural network with single nonlinear hidden layer is capable of forming an arbitrarily close approximation of any continuous nonlinear mapping, the ELM is limited to such networks. For N arbitrary distinct samples ( x i , t i ), where x i =[ x i1 , x i 2 ,…, T T n m x in ] ∈ R , t i =[ t i1 , t i 2 ,…, t im ] ∈ R . The SLFN with H hidden neurons and activation function g (x) can approximate these N samples with zero error means that
H wo =T
(1)
where H( wh1 , wh 2 ,…, wh H , b1 , b 2 ,…, b H , x1 , x 2 ,…, x N )
⎡ g ( wh1 x1 + b1) L g (wh H x1 + b H ) ⎤ ⎡ wo1T ⎤ ⎢ ⎥ M K M ⎢ ⎥ =⎢ , wo = ⎥ ⎢ M ⎥ T ⎢⎣ g ( wh1 x N + b1) K g ( wh H x N + b H )⎥⎦ ⎢ wo H ⎥ N ×H ⎣ ⎦
⎡ t 1T ⎤
,T= ⎢ M ⎥ H ×m
(2)
⎢ ⎥ ⎢ t TN ⎥ ⎣ ⎦ N ×m
where wh i =[ wh i1 , wh i 2 ,…, wh in ]T is the weight vector connecting the i th hidden neuron and the input neurons, wo i =[ wo i1 , wo i 2 ,…, woim ]T is the weight vector connecting the i th hidden neuron and the output neurons, and bi is the threshold of the i th hidden neuron. In order to make it easier to understand ELM, a theorem is introduced in the following:
A New Learning Algorithm for Function Approximation
633
Theorem 2.1 [14]. Let there exist a matrix G such that Gy is a minimum norm leastsquares solution of a linear system Ax = y . Then it is necessary and sufficient that G = A+ , the Moore-Penrose generalized inverse of matrix A . In the course of learning, first, the input weights wh i and the hidden layer biases bi are arbitrarily given and need not be adjusted at all. Second, according to Theorem 2.1, the smallest norm least-squares solution of the above linear Eqn. (1) is as follow: wo
=H+T
(3)
From the above discussion, it can be found that the ELM has the minimum training error and smallest norm of weights. The smallest norm of weights tends to have the best generalization performance. Since the smallest norm least-squares solution of the above linear Eqn. (1) is obtained by analytical method and all the parameters of SLFN need not to be adjusted, ELM converge much faster than BP algorithm.
3 New Learning Algorithm Incorporating a Priori Information into ELM 3.1 Architectural Constraints from a Priori Information According to the Taylor theorem, if the function meets the conditions that the Taylor theorem requires, the function has the corresponding Taylor expansion as follows: f ( x) = f ( x 0) +
n
k
∑x f
k =1
k!
(k )
( x 0) +
x
( n +1)
(n + 1)!
f
( n +1)
(ξ ) , x, x 0 , ξ ∈ D ( f ( x)) , ξ ∈ ( x, x 0)
or ξ ∈ ( x 0 , x).
(4)
where D( f ( x)) denotes the definitional domain of the function f (x) . From Eqn.(4), it can be found that the function which meets the conditions of the Taylor theorem can be expressed as the weighted sum of the polynomial functions. In order to approximate the function f (x) more accurately by the FNN φ (x) , we make the FNN φ (x) be expressed as the weighted sum of the polynomial functions according to the above a priori information. So a SLFN is adopted for approximating the function and the transfer function of the k th hidden neuron is k
selected as the function x k! , (k = 1,2,K, n.) . Then the FNN φ (x) can be expressed as follows: k
n ( x) φ ( x) = ∑ wo k wh k − wo n +1 , k =1
k!
(5)
where wo k denotes the the synaptic weight from the output neuron to the k th neuron at the hidden layer, and wh k denotes the the synaptic weight from the k th neuron at the hidden layer to the input neuron. The output layer is a linear neuron.
634
F. Han, T.-M. Lok, and M.R. Lyu
3.2 New Learning Algorithm
In order to improve the generalization performance and obtain faster convergent rate, a new algorithm incorporating a priori information into ELM is proposed as follows: First, according to Subsection 3.1, a SLFN as shown in Eqn. (5) is adopted for approximating the function. The weights from the input layer to the hidden layer are all fixed to one, i.e., wh k = 1, k = 1,2,L, n. According to Section 2, the weights from the output neuron to the hidden neurons are analytically determined by Eqn. (3). In the new algorithm, the weights from the output neuron to the hidden neurons are analytically determined, so the learning speed of the new algorithm can be thousands of times faster than that of BP algorithm. Moreover, according to Eqn. (3), since the smallest norm least-squares solution is obtained, the new algorithm tends to have the better generalization performance. Finally, compared with ELM, in that the new learning algorithm incorporates architectural constraints from a priori information into SLFN, the new learning one has better generalization performance than ELM. From this new algorithm, the following conclusion can be easily deduced: Conclusion 1. Assume that the FNN, φ (x) , which is expressed as Eqn. (5), is used to approximate the function f (x) by the above new learning algorithm. The function f (x ) meets the conditions that the Taylor theorem requires and 0 ∈ D ( f ( x)) . The following equation can be obtained:
wo k ≈ f
(k )
(0) , k = 1,2, L , n.
wo n +1 ≈ − f (0)
(6)
Proof. Comparing Eqn.(6) and Eqn.(7), we notice that f ( x) ≈ φ ( x) and wh k = 1, (k = 1,2,L, n) from the new learning algorithm. Therefore, Eqn. (8) can be easily deduced. Q.E.D.
4 Experimental Results To demonstrate the improved generalization performance and fast convergent rate of the new learning algorithm, in the following we shall conduct the experiments with two functions. They are a bimodal function y = sin(2 x) / 2 and a multimodal function 2 3 y = (1 − (40 x / π ) + 2 (40 x / π ) − 0.4 (40 x / π ) ) e − x / 2 . In this section, this new algorithm is compared with traditional BP algorithm, Hybrid-I algorithm, MHLA and ELM. The activation function of the neurons in all layers for BP algorithm, Hybrid-I algorithm and MHLA all are tangent sigmoid function. The activation functions of the hidden neurons for ELM are sigmoid function. In all five learning algorithms, the number of the hidden neurons is 10. As for each function, assume that 126 training samples are selected from [0,π] at identical spaced interval. Likely, 125 testing samples are also selected from [0.0125, π-0.0125] at identical spaced interval. In order to statistically compare the approximation accuracies and CPU time for the two functions with the above five algorithms, we conducted the experiments fifty times for each algorithm, and the corresponding results are summarized in Table 1-2.
A New Learning Algorithm for Function Approximation
635
Table 1. The approximation accuracies and CPU time for y = sin(2 x ) / 2 with the five algorithms
LA BP Hybrid-I MHLA ELM New LA Table
Training error 1.2956e-5 5.0663e-6 2.6359e-6 5.3231e-11 2.3636e-12 2.
The
Testing error 1.1925e-5 5.0400e-6 2.5472e-6 4.8595e-11 2.1346e-12
approximation
accuracies
and
CPU time 53.5160s 65.7138s 75.5460s 0.0631s 0.2020s CPU
time
for
y = (1 − (40 x / π ) + 2 (40 x / π ) − 0.4 (40 x / π ) ) e − x / 2 with the five algorithms 2
LA BP Hybrid-I MHLA ELM New LA
Training error 5.2511e-4 2.6711e-4 1.5123e-4 5.3450e-6 1.4665e-6
3
Testing error 4.5036e-4 2.1255e-4 1.1102e-4 5.1332e-6 1.0535e-6
CPU time 54.0630s 75.5628s 85.8660s 0.0825s 0.3523s
From the above results, it can be drawn the conclusions as follows: First, the generalization performance of the new algorithm and ELM is much better than that of the BP algorithm, Hybrid-I algorithm and MHLA, because the testing error of the new algorithm and ELM is much less than that of other three algorithms. This result rests in the fact that the new algorithm and ELM obtain the smallest norm least-squares solution through Eqn. (3), whereas other three algorithms do not. Second, the new algorithm and ELM converge much faster that the BP algorithm, Hybrid-I algorithm and MHLA. This is because the new learning algorithm and ELM obtain the solution by analytical method, whereas other three algorithms obtain the solution through thousands of iterative calculation. Third, compared with ELM, the new algorithm has better generalization. This is chiefly because the new learning one considers a priori information from the function approximation problem. Finally, compared with ELM, the new learning algorithm converges slightly slower than ELM. This rests in the fact that the new learning one requires much more time to calculate the hidden neurons outputs than ELM.
5 Conclusions In this paper, a new learning algorithm which incorporates the architectural constraints into ELM was proposed for function approximation problem. The architectural constraints are extracted from a priori information of the approximated function based on Taylor series expansion. The architectural constraints are realized by selecting the activation functions of the hidden neurons as polynomial functions. Furthermore, the new algorithm analytically determines the output weights of SLFN through simple generalized inverse operation of the hidden layer output matrices
636
F. Han, T.-M. Lok, and M.R. Lyu
according to ELM. Therefore, the new learning one has much better generalization performance and faster convergent rate than the traditional gradient-based learning algorithms. Finally, theoretical justification and simulated results were given to verify the efficiency and effectiveness of the proposed new learning algorithm. Future research works will include how to apply this new learning algorithm to resolve more numerical computation problems.
Acknowledgement This work was supported by the National Science Foundation of China (Nos.60472111, 30570368 and 60405002).
References 1. Ng, S.C., Cheung, C.C., Leung, S.H.: Magnified Gradient Function with Deterministic Weight Modification in Adaptive Learning, IEEE Transactions on Neural Networks 15(6) (2004) 1411-1423 2. Baum, E., Haussler, D.: What Size Net Gives Valid Generalization? Neural Comput. 1(1) (1989) 151-160 3. Huang, D.S.: A Constructive Approach for Finding Arbitrary Roots of Polynomials by Neural Networks, IEEE Transactions on Neural Networks 15(2) (2004) 477-491 4. Huang, D.S., Chi, Z.: Finding Roots of Arbitrary High Order Polynomials Based on Neural Network Recursive Partitioning Method, Science in China Ser. F Information Sciences 47(2) (2004) 232-245 5. Huang, D.S., Ip, Horace H.S., Chi, Z.: A Neural Root Finder of Polynomials Based on Root Moments, Neural Computation 16(8) (2004) 1721-1762 6. Huang, D.S., Ip, Horace H.S., Chi, Z., Wong, H.S.: Dilation Method for Finding Close Roots of Polynomials Based on Constrained Learning Neural Networks, Physics Letters A 309 (5-6) (2003) 443-451 7. Karras, D.A.: An Efficient Constrained Training Algorithm for Feedforward Networks, IEEE Trans. Neural Networks 6(6) (1995) 1420-1434 8. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme Learning Machine: A New Learning Scheme of FNN, 2004 International Joint Conference on Neural Networks (IJCNN’2004), July 25-29, Budapest, Hungary, 985-990 9. Huang, G.B., Siew, C.K.: Extreme Learning Machine with Randomly Assigned RBF Kernels, International Journal of Information Technology 11(1) (2005), 16-24 10. Jeong, S.Y., Lee, S.Y.: Adaptive Learning Algorithms to Incorporate Additional Functional Constraints into Neural Networks, Neurocomputing 35 (1-4) (2000), 73-90 11. Jeong, D.G., Lee, S.Y.: Merging Back-propagation and Hebbian Learning Rules for Robust Classifications, Neural Networks 9(7) (1996) 1213-1222 12. Han, F., Huang, D.S., Cheung, Y.-M., Huang, G.B.: A New Modified Hybrid Learning Algorithm for FNN, Lecture Notes in Computer Science, Vol. 3496, Springer-Verlag (2005) 572-577 13. Huang, D.S.: Systematic Theory of Neural Networks for Pattern Recognition, Publishing House of Electronic Industry of China, Beijing, 1996, 109-110 14. Serre, D.: Matrices: Theory and Application, Springer-Verlag (2002) 147-147
Robust Recursive Complex Extreme Learning Machine Algorithm for Finite Numerical Precision Junseok Lim1 , Koeng Mo Sung2 , and Joonil Song3 1
2
Dept. of Electronics Engineering, Sejong University, Kunja, Kwangjin, 98,143-747, Seoul, Korea
[email protected] School of Electrical Engineering, Seoul National University, Seoul, Korea 3 Samsung Electronics Co., Ltd Abstract. Recently, a new learning algorithm for single-hidden-layer feedforward neural network (SLFN) named the complex extreme learning machine (C-ELM) has been proposed in [1]. In this paper, we propose a numerically robust recursive least square type C-ELM algorithm. The proposed algorithm improves the performance of C-ELM especially in finite numerical precision. The computer simulation results in the various precision cases show the proposed algorithm improves the numerical robustness of C-ELM.
1
Introductions
Recently, a new learning algorithm for single-hidden-layer feedforward neural network (SLFN) named the complex extreme learning machine (C-ELM) has been proposed by Huang et al. [1]. Unlike traditional approaches (such as BP algorithms), which may face difficulties in manually tuning control parameters (learning rate, learning epochs, etc), C-ELM avoids such issues and reaches good solutions analytically. The learning speed of C-ELM is extremely fast compared to other traditional methods. The simulation results in [1] have shown that the C-ELM equalizer is superior to CRBF, CMRAN and CBP equalizers in terms of symbol error rate (SER) and learning speed. In algorithmic structure, C-ELM is based on block least square method. Therefore, it is easy to adopt RLS (recursive least square) method for real time application as CRBF and CBP etc. However, due to the lack of numerical robustness, RLS makes trouble in real implementations in finite precision. The source of these numerical instabilities in RLS schemes is the update of the inverse of the exponentially windowed input signal autocorrelation matrix, Rxx (n) = λn δI +
n
λn−k x(k)x(k),
(1)
k=1
where x(k) = [ xl ( k ) x2 ( k ) . . . xN ( k )]T ,k≥1 is the input signal vector sequence, N is the number of system parameters, δ is a small positive constant, J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 637–643, 2006. c Springer-Verlag Berlin Heidelberg 2006
638
J. Lim, K.M. Sung, and J. Song
and λ, 0 ¡ λ ¡ 1 is the forgetting factor. Several authors have shown that the primary source of numerical instability in is the loss of symmetry in the numerical representation of Rxx (n)[2]. This fact has spurred the development of alternative QR-decomposition-based least-squares (QRD-LS) methods [2] that employ the QR decomposition of the weighted data matrix XT (n) = (λn δ)1/2 I λ(n−1)/2 x(1) λ(n−2)/2 x(2) · · · x(n) (2) to calculate an upper-triangular square root factor or Cholesky factor R(n) of Rxx (n) = RT (n)R(n) = XT (n)X(n).QRD-LS methods possess good numerical properties [2]. Even so, QRD-LS methods are less popular because they require N divide and N square-root operations at each time step. These calculations are difficult to perform on typical DSP hardware. In [3], a new RLS algorithm has been proposed. It is similar to a recently developed natural gradient prewhitening procedure [4],[5]. In this paper we propose a new C-ELM algorithm based on the least squares prewhitening, in [3] and show that it improves the estimation performance in finite precision. This paper is organized as follows. Section 2 summarizes C-ELM. In section 3, we propose a numerically robust recursive C-ELM algorithm. The performance comparison of C-ELM with the proposed C-ELM equalizers shows in Section 4. Discussions and conclusions are given in Section 5.
2
Complex Extreme Learning Machine (C-ELM) Algorithm
Given a series of complex-valued training samples (z(i),d(i) ), i=1,2,. . . ,N, where z(i) ∈ Cp and d(i) ∈ C1 , the actual outputs of the single-hidden-layer feedforward network (SLFN) with complex activation function gc (z) for these N training data is given by ˜ N
βk gc (wk zi + bk ) = oi ,
i = 1, · · · , N,
(3)
k=1
where column vector wk ∈ Cp is the complex input weight vector connecting the input layer neurons to the kth hidden neuron, βk ∈ C1 the complex output weight vector connecting the kth hidden neuron and the output neurons, and bk ∈ C1 is the complex bias of the kth hidden neuron. wk •zi denotes the inner product of column vectors wk and zi . gc (z) is a fully complex activation function. The above N equations can be written compactly as Hβ = o, (4) ˜ of the hidden neurons is usually and in practical applications the number N much less than the number N of training samples and Hβ =d, where ⎡
gc (w1 z1 + b1 ) ⎢. H(w1 , · · · , wN˜ , z1 , · · · , zN˜ , b1 , · · · , bN˜ ) = ⎣ .. gc (w1 zN + b1 )
··· .. . ···
⎤ gc (wN˜ z1 + bN˜ ) ⎥ .. ⎦ , (5) . gc (wN˜ zN + bN˜ )
Robust Recursive Complex Extreme Learning Machine Algorithm
⎡
⎤ β1 ⎢ ⎥ β = ⎣ ... ⎦ , βN˜
⎡
⎤ o1 ⎢ ⎥ o = ⎣ ... ⎦ , oN
⎤ d1 ⎢ ⎥ d = ⎣ ... ⎦ , dN
639
⎡
(6)
The complex matrix H is called the hidden layer output matrix. As analyzed by Huang et al [1] for fixed input weights wi and hidden layer biases bi , we can get the least squares solution βˆ of the linear system Hβ=d with minimum norm of output weights b, which usually tend to have good generalization performance. The resulting βˆ is given by βˆ = H+ d,
(7)
where complex matrix H+ is the Moore-Penrose generalized inverse of complex matrix H. Thus, ELM can be extended from the real domain to a fully complex domain in a straightforward manner. The three steps in the fully complex ELM (C-ELM) algorithm can be summarized as follows: Algorithm C-ELM: Given a training set N = {(z(i) , d(i))| z(i) ∈ Cp , d(i) ∈ C1 , i = 1, . . . ,N}, complex activation function gc (z), and hidden neuron num˜; ber N 1. Randomly choose the complex input weight wk and the complex bias bk , k ˜. = 1, . . . , N 2. Calculate the complex hidden layer output matrix H. 3. Calculate the complex output weight β using formula (7).
3
Robust Recursive C-ELM Using Least-Squares Prewhitening
In this section, we apply the prewhitened recursive least-squares procedure to the C-ELM cost function given by J(β(t)) =
t
2 λt−i d(i) − hH (t)β(i) ,
(8)
i=1
where hH (k) = gc (w1 z1 + b1 ) · · · gc (wN˜ zN + bN˜ ) . This becomes the exponentially weighted least squares cost function, which is well studied in adaptive filtering in [2]. J(β(t)) is minimized if β(t) = C−1 hh (t)chd (t), where Chh (t) =
t
λt−i h(i)hH (i), chd (t) =
i=1
e(n) = d(n) − hH (n − 1)β(n).
(9) t
λt−i d(i)h(i).
i=1
(10)
640
J. Lim, K.M. Sung, and J. Song
The conventional RLS algorithm in [2] for adjusting β(n) is β(n) = β(n − 1) + e(n)k(n),
k(n) =
(11)
C−1 yy (n − 1)h(n) λ + hH (n)C−1 yy (n − 1)h(n)
.
(12)
T We can replace C−1 yy (n − 1) in (12) with P (n -1)P(n - 1) to obtain
k(n) =
PT (n − 1)P(n − 1)h(n) . λ + hT (n)PT (n − 1)P(n − 1)h(n)
(13)
We apply least-squares prewhitening algorithm in [4] as follows PT (n)P(n) = C−1 hh (n).
(14)
Consider the update for C−1 hh (n) in the conventional RLS algorithm, given by C−1 hh (n)
−1 T C−1 1 −1 hh (n − 1)h(n)h (n)Chh (n − 1) = Chh (n − 1) − . λ λ + hT (n)C−1 hh (n − 1)h(n)
Substituting PT (k)P(k) for C−1 hh (n) in (15) yields 1 T v(n)vT (n) 1 T P (n)P(n) = √ P (n − 1) I − P(n − 1) √ , 2 λ λ λ + v
(15)
(16)
where the prewhitened signal vector v(n) is v(n) = P(n − 1)h(n).
(17)
The vector v(n) enjoys the following property: if h(n) is a wide-sense stationary sequence, then lim E v(n)vT (n) ∼ (18) = (1 − λ)I. n→∞
That is, as P(n) converges, the elements of v(n) are approximately uncorrelated with variance (1 - λ). We decompose the matrix in large brackets on the RHS of (16) as (19). BT (n)B(n) = I −
v(n)vT (n) λ + v
2
.
(19)
Then P(n-1) can be updated using 1 P(n) = √ B(n)P(n − 1). λ
(20)
Robust Recursive Complex Extreme Learning Machine Algorithm
641
The matrix on the RHS of (19) has one eigenvalue equal to λ/(λ + v2 ) in the direction of v(n)and N - 1 unity eigenvalues in the null space of v(n), or v(n)vT (n) λ v(n)vT (n) T B (n)B(n) = I − + (21) 2 2 2 . v(n) λ + v(n) v(n) Due to the symmetry and orthogonality of the two terms in (21), a symmetric square-root factor of BT (n)B(n) is v(n)vT (n) λ v(n)vT (n) B(n) = I − + (22) 2 2 2 . v(n) λ + v(n) v(n) Substituting (22) into (20) yields after some algebraic simplification the update
1 P(n) = √ P(n − 1) − ς(n)v(n)vT (n) . β
(23)
u(n) = PT (n − 1)v(n).
(24)
ς(n) =
1 v(n)
2
1−
λ λ + v(n)
2
.
(25)
Equation (12) along with (17) and (24) constitutes the new adaptive square-root factorization algorithm. k(n) =
u(n)
2.
λ + v(n)
(26)
Therefore, combining (26) with (10) and (11), we obtain a new numerically robust C-ELM algorithm. The algorithm’s complexity is similar to that of the conventional C-ELM algorithm; however, the former algorithm’s numerical properties are much improved.
4
Performance Evaluation
In this section, a well-known complex nonminimum-phase channel model introduced by Cha and Kassam [6] is used to evaluate the recursive data least square C-ELM equalizer performance. The channel model is of order 3 with nonlinear distortion for BPSK signaling. The channel output z(n) (which is also the input of the equalizer) is given by z(n) = o(n) + 0.1o(n)2 + 0.05o(n)3 + v(n), v(n) ∼ N (0, 0.01) , o(n) = (0.34 − j0.27)s(n) + (0.87 + j0.43)s(n − 1) + (0.34 − j0.21)s(n − 2) (27)
642
J. Lim, K.M. Sung, and J. Song 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
−0.8 −1 −1.5
−0.8 −1
−0.5
0
0.5
1
1.5
(a) normal C-ELM with 40bit precision
−1 −1.5
1
1 0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
−0.8
−0.8 −1
−0.5
0
0.5
1
1.5
(c) normal C-ELM with 30bit precision
2
−0.5
0
0.5
1
1.5
(b) the proposed algorithm with 40bit precision
0.8
−1 −1.5
−1
−1 −1.5
−1
−0.5
0
0.5
1
1.5
(d) the proposed algorithm with 30bit precision 1 0.8
1.5
0.6 1 0.4 0.5
0.2
0
0 −0.2
−0.5
−0.4 −1 −0.6 −1.5 −2 −3
−0.8 −2
−1
0
1
2
3
(e) normal C-ELM with 20bit precision
−1 −1.5
−1
−0.5
0
0.5
1
1.5
(f) the proposed algorithm with 20bit precision
Fig. 1. Performance comparison
where N(0, 0.01) means the white Gaussian noise (of the nonminimum-phase channel) with mean 0 and variance 0.01.The equalizer input dimension is chosen as 10. BPSK symbol sequence s(n) is passed through the channel and the real and imaginary parts of the symbol are valued from the set {±1}. The fully complex activation functions of both C-ELM and the proposed algorithm are chosen as
Robust Recursive Complex Extreme Learning Machine Algorithm
643
arcsinh(x), where x = w •z + b. Both the input weight vectors wk and biases bk of the C-ELM and the proposed one are randomly chosen from a complex area centered at the origin with the radius set as 0.1. All the two equalizers, normal C-ELM and the proposed algorithm, are trained with 1000 data symbols at 30 dB SNR. The hidden neuron numbers of C-ELM and the proposed algorithm are set to 50. Fig. 1 shows the results in three finite precision cases from 20bits to 40bits. These are comparison between normal C-ELM and the proposed algorithm. As observed from Fig. 1, the performance of the proposed algorithm keeps almost the same without dependency on finite precision while C-ELM degrades severely in shorter precision.
5
Discussions and Conclusions
In this paper, we have proposed a numerically robust recursive C-ELM and its performance has been tested by simulation in various numerical precision cases. The simulation has shown that the proposed algorithm improves the numerical robustness of C-ELM algorithm.
References 1. Huang, M., Saratchandran, P., Sundararajan, N.: Fully Complex Extreme Learning Machine. Neurocomputing 68 (2005) 306–314 2. Haykin, S.: Adaptive Filter Theory, 3rd edn. Prentice-Hall, Upper Saddle River, NJ, (1996) 3. Douglas, S.C.: Numerically - Robust. O(N2 ) Recursive Least-Squares Estimation Using Least Squares Prewhitening. In: Proceeding of International Conference of Acoustics, Speech, and Signal Processing (ICASSP00), 1 (2000) 412 – 415 4. Dasilva, F.M., Almeida, L.B.: A Distributed Decorrelation Algorithm. In: Gelenbe, E.(eds.): Neural Networks: Advances and Applications, Elsevier Science, Amsterdam (1991) 145–163 5. Douglas, S.C., Cichocki, A.: Neural Networks for Blind Decorrelation of Signals. IEEE Trans. Signal Processing 45 (1997) 2829-2842 6. Cha, I., Kassam, S.A.: Channel Equalization Using Adaptive Complex Radial Basis Function Networks. IEEE J. Sel. Area. Comm., 13 (1995) 122–131
Evolutionary Extreme Learning Machine – Based on Particle Swarm Optimization You Xu and Yang Shu Department of Mathematics, Nanjing University, Nanjing 210093, P.R. China
[email protected]
Abstract. A new off-line learning method of single-hidden layer feedforward neural networks (SLFN) called Extreme Learning Machine (ELM) was introduced by Huang et al. [1, 2, 3, 4]. ELM is not the same as traditional BP methods as it can achieve good generalization performance at extremely fast learning speed. In ELM, the hidden neuron parameters (the input weights and hidden biases or the RBF centers and impact factors) were pre-assigned randomly so there may be a set of non-optimized parameters that avoid ELM achieving global minimum in some applications. Adopting the ideas in [5] that a single layer feedforward neural network can be trained using a hybrid approach which takes advantages of both ELM and the evolutionary algorithm, this paper introduces a new kind of evolutionary algorithm called particle swarm optimization (PSO) which can train the network more suitable for some prediction problems using the ideas of ELM.
1
Introduction
ELM algorithm [1, 2, 3, 4] is a novel learning algorithm for single layer feedforward neural networks (SLFNs). In ELM, the hidden neuron parameters (the input weights (linking the input layer to the hidden layer) and hidden biases or the RBF centers and impact factors) are randomly chosen, and the output weights (linking the hidden layer to the output layer) are determined by calculating Moore-Penrose (MP) generalized inverse. ELM divides the learning procedure in a SLFN into two parts, the first part is to decide the nonlinear parameters, and the other is to decide the linear parameters. For instance, the model of a standard SLFN can be described as follows: q yk = fk (x) = θkj φ(aj ), (1) j=1
aj =
d
ωij xi − ω0j
j = 1, 2, · · · , q;
(2)
i=1
where aj is the input activation of the jth neuron in the hidden layer; φ(·) is the activation function and usually is sigmoid; q is the number of neurons J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 644–652, 2006. c Springer-Verlag Berlin Heidelberg 2006
Evolutionary ELM – Based on Particle Swarm Optimization
645
in the hidden layer (NNHL); ωij , ω0j (i = 0, 1, · · · , d; j = 1, 2, · · · , q) are the weights and biases; y ∈ Rm is the output and x ∈ Rd is the input. Here (1) is linear. In fact we can rewrite the input-output relationship as follows: y = f (x, ω, θ) =
q
φ(x, ωj )θj ,
(3)
j=1
thus, for N distinct samples {xi , ydi , i = 1, 2, · · · , N }, the learning model of these networks can be represented in an unified form as follows: ⎡ ⎤ ⎡ ⎤⎡ ⎤ yd1 φ(x1 , ω1 ) · · · φ(x1 , ωq ) θ1 ⎢ .. ⎥ ⎢ ⎥⎢ .. ⎥ .. .. = (4) ⎣ . ⎦ ⎣ ⎦⎣ . ⎦ + ε . . ydN
φ(xN , ω1 ) · · · φ(xN , ωq )
θq
If all the ωj s are determined, in order to make the training error, that is, the norm of ε minimal, we need to find the least square solution to (6). In fact ELM tries to find the its smallest norm least-square solution. In short, ELM assigns the parameters ωj s arbitrarily in the input layer—in the BP network these are input weights and biases while in the RBF case are the kernel centers and impact widths. Then it calculates the hidden layer output matrix H. The output vector θ = H † · T where H † represents the Moore-Penrose (MP) generalized inverse of matrix H. As θ is the smallest norm least-square solution of the linear problem Hθ = T , ELM claims that θ makes the network have both the smallest training error and the good generalization performance [1]. ELM is not the same as traditional gradient-based method because it can achieve a good generalization performance at an extremely fast learning speed. However, the solution of ELM highly depends on a certain set of input weights and hidden biases. In fact, ELM tries to find the global minimal point of a localized linear problem, so different sets of input weights and hidden biases may have different performances. Therefore, although ELM achieves the global minimal point of the linear system, maybe it is not the global minimal point in the problem space [6]. This kind of issue can be solved in two different ways: one is to increase the hidden units that make some randomly assigned parameters be very close to the best set of parameters [6, 3]; the other is to find the best set of parameters without increasing the hidden units. Xu [6] adopts a kind of gradient-based algorithm to find the best set but it cannot avoid the local minima problem. Huang, et al. [5] introduces a hybrid form of differential evolutionary algorithm and ELM method called E-ELM to train a SLFN. It is well known that evolutionary algorithms can avoid the local minima problem using a global searching strategy. In E-ELM, users need to decide the mutation factor (the constant F) and the crossover factor (the constant CR).1 In other words, if you want use E-ELM algorithm in some cases like pattern 1
The selection factor in ELM is 0 which is not defined explicitly.
646
Y. Xu and Y. Shu
recognition, you need to decide and turn three parameters manually. Is there a kind of evolutionary algorithm that has the advantages of “global searching” but less or no parameter needs turned? The answer to this question is the particle swarm optimization (PSO) algorithm. PSO is a population based stochastic optimization technique developed by Dr. Eberhart and Dr. Kennedy in 1995, inspired by social behavior of bird flocking or fish schooling. The concept of particle swarm originated from the simulation of a simplified social system. The original intent was to graphically simulate the choreography of bird flock or fish school. However, it was found that particle swarm model could be used as an optimizer. In this algorithm, a swarm consists of many particles, where each individual particle keeps track of its position, velocity and best position. The global best position in the problem space can be found via iterations. In iteration, a particle updates its own position by keeping track of two values: the best position pbest this particle have arrived; the best position gbest this particle swarm has arrived. Usually, whether the position is better or worse is judged by the fitness function as its value might be considered as the distance from the best position. The particles then update their velocities and positions according to the formulae as follows: vt+1 = ω × vt + c1 rand1 (pbestt − presentt ) + c2 rand2 (gbestt − presentt ) (5) presentt+1 = presentt + vt+1
(6)
where v is the velocity of a particle and present is the current position of a particle. rands are random numbers in the range of (0, 1), ω is called the inertial factor. It usually in the interval of [0.4, 0.9], c1 and c2 are the learning factors and usually equal to constant 2. So the only parameter needs turning manually is ω. During the iteration, the particle learns from the society and moves towards the potential best position using both its own and social knowledge. In fact, every iteration is like an evolution and every position changing is like mutation or crossover. Therefore, from this viewpoint, we can classify the PSO algorithm to a kind of evolutionary computing. However, PSO has a few differences with traditional generic algorithm. In the first place, the sorting procedure of the fitness values and the selecting procedure of the population are not necessary any more in this algorithm. Besides these, it needs neither crossover nor mutation operation, which makes the algorithm much easier. In the second place, no factor in this algorithm needs to turn manually. In fact, the inertial factor usually equals to constant 0.5 and works well in most applications. This is unlike the traditional generic algorithm that requires at least three parameters2 that would make the algorithm much more difficult to implement in some applications. 2
The mutation factor, the crossover factor and the selection factor.
Evolutionary ELM – Based on Particle Swarm Optimization
2
647
The PSO-ELM Algorithm
The essential issue in optimizing the performance of ELM is 1) to optimize the hidden neuron parameters and 2) to reduce the number of hidden neurons to make the network have a better and quicker performance [1, 5, 6]. In order to achieve these two goals, many algorithms originated from ELM were proposed. The Evolutionary Extreme Learning Machine (E-ELM) [5] is a good example that uses differential evolutionary algorithm and ELM to train the network suitable for the pattern recognition problems. Here we just replace the differential evolutionary algorithm with PSO algorithm. Obviously, the restriction that Φ should be differentiable to ω will no longer be a necessary condition to apply the algorithm. Moreover, less factors need turning, which will make the final implementation much easier. However, PSO has some detailed problems that need to be considered carefully. The first one is the issue that particles might fly out of bound in position or velocity. 1. The position is out of bound on a certain dimension It is easy to find that when rand1 is quite close to 1 and rand2 is quite close 0, the particle will fly over the global best position, if the position gbest is quite close to the boundary in a certain dimension, occasionally the particle will go out of the bound. Since the out-bounded particles that are not the suitable solutions to the original problem any longer. This issue should be considered carefully. Unfortunately, hardly any researchers who applied the PSO algorithm in the training of neural networks treat the issue seriously; as they adopted the BP algorithm to train the network, which needs unbounded values to avoid local minima. However, in the ELM algorithm, the input weights and hidden biases should be bounded. For instance, if we choose all the hidden biases smaller than -10, the elements in the hidden layer output matrix will be very close to 1 (assuming that Φ is sigmoid), thus the MP generalized inverse of matrix H becomes trivial. In fact the rank of H here can be considered as 1 rather than K that we assumed, where K is the number of hidden neurons. In fact, all the weights should be in the range of [−1, 1] and the hidden biases should be in the range of [−1, 1]. 3 We adopt the author’s constraints directly and there are three different strategies to deal with the position out-of-bound issue: (1) the boundary is absorptive. The out-bounded particles will stay at the boundary in this dimension, It is like that the boundary is absorptive such that the out-bounded particles will adhere to the boundary. (2) the boundary is reflective. The out-bounded particles will inverse their directions in this dimension and leave the boundary keeping the same velocities. 3
Such normalization was not stated explicitly in [1], but we can get them from the MATLAB source code of ELM which is available in the ELM portal http://www.ntu.edu.sg/home/egbhuang.
648
Y. Xu and Y. Shu
(3) the boundary is abysmal. The out-bounded particles will disappear, which means they will not be taken into account anymore. In this algorithm, in order to make an easier implementation, we adopt strategy (2). In the algorithm, we will simply inverse the directions of the out-bounded particles while keeping their velocities unchanged. 2. The particle’s velocity is out of bound in a certain dimension To some extent, this constraint is a necessary condition that avoids the PSO algorithm being trapped in local minima and ensures a global even search in the i solution space. If a maximal velocity vmax were given in the ith dimension, and i i vnew > vmax , then the velocity in this dimension would be normalized to i vnew =
i vmax vi . i vnew new
(7)
In order to train a neural network quickly, according to [5], an algorithm should optimize the input weights and hidden biases using some strategies and adopting ELM to calculate the output layer weights. Actually, in most cases, these two steps are implemented iteratively. It is true that PSO algorithm can train a neural network accompanying with BP algorithm. However, BP is much slower than ELM. ELM also ensures the global minima, which means ELM can achieve a good result in the output layer. Therefore, we adopt PSO in the procedure of searching a best set of input parameters and employ ELM to train the output layer directly. The fitness function or the root means standard error (RMSE) of PSO is defined as follows:
N 2 i=1 εi 2 RM SE = (8) m×N In the first step, the algorithm should generate a swarm of input values, then make an evolutionary computing until the maximum iterations or minimum criteria attained. In fact, it is difficult to define the minimum criteria here, as the RMSE or other factors may not have the limit of 0; and usually it is impossible to pre-evaluate this limit before calculation. However, as the PSO algorithm is stochastic, if the RMSE keeps unchanging during evolution or δRMSE is smaller than a pre-defined ε, we have sufficient reason that best point has been found. Based on the careful analysis stated above, the algorithm using PSO and ELM is given below, we named it PSO-ELM:4 Algorithm PSO-ELM. Given a training set ℵ = {(xi , ti )|xi ∈ RN , t ∈ R, i = 1, · · · , N }, maximum iteration and ε. Step 1: Generate the initial swarm and set the boundary; Step 2: Calculate the output weights using ELM algorithm; 4
The MATLAB source code can be obtained from authors via e-mail.
Evolutionary ELM – Based on Particle Swarm Optimization
649
Step 3: Update particle position, calculate RMSE and δRMSE ; Step 4: If δ < ε or maximal iteration attained, Goto Step 5; Else Goto Step 2; Step 5: Output the best position, EXIT.
3
Prediction Problem
According to the conversational viewpoints, a single layer forward neural network can deal with some kinds of static problems (without the time factor). For instance, we can find their applications in regressing a multi-variable function, especially in regressing the nonlinear orbit of a robot in the assembly line; or in classifying the samples out of a raw data set like pattern recognition or disease diagnoses; or in compressing the data like encoding and extracting the characteristics from a digital image. However, the single layer feed-forward neural network might not be a suitable model for some dynamic problems with dynamic behaviors, e.g. prediction problems, real-time adaptive control systems. In other words, there are neither time factors nor feedback mechanisms in BP networks which make it not adaptive to dynamic models. In fact, there are two different kinds of predictions: to predict the results using the information provided by last state; to predict the results using all the information before. The first one is essentially not dynamic. In fact, it has no relationship with time factor, as every time you have the last state, you will get current state just the same. Analog to the dynamic programming in the operational research [7] that current state only depends on the last state and has no relationship with earlier ones; the prediction of the future only relates to the factors currently. It fact it is called Markov attribute. Because it is hard to find a formula to describe or summarize the relationship between these two states and it is even harder to find which factors will actually affect the results, we employ the neural networks to deal with such kinds of problems. Additionally, the relationship is usually nonlinear that neural network is the only suitable model dealing with such problems. Researches have already applied the neural networks in the prediction cases such as the water-quality pollution prediction [8]. We rewrote the program according to the algorithm provided in [8] using MATLAB neural network toolbox. This algorithm uses PSO algorithm in global search and accelerates the convergence of a swarm using Levenberg-Marquardt (LM) algorithm. Unfortunately, we could not gain sufficient data of this problem. Instead, we got the statistical data about the farm production from 1978 to 2000 from the annals published by National Statistic Bureau of China in 2003 [9]. Seven factors affected the production of farm goods: irrigation area, amount of fertilizer, electric power, total mechanical power, labor force, planting acreage and area hit by natural calamity. In Table 1 we use x1 , · · · , x7 to denote these seven factors. In order to eliminate the errors caused by different dimensions of different factors, a Z-Score transformation is performed on the raw data. Z-Score is a measure of the distance in standard deviations of a sample from the mean. Calculated as (X − X)/σ where X is the mean and σ
650
Y. Xu and Y. Shu
Table 1. Standard values of corn production and its influencing factors in China Year Production x1 x2 x3 x4 x5 x6 x7 1978 -1.7992 -0.7009 -1.5436 -1.1178 -1.4246 -2.2198 2.5280 0.6992 1979 -1.3592 -0.6891 -1.3545 -1.0748 -1.2842 -1.8086 2.1035 -0.9361 1980 -1.5453 -0.7248 -1.1834 -1.0193 -1.1665 -1.3167 1.4531 -0.1978 1981 -1.4734 -0.8223 -1.0549 -0.9481 -1.0860 -0.7364 0.7234 -0.8765 1982 -0.9993 -0.9456 -0.9553 -0.9087 -1.0055 -0.4193 0.2438 -1.8292 1983 -0.4721 -0.8005 -0.8185 -0.8530 -0.8841 -0.0905 0.4314 -1.6029 1984 -0.1499 -0.8598 -0.7437 -0.8111 -0.7570 -0.0639 0.0585 -2.0075 1985 -0.6035 -0.9893 -0.7101 -0.7458 -0.6351 -0.9541 -1.2363 -0.2208 1986 -0.4040 -0.9304 -0.5654 -0.6326 -0.4595 -0.8764 -0.5669 0.1758 1987 -0.1914 -0.8754 -0.5012 -0.5278 -0.2970 -0.6080 -0.4595 -0.5471 1988 -0.3627 -0.8838 -0.3683 -0.4504 -0.1472 -0.2170 -0.8266 0.7112 1989 -0.1460 -0.7158 -0.1667 -0.3362 -0.0186 0.4405 -0.1591 0.1552 1990 0.4762 0.0559 0.0512 -0.2577 0.0366 1.0386 0.2451 -1.0644 1991 0.3001 0.1860 0.2520 -0.0851 0.0953 1.6060 -0.1242 1.3680 1992 0.4186 0.4244 0.3689 0.1239 0.1745 1.5063 -0.6865 0.7770 1993 0.6411 0.4671 0.5761 0.3245 0.3045 0.9864 -0.7028 0.4184 1994 0.4579 0.4768 0.7313 0.6577 0.4756 0.6073 -1.3328 1.3082 1995 0.8040 0.6389 0.9891 0.9221 0.6751 0.3697 -0.8468 -0.0123 1996 1.4139 0.9804 1.2080 1.1504 0.8844 0.3203 -0.0492 0.1549 1997 1.2472 1.2465 1.3506 1.3938 1.1833 0.4367 0.0675 1.0771 1998 1.5387 1.5746 1.4471 1.4840 1.4583 0.5646 0.3480 0.6069 1999 1.4758 1.8424 1.4850 1.6750 1.7848 0.7551 0.1473 0.5834 2000 0.7326 2.0479 1.5057 2.0355 2.0930 0.6788 -1.3589 1.2574
Table 2. Performance comparison Algorithms PSO-LM BP(MATLAB) PSO-ELM
Training Time Output (seconds) 1999 10.921 1.4416 0.732 1.4573 4.478 1.4407
Value 2000 0.7018 0.7737 0.7010
Relative 1999 2.32% 1.25% 2.33%
Error 2000 4.20% 5.61% 4.31%
is the standard deviation. In the experiment, we train the network using data from 1978 to 1998 and try to predict the production in the years of 1999 and 2000. The network has 7 inputs, 15 hidden neurons, and 1 output. We make a comparison between PSO-ELM and the algorithm provided in [8]. Dramatically the results of these two algorithms are quite close to each other. Moreover, the BP algorithm also provides the similar prefect results. The relative prediction error of year 2.33% and 4.31% in the year of 1999 and 2000 respectively. Thanks to the quick learning speed of ELM, the total learning time of PSO-ELM is less than 5 seconds, and the PSO-LM algorithm takes more than 10 seconds in average. As we cannot get the original data from [8] and make exact comparison between these two algorithms, this comparison here might be a very particular
Evolutionary ELM – Based on Particle Swarm Optimization
651
case that does not reveal the full characteristics of them. However, as PSO-ELM adopts the quick learning speed from ELM, it is always true that PSO-ELM algorithm is much quicker than the hybrid PSO and BP algorithm.
4
Conclusions
In this paper, in order to simplify Evolutionary ELM algorithm, a hybrid PSO and ELM algorithm was introduced. Although many researchers applied similar PSO algorithms in the training of neural networks, rarely did they consider the boundary conditions. In fact, the boundary condition is quite a significant part in the PSO algorithm. Without these constraints, a PSO algorithm will be entrapped in local minima in all probability. Additionally, the ELM algorithm has this restriction although it is not claimed explicitly in [1]. After the careful analysis of these issues, a PSO-ELM algorithm was given. Moreover, by adopting the ELM in the training step, the complete training procedure took less time than other PSO based algorithms. Additionally, we discussed the applicability of feedforward neural networks in prediction problems, and found the right situation in which could we apply this model. Finally the experimental results originated from the real world problems stated that this algorithm could achieve as the same precision as traditional BP algorithms. There are three main advantages of PSO-ELM. In the first place, it needs only one parameter compared to three parameters in the traditional generic method. In the second place, the constraint that the activation function should be differentiable to x is no longer required. It is true that ELM can be used in the cases such as the threshold network where activation function is not differentiable [10]. However, in [5], this kind of constraints is needed. The PSO-ELM removes the constraint that makes the algorithm more applicable. In fact, we would like to do some future work on this field and plan to implement such algorithm to the embedding systems. In the third place, the learning speed of ELM makes this algorithm much quicker than the PSO-based algorithms such as PSO-LM.
References 1. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks. In: Proceedings of International Joint Conference on Neural Networks (IJCNN2004), Budapest, Hungary (25-29 July, 2004) 2. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme Learning Machine: Theory and Applications. Neurocomputing (in press) (2006) 3. Huang, G.B., Siew, C.K.: Extreme Learning Machine with Randomly Assigned RBF Kernels. International Journal of Information Technology. 11 (2005) 4. Huang, G.B., Siew, C.K.: Extreme Learning Machine: RBF Network Case. In: Proceedings of the Eighth International Conference on Control, Automation, Robotics and Vision (ICARCV 2004), Kunming, China (6-9 Dec, 2004) 5. Zhu, Q.Y., Qin, A., Suganthan, P., Huang, G.B.: Evolutionary Extreme Learning Machine. Pattern Recognition 38 (2005) 1739–1763
652
Y. Xu and Y. Shu
6. Xu, Y.: A Gradient-Based ELM Algorithm in Regressing Multivariable Functions. In: Proceedings of International Symposium on Neural Networks (ISNN2006), Chendu, China (29-31 May, 2006) 7. Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton, N.J. (1957) 8. Chau, K.: A Split-Step PSO Algorithm in Prediction of Water Quality Pollution. In: Proceedings of the International Symposium on Neural Networks (2005), Chongqing, China (30 May - 1 June, 2005) 1034–1039 9. National Bureau of Statistics of China: The Annals (2003) of Statistical Data in China. China Statistics Press, Beijing (2004) 10. Huang, G.B., Zhu, Q.Y., Mao, K.Z., Siew, C.K., Saratchandran, P., Sundararajan, N.: Can Threshold Networks Be Trained Directly? IEEE Transactions on Circuits and Systems - II 53 (2006)
A Gradient-Based ELM Algorithm in Regressing Multi-variable Functions You Xu Department of Mathematics, Nanjing University, Nanjing 210093, P.R. China
[email protected]
Abstract. A new off-line learning algorithm for single layer feed-forward neural networks (SLFNs) called Extreme Learning Machine (ELM) was introduced by Huang et al. [1, 2, 3, 4]. ELM is not as the same as traditional BP method as it can achieve good generalization performance at an extremely fast learning speed. In ELM, the hidden neuron parameters (the input weights and hidden biases or the RBF centers and impact factors) were pre-determined randomly so a set of non-optimized parameters might avoid ELM to achieve the global minimum in some applications. This paper tries to find a set of optimized value of input weights using gradient-based algorithm in training SLFN where the activation function is differentiable.
1
Introduction
It is well known that single layer feed-forward neural networks (SLFNs) can regress a function in the compact set of RN [5]. Recently a new training method called ELM was introduced by Huang et al. [1, 2, 3, 4]. This method can help regressing a multi-variable function quickly. Here is a short introduction of ELM, to see the full illustration, please refer to [2]. ELM divides the learning problem in a SLFN into two parts, the first part is to decide the nonlinear parameters, the other is to decide the linear parameters. For instance, the model of the standard SLFN can be described as follows: y = f (x) =
K
θj φ (aj ) ,
(1)
j=1
aj =
m
ωij xi − ω0j
j = 1, 2, · · · , K;
(2)
i=1
where aj is the input activation of the jth neuron in the hidden layer; φ(·) is the activation function and usually is sigmoid; K is the number of neurons in the hidden layer (NNHL); ωij , ω0j (i = 0, 1, · · · , d; j = 1, 2, · · · , K) are the weights and biases of additive hidden neurons; y ∈ R is the output and x ∈ Rm is the input. Here (1) is linear. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 653–658, 2006. c Springer-Verlag Berlin Heidelberg 2006
654
Y. Xu
Here is another example about the radial basis function (RBF) network, the output of a RBF network with K kernels for an input vector x ∈ Rn is given by y = f (x) =
K
θj φj (x)
(3)
j=1
here φj (·) is the radial basis function and usually gaussian: x − μj 2 φj (x) = exp − 2σj2
(4)
Besides these two types, also the general regression neural network[6] and the fuzzy neural networks and threshold network [7] have the input-output relationship like this: K y = f (x, ω, θ) = φ(x, ωj )θj , (5) j=1
thus, for N distinct samples {xi , yi , i = 1, 2, · · · , N }, the learning model of these network can be represented in the unified form as follows: ⎡ ⎤ ⎡ ⎤⎡ ⎤ y1 φ(x1 , ω1 ) · · · φ(x1 , ωK ) θ1 ⎢ .. ⎥ ⎢ ⎥⎢ .. ⎥ .. .. (6) ⎣ . ⎦=⎣ ⎦⎣ . ⎦ + ε . . yN
φ(xN , ω1 ) · · · φ(xN , ωK )
θK
If all the ωj are determined, in order to make the training error, that is, the norm of ε minimal, the least square solution to (6) should be found. In fact ELM tries to find the smallest norm least-square solution of linear system (6). In short, ELM randomly assigns the parameters ωj in the input layer - in the BP network these were the input weights and biases while in the RBF case were the kernel centers and impact widths. Then it calculates the hidden layer output matrix H. The output vector θ = H † · T where H † represents the Moore-Penrose (MP) generalized inverse matrix of H. As θ is the smallest norm least-square solution of the linear problem Hθ = T [8], ELM claims that θ make the network have both the smallest training error and the good generalize performance [2].
2
Parameters and Performances
The ELM method has a very good performance and directly reaches the global minimum of the solution space that only determined by the number of units of the hidden layer and the pre-assigned parameters. However, as the results are computed according to these pre-determined parameters, different sets of value imply different performances. The reason will be given in the next section and the examples are provided at the end of this article. In fact the solution of ELM is just the smallest norm least-square solution to a linear system. Using equation (6) mentioned above, the problem can be restated as: Y = Φ(X, ω)θ + ε, (7)
A Gradient-Based ELM Algorithm in Regressing Multi-variable Functions
655
where Y ∈ RN , X ∈ RN ×m , w ∈ Rm×K , θ ∈ RK , ε ∈ RN J = Y − Φ(X, ω)θ = J(ω, θ)
(8)
{w∗ , θ∗ } = arg min J(ω, θ) .
(9)
Assume that rank(Φ(X, ω)) is K, the orthogonal factorization of Φ can be made as A = QRV T , here R ∈ RK×K , Q ∈ Rm×K andR ∈ RK×r . Q and R are both orthogonal. Thus, according to the definition in [1], θ = V R−1 QT y. Replace θ in (8) and we got: ˆ J = Y − Φ(X, ω)θ = Y − ΦΦ† Y = J(ω)
(10)
Here Jˆ is only determined by ω and has no relationship with θ, hence the global minimal point of J can be obtained by two separate steps: find the minimal point of Jˆ; calculate θ using ELM. But the procedure of finding the minimal point of Jˆ requires the value of θ, so these two steps will implemented iteratively. Suppose that Jˆ is differentiable to ω, then we have: Theorem 1. As Y can be denoted with Y1 + Y2 where Y1 = Φ(X, ω)θ ∈ R(Φ), Y2 ∈ N (ΦT ) ,R and N are the image space and zero space of Φ respectively. We have: ∂ Jˆ ˙ kj )θ, = −2Y2T Φ(ω (11) ∂ωkj Jˆ = Y T Y2 (12) 2
Proof. Just replace Y with Y1 + Y2 in (10) then (12) would be got. Here we prove (11). Use AΦ to denote Φ† Φ and omit ω in the context. ˙ ∵ Φ = AΦ Φ, Φ˙ = A˙ Φ Φ + AΦ Φ˙ ∴ A˙ Φ Φ = (I − AΦ )Φ; ∵ Jˆ = (I − A)Y = Y T (I − AΦ )Y ∴
∂ Jˆ = −2Y T (I − AΦ )A˙ Φ Y = −2Y T (I − PΦ ) ∂ωkj
As AΦ = AΦ AΦ , We have:
A˙ Φ = A˙ Φ AΦ + AΦ A˙ Φ
∂ Jˆ = −2Y T (I − AΦ )(A˙ Φ AΦ + AΦ A˙ Φ )Y ∂ωkj = −2Y T (I − AΦ )A˙ Φ AΦ Y = −2Y T (I − AΦ )(A˙ Φ AΦ + AΦ A˙ Φ )Y = −2Y T (I − AΦ )ΦΦ† Y. ˙ = −2(Y2T + (Φθ)T )(I − ΦΦ† )Φθ T ˙ T † ˙ ˙ = −2Y2 Φθ − 2Y2 ΦΦ Φθ − 2(Φθ)T (I − ΦΦ† )Φθ The last two items are 0, thus we got (11).
656
Y. Xu
Theorem 1 provides a new algorithm which can be used in the regression of a multi-variable function using ELM and gradient descent methods. We have this algorithm named gradient ELM (G-ELM). Algorithm G-ELM. Given a training set ℵ = {(xi , ti )|xi ∈ RN , t ∈ R, i = 1, · · · , N }, activation function g(x) which is differentiable, and the number of hidden neurons is K. Step 1: Assign randomly input weight ω(0) ∈ Rm×K ; Step 2: When ω(k) has got, calculate θ(k), Y2 ; Step 3: If Jˆ = Y2T Y2 < ε, goto Step 6. Step 4: Calculate
∂ˆ (J) ∂ωij
using (11), hence, ωij (k + 1) = ωij (k) −
∂ Jˆ h, ∂ωij
(13)
Step 5: goto Step 2; Step 6: ω(k), θ(k) is the solution, End. Note that in Step 2, θ is calculated using ELM. As Step 4 uses a gradient descent method, this algorithm is named Gradient ELM (G-ELM) 1 .
3
Examples
We use the peaks function in MATLAB as the target function, peaks is a function of two variables, obtained by translating and scaling Gaussian distributions. peaks(x, y) = 3(1 − x)2 e−x
2
−(y+1)2
− 10(
2 2 2 2 x 1 − x3 − y 5 )e−x −y − e−(x+1) −y 5 3
Using a three layer neural network with 2 inputs, 30 hidden units and 1 output. The input values are randomly generated between [−3, 3] that follow a uniform distribution and normalized to [0.2, 0.8], z is normalized to [0, 1]. 1000 training samples and 300 testing samples are generated randomly. ε = 0.1 2 and h = 0.05. A training needs about 14 seconds in AMD Sempron 2500+ 512MB compared with 0.1 seconds of the original ELM algorithm. The training speed is the cost of the performance. Table 1 shows the performance comparison of two algorithms: ELM and G-ELM. We can draw the conclusion that G-ELM tends to get a set of optimized input weights compares with the original ELM. However, this algorithm only reduces the RMSE a little as the number of hidden neurons is adequate. Therefore in this situation the original ELM can have a good performance in training SLFNs. This conclusion can extends to the RBF network naturally that if there are enough kernels, the ELM method is good enough in the training of RBF 1 2
The source code can be obtained from author via email. According to several times of experiment, in order to avoid over-fitting.
A Gradient-Based ELM Algorithm in Regressing Multi-variable Functions
657
Table 1. Performance comparison (RMSE) of G-ELM and ELM Sample 1 2 3 4 5
ELM G-ELM Training Testing Training Testing 0.1121 0.5686 0.1117 0.4328 0.1280 0.1479 0.1167 0.1204 0.1183 0.2766 0.1165 0.2548 0.1160 2.2133 0.1114 1.9252 0.1145 0.3164 0.1037 0.2876
networks. This statement was mentioned in [3, 4] and here is the experimental proof. From the experiment we know that less neurons lead bigger RMSE. The RMSE in sample 3 is quite different because the variable x in the training data in [−3, −0.6] ∪ [0.6, 3] rather than the testing data in [−3, 3]. Sample 5 is a good example which could illustrate that different input weights might have different performances.
4
Conclusion and Future Work
In fact, the essential idea of ELM is to separate the training of a network into two parts, firstly it determines the input weights, and then optimizes the output weights. The first step can be done by assigning the input weights and hidden biases randomly in some applications, especially when the number of neurons is sufficient to make the network have a good performance. Comparing with the original ELM, G-ELM only reduces the RMSE in a very little scale, as in the experiment we choose 30 hidden layer units. Another example is the ELM algorithm can be used in training RBF networks [3, 4] and ELM can be applicable for many RBF kernels. In theory, if enough kernels are randomly chosen and some of them are exactly very close to the best kernels, the performance of this network would be quite good. This idea is fully illustrated in [3, 4]. However, as mentioned above, in some problems, different hidden neuron parameters may imply different performances. The Evolutionary ELM [9] is an algorithm that aims to solve this problem. This algorithm selects the input weights using differential evolutionary algorithm. But the selection procedure needs the values of the output weights that can be obtained using ELM. This kind of algorithm also essentially separates the training of the network into two steps: the first step is to calculate the output weights and the RMSE of the training data; the second step is to select the offspring (input weights) which have better performances, even though these two steps are implemented iteratively. GELM just replaces the evolutionary algorithm with gradient-based algorithm, but ideas of separating the parameters are just the same. However, even the results provided above illustrate that G-ELM has a good performance, a lot of fundamental works of G-ELM need to be done. For instance, how to avoid the local minimum or under what condition will the algorithm converge. All of these questions need to be solve in the near future to make this algorithm more useful.
658
Y. Xu
References 1. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks. In: Proceedings of International Joint Conference on Neural Networks (IJCNN2004), Budapest, Hungary (25-29 July, 2004) 2. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme Learning Machine: Theory and Applications. Neurocomputing (in press) (2006) 3. Huang, G.B., Siew, C.K.: Extreme Learning Machine with Randomly Assigned RBF Kernels. International Journal of Information Technology. 11 (2005) 4. Huang, G.B., Siew, C.K.: Extreme Learning Machine: RBF Network Case. In: Proceedings of the Eighth International Conference on Control, Automation, Robotics and Vision (ICARCV 2004), Kunming, China (6-9 Dec, 2004) 5. Hornik, K., Stinchcombe, M., White, H.: Universal Approximation of an Unknown Mapping and Its Derivatives Using Multilayer Feedforward Networks. Neural Networks 3 (1990) 551–560 6. Specht, D.: A General Regression Neural Network. IEEE Transactions on Neural Network 2 (1991) 568–576 7. Huang, G.B., Zhu, Q.Y., Mao, K.Z., Siew, C.K., Saratchandran, P., Sundararajan, N.: Can threshold networks be trained directly? IEEE Transactions on Circuits and Systems - II 53 (2006) 8. Burden, R.L., Faires, J.D.: Numerical Analysis. Third edn. Prindle, Weber & Schmidt, Boston (1985) 9. Zhu, Q.Y., Qin, A., Suganthan, P., Huang, G.B.: Evolutionary Extreme Learning Machine. Pattern Recognition 38 (2005) 1739–1763
A New Genetic Approach to Structure Learning of Bayesian Networks Jaehun Lee, Wooyong Chung, and Euntai Kim School of Electrical and Electronic Engineering, Yonsei University, 134 Shinchon-Dong, Seodaemun-Gu, Seoul 124-749, Korea
[email protected]
Abstract. In this paper, a new approach to structure learning of Bayesian networks (BNs) based on genetic algorithm is proposed. The proposed method explores the wider solution space than the previous method. In the previous method, while the ordering among the nodes of the BNs was fixed their conditional dependencies represented by the connectivity matrix was learned, whereas, in the proposed method, the ordering as well as the conditional dependency among the BN nodes is learned. To implement this method using the genetic algorithm, we represent an individual of the population as a pair of chromosomes: The first one represents the ordering among the BN nodes and the second one represents their conditional dependencies. To implement proposed method new crossover and mutation operations which are closed in the set of the admissible individuals are introduced. Finally, a computer simulation exploits the real-world data and demonstrates the performance of the method.
1 Introduction Bayesian networks (BNs) are one of the best known formalisms to reason under uncertainty in Artificial Intelligence (AI) [1]. BN is a graphical model that denotes joint probabilistic distribution among variables of interest based on their probabilistic relationships. The structure of BN is a directed acyclic graph, or abbreviated as DAG. Each node represents a variable which has the range over a discrete data set of domain and is connected with its parent’s nodes. Each arc represents the conditional dependency between the connected two nodes. The problem of searching the structure of BN that best reflects the conditional dependencies in a database of cases is a difficult one because even a small number of nodes to connect lead to a large number of possible directed acyclic graph structures. To solve this problem, several methods have been reported [3], [4] and the genetic algorithms (GAs) are certainly among them [6]. Some authors have worked on inducing the structure of trees or polytrees from a database of cases [3], more relevant works on structure learning on multiply connected networks also have been developed [4]. GAs have been successfully applied to various optimization problems in real world and it is well suited to explore complex search spaces. But the GA is not ready made optimization tool and the plain GA cannot be applied to the structure learning of BN because the plain crossover and mutation operations are not closed (not allowed) in the set of the BN structure individuals. In this paper, we suggest a new genetic J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 659 – 668, 2006. © Springer-Verlag Berlin Heidelberg 2006
660
J. Lee, W. Chung, and E. Kim
approach for structure learning of BNs from a database of cases. The proposed method searches the wider solution space than the previous method which also uses the genetic algorithm for structure learning of BNs. This paper is organized as follows. In Section 2, a brief introduction to BNs and GAs is given. In Section 3, a new algorithm for structure learning of BNs based on genetic algorithm is presented and new genetic operations are introduced. In section 4, the proposed method is applied to a real-world benchmark data, a database of Semiconductor operation problems through computer simulation. Finally, in Section 5, some conclusions are drawn.
2 Preliminaries: Bayesian Networks and Genetic Algorithms 2.1 Bayesian Networks Bayesian networks and associated schemes constitute a probabilistic framework for reasoning under uncertainty and in recent years have gained popularity in the community of artificial intelligence. From the structure point of view, BNs are directed acyclic graphs (DAGs), where the nodes are random variables, and the arcs specify the dependency relationships that must be held between the random variables [2]. A Bayesian network is composed of a network structure and a set of parameters associated with the structure. In general, the structure consists of nodes which are connected by directed arcs and form a directed acyclic graph. Each node represents a domain variable that can take on a finite set of values. Each arc represents a dependency between two nodes. To specify the probability distribution of a BN, one must give prior probabilities for all root nodes (nodes with no predecessors) and conditional probabilities for all other nodes, given all possible combinations of their direct predecessors. These numbers in conjunction with the DAG specify the BN completely. The joint probability of any particular instantiation of all variables in a BN can be calculated as follows: P (U ) = ∏ P ( Ai | pa ( Ai ))
(1)
i
where U = {A1 , A2 ,L, An } represents nodes of a Bayesian network and pa( Ai ) is the parent set of the variable Ai . For example the probability of the whole network shown in Fig. 1 is given by P( A1 ) P( A2 ) P( A3 ) P( A4 | A1 , A2 ) P( A5 | A2 , A3 ) P( A6 | A4 ) P( A7 | A5 ) . The process of constructing BNs is called Bayesian networks learning. Learning BNs can he separated into two tasks, structure learning and parameter learning. Structure learning is to create an appropriate structure for BNs which accommodates prior knowledge from samples of data. Parameter learning is to calculate the conditional probabilistic distribution for the given BN structure and is done by samples of data. The most popular parameter learning method is EM algorithm [8]. In this paper we focus upon structure learning of the BNs and create an appropriate structure for the BNs which best reflects the samples of given data.
A New Genetic Approach to Structure Learning of Bayesian Networks
A1
661
A3
A2
A4
A5
A6
A7
Fig. 1. An example of Bayesian networks
2.2 Genetic Algorithms Genetic algorithms (GAs) are based on a biological metaphor: They view learning as a competition among a population of evolving solution candidates. A fitness function evaluates each solution to decide which one will contribute to the next generation of solutions. Then, through operations analogous to gene transfer in sexual reproduction, a new population of candidate solutions is created [5]. The general procedure for GA is summarized as follows: An initial population of individuals (or chromosomes) is created by random selection. Next, a fitness value is assigned to each chromosome, depending on how close it actually is to solve the problem, thus arriving at the answer of the given problem. Each chromosome is a kind of solution of the problem. These solutions are not to be confused with answers of the problem, think of them as possible candidates that the system would employ in order to reach the answer. Those chromosomes with high fitness values are more likely to reproduce offspring which can mutate after reproduction. The offspring is the mixture of the father and mother and consists of a combination of their genes. This process is known as crossover. If the new generation contains a solution that produces a satisfactory performance or fitness, then the algorithm stops and the problem is said to be solved. If this is not the case, then the algorithm will be repeated until the condition is satisfied.
3 Structure Learning of Bayesian Networks Using GAs In this section, a new approach for structure learning of BNs based on genetic algorithm is proposed. The proposed method searches the wider solution space than the previous method which also uses the genetic algorithm for structure learning of BNs. 3.1 The Previous Method In this subsection, the previous genetic method for structure learning of BNs is explained briefly [6]. The BN structure with n variables is represented by an n× n connectivity matrix C of which element cij is represented as
662
J. Lee, W. Chung, and E. Kim
⎧1 if i is a parent of j cij = ⎨ ⎩0 otherwise
(2)
and each individual of the population is encoded as a chromosome c11c12 Lc1n c21c22 Lc2 n Lcn1c n 2 L cnn
(3)
With this representation, the general crossover and mutation operators would produce the illegal BN structures [6]. In the previous method, to overcome this problem, the connectivity matrix was confined to being triangular and the chromosome was encoded as x = c12 c13 L c1n c23 c24 L c2 n L cn− 2 n −1cn −2 n cn−1n
(4)
for the connectivity matrix ⎛ c11 ⎜ ⎜ c 21 ⎜ M ⎜ ⎜c ⎝ n1
c12 L c1n ⎞ ⎟ c 22 L c 2 n ⎟ M O M ⎟ ⎟ c n 2 L c nn ⎟⎠
(5)
That is, the ordering between the variables of the BN is fixed and a node Ai can have another node A j as a parent node only if the node A j comes before the node Ai in the ordering. Under the assumption of the ordering between the BN variables (or the triangulation of the connectivity matrix), it is clear that the plain genetic operations are closed in the set of BNs. 3.2 The Proposed Method In the previous method, the ordering between the nodes was predetermined and their connections were confined to being triangular. So, the search space was not fully explored to find the fittest structure of BNs. If we do not have the prior knowledge about the ordering between variables (it is usual in most of the actual problems), the resulting structure will be not satisfactory. In this paper, we propose a new genetic method which can be applied to the structure learning of the BNs. In the proposed method, the connectivity matrix is not confined to being triangular and it is not learned either. It is inherited from the parent. Instead, the ordering between the variables is learned through the genetic algorithm. To implement this strategy, a BN structure is represented as a pair of chromosomes (ordering chromosome and the connection chromosome): The first one denotes the ordering between the BN nodes and the second one is the connectivity matrix which does not need to being triangular. The ordering chromosome consists of indices of variables. If there are n root nodes (nodes with no parents), the first n genes of the chromosome are the indices of the root nodes. The next genes are the indices of the children nodes of the root nodes without overlaps. This structure is repeated until the whole variables show up. The connection chromosome denotes the connectivity
A New Genetic Approach to Structure Learning of Bayesian Networks
663
matrices which defines the dependency relation between the variables. The genetic operations are only applied to the first chromosome. But the second one is also automatically changed to accommodate the change of the first chromosome. The following examples show the basic idea of the suggested scheme. Example 1 (coding BNs) Consider the two BN structures shown in Fig. 2. Each one is represented as a pair of two chromosomes. Fig. 2(a) is represented by x1 and C1 and Fig. 2 (b) is represented by x2 and C 2 x1 = 126354
(6)
x 2 = 543126
⎛0 ⎜ ⎜0 ⎜0 C1 = ⎜ ⎜0 ⎜0 ⎜ ⎜0 ⎝
1 0 0 0 1⎞ ⎛0 0 ⎟ ⎜ 0 1 0 0 0⎟ ⎜0 0 ⎟ ⎜0 1 0 0 1 0 0 ⎟ , C2 = ⎜ 0 0 0 0 0⎟ ⎜1 0 ⎟ ⎜0 0 0 0 0 0 0⎟ ⎜ ⎜0 0 0 0 0 1 0 ⎟⎠ ⎝
A6 A5
A3
(7)
A5
A1 A2
1 0 0 1⎞ ⎟ 0 0 0 1⎟ 0 0 0 0⎟ ⎟ 1 0 0 0⎟ 1 1 0 0 ⎟⎟ 0 0 0 0 ⎟⎠
A4
A3 A2
A1
A4
A6
(a)
(b)
Fig. 2. Two BNs (Example 1)
The genetic operations are only applied to the ordering chromosomes. When the two ordering chromosomes execute crossover operation, the first parts of the offspring chromosomes before the crossover point are the same as those of the parent chromosomes. The second parts of the offspring chromosomes after the crossover point are chromosomes of the other parent. But, to prevent the duplication of a digit and illegal chromosome, the duplicated digits are deleted in the second part. As stated above, but the connection chromosome is also automatically changed to accommodate the change of the ordering chromosome. The following Example 2 shows a simple example of the suggested crossover operation.
664
J. Lee, W. Chung, and E. Kim
Example 2 (crossover operation) Consider the two BNs in Example 1 and apply the crossover to them at the point between the second and the third gene. The offsprings start with x1′ = 12xxxx x 2′ = 54xxxx
(8)
Since contains “1” and “2”, we delete the two digits from and the remaining subchromosome is “5436”. The concatenation of two parts yield an offspring ordering chromosome x1′ = 125436 . In the same way, we concatenate “54” and “1263” and another offspring ordering chromosome becomes x2′ = 541263 . x1′ = 125436 x 2′ = 541263
⎛0 ⎜ ⎜0 ⎜0 C1′ = ⎜ ⎜0 ⎜0 ⎜ ⎜0 ⎝
(9)
1 0 0 1 0⎞ ⎛0 0 ⎟ ⎜ 0 0 1 0 0⎟ ⎜1 0 ⎟ ⎜0 0 0 0 0 0 0 ⎟ , C 2′ = ⎜ 0 0 0 0 1⎟ ⎜1 1 ⎟ ⎜1 0 0 1 0 0 0⎟ ⎜ ⎜0 0 0 0 0 0 0 ⎟⎠ ⎝
0 0 0 1⎞ ⎟ 1 0 0 0⎟ 0 0 0 0⎟ ⎟ 0 0 0 0⎟ 0 1 0 0 ⎟⎟ 1 0 0 0 ⎟⎠
(10)
The offspring BNs are depicted in Fig. 3.
A5
A1
A5
A2
A3
A4
A4
A1 A6
A2
A6
A3
(a)
(b)
Fig. 3. Offspring BNs after crossover (Example 2)
It can be noted that the offsprings are DAGs and they are valid. So the proposed crossover operation is closed in the set of BNs. Next, let us consider the mutation. In the mutation, randomly chosen two genes of the ordering chromosome are exchanged. As in the crossover, the connection chromosome is also automatically changed to accommodate the change of the ordering chromosome.
A New Genetic Approach to Structure Learning of Bayesian Networks
665
Example 3 (mutation operation) Consider a BN structure of Fig. 4(a) which consists of three variables. Using the proposed representation, the structure is represented by x = 123
(11)
⎛0 1 1⎞ ⎜ ⎟ C = ⎜ 0 0 0⎟ ⎜ 0 0 0⎟ ⎝ ⎠
(12)
Suppose that the first and the second genes are exchanged by mutation and the resulting one becomes x ′ = 213
(13)
⎛0 0 0⎞ ⎜ ⎟ C′ = ⎜ 1 0 1⎟ ⎜0 0 0⎟ ⎝ ⎠
(14)
Fig. 4(b) depicts the resulting BN. Like the crossover, the mutation is also closed in the set of BNs.
A2
A1 A2
A3 (a)
A3
A1 (b)
Fig. 4. Two BNs before and after mutation (Example 3)
From above examples, we can see that the proposed method finds the answer from wider search space than the previous one.
4 Simulation Results 4.1 The Application of the Proposed Method to Semiconductor Problems In this section, we apply our proposed method to the Semiconductor operation problems. The Semiconductor operation problem is a simple example of belief network for diagnosing why a semiconductor won't operate, based on wafer, open lines, particles, etc. Fig. 5 presents the structure of the Semiconductor operation problem network. All nodes of the network are discrete variables. Some of them have three probable discrete states and the others have two states. We use Netica tool for generating a database of 2000 cases. Fig. 5 shows the part of the generated databases.
666
J. Lee, W. Chung, and E. Kim
Fig. 5. The real structure of the semiconductor operation problem network
Using the database of cases, we build the BN in such a way that the probabilistic relationships between the variables are modeled. To apply the new genetic encoding and genetic operations and evaluate the effectiveness of the structure of the given BN, we use the theorem proved in [7]. The theorem says n
qi
i =1
j =1
P ( Bs , D ) = P ( Bs )∏ ∏
ri ( ri − 1)! N ijk ! ∏ ( N ij + ri − 1)! k =1
(15)
where D is a database of m cases, Bs is a belief network structure containing n discrete variables, each variable X i has ri possible values and its parents pa( X i ) have qi values, N ijk is the number of cases in D in which variable X i has j th value ri
and its parents pa( X i ) have k th values, and N ij = ∑ N ijk . k =1
Therefore in this paper, we use n
qi
i =1
j =1
P( D | Bs ) = ∏ ∏
ri ( ri − 1)! N ijk ! ∏ ( N ij + ri − 1)! k =1
(16)
as a fitness function to evaluate the structure of the given BN. Fig. 6 compares the fitness of the previous and the proposed methods. The vertical axis denotes the minus logarithm of (16) and the horizontal axis denotes the number of generations. So as the BN is close to the given database, the probability (16) goes to 1 and the minus logarithm of (16) goes to zero. It can be seen that the proposed method starts to outperform the previous method after only two generations and keeps much outperforming for the next almost one hundred generations. The reason for the excellence of the proposed method might be that wider search space is explored compared with the previous method. But the proposed method also has a shortcoming that the genetic operations are only applied to the ordering chromosome and the connection chromosome is changed in such a way that the change of the
A New Genetic Approach to Structure Learning of Bayesian Networks
667
ordering chromosome is accommodated. Finally, to overcome this shortcoming, we combine the previous and proposed methods. That is, first, we find the appropriate shape of BN structure using previous method, and then we adjust BN structures by applying proposed method. Fig. 7 compares the performances of the previous method, the proposed method and the mixture of both. -logP(D|Bs) 5.00E+02
previous method
proposed method
4.50E+02 4.00E+02 3.50E+02 3.00E+02 2.50E+02 2.00E+02 1.50E+02 1.00E+02 5.00E+01 0.00E+00
1
5
9
13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Generation
Fig. 6. Comparison of the performances of the previous and proposed methods -logP(D|Bs) 5.00E+02
previous method
mixture of previous and proposed method
proposed method
4.50E+02 4.00E+02 3.50E+02 3.00E+02 2.50E+02 2.00E+02 1.50E+02 1.00E+02 5.00E+01 0.00E+00
1
5
9
13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Generation
Fig. 7. A comparison of the evolution of the best individual found with previous and proposed and mixture of both methods
In the early generations of the mixture method, the previous method is employed to find a suitable structure of BN. But when the performance is not improved any more, the genetic encoding and the operations switched to the proposed method. Table 1 compares the performances of the three methods. It can be seen that the mixture method outperforms not only the previous method but the proposed method. The reason might be that the previous method used in the early generations of the algorithm finds the better shape of the BN than the proposed method.
668
J. Lee, W. Chung, and E. Kim Table 1. A comparison of performance for three methods
Performance
Previous method 142.597
Proposed method 13.598
Mixture of both 7.098
5 Conclusions In this paper, a new approach for the structure learning of BNs from databases of cases has been suggested. In the proposed method, the connectivity matrix of the BN is not confined to being triangular and wider solution space is explored by encoding the BN as a pair of ordering chromosome and connection chromosome. The mixture of the previous and the proposed methods are also considered to compensate for the weak points of the two methods. The proposed method was applied to the semiconductor operation problems and the effectiveness of the proposed method is verified.
Acknowledgments This work was supported by the Ministry of Commerce, industry and Energy of Korea.
References 1. Jensen, F.V.: Introduction to Bayesian Networks. Technical Report IR 93–2003, Dept. of Mathematics and Computer Science, Univ. of Aalborg, Denmark (1993) 2. Jensen, F.V.: Bayesian Networks and Decision Graphs, Springer-Verlag Berlin Heidelberg New York (2001) 3. Acid, S., De Campos, L.M., Gonzalez, A., Molina, R., Perez de la Blanca, N.: Learning with CASTLE. In: Kruse, R., Siegel, P. (eds.): Symbolic and Quantitative Approaches to Uncertainty. Lecture Notes in Computer Science, Vol. 548. Springer-Verlag, Berlin Heidelberg New York (1991) 99–106 4. Chickering, D.M., Geiger, D., Heckerman, D.: Learning Bayesian Networks: Search Methods and Experimental Results. Fifth International Workshop Artificial Intelligence and Statistics (1995) 112–128 5. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996) 6. Larrañaga, P., Poza, M., Yurramendi, Y., Murga, R., Kuijpers, C.: Structure Learning of Bayesian Network by Genetic Algorithms: A Performance Analysis of Control Parameters. IEEE Trans. Pattern Analysis and Machine Intelligence 18(9) (1996) 912–926 7. Cooper, G.F., Herskovits, E.A.: A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning 9(4) (1992) 309–347 8. Zhang, S.Z., Zhang, Z.N., Yang, N.H., Zhang, N.H., Wang, X.K.: An Improved EM Algorithm for Bayesian Networks Parameter Learning. Machine Learning and Cybernetics 3. (2004) 1503–1508
Research on Multi-Degree-of-Freedom Neurons with Weighted Graphs Shoujue Wang1 , Singsing Liu1 , and Wenming Cao2 1
Artificial Neural Networks Laboratory, Institute of Semiconductors, Chinese Academy of Sciences, P.O. Box 912, Beijing 100083, P.R. China 2 Institute of Intelligent Information System, Information Engineering College, Zhejiang University of Technology, Hangzhou 310032, P.R. China
[email protected]
Abstract. In this paper, we redefine the sample points set in the feature space from the point of view of weighted graph and propose a new covering model — Multi-Degree-of-Freedom Neurons (MDFN). Base on this model, we describe a geometric learning algorithm with 3-degree-offreedom neurons. It identifies the sample points set’s topological character in the feature space, which is different from the traditional “separation” method. Experiment results demonstrates the general superiority of this algorithm over the traditional PCA+NN algorithm in terms of efficiency and accuracy.
1
Introduction
The problem of Biomimetic Pattern Recognition (BPR) [1] can be loosely defined as finding an optimal compact covering of the same sample set. Traditional pattern recognition works on the premise that the whole feature space of samples is already known, and it separates one class to another by division. However, it is not coincident with the cognition of human [1] and the feature space has to been redivided, when even adding one new class. In order to solve these problems, BPR was proposed and has been advanced and widely applied practically [3][5][8][6][7]. BPR is under the guidance of high-dimensional geometric informatics [9]. It “recognizes” one kind of samples by studying the sample points’ district status in the feature space and then covering them with proper graphs. Since it’s impossible to construct a perfect covering set to collect all the samples in one class, we try to find a union of many simple unites that is approximate to the perfect covering set, and the artificial neural network is appropriate choice to construct the covering set in BPR [2]. The DBF neuron [4], the hyper sausage neuron (HSN) etc. were proposed successively in the past few years. Compared to the hyper-plane, hyper-ball, and hyper-cone which were used in traditional pattern recognition, they are better graphs of covering, as what are depicted in Fig.1. However, they are still not good enough because they could not measure up the characters of feature space precisely — just like to cover a 3-dimensional J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 669–675, 2006. c Springer-Verlag Berlin Heidelberg 2006
670
S. Wang, S. Liu, and W. Cao
Fig. 1. HSN chains based BPR in comparison with BP and RBF networks. The triangles represent samples to be recognized. The circles and crosses represent samples to be distinguished from triangles. Polygonal line denotes the classification boundaries of traditional BP networks. Big circle denotes Radial basis function (RBF) networks. Only HSN chains keeps topological connectivity and minimizes the total covering volume. [9]
ball with line segments. Since an n-dimensional object should better be covered with simplex of n-dimension, which is constructed by n + 1 affinely independent points in Rn , higher-dimensional covering models should be studied. Here we propose a new covering model — Multi-Degree-of-Freedom Neurons (MDFN) from the point of view of weighted graph. In section 2, We recall some basic notions we shall use in the sequel and give out the definitions of some basic terms. Then we formalize our model. Base on this model, we present in section 3 the learning algorithm, taking 3-degree-of-freedom neurons as a example. Experiments of face recognition in 3-dimensional Euclidean space prove its efficiency in section 4, compared with the traditional PCA+NN algorithm, and in section 5 we conclude the paper giving hints on future work.
2 2.1
The Multi-Degree-of-Freedom Neurons Preliminaries on Weighted Graph
First of all, let us recall some basic definitions on weighted graph, and most of them are familiar to us in graph theory[10]. So this subsection does no dwell on these definitions more than clarity requires: our main purpose is to relate these conceptions to the feature space. Supposed that all of the sample points are in the feature space E n after the process of feature extraction and regulation, and are affinely independent. Definition 1 (Vertex). All sample points compose sample points set V , and the elements of V are the vertices (or nodes, or point).
Research on Multi-Degree-of-Freedom Neurons with Weighted Graphs
671
Definition 2 (Edge). The relationships between each pair of sample points in V compose edge set E, and the elements of E are the edges (or lines). The elements of E are 2-element subsets of V . Definition 3 (Graph). A graph is a pair of G = (V, E). The number of vertices of a graph G is its order, written as |G|. Definition 4 (Weight). A weight is the correlative value of a edge. In feature space E n , the extent of space between two points shows the degree of these two points’ difference. In this paper we define a weight (or cost) of the edge to be the Euclidean distance of two sample points which form this edge. Since the Euclidean distance is undirected, the graph in this paper refers to undirected graph. Definition 5 (Subgraph). If V ⊆ V and E ⊆ E, then G is a subgraph of G (and G a supergraph of G ). Definition 6 (Path). A path (or simple path) is a non-empty graph P = (V, E) of the form V = {x0 , x1 , . . . , xk },
E = {x0 x1 , x1 x2 , . . . , xk−1 xk }
where the xi are all distinct. The vertices x0 and xk are linked by P and are called its ends; the vertices x1 , . . . , xk−1 are the inner vertices of P . The number of edges of a path is its length, and the path of length k is denoted by P k . A path shows the process of evolvement from one sample points to another. Definition 7 (Distance). The distance dG (x, y) in G of two vertices x, y is the sum of weights of a shortest x − y path in G; if no such path exists, we set d(x, y) := ∞. The shortest means that the the sum of weights in this path is the least one of all the paths from x to y. Definition 8 (Complete). If all the vertices of G are pairwise adjacent, then G is complete, denoted K n . A complete graph on n vertices is a d−simplex in Rn in fact; a K 3 is called a triangle graph and a K 4 is called a tetrahedron graph. The advantages of describing feature space with graph theory is that, we need not consider the coordinate of sample points during the process of computing, but just the points themselves and the connections between each other. 2.2
The Definitions of Multi-Degree-of-Freedom Neurons
Given a class of sample points set V . As we have known, it is not veritable to measure the feature space by straight-line Euclidean distance, but the extent of space between two points does show the degree of these two points’ difference comparatively. So, sometimes it is necessary to set a neighbor threshold d0 . If a pair of vertices’ Euclidean distance d is smaller than d0 , they are neighbors
672
S. Wang, S. Liu, and W. Cao
and form an edge, whose wight is d. If d is larger than d0 , they are too far to be neighbors and form no edges. In this way we get a kind of relationship between each sample points in V , that is E. Besides, suppose θ is a constant scalar, let U = {x|x ∈ O(x, G) < θ}
(1)
where U is the the covering of this class of samples which is actually a convex polyhedron covered with a θ−thick layer in Rn . Our destination is to find out a weighted graph G made up of V and E and a covering U which shows the geometric characters of these samples and covers them properly. So we describe the Multi-Degree-of-Freedom Neuron (MDFN) in mathematical terms by writing the following threshold value functions:: 1, x ∈ U f (x, θ) = 0, x ∈U where x is the input signal and f (x, θ) is the output due to the input signal. Here we do not consider the synaptic weights of the neuron. The input signal x can be arbitrary vector in Rn . If x is inside the hyper-surface U , then the neuron is activated and returns output signal 1; vice versa. The neuron’s degree of freedom is measured by the order of K. If K’s order is r − 1, it is a r-degree-of-freedom neuron (r-DFN). Several straightforward conclusions immediately come out. Corollary 1. A 0-DFN is a RBF neuron. Corollary 2. A 1-DFN is a hyper sausage neuron. Corollary 3. A 2-DFN is a ψ3 -neuron.
3
Geometrical Learning Algorithm
Suppose we’ve got a (r − 1)-DFN, it’s easy to deduce a r-DFN with the method as follows. So we take an example of geometrical learning algorithm with 3-DFN. Suppose V is one class of sample points set, v1 , v2 , . . . , v|G| ∈ V and they are affinely independent. Also, distances between each pair of vertices are already known, so that we get the edges set E according to a presettable neighbor threshold d0 . Thus we get a weighted graph G = (V, E). Step 1. Find out two vertices v1 and v2 , which form the edge v1 v2 with the least weight w12 . Calculate out the least distance dG (v1 , v2 ) with all P 2 , and 3 record the inner vertex as v3 . Now we get a triangle graph K{1,2,3} constructed 4 by v1 , v2 and v3 . In order to get a tetrahedron K , we need another vertex v4 . In consideration of neighbor vertices’ similarity, the distance from one vertex to 3 another in K{1,2,3} with any path of P 2 passing by v4 should be minimized, that is, v4 should satisfy the following items simultaneously:
Research on Multi-Degree-of-Freedom Neurons with Weighted Graphs
673
1. dG (v1 , v2 ) with the path v1 v4 v2 should be as small as possible. 2. dG (v1 , v3 ) with the path v1 v4 v3 should be as small as possible. 3. dG (v2 , v3 ) with the path v2 v4 v3 should be as small as possible. In short, the destination vertex v4 must minimize (w14 + w24 + w34 ). Definition 9. We call the vertex v4 satisfied the three items above the nearest 3 neighbor of the triangle K{1,2,3} . 4 4 v1 , v2 , v3 and v4 compose a tetrahedron K{1,2,3,4} . Record K{1,2,3,4} as K14 and it is obviously a subgraph of G.
Step 2. If the distance from any surplus vertices of G outside K14 to the vertices of K14 are smaller than the least weight in K14 , then these vertices are not influential to the final result and can be removed from V . Definition 10. Given a tetrahedron K 4 and a vertex v in V , there are 4 weights between v and the tetrahedron’s 4 vertices. We choose the sum of the least three weights as the measure of the distance from this vertex to the tetrahedron. The vertex who has the least sum in V is the nearest neighbor of the tetrahedron. The 3 vertices corresponding to the three least weights compose a K 3 and we name it the nearest K 3 of the tetrahedron to v. To the rest vertices in V (G) − V (K14 ), pick out the nearest neighbor vertex of K14 and record it v5 . v5 and the three vertices of the corresponding nearest K 3 to v5 compose a new tetrahedron K24 . K24 is next to K14 and they share a public triangle graph. Step 3. Exclude the vertices in the former (i − 1) tetrahedron graphs’ covering volume in the surplus points. In the vertices outside the covering volume, find out the nearest neighbors of all the former (i−1) tetrahedrons. Pick out the least one and record it as vi . vi and the three vertices of the corresponding nearest 4 K 3 compose K(i−3) . Step 4. Repeat step 3 until all effective vertices have been involved. Finally we get |G| − 3 tetrahedrons at most. According to formula 1, preset a constant scalar θ and we get a covering Ui for each tetrahedron Ki4 . So the covering of this class of sample points is the union of them: U= Ui
4
Experiments and Results
Both UMIST and Yale database are used in our experiments. We use Yale database to analysis the influences of illuminations and expressions and UMIST database to analysis the influences of face positions. We compare PCA+MDFN with the traditional PCA+NN algorithm.
674
S. Wang, S. Liu, and W. Cao Table 1. Comparison of the efficiency used Yale database
PCA+NN PCA+MDFN
Correct rate Rejection rate Error rate 70% 0 30% 95% 100% 5%
Table 2. Comparison of the efficiency used UMIST database
PCA+NN PCA+MDFN
Correct rate Rejection rate Error rate 72.5% 0 27.5% 85% 100% 15%
Fig. 2. Comparison of the efficiency when M is changed
Experiment 1. Under the variation of illumination and expression, we can see from above table, the correct rate of PCA+MDFN is higher than that of PCA+NN. Also it can correctly confirm all of the unacquainted image. So we can say that PCA+MDFN is superior to PCA+NN under the variation of illumination and expression. Experiment 2. Under the variation of face position, we can see the correct rate of PCA+MDFN is higher than that of PCA+NN. Also it can correctly confirm all of the unacquainted image. So under the variation of illumination and expression, PCA+MDFN is superior to PCA+NN. Experiment 3. We test the efficiency between PCA+MDFN and the traditional PCA+NN algorithm when the dimension of eigenvector changed. The result are depicted in Fig.2. With the increase of eigenvectors’ size M , recognition rate heightens, and complexity increases as well. When M is larger than 30, the recognition rate does not increase obviously. Because an algorithm of PCA face recognition based on MDFN is based on the sample sets’ topological character in the feature space which is different from “classification”, its recognition rate is much better.
Research on Multi-Degree-of-Freedom Neurons with Weighted Graphs
5
675
Conclusion
In this paper we proposed a new geometric covering model, explored the concept of MDFN with weighted graph and showed its relevance to pattern recognition. Based on this model we presented our MDFN algorithm, giving an example while the degree of freedom of the neuron is 3, and demonstrated its superiority over traditional methods. In addition we presented a successful application of our algorithm to the problem of face recognition. Meanwhile there are some problem existed. Firstly, the choice of scalar θ is crucial when constructing a MDFN, but we now just preset it by experience. Secondly, the design of our experiments is too simple to show superiority of MDFN because of hurry, and experiments with large number samples should be do in the future. In the future, we plan to make further researches on the properties of MDFN with weighted graph, and investigate its usefulness as a tool for high-dimensional geometric informatics.
References 1. Wang, S.J.: Biomimetic (Topological) Pattern Recognition - A New Model of Pattern Recognition Theory and Its Applications. Acta Electron Sinica 30(1) (2002) 1-4 2. Wang, S.J., Zhao,X.T.: Biomimetic Pattern Recognition Theory and Its Applications. China J. Electron. 13(3)(2004) 373- 377 3. Wang, S.J., et al.: Mulit-camera Human-face Personal Identification System Based on the Biomimetic Pattern Recognition. Acta Electronica Sinica 31(1) (2003) 1-3 4. Cao, W.M., Feng, H., and Wang, S.J.: The Application of DBF Neural Networks for Object Recogniton. Inf. Sci. 160(1-4)(2003) 153- 160 5. Wang, S.J., et al.: Face Recognition: Biomimetic Pattern Recognition vs. Traditional Pattern Recognition. Acta Electronica Sinica 32(7) (2004) 1057-1061 6. Qin, H., Wang, S.J.: Comparison of Biomimetic Pattern Recognition, HMM and DTW for Speaker-independent Speech Recognition. Acta Electronica Sinca 33(5) (2005) 957-560 7. Wang, S.J., Cao, W.M., et al.: Research on Speaker-independent Continuous Figure Speech-recognition Based on High-dimensionall Space Covering and Dynamic Scanning. Acta Electronica Sinica 33(10) (2005) 1790-1093 8. Cao, W.M., et al.: Iris Recognition Algorithm Based on Point Covering of Highdimensional Space and Neural Network. Lecture Notes in Artificial Intelligence 3587 (2005) 305-313 9. Wang, S.J., Jiangliang, L.: Geometrical Learning, Descriptive Geometry, and Biomimetic Pattern Recognition. Neurocomputing 67 (2005) 9-28 10. Diestel, R.: Graph Theory. 2nd edn. Springer-Verlag, New York (2000)
Output PDF Shaping of Singular Weights System: Monotonical Performance Design Hong Yue1,2 , Aurelie J.A. Leprand1 , and Hong Wang1,2 1
2
The University of Manchester, Manchester M60 1QD, UK Institute of Automation, Chinese Academy of Sciences, Beijing, China
Abstract. For a group of stochastic systems whose output probability density function (PDF) is modelled by the B-spline neural networks with singular weights dynamics, the output PDF is controlled towards the desired PDF by an innovative optimization design with monotonically decreasing performance. The controller is developed based on the transformed equivalent non-singular system with sufficient conditions given for the monotonic performance.
1
Introduction
For stochastic systems subjected to arbitrary stochastic noises, the control of the output probability density function (PDF) is a new development in the past decade using the modelling technique of B-spline neural networks [1, 2]. Research on singular control systems has been carried out for many years [3]. Most of the important issues of non-singular control systems such as controllability and observability, stability, parameter estimation, model reduction, feedback control, etc., have been extended to the singular systems, especially to the linear timeinvariant singular systems. Further research on complex singular systems with time-varying parameters, nonlinearities, time delay, etc, has also achieved positive results. In the past two decades, there are quite a few studies on the control of stochastic singular systems [4], [5], [6]. However, little has been reported on the output PDF control of such systems. In our earlier work [7], a PDF controller is developed for the singular B-spline weights system, which minimizes the distance between the output PDF and the desired PDF at each time instance. In practice, however, it is often required that the system adopts a monotonic behavior when approaching the target. This motivates the work in the paper where a feasible optimization algorithm is investigated for the PDF tracking with monotonically decreasing performance.
2
B-Spline PDF Model with Singular Weights Dynamics
Denote γ(y, u(k)) as the PDF of the system output at time k, with y defined on a bounded interval [a, b], the discretized PDF model of a singular weights system can be represented as follows J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 676–682, 2006. c Springer-Verlag Berlin Heidelberg 2006
Output PDF Shaping of Singular Weights System
677
γ(y, u(k)) = C(y) V (k) + L(y) ,
(1)
Ex(k + 1) = Ax(k) + Bu(k) ,
(2)
V (k) = Dx(k) .
(3)
(1) is the static output PDF model approximated by the B-spline neural networks, where V (k) ∈ Rn is the independent weights vector, C(y) ∈ Rn and L(y) are formulated from the pre-specified B-spline functions [2]. Eqns. (2) and (3) form the singular dynamic model of the weights vector, where x is the state vector, u(k) is the control input. A, B, D, E are the system parameter matrices with proper dimensions. E is a singular matrix. In order to efficiently apply the results from non-singular PDF systems, the dynamic part of the above model is first transformed into an equivalent non-singular model as follows [7]: x1 (k + 1) = A1 x1 (k) + B1 u(k) , x2 (k) = −B2 u(k) ,
(4) (5)
V (k) = D1 x1 (k) + D2 x2 (k) .
(6)
Here A1 ∈ Rq×q , B1 ∈ Rq , B2 ∈ Rn−q , D1 ∈ R(n−1)×q , D2 ∈ R(n−1)×(n−q) .
3 3.1
Monotonically Decreasing Performance Design Condition 1
The aim of the controller design is to find a control sequence u(k) such that the output PDF γ(y, u(k)) is made as close as possible to the desired PDF. Consequently, the instant performance function to be minimized is given by b 2 Jk = γ(y, u(k)) − g(y) dy + Ru2 (k) , (7) a
where g(y) is the continuous target PDF and R > 0 is the pre-specified weighting factor for reducing u(k). From equations (1) and (6), it can be seen that the difference between the output PDF and the desired PDF is formulated to give γ(y, u(k)) − g(y) = C(y) D1 x1 (k) + L(y) − g(y) − C(y) D2 B2 u(k) = g˜(y, k) − H(y) u(k) ,
(8)
where g˜(y, k) = C(y)D1 x1 (k) + L(y) − g(y), H(y) = C(y)D2 B2 . Taking (8) into (7), the performance function Jk can then be rewritten as b b Jk = g˜2 (y, k)dy − 2 g˜(y, k)H(y)dy u(k) + (R + S)u2 (k) , (9) a
a
b
where S = a H 2 (y)dy > 0 is known. The following condition is set to guarantee a monotonically decreasing performance Jk ≤ ρJk−1 , (0 < ρ < 1).
(10)
678
H. Yue, A.J.A. Leprand, and H. Wang
By taking (9) into (10), we have b 2 (S + R)u (k) − 2 g˜(y, k)H(y) dy u(k) + a
b
g˜2 (y, k)dy − ρJk−1 ≤ 0 . (11)
a
Denote a1 = S + R, b1 = −2 (11) can be written as
b a
g˜(y, k)H(y)dy, c1 =
b a
g˜2 (y, k)dy − ρJk−1 . Eqn.
a1 u2 (k) + b1 u(k) + c1 ≤ 0 .
(12)
It can be seen that a1 > 0, therefore the control input u(k) should be designed to satisfy the following two constraints simultaneously: −b1 − b21 − 4a1 c1 −b1 + b21 − 4a1 c1 ≤ u(k) ≤ , (13) 2a1 2a1 Δ1 = b21 − 4a1 c1 ≥ 0 .
(14)
Taking g˜(y, k) and H(y) into b1 and c1, we have b21 = 4B22 xT1 (k)η1 x1 (k) − 8B22 η2 x1 (k) + 4B22 η3 , c1 =
xT1 (k)θ1 x1 (k)
+ 2θ2 x1 (k) + θ3 − ρJk−1 ,
(15) (16)
where η1 = D1T a11 D2 D2T a11 D1 , η2 = D2T aT12 a12 D2T a11 D1 , η3 = D2T aT12 a12 D2 ,
2 b b θ1 = D1T a11 D1 , θ2 = a12 D1 , θ3 = a L(y) − g(y) dy, a11 = a C T (y)C(y)dy,
b a12 = a L(y) − g(y) C(y)dy are known when the B-splines are chosen. 3.2
Condition 2: b21 − 4a1 c1 ≥ 0
Considering (15) and (16), condition (14) can be further expressed to be Δ1 = xT1 (k)αx1 (k) + βx1 (k) + γ1 ≥ 0 ,
(17)
where α = 4B22 η1 − 4(S + R)θ1 , β = −8B22 η2 − 8(S + R)θ2 , γ1 = 4B22 η3 − 4(S + R)θ3 + 4ρ(S + R)Jk−1 . Eqn. (17) is equivalent to − xT1 (k)αx1 (k) − βx1 (k) ≤ γ1 .
(18)
An upper-bound for the left-hand side of (18) can be derived as follows −xT1 (k) αx1 (k) − βx1 (k) ≤ ||α|| ||x1 (k)||2 + ||β|| ||x1 (k)|| .
(19)
Taking (18) and (19) together, a sufficient condition for (17) is obtained to be ||α|| ||x1 (k)||2 + ||β|| ||x1 (k)|| − γ1 ≤ 0 .
(20)
For this system, it is obvious that ||α|| ≥ 0 and ||β|| ≥ 0. The solution of ||x1 (k)|| should be real and positive. Therefore, (20) can be satisfied when ||β||2 +
Output PDF Shaping of Singular Weights System
679
4||α||γ1 ≥ 0 and 0 ≤ ||x1 || ≤ N2 . Here N2 is the positive root value of the polynomial (20) (another root is negative from analysis). In the following, we assume it is possible to find an instance k where ||x1 (k)|| ≤ N2 . The problem is to guarantee the validity of this condition at time k + 1. For this purpose, the following theorem can be presented. Theorem 1. For the transformed state system (4), if ||A1 || < 1, then when ||x1 (k)|| ≤ N2 , there exists u(k) ∈ R such that ||x1 (k + 1)|| ≤ N2 also holds. Proof: It can be seen from equation (4) that ||x1 (k + 1)||2 = xT1 (k)AT1 A1 x1 (k) + 2xT1 (k)AT1 B1 u(k) + B1T B1 u2 (k) .
(21)
Denote λ1 = B1T B1 , λ2 = 2xT1 (k)AT1 B1 , λ3 = xT1 (k)AT1 A1 x1 (k) − N22 , then ||x1 (k + 1)||2 ≤ N22 can be represented as λ1 u2 (k) + λ2 u(k) + λ3 ≤ 0 .
(22)
There are two real solutions to equation (22) only if Δ2 = λ22 − 4λ1 λ3 ≥ 0. In this case, −λ2 ± λ22 − 4λ1 λ3 u1, 2 = . (23) 2λ1 That is to say, when Δ2 ≥ 0 and u(k) ∈ [u1 , u2 ], ||x1 (k)|| ≤ N2 can be satisfied. By introducing the expressions of λ1 , λ2 and λ3 , the calculation of Δ2 can be performed as Δ2 = xT1 (k) (AT1 B1 )(AT1 B1 )T − AT1 A1 B1T B1 x1 (k) + N22 B1T B1 .
(24)
It is known that ||x1 (k)|| ≤ N2 , i.e., xT1 (k)x1 (k) ≤ N22 , therefore Δ2 ≥ xT1 (k) (AT1 B1 )(AT1 B1 )T − AT1 A1 B1T B1 x1 (k) + xT1 (k)B1T B1 x1 (k) = xT1 (k) (AT1 B1 )(AT1 B1 )T + B1T B1 (I − AT1 A1 ) x1 (k) . (25) It can be seen from the above inequality that Δ2 ≥ 0 when ||AT1 A1 || = ||A1 ||2 ≤ 1. This proofs Theorem 1. To summarize, when the following conditions are satisfied, the control input u(k) will guarantee the monotonically decreasing performance (10) of the PDF system with singular weights dynamics. – The transformed system stable; (4) is open-loop – condition 1: − b1 − b21 − 4a1 c1 /2a1 ≤ u(k) ≤ − b1 + b21 − 4a1 c1 /2a1 ; – condition 2: −λ2 − λ22 − 4λ1 λ3 /2λ1 ≤ u(k) ≤ −λ2 + λ22 − 4λ1 λ3 /2λ1 .
680
4
H. Yue, A.J.A. Leprand, and H. Wang
Model Simulation
Suppose the output PDF is defined on the interval [−3, 3] and can be approximated by γ(y, u(k)) = w1 φ1 (y) + w2 φ2 (y) + w3 φ3 (y). The three basis functions are defined as in [7]. Consider the transformed model with A1 being stable.
0.7 initial pdf desired pdf 0.6
0.5
0.4
0.3
0.2
0.1
0 −3
−2
−1
0 y
1
2
3
0.7 final pdf desired pdf 0.6
0.5
0.4
0.3
0.2
0.1
0 −3
−2
−1
0 y
1
2
3
Fig. 1. Initial, final and target PDFs
1.6 1.4 1.2
J
1 0.8 0.6 0.4 0.2 0
1
2
3
4 5 control step
6
7
Fig. 2. Evolution of the performance function
8
Output PDF Shaping of Singular Weights System
681
0.50 −0.25 1 0.5 0.9 1 A1 = , B1 = , B2 = 0.4, D1 = , D2 = . The 1/9 1/6 2 0.2 0.1 0.6 desired PDF for this simulation is selected as g(y) = 1/3 · φ2 (y) + 2/3 · φ3 (y), where the initial are chosen by imposing γ(y, u(0)) = B1 (y), u(0) =
conditions 0.4 1 −0.5, x1 (0) = , x2 (0) = 0.2, V (0) = . The weighting factor is chosen 1 0 to be R = 0.1, and the decreasing factor of the performance index is ρ = 0.8. The comparison of the initial PDF and the final PDF can be seen from fig. 1. It shows that the output PDF is controlled towards the desired PDF although there is a steady-state tracking error. From fig. 2, it is clearly seen that the PDF tracking error is decreasing monotonically during the control process.
5
Conclusions
In this paper, the linear B-spline neural networks is used to approximate the output PDF of non-Gaussian stochastic systems. The output PDF contains singular dynamic behaviors in its B-spline weights vector. In order to perform a convenient PDF controller design, the singular system model is transformed into an equivalent form of non-singular state-space model. The main contribution of this work is to develop the sufficient conditions which guarantee the monotonically decreasing performance. Simulation study shows that although there exists a steady-state PDF tracking error, the monotonically decreasing performance can be obtained by the proposed algorithm. In fact, using the linear B-spline expansions, it can be verified that the dynamics of the PDF weights satisfy a differential-algebraic equation. Further researches with guaranteed tracking and stability performances can be proceeded based on the PID type tracking schemes [8] and some results on singular systems can be applied [9].
Acknowledgement This work is supported by The Outstanding Overseas Chinese Scholars Fund of Chinese Academy of Sciences (2004-1-4).
References 1. Wang H.: Robust Control of the Output Probability Density Functions for Multivariable Stochastic Systems. IEEE Trans. Automat. Contr. 44(11) (1999) 2103-2107 2. Wang H.: Bounded Dynamic Stochastic Systems: Modelling and Control. SpringerVerlag, London (2000) 3. Dai, L.: Singular Control Systems. Springer-Verlag, Berlin (1989) 4. Darouach, M., Zasadzinski, M., Mehdi, D.: State Estimation of Stochastic Singular Linear Systems. Int. J. Systems Science. 2(2) (1993) 345-354 5. Liu, W. Q., Yan, W. Y., Teo, K. L.: On Initial Instantaneous Jumps of Singular Systems. IEEE Trans. Automat. Contr. 40(8) (1995) 1650 -1655
682
H. Yue, A.J.A. Leprand, and H. Wang
6. Nikoukhah, R., Campbell, S. L., Delebecque, F.: Kalman Filtering for General Discrete-Time Linear Systems. IEEE Trans. Automat. Contr. 44(10) (1999) 1829 -1839 7. Yue, H., Leprand, A. J. A., Wang, H.: Stochastic Distribution Control of Singular Systems: Output PDF Shaping. ACTA Automatica Sinica. 31(1) (2005) 151-160 8. Guo, L., Wang, H.: PID Controller Design for Output PDFs of Stochastic Systems Using Linear Matrix Inequalities, IEEE Trans. Syst. Man Cybern. Part-B. 35(1) (2005) 65-71 9. Guo, L., Malabre, M.: Robust H-infinity Control for Descriptor Systems with NonLinear Uncertainties. Int. J. Control. 76(12) (2003) 1254-1262
Stochastic Time-Varying Competitive Neural Network Systems Yi Shen1 , Meiqin Liu2, , and Xiaodong Xu3 1
Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
[email protected] 2 College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China
[email protected] 3 College of Public Administration, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
[email protected]
Abstract. In this paper we reveal that the environmental noise will suppress a potential population explosion in the stochastic competitive neural network systems with variable delay. To reveal these interesting facts, we stochastically perturb the competitive neural network systems with variable delay x(t) ˙ = diag(x1 (t), . . . , xn (t))[b + Ax(t − δ(t))] into the Itˆ o form dx(t) = diag(x1 (t), . . . , xn (t))[b + Ax(t − δ(t))]dt + σx(t)dw(t), and show that although the solution to the original delay systems may explode to infinity in a finite time, with probability one that of the associated stochastic delay systems do not.
1
Introduction
Deterministic subclasses of the competitive neural network systems are wellknown and have been extensively investigated in the literature concerning ecological population modeling. One particularly interesting subclass describes the facultative mutualism of two species, where each one enhances the growth of the other, represented through the deterministic systems x˙ 1 (t) = x1 (t)(b1 − a11 x1 (t) + a12 x2 (t)), x˙ 2 (t) = x2 (t)(b2 − a22 x2 (t) + a21 x1 (t)),
(1)
for a12 and a21 positive constants. The associated dynamics have been developed by many authors [1-9]. Now in order to avoid having a solution that explodes at a finite time, a12 a21 is required to be smaller than a11 a22 . To illustrate what happens when the latter condition does not hold, suppose that a11 = a22 = α and a12 = a21 = β (i.e., we have a symmetric system) and α2 < β 2 . Moreover, let us assume that b1 = b2 = b ≥ 1 and that both species have the same initial
Corresponding author.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 683–688, 2006. c Springer-Verlag Berlin Heidelberg 2006
684
Y. Shen, M. Liu, and X. Xu
value x1 (0) = x2 (0) = x0 > 0. Then the resulting symmetry reduces system (1) to the single deterministic system x(t) ˙ = x(t)[b + (−α + β)x(t)] whose solution is given by x(t) =
b −(−α + β) +
b+(−α+β)x0 −bt e x0
.
Now the assumption that α2 < β 2 causes x(t) to explode at the finite time t = 1b {ln[b + (−α + β)x0 ] − ln[(−α + β)x0 ]}. So are the delay Competitive neural network Systems [5]. Nevertheless, this can be avoided, even when the condition a12 a21 < a11 a22 does not hold, by introducing (stochastic ) environmental noise. Throughout this paper, unless otherwise specified, we let (Ω, F, {Ft }t≥0 , P ) be a complete probability space with a filtration {Ft }t≥0 satisfying the usual conditions (i.e., it is increasing and right continuous while F0 contains all P -null sets). Moreover, let w(t) be a one-dimensional Brownian motion defined on the n filtered space and R+ = {x ∈ Rn : xi > 0 for all 1 ≤ i ≤ n}. Finally, denote the trace norm of a matrix A by |A| = trace(AT A) (where AT denotes the transpose of a vector or matrix A ) and its operator norm by A = sup{|Ax| : |x| = 1}. Let τ > 0 and C([−τ, 0]; Rn ) denote the family of all continuous n functions from [−τ, 0] to R+ . For the generalization of the question, we consider a time-varying competitive neural network systems with n interaction components, which corresponds to the case of facultative mutualism, namely x˙ i (t) = xi (t)(bi +
n
aij xj (t − δ(t))),
1 ≤ i ≤ n.
j=1
This system can be rewritten in the matrix form x(t) ˙ = diag(x1 (t), . . . , xn (t))(b + Ax(t − δ(t))),
t ≥ 0,
(2)
where x(t) = (x1 (t), . . . , xn (t))T , b = (bi )1×n and A = (aij )n×n , δ : R+ → [0, τ ] and δ (t) ≤ δ < 1, here δ is a constant. Stochastically perturbing each parameter aij → aij + σij w(t) ˙ results in the new stochastic form dx(t) = diag(x1 (t), . . . , xn (t))((b + Ax(t − δ(t)))dt + σx(t)dw(t)),
t ≥ 0, (3)
Here σ = (σij )n×n , and we impose the condition (H) σii = 0, σij σik ≥ 0, i, j, k = 1, . . . , n. For a stochastic system to have a unique global solution (i.e., no explosion in n a finite time ) for any given initial value {x(t) : −τ ≤ t ≤ 0} ∈ C([−τ, 0]; R+ ) the
Stochastic Time-Varying Competitive Neural Network Systems
685
coefficients of system (3) are generally required to satisfy both the linear growth condition and the local Lipschitz condition [10]. However, the coefficients of the system (3)do not satisfy the linear growth condition, though they are locally Lipschitz continuous, so the solution of the system (3) may explode at a finite time. Under the simple hypothesis (H), in this paper we show that this solution is positive and global.
2
Positive and Global Solutions
Theorem 2.1. Under hypothesis (H), for any system parameters b ∈ Rn and n A ∈ Rn×n , and any given initial data {x(t) : −τ ≤ t ≤ 0} ∈ C([−τ, 0]; R+ ), there is a unique solution x(t) to the system (3) on t ≥ −τ and the solution will n n remain in R+ with probability 1, namely x(t) ∈ R+ for all t ≥ −τ almost surely. Proof. Since the coefficients of the system are locally Lipschitz continuous, for n any given initial data {x(t) : −τ ≤ t ≤ 0} ∈ C([−τ, 0]; R+ ) there is a unique maximal local solution x(t) on t ∈ [−τ, τe ), where τe is the explosion time [10]. To show this solution is global, we need to show that τe = ∞ a.s. Let k0 > 0 be sufficiently large for 1 < min |x(t)| ≤ max |x(t)| < k0 . −τ ≤t≤0 −τ ≤t≤0 k0 For each integer k ≥ k0 , define the stopping time τk = inf{t ∈ [0, τe ) : xi (t) ∈ / (1/k, k) for some
i = 1, . . . , n}
where throughout this paper we set inf ∅ = ∞ (as usual ∅ denotes the empty set). Clearly, τk is increasing as k → ∞. Set τ∞ = limk→∞ τk , whence τ∞ ≤ τe n a.s. If we can show that τ∞ = ∞ a.s., then τe = ∞ a.s. and x(t) ∈ R+ a.s. for all t ≥ 0. In other words, to complete the proof all we need to show is that τ∞ = ∞ n a.s.. To show this statement, let us define a C 2 -function V : R+ → R+ by V (x) =
n
(xθi − θ lg(xi )),
i=1
where 0 < θ < 1. The nonnegativity of this function can be seen from uθ − θ lg u > 0 on u > 0. Let k ≥ k0 and T > 0 be arbitrary. For 0 ≤ t ≤ τk ∧ T , we can apply the Itˆ o t formula to t−δ(t) |x(s)|2 ds + V (x(t)) to obtain that d[
t
t−δ(t)
|x(s)|2 ds + V (x(t))]
686
Y. Shen, M. Liu, and X. Xu n
≤ [|x(t)|2 − (1 − δ)|x(t − δ(t))|2 ]dt +
{θ(xθ−1 (t) − x−1 i i (t))xi (t)
i=1
×[(bi +
n
aij xj (t − δ(t)))dt +
j=1
n
σij xj (t)dw(t)]
j=1
n −2 2 +0.5(θ(θ − 1)xθ−2 (t) + θx (t))x (t)( σij xj (t))2 dt} i i i j=1
= {|x(t)|2 − (1 − δ)|x(t − δ(t))|2 + ×(bi +
n
n
i=1 n
+
n (θ + θ(θ − 1)xθi (t))( σij xj (t))2 }dt
aij xj (t − δ(t))) + 0.5
j=1 n n
θ(xθi (t) − 1)
i=1
j=1
θ(xθi (t) − 1)σij xj (t)dw(t).
(4)
i=1 j=1
Compute n
θ(xθi (t) − 1)(bi +
i=1
≤
n
n
aij xj (t j=1 n n
[0.25nθ2 a2ij (1 − δ)−1 (xθi (t) − 1)2
θbi (xθi (t) − 1) +
i=1
− δ(t)))
i=1 j=1
−1
+n (1 − − δ(t))] n n n = θbi (xθi (t) − 1) + 0.25nθ2 a2ij (1 − δ)−1 (xθi (t) − 1)2 δ)x2j (t
i=1
i=1 j=1
+(1 − δ)|x(t − δ(t))|
2
and n n n n n 2 ( σij xj (t))2 ≤ σij x2j (t) = |σ|2 |x(t)|2 . i=1 j=1
i=1 j=1
j=1
Moreover, by hypothesis (H), n
n n 2 2+θ xθi (t)( σij xj (t))2 ≥ σii xi (t).
i=1
j=1
i=1
Substituting these into (4) yields t d[ |x(s)|2 ds + V (x(t))] t−δ(t)
≤ F (x(t))dt +
n n i=1 j=1
θ(xθi (t) − 1)σij xj (t)dw(t),
(5)
Stochastic Time-Varying Competitive Neural Network Systems
687
where F (x) = (1 + 0.5θ|σ|2 )|x|2 +
n
θbi (xθi − 1) + 0.25nθ2 (1 − δ)−1
i=1
×
n n
a2ij (xθi − 1)2 − 0.5θ(1 − θ)
i=1 j=1
n
2 2+θ σii xi
(6)
i=1
n It is straightforward to see that F (x) is bounded, say by K, in R+ . We therefore obtain that t n n d[ |x(s)|2 ds + V (x(t))] ≤ Kdt + θ(xθi (t) − 1)σij xj (t)dw(t). t−δ(t)
i=1 j=1
Integrating both sides from 0 to τk ∧ T , and then taking expectations, yields E{
τk ∧T τk ∧T −δ(τk ∧T )
|x(s)|2 ds + V (x(τk ∧ T ))} ≤
0 −τ
|x(s)|2 ds + V (x(0)) + KE(τk ∧ T ).
Consequently, EV (x(τk ∧ T )) ≤
0
−τ
|x(s)|2 ds + V (x(0)) + KT.
(7)
Note that for every ω ∈ {τk ≤ T }, there is some i such that xi (τk , ω) equals either k or 1/k, and hence V (x(τk , ω)) is no less than either k θ − θ lg(k) or
k −θ − θ lg(k −1 ) = k −θ + θ lg(k). Consequently, V (x(τk , ω)) ≥ (k θ − θ lg(k)) ∧ (k −θ + θ lg(k)). It then follows from (7) that 0 |x(s)|2 ds + V (x(0)) + KT ≥ E[1{τk ≤T } (ω)V (x(τk , ω))] −τ
≥ P {τk ≤ T }[(k θ − θ lg(k)) ∧ (k −θ + θ lg(k))], where 1{τk ≤T } is the indicator function of {τk ≤ T }. Letting k → ∞ gives limk→∞ P {τk ≤ T } = 0 and hence P {τ∞ ≤ T } = 0. Since T > 0 is arbitrary, we must have P {τ∞ < ∞} = 0, so P {τ∞ = ∞} = 1 as required. Remark 1. It is well known that the system (1) and (2) may explode to infinity at a finite time for some system parameters b ∈ Rn and A ∈ Rn×n . However, the explosion will no longer happen as long as there is a noise. In other words, this result reveals the important property that the environmental noise suppresses the explosion for the time-varying delay system.
688
3
Y. Shen, M. Liu, and X. Xu
Conclusion
In this paper we investigate that the environmental noise play a key role in suppressing a potential population explosion on the stochastic delay-varying competitive neural network systems. And we show that although the solution to the original delay systems may explode to infinity in a finite time, with probability one that of the associated stochastic time-varying delay systems do not.
Acknowledgments The work was supported by Natural Science Foundation of Hubei (2004ABA055) and National Natural Science Foundation of China (60574025, 60074008).
References 1. Ahmad, A., Rao, M.R.M.: Asymptotically Periodic Solutions of N-competing Species Problem with Time Delay. J. Math. Anal. Appl. 186 (1994) 557-571 2. Bereketoglu, H., Gyori, I.: Global Asymptotic Stability in A Nonautonomous Lotka-Volterra Type System with Infinite Delay. J. Math. Anal. Appl. 210 (1997)279-291 3. Freedman, H.I., Ruan, S.: Uniform Persistence in Functional Differential Equations. J. Differential Equations 115 (1995) 173-192 4. Gopalsamy, K.: Stability and Oscillations in Delay Differential Equations of Population Dynamics. Kluwer Academic, Dordrecht (1992) 5. He, X., Gopalsamy, K.: Persistence, Attractivity, and Delay in Facultative Mutualism. J. Math. Anal. Appl. 215 (1997) 154-173 6. Kolmanovskii, V., Myshkis, A.: Applied Theory of Functional Differential Equations. Kluwer Academic, Dordrecht (1992) 7. Kuang, Y.: Delay Differential Equations with Applications in Population Dynamics. Academic Press, Boston (1993) 8. Kuang, Y., Smith, H.L.: Global Stability for Infinite Delay Lotka-Volterra Type Systems. J. Differential Equations 103 (1993) 221-246 9. Teng, Z., Yu, Y.: Some New Results of Nonautonomous Lotka-Volterra Competitive Systems with Delays. J. Math. Anal. Appl. 241 (2000) 254-275 10. Mao, X.: Exponential Stability of Stochastic Differential Equations. Dekker, New York (1994)
Heterogeneous Centroid Neural Networks Dong-Chul Park1, Duc-Hoai Nguyen1 , Song-Jae Lee1 , and Yunsik Lee2 1
2
ICRL, Dept. of Information Engineering, Myong Ji University, Korea {parkd, dhoai, songjae}@mju.ac.kr SoC Research Center, Korea Electronics Tech. Inst., Seongnam, Korea
[email protected]
Abstract. The Tied Mixture Hidden Markov Model (TMHMM) is an important approach to reduce the number of free parameters in speech recognition. However, this model suffers from degradation in recognition accuracy due to its Gaussian Probability Density Function (GPDF) clustering error. This paper proposes a clustering algorithm called a Heterogeneous Centroid Neural Network (HCNN) for use in TMHMMs. The algorithm utilizes a Centroid Neural Network (CNN) to cluster acoustic feature vectors in the TMHMM. The HCNN uses a heterogeneous distance measure to allocate more code vectors in the heterogeneous areas where probability densities of different states overlap each other. When applied to an isolated Korean digit word recognition problem, the HCNN reduces the error rate by 9.39% over CNN clustering, and 14.63% over the traditional K-means clustering. Keywords: speech recognition, tied mixture, unsupervised clustering, Hidden Markov Model.
1
Introduction
Most approaches to statistical speech recognition are based on Continuous Density Hidden Markov Models (CDHMMs), which use Gaussian mixture densities to model HMM state observation likelihoods. However, the CDHMM requires a large number of free parameters. This obstacle leads to high computational cost during the parameter estimation process and observation likelihood calculation. Recently, parameter tying has been employed to efficiently overcome this weakness. Several parameter tying models have been proposed, such as tied mixture [1], tied state [2], and substate models [3]. A combination of state tying and mixture tying was also proposed by Liu and Fung [4]. Among these approaches, the Tied Mixture Hidden Markov Model (TMHMM) has received widespread attention. In this paper, we propose an algorithm that applies CNN to the TMHMM. The key concept is that: in the CDHMM, the acoustic data (feature vectors) are clustered state by state, separately. In contrast, in the TMHMM, feature vectors of all states are put into a shared pool and clustered without consideration of the state to which they belong. Therefore, the clustering errors in the TMHMM arise from the areas where the distribution of feature vectors of different states overlaps. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 689–694, 2006. c Springer-Verlag Berlin Heidelberg 2006
690
D.-C. Park et al.
Fig. 1. A tied mixture Hidden Markov Model using 3 states in speech recognition
2 2.1
Preliminaries Tied Mixture Hidden Markov Model
Although Continuous Density HMMs offer very high recognition accuracy, they require relatively high computational cost because of the large number of code vectors [5]. A TMHMM is a model in which all Gaussian components are stored in a pool and all state output distributions share this pool. Fig. 1 illustrates this model for a case of single data stream. For simplicity, we consider the single stream case of a TMHMM. The output distribution for state j is defined as: bj (ot ) =
Ms
2 cjm .ℵ[ot , μm , σm ]
(1)
m=1
where Ms is the number of Gaussian mixture components in the pool. ℵ is the likelihood of the component m, and cjm is the corresponding weight for the state j respect to the component m. 2.2
Clustering Error in TMHMM and the Heterogeneous Level
In the pool, the feature vectors are grouped into many clusters (also called Gaussian components). Each cluster may contain feature vectors from one or many states. For convenience, we define “homogeneous group” and “heterogeneous group”: a homogeneous group is a group in which all feature vectors belong to the same state while a heterogeneous group is a group that contains feature vectors of more than one state. In a TMHMM, the state observation likelihood depends strongly on the clustering result of the pool. Therefore, one of the major sources of recognizer error is the heterogeneous groups. Fig. 2 illustrates a typical heterogeneous group. Assume that the i − th cluster contains feature vectors from 2 states, Ni1 vectors from the first vector state and Ni2 from the second state. Visually, Ni1
Heterogeneous Centroid Neural Networks
691
Fig. 2. A typical heterogeneous group
and Ni2 feature vectors for 2 child clusters inside the heterogeneous cluster. The distribution of all feature vectors in this heterogeneous cluster will be represented by a unique pair of mean μi and covariance Σi , denoted by gi (μi , Σi ). Similarly, we model the distribution of Ni1 feature vectors of the first state and Ni2 feature vectors of the second state within that cluster as gi1 (μ1i , Σi1 ) and gi2 (μ2i , Σi2 ), respectively. 2.3
Heterogeneity Level
In order to measure the heterogeneity of a cluster, we introduce the term of heterogeneity level. This measure informs us how heterogeneous a cluster is. The heterogeneity level is measured as the sum of “differences” between a cluster and its child clusters: Ni Hi = d(gi , gij ) (2) j=1
where Hi is the heterogeneity level of the i − th cluster, Ni is the number of HMM states, and d(gi , gij ) is the “difference” between cluster gi and its child cluster gij . We also assume that d(gi , gij ) is zero if the j − th state has no feature vector in the cluster gi . In order to represent the “difference” between 2 clusters, the Bhattacharyya distance is adopted in this paper. The Bhattacharyya distance is a separability measure between 2 Gaussian distributions and is defined as follows: d(gi , gij )
Σi +Σij | 2
| 1 Σi + Σij −1 1 = (μi − μji )T [ ] (μi − μji ) + ln 8 2 2
(3)
|Σi ||Σij |
where T denotes the transpose matrix. The first term of Eq. (3) gives the class separability due to the difference between the mean values while the second term gives the class separability due to the difference between the covariance matrices.
692
3 3.1
D.-C. Park et al.
Heterogeneous CNN(HCNN) CNN Algorithm
When an input vector x is applied to the network at time n, the weight update equations for winner neuron j and loser neuron i in CNN can be written as follows: 1 [x(n) − wj (n)] Nj + 1 1 wi (n + 1) = wi (n) − [x(n) − wi (n)] Ni − 1
wj (n + 1) = wj (n) +
(4)
where wj (n) and wi (n) represent the weight vectors of the winner neuron and the loser neuron, respectively while Ni and Nj denote the number of data vectors in cluster i and j at the time of iteration, respectively. When CNN is compared with SOM, the CNN requires neither a predetermined schedule for learning gain nor the total number of iterations for clustering. A more detailed description on CNN can be found in [6, 7] 3.2
Heterogeneous CNN(HCNN)
As noted in Section 2.2, one of the major sources of clustering errors in the TMHMM arises from the overlapped area between the acoustic distributions of different states. Therefore, it would be better to allocate more code vectors to that overlapped area. This can be achieved by adding a multiplier to the Euclidean distance measure of the original CNN algorithm: dH (x(n), w(n)) =
nh d(x(n), w(n)) Hx(n)
(5)
where Hx(n) is the heterogeneity level of the feature vector x(n), and d(x(n), w(n)) is the Euclidean distance between x(n) and the code vector w(n). We refer to dH (x(n), w(n)) as the heterogeneous distance.
4
Experiments and Results
The experiments utilize the isolated word ETRI (Electronics and Telecommunications Research Institute) Speech Database for Korean digits. 6,451 utterances were used. The speech excerpts are recorded through wired, wireless, PCS, and cellular phone environments from 981 males and 1,019 females. The feature vectors are composed of 39 components: 12 MFCCs and 1 energy value (13 features), their first order time derivatives (13 features) and second order time derivatives (13 features). During training, 5-state left-to-right HMMs are used to model each word. Because the database includes 13 digits as mentioned above, we need 13 HMMs to model them, accordingly. 5,451 utterances were used for training and 1,000 utterances for testing. Fig. 3 shows the experiment results for 3 different algorithms: K-means, CNN,
Heterogeneous Centroid Neural Networks
693
Fig. 3. Recognition rates of HCNN in comparison with K-means and CNN
Fig. 4. Recognition rates under wireless environment
and HCNN. In order to compare the accuracy of these algorithms, we altered the number of Gaussian components over a range from 70 to 500. By allocating more code vectors to the overlapped areas during training, HCNN gives significant improvement over CNN and K-means, especially for the case when the number of Gaussian components exceeds 300. The average accuracy of HCNN is 88.8%, corresponding to an error rate of 11.2%. In comparison to K-means (average error rate 13.12%), HCNN reduces the error rate by 13.12−11.2 = 14.63%. 13.12 Similarly, the error is reduced by 9.39% in comparison to CNN. Fig. 4 gives more details on the testing results and shows the performance for the case of wireless environment. The speech excerpts in the wireless environment
694
D.-C. Park et al.
are more difficult to recognize because of the high noise level. The improvement of HCNN for the case of the wireless environment is the most impressive, especially when a high number of Gaussian components is selected. This implies that the proposed algorithm can show improvement if the overlapping in the acoustic distributions is large.
5
Conclusion
A Heterogeneous Centroid Neural Network (HCNN) is proposed in this paper. The HCNN is based on the CNN. However, using the heterogeneous distance measure, the HCNN attempts to allocate more code vectors in the heterogeneous areas where probability densities of different states overlap each other. The proposed HCNN is then applied to an isolated Korean digit word recognition problem, using the ETRI speech database. The results show that HCNN can give 14% reduction of recognition error over the K-means algorithm, and 9.39% error reduction over the CNN algorithm.
References 1. Huang, X.D.: Phoneme Classification Using Semicontinuous Hidden Markov Models. IEEE Trans. on Acous., Speech, and Sig. Proc. 40(5) (1992) 1062-1067 2. Huang, X., Acero, A., Hon, H., Reddy, R.: Spoken Language Processing: a Guide to Theory, Algorithm, and System Development. Prentice-Hall, Englewood Cliffs (2001) 3. Gu, L., Rose, K.: Substate Tying with Combined Parameter Training and Reduction in Tied-Mixture HMM Design. IEEE Trans. on Acous., Speech, and Sig. Proc. 10(3) (2002) 137-145 4. Liu, Y., Fung, P.: State Dependent Phonetic Tied Mixtures with Pronounciation Modeling for Spontaneous Speech Recognition. IEEE Trans. on Acous., Speech, and Sig. Proc. 12(4) (2004) 351-364 5. Dermatas, E., Kokkinakis, G.: Algorithm for Clustering Continuous Density HMM by Recognition Error. IEEE Trans. on Acous., Speech, and Sig. Proc. 4(3) (1996) 231-234 6. Park, D.C.: Centroid Neural Network for Unsupervised Competitive Learning. IEEE Trans. on Neural Networks 11(2) (2000) 520-528 7. Park, D.C., Woo, J.Y.: Weighted Centroid Neural Network for Edge Preserving Image Compression. IEEE Trans. on Neural Networks 12(5) (2001) 1134-1146
Building Multi-layer Small World Neural Network Shuzhong Yang1, Siwei Luo1 and Jianyu Li2 1
School of Computer and Information Technology, Beijing Jiaotong University, 100044, Beijing, China 2 School of Computer and Software, Communication University of China, 100024, Beijing, China
[email protected]
Abstract. Selecting rational structure is a crucial problem on multi-layer neural network in application. In this paper a novel method is presented to solve this problem. The method breaks through the traditional methods which only determine the hidden structure and also learns the topological connectivity so that the connectivity structure has small world characteristic. The experiments show that the learned small world neural network using our method reduces the learning error and learning time but improves the generalization when compared to the networks of regular connectivity.
1 Introduction Traditional structure learning methods [1] usually only determine the hidden structure of neural network which includes the number of hidden layers and the number of nodes per hidden layer and then train the network using various algorithms such as back-propagation algorithm and conjugate gradient optimum algorithm, etc. In fact the topology of network also has important impact on the performance of networks [2]. Herein the topology is based on complex network theory. In recent years, there is a growing interest in evolving complex networks. Small world characteristic [3] is one of the most important characteristics of complex networks in real world. The investigation shows that small world network has many advantages compared with regular network. The traditional multi-layer neural network can be considered as regular network. So if we can evolve it into small world neural network, the network will be expected to have better performances. The rest of the paper is organized as follows: In Section 2, we present the procedure of determining the hidden structure on multi-layer neural network. In Section 3, we show how to involve the regular neural network into small world neural network. In Section 4, we give some experimental results which compare the performances of small world neural network and regular multi-layer neural network. Finally, conclusions and future work are given in Section 5.
2 Decision of Hidden Structure Selecting rational hidden structure is also an important problem in model selection. Cross-validation [4], AIC [5], and BIC [6] are all accepted rules for choosing J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 695 – 700, 2006. © Springer-Verlag Berlin Heidelberg 2006
696
S. Yang, S. Luo and J. Li
rational hidden structure and we usually need use their generalizations in applications such as NIC [7] (the generalization of AIC), etc. In this paper we will introduce an optimization method based on the algebra equation theory [8] to select the hidden structure. 2.1 Definitions A multi-layer neural network is composed of input layer, several hidden layers and output layer. In neural network, all nodes per layer are full-connected with those in their adjacent layers. Table 1 lists all variables and their corresponding meanings which will be used in this paper. Table 1. All variables and their corresponding meanings in this paper
Variable ni
the number of input nodes
Meaning
Variable np
the number of given samples
no
the number of output nodes
xp
the pth input sample
nhnum nh j
the number of hidden layers the number of nodes for the jth hidden layer
yp
the pth teacher output the pth real output
In table 1, j = 1, 2,
Meaning
op
, nhnum and p = 1, 2,
, np .
Neural networks are usually based on supervised learning. Mean square error function is selected as the objective function of algorithm which can be written as 1 np
f (z) =
np
∑y p =1
p
− o p (z )
2
(1)
Where z is weight vector. We can minimize the above function by solving the following nonlinear optimization problem
min{ f (z )}, z ∈ R n z = [ z1 , z2 ,
, z n ], n = (ni + 1)nh1 +
nhnum −1
∑ j =1
(nh j + 1)nh j +1 + (nhnhnum + 1)no
(2)
Denote o p ,l ( z ) as the output of the lth node in output layer for the pth sample. Then f (z ) =
1 np
np
no
∑∑ p =1 l =1
y p ,l − o p , l (z )
Ideally, there exists a z , let f ( z ) = 0 . Then *
*
2
(3)
Building Multi-layer Small World Neural Network np
no
∑∑ p =1 l =1
y p ,l − o p , l ( z * )
2
=0
697
(4)
Or
y p ,l − o p ,l (z* ) = 0, p = 1, 2,
, n p , l = 1, 2,
, no
(5)
From (5) we can see that the number of equations is n p ⋅ no and the number of variables is n . According to algebra equation theory, only when n is not less than n p ⋅ no the equation group (5) has rational solutions. In next subsection, we will explain how to select the hidden structure under the condition of n = n p ⋅ no . 2.2 Decision of Hidden Layers According to equation (2) and let n = n p ino , we can get (ni + 1)nh1 +
nhnum −1
∑ j =1
(nh j + 1)nh j +1 + (nhnhnum + 1)no = n p ⋅ no
(6)
As we all known, the number of nodes per hidden layer is at least 1. If we let nh j = 1, j = 1, 2, , nhnum , then (ni + 1) ⋅1 +
nhnum −1
∑ j =1
(1 + 1) ⋅1 + (1 + 1)no = n p ⋅ no
(7)
From (7), we can evaluate the maximal number of hidden layers nhnummax ≤
n p (no − 2) − ni 2
+ 0.5
(8)
For example, ni = 3, no = 4, n p = 8 , then nhnummax ≤ 7 . In fact, we usually need very few hidden layers to solve the applications. Having the maximum of hidden layers, the hidden structure can be determined. In this paper, we adopt the following optimal procedure to solve this problem min{g (u )}, s.t. u j ≥ 1
(9)
The objective function is g (u) = [(ni + 1)u1 +
nhnum −1
∑ j =1
For each nhnum = 1, 2,
(u j + 1)u j +1 + (unhnum + 1) no − n p ⋅ no ]2 , u = [u1 , u2 ,
, unhnum ].
(10)
, nhnummax , compute an optimal u* by optimizing (9). Let
nh j = int(u*j + 0.5) , we can get
698
S. Yang, S. Luo and J. Li
ncnhnum = (ni + 1)nh1 +
nhnum −1
∑ j =1
(nh j + 1)nh j +1 + (nhnhnum + 1)no
(11)
We select u* as the optimal solution which satisfies ncnhnum* − n p ⋅ no is minimal. Thus we get the number of hidden layers nhmum* and the vector of node numbers for * every hidden layer u* = [u1* , u2* , , unhnum * ].
3 Learning Small World Topology In this section, a method similar to WS model [3] will be used to evolve regular neural network into small world neural network, but it is also different from WS model when reconnecting. The following is the detailed procedure for learning the small world topology. 1、Starting from regular multi-layer neural network with full connectivity between adjacent layers; 2 、Rewire each edge at random with probability p ( 0 < p 1 ); It is noted that when reconnecting, two neurons in the same layer can’t be connected and two neurons having edge between them can’t be reconnected; 3、Repeat step 2 until the neural network emerges small world characteristic. Figure1 gives an example of how to obtain the small world topology by randomly cutting and rewiring connections.
Fig. 1. Scheme of network connection topologies obtained by randomly cutting and rewiring connections, starting from regular neural network (left) and going to small world neural network (right)
4 Experimental Results In this section we will compare the performances of small world neural network and regular neural network constructed by various methods based on Cross-validation [4], BIC [6], and NIC [7], respectively. The performances include learning error, learning time and generalization. We use many group of synthetic data to carry out the experiments. Since we can not shown all results, we only select the two groups of representative data. One group
Building Multi-layer Small World Neural Network
699
Table 2. Two representative conditions in experiments
Data1 Cross-validation BIC NIC Our method
nhnum* 1 1 1 2
Data2 u* 5 4 4 (2,2)
nhnum* 1 1 1 5
u* 13 12 12 (3,2,3,2,2)
Fig. 2. Performances of traditional methods and our method for Data1 Table 3. Generalization of traditional methods and our method for Data1 and Data2
Cross-validation BIC NIC Our method
Data1
Data2
100% 100% 100% 100%
96.7% 93.3% 93.3% 96.7%
has 30 train samples and 10 test samples of two dimension and the other has 90 training samples and 30 test samples of five dimension. The results for structure learning under the two conditions are shown in table 2. Since traditional methods and our
700
S. Yang, S. Luo and J. Li
method have similar performances to the first group sample, we only give the comparison of performances for the second group sample in Fig. 2. The comparison of generalization is given in Table 3. Although we only give two groups of experimental results, from all groups we can see that when the number of hidden layers is small, our method has no advantage than traditional methods, only when the number of hidden layers is more than 3, our method gradually emerges distinct advantages. It reduces the learning time and learning error but has equal or higher performance in generalization (Table 3). According to small world theory, we know that when the number of hidden layers is small, regular neural network also has small world characteristic, so learning small world topology can’t improve the performances any more.
5 Conclusions In this paper we present a novel method for the construction of multi-layer neural network. The method breaks through the limit that traditional regular neural network only select the hidden structure before training algorithms and also learns the small world topology. The resultant network has more advantages than regular networks. The experimental results also prove that small world neural network has higher performances in several aspects. In future we will prove our method in massive neural networks and apply our method to more applications.
Acknowledgements The research is supported by: National Natural Science Foundations (No. 60373029) and Doctoral Foundations of China (No. 20020004020).
References 1. Magnitskii, N.A.: Some New Approaches to the Construction and Learning of Artificial Neural Networks. Computational Mathematics and Modeling 12 (4) (2001) 293-304 2. Torres, J.J., et al.: Influence of Topology on the Performance of a Neural Network. Neurocomputing (2004) 229-234 3. Watts, D., Strogatz, S.: Collective Dynamics of Small-World Networks. Nature 393 (1998) 440-442 4. Stone, M.: Cross-validatory Choice and Assessment of Statistical Prediction. Journal of the Royal Statistical Society, Series B 36 (1974) 111-147 5. Schwarz, G.: Estimating the Dimension of a Model. Annals of Statistics 6 (1978) 461-464 6. Akaike, H.: A New Look at the Statistical Model Identification, IEEE Transactions on Automatic Control. AC-19 (1974) 716-723 7. Murata, N., Yoshizawa, S., Amari, S.: Network Information Criterion—Determining the Number of Hidden Units for an Artificial Neural Network Model. IEEE Transactions on Neural Networks (1994) 865–871 8. Hou, X., Hu, Y., Li, Y., Xu, X.: Rational Structure of Multi-Layer Artificial Neural Network. Journal of Northeastern University 24 (1) (2003) 35-38
Growing Hierarchical Principal Components Analysis Self-Organizing Map Stones Lei Zhang, Zhang Yi, and Jian Cheng Lv Computational Intelligence Laboratory, School of Computer Science and Engineering, University of Electronic Science and Technology of China Chengdu 610054, P.R. China
[email protected],
[email protected],
[email protected] http://cilab.uestc.edu.cn
Abstract. In this paper, we propose a new self-growing hierarchical principal components analysis self-organizing neural networks model. This dynamically growing model expands the ability of the PCASOM model that represents the hierarchical structure of the input data. It overcomes the shortcoming of the PCASOM model in which the fixed the network architecture must be defined prior to training. Experiment results showed that the proposed model has better performance in the tradition clustering problem.
1
Introduction
The Principal Components Analysis Self-Organizing Map (PCASOM)[1] extend from the ASSOM model [2], uses the covariance matrix of its respective input subset to extract the most similar features of a part of the input data. The PCASOM based the PCA theory has built the description of the local geometry of the input data. The PCASOM model and SOM [3] model both intend to partition the high-dimensional data into a number of clusters with creating topographic feature maps. However, the representation ability of PCASOM is more elaborate as the PCASOM using the vector basis of each neuron and mean value, while the SOM only use mean vector. Nevertheless, the PCASOM lacks the ability of extracting the hierarchical structure of the data. Furthermore, the designer has many difficulties to decide the architecture and the number of parameters of the neural networks model in advance. Several algorithms with an adaptive structure during the training process have been proposed to overcome the architectural design problem, such as the incremental grid growing [4], growing grid [5], dynamic growing SOM [6], the growing hierarchical SOM [7] and its expended [8, 9]etc. In this paper, we combine the PCASOM and the self-growing hierarchical method [7] which adapts its architecture during its unsupervised training process to represent the hierarchical character of the data to obtain a new growing hierarchical topographic maps that its capability of describing the data become more elaborate. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 701–706, 2006. c Springer-Verlag Berlin Heidelberg 2006
702
S.L. Zhang, Z. Yi, and J.C. Lv
The next section will review the PCASOM neural networks. The Growing Hierarchical Principal Components Analysis Self-organizing Map (GHPCASOM) extended from PCASOM are considered in section 3. Finally, section 4 and section 5 deal with experimental results and conclusion.
2
The PCASOM Model
In this section, we give a brief review of the PCASOM [1] model. The PCASOM model uses covariance matrix to store information. The covariance matrix Rj (t) of an input vector X is defined as in [1]. Every unit j of PCASOM has its correlated orthonormal vector basis B j (t) which formed by the Kj eigenvectors corresponding to the Kj largest eigenvalues of Rj (t) in the time instant i. The difference of input vector xi (t), (i = 1, ..., N ) and estimated means ej (t) are projected onto the vector basis of all the neurals. The training process of PCASOM has following steps: (a) Winner lookup. The neuron c that has the minimum sum of projection errors is the winner: N i i j 2 c = arg min x (t) − ej (t) − Orth(x (t) − ej (t), B (t)) , (1) j
i=1
where Orth(X, B) is orthogonal projection of vector X on the basis B. So, we can look for the neuron j which represent the input xi (t) best. (b) Learning. The input sample X(t): (b.1) Degree of neighborhood. We design the neurons form a rectangular lattice. When a neuron c wins the competition, the topology of networks and all of its neighbors should be updated, according to the degree of neighborhood πi,c between wining neuron c and its neighbor i. The degree of neighborhood is defined as d2i,c πi,c (t) = exp , (2) 2σ(t)2 where di,c is the distance between wining neuron c and its neighbor neuron i, and πi,c is a Gaussian-like function, with σ(t) → 0 as t → ∞. The size of neighborhood is controlled by σ(t) which defined as σ(t) = σ0 exp (−t/τ ) ,
(3)
where τ is a parameter of . (b.2) Neuron update. For every neuron j, we use instant time t to denote the current training iteration ⎡ ⎤ N 1 ej (t + 1) = ej + ηe (t)πj,c (t) ⎣ ( Xj ) − ej (t)⎦ , (4) N j=1 Rj∗ (t + 1) =
N 1 T (Xi − ej (t + 1)) , N i=1
(5)
Growing Hierarchical Principal Components Analysis Self-Organizing Map
Rj (t + 1) = Rj (t) + ηR (t)πj,c (t) Rj∗ (t + 1) − Rj (t) ,
703
(6)
where ηR (t) is learning rate for the covariance matrix and ηe (t)is the learning rate for the mean.
3
The GHPCASOM Neural Networks
The key idea of the GHPCASOM is to use a self-growing hierarchical structure of multiple layers which are constituted by a number of independent PCASOMs. The self-growing hierarchical method that adapted the hierarchical relations between the input data is extended to GHSOM[7]. One PCASOM is used at the first layer of the hierarchy and provides a rather rough representation of the main cluster. For every unit in this map a PCASOM that offers a more detailed view of data might be refined to the next layer of the hierarchy . A graphical example of the training process and topolgy of a GHPCASOM is given in Fig.1(1).
(1)
(2)
Fig. 1. The training process of GHPCASOM: (1)the topology of GHPCASOM (2)the growth process of GHPCASOM. Insertion of units:(a) A row or (b) a column of units is inserted in between error unit e and the neighboring units d.
Initial Setup the Global Network Control: The key of the GHPCASOM model is its adaptation to the input data. The quality of this adaptation is measured by the projection of the difference among input vector X and estimated means e onto the vector basis of the all the neurons. We use the project error (pe) to describe how well a data item is represented by its best matching unit. The pe of a unit i is calculated according to pei = xj − ei − Orth(xj − ei , B i ), (7) xj ∈ C i
where xj are elements of the subset of input vector Ci mapped onto the unit i, the ei is mean vector of the unit i, and B i is the vector basis of the unit i. The start of GHPCASOM is that we create a ”virtual” layer with only one neuron. The virtual severs as a representation of the whole data set and is
704
S.L. Zhang, Z. Yi, and J.C. Lv
necessary for calculate the global control of the growth process. So, we calculate the global control value pe0 of the unit that form the start layer as pe0 = xj − e − Orth(xj − e, B), (8) xj ∈ N
where e is the mean of the input data, the direction of the orthonormal vector basis B is horizontal, and the xj , j = 1, ..., N is the input vector. The pei measures the dissimilarity of all input data mapped onto a particular unit i and will be used to control the growth process of the neuron networks. The global termination criterion as follows: pei < τ2 · pe0 ,
(9)
where τ2 is a user defined parameter. The addition of further units provide more map space for data representation until all units satisfying the ineqs.(9). The GHPCASOM architecture is initialized by creating a new PCASOM formed by 2 × 2 units. The initial covariance matrix Ri and the vector ei of units i is similar to [1]. Training and Growth Process of Growing PCASOM: The training process similar to PCASOM [1] is iterated until the fixed number of training iteration is reached. If the pe of the unit in the same map is not fulfilled ineqs.(9), we select and denoted the unit with the highest pe as error unit e. A new row or column of units is inserted between e and its the most dissimilar adjacent neighbor d that the unit has the largest distance in respect to the covariance matrix, such as shows in Fig.1(2). The covariance matrix and the vector basis of the new units are initialized as the average of their corresponding neighbors. Then, the training process start again with the original state of PCASOM [1]. Termination of Growth Process: In order to reveal the hierarchical structure present in the data, each map shall only describe a portion of data similarity[7]. The growth process of a single map m continues until the following inequality is fulfilled as mpem < τ1 · peu , (10) where peu is the pe of the corresponding unit u in the upper layer, the mpem is the mean of the pei of all units in the same map, and τ1 is a user defined parameter. The parameter τ1 is the parameter that controls the hierarchy of the GHPCASOM. Hierarchical Growth: When the training of a map is stopped by ineqs.(10), the pe of every units of the map has to be checked whether it satisfies the ineqs.(9) for expanding the sublayer. Suppose the unit p be expanded to form a new 2 × 2 map in the subsequent layer of the hierarchy, the covariance matrix and the vector basis of every unit in the new layer are initialized to mirror the orientation of neighboring units of its parent p. We initialize the covariance matrix and the vector basis with the mean of the parent and its neighbors in the respective directions.
Growing Hierarchical Principal Components Analysis Self-Organizing Map
705
Summary of the algorithm: The algorithm that summarizes the GHPCASOM model can be described as follows: step1: Calculate the global termination criterion pe0 . step2: Initialize the first layer PCASOM neuron networks formed 2×2 units. step3: Train the PCASOM, if the ineqs.(10) is not satisfied, goto step4, else goto step5. step4: Check the pei of every unit, if the ineqs.(9) is satisfied, the units will not be expanded to the sublayer. Insert new row or column to grow the PCASOM, goto step3. step5: Check the pei of every unit, if the ineqs.(9) is not satisfied, goto step6, else goto step7. step6: For each unit to expand: create new 2 × 2 PCASOM, goto step3. step7: If the training of all the PCASOM is finished, end.
4
Experiments
Our set of experiment is devoted to the cluster performance of the GHPCASOM neural networks. We have selected the animal dataset from the UCI Repository of Machine Learning Database and Domain Theories[10]. The animal dataset consists of 16 animals which are described by 13 features, such as how many legs they have, whether they can swim, hunt, fly and so on[3]. For the GHSOM and GHPCASOM model, the value of parameters of control the growth process and hierarchy are τ1 = 0.5,τ2 = 0.001. Fig.2 depicts a flat representation of the GHSOM and GHPCASOM trained with the animal dataset. In the Fig.2(a),the dataset is split into 6 clusters on the fist layer distinguishing the mammals into three clusters on the left, and birds on the right by GHSOM neural networks. The cluster representing all birds except the Eagle is further elaborated on the second layer. In the Fig.2(b), the animals are clustered into three groups on the first layer according to large mammals, medium mammals,
(a)
(b)
Fig. 2. (a) GHSOM trained with animal dataset. (b)GHPCASOM trained with animal dataset.
706
S.L. Zhang, Z. Yi, and J.C. Lv
and birds by GHPCASOM neural networks. The large mammals is split to 2 clusters that carnivore and herbivore according to whether they have hooves on the second layer. The medium mammals also are further separated into catamount and canine on the corresponding second layer. The hierarchical clustering results show that the GHPCASOM model possessed the capability to capture the inherent hierarchical relation of the input data. Furthermore, it appears to have a slightly better quality than the original GHSOM.
5
Concluding Comments
In this paper, we have proposed a new growing hierarchical PCASOM neural model. Owing to the GHPCASOM model combines their advantages of the GHSOM and PCASOM network, its input representation capability is more elaborate and its architecture is flexible. Experiments appear to have a preferable performance on representing hierarchical structure of the input data.
Acknowledgment This work was supported by the “Chunhui Plan” (No. Z2004-2-51009).
References 1. Lopez-Rubio, E., Munoz-Perez, J.Gomez-Ruiz J.A: A Principal Components Analysis Self-Organizing Map. Neural Networks 17(2004) 261-270 2. Kohonen, T.: Emergence of Invariant-Feature Dectors in the Adaptive-Subspace SOM. Biological Cybernetics 75(1996) 281-291 3. Kohonen, T.: The Self-Organizing Map. Proc. IEEE 78(1990) 1464-1480 4. Blackmore, J. and Miikkulainen, R.: Incremental Grid Growing: Encoding HighDimensional Structure into a Two-Dimensional Feature Map. in Proc. IEEE Int. Conf. Neural Networks, San Francisco, CA 1(1993) 450-455 5. Fritzke, B.: Growing GridA Self-Organizing Network with Constant Neighborhood Range and Adaption Strength. Neural Processing Letter 2(1995) 1-5 6. Alahakoon, D., Halgamuge, S. K., and Srinivasan, B.: Dynamic Self-Organizing Maps with Controlled Growth for Knowledge Discovery. IEEE Trans. Neural Networks 11(2000) 601-614. 7. Rauber, A., Merkl, D., and Dittenbach, M.: The Growing Hierarchical Selforganizing Map: Exploratory Analysis of High-Dimensional Data. IEEE Trans. on Neural Networks 13(6)(2002) 1331-1341 8. Moreno, S., Allende, H., Rogel, C. and Salas, R.: Robust Growing Hierarchical Self-Organizing Map. In: J. Cabestany, A. Pricto, and D.F. Sandoval(Eds.): IWANN2005, LNCS, Vol. 3512 Springer-Verlag, Berlin Heidelberg (2005)341-348 9. Pampalk, E., Widmer, G. and Chan, A.: A New Approach to Hierarchical Clustering and Structuring of Data with Self-Oranizing Maps. Intell. Data Analysis 8(2004) 131-149 10. Murphy,P.M.:UCI Repository of Machine Learning Database and Domain Theories[online]. Available: http://www.ics.uci.edu/ mlearn/MLRepository.html. Data of access: March 2001
Hybrid Neural Network Model Based on Multi-layer Perceptron and Adaptive Resonance Theory Andrey Gavrilov, Young-Koo Lee, and Sungyoung Lee Kyung Hee University, 1, Soechen-ri, Giheung-eop, Yongin-shi, Gyeonggi-do, 449-701, Korea
[email protected],
[email protected],
[email protected]
Abstract. The model of the hybrid neural network is considered. This model consists of model ART-2 for clustering and perceptron for preprocessing of images. The perceptron provides invariant recognition of objects. This model can be used in mobile robots for recognition of new objects or scenes in sight the robot during his movement.
1 Introduction Application of model Adaptive Resonance Theory (ART) of Grossberg-Carpenter [1] (in particular ART-2) is rather attractive to problem of solving of classification and clustering, because this model combines in itself properties of plasticity and stability, and also it does not demand a priori knowledge of the fixed quantity of necessary classes. So many different modifications of this model and its combinations with other neural networks are known [2,3,4,5]. But practically all of them are supervised learning and demand of teacher. Thus one of sufficient feature of model ART was refused. In this paper we suggest one of hybrid model of neural network based on ART-2 and multi layer perceptron with error back propagation training algorithm (MLPART). In this model we try to keep the unsupervised learning and remove one sufficient disadvantage of ART. Model ART-2 assumes usage of only one layer of neurons (not including entry, associated with sensors). It results to that the neural network works only with the metrics of primary features and calculates distance between images (for classification or creation of a new cluster - output neuron), usually using Euclidean distance. The absence of adapting to transformations of input vector to same one in real time is a result to that for real applications the model ART-2 is almost never used. For example, for clustering and pattern recognition by the mobile robot [6] it is required to recognize the object in different foreshortenings and allocating in different parts of a field of vision, i.e. recognition should be invariant concerning conversions of the map, such as shifts, rotations and others. Multi layer perceptrons provide invariancy of recognition because in them hidden layers form secondary features during learning. Could say that in perceptrons each layer provides conversion of any feature space to another one. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 707 – 713, 2006. © Springer-Verlag Berlin Heidelberg 2006
708
A. Gavrilov, Y.-K. Lee, and S. Lee
In this paper the hybrid model is offered. It combines advantages of multilayer perceptron with learning by error back propagation [7] and of model ART-2. At first this paradigm was proposed in [8]. In this paper this paradigm and algorithms of model suffer sufficient improvements and continue developments, new experiments and conclusions are adduced.
2 Main Concepts and Learning Algorithm of Hybrid Model of Neural Network In suggested model the first some layers of neurons are organized as perceptron with forward connections. Its outputs are an inputs of model of Grossberg-Carpenter ART-2. Perceptron provides conversion of the metrics of primary features in the metrics of secondary features in space of considerably smaller dimension. Neural network ART-2 classifies images and uses secondary features to do it. Training of perceptron by error back propagation algorithm (EBP) provides "attraction" of an output vector of perceptron to centre of recognized cluster by ART-2. At that the weight vector of recognized cluster is desired output vector of multi layer perceptron. Could say, that the recognized class is a context in which system try to recognize other next images as same, and in some limits the system “is ready to recognize” its by this manner.
Fig. 1. Structure of hybrid neural network
Hybrid Neural Network Model Based on Multi-layer Perceptron and ART
709
Action of the suggested model is described by the following unsupervised learning algorithm: 1. In the perceptron the weights of connections equal to half of quantity of neurons in the previous layer are formed. The quantity of output neurons Nout of ART-2 is considered equal zero. 2. The next example is present to inputs. Outputs of perceptron are calculating. 3. If Nout=0, then the output neuron is formed with the weights of links equal to values of inputs of model ART-2 (the outputs of perceptron). 4. If Nout> 0, in model ART-2 the algorithm of calculation of distances between its input vector and centers of existing clusters (the weight vectors of output neurons) is executing:
dj =
∑ (x
i
− wij ) 2
(1)
i
where: xi – ith digit of input vector of ART-2, wij – ith digit of weight of jth output neuron (the cluster). If the distance for the neuron-winner more than defined R value (a vigilance or radius of cluster), the new cluster as well as in step 3 is formed. 5. If the distance for the neuron-winner is less R in model ART-2 weights of connections for the neuron-winner are enumerating, approximating centre of a cluster to the input vector of model ART-2:
wim = wim + ( xi − wim ) (1 + N m ) ,
(2)
where: Nm – a number of recognized input vectors of mth cluster. And for perceptron a recalculation of weights by algorithm “error back propagation” (EBP) are executing. Thus a new vector of weights of output neuron-winner of model ART-2 is viewed as output desirable vector for EBP, and the quantity of iterations can be little (in particular, there can be only one iteration). 6. The algorithm repeats from step 2 while there are learning examples in training set. Action of the suggested model is explained in a figure 2. Here the space of secondary features in which by points are represented output vector of perceptron (input vector of model ART-2), centers of clusters are shown. In a figure the following points are represented: 1) - a new image for which the cluster with radius R is created, 2) - the new image recognized as concerning to this cluster, 3) - the new centre of a cluster calculated in step 5 of algorithm, 4) - a new output vector of perceptron, approximated to centre of a cluster as a result of executing of algorithm “error back propagation”, 5) - the new image recognized as inhered to other cluster.
710
A. Gavrilov, Y.-K. Lee, and S. Lee
4 1
3
2 5
R
Fig. 2. Explanation of action of hybrid model
Earlier carried out experiments [8] with simple artificial images proved that a distance between a centroid of cluster and the input vector is increasing slowly during successful recognition of this cluster and its value depends on number of iterations of error back propagation algorithm.
3 Experiments For research of the suggested model a framework for simulation of the neural network handling some sequences of patterns has been developed. As the sequence of images we used preliminary reduced to size 100x100 pixels shots from video. In a figure 3 three examples of images used in experiments are shown (1st images of sequences).
1)
2)
3)
Fig. 3. The images (first Images from series) used in experiments
First series of images is sequence with moving cars, second – is moving bus in same place, third – moving images of chairs similar to pattern perceived by mobile robot in its moving into room. The suggested algorithm of training without the teacher in experiments has shown good results at rather minor changing of each next example in training set. Thus the following parameters of model have been used:
Hybrid Neural Network Model Based on Multi-layer Perceptron and ART
711
Quantity of input neurons (pixels) - 10000 (100х100), Quantity of neurons in hidden layer of perceptron - 20, Quantity of output neurons of perceptron (in input layer of ART-2) Nout - 10, Radius of cluster R was used in experiments in different manners: 1) adapt and fix, 2) calculate for every image by formulas S/(Na2), where S – average input signal, Na – number of output neurons of perceptron, 3) calculated as 2Dmin, where Dmin – minimal distance between input vector of ART2 and weight vectors in previous image. Activation function of neurons of perceptron is rational sigmoid with parameter a=1, Value of learning step of perceptron is 1, Number of iterations of recalculation of weights of perceptron is from 1 to 10. Some results of experiments are shown in figures 4 and 5. From figure 4 one can see effect of partial forgetting. In repeated series of images system sometimes make mistakes and create new clusters. And at calculation of vigilance from last minimal distance effect of forgetting is more expressive. Moreover, in first present of series 2 the system makes mistakes in points 25 and 26 recognizing red bus as red car. In figure 5 shown changing of distance between output vector of MLP (input vector of ART2) and centroid of recognized cluster at different number of iteration of EBP algorithm: 1, 3, 5, 7, 9. Dark line is corresponding to value 1. One can see that distance decreases as result of action of EBP algorithm. Only in last point algorithm EBP already is unable to be opposed to strong changing of input vector. As result is creation of new cluster for next image. Results of experiments show that: 1. Proposed model and learning algorithm may be used for processing of real visual images to detect novelty in series of images ignoring not enough strong difference between contiguous images.
number
Number of recognized cluster 18 16 14 12 10 8 6 4 2 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 image
Fig. 4. For sequence of images of series 1, 2, 1, 2 (a dark points are corresponding to 2nd kind of calculation of vigilance and light – to 1st one)
712
A. Gavrilov, Y.-K. Lee, and S. Lee
Distance between output vector of MLP and centroid of recognized cluster
Distance
0,004 0,003 0,002 0,001 0 1
2
3
4
5
6
7
8
9
image
Fig. 5. For sequence of images of series 1 at different number of iteration of EBP algorithm: 1, 3, 5, 7, 9
2. May be to suppose that for every enough homogeneous series of images we can to select such value of vigilance which provides intuitively right behavior of this model. 3. At processing of strongly different series of images it is needed to calculate value of vigilance during learning (for every image). We propose for it empiric formulas S/(Na2), where S – average input signal, Na – number of output neurons of perceptron. Binding of value of vigilance to last minimal distance between input vector and centroid of cluster proved to be worst. 4. Better results were obtained at prohibition of creation new cluster only for one image (after that system try to create also new cluster). In this case that new cluster is replaced by next new one. By this way the algorithm is oppose to strong increasing of number of clusters (output neurons) and to influence of random overshoots on keeping of structure of learned neural network.
4 Conclusions The suggested hybrid model of the neural network can be used in the mobile robot when it is necessary to watch sequence of the images visible by the robot during its movement, and to extract in it new images (the objects), i.e. essential changes in the scene visible by the robot. More over, this model may be used in security systems for recognition of new objects or persons in sight of camera. Modification of this algorithm can be algorithm in which the quantity of created clusters is limited. In this case, if the quantity of clusters (output neurons) has achieved a limit, it is needed to decide what to do with images which are not recognized, i.e. which cannot be related to any existing cluster. In this case may be offered increasing of parameter R (radius of clusters) and to try to apply again algorithm of recognition and so until the new image will not be related to one of clusters. After that, it is necessary to reduce quantity of clusters (output neurons), to unify clusters with the centers which appear
Hybrid Neural Network Model Based on Multi-layer Perceptron and ART
713
in one cluster, and to change weights of links between outputs of perceptron and outputs neurons-clusters. Also one modification of this algorithm can be algorithm of training with the teacher in whom before to apply the procedure of increase of radius and decrease of quantity of output neurons, the system requests "teacher" what to do - to apply this procedure or to create a new cluster. As "teacher" there can be not only inquiry to user, but also any additional test for novelty of an image. The following further researches of the suggested hybrid model of the neural network are planned: - a mathematical substantiation of suggested algorithms, - continue research of influence of perceptron and ART-2 parameters on results of action of the neural network, - testing of the suggested model on program model of the mobile robot and the real robot.
Acknowledgement This work was supported by MIC Korea (IITA Visiting Professorship Program). Dr. S.Y.Lee is the corresponding author.
References 1. Carpenter, G.A., Grossberg, S.: Pattern Recognition by Self-Organizing Neural Networks, MIT Press, MA, Cambridge (1991) 2. Lee, S.J., Hou, C.L.: An ART-Based Construction of RBF Networks. IEEE Trans. Neural Networks 13(6) (2002) 1308-1321 3. Baxter, R.A.: Supervised Adaptive Resonance Networks. In: Proceedings of the Conference on Analysis of Neural Network Applications, ACM, VA, Fairfax (1991) 123-137 4. Wang, J.H., Rau, J.D., Liu, W.J.: Two-Stage Clustering Via Neural Networks. IEEE Trans. Neural Networks 14(3) (2003) 606-615 5. Lee, H.M. Chen, C.M., Lu, Y.F.: A Self-Organizing HCMAC Neural-Network Classifier. IEEE Trans. Neural Networks 14(1) (2003) 15-27 6. Gavrilov, A.V., Gubarev, V.V., Jo, K.-H., Lee, H.-H.: Hybrid Neural-Based Control System for Mobile Robot. In: Proceedings of 8th Korea-Russia International Symposium on Science and Technology KORUS-2004, Vol. 1, TPU, Tomsk (2004) 31-35 7. Rumelhart, D.E.: Parallel Distributed Processing. In: Mcclelland J.L. (Eds.), Explorations in the Microstructure of Cognition, V.I, II, MIT Press (1986) 8. Gavrilov, A.V. Hybrid Neural Network Based on Models Multi-Layer Perceptron and Adaptive Resonance Theory. In: Proceedings of 9th Korean-Russian International Symposium on Science and Technology KORUS-2005, NSTU, Novosibirsk (2005) 119-122
Evolving Neural Networks Using the Hybrid of Ant Colony Optimization and BP Algorithms Yan-Peng Liu, Ming-Guang Wu, and Ji-Xin Qian Institute of Systems Engineering, Zhejiang University, Hangzhou, 310027, China
[email protected]
Abstract. Ant colony optimization (ACO) algorithm has the powerful ability of searching the global optimal solution, and backpropagation (BP) algorithm has the feature of rapid convergence on the local optima. The proper hybrid of the two algorithms (ACO-BP) may accelerate the evolving speed of neural networks and improve the forecasting precision of the well-trained networks. ACO-BP scheme adopts ACO to search the optimal combination of weights in the solution space, and then uses BP algorithm to obtain the accurate optimal solution quickly. The ACO-BP and BP algorithms were applied to the problems of function approaching and modeling quantitative structure-activity relationships of Herbicides. Experiment results show that the proposed ACO-BP scheme is more efficient and effective than BP algorithm. Furthermore, ACOBP reliably performs well when the number of hidden nodes varies.
1 Introduction Backpropagation (BP) algorithm is currently the most widely used search technique for evolving neural networks (NNs) [1,2]. BP algorithm is gradient descent in essence, which has been discussed in many textbooks [3]. One of characteristics of the gradient descent algorithm is rapid convergence on the local optima. Therefore, adopting BP algorithm to evolve NNs may have several drawbacks. Ant colony optimization (ACO) algorithm is another ecological system algorithm, which was proposed by Dorigo [4,5,6,7] et al. As a swarm intelligent algorithm, it has the advantage of global optimization and easy realization. It has been successfully used in solving combinatorial optimization problems. In this paper, a hybrid of ACO and BP (ACO-BP) algorithms is proposed to evolve NNs. The ACO-BP algorithm firstly uses ACO algorithm to search the near-optimal solution and then adopts BP algorithm to find the accurate solution. The former attempts to avoid being trapped in the local optima and the later can rapidly find the accurate solution to accelerate its evolving speed. This paper is organized as follows. A brief review to ACO and BP algorithms is provided in section 2. In section 3, the proposed ACO-BP algorithm is detailed. Section 4 gives the performance evaluation. The conclusion is reported in section 5. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 714 – 722, 2006. © Springer-Verlag Berlin Heidelberg 2006
Evolving Neural Networks Using the Hybrid of ACO and BP Algorithms
715
2 ACO and BP Algorithms The ant colony optimization algorithm draws its inspiration from the behavior of real ants as they move from their nests towards a food source along a shorter path. ACO has been successfully applied to many complex combinatorial optimization problems, such as TSP [4], quadratic assignment problems [8], vehicle routing problems [9] and so on. The basic algorithm of the ACO introduced by Dorigo et al [4] is outlined as Fig 1. Since the ACO algorithm simultaneously searches in many directions using a population of ants, the probability of finding a global optimum greatly increases.
Ant Colony Optimization Algorithm 1. Initialize Represent the underlying problem by a weighted connected graph. Set initial pheromone for every edge. 2. Repeat 2.1. For each ant do Randomly select a starting node. Repeat Move to the next node according to a node transition rule. Until a complete tour is fulfilled. 2.2. For each edge do update the pheromone intensity using a pheromone updating rule. Until the stopping criterion is satisfied. 3. Output the global best tour.
Fig. 1. The basic ACO algorithm flow chart
BP algorithm was developed and popularized by Rumelhart and McClelland [10,11]. Consider the batch-style BP training. Suppose a set of P training samples is available, the problem can be characterized as the process of minimizing the following sumsquared error:
J ( w) = where
1 P NM (d s ,i − ysM,i )2 ∑∑ 2 s =1 i =1
(1)
d s ,i and ysM,i are the ith target and actual outputs corresponding to the sth
training pattern respectively, W is a vector constituted by all the weights and biases involved in the network, and NM is the number of output units. In this scheme, an initial weight vector W0 is iteratively adapted according to the following recursion to find an optimal weight vector. The positive constant of η is learning rate.
Wk +1 = Wk − η
∂J (W ) ∂W
(2)
In the first phase, the actual outputs of the network are computed forward from the input layer to the output layer. While in the second phase, the descent gradient is
716
Y.-P. Liu, M.-G. Wu, and J.-X. Qian
calculated in a BP fashion, which makes it possible to adjust the weights in a descent direction. This procedure is repeatedly performed for each training pattern until all error signals between the desired and actual outputs are sufficiently small.
3 Hybrid of ACO and BP Algorithm 3.1 The Basic Idea When the architecture of a neural network is decided, it needs to be trained before used. Given that there are D parameters in the network, which consist of all the weights and biases. We can regard the NNs’ evolution as the process of searching the optimal combination of the D parameters in their solution spaces. There are numerous candidate points for each parameter, so the candidate combination solutions are also numerous. It is in all probability that the combinatorial optimization function has a multimodal distribution. As the above-mentioned, ACO becomes the best choice to solve the combinatorial optimization. In order to apply ACO, each parameter’s definition space should be split into a set of discrete points. Thus each point is a candidate value of the corresponding parameter. As far as an ant is concerned, it can and only can choose a value for each parameter among the candidate points, just like an ant visiting a city only once in the TSP problem, and remembers its tag simultaneously. A pheromone table is needed for each parameter. They can be formed as the table 1. Where wi is the ith parameter to
be optimized, ai is the split calibration, we call it a point; τ (i) represents the pheromone intensity of point ai, and m is the number of shares that the space is divided. So there are m+1 points totally for every parameter. Table 1. Pheromone table for each weight or bias
wi Tag Split calibration Pheromone intensity
1
2
…
m+1
a1 τ (1)
a2 τ (2)
… …
am+1 τ (m+1)
When an ant reaches the parameter
wi , it selects the value according to the prob-
ability obtained by the following equation:
P(i ) =
τ (i) ∑ τ ( j)
(3)
1≤ j ≤ m
An ant finishes a tour when it selects values for all the parameters. Then it returns to its nest and updates the pheromone table simultaneously according to the equation:
Evolving Neural Networks Using the Hybrid of ACO and BP Algorithms
τ (i + 1) = ρτ (i ) + Δτ (i ) where
ρ ∈ (0,1) , is
717
(4)
the coefficient of pheromone duration.
Δτ (i ) = Q / Err de-
notes the increment of pheromone, Q is a constant, and Err is the error between the actual outputs of the network and the target outputs. A better combination of parameters can be found by ACO scheme, however, the output precision of NNs is not very high because the parameters’ feasible field is split into discrete points. Therefore a further searching algorithm is needed to improve the output precision. As the above-mentioned, BP is a good choice, which can converges rapidly on the local optima. BP algorithm usually initializes all the weights of the network with small random values, so it takes the risk of being trapped in the local minimum. Now, the ACO provides the BP better initial values. Consequently, both the training effectiveness and evolving speed can be enhanced. So the basic idea of the hybrid algorithm of ACO and BP (ACO-BP) is simple, using ACO algorithm to search the optimal combination of all the network parameters, and then using BP algorithm to find the accurate value of each parameter. The framework of ACO-BP scheme is shown as Fig 2.
Start
Initialization
Termination criterion satisfied?
Y
N Release ants
Searching the best combination
Output the best p arameters as the initial weights and biases of BP
BP Training
Update pheromone End
Fig. 2. Framework of combining the ACO and BP algorithms
3.2 Details of the ACO-BP Algorithm Given that the architecture of a feed-forward neural network has been fixed, the number of weights and biases of connection is determined subsequently. The detailed steps that using ACO-BP scheme to evolve the neural network are as follows. Step 1: Initialization. Dividing the range [Wmin, Wmax] of each parameter into m shares uniformly. Usually, Wmin=-2 and Wmax=2 are not a bad choice. Create a phero-
718
Y.-P. Liu, M.-G. Wu, and J.-X. Qian
mone table for each parameter. Every point has the same amount of pheromone τ 0 . Then N ants start from the nest, and each one executes the step 2. Step 2: Touring. Every ant moves form one parameter to the next dependent on the probability as equation . It selects one and only one value for each parameter and records it. When it chooses values for all the parameters, an ant reaches the destination. The values recoded by the ant constitute all the parameters of the network and a network is decided subsequently. Input the training samples into the network, and obtain the output. The errors between the target outputs and the actual outputs are calculated. Then the ant execute step 3.
⑶
Step3: Pheromone table updating. The ant goes back to its nest along the path it has passed and the pheromone intensity of corresponding points in the path is updated according to equation . For the points it has visited, Δτ = Q / Err , and for the other points Δτ = 0 .
⑷
Step4: Checking the stop criterion. When all the ants return to the nest, the ACO finishes an iteration. Repeat step2-step3, until the maximum number of iteration has been run or all the ants converge on one path. Step5: Evolving the network using BP algorithm. Use the best parameters found by ACO algorithm as the initial weighs and biases of BP algorithm. Calculate the errors between the desired outputs and actual outputs, and transfer the errors from the output layer back to its forward layer one by one. The weight vector is modified according to the equation . Repeat the training process until the error accuracy satisfies the constraints or the number of maximum epochs has been run.
⑵
Step6: Testing the network. Using the test samples to check the generalization ability of the well-trained network. If the errors satisfied the desired criterion, stop. Otherwise, restart step 1. 3.3 Parameter Selection For ACO-BP scheme, its parameters have important effect on its performance. They are analyzed as follows.
⑴
Wmin, Wmax and the number of split calibration m. Wmin and Wmax should include all the potential solutions of weights as possible. And their interval should be appropriate. If the interval between Wmin and Wmax is small, ACO cannot fully exert its advantage because the difference between the different combinations is too small. When the interval is large and the split number m is fixed, the precision of finding the optimal solution is low and may miss the best solution in some certain cases. So a proper choice of the three parameters is very important. Generally speaking, Wmin=-2, Wmax=2 and m=59-79 are not a bad choice. The number of ants N and duration coefficient of ρ. For the ACO-BP scheme, a large N can endow it with powerful ability of exploring more candidate solutions in an iteration, but it is at the expense of more CPU time in the case of nonparallel realization. A small N is not good for exploring new paths, especially in the later phase of ACO algorithm. Generally N can be set 30-70. ρ,the duration coefficient, enables the
⑵
Evolving Neural Networks Using the Hybrid of ACO and BP Algorithms
719
ants to forget the old solution and explore the new path. Usually ρcan be set 0.8-0.95 . Other parameters of ACO algorithm in ACO-BP scheme can choose the typical values in TSP. As for the learning rate and other parameters of BP algorithm in ACO-BP algorithm, their relationship with the training effectiveness can refer to Ref [3].
4 Simulation Results In order to evaluate the performance of the proposed ACO-BP scheme, two simulation experiments were performed. The comparison of standard BP (SBP, for short) and ACO-BP algorithms is discussed based on the simulation results. Both the two algorithms were PC based and were implemented on a 1.7GHz Pentium IV machine. Experiment 1 Nonlinear system identification is one of the most important application fields of neural networks. The function adopted is as the following, which is proposed by Mackay [12].
x2 f ( x) = 1.1(1 − x + 2 x ) exp(− ) 2 2
(5)
A three-layer network architecture is adopted for both ACO-BP and SBP algorithms. The network includes one hidden layer with 6, 7, 8, 9 and 10 hidden nodes respectively. The transfer function of hidden layer is the standard sigmoid, and the function of output is purelin. As for ACO-BP scheme, the number of ants, Wmin, Wmax and the number of maximum iteration are set with 30, -2, 2 and 100 respectively. The split number is 59, i.e. the range [-2, 2] is divided into 59 equal shares and correspondingly there are 60 points. The BP of ACO-BP has the same parameters with the SBP that we are going to compare. The learning rate is 0.003, the number of maximum epochs is 20000, and the error of stopping is 0.5. The experiment is repeated 50 times respectively. The error measure used in evaluating the performance of both algorithms is defined as the following:
1 MSSE = ns
ns
^
∑(y − y ) i =1
i
i
2
× 100
(6) ^
where ns is the number of samples used during the training process, and
yi and yi
are the actual output and the target output respectively. The results are shown in Table 2. Since the ability of a neural network to predict out-of-sample rather than in-sample is more important for these problems, the following discussion focuses on the algorithm’s performance on the testing data for all problems. The first column represents the number of hidden layer. Columns 2-9 show the maximum, minimum, mean test error and CPU time of SBP and ACO-BP algorithms respectively.
720
Y.-P. Liu, M.-G. Wu, and J.-X. Qian Table 2. Test error and CPU time
Hidden nodes
6 7 8 9 10
BP
ACO-BP
Max
Min
Mean
Second
Max
Min
Mean
Second
8.45
0.04
0.30
27.0
0.45
0.01
0.16
20.5
8.70
0.02
0.49
26.8
0.53
0.04
0.16
21.5
8.66
0.04
0.63
28.4
0.38
0.05
0.17
22.2
8.83
0.05
0.68
32.3
0.73
0.05
0.18
22.4
8.44
0.07
0.43
28.3
0.62
0.06
0.18
22.8
(a)
(b)
Fig. 3. Curves of identification by ACO-BP (a) and BP (b) algorithms respectively. The training set includes the noise whose mean is 0 and standard deviation is 1. The dashed line is the real curves of function (5). The solid line is the approaching line by the neural network.
From the results, we can observe that ACO-BP reliably outperforms the SBP. The differences between the maximum and the minimum test error of ACO-BP are much smaller than those of SBP, which shows that the former has a better robust characteristic. Furthermore, the mean test error of ACO-BP is smaller than that of SBP, and the CPU time of the former is less than the later, too. This indicates that the ACO-BP has better performance than SBP in both computation precision and speed. When the number of hidden neurons varies, the performance of ACO-BP is still reliably better than that of SBP, which is very important for the designation of neural networks in the current state that there are no certain rules for the decision of the number of hidden nodes. Fig.3 shows the curves of identification in an experiment. We can clearly observe that the identification result of ACO-BP (a) is better than that of SBP (b). But, why does ACO-BP run faster? The main reason is due to the decrease of iterative epochs. As far as ACO-BP algorithm is concerned, we obtained ideal results in 100 epochs for ACO and 10000 epochs for BP algorithm respectively on average. But
Evolving Neural Networks Using the Hybrid of ACO and BP Algorithms
721
for SBP algorithm, we found the above-mentioned results in as many as 20000 epochs. In other words, we can simply believe that the effectiveness of 100 epochs in ACO is better than that of 10000 epochs in SBP. So the time spent in ACO-BP is much less than that in SBP. Experiment 2 In order to further prove the effectiveness of ACO-BP scheme, another experiment was performed. It is a chemical application of the BP feed-forward neural network in the modeling of quantitative structure-activity relationships of Herbicides. The data comes from the Ref.[13]. The data set was divided into two subsets, a training set and a test set. In order to compare the effectiveness of ACO-BP and BP schemes fairly, we trained and tested the network using ten different combinations of the two subsets. As for ACO-BP algorithm, the ACO has the same parameter settings as in Experiment 1. The BP of ACO-BP has the same parameters with the SBP that we are going to compare. The experiment is repeated 50 times respectively. Table 3 shows the results obtained. As far as evolving the neural network is concerned, ACO-BP algorithm outperforms SBP in both training error and test error. Moreover, the time spent of ACO-BP scheme is less than that of SBP. Table 3. Training error, Test error and CPU time
BP
No. Training error
1 2 3 4 5 6 7 8 9 10 Aver.
6.67 7.20 6.59 5.67 7.38 6.02 6.74 7.12 7.51 6.27 6.72
Test error
9.82 5.81 9.95 14.62 3.15 15.30 8.58 6.20 2.81 8.54 8.48
ACO-BP Second
18.9 19.1 19.2 19.3 19.7 20.8 20.0 20.1 20.0 20.2 19.7
Training error
6.12 6.42 5.82 5.82 6.70 6.01 6.18 6.42 6.89 6.24 6.26
Test error
Second
9.03 5.00 10.07 14.79 2.66 14.02 7.02 6.19 2.81 7.06 7.87
12.9 13.9 14.1 13.8 13.3 13.3 12.8 13.1 13.1 13.1 13.3
5 Conclusion The BP algorithm usually used to evolve feed-forward neural networks is an unstable algorithm and may easily be trapped in the local optima. But it is simple and can rapidly converge on the local optima. Ant colony optimization algorithm has the powerful ability of searching the global optimal solution. So the proper hybrid of the two algorithms may accelerate the evolving speed and improve the forecasting precision of the well-trained neural network. The basic idea of ACO-BP scheme is to use ACO
722
Y.-P. Liu, M.-G. Wu, and J.-X. Qian
to search the best combination of all the network parameters, and then adopt BP algorithm to rapidly find the accurate value. When the number of hidden nodes of the neural network varies, the performance of ACO-BP is stable, which is important for the designation of neural networks in the current state that there are no specific rules for the decision of the number of hidden nodes. Simulation results show the effectiveness and efficiency of ACO-BP algorithm.
References 1. Randall, S. Sexton, R., Dorsey, E.: Reliable Classification Using Neural Networks: A Genetic Algorithm and Backpropagation Comparison. Decision Support Syst. 30 (2000) 11–22 2. Liu, Z.J., Liu, A.X., Wang, C.Y., Niu, Z.: Evolving Neural Network Using Real Coded Genetic Algorithm (GA) for Multispectral Image Classification. Future Generation Computer Systems 20 (2004) 1119–1129 3. Wang, L.: Intelligent Optimization Algorithms with Applications. Tsinghua University Press, Beijing, China (2001) 4. Dorigo, M., Maniezzo, V., Colorni, A.: Ant System: Optimization by A Colony of Cooperating Agents. IEEE Transactions on Systems, Man, and Cybernetics-Part B 1 (1996) 29– 41 5. Dorigo, M., Gambardella, L. M.: Ant Colony System: A Cooperative Learning Approach to the Traveling Salesman Problem. IEEE Trans. Evolutionary Computation 1(1) (1997) 53–66 6. Colorni, A., Dorigo, M., Maniezzo, V.: Distributed Optimization by Ant Colonies. Proceedings of the First European Conference on Artificial Life, Paris (1991) 7. Colorni, A., Dorigo, M., Maffioli, F., et al.: Heuristics from Nature for Hard Combinatorial Problems. International Transactions in Operational Research 3 (1) (1996) 1–21 8. Maniezzo, V., Colorni, A.: The Ant System Applied to the Quadratic Assignment Problem. IEEE Transactions on Knowledge and Data Engineering 5 (1999) 769–778 9. John, E. Bell, Patrick, R. McMullen: Ant Colony Optimization Techniques for the Vehicle Routing Problem. Advanced Engineering Informatics 18 (2004) 41–48 10. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning Internal Representations by Error Propagation. Parallel Distributed Processing: Exploration in the Microstructure of Cognition, MIT Press, Cambridge, MA (1986) 318–362 11. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning Representations by Backpropagating Errors. Nature 323 (1986) 533-536 12. Mackay D.-J.C.: Bayesian Interpolation. Neural Computation 3 (1992) 415–447 13. Zhang, L.P., Yu, H.J., Chen, D.Z., Hu, S.X.: Application of Neural Networks Based on Particle Swarm Algorithm for Modeling Quantitative Structure-Activity Relationships of Herbicides. Chinese Journal of Analytical Chemistry 12 (2004) 1590–1594
A Genetic Algorithm with Modified Tournament Selection and Efficient Deterministic Mutation for Evolving Neural Network Dong-Sun Kim1 , Hyun-Sik Kim1 , and Duck-Jin Chung2 1
DxB·Communication Convergence Research Center, Korea Electronics Technology Institute, 68 Yatop-Dong, Bundang-gu, SeongNam-Si, Gyeonggi-Do 463-816, Korea {dskim, hskim}@keti.re.kr 2 Information Technology and Telecommunications, Inha University, 253 Younghyun-Dong, Nam-Gu, Incheon 402-751, Korea
[email protected]
Abstract. In this paper, we present a genetic algorithm (GA) based on tournament selection (TS) and deterministic mutation (DM) to evolve neural network systems. We use population diversity to determine the mutation probability for sustaining the convergence capacity and preventing local optimum problem of GA. In addition, we consider population that have a worst fitness and best fitness value for tournament selection to fast convergence. Experimental results with mathematical problems and pattern recognition problem show that the proposed method enhance the convergence capacity about 34.5% and reduce computation power about 40% compared with the conventional method.
1
Introduction
The training of feed-forward Neural Networks (NNs) by backpropagation (BP) is much time-consuming and complex task of great importance [1]. To overcome this problem, we apply Genetic Algorithm (GA) to determine parameters of NN automatically and propose a efficient GA which reduces its iterative computation time for enhancing the training capacity of NN. The use of the genetic algorithm (GA) has been steadily increased as main adaptive mechanism to solve many large-scaled optimization problems because of its robust capability of exploring in the solution space of a given problem [2, 3]. The GA is an optimization algorithm based on Darwin’s theory of Natural Selection, which states that an individual with higher fitness level is more likely to survive in the next generation [4, 5]. The genetic algorithm consider a solving model of problem as a gene and the collection of genes is called the population. Actually the goal of the genetic algorithm is to come up with a best value, but not necessarily optimal solution to the problem. The general genetic algorithm uses several simple operations in order to simulate evolution. The fitness for each individuals in the population is calculated after initial population generated randomly. Selection operator reproduce the individuals selected to form a new population J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 723–731, 2006. c Springer-Verlag Berlin Heidelberg 2006
724
D.-S. Kim, H.-S. Kim, and D.-J. Chung
according to each individuals’s fitness. Thereafter crossover and mutation on the population are performed and such operations are repeated until some condition is satisfied. Crossover operation swaps some part of genetic bit string within parents and it means that descendants are inherited characteristics from both parents like real world. Mutation operation is a inversion of some bits from whole bit string at very low rate [6]. These factors increase the diversity of genes and influence on each individuals in the population evolves to get higher fitness. However, GA takes a long computation time and local minima problem because of its iteratively adaptive process and evolution mechanism. In this paper, we propose an efficient self-adjusting method for mutation probability, which selects optimal mutation probability according to the change of population diversities for the image pattern recognition. The proposed determination method maintains the suitable diversity in the population and improves the exploration capacity.
2
Selection Scheme for Genetic Algorithm
We employ the steady-state algorithm with modified tournament selection and replacement scheme for enhancing the convergency speed as described in Algorithm 1. Algorithm 1. A Genetic Algorithm with proposed Tournament Selection Initialize Parent and best chromosome While not termination-condition do Select two chromosomes from population Chrom_alpha = rnd(population), Chrom_beta = rnd(population) Copy the previous parent for the other parent Parent_beta = previous Parent_alpha Decide one parent from two chromosomes using fitness value Parent_alpha = Max of Fitness(Chrom_alpha, Chrom_beta) Worse_fitness = Min of Fitness(Chrom_alpha, Chrom_beta) Do Mutation and Crossover using a few methods Evaluate the fitness on a specific problem Child_fit(alpha, beta)= function(Child_alpha_data, Child_beta_data) Update population using new offspring compared worse chromosome New_chrom = Max of Fitness(Child_alpha_data, Child_beta_data) IF (New_chrom_fitness > Worse_fitness) then Population (Worse_fitness) = New_chrom END IF IF (New_chrom_fitness > Best_chrom_fitness) then Best_chrom = New_chrom END IF End While
Our proposed GA update population whenever the offspring’s fitness is better than worse parent’s. It has improved the selection scheme and the preservation
A Genetic Algorithm with Modified Tournament Selection
725
of chromosome compared to the survival-based GA. Because the increment of chromosome with high possibility to attain a optimal solution give pressure to converge the solution as fast as possible compared to random selection of survival-based GA.
3
Evolutionary Operators in Genetic Algorithm
There are two main operators for evolution such as mutation and crossover. Through the evolution of the population that consists of several chromosome, it performs a search throughout the entire searching space in order to find the optimal solution. The information for optimization is stored in each individual and passed to the next generation. The probability of passing the information to the next generation depends on the fitness of the individual entity, which is determined through the fitness evaluation. The individual with high fitness has a high probability of survival in the next generation. Crossover exchange the chromosome information between the two individuals at a random crossover point to create a new offspring that inherits a certain combination of superior characteristics from both parents. Depending on the number of crossover points, crossover can be classified into one-point crossover, multi-point crossover, and uniform crossover. Uniform crossover means all genes have equal probability for exchange. Mutation operator replaces the value of a randomly selected gene within the chromosome with that of the rival gene to create a new individual similar to the existing individual. Such operation can be considered as a limited or partial random search and maintains the population’s diversity by the restoration of the lost rival gene. As the selection, crossover, and mutation processes are repeated, the population is replaced with individuals with higher fitness levels, eventually finding the optimal solution to the given problem. 3.1
A Mutation Probability Adjustment Using Population Diversity
In the early stages of evolution, GA has a characteristic of exploration by its randomly generated population, but it gradually changes to exploitation for solving problems. In such evolution process, the main role of mutation probability is that GA takes a chance to consider a new population for preventing the premature convergence of GA to a local optimum. Large mutation probability can give an opportunity to evaluate new population but increases the evolution process and convergence time. We conclude that the appropriate mutation probability pm (t) must be determined at each evolution time and maximum value of pm (t) should be exponentially decreased to prevent GA from excessive exploration search. To decide a suitable mutation probability in these points, we define the population diversity value D based on the average Hamming distance H between two individuals α and β and it can be describe as 1 (αi − βi ) L i=1 L
D = 1 − H(α, β) = 1 −
(1)
726
D.-S. Kim, H.-S. Kim, and D.-J. Chung
where L is a length of each string and αi denotes a i-th bit of string α. In case of converging to any solution, population diversity D is increased and the randomly distributed population decreases the population diversity. In addition, mutation probability has a value at L1 ≤ pm (t) ≤ 0.5[4,5] and we exponentially limit its variation range according to the evolution time to prevent strong exploration. Therefore, pm (t) with the population diversity value D for next generation is given by 2 t pm (t) = 0.5exp[ln( × )]Dn (2) L L where t, n and T denote respectively current generation number, order of D and a maximum generation number for evolution. The order of population diversity n is a impact factor to determine the sensitivity of pm (t) depending on the population diversity. Small n shows more possibilities to explore new population for GA but decreases the exploitation capability. Our proposed method reinforces the mutation probability adaptively according to population diversity for sustaining the convergence capacity and decreases the probability of premature convergence in complex problems with many local optima. 3.2
Determination of n and Experimental Result
To show the influence of n and the validity of the proposed algorithm, we compare a survival-based GA with a fixed mutation probability and its performance is investigated by solving complex problems with local minima such as DeJong function, Rastrigin function, and mathematical function [5]. 1) DeJong function: f1 =
3
x2i
where − 5.12 ≤ xi ≤ 5.12
(3)
where − 5.12 ≤ xi ≤ 5.12
(4)
i=1
2) Rastrigin function: f2 = 30 +
3
x2i − 10cos(2πy)
i=1
3) Mathematical function: f3 = 21.5 + x× sin(4πx)+ y × sin(20πy) where0 ≤ x ≤ 12.1, 4.1 ≤ y ≤ 5.8 (5) For the experiment, the used population size, probability of 1-point crossover, length L and maximum generation number T are 128, 1, 30 bits and 10000 respectively. To determine the fixed mutation probability pm for survival-based GA, we simulate the average generation number ξ for convergence and exploitation capacities of 10000 runs for 0.03 ≤ pm ≤ 0.5 at each problem. We evaluate the exploitation capacity of GA by the success number ρ of finding solution. The
A Genetic Algorithm with Modified Tournament Selection
727
Table 1. Simulation result by the order of population diversity n (ξ : number of generation, ρ : number of finding solution) n −
0.5 ξ
1.0 ρ
ξ
1.5 ρ
ξ
(*)2.0 ρ
ξ
ρ
2.5 ξ
3.0 ρ
ξ
ρ
f1 8320 6400 7383 8800 6748 9000 5701 9000 5449 8600 5077 7700 f2 8483 6500 7453 7500 6665 8000 5694 8200 5190 7300 7923 6300 f3 8379 8600 7895 9800 7864 9900 6552 9900 6206 9600 5813 9400
Table 2. Evaluation of simulation results (ξ : number of generation, ρ : number of finding solution) J.J. kim etal. [3] Proposed method
Relative ratio a−b a
× 100
d−c a
× 100
Function
ξ(a)
ρ(c)
ξ(b)
ρ(d)
f1
5769
5600
5701
9000
1.2%
60.7%
f2
5842
6000
5691
8200
2.6%
36.7%
f3
6606
9300
6552
9900
0.8%
6.1%
selected pm for experiments is 0.05 for f1 ,f3 and 0.06 for f2 . As shown in Table 1, large n results in premature convergence and decreases the performance of GA by weak exploration. On the contrary, small n causes the increase of convergence speed and it rather decreases performance of GA as it becomes small because of strong exploration. By this reason, we choose the optimum value of n by experimental results, as marked by asterisks in Table 1. Final experimental results by comparison with the average generation number and convergence capacity are shown in Table 2. All results in the Table 2 are average values of 10000 simulation times with different random number seeds. According to the result, we can show that mutation probability based on population diversity increases the performances of genetic algorithms in problems with many local optima for providing large exploitation and dynamic exploration.
4
Genetic Algorithm for Neural Networks
GA has several advantages over other optimization algorithms because it is a derivative-free stochastic optimization method based on the features of natural selection and biological evolution [4,6]. One of the most significant advantages is its robustness of getting trapped in local minima and the flexibility of facilitating parameter optimization in complex models such as NNs. In this reason, we focus on the learning network parameters and optimizing the connection weights using GA for designing an NN architecture. 4.1
Chromosome Representation
A feed-forward NN can be thought of as a weighted digraph with no closed paths and described by an upper or lower diagonal adjacency matrix with real valued
728
D.-S. Kim, H.-S. Kim, and D.-J. Chung
elements. The nodes should be in a fixed order according to layers. An adjacency matrix is an N × N array in which elements [1]. if < i, j >∈ /E if < i, j >∈ E
ηij = 0 ηij =0
for all i ≤ j for all i ≤ j
(6) (7)
where i, j = 1,2,. . . , N and < i, j > is an ordered pair and represents an edge or link between neurons i and j, E is the set of all edges of the graph and N is the total number of neurons in the network. Here, the biases of the network, ηij is not 0 if i equals j for all < i, j >. Thus, an adjacency matrix of a digraph can contain all information about the connectivity, weights and biases of a network. A layered feed-forward network is one such that a path from input node to output node will have the same path length. Thus, an n-layered NN has the path length of n. The adjacency matrix of the corresponding feed-forward NN will be an upper or lower diagonal matrix [7]. 4.2
Fitness Function
The fitness function for NNP is the Sum Squared Error (SSE) for neural networks. If yd is the target output and y is the actual output, it can be defined as J(ω, θ) = e2 where e = yd − y (8) d∈outputs
The ω and θ denote the weights and biases linking the neuron unit to the previous neuron layer. Obviously the objective is to minimize J subject to weights and biases ωji , ωkj , θj and θk .
5
Pattern Recognition for Performance Estimation
The proposed algorithm is used in an experiment to distinguish four coins with similar circular patterns and to recognize the rotation angle of each coin using the coin’s pattern. In general, coins can be distinguished according to their size and weight, but in this experiment we attempted the distinction under the assumption that all pattern sizes are the same in the pixel level. Figure 1 shows the basic image pattern of coins for experiment. To evaluate the capability of calculating the rotation angle and recognizing the pattern, each basic pattern is
Fig. 1. Coin pattern for experiment
A Genetic Algorithm with Modified Tournament Selection
729
Fig. 2. Pattern generation flow
Fig. 3. Variation of the population according to number of fitness calculation
rotated 30, 180 and 270 to create twelve test patterns and additional four patterns with random noise are generated before generalizing pixel size. As shown in Figure 2, the image of coin is extracted from the background using the projection method. Such extracted images are resized by 64 × 64 and the binary image patterns are generated. For fast and accurate recognition, each population is expressed in gray code and a 2-point crossover is used for maintaining good schema. Each chromosome consists of 11 bits. First two bits are used for pattern recognition and the other bits are for rotation angle calculation. The difference in absolute values between two patterns, similarity, is used for evaluating the fitness value and initial difference value of each four patterns is approximately 0.7. In addition, bilinear interpolation is applied to obtain the exact image pattern after rotation. Figure 3 shows the change of population with similar angles according to generation number. If the angle for maximum similarity is defined as θ, similar angles mean any value between (θ + 10) and (θ − 10). As the number of generation increases, the exploration area is concentrated around θ. These characteristics of GA reduce the number of calculations needed for recognition and improve the recognition speed. For experiment, the input patterns rotated the basic pattern 30, 180 and 270 are recognized correctly and the exact degrees
730
D.-S. Kim, H.-S. Kim, and D.-J. Chung Table 3. Number of generation for finding pattern with maximum similarity Input Patterns Basic Patterns
Number of generation
Rotation angle
Average
Max
Min
30
200.8
278
150
46.21
180
148.0
192
80
38.56
270
184.2
300
78
84.44
30
215.1
262
132
49.00
180
155.8
276
86
86.97
270
212.4
294
130
60.12
30
240.8
284
172
37.84
180
226.8
290
162
44.19
270
97.0
278
84
63.31
30
200.8
290
82
63.13
180
174.6
234
78
44.99
270
170.2
238
156
32.14
195.7
300
78
59.23
Pattern 1
Pattern 2
Pattern 3
Pattern 4 Total
Standard Deviation
Table 4. Comparison of experiment results Pattern matching method Survival-based GA Proposed GA Rotation numbers
512
255
195.7
Similarity
2048
255
195.7
are found. As a result in Table 3 and 4, pattern recognition with a proposed GA can reduce the computation power about 40% and enhance the convergence speed for rotation angle decision about 10% than previous pattern matching method.
6
Conclusion
In this paper, a genetic algorithm using a new mutation probability has been proposed for evolving neural network system. It is adjusted by population diversity and evolution time. From experiment results, the proposed method can enhance the performances by reinforcing both exploitation and exploration of genetic algorithms. Also, in order to improve convergence speed, each population is represented in Gray code, which has the minimum Hamming distance with the adjacent code, and apply a new tournament selection algorithm by modifying the reproduction method. Our proposed method can be easily applied to other types of genetic algorithms to improve their performance and used in various applications such as evolvable hardware, speech recognition, and computer vision.
A Genetic Algorithm with Modified Tournament Selection
731
References 1. Hicklin, J., Demuth, H.: Modeling Neural Networks on the MPP. Proceeding of 2nd Symposium on the Frontiers of Massively Parallel Computation (1988) 39–42 2. Ho, C. W., Lee, K. H., Leung, K. S.: A Genetic Algorithm Based on Mutation and Crossover with Adaptive Probabilities. Proceeding of the 1999 Congress on Evolutionary Computation (1999) 768–775 3. Sunghoon, J.: Queen-bee Evolution for Genetic Algorithms. Electron. Lett. 39(6) (2003) 575–576 4. Kim, J. J., Choi, Y. H., Lee, C. H., Chung, D. J.: Implementation of a HighPerformance Genetic Algorithm Processor for Hardware Optimization. IEICE Trans. Electron (1) (2002) 195–203 5. Hesser, Manner, R.: Towords an Optimal Mutation Probability for Genetic Algorithms. Proceedings of the first Conference on Parallel Problem Solving from Nature (1990) 23–32 6. Gong, D., Pan, F., Sun, X.: Research on a Novel Adaptive Genetic Algorithm, Proceeding of the 2002 IEEE International Symposium on Industrial Electronics (2002) 357–359 7. Yentis, R., Zaghloul, M. E.: VLSl Implementation of Focally Connected Neural Networks for Solving Partial Differential Equations. IEEE Trans. Circnits Syst. I, Fundaln. Theory Appl. 43(8) (1996) 687–690
A Neural Network Structure Evolution Algorithm Based on e, m Projections and Model Selection Criterion Yunhui Liu, Siwei Luo, Ziang Lv, and Hua Huang School of Computer and Information Technology, Beijing Jiaotong University, 100044, Beijing, China
[email protected]
Abstract. According to biological and neurophysiologic research, there is a bloom bursting of synapses in brain’s physiological growing process of newborn infants. These jillion nerve connections will be pruned and the dendrites of neurons can change their conformation in infants’ proceeding cognition process. Simulating this pruning process, a new neural network structure evolution algorithm is proposed based on e and m projections in information geometry and model selection criterion. This structure evolution process is formulated in iterative e, m projections and stopped by using model selection criterion. Experimental results prove the validation of the algorithm.
1 Introduction Biological and neurophysiologic research indicates that nearly half of the human genes are dedicated to the intricate task of constructing and maintaining the brain. Despite this seemingly large army of genes, it is not enough to choreograph the trillions of connections necessary for a functioning brain. Therefore, the brain does not follow a detailed step-by-step plan and construct itself cell by cell and synapse by synapse. Rather, our brains construct and evolve its structure in a quite different way. First, we overproduce neurons, and then, trim those deemed unnecessary. The neurons in the infant brain blossom and bloom at an incredible rate [1][2]. At the peak of growth, an astounding 250,000 cells are born each minute. And in the two months before and after birth, an amazing 1.8 million new synapses are made each second result in a stunningly dense thicket of neurons. However, this complex network does not last. The cells and connections that are not essential are weeded out. Neurons and synapses that are either inactive or not required to meet the demands presented by the environment are deleted; a process called “pruning”. Beginning in the first two years of life and lasting into adolescence, an average of 20 billion synapses are pruned each day, shaping the brain into an efficient machine capable of meeting the demands of environment. These biological researches indicate that structure reduction or pruning plays a very important role in human brain’s structure evolution. Simulating this process is helpful to study neural network’s structure learning problem. In this paper, the structure reduction process is studied simulating the brain’s evolution process of new born babies. Thus, the term “structure evolution” mainly means structure reduction here. First, an information geometric explanation of structure J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 732 – 738, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Neural Network Structure Evolution Algorithm Based on e, m Projections
733
reduction is given. In information geometric framework, most kinds of neural network form exponential or mixture manifold which has a natural hierarchical structure. Such a parameter space has rich geometrical structures that are responsible for the dynamic behaviors of learning. The structure evolution problem is formulated in iterative e and m projections from the current manifold to its submanifold or conversely. Then, model selection criterion is used to decide the stop time of the evolution process. At last, some experimental results are given to verify the algorithm.
2 Related Theory of Information Geometry Information geometry [3], originated from the information structure of a manifold of probability distributions, has been successfully applied to many fields such as information theory, neural networks and others. It is good at studying the properties which a family of probability distributions posses as a whole. Such a family forms a geometrical manifold which has many important geometric structures, which are very important in studying the learning process of parameterized statistical model. Neural network, with modifiable parameters θ = (θ 1 ,...,θ n ) , can be described by a parameterized family of probability distributions S = { p( x,θ )} in view of statistics. The set of all the possible neural networks realized by changing θ forms an ndimensional manifold S called neuromanifold, where θ plays the role of coordinate system. Each point in S represents a network depicted by a probability distribution p ( x;θ ) or p ( y | x;θ ) .Various important neural networks such as the Boltzmann machine, stochastic multilayer perceptron, mixture of expert networks are shown to be in the exponential or mixture family which can be described by dual flat manifold structure [4]. In many cases, such manifold systems have hierarchical structures such that systems with a smaller number of parameters are included in the space with a large number of parameters. Hierarchical systems have interesting geometry closely related to the dynamical behaviors of learning of the system. Suppose a flat manifold S with parameter θ = (θ 1 ,...,θ n ) and two dual coordinate systems θ and η . Consider a nested series of e-flat submanifolds [4]
E1 ⊂ E 2 ⊂ ..... ⊂ E n
(1)
where every E k is an e-flat submanifold of Ek +1 . Each E k is automatically dually flat. We call such a nested series an e-flat hierarchical structure or, e-structure. A typical example of the e-structure is the exponential-type distributions [3]. If consider the k-cut by which
θ k = (θ k +1 , θ k +2 ,⋅ ⋅ ⋅, θ n ),
(2)
η k = (η1 ,η 2 ,⋅ ⋅ ⋅,η k )
(3)
+
−
when θ k is fixed to be equal to const vector c k + = (c k +1 ,...c n ) but other θ coordinates +
are free, we have an e-flat submanifold for each k
734
Y. Liu et al.
E k (c k + ) = { p( x, θ ) | θ k + = c k + } k + = k + 1,....n
(4)
They give a foliation of the entire manifold
U E k (c k ) = S
(5)
+
c
k+
For c k + = 0 , a hierarchical structure is introduced in S because of E 1 ( 0 ) ⊂ E 2 (0 ) ⊂ ⋅ ⋅ ⋅ ⊂ E n ( 0 ) = S
(6)
Dually to the above, we can define m-structure and m-flat foliations. Due to the above analysis, a single structure can be divided into many hierarchical structures by dividing the parameter vector. Every such hierarchical structure forms a dually flat submanifold, embedded in its flat enveloping manifold. The dimensionality of submanifold is smaller than that of its enveloping manifold. From equation (6), we know that if we evaluate more and more part of the elements of the parameter vector with zero, there will produce a sequence of submanifolds corresponding to the new parameter vectors which can be embedded one by one. This action of endowing zero to some elements of the parameter vector neither more nor less than corresponds to pruning some connections or some neurons of neural network, which is the very meaning of structure evolution in this paper.
3 Structure Evolution Algorithm 3.1 Evolution as Iterative e, m Projections
In this section, we study connections pruning simulating the brain’s growth process of a newborn baby. In fact, the same analysis can be done in case of node pruning. For reason of simplicity, we consider the first evolution step of removing one connection from the initial network. The succeeding pruning can be analyzed in the same way. Suppose the initial super complicated network with the parameter θ forms a manifold S = { p ( x,θ )} and the true distribution is included in it denoted by p ∗ = p ( x, θ ∗ ) . Consider all potential network models after pruning one connection are E ni −1 = S \ v i , i = 1,.., n where n is number of connections in S and vi is the ithconnection that will be pruned. Each E ni −1 forms a submanifold of S . Then we do m projection from p ∗ to each E ni −1 to find the point pn∗−1 which can minimize the diverD i ( p ∗ , p n −1 )
gence
p
∗ n −1
from
p∗
to
the
point
pn −1
in
E ni −1 ,
we
get
= arg min D ( p , p n −1 ) and choose the connection j to prune which minimize the i
∗
p n −1 ∈E in −1
D ( p , p n∗−1 ) , that is j = arg min D i ( p ∗ , p n∗−1 ) . i
∗
i∈[1, n ]
If the true distribution is known, the above process can find the pruned network by m projection only. This is an ideal case, but in practice, the true distribution or true parameter is unknown, so we necessarily introduce e projection in this pruning
A Neural Network Structure Evolution Algorithm Based on e, m Projections
735
process. In our former work [5], we use the empirical distribution to approximate the true distribution and use m projection only in pruning process. Though this method has already been proved a good performance, it after all fits to only a simplified ideal case and intelligibly it is sensitive to the sample size. In this paper, we consider bringing in another information projection in information geometry: e projection and now the empirical distribution only as the initial value of the iterative process. So, the first evolution step now changed this way. First, we use the empirical distribution (often we use the corresponding parameter to denote this distribution, in manifold is a point) as the starting point of the iteration, different from the above algorithm, after the m projection from the current manifold to its submanifold, we do the e projection further in the converse direction. The divergence D i ( p, p n −1 ) , where p in the current manifold and p n −1 in E ni −1 , will be decreased in the projection process
and until to its minimum. When the divergence reaches its minimum, the parameters in the current manifold and its submanifold both reach their optimal values. The same process is applied in all n potential submanifolds and selects the one with the shortest divergence as the former algorithm. E projection here corresponds to the feedback mechanism, which make the evolution algorithm more exact. The above is the first evolution step and the succeeding evolution steps can be done similarly. In a word, consider all potential unit removal {E k +1 \ v i | i = 1,..n} and select the unit which minimizes D( p k∗+1 , p k∗ ) where E k +1 is the current model and p k∗ in E k is the iterative result of pk∗+1 in E k +1 after several e,m projections. 3.2 Stop Time Decided by Model Selection Criterion
Model select criterion can be used to reconciling the two desirable yet conflicting properties: model complexity and model-data fitness. The common used criteria include Akaike Information Criterion (AIC)[6], Bayesian Information Criterion (BIC)[7] or the Minimum Description Length (MDL) principle [8]. All these criteria balance the contrary goals of maximizing data fidelity while minimizing model complexity. The minimum value for the criterion represents a trade-off between the two conflicting properties. In our algorithm, we use model selection criterion to decide the stop time of evolution. We here use the simple BIC. This criterion can be fit into the framework of min C ( p E ) ≡ D( p ∗ , p E ) + δk where k is the number of parameters in the model E and δ ≥ 0 is chosen as δ BIC = log N / 2 N , N is the number of the samples. Consider a sequence of embedded manifolds E n − k ⊂ E n −( k −1) ..... ⊂ E n −1 ⊂ E n = S where En −k has removed k connections from the initial manifold S . The cost function is defined as: C ( p n∗− k ) = D ( p n∗ , p n∗− k ) + K E n − k , where the first term denotes the modeldata fitness and the second term denotes the model complexity which adopts the complexity form of BIC . p n∗ is the probability of initial super complicated network S, approximating the true probability, estimated by e,m projections process.
736
Y. Liu et al.
Then the cost function is decomposed by Pythagorean theorem and get: n −1
C ( p n∗− k ) = ∑ [ D ( p i∗ , p i∗+1 ) − δ ] + δk E i =1
(7)
n
Hence, so long as there exists a favorable connection removal such that D( pi∗ , pi∗+1 ) < δ we may decrease the cost function by projection to a lower order model. Also, just as we depicted above, the most favorable connection removal is the one which minimizes the KL-divergence D( pi∗ , pi∗+1 ) . 3.3 Algorithm Description
The basic idea of this algorithm is depicted as following: 1. Suppose the current structure of model M as a full-connected network F, empirical parameter θ M∗ as a initial value, δ = log N / 2 N , N is the sample size. 2. Suppose all potential the one-connection-less models are Mi = {M \ v i , i = 1,..n} . Do e,m projections between the current model and M i s till convergence and get the final KL-divergences corresponding to each submanifold D (θ M∗ , θ M ) . Then select the i
minimum KL-divergence D (θ , θ ∗ M
3. If D(θ ,θ ∗ M
2; If D(θ ,θ ∗ M
∗ M
∗ M
i∗
i∗
∗ M
i∗
).
) < δ then we accept the M i as the new current model and jump to ∗
) > δ then stop.
4 Experimental Results Because this algorithm is proposed based on information geometry which originated mostly from probability and statistics idea, probability network is adopted as experiment object and the sample is generated by Monte-Carlo simulation submitting to Gaussian distribution. The performance of this algorithm is tested by the recovery ability to the true network who generating the sample. The process is listed below: 1. Construct “truth” neural network S as the true model to generate data. 2. Generate randomly the mean and covariance matrix of the sample and generate the sample by Monte-Carlo simulation. 3. Construct a full-connected network with the same number of nodes of S and use the new structure evolution algorithm to get the new network S’. 4. Compare S’ and the true model S and compute their “result error degree” to measure the validity of this algorithm. “Result error degree” is defined as the ratio of the number of false connections (including the number of redundant connections and the number of connections reduced by false) to the number of true connections. The left figure in Fig. 1 is the true model S which contains 16 nodes and 19 connections. The sample size is 104. The structure evolution algorithm is applied on a full connected network with 16 nodes and finally we get a new network S’. Generate different set of samples and repeat the process and then compute the “result error degree” comparing S and S’, the “result error degree” is listed below:
A Neural Network Structure Evolution Algorithm Based on e, m Projections
737
Table 1. Result error degree
Sample 1 2 3 4
Redundant by false 0 1/19 1/19 0
Reduced by false 0 1/19 0 0
The error degree equals to zero meaning the super complicated network model can be evolved to the true model accurately by using our algorithm. The result error degrees are all very low verifying the validity of the algorithm.
Fig. 1. The true model adopted in experiment (left) and approximation curve (right) after evolution
The right figure in Fig. 1 is three approximation curves when the sample size is 104. The abscissa denotes the number of pruned connections, the red dashdotted line denotes the curve of cost function, the blue line denotes the curve of divergence between the truth and the model, the green dashed line means the curve of divergence of data and the model. This figure illustrates the approximation relation between these three curves. From this figure, we can see that the three curves are adjacent, which testifies the validity of the definition of the cost function and the model’s evolution process.
5 Conclusions This paper gives a new structure evolution algorithm of neural network based on e, m projections and model selection criterion. This “evolution” is inspired by the growth process of an infant’s brain. The result of this paper also helps to understand the geometric properties of pruning in view of geometry. This information geometric guideline to prune weights/nodes has more authentic theoretic foundation.
738
Y. Liu et al.
To explore new inspirations from biology or neuroscience are headsprings of new algorithms of neural networks. So, introducing the research result of biology, psychology and neuroscience into neural networks domain has a bright foreground.
Acknowledgements The research is supported by: National Natural Science Foundations (No. 60373029), Scientific Foundation by Beijing Jiaotong University (No. 2005RC044).
References 1. Rosemary, A. R.: Cognitive Development: Psychological and Biological Perspectives. Allyn & Bacon, Boston MA (1994) 2. Kalat, J.W.: Introduction to Psychology. 4th edn. Brooks/Cole Publ. Co., Pacific Grove (1996) 3. Amari, S.: Differential-geometrical Method in Statistics: Lecture Notes in Statistics. Springer –Verlag, Berlin Heidelberg (1985) 4. Amari, S.: Information Geometry on Hierarchy of Probability Distributions. IEEE Trans. Information Theory 47(5) (2001) 1701-1711 5. Liu,Y.H., Luo, S.W.: Information Geometric Analysis of Neural Network Pruning. Journal of Computer Research and Development (to be Published) (2006) 6. Akaike, H.: Information Theory and an Extension of the Maximum Likelihood Principle. In Petrov, B. N., Csaki, F. (eds.): 2nd Int. Symposium on Info. Theory, Budapest (1973) 267-281 7. Heckerman, D., Chickering, D.: A Comparison of Scientific and Engineering Criteria for Bayesian Model Selection. Statistics and Computing 10(1) (2000) 55-62 8. Barron, A. R., Rissanen, J., Yu, B.: The Minimum Description Length Principle in Coding and Modeling. IEEE Trans. Information Theory 44(6) (1998) 2743-2760
A Parallel Coevolutionary Immune Neural Network and Its Application to Signal Simulation Zhu-Hong Zhang, Xin Tu, and Chang-Gen Peng College of Science, Guizhou University, Guiyang, Guizhou, 550025, China
[email protected]
Abstract. The work is to propose a parallel coevolutionary immune neural network to decide weights and thresholds of feedforward network so as to deal with high dimensional or strongly nonlinear signal simulation. The network is developed based on the feedforward network, a novel simple artificial immune model, antibody representation, and coevolutionary ideas of antibody populations of the immune system, in which an evolution mechanism originated from humoral immune response is designed as a fundamental local evolution scheme. Through practical application and comparative analysis, numerical experiments illustrate that the proposed network can effectively simulate practical signal, and also be superior to the compared algorithms.
1 Introduction Immune neural network is a potential research area in the field of intelligent computation, in which many algorithms have been reported based on feedforward or feedback neural network and immune algorithms [1-4]. Their main attempts are to decide weights and thresholds of the network in terms of immune optimization algorithms or immune feedback law. However, since the weights and thresholds are considered as a mutually connective variable vector, these algorithms result in extremely high computation complexity. To overcome this drawback, several authors [5] integrated the idea of division of decision variables into evolutionary algorithm to design a coevolutionary algorithm, but they only obtained some theoretical results. Besides, the idea of task decomposition is used to design distributed artificial immune systems [6-7]. In the literature, the authors [6] proposed a distributed coevolutionary immune model to solve n-TSP problem by way of a distributed artificial immune model and simulated annealing algorithm. It can achieve good results only for small-scale problems [1]. Further, the authors [7] designed a parallel classifier system through transforming clonal selection algorithm [8] into a parallel algorithm. Even if this classifier can reduce the whole runtime of data learning, it does not consider how to deal with highdimensional data learning. Up to date, some authors are dedicated to propose intelligent algorithms to solve difficult engineering optimization problems. However, little work has been done as to how to design parallel immune neural network models to cope with complex signal simulation. Thus, in this paper we propose a parallel coevolutionary immune neural network (PCINN) based on the feedforward network, antibody representation, a novel simple artificial immune model, a new local evolution scheme, parallel evolution of antibody populations, and intercommunication between J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 739 – 746, 2006. © Springer-Verlag Berlin Heidelberg 2006
740
Z.-H. Zhang, X. Tu, and C.-G. Peng
antibodies. These schemes show that it is different from the reported immune neural network models. Through application to signal simulation of practical data from an electric power network and Tai element, our numerical experiments illustrate that the proposed network can not only rapidly search the desired weights and thresholds, but also perform better than the compared algorithms.
2 Basic Descriptions and Simple Artificial Immune Mode 2.1 Optimization of Feedforward Network As well known, minimizing the energy function of a three-layer feedforward network is to solve the following problem (P):
Minimize E ( x) = x∈R
N
M
1 2
L
∑ ∑( y p =1
k =1
pk
− d pk )2 ,
(1)
where ypk = h(netpk ),netpk =W2Tk vp + θ2k ,1 ≤ k ≤ L, vpj = g(netpj ),netpj =W1Tj X p + θ1 j ,1≤ j ≤ K,
x ≡ [W ,θ ] = (w11,θ11, w12 ,θ12 ,...,w1K ,θ1k ; w21,θ21, w22 ,θ22 ,...,w2L ,θ2 L ); w1i = (w1i1, w1i 2 ,...,w1in ),
θ1i ∈ R1,1 ≤ i ≤ K, w2 j = (w2 j1, w2 j 2 ,...,w2 jK ),θ2 j ∈ R1,1 ≤ j ≤ L, and N = K(n + 1) + L(K + 1). Besides, w1i and θ1i denote the synaptic weight vector and threshold of the i-th neuron of the hidden layer. w2j and θ2j represent the weight vector and threshold of the j-th neuron of the output layer. (Xp, dp) is the pair of the p-th input and output with X p ∈ R n , dp=(dp1,dp2, …,dpL), and 1≤p≤M; g1(.) and g2(.) are activation functions of the hidden layer and output layer respectively. 2.2 Simple Artificial Immune Model Antibody consists of a variable region and a constant region. Variable region is primarily responsible for recognizing the structure patterns of antigens, and constant region for a variety of effector functions with invariable structure patterns. For simplicity, antibody is represented as a chromosome composed of multiple variable gene segments and invariable ones generated from their gene libraries. Basically, the process that antibody learns the patterns of antigen is an evolutionary process, in which evolution mechanism is considered important in eliminating the invaded antigen. Through simulating it, we can design a local search operator. Besides, parallel evolution of antibody populations and their mutual intercommunication speed up the process of antibody learning, which helps for rapidly obtaining excellent antibodies to eliminate the encounter. These activate us to design a simple artificial immune model (Fig.1) below. In this model, B1, B2,…, and Bm represent m mutually different antibody subpopulations. Each of which is composed of many different antibodies, and evolves with the help of T cells. Intercommunication takes place between these populations. Finally, their excellent antibodies are carried out the task to eliminate the antigen.
A Parallel Coevolutionary Immune Neural Network
741
Ag: Minimize energy function of feedforward network Stimulation B1 Elimination Intercomunication Ab
B2
Regulation
T
Evolution Bm
Fig. 1. Simple artificial immune model
3 Evolutionary Mechanism (EM) We partition the decision vector of Problem (P) above, into L subsegments (gene segments) encoded with real number. Correspondingly, an antibody can be represented as a block, in which only one gene segment is considered variable, and others (L-1 segments) invariable. Meanwhile, antigen is viewed as the problem itself. Besides, for a given antibody subpopulation, we also suppose that all antibodies have L1 identical invariable gene segments but different variable gene segments. Through abstracting some simple metaphors of immune response process, we can design EM to evolve an antibody subpopulation into another one. Given a subpopulation A, the affinity of antibody x in A may be designated as aff ( x) = 1/(1 + exp(−λE( x))) , with 0 < λ < 1 . Density of the antibody is the ratio of the antibodies similar to x in A. Meanwhile, the design of the survival probability of x is taken from Ref. [9], and the average density α of all antibodies in A denotes the mean of the densities of these antibodies. Hence, EM can be described as follows: Step 1. Given an antibody subpopulation A, calculate affinities and densities of all antibodies as well as their average density. Step 2. Select 100(1-α)% higher affinity antibodies to constitute a population B. Step 3. Execute Cell Reproduction[9] on B, and obtain a clone population C. Step 4. Calculate mutaion probabilities of all clones according to
pm (x) = 1 − ς exp(−
max{aff ( y) | y ∈ B} − aff (x) ), max{aff ( y) | y ∈ B} − min{aff ( y) | y ∈ B}
(2)
with x ∈ B ∩ C , and 0.8 < ς < 1 . Further, the variable gene segments of each clone are submitted to Gaussian mutaion. Note that their invariable gene segments are not changed. Eliminate the identical clones, and obtain the mutated clone populatrion C*. Step 5. Calculate survival probabilities of all antibodies in A ∪ C * through Eq.(5), and selcet η% cells to form population D, where η is an adptive parameter.
742
Z.-H. Zhang, X. Tu, and C.-G. Peng
Step 6. Generate radomly μ% new antibodies to replace the lower affinity antibodies in D, and hence obtain a new subpopulation E.
4 PCINN Formulation Based on EM and Fig.1 above, PCINN is described in detail as follows: Step 1. Generate randomly N1 variable gene segments from the first gene library. Each of these segments, together with L-1 invariable gene segments from other L-1 gene libraries respectively, constitutes an antibody. Note that these antibodies have an identical invariable gene segment on the i-th gene segment locus with 2≤i≤L. All these make up the first antibody subpopulation An1. Similarly, generate M-1 antibody subpopulations, An2, An3…, and AnM, where the sizes of all populations are decided by user. Set n←1. Step 2. Execute EM on An1, An2,…, and AnM respectively. M corresponding subpopulations Bn1, Bn2,…, and BnM are created. Step 3. Intercommunication. Decide the highest affinity antibody (i.e., the best antibody Ab) among all antibodies from Bn1, Bn2,…, and BnM. All invariable gene segments of antibodies from those populations in which each does not involve in Ab are replaced by the corresponding gene segments of Ab. Hence, obtain A(n+1)1, A(n+1)2,…, and A(n+1)M. Step 4. If the termination criterion is satisfied, the best antibody in Step 3 is the desired solution, and the procedure ends. Otherwise, set n←n+1, and returns to Step 2.
5 Performance Criteria PCINN is compared to the three algorithms reported: BP neural network algorithm BP, adaptive BP neural network ABP and Clonal selection algorithm CLONALG [8]. The terminal numbers are 200 for both CLONALG and PCINN, and 10000 for both BP and ABP, respectively. Take the learning rates of BP and ABP as 0.01. We improve CLONALG by utilizing real decoding and hypermutation. Its parameter settings are: selection rate 0.5, population size 80, the ratio of new inserted antibodies 8%, and the total number of clones 240 in each generation. Besides, for PCINN, take the ratio of new inserted antibodies 10%. Finally, associated with different examples below, the activation functions of hind layer and output of feedforward network are different. Let E(n, k) be the minimal one of error function values (decided by Eq.(1)) of all current antibodies at the n-th generation and in the k-th execution. (a) Average convergence. We may justify whether each of the algorithms is convergent by observing a series of average error values given by
averc ( n ) =
1 m
m
∑ E (n, k ). k =1
(3)
A Parallel Coevolutionary Immune Neural Network
743
(b) Average error and variance. They are the mean averm and variance averv of all the minimal errors (dependent upon the training samples) at the terminal iteration for m runs, as well as predication average error paverm and predication variance paverv.
6 Examples and Comparisons Example 1 The problem of modeling of an electric power network with 8 inputs and 6 outputs is considered. 33 datum pairs from input to output are taken from Chinese undergraduate modeling competition (2004). Let the hind layer of the feedforward network include 7 neurons whose activation functions are considered as Sigmoid function. Correspondingly, the output layer includes 6 neurons with the corresponding activation functions, h j ( x) = a j x + b j ,1 ≤ j ≤ 6, where aj and bj are parameters to be decided. We view the variables, ( wi1 , wi 2 ,..., wi 8 , θ i ) and (wj1, wj 2 ,...,wj6 ,θ j , a j , bj ) , as the gene segments of the i-th neuron of the hind layer and the j-th neuron of the output layer with 1≤i≤7 and 1≤j≤6, respectively. These segments are connected to form a complete variable vector of Eq.(1) scattered into some gene segments. The former 23 pairs of all samples are used to train the structure of the network; the latter ones are used to examine the predication capacity of the network obtained. Each of the four algorithms is executed 50 times for problem (P) as in Section 2.1. The statistic values of the best error results found at the terminal iteration for each algorithm are listed in Table 1. This illustrates that PCINN can obtain the minimal values for average error, average variance, predication average error and predication average variance, while ABP and BP are superior to CLONALG but need to run a large iteration number. Table 1. Comparison of the average results for the four algorithms with 50 runs
averm averv paverm paverv
BP 7.6948 e-003 6.0321 e-003 5.0375 e-003 5.0522 e-003
ABP 3.24796 e-003 2.50461 e-003 5.59379 e-003 2.59140 e-003
CLONAL 3.205 e-002 5.021 e-002 7.624 e-003 4.125 e-003
PCINN 5.94888 e-004 2.03745 e-004 6.98186 e-005 7.13574 e-005
Since the iterative number of BP and ABP (i.e., 10000) is too large to depict their search curves through Matlab Toolbox, we only compare average convergence of CLONALG and PCINN (Fig. 2). Obviously, PCINN is rapidly convergent for each run (50 times), but CLONALG is slow if it is convergent in some runs. Besides, because of the limitation of pages, only one average simulation curve that simulates the first practical output curve is drawn (Fig.3), in which each notation, e.g., “+”, means the mean of 50 output results (50 runs) of a given input signal. Obviously, we see that PCINN can effectively obtain the best simulation effect for each run.
744
Z.-H. Zhang, X. Tu, and C.-G. Peng
Fig. 2. Comparison of average convergence of CLONALG and PCINN with 50 runs
Fig. 3. Comparison of the average simulation results for the first output curve with 50 runs
Example 2 200 data pairs of Tai element, (n,yn), 1015≤n≤1035, are taken from Fudan University in China, where n is time, yn is some performance index value of Tai element at the moment n. Our purpose is to utilize them to establish a mathematical model. Note that, from the distribution curve of these datum pairs (Fig.5), noise disturbance is very severe. Due to its strong nonlinearity, we choose the respective activation functions of the hind layer (20 neurons) and output layer (1 neurons) of the feedforward network as gi ( x) = ai sin(λx + bi ) + ci ,1 ≤ i ≤ 20, h( x) = ax 2 + bx + c, where ai , bi and c are the parameters to be decided. These, together with the weight vectors and thresholds, decide a decision vector divided like Example 1. The former 180 pair samples are used to train the structure of the network; others are used to examine its predication capacity. Table 2 derives out the same statistic properties as Example 1. Fig.4 illustrates that CLONALG is not convergent for 50 executions, but PCINN is convergent. Fig.5 illustrates that PCINN can effectively simulate the real curve for each run.
A Parallel Coevolutionary Immune Neural Network
745
Table 2. Comparison of the average results for the four algorithms with 50 runs BP 0.203719 0.514731 0.445528 0.642103
averm averv paverm paverv
ABP 0.153929 0.21424 2.65454 3.13022
CLONAL 0.190581 0.2115 0.604265 0.51321
PCINN 0.0502776 0.03182 0.342854 0.418621
Fig. 4. Comparison of average convergence of CLONALG and PCINN with 50 runs
Fig. 5. Comparison of average simulation results with 50 runs
7 Conclusions A parallel coevolutionary immune neural network is designed for high-dimensional or strongly nonlinear signal simulation. It includes feedforward network, a simple artificial immune model and a novel local evolution mechanism. The focus is to consider several aspects: division of decision variables, representation and coevolution of antibodies, and intercommunication of antibody populations. The proposed network,
746
Z.-H. Zhang, X. Tu, and C.-G. Peng
together with several other algorithms, are applied to two examples. Comparative results illustrate that the proposed network is evidently superior to the algorithms selected.
Acknowledgement This work is supported by NSFC (60565002), President Foundation (20040706) and Natural Science Foundation (20052108) of Guizhou Province, and Doctorate Foundation (2004001) of Guizhou University, P.R. China.
References 1. Timmis, J., Knight, T., de Castro, L.N., et al.: An Overview of Artificial Immune Systems. Computation in Cells and Tissues: Perspectives and Tools for Thought. Natural Computation Series. Springer, Heidelberg (2004) 51-86 2. Oeda, S., Icmmura, T., Yamashita, T.: Biological Immune System by Evolutionary Adaptive Learning of Neural Networks. Evolutionary Computation, CEC '02. Proc. of the 2002 Congress on, 2(12) (2002) 1976–1981 3. Kim, D.H., Lee K.Y.: Neural Networks Control by Immune Network Algorithm Based on Auto-Weight Function Tuning. Proc. of the 2002 International Joint Conference on, 2(12) (2002) 1469–1474 4. Wu, J.Y., Chung, Y.K.: Artificial Immune System for Solving Generalized Geometric Problems: A Preliminary Results. Proc. of the 2005 Conference on Genetic and Evolutionary Computation. Washington DC, USA (2005) 329–336 5. Subbu, R., Sanderson A.C.: Modeling and Convergence Analysis of Distributed Coevolutionary Algorithms. Proc. of the IEEE International Congress on Evolutionary Computation (2000) 217-223 6. Toma, N.: An Immune Coevolutionary Algorithm for N-th Agent’s Traveling Sales Problem. Proc. 2003 IEEE International Symposium on Computational International Intelligence in Robotics and Automation. Kobe, Japan (7) (2003) 1498-1503 7. Watkins, A., Timmis, J.: Exploiting Parallelism Inherent in AIRS, an Artificial Immune Classifier System. In: Nicosia G., et al. (eds.): Third International Conference on Artificial Immune Systems, No. 3239 in LNCS. Springer, Heidelberg (2004) 427-438 8. de Castro, L.N., Von Zuben, F.J.: The Clonal Selection Algorithm with Engineering Applications. Workshop Proc. of GECC’00, Workshop on Artificial Immune Systems and Their Applications. Las Vegas, USA, (7) (2000) 36-37 9. Zhang, Z.H.: Study on Theory and Applications of Intelligent Optimization and Immune Network Algorithms in Artificial Immune Systems. Doctorate Thesis, No: 106112040055. Chongqing University, China (2004)
A Novel Elliptical Basis Function Neural Networks Optimized by Particle Swarm Optimization Ji-Xiang Du1,2, Chuan-Min Zhai3, Zeng-Fu Wang1, and Guo-Jun Zhang2 1
2
Department of Automation, University of Science and Technology of China Intelligent Computing Lab, Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, P.O. Box 1130, Hefei, Anhui 230031, China 3 Department of Mechanical Engineering, Hefei University
[email protected],
[email protected]
Abstract. In this paper, a novel model of elliptical basis function neural networks (EBFNN) is proposed. Firstly, a geometry analytic algorithm is applied to construct the hyper-ellipsoid units of hidden layer of the EBFNN, i.e., an initial structure of the EBFNN, which is further pruned by the particle swarm optimization (PSO) algorithm. Finally, the experimental results demonstrated the proposed hybrid optimization algorithm for the EBFNN model is feasible and efficient, and the EBFNN is not only parsimonious but also has better generalization performance than the RBFNN.
1 Introduction The radial basis function neural network (RBFNN) is a special type of neural network model with several distinctive features. Since firstly proposed, the RBFNN has attracted a high degree of attention and interest in research communities. Usually, when used as pattern classifier, the outputs of the RBFNN represent the posterior probabilities of the training data by a weighted sum of Gaussian basis functions with diagonal covariance matrices, which control the spread of the kernel function for the corresponding RBF unit. As a result, the RBF units can perform hyper-spherical division on the input samples [1]. In fact, it would be beneficial if the full covariance matrices could be incorporated into the RBFNN structure so that complex distributions could be well represented without the need for using a large number of basis functions. As a result, the RBF units are in hyper-ellipsoidal shapes, and can enhance the approximation capability of conventional RBFNN models. This paper, therefore, will introduce a novel EBFNN model with the hyper-ellipsoidal units in an attempt to obtain the better classification capability with respect to the conventional RBFNN. This paper is organized as follows. Section 2 introduces how to use the geometry analytic algorithm to initialize the structure of the EBFNN. In Section 3, the further structure optimization based on the PSO is described and discussed in detail. The experimental results are presented in Section 4, and Section 5 concludes the paper and gives related conclusions. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 747 – 751, 2006. © Springer-Verlag Berlin Heidelberg 2006
748
J.-X. Du et al.
2 Initializing the Structure of the EBFNN by the Geometry Analytic Algorithm Generally, it would be more reasonable and beneficial if hyper-ellipsoidal units could be adopted to the RBFNN. The output of an EBF network can be defined as: l
y j (x ) = ∑ w ji hi (x ) , i =1
j = 1,2, L, m
(1)
⎧ D (x ) ⎫ hi (x ) = exp ⎨ − i 2 ⎬ , i = 1, 2, L , l ⎩ αi ⎭
(2)
α
where x is the input vector, i is the shape parameter controlling the spread of the ith basis function. Di (x) is the distance between the input vector and the ith center of the hyper-ellipsoid unit. Since the unit of the EBFNN is a hyper-ellipsoid basis one, the definition for hyper-ellipsoid can be firstly given as follows: Definition: A hyper-ellipsoid that is represented by a linear operator L contains a point T L(x − μ ) = 1 x = ( x1 , x2 ,L, xd ) on its surface if and only if . where the matrix hyper-ellipsoid.
L is non-singular and real valued and μ is the center of the
In other words, a hyper-ellipsoid represented by a certain non-singular matrix L is a pre-image of the unit hyper-sphere for a linear transformation of the space determined by L . This representation allows the same ellipsoid to be represented by multiple linear operators coinciding up to any rotation of the space [2]. Then, equation (2) can be written as ⎧⎪ L (x − μ ) 2 ⎫⎪ hi (x) = exp⎨− i 2 i ⎬ , i = 1,2,L, l αi ⎪⎩ ⎪⎭
(3)
Suppose that a negative sample v resides on the surface of the unit hyper-sphere I , and a positive sample w lies outside it (as shown in Fig 1(a)). The enlargement of the hyper-sphere must be in the direction of the vector e , which is orthogonal to v and lies
μ
v, w
in the two dimensional plane defined by and the center . The dilation coefficient w must be chosen so that resides on the surface of the resulting hyper-ellipsoid: we k= ~ = we
2
w − b2 1− b
2
, b = wv =
vT w v
(4)
And the dilation itself is 2 ⎞ T ⎛ w 2 − b2 v ⎟ ee ⎜ D = I +⎜ − 1⎟ 2 , e = v − T w 2 v w ⎟ e ⎜ 1− b ⎠ ⎝
(5)
A Novel Elliptical Basis Function Neural Networks Optimized by PSO
749
Fig. 1. Illustration of the geometry analytic algorithm (a) enlargement; (b) contraction
In order to modify an arbitrary hyper-ellipsoid L , one must combine L with an −1 operation C = D , and to replace v and w by vˆ = Lv and wˆ = Lw respectively, the final formula is: ⎡ ⎛ L' = ⎢ I + ⎜ ⎢ ⎜ ⎣ ⎝
⎞ ee T − 1⎟ 2 ⎟ e w − b2 ⎠ 1 − b2 2
⎤ ⎥L ⎥ ⎦
(6)
2
vˆ vˆ T wˆ , e = vˆ − T wˆ vˆ = Lv, wˆ = Lw, b = vˆ vˆ wˆ
The same computation can be applied without modifications to contraction (as shown in Fig 1(b)). In Fig. 1, the resulting hyper-ellipsoid is figured by a dot line. However, enlarging a hyper-ellipsoid need certain conditions, which is also a criterion for the creation of a new hyper-ellipsoid: (1)
w must reside between hyper-planes h1 and h 2 , that is
equivalent to
vˆ T wˆ v < vˆ
wˆ v < vˆ
, which is
2
;
1 < k ≤ wk , that is to avoid to create the needle-shaped hyper-ellipsoid. (2)
3 Optimizing the EBFNN Structure Based on the PSO After constructing the initial hyper-ellipsoid centers of the EBFNN, the number of the centers needs to be further pruned to improving the generalization performance of the EBFNN. In addition, appropriate evaluation of the shape parameters of the kernel function is a very important procedure for optimizing the EBFNN. Therefore, in order
Fig. 2. The encoding scheme of particle for the PSO
750
J.-X. Du et al.
to avoid a great deal of the computational cost demanded by the standard GA, the particle swarm optimization (PSO) is preferred here for pruning the centers and finding the controlling parameters simultaneously [3]. For the special problem of optimizing the EBFNN, each particle is encoded with two encoding parts as shown in Fig. 2. The first part has the centers and the second part also has
l components (or bits) encoding
l components encoding the shape parameters
(sp), where l denotes the number of initial centers of hyper-ellipsoid (HE) units. Our objective is to make the number of hidden centers selected as few as possible under the given accuracy. So, a fitness function is defined as follows: f (ε 0 , l ) =
where
nC sign(e R − ε 0 ) + 1 + l 2 ⋅ eps
ε 0 represents the given error in advance; e R
(7)
is the practical root mean squared
n errors (RMSE) of the output layer; C is the number of the selected hidden centers and eps is a pre-given constant.
4 Experimental Results and Discussions In order to test our optimization algorithm for the EBFNN, the telling-two-spirals-apart (TTSA) problem and iris classification problem are used. Firstly, we generated 70 groups of different training samples sets in which the number of samples is changed from 40 to 1400. For each of the training samples sets, the numbers of hyper-ellipsoid units initially constructed by the geometry analytic algorithm and optimized by the PSO, the number of the hyper-sphere constructed by the moving median center hyper-sphere covering (MMCHS) algorithm [4] were recorded. All the results were showed in Fig 3. It can be seen that the number of initially constructed hyper-ellipsoid units is at most about 36 and after optimized by the PSO, there are about 20-26 hyper-ellipsoids selected as the centers for the EBFNN. However, there are about 44 hyper-spheres constructed by the MMCHS as the center of the RBFNN. These three methods were also tested by the iris problem and the results were plotted in Fig. 3. Obviously, for the same training samples set, the RBFNN need more centers than the EBFNN to cover the regions of samples space. Moreover, the hybrid PSO algorithm is efficient and convergent and remarkably reduces the numbers of centers.
Fig. 3. Number of the formed units vs. the training samples for TTSA and iris problem
A Novel Elliptical Basis Function Neural Networks Optimized by PSO
751
Table 1. Classification performance comparison for the telling-two-spirals-apart problem Classifier
Method
NC
EBFNN
PSO GA PSO
22 30 29
RBFNN
0.010 99.5 97.5 98.5
Rooted variances of the noise 0.025 0.050 0.075 0.100 89.0 75.5 55.5 54.0 84.5 71.0 53.2 52.5 88.5 75.0 55.0 53.0
Next, a set of 200 samples for the telling-two-spirals-apart problem was produced. Then we optimized the initial parameters of EBFNN by the PSO. In addition, the testing results of Gaussian white noisy training samples with zero means but different variances using various methods were shown in Table 1. From Table 1, it can be seen that the EBFNN optimization by the PSO has the most reduced structure and the better generalization capability compared with the RBFNN.
5 Conclusions A two-step learning and optimization scheme for the elliptical basis function neural networks (EBFNN) is proposed. The initial units of hidden layer of the EBFNN were constructed by a geometry analytic algorithm, and then the particle swarm optimization algorithm was applied to prune the initial structure. As a result, the experimental results showed that a more efficient and parsimonious structure of EBFNN than the RBFNN with better generalization capability can be designed by our proposed hybrid optimization algorithm.
Acknowledgements This work was supported by the NSF of China (Nos. 60472111 and 60405002).
References 1. Huang, D.S.: Systematic Theory of Neural Networks for Pattern Recognition. Publishing House of Electronic Industry of China, Beijing (1996) 2. Kositsky, M., and Ullman, S.: Learning Class Regions by the Union of Ellipsoids, Proceedings of the 13th International Conference on Pattern Recognition (1996) 750-757 3. Kennedy, J. and Eberhart, R.C.: A Discrete Binary Version of the Particle Swarm Algorithm. Proceedings of IEEE Int’l Conference on Systems, Man, and Cybernetics (1997) 4104-4108 4. Zhang, G.J., Wang, X.F., Huang, D.S., Chi, Z. Cheung, Y.M. Du, J.X. and Wan, Y.Y.: A Hypersphere Method for Plant Leaves Classification. Proceedings of the 2004 International Symposium on Intelligent Multimedia, Video & Speech Processing (ISIMP 2004), Hong Kong, China, (2004) 165-168
Fuzzy Neural Network Optimization by a Particle Swarm Optimization Algorithm Ming Ma1,2, Li-Biao Zhang1, Jie Ma1, and Chun-Guang Zhou1,∗ 1
College of Computer Science and Technology, Jilin University, Key Laboratory for Symbol Computation and Knowledge Engineering of the National Education Ministry of China, Changchun 130012, China
[email protected] 2 Information Manage Center, Beihua University, Jilin 132013, China
[email protected]
Abstract. Designing a set of fuzzy neural networks can be considered as solving a multi-objective optimization problem. An algorithm for solving the multi-objective optimization problem is presented based on particle swarm optimization through the improvement of the selection manner for global and individual extremum. The search for the Pareto Optimal Set of fuzzy neural networks optimization problems is performed. Numerical simulations for taste identification of tea show the effectiveness of the proposed algorithm.
1 Introduction Fuzzy neural network has developed a lot in theory researches and application in recent years. However, the decision of “which fuzzy neural network is the best” in terms of two conflicting criteria, performance and complexity, is often decided by which network will more closely achieve the user's purpose for a given problem [1]. Therefore, designing a set of fuzzy neural networks can be considered as solving a multi-objective optimization problem. This has promoted research on how to identify an optimal and efficient fuzzy neural network structure. Because there is no unique global best solution in the multi objective problem, to find a solution for the multi objective problem is to find a set of solutions (Pareto Optimal Set)[2]. Traditional multi objective optimization is settled by turning the multi objective problem into single objective problem through weighted sum. However, this method requires a priori knowledge of the problem itself, so it cannot solve real multi objective problems. Evolutionary Algorithm is a computer technique based on population, which can search for several solutions in the solution space and can improve the efficiency of working out solutions through the similarity of different solutions. Therefore, Evolutionary Algorithms are very suitable for solving multi objective optimization problems. Schaffer studied multi objective optimization problems using vector evaluated genetic algorithms in 1980’s [3]. In recent years, many evolutionary algorithms used to solve multi objective optimization problems have been proposed and successfully applied to multi objective optimization problems [4]. ∗
Corresponding author.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 752 – 761, 2006. © Springer-Verlag Berlin Heidelberg 2006
Fuzzy Neural Network Optimization by a Particle Swarm Optimization Algorithm
753
Particle Swarm Optimization (PSO) is an optimization algorithm proposed by Kennedy and Eberhart in 1995 [5][6]. It is easy to be understood and realized, and it has been applied in many optimization problems [7][8]. The PSO is more effective than traditional algorithms in most cases. The application of PSO in the multi objective optimization problems could be very promising. In this paper, we proposed a multi-objective PSO based design procedure for a fuzzy neural network. Satisfactory results through taste identification of tea are obtained. The rest of this paper is organized as follows: The overview of PSO is introduced in Section 2. The weighted fuzzy neural network is introduced in Section 3. The proposed multi-object PSO is described in Section 4. The simulation for taste identification of tea and experimental results are presented in Section 5. Finally, concluding remarks are given in Section 6.
2 Particle Swarm Optimization PSO originated from the research of food hunting behaviors of birds. Each swarm of PSO can be considered as a point in the solution space. If the scale of swarm is N, then the position of the i-th (i=1,2…N ) particle is expressed as Xi . The “best” position passed by the particle is expressed as pBest [i ] . The speed is expressed with Vi . The index of the position of the “best” particle of the swarm is expressed with g. Therefore, swarm i will update its own speed and position according to the following equations [5][6]: Vi=w*Vi+c1*rand()*(pBest[i]-Xi)+c2*Rand() *( pBest[g] -Xi) ,
(1)
Xi =Xi+Vi ,
where c1 and c2 are two positive constants, rand() and Rand() are two random numbers within the range [0,1],and w is the inertia weight.
3 Fuzzy Neural Network Architecture Different fuzzy inference mechanism can be distinguished by the consequents of the fuzzy if-then rules [9], such as Mamdani and Takagi-Sugeno inference system. Mamdani inference mechanism is adopted in this paper. The weighted fuzzy neural network (WFNN) is an adaptive network based on improving fuzzy weighted reasoning method. In improving fuzzy weighted reasoning method, we try to list all possible fuzzy rules. The fuzzy if –then rules have the form: m
n
OR ( IF ( AND j =1
i =1
( x i is A ij )) THEN Y is w j1 /z 1 , w j2 /z 2 ,..., w jl /z l )
,
(2)
where m is the number of fuzzy rules, n is the number of input variables , l is the number of consequent fuzzy set, zi(i=1,…,l)is a constant, wji (j=1,…,m, i=1,…,l )is weight parameter, and Aij(j=1,…,m) is the antecedent fuzzy set of the ith rule. Then the output value is calculated as following.
754
M. Ma et al. l
z
0
=
m
∑
∑
u
( f (
∑
( f (
i=1
j
j =1
l
∑
w
ji
) × zi )
m
i=1
u
j
w
ji
,
(3)
))
j =1
where uj is the extent to which a rule is activated. Each uj is calculated as uj=uA1j(x1) AND uA2j(x2)…AND uAnj(xn), and uAij(xi) is the membership function value of xi in the Aij fuzzy set. The weighted fuzzy neural network is a seven layers feedforward network, as shown in Fig 1. Wb X1
Xn
Wf We
Wc [A]
[B]
[C]
[D]
[E]
[F]
[G]
Fig. 1. The weighted fuzzy neural network architecture. The layer of A is an input layer, and all the xi (i=1…n) are input signals; the layer of B and C perform fuzzification of the input data, and wb,wc are parameters of membership functions ;the layer of D and E perform the fuzzy inference, all the we are weighted parameters, and each wei represents the importance of the corresponding fuzzy rule;the layer of F and G perform defuzzification of the output data, and all the wf are weighted parameters.
4 Multi-objective PSO 4.1 Objective Functions As illustrated in Fig. 1, each node in D layer is connected with each node in E layer, and each connection between D layer and E layer represents a fuzzy rule. Designing the weighted fuzzy neural network is to select a small number of fuzzy rules with high performance. This task is performed through maximizing the accuracy, minimizing the number of selected rules. In the proposed algorithm, we define a real-valued vector as a particle to represent a solution., the vector has the form: X=(x1, x2…xm, xm+1, …xn) ,
(4)
where m is the number of all the connections between D layer and E layer, n is the number of all the parameters, the xi (i=1,…,m) represent the weighted parameters of the connection between D layer and E layer, and the xi (i=m+1,…,n) represent the other connection parameters or the parameters of membership functions. As illustrated in Fig 1, each xi (i=1,…,m) represents the corresponding wei, and each xi (i=m+1,…,n) represents the other parameter. According to the improving fuzzy weighted reasoning
Fuzzy Neural Network Optimization by a Particle Swarm Optimization Algorithm
755
method, we let xi (i=1,…,m)>0 , and let xi be a random number within the range (0,1) when xi <0 in the execution of the algorithm. Then, we formulate our optimization problem as follows: Minimize f1(X) , and maximize f2(X) ,
(5)
where f1(X) describe the complexity of the WFNN, and f2(X) describe the performance of the WFNN. Usually there is no optimal particle with respect to the above two objectives. When a particle X is not dominated by any other particles, X is said to be a Pareto-optimal solution. A particle XB is said to dominate another particle XA if all the following inequalities hold: f1(XA) ≥f1(XB) , f2(XA) ≤f2(XB) ,
(6)
and at least one of the following inequalities holds: f1(XA)
>f (X ) , f (X ) <f (X ) . 1
B
2
A
2
B
(7)
Now we offer the expression forms of function f1 (x) and f2 (x). We define a vector E (ex1,ex2,……,exm) to represent the effectiveness of the fuzzy rules, and each exi (i=1,…,m) is 1 or 0. Each exi (i=1,…,m) represents a connection between D layer and E layer, and if the value of the exi is 1, then the corresponding fuzzy rule is enabled, otherwise it is disabled. In the following we confirm the value of each exi according to the corresponding xi (i=1,…,m) Step 1: we let all connections from a node of D layer in Fig 1 be a group, and then each group is composed of corresponding xis. Step 2: calculate the sum of all the xi’s value in each group, and calculate the proportion of each group to all the groups. The result showed in following: U =( u1 , u2 , …… uN1 ) ,
(8)
where 0
(9)
where 0< zxi<1 (i=1,2,…,m). Step 4: in step 2, each ui (i=1,2,…,N1) represents the importance of fuzzy rules in the corresponding group. If it’s value is smaller than a small value, that is to say the importance of those fuzzy rules is little enough, then those fuzzy rules are disabled. Step 5: in step 3, each zxi (i=1,2,…,m) represents the importance of the corresponding fuzzy rule in the group. If the value of a zxi is much bigger than the others in the group, the corresponding fuzzy rule is enabled, the others fuzzy rules are disabled; if the value of a zxi is smaller than a small value, the corresponding fuzzy rule is disabled. Step 6: the vector E (ex1,ex2,……exm) represents the effectiveness of all the fuzzy rules, and if the fuzzy rule is disabled in step 4 or step 5, then the corresponding exi is 0, else the corresponding exi is 1.
756
M. Ma et al.
According to the above step, we have obtained the vector E function f1 (x) is defined as follows:
(e
),then
x1,ex2,…,exm
m
f1 ( X
)
=
1
+
∑
e xi
(10)
,
i
where exi (i=1,…,m) is 1 or 0. If the value of the exi is 1, then the corresponding fuzzy rule is enabled, otherwise the corresponding fuzzy rule is disabled. It measures the complexity of the evolved WFNN. To a particle X, the vector E (ex1,ex2,…,exm)has described the structure of the WFNN, then we define function f2 (x) as follows on the basis of the structure f
2
( X
) =
1 1 2
∑
(O
, − T )
2
(11)
K
where K is the number of sample ,T is the teacher signal , and O is the output of network. It measures the performance of the evolved WFNN on the training data. 4.2 Introduction of the Algorithms The successful application of PSO in many single objective optimization problems reflects the effectiveness of PSO. However, PSO cannot be immediately applied to multi objective optimization problems, because there are essential distinctions between multiple and single objective optimization problems. First, the former is the set of one group or several groups of solutions, while the latter is only single solution or a group of series of solutions. In addition, the successful application of genetic algorithms in multi objective optimization problems and the similarity between PSO and genetic algorithms reflects that PSO is likely a method to deal with multi- objective optimization problems. However, there is a great distinction between PSO and genetic algorithms. In genetic algorithms, chromosomes share the information, which causes the whole community moves gradually into a better area, while in PSO the information is sent out by the best particle which is followed by other individuals to quickly converge to a point. Therefore, it may easily cause the swarms to converge to the local area of Pareto Front if the PSO is applied directly to multi objective optimization problems. In view of the above-mentioned reasons, this paper proposes the PSO assessed and chosen by the best solution based on the WFNN. First, find out the global best solution gBest1 gBest2 in each swarm using f1(x) and f2(x),and find out the best individual solution pBest1 pBest2 in each particle using f1(x) and f2(x). In the next, when the speed and the position of each particle are updated, the “average” of gBest1 and gBest2 is used as the best global solution gBest, and the “average” of pBest1 and pBest2 is used as the best individual solution pBest. Generally, the “average” is defined as follows:
、
、
gBest = α gBest 1 + β gBest 2
,
(12)
Fuzzy Neural Network Optimization by a Particle Swarm Optimization Algorithm
757
pBest = α * pBest 1 + β * pBest 2
(13)
where α
、β
and α *
,
、 β * are weights satisfying the following conditions: 0≤ α ; β ; α * ; β * ≤1 ,
(14)
α + β =1; α * + β * =1 .
(15)
In the above–mentioned method, α
、 β and α * 、 β * are defined as fixed values
to the entire component, and the feature of each component have not been considered. In this paper, we proposed a heuristic method to define the “average” based on the WFNN. Now we describe our method in following. To a swarm, we let: gBest = ( x 1gBest , x 2gBest ,..., x ngBest ) .
(16)
:
Then
x igBest = α i x igBest 1 + β i x igBest
( i = 1, 2 ,...... n ) ,
2
(17)
where 0≤ α i ≤1 ,0≤ β i ≤1, and α i + β i =1. To a particle, we let: = ( x 1pBest , x 2pBest ,..., x npBest ) .
pBest
(18)
Then: x ipBest = α i* x ipBest
1
+ β i* x ipBest
2
( i = 1, 2 ,...... n ) ,
(19)
where 0≤ α i* ≤1 ,0≤ β i* ≤1, and α i* + β i* =1. To the different component, we adopt different method to determine the values of
αi
、 β i and α i* 、 β i* , as shown in follows: αi =
βi =
z
gBest1 xi
z xigBest1 + z xigBest 2
(i = 1,2......m) ,
(20)
z
gBest 2 xi gBest1 gBest 2 xi xi
(i = 1,2......m) ,
(21)
z
+z
758
M. Ma et al.
α i* =
β i* =
pBest1 xi pBest1 pBest 2 xi xi
(i = 1,2......m) ,
z xipBest 2 z xipBest1 + z xipBest 2
(i = 1,2......m) ,
z
z
+z
(22)
(23)
where each zxi (i=1,2,…..,m) is obtained from expression (9).The vector Z(zx1 ,zx2,……,zxm) represent the corresponding importance of all the fuzzy rules, and it has an inherent relation with two objects functions f1(x) and f2(x). In the execution of the
、
、
algorithm, α i β i and α i* β i* can be adjusted dynamically with the change of the importance of the corresponding fuzzy rule. Because each xi (i=m+1,…,n) in (4) has not relation with the value of function f1(x),we let:
α i = α i* = 0 , and
β i = β i* = 1 ( i = m + 1,..., n ) .
(24)
The heuristic method can cause the swarms to move to the area of Pareto Front faster, and the following experiment has proved the effectiveness of it. In the PSO the behavior of each swarm is decided mainly by gBest and pBest, so the method proposed in this paper makes the behavior quite different when each swarm moves to the solution. That is to say, each swarm moves to different solutions in the area of solutions. Thus, the whole group would be scattered finally into the Pareto Optimal Set, which avoids the whole group scattering into the local area of the Pareto Optimal Set.
5 Numerical Simulations The taste signals of 5 kinds of tea are used in our experiment [10]. The original taste signals are 5-dimensional data. In order to eliminate the correlation among the original taste signals and decrease the input dimensions of the recognition system, the Principal Component Analysis (PCA) is used for the original signals and the 2-dimensional resultant data are obtained as shown in table 1. In this paper, C-means clustering algorithm is adopted to obtain the fuzzy subsets of input variable and membership functions. To input variable x1, the number of clusters is 5, and the cluster center is 0.1393, 0.3113, 0.4431, 0.6197,0.7888, respectively; to input variable x2, the number of clusters is 5, and the cluster center is 0.3330, 0.6259, 0.6968, 0.7698, 0.8352, respectively. Therefore to the input variable we define 5 fuzzy subsets such as {smallest, smaller, mid, bigger, biggest}, and use the corresponding cluster center as the mean to define the Gaussian membership function for each fuzzy subset. According to the expected value of the result, to the output variable T we define 5 fuzzy subsets such as {T1,T2,T3,T4,T5} to represent 5 kinds of tea, and use the corresponding expected value as the mean to define the Gaussian membership function for each fuzzy subset.
Fuzzy Neural Network Optimization by a Particle Swarm Optimization Algorithm
759
Table 1. Sample data of taste signals of five kinds of tea
X1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.566667 0.583333 0.333333 0.5 0.866667 0.85 0.75 0.75 0.65 0.666667 0.766667 0.766667 0.316667 0.283333 0.4 0.433333 0.1 0.116667 0.166667 0.2
X2
T 1.0 1.0 1.0 1.0 0.75 0.75 0.75 0.75 0.50 0.50 0.50 0.50 0.25 0.25 0.25 0.25 0.0 0.0 0.0 0.0
0.79 0.8 0.83 0.84 0.28 0.4 0.69 0.71 0.83 0.85 0.62 0.63 0.74 0.67 0.71 0.69 0.77 0.78 0.76 0.76
In table 1, 20 standard samples are provided as the input signal. However, in the actual conditions, the input signal often includes the noise signal. In order to raise the recognition capability to the taste signal of the WFNN, the original taste signals of tea and the ones polluted by the noise are used for our identification experiment. Note that, the assuming original signal is A , then the signal polluted by the noise is A’ = A + A ×ç×rand, where 0<=ç<=1 is the noise level (setç= 0.2 in the experiment), and rand is a random number uniformly distributed on [-1, +1]. According to above-mentioned methods, produced 60 data as training samples, and produced 40 data as test samples. In the proposed algorithm, the parameters of the multi-Objective PSO are these: learning rate c1 = c2=2, inertia weight is taken from 0.8 to 0.4 with a linear decreasing rate. The population of particles was set to 30, and stopping condition: 5000 generations. Simulation results are summarized in Table 2. This table shows non-dominated solutions with high classification rates (i.e., higher than 80%) obtained by a single trial of our algorithm. Table 2. Simulation results on test data
The number of fuzzy rules 17 18 22 25
Classification rates 80% 82.5% 85% 87.5%
760
M. Ma et al.
We can obtain multiple non-dominated solutions by a single run of our multi-objective PSO. This is one advantage of our multi-objective PSO to fuzzy neural network optimization over single-objective ones. That is, a tradeoff between accuracy and interpretability of fuzzy neural networks is clearly shown by obtained non-dominated solutions In our multi-objective PSO algorithm, the heuristic method is proposed to obtain the best solution gBest and the best individual solution pBest .A random weighting method is presented in [11], and the feature of it is to randomly specify those weights in (12) and (13) to each iteration in the execution of the algorithm. We performed the above computer simulation 20 times using two kinds of methods, respectively. Simulation results are summarized in Table 3. This table shows non-dominated solutions with high classification rates (i.e., higher than 80%) to every method. Table 3. Comparison between two kinds of methods
Heuristic method The number of Classification fuzzy rules rates 15 80% 18 82.5% 22 85% 23 87.5% 25 90%
Random weighting method The number of Classification fuzzy rules rates 19 80% 21 82.5% 24 85% 25 87.5%
The effect of the heuristic method becomes clear through comparing with the random weighting method. From the Table 3, we can see that better results were obtained in the heuristic method than the random weighting method. Therefore, compare the random weighting method the heuristic method can cause the swarms to move to the area of Pareto Front faster.
6 Conclusion Based on the standard PSO and weighted fuzzy neural network, the proposed multi-objective algorithm can solve the fuzzy neural network optimization problem. The numerical experiments for taste identification of tea indicate the effectiveness of the algorithm.
Acknowledgement This work was supported by the National Natural Science Foundation of China under Grant No. 60433020, “985” Project of Jilin University and the Key Laboratory for Symbol Computation and Knowledge Engineering of the National Education Ministry of China.
Fuzzy Neural Network Optimization by a Particle Swarm Optimization Algorithm
761
References 1. Yen, G, Lu, H.: Hierarchical Genetic Algorithm for Near-Optimal Feedforward Neural Network Design. International Journal of Neural Systems 12(1) (2002) 31-43 2. Coello, C.A.: A Comprehensive Survey of Evolutionary-Based Multi Objective Optimization Techniques. Knowledge and Information Systems 1(3) (1999) 269-308 3. Schaffer, J.D.: Multiple Objective Optimization with Vector Evaluated Genetic Algorithms. In: Proceeding of the First International Conference on Genetic Algorithms and Their Applications. Lawrence Erlbaum, Hillsdale, NJ (1985) 93-100 4. Zitzler, E.: Evolutionary Algorithms for Multi Objective Optimization: Methods and Applications. Ph.D. Thesis. Swiss Federal Institute of Technology Zurich (1999) 5. Eberhart, R., Kennedy, J.: A New Optimizer Using Particle Swarm Theory. In: Proc. Sixth International Symposium on Micro Machine and Human Science (Nagoya, Japan). IEEE Service Center, Piscataway, NJ (1995) 39-43 6. Kennedy, J., Eberhart, R.: Particle Swarm Optimization. In: IEEE International Conference on Neural Networks (Perth, Australia). IEEE Service Center, Piscataway, NJ (1995), Volume IV, 1942-1948 7. Parsopoulos, K.E., Vrahatis, M.N.: Particle Swarm Optimizer in Noisy and Continuously Changing Environments. In: Artificial Intelligence and Soft Computing. IASTED/ACTA Press,Anaheim,CA,USA (2001) 289-294 8. Zhang, L.B., Zhou, C.G., Liang, Y.C.: Solving Multi Objective Optimization Problems Using Particle Swarm Optimization. 2003 Congress on Evolutionary Computation, Canberra, Australia, Volume IV. IEEE Service Center, Piscataway, NJ (2003) 2400-2405 9. Ludmila, I. K.: How Good Are Fuzzy If-Then Classifiers? IEEE Trans on Systems, Man and Cybernetics Part B: Cybernetics 30(4) (2000) 501-509 10. Zheng, Y., Zhou, C.G., Huang, Y.X.: Taste Identification of Tea through a Fuzzy Neural Network Based on Fuzzy C-Means Clustering (in Chinese) Mini-Micro Systems 25(7) (2004) 1290- 1294 11. Ishibuchi, H., Yamamoto, T.: Fuzzy Rule Selection by Multi-Objective Genetic Local Search Algorithms and Rule Evaluation Measures in Data Mining. Fuzzy Sets and Systems 141 (2004) 59–88
Fuzzy Rule Extraction Using Robust Particle Swarm Optimization Sumitra Mukhopadhyay1 and Ajit K. Mandal2 1
A.K. Choudhury School of Information Technology, University of Calcutta, India
[email protected] 2 ETCE Dept, Jadavpur University, India
[email protected]
Abstract. Automatic fuzzy rule extraction assumes the realization of fuzzy ifthen rules using a pre-assigned structure rather than an optimal one. In this paper, Particle Swarm Optimization (PSO) is used to simultaneously evolve the structure and the parameters of the fuzzy rule base. However, the existing PSO based adaptation employs randomness, which makes the rate of convergence dependent on the initial states and the end result can not be reproduced repeatedly with a pre-assigned value of iterations. The algorithm has been modified by removing the randomness in parameter learning, making it very robust. The scheme provides the flexibility in extracting the optimal set of fuzzy rules for a prescribed residual error in function approximation and prediction. Simulation studies and the comprehensive analysis demonstrate that an efficient learning technique as well as the structure development of the fuzzy system, can be achieved by the proposed approach.
1 Introduction The need for prediction, approximation and modeling arises in many disciplines of engineering. Conventional approach aims at finding a global target function using several applicable techniques of approximation. However, the performances of these tools are not satisfactory, especially in modeling the ill-defined, highly nonlinear, uncertain and chaotic systems. In this situation, fuzzy logic or neural network can be used as universal approximator [1] up to any prescribed accuracy provided that sufficient hidden neurons or fuzzy rules are available. It has been established by various researchers [2]-[5] that Fuzzy Neural Network (FNN), in particular, the radial basis function network (RBFN) could offer a feasible approach for system modeling and structure identification based on expert knowledge and self organized learning. The task of finding a set of weights to minimize residuals for feed forward NN is nonlinear and dynamic, where the weights are trained using the hybrid [3], back propagation (BP) learning [4] or evolutionary computation (EC) techniques. Gradient descent techniques in BP of errors suffer with a slow speed of learning and susceptibility to be trapped at local minima. Besides, since the distributed weights of the BP neural network captures the rules of the fuzzy system, the net is not amenable for extraction of fuzzy rules. PSO as a simple method of EC, introduced by Kennedy and Eberhart, simulates the social behavior [6]-[8] of a swarm of birds searching food. It can be used as an efficient learning technique for the adjustment of parameters of NN. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 762 – 767, 2006. © Springer-Verlag Berlin Heidelberg 2006
Fuzzy Rule Extraction Using Robust Particle Swarm Optimization
763
In this paper RBFN based system is used to simultaneously evolve the structure and the parameters of the fuzzy implicative rules from sample data, starting from a single node. It minimizes the RMS of the residual error to get the optimal number of rules. Here, PSO has been used as a learning scheme with a fuzzy consequence, of which Sugeno type is obtained as a special case. A number of particles are scattered initially covering the entire space. Since the global minimum is also a local minimum in a neighborhood, the best particle will yield the global solution for a prescribed error with a set of optimal fuzzy rules. The learning with random parameters proposed in [8] has been modified by iterative adaptation of the parameters, which makes the process very robust.
2 Architecture of Network A multi-input and single-output RBFN is considered here to perform as Takagi– Sugeno-Kang model-based fuzzy system [5]. The results presented can be readily extended to MIMO cases. The RBFNs [5] used in the paper are designated as Type I, II and III, based on the inference scheme adopted and the parameters are trained using PSO technique. Consequence is First Order Linear Equation [5]. Type I fuzzy implicative rule is: Rule i: If (x1 is A1i) and….. (x n is A n i) Then (y1 is fi1(x)) and …… (yp is fip(x)), where,
fik(x) = aik0+a ik1 x1 + aik2 x2+..+ aikn xn
.
(1)
Consequence is a Constant [5]. The consequence of the Type-II rule retains only the constant term in the RHS in relation (1). Consequence is a Fuzzy Variable [5]. The ith (Type III) fuzzy implicative rule is , Rule i : If (x1 is A1i) and…..and (xn is Ani) Then (y1 is Bi1) and……and (yp is Bip) where Bik is a fuzzy variable.
3 PSO Learning Algorithm The PSO [8] is an optimization technique which resembles a school of flying birds. th The i particle is represented as Xi = (xi1, ...., xiD), the velocity as Vi (vil,vi2, . , viD), best previous position (‘pbest’) of the particle as Pi (pi1,…, piD) and best particle among all the particles in the population is represented by ‘gbest(g)’. 3.1 Proposed Robust PSO (RPSO) The PSO has been adopted for learning of the NN. The random terms in the learning equation of the existing PSO [8] were intended to introduce diversity in the particles’ position and velocity to span the entire search space. However, in ill-defined and time varying problems, the algorithm fails to converge to the optimal solution at all or in finite number of iterations. Even, the use of clustering techniques over the sample data to select the initial values of the parameters of the particles spanning the entire domain, was not very fruitful.
764
S. Mukhopadhyay and A.K. Mandal
In view of this, we have removed the random parameters from the updating law of existing PSO[8] and the modified learning law is given by:
v id = w × v id
+ c1 ( p id − x id ) + c 2 ( p gd − x id ) .
where,
(2)
x id = x id + v id .
and
ci = cimax − [(cimax − cimin) L ] Lmax ,
i =1 , 2 .
Any other adaptive variation for the coefficient ci may be used in equation (2). 3.2 PSO Based Self Organized Learning Technique
In order to generate and train the system using RPSO, a hierarchical self organizing system is implemented, which is the modification of HiSSOL [5]. The main idea is that initially Q number of particles are initialized and scattered in the search domain. The particles grow having different dimension and a node is recruited, one at a time based on the accommodation boundary of the RBF unit [5]. The general scheme is outlined below: th Let Lmax = maximum learning cycle, uq = the hidden nodes of the q particle, r(L) = accommodation boundary, n = the training data, p = the number of output. Step1: Initialize Q particles with no hidden node assigned to them. Step2: Set the weights, gbest, pbes , other parameters of all the particles. Set L = 1 ; Step3: Set q = 1 ; Step4: Set j =uq. At the Lth learning cycle, the error of the qth particle is calculated. Eq =
p n 2 − y zkq ) ∑ ∑ (t 2 z = 1 k = 1 zkq
1
.
(3) th
where, tzkq and yzkq are the desired and actual values of k output unit of the th th z training pattern of the q particle. Determine pbest and the data point contributing maximum error during error calculation for the particle. Step5: If E q > E threshold , then go to Step 6 , Otherwise go to Step7. Step 6: If there is a hidden node such that the training pattern generating maximum error falls inside its hypersphere H, then go to Step 7. Otherwise, create a new hidden node and go to Step 8. The parameters of the new hidden nodes are assigned as follows: Set j = j+1; Mean vector Cjq = input pattern corresponding to the maximum error, deviation σ jq = σ init; weight parameters vector ajq, wjq, y*jq ; Step7: If equal(size(particle), size(gbest))=FALSE then modify the structure of the system using subroutine struct_mod and apply the parameter learning. struct_mod: If (size( particle)< size(gbest)) If (error_particle > Gbest_error) Then (Append extra parameters of gbest to the original particle); Else ( parameter of gbest = parameter of particle);
Fuzzy Rule Extraction Using Robust Particle Swarm Optimization
765
Else If (error_particle > Gbest_error) Then (Append zeros to the gbest); Else ( parameter of gbest = parameter of particle); Return. /*Here error_ particle is the RMSE generated by the particle in each learning cycle and Gbest_error is the error produced by the gbest. */ Step 8: uq = j. If equal (q, Q) =TRUE then go to Step 9 ; Otherwise q = q+1 and go to Step4. Step 9: Compare evaluation with group’s previous best (pbest [gbest]). Set L = L +1; Step 10: If Gbest_error<E threshold or L >Lmax then stop and select the ‘gbest’. Otherwise modify the boundary r(L) and Go to Step3.
4 Simulation Studies In this section the effectiveness of the proposed learning technique is demonstrated, using 20 particles with velocity bound of [+2,-2], cimax= 4.0 and cimin=1.4 . Example1: Nonlinear Dynamic System Identification. The plant [5] below is considered for identification : y(t + 1) =
y(t)y(t − 1)[y(t) + 2.5] c + y 2 (t) + y 2 (t − 1)
+ u(t)
.
(4)
Figure 1. shows the result of identification by the proposed RPSO as a learning rule. and the results of comparative study are shown in Table1. Example2: Mackey Glass Time – Series Prediction[5]: The data is generated with
x(t + 1) = (1 − a)x(t) +
bx(t − τ) . 1 + x 10 (t − τ)
(5)
Table 1. Comparison of the Performance of RPSO trained RBFN with Existing Algorithms
RBFN Type I
Type II
Type III
Learning Technique RPSO PSO BP RPSO PSO BP RPSO PSO BP
Rules
RMSE (training)
RMSE (validation)
14 17 35 7 20 47 9 12 27
0.0192 0.0211 ------0.0252 0.0358 ------0.0233 0.0202 -------
0.0588 0.0598 0.1384 0.0962 0.0756 0.0668 0.0521 0.0435 0.0998
766
S. Mukhopadhyay and A.K. Mandal
Fig. 1.TypeII RBF based system:(a)growth of neuron (b)squared error during training, (c)training result [desired (─),actual (----) ],(d)testing result [original ( ), predicted(----)]
Fig. 2. Type III RBF based system: (a) growth of neuron (b)squared error during training, (c)training result [desired (─), actual (----) ] ,(d) testing result[original ( ) ,predicted(----)]
Table 2. Comparison of the Performance of Different RBFN (using Existing PSO and RPSO)
RBFN Type I Type II Type III
Learning Technique
Rules
PSO RPSO
30 25
PSO RPSO PSO RPSO
37 27 16 14
RMSE(training) 0.0644 0.0399 0.0281 0.0265 0.0284 0.0266
RMSE (validation) 0.0722 0.0465 0.0382 0.0276 0.0289 0.0267
For the purpose of training we extract 1000 data points between t =124 and t =1123 and1000 data points between t =1124 and t =2123 are used for validation. Table 2
Fuzzy Rule Extraction Using Robust Particle Swarm Optimization
767
records the comparative results and Figure 2 shows the result of prediction obtained by RPSO.
5 Conclusion The RPSO based algorithm presented in this paper makes the convergence of error independent of initial conditions. Here, a learning cycle consists of one iteration only, in contrast to a number of iterations as in [5].The RPSO is also amenable to the sequential and gradual building up of particles with different dimensions covering the whole domain and offers a choice of selecting the best-trained neural net with optimal number of nodes from the population of particles. The results in the Tables 1 and 2 reveal that the numbers of rules obtained with the proposed RPSO are lower in each case than that obtained by other methods and the RMSE are either smaller or comparable.
References 1. Wang, L.: Fuzzy Systems are Universal Approximators. Proc. Int. Conf. Fuzzy Syst., San Diego (1992) 1163-1170 2. Jang, J.S.R: ANFIS: Adaptive Network Based Fuzzy Inference System. IEEE Trans. Syst. Man. Cybern., 23 (1993) 665-684 3. Lin, C.T.: A Neural Fuzzy Control System with Structure and Parameter Learning. Fuzzy Sets and Systems, 70 (1995) 183-212 4. Juang, C. F., Lin, C.T.: An On-Line Self Constructing Neural Fuzzy Inference Network and its Application. IEEE Trans. Fuzzy Syst., 6 (1998) 12-32 5. Cho, B.K., Wang, B.H.: R B F Based Adaptive Fuzzy Systems and Their Application to System Identification and Prediction. Fuzzy Sets and Systems, 83 (1996) 325-339 6. Eberhart, R., Kennedy, J.: A New Optimizer using Particle Swarm Theory. Sixth International Symposium on Micromachine and Human Science, IEEE (1995) 39-43 7. Kennedy, J.: The Particle Swarm: Social Adaptation of Knowledge. International Conf. on Evolutionary Computation Indianapolis, Indiana, IEEE Service Center (1997) 303-308. 8. Shi., Y.: Particle Swarm Optimization. Electronic Data Systems, Inc., Kokomo, IN 46902, USA, IEEE Neural Network Society (2004)
A New Design Methodology of Fuzzy Set-Based Polynomial Neural Networks with Symbolic Gene Type Genetic Algorithms Seok-Beom Roh1, Sung-Kwun Oh2 and Tae-Chon Ahn1 1 Department
of Electrical Electronic and Information Engineering, Wonkwang University, 344-2, Shinyong-Dong, Iksan, Chon-Buk, 570-749, South Korea {nado, tcahn}@wonkwang.ac.kr 2 Department of Electrical Engineering, Suwon University, San 2-2 Wau-ri, Bongdam-eup, Hwaseong-si, Gyeonggi-do, 445-743, South Korea
[email protected]
Abstract. In this paper, we propose a new design methodology of fuzzy-neural networks – Fuzzy Set–based Polynomial Neural Networks (FSPNN) with symbolic genetic algorithms. We have developed a design methodology (genetic optimization using Symbolic Genetic Algorithms) to find the optimal structure for fuzzy-neural networks that expanded from Group Method of Data Handling (GMDH). It is the number of input variables, the order of the polynomial, the number of membership functions, and a collection of the specific subset of input variables that are the parameters of FSPNN fixed by aid of symbolic genetic optimization that has search capability to find the optimal solution on the solution space. The augmented and genetically developed FPNN (gFPNN) results in a structurally optimized structure and comes with a higher level of flexibility in comparison to the one we encounter in the conventional FPNNs. The GA-based design procedure being applied at each layer of FPNN leads to the selection of the most suitable nodes (or FSPNs) available within the FPNN. Symbolic genetic algorithms are capable of reducing the solution space more than conventional genetic algorithms with binary genetype chromosomes. The performance of genetically optimized FSPNN (gFSPNN) with aid of symbolic genetic algorithms is quantified through experimentation where we use a number of modeling benchmarks data which are already experimented with in fuzzy or neurofuzzy modeling.
1 Introduction In recent, a great deal of attention has been directed towards usage of Computational Intelligence such as fuzzy sets, neural networks, and evolutionary optimization towards system modeling. A lot of researchers on system modeling have been interested in the multitude of challenging and conflicting objectives such as compactness, approximation ability, generalization capability and so on which they wish to satisfy. Fuzzy sets emphasize the aspect of linguistic transparency of models and a role of a model designer whose prior knowledge about the system may be very helpful in facilitating all identification pursuits. In addition, to build models with substantial approximation capabilities, there should be a need for advanced tools. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 768 – 773, 2006. © Springer-Verlag Berlin Heidelberg 2006
A New Design Methodology of Fuzzy Set-Based Polynomial Neural Networks
769
As one of the representative and sophisticated design approaches comes a family of fuzzy polynomial neuron (FPN)-based self organizing neural networks (abbreviated as FPNN or SOPNN and treated as a new category of neuro-fuzzy networks) [1], [4]. The design procedure of the FPNNs exhibits some tendency to produce overly complex networks as well as comes with a repetitive computation load caused by the trial and error method being a part of the development process. The latter is in essence inherited from the original GMDH algorithm that requires some repetitive parameter adjustment to be completed by the designer. In this study, in addressing the above problems coming with the conventional SOPNN (especially, FPN-based SOPNN called “FPNN”) [1], [4] as well as the GMDH algorithm, we introduce a new genetic design approach as well as a new FSPN structure treated as a FPN within the FPNN. Bearing this new design in mind, we will be referring to such networks as genetically optimized FPNN with fuzzy set-based PNs (“gFPNN” for brief). The determination of the optimal values of the parameters available within an individual FSPN (viz. the number of input variables, the order of the polynomial corresponding to the type of fuzzy inference method, the number of membership functions(MFs) and a collection of the specific subset of input variables) leads to a structurally and parametrically optimized network.
2 The Architecture and Development of Fuzzy Set-Based Polynomial Neural Networks (FSPNN) The FSPN encapsulates a family of nonlinear “if-then” rules. When put together, FSPNs results in a self-organizing Fuzzy Set-based Polynomial Neural Networks (FSPNN). Each rule reads in the form. if xp is Ak then z is Ppk(xi, xj, apk)
(1)
if xq is Bk then z is Pqk(xi, xj, aqk) where aqk is a vector of the parameters of the conclusion part of the rule while P(xi, xj, a) denoted the regression polynomial forming the consequence part of the fuzzy rule. The activation levels of the rules contribute to the output of the FSPN being computed as a weighted average of the individual condition parts (functional transformations) PK. (note that the index of the rule, namely “k” is a shorthand notation for the two indices of fuzzy sets used in the rule (1), that is K=(l,k)).
( ∑(
total inputs
z= =
∑
total_rules related to input l
l =1
k =1
∑
total inputs
rules related to input l
l =1
k =1
∑
total_rules related to input l
μ( l , k ) P( l , k ) ( xi , x j , a( l , k ) ) μ% ( l , k ) P( l , k ) ( xi , x j , a( l , k ) )
)
∑ k =1
μ ( l ,k )
)
(2)
In the above expression, we use an abbreviated notation to describe an activation level of the “k”th rule to be in the form
770
S.-B. Roh, S.-K. Oh, and T.-C. Ahn
1st layer
2nd layer or higher FSPN
FSPN
FSPN
x1 FSPN
x2
FSPN FSPN
FSPN
FSPN
yˆ
FSPN
x3
FSPN
FSPN
x4
FSPN FSPN FSPN
FSPN xi , x j F
x3
P
μ31
μˆ31
P31
μ32
μˆ 32
P32
μ41
μˆ 41
∑
{ A3}
∑
z
xi , x j
x4
∑
{B4}
μ42
μˆ 42
Fuzzy set-based processing(F) part
Membership function
Triangular
Gaussian
No. of MFs per each input
P41 P42
Polynomial form of mapping(P) prat
2≤ M ≤5
Fuzzy inference method
Simplified fuzzy inference Regression polynomial fuzzy inference
The structure of consequent part of fuzzy rules
Selected input variables Entire system input variables
PD : C0+C1x3+C2+x4
PD : C0+C1x1+C2+x2+C3x3+C4x4
Fig. 1. A general topology of the FSPN based FPNN along with the structure of the generic FSPN module (F: fuzzy set-based processing part, P: the polynomial form of mapping)
μ(l , k ) μ~( l , k ) = total rule related to input l ∑ μ(l , k )
(3)
k =1
3 Genetic Optimization of FSPNN with Aid of Symbolic Gene-Type Genetic Algorithms Let us briefly recall that GAs is a stochastic search technique based on the principles of evolution, natural selection, and genetic recombination by simulating a process of “survival of the fittest” in a population of potential solutions (individuals) to the given problem. GAs are aimed at the global exploration of a solution space. They help pursue potentially fruitful search paths while examining randomly selected points in order to reduce the likelihood of being trapped in possible local minima. The main features of genetic algorithms concern individuals viewed as strings, population-based optimization (where the search is realized through the genotype space), and stochastic search mechanisms (selection and crossover). The conventional genetic algorithms use several binary gene type chromosomes. However, symbolic gene type genetic algorithms use symbolic gene type chromosomes not binary gene type chromosomes. We are able to reduce the solution space with aid of symbolic gene type genetic algorithms. That is the important advantage of symbolic gene type genetic algorithms.
A New Design Methodology of Fuzzy Set-Based Polynomial Neural Networks
771
In order to enhance the learning of the FPNN, we use GAs to complete the structural optimization of the network by optimally selecting such parameters as the number of input variables (nodes), the order of polynomial, and input variables within a FSPN. In this study, GA uses the serial method of binary type, roulette-wheel used in the selection process, one-point crossover in the crossover operation, and a binary inversion (complementation) operation in the mutation operator. To retain the best individual and carry it over to the next generation, we use elitist strategy [3], [8].
4 The Design Procedure of Genetically Optimized FPNN (gFPNN) The framework of the design procedure of the genetically optimized Fuzzy Polynomial Neural Networks (FPNN) with fuzzy set-based PNs (FSPN) comprises the following steps [Step 1] Determine system’s input variables [Step 2] Form training and testing data [Step 3] specify initial design parameters [Step 4] Decide upon the FSPN structure through the use of the genetic design [Step 5] Carry out fuzzy-set based fuzzy inference and coefficient parameters estimation for fuzzy identification in the selected node(FSPN) [Step 6] Select nodes (FSPNs) with the best predictive capability and construct their corresponding layer [Step 7] Check the termination criterion [Step 8] Determine new input variables for the next layer Selection of node(FSPN) structrue by chromosome Related symbolic gene type chromosome
i) Bits for the selection of the no. of input variables
ii) Bits for the selection of the polynomial order
ii) Bits for the selection of the number of MFs
iii) Bits for the selection of input variables
Bit structure of sub-chromosome divided for each item
Selection of no. of input variables(r)
Selection of the order of polynomial (Type 1~Type 4)
Selection of no. of membership function
3
5
1
r Selection of input variables
Genetic Design
Fuzzy inference & fuzzy identification
Selected FSPN
Fuzzy inferene method
No. of MFs per each input
MF Type
The structure of consequent part of fuzzy rules
Simplified or regression polynomial fuzzy inference
2, 3, 4, or 5
Triangular or Gaussian
Selected input variables or entire system input variables
FSPN
Fig. 2. The FSPN design–structural considerations and mapping the structure on a chromosome
772
S.-B. Roh, S.-K. Oh, and T.-C. Ahn
5 Experimental Studies This time series data (296 input-output pairs) describing relationships in a gas furnace has been intensively studied in the previous literature [5],[6],[7],[8],[9],[10]. The delayed terms of methane gas flow rate, u(t) and carbon dioxide density, y(t) are used as system input variables such as u(t-3), u(t-2), u(t-1), y(t-3), y(t-2), and y(t-1). The output variable is y(t). The first part of the dataset (consisting of 148 pairs) was used for training. The remaining part of the series serves as a testing set. Fig. 3 illustrates the different optimization process between gFSPNN and the proposed IG-gFSPNN by visualizing the values of the performance index obtained in successive generations of GA when using Type T*. 0.15
0.0135 0.013
0.14
Performance Index
Performance Index
0.0125 0.012 0.0115
1st layer
0.011
2nd layer
3rd layer
0.0105
0.13
0.12
1st layer
2nd layer
3rd layer
0.11
0.01
0.1 0.0095 0.009
0.09 0
50
100
150
200
250
300
350
400
450
0
50
100
150
(a) Training error
200
250
300
350
400
450
Generation
Generation
(b) Testing error
Fig. 3. The optimization process quantified by the values of the performance index (in case of using Gaussian MF with Max=5 and Type T*)
Table 1 summarizes a comparative analysis of the performance of the network with other models. Table 1. Comparative analysis of the performance of the network; considered are models reported in the literature Model
Proposed gFSPNN
Box and Jenkin’s model[7] Tong’s model[8] Sugeno and Yasukawa's model[9] Pedrycz’s model[5] Oh and Pedrycz’s model[6] Kim et al.’s model[10] Type III Triangular 3rd layer(Max=3) (SI=6) Gaussian-like 3rd layer(Max=3)
PI
Performance index PIs
EPIs
0.020 0.034 0.0114 0.0091
0.271 0.244 0.1072 0.0968
0.710 0.469 0190 0.320 0.123
PI - performance index over the entire data set, PIs - performance index on the training data, EPIs - performance index on the testing data.
A New Design Methodology of Fuzzy Set-Based Polynomial Neural Networks
773
6 Concluding Remarks In this study, we have investigated the symbolic gene type GA-based design of Fuzzy Polynomial Neural Networks. The genetic optimization being applied at each stage (layer) of the FPNN has resulted in a compact and optimal topology. The design methodology emerges as a hybrid structural optimization framework (based on the GMDH method and genetic optimization) and parametric learning being regarded as a two-phase design procedure. The GMDH method is comprised of both a structural phase such as a self-organizing and an evolutionary algorithm (rooted in natural law of survival of the fittest), and the ensuing parametric phase of the Least Square Estimation (LSE)-based learning. The comprehensive experimental studies involving well-known datasets quantify a superb performance of the network when compared with the existing fuzzy and neuro-fuzzy models. Most importantly, the proposed framework of genetic optimization supports an efficient structural search resulting in the structurally and parametrically optimal architectures of the networks. Acknowledgements. This work has been supported by KESRI (R-2004-B-133-01), which is funded by MOCIE (Ministry of Commerce, Industry and Energy).
References 1. Oh, S.K., Pedrycz, W.: Self-organizing Polynomial Neural Networks Based on PNs or FPNs: Analysis and Design. Fuzzy Sets and Systems 142 (2) (2004) 163-198 2. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. SpringerVerlag, Berlin Heidelberg New York (1996) 3. Jong, D.K.A.: Are Genetic Algorithms Function Optimizers? Parallel Problem Solving from Nature 2, Manner, R. and Manderick, B. (eds.) North-Holland, Amsterdam (1992) 4. Oh, S.K., Pedrycz, W.: Fuzzy Polynomial Neuron-Based Self-Organizing Neural Networks. Int. J. of General Systems 32 (2003) 237-250 5. Pedrycz, W.: An Identification Algorithm in Fuzzy Relational System. Fuzzy Sets and Systems 13 (1984) 153-167 6. Oh, S.K., Pedrycz, W.: Identification of Fuzzy Systems by Means of an Auto-tuning Algorithm and Its Application to Nonlinear Systems. Fuzzy Sets and Systems 115 (2) (2000) 205-230 7. Box, D.E., Jenkins, G.M.: Time Series Analysis, Forecasting and Control. Holden Day, California (1976) 8. Oh, S.K., Pedrycz, W.: The Design of Self-organizing Polynomial Neural Networks. Information Science 141 (2002) 237-258 9. Tong, R.M.: The Evaluation of Fuzzy Models Derived from Experimental Data. Fuzzy Sets and Systems 13 (1980) 1-12 10. Sugeno, M., Yasukawa, T.: A Fuzzy-Logic-Based Approach to Qualitative Modeling. IEEE Trans. Fuzzy Systems 1 (1) (1993) 7-31 11. Kim, E.T., et al.: A Simple Identified Sugeno-Type Fuzzy Model via Double Clustering. Information Science 110 (1998) 25-39
Design of Fuzzy Polynomial Neural Networks with the Aid of Genetic Fuzzy Granulation and Its Application to Multi-variable Process System Sung-Kwun Oh1, In-Tae Lee1, and Jeoung-Nae Choi2 1
2
Department of Electrical Engineering, The University of Suwon, San 2-2 Wau-ri, Bongdam-eup, Hwaseong-si, Gyeonggi-do, 445-743, South Korea
[email protected] Department of Electrical Electronic and Information Engineering, Wonkwang University, 344-2, Shinyong-Dong, Iksan, Chon-Buk, 570-749, South Korea
Abstract. In this paper, we propose a new architecture of Fuzzy Polynomial Neural Networks (FPNN) by means of genetically optimized Fuzzy Polynomial Neuron (FPN) and discuss its comprehensive design methodology involving mechanisms of genetic optimization, especially Genetic Algorithms (GAs). The conventional FPNNs developed so far are based on mechanisms of selforganization and evolutionary optimization. The proposed FPNN gives rise to a structurally optimized network and comes with a substantial level of flexibility in comparison to the one we encounter in conventional FPNNs. It is shown that the proposed genetic algorithms-based Fuzzy Polynomial Neural Networks is more useful and effective than the existing models for nonlinear process. We experimented with Medical Imaging System (MIS) dataset to evaluate the performance of the proposed model.
1 Introduction Most recently, the omnipresent trends in system modeling are concerned with a broad range of techniques of computational intelligence (CI) that dwell on the paradigm of fuzzy modeling, neurocomputing, and genetic optimization [1], [2]. The list of evident landmarks in the area of fuzzy and neurofuzzy modeling [3], [4] is impressive. While the accomplishments are profound, there are still a number of open issues regarding structure problems of the models along with their comprehensive development and testing. Data concerning software products and software processes are crucial to their better understanding and, in the sequel, establishing more effective ways of enhancing software quality. Accordingly we are vitally interested in the development of adaptive and highly nonlinear models that are capable of handling efficacies of software processes. As one of the representative and advanced design approaches comes a family of fuzzy polynomial neuron (FPN)-based self organizing neural networks (abbrev iated as FPNN or PNN and treated as a new category of neuro-fuzzy networks) [6],[9], [13]. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 774 – 779, 2006. © Springer-Verlag Berlin Heidelberg 2006
Design of Fuzzy Polynomial Neural Networks
775
In this study, in addressing the above problems with the conventional SOPNN (especially, FPN-based PNN called “FPNN” [6], [9], [13]) as well as the GMDH algorithm, we introduce a new genetic design approach. Bearing this new design in mind, we will be referring to these networks as genetic algorithms based FPNN (“GAsbased FPNN” for brief). To assess the performance of proposed model, we experiment with well-know medical imaging system (MIS) [8] widely used in software engineering.
2 The Architecture and Development of Fuzzy Polynomial Neural Networks (FPNN) In this section, we elaborate on the architecture and a development process of the FPNN. This network emerges from the genetically optimized multi-layer perceptron architecture based on GA algorithms and the extended GMDH method. 2.1 FPNN Based on Fuzzy Polynomial Neurons (FPNs) We start with a fuzzy polynomial neuron (FPN). This neuron, regarded as a generic type of the processing unit, dwells on the concepts of fuzzy sets and neural networks. The FPN encapsulates a family of nonlinear “if-then” rules. When put together, FPNs results in a self-organizing Fuzzy Polynomial Neural Network (FPNN). The FPN consists of two basic functional modules. The first one, labeled by F, is a collection of fuzzy sets (here denoted by {Al} and {Bk}) that form an interface between the input numeric variables and the processing part realized by the neuron. Here xq and xp stand for input variables. The second module (denoted here by P) refers to the function – based nonlinear (polynomial) processing that involves some input variables. In other words, FPN realizes a family of multiple-input single-output rules. Each rule, reads in the form if xp is Al and xq is Bk then z is Plk(xi, xj, alk)
(1)
where alk is a vector of the parameters of the conclusion part of the rule while Plk(xi, xj, alk) denotes the regression polynomial forming the consequence part of the fuzzy rule. The activation levels of the rules contribute to the output of the FPN being computed as a weighted average of the individual condition parts (functional transformations) PK (note that the index of the rule, namely “K” is a shorthand notation for the two indexes of fuzzy sets used in the rule (1), that is K = (l, k)). z=
all rules
∑ K =1
μK PK ( xi , x j , aK )
all rules
μ~ K =
∑ K =1
μK =
all rules
∑ K =1
μ% K PK ( xi , x j , aK )
(2)
μ
K all rules
∑ μL
(3)
L =1
2.2 Genetic Algorithms Based FPNN In order to enhance the learning of the FPNN and augment its performance, we use genetic algorithms to obtain the structural optimization of the network by optimally
776
S.-K. Oh, I.-T. Lee, and J.-N. Choi
selecting such parameters as the number of input variables (nodes), the order of polynomial, and input variables within a FPN. In this study, for the optimization of the FPNN model, GA uses the serial method of binary type, roulette-wheel used in the selection process, one-point crossover in the crossover operation, and a binary inversion (complementation) operation in the mutation operator. To retain the best individual and carry it over to the next generation, we use elitist strategy.
3 The Design Procedure of Genetically Optimized FPNN We use FPNNs as the building blocks of the network. Each neuron of the network realizes a polynomial type of partial description (PD) of the mapping between input and output variables. The input-output relation formed by the FPNN algorithm can be described in the form y = f ( x1 , x2 ,L , xn )
(4)
The estimated output yˆ reads as the following polynomial yˆ = fˆ ( x1 , x2 ,L , xn ) = c0 + ∑ ck 1 xk1 + ∑ ck1k 2 xk1 xk 2 + k1
k 1k 2
∑
k 1k 2 k 3
ck1k 2 k 3 xk1 xk 2 xk 3 + L
(5)
where ck s are its coefficients. The framework of the design procedure of the FPNN based on genetically optimized multi-layer perceptron architecture comprises the following steps [Step 1] Determine system’s input variables [Step 2] Form training and testing data [Step 3] specify initial design parameters [Step 4] Decide FPN structure using genetic design [Step 5] Carry out fuzzy inference and coefficient parameters estimation for fuzzy identification in the selected node (FPN) [Step 6] Select nodes (FPNs) with the best predictive capability and construct their corresponding layer [Step 7] Check the termination criterion [Step 8] Determine new input variables for the next layer
4 Experimental Studies In this section, we illustrate the development of the GAs-based FPNN and show its performance for well known and widely used datasets in software engineering. That is medical imaging system (MIS)[8] data. We consider a medical imaging system (MIS) subset of 390 software modules written in Pascal and FORTRAN for modeling. These modules consist of approximately 40,000 line of code. To design an optimal model from the MIS, we study 11 system input variables such as LOC, CL, TChar, TComm, MChar, DChar, N, NE, NF, V(G) and BW. The output variable of the model is the number of changes changes made to the software module during its development. In
Design of Fuzzy Polynomial Neural Networks
777
Table 1. Summary of the parameters of the genetic optimization
GA
FPNN
Parameters
1st to 5th layer
Maximum generation
300
Total population size
150
Selected population size (W)
40
Crossover rate
0.65
Mutation rate
0.1
String length
3+3+30+5
Maximal no.(Max) of inputs to be selected
1≤l≤Max(2~4)
Polynomial type (Type T) of the consequent part of fuzzy rules
1≤T≤4
Consequent input type to be used for Type T (*)
Type T Triangular Gaussian-like
Membership Function (MF) type No. of MFs per input
2 or 3
the case of the MIS data, the performance index is defined as the mean squared error (MSE). Table 1 summarizes the list of parameters used in the genetic optimization of the network. The population size being selected from the total population size (150) is equal to 40. The process is realized as follows. 150 nodes are generated in each layer of the network. The parameters of all nodes generated in each layer are estimated and the network is evaluated using both the training and testing data sets. Then we compare these values and choose 40 FPNs that produce the best (lowest) value of the performance index. The maximal number (Max) of inputs to be selected is confined to two to five (2-4) entries. The polynomial order of the consequent part of fuzzy rules is chosen from four types, that is Type 1, Type2, Type 3, and Type 4. Fig. 1 illustrates the detailed optimal topologies of GAs-based FPNN for three layers when using Max=3 and Gaussian-like MF: the results of the network have been reported as PI=25.804 and EPI=12.637. As shown in Fig 1, the proposed network LO C CL
2 2 2
FPN
1
3
1
3 3
FPN
9
2
1
TC har TC om m M C har D C H ar N N^ NF V(G ) BW
2 3 2
FPN 10
3 3
2
3 3
2
2 3
2
1
FPN 10 3
3 2 2
3
2 3 2
3
2 2 3
3
1
FPN 12
FPN 24 1
2 2 3
FPN 39 3
1
yˆ
1
FPN 24 1
FPN 36 1
FPN 37 1
Fig. 1. Optimal networks structure of GAs-based FPNN ( for 3 layers )
778
S.-K. Oh, I.-T. Lee, and J.-N. Choi
enables the architecture to be a structurally more optimized and simplified network than the conventional FPNN. In nodes (FPNs) of Fig. 1, ‘FPNn’ denotes the nth FPN (node) of the corresponding layer, numeric values with rectangle form before a node(neuron) mean number of membership functions per each input variable, the number of the left side denotes the number of nodes (inputs or FPNs) coming to the corresponding node, and the number of the right side denotes the polynomial order of conclusion part of fuzzy rules used in the corresponding node. Table 2 summarizes a comparative analysis of the performance of the network with other models. The experimental results clearly reveal that it outperforms the existing models both in terms of significant approximation capabilities (lower values of the performance index on the training data, PIs) as well as superb generalization abilities (expressed by the performance index on the testing data EPIs). The proposed genetically optimized FPNN (GAs-based FPNN) can be much lower in comparison to the conventional optimized FPNN as shown in Table 2. PIs(EPIs) is defined as the mean square errors (MSE) computed for the experimental data and the respective outputs of the network. Table 2. Comparative analysis of the performance of the network; considered are models reported in the literature Model SONFN [11]
FPNN [12]
Simplified Linear No. of inputs: 2 No. of inputs: 3
Max-inputs: 2 Our model
Max-inputs: 3 Max-inputs: 4
Structure Generic Basic Type Architecture Basic Generic Type Architecture Triangular Type : 2 Gaussian Type : 1 Triangular Type : 1 Gaussian Type : 1 Triangular Gaussian Triangular Gaussian Triangular Gaussian
5th layer Case1 5th layer Case2 5th layer 5th layer 5th layer 5th layer 5th layer 5th layer 5th layer 5th layer 5th layer 5th layer
PI
EPI
40.753
17.898
35.745
17.807
32.195 49.716 32.251 39.093 27.406 25.435 27.162 25.804 26.835 30.69
18.462 31.423 19.622 19.983 16.641 14.976 17.598 12.637 14.919 15.39
5 Concluding Remarks In this study, the GA-based design procedure of Fuzzy Polynomial Neural Networks (FPNN) along with its architectural considerations has been investigated. The GAbased design procedure applied at each stage (layer) of the FPNN leads to the selection of the preferred nodes (or FPNs) with optimal local characteristics (such as the number of input variables, the order of the consequent polynomial of fuzzy rules, input variables, and the number of membership functions) available within FPNN. These options contribute to the flexibility of the resulting architecture of the network. The design methodology emerges as a hybrid structural optimization and parametric learning being regarded as a two phase design procedure. Through the proposed
Design of Fuzzy Polynomial Neural Networks
779
framework of genetic optimization we can efficiently search for the optimal network architecture (being both structurally and parametrically optimized) and this design facet becomes crucial in improving the performance of the resulting model. Acknowledgements. This work has been supported by KESRI (R-2004-B-274), which is funded by MOCIE (Ministry of Commerce, Industry and Energy)
References 1. Pedrycz, W.: Computational Intelligence: An Introduction. CRC Press, Florida (1998) 2. Peters, J.F, Pedrycz, W.: Computational Intelligence, In Encyclopedia of Electrical and Electronic Engineering. (Edited by J.G. Webster). John Wiley & Sons, New York 22 (1999) 3. Pedrycz, W., Peters, J.F. (ed.): Computational Intelligence in Software Engineering. World Scientific, Singapore (1998) 4. Takagi, H., Hayashi, I.: NN-driven Fuzzy Reasoning, Int. J. of Approximate Reasoning 5(3) (1991) 191-212 5. Cordon, O., et al.: Ten Years of Genetic Fuzzy Systems: Current Framework and New Trends. Fuzzy Sets and Systems 141(1) (2004) 5-31 6. Oh, S.K., Pedrycz, W.: Self-organizing Polynomial Neural Networks Based on PNs or FPNs : Analysis and Design. Fuzzy Sets and Systems 142 (2004) 163-198 7. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn.Springer-Verlag, Berlin Heidelberg New York (1996) 8. Lyu, M.R. (ed.): Handbook of Software Reliability Engineering. McGraw-Hill (1995) 9. Oh, S.K., Pedrycz, W.: Fuzzy Polynomial Neuron-Based Self-Organizing Neural Networks. Int. J. of General Systems 32(3) (2003) 237-250 10. Wang, L.X., Mendel, J.M.: Generating Fuzzy Rules from Numerical Data with Applications. IEEE Trans. Systems, Man, Cybern. 22(6) (1992) 1414-1427 11. Oh, S.K., Pedrycz, W., Park, B.J.: Relation-based Neurofuzzy Networks with Evolutionary Data Granulation. Mathematical and Computer Modeling (2003) 202-212 12. Oh, S.K., Pedrycz, W.: Fuzzy Polynomial Neuron-Based Self-Organizing Neural Networks. Int. J. of General Systems 32(3) (2003) 237-250 13. Park, B.J., Lee, D.Y., Oh, S.K.: Rule-based Fuzzy Polynomial Neural Networks in Modeling Software Process Data. Int. J. of Control, Automations, and Systems 1(3) (2003) 321-331 14. Yamakawa, T.: A New Effective Learning Algorithm for a Neo Fuzzy Neuron Model. 5th IFSA World Conference. (1993) 1017-1020
A Novel Self-Organizing Fuzzy Polynomial Neural Networks with Evolutionary FPNs: Design and Analysis Ho-Sung Park1, Sung-Kwun Oh2, and Tae-Chon Ahn1 1
School of Electrical Electronic and Information Engineering, Wonkwang University, 344-2, Shinyong-Dong, Iksan, Chon-Buk, 570-749, South Korea {neuron, tcahn}@wonkwang.ac.kr 2 Department of Electrical Engineering, The University of Suwon, San 2-2 Wau-ri, Bongdam-eup, Hwaseong-si, Gyeonggi-do, 445-743, South Korea
[email protected]
Abstract. In this study, we introduce a new category of neurofuzzy networks – Self-Organizing Fuzzy Polynomial Neural Networks (SOFPNN) and develop a comprehensive design methodology involving mechanisms of genetic algorithms and information granulation. With the aid of the information granulation, we determine the initial location (apexes) of membership functions and initial values of polynomial function being used in the premised and consequence part of the fuzzy rules respectively. The GA-based design procedure being applied at each layer of SOFPNN leads to the selection of preferred nodes with specific local characteristics (such as the number of input variables, the order of the polynomial, a collection of the specific subset of input variables, and the number of membership function) available within the network.
1 Introduction While neural networks, fuzzy sets and evolutionary computing as the technologies of Computational Intelligence (CI) have expanded and enriched a field of modeling quite immensely, they have also gave rise to a number of new methodological issues and increased our awareness about tradeoffs one has to make in system modeling [1]. GMDH was introduced by Ivakhnenko in the early 1970's [2]. GMDH-type algorithms have been extensively used since the mid-1970’s for prediction and modeling complex nonlinear processes. While providing with a systematic design procedure, GMDH comes with some drawbacks. To alleviate the problems associated with the GMDH, Self-Organizing Neural Networks (SONN, called “SOFPNN”) were introduced by Oh and Pedrycz [3], [4], [5] as a new category of neural networks or neurofuzzy networks. Although the SOFPNN has a flexible architecture whose potential can be fully utilized through a systematic design, it is difficult to obtain the structurally and parametrically optimized network because of the limited design of the nodes located in each layer of the SOFPNN. In this study, in considering the above problems coming with the conventional SOFPNN, we introduce a new structure and organization of fuzzy rules as well as a new genetic design approach. The new meaning of fuzzy rules, information granules melt into the fuzzy rules. In a nutshell, each fuzzy rule describes the related J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 780 – 785, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Novel Self-Organizing Fuzzy Polynomial Neural Networks
781
information granule. The determination of the optimal values of the parameters available within an individual FPN leads to a structurally and parametrically optimized network through the genetic approach.
2 The Conventional SOFPNN with Fuzzy Polynomial Neuron (FPN) and Its Topology The FPN consists of two basic functional modules. The first one, labeled by F, is a collection of fuzzy sets that form an interface between the input numeric variables and the processing part realized by the neuron. The second module (denoted here by P) is about the function – based nonlinear (polynomial) processing. The detailed FPN involving a certain regression polynomial is shown in Table 1. Table 1. Different forms of regression polynomial building a FPN No. of inputs Order of the polynomial Order FPN 0 Type 1 1 Type 2 Type 3 2 Type 4
1 Constant Linear Quadratic
2
3
Constant Constant Bilinear Trilinear Biquadratic-1 Triquadratic-1 Biquadratic-2 Triquadratic-2 1: Basic type, 2: Modified type
3 The Optimization of SOFPNN 3.1 Optimization of SOFPNN by Structure Identification With aid of GAs[6], optimal structure design available within FPN (viz. the number of input variables, the order of the polynomial, a collection of the specific subset of input variables, and the number of MFs) lead to a structurally optimized network. Furthermore, we determine the initial location (apexes) of membership functions in the premise part and the initial values of polynomial function of consequence part of the fuzzy rules by information granulation[7] (using HCM[8]). 3.2 Optimization of SOFPNN by Parameter Identification For the parametric identification, we obtained the effective model that the axis of MFs are identified by GA to reflect characteristic of given data. Especially, the genetically dynamic search method is introduced in the identification of parameter. It helps lead to rapidly optimal convergence over a limited region or a boundary condition.
4 The Algorithm and Design Procedure of SOFPNN The framework of the design procedure of the SOFPNN based on genetically optimized FPNs consists of the following steps.
782
H.-S. Park, S.-K. Oh, and T.-C. Ahn
[Step 1] Determine system’s input variables. Define system’s input variables xi(i=1, 2, …, n) related to the output variable y. [Step 2] Form training and testing data. The input-output data set (xi, yi)=(x1i, x2i, …, xni, yi), i=1, 2, …, N is divided into two parts, that is, a training and testing dataset. [Step 3] Decision of apexes of MF by Information granulation(HCM). As mentioned in ‘3.1 Optimization of SOFPNN by structure identification’, we obtained the new apexes of MFs by information granulation as shown in Fig. 1. [Step 4] Decide initial design information for constructing the SOFPNN structure. Here we decide upon the essential design parameters of the SOFPNN structure. Those include a) Initial specification of the fuzzy inference method and the fuzzy identification b) Initial specification for decision of SOFPNN structure [Step 5] Decide upon the SOFPNN structure with the use of genetic design. When it comes to the organization of the chromosome representing (mapping) the structure of the SOFPNN, we divide the chromosome to be used for genetic optimization into four sub-chromosomes. [Step 6] Carry out fuzzy inference and coefficient parameters estimation for fuzzy identification in the selected node(FPNs). At this step, the regression polynomial inference is considered. In the fuzzy inference, we consider two types of membership functions, namely triangular and Gaussian-like membership functions. The consequence parameters are produced by the standard least squares method. And new consequence part of the fuzzy rule are constructed by information granulation (HCM). [Step 7] Select nodes (FPNs) with the best predictive capability and construct their corresponding layer. The nodes (FPNs) are generated through the genetic design. In GAs, we calculate the fitness function as following Eq. (1). F(fitness function) = 1/(1+EPI)
(1)
[Step 8] Check the termination criterion. As far as the performance index is concerned (that reflects a numeric accuracy of the layers), a termination is straightforward and comes in the form, F1 ≤ F*
(2)
Where, F1 denotes a maximal fitness value occurring at the current layer whereas F* stands for a maximal fitness value that occurred at the previous layer. In this study, we use one measure (performance indexes) that is the Mean Squared Error (MSE). [Step 9] Determine new input variables for the next layer. If (2) has not been met, the model is expanded. The outputs of the preserved nodes (zli, z2i, …, zWi) serves as new inputs to the next layer (x1j, x2j, …, xWj)(j=i+1). The SOFPNN algorithm is carried out by repeating steps 4-9.
A Novel Self-Organizing Fuzzy Polynomial Neural Networks
783
5 Experimental Studies We illustrate the performance of the network and elaborate on its development by experimenting with data coming from the NOx emission process of gas turbine power plant [9]. To come up with a quantitative evaluation of network, we use the standard MSE performance index. Table 2 summarizes the list of parameters used in the genetic optimization of the networks. Table 2. Computational overhead and a list of parameters of the GAs and the SOFPNN 1st layer
Parameters Maximum generation Total population size GAs
SOFP NN
S.O
150
P.O S.O P.O
2nd layer
3rd layer
150
150
150 100
100
100
300*No. of Node in 1st layer
Selected population size (S.O)
30
30
30
Crossover rate
0.65
0.65
0.65
Mutation rate
0.1
0.1
0.1
String length
3+3+30+5 +90
3+3+30+5
3+3+30+5
Maximal no. of inputs to be selected(Max)
1≤l≤Max( 2~3)
1≤l≤Max( 2~3)
1≤l≤Max( 2~3)
Polynomial type (Type T) of the consequent part of fuzzy rules
1≤T*≤4
1≤T≤4
1≤T≤4
Triangular
Triangular
Triangular
Gaussian
Gaussian
Gaussian
2 or 3
2 or 3
2 or 3
Membership Function (MF) type No. of MFs per each input(M)
l, T, Max: integers, S.O : Structure Optimization, P.O : Parameter Optimization.
Table 3 summarizes the results. Table 3. Performance index of SOFPNN for NOx(S.O : Structure Optimization, P.O : Parameter Optimization) In case of selected input In case of entire system input Model S.O P.O S.O P.O MF Max Node(M) T PI EPI PI EPI Node(M) T PI EPI PI EPI 2 4(3) 27(2) 4 0.072 0.145 0.016 0.068 14(2) 20(3) 4 0.002 0.045 0.003 0.017 Tri 3 5(2) 13(3) 0 4 0.027 0.111 0.014 0.036 6(3) 15(2) 0 4 0.001 0.048 0.002 0.008 2 23(3) 29(2) 3 0.014 0.352 0.012 0.180 2(3) 14(2) 2 0.001 0.027 0.002 0.024 Gau 3 2(3) 16(2) 0 3 0.003 0.055 0.004 0.134 15(2) 28(2) 0 4 0.001 0.027 0.001 0.023
Fig. 1 illustrates the detailed optimal topologies of SOFPNN when using Max=3 and triangular MF: the results of the network have been reported as PI=0.002 and EPI=0.008 in the 3rd layer. Fig. 2 illustrates the optimization process by visualizing the values of the performance index obtained in successive generation of GAs.
784
H.-S. Park, S.-K. Oh, and T.-C. Ahn Tamb COM
3
FPN20
1
2 3
3
FPN6
2
2 FPN5
3 2
LPT 3 2
Pcd
FPN22
2
3
2
2
yˆ
4
FPN15
1
3
Texh
Fig. 1. Genetically optimized SOFPNN architecture
3.8
x 10 -3
0.07
3.6 0.06
Testing data error
Training data error
3.4 3.2 3
Parameter Optimization
Structure Optimization
2.8 2.6 2.4
0.05 0.04
Parameter Optimization
Structure Optimization
0.03 0.02
2.2 0.01
2 1.8
0
50
100
150
200
250
0
300
0
50
100
Generation
150
200
250
300
Generation
(a) Training data error
(b) Testing data error
Fig. 2. The optimization process quantified by the values of the performance index Table 4. Comparative analysis of the performance of the network; considered are models reported in the literature Model
PIs
EPIs
Regression model
17.68
19.23
Simplified
7.045
11.264
Linear
4.038
6.028
Simplified
6.205
8.868
3.830 1.076 0.008
5.397 2.250 0.082
GA FNN model [10]
Hybrid (GA+Complex)
Multi-FNNs [11] Triangular gHFPNN [12] Gaussian-like
Linear Linear 3rd layer (Max=2) 5th layer (Max=2)
0.008
0.081
3rd layer (Max=2)
0.016
0.132
5th layer (Max=2)
0.016
0.116
3 layer (Max=2)
0.003
0.017
3rd layer (Max=3)
0.002
0.008
3rd layer (Max=2)
0.002
0.024
3rd layer (Max=3)
0.001
0.023
rd
Triangular Our model Gaussian-like
A Novel Self-Organizing Fuzzy Polynomial Neural Networks
785
6 Concluding Remarks In this study, we introduced and investigated a new architecture and comprehensive design methodology of Self-Organizing Fuzzy Polynomial Neural Networks (SOFPNN), and discussed their topologies. The design methodology comes as a structural and parametrical optimization being viewed as two fundamental phases of the design process. In the design of SOFPNN, the characteristics inherent to entire experimental data being used in the construction of the SOFPNN architecture is reflected to fuzzy rules available within a FPN. Therefore Information granulation based on HCM (Hard C-Means) clustering method was adopted. With the aid of the information granulation, we determine the initial location (apexes) of membership functions and initial values of polynomial function being used in the premised and consequence part of the fuzzy rules respectively. And GA-based design procedure applied at each stage (layer) of the SOFPNN leads to the selection of these optimal nodes with local characteristics available within SOFPNN. Acknowledgement. This paper was supported by Wonkwang University in 2004.
References 1. Cherkassky, V., Gehring, Mulier, F.: Comparison of Adaptive Methods for Function Estimation from Samples. IEEE Trans. Neural Networks 7 (1996) 969-984 2. Ivakhnenko, A.G.: Polynomial Theory of Complex Systems. IEEE Trans. on Systems, Man and Cybernetics 1 (1971) 364-378 3. Oh, S.K., Pedrycz, W.: The Design of Self-organizing Polynomial Neural Networks. Information Science 141 (2002) 237-258 4. Oh, S.K., Pedrycz, W., Park, B.J.: Polynomial Neural Networks Architecture: Analysis and Design. Computers and Electrical Engineering 29 (2003) 703-725 5. Oh, S.K., Pedrycz, W.: Fuzzy Polynomial Neuron-Based Self-Organizing Neural Networks. Int. J. of General Systems 32 (2003) 237-250 6. Jong, D.K.A.: Are Genetic Algorithms Function Optimizers? Parallel Problem Solving from Nature 2, Manner, R. and Manderick, B. (eds.) North-Holland, Amsterdam (1992) 7. Zadeh, L.A.: Toward a Theory of Fuzzy Information Granulation and Its Centrality in Human Reasoning and Fuzzy Logic. Fuzzy sets and Systems 90 (1997) 111-117 8. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. New York. Plenum (1981) 9. Vachtsevanos, G., Ramani, V., Hwang, T.W.: Prediction of Gas Turbine NOx Emissions Using Polynomial Neural Network. Technical Report, Georgia Institute of Technology. Atlanta (1995) 10. Oh, S.K., Pedrycz, W., and Park, H.S.: Hybrid Identification in Fuzzy-neural Networks. Fuzzy Sets and Systems 138 (2003) 399-426 11. Oh, S.K., Pedrycz, W., Park, H.S.: Multi-FNN Identification Based on HCM Clustering and Evolutionary Fuzzy Granulation. Simulation Modelling Practice and Theory 11 (2003) 627-642 12. Oh, S.K., Pedrycz, W., Park, H.S.: Multi-layer Hybrid Fuzzy Polynomial Neural Networks: A Design in the Framework of Computational Intelligence. Neurocomputing 64 (2004) 397-431
Design of Fuzzy Neural Networks Based on Genetic Fuzzy Granulation and Regression Polynomial Fuzzy Inference Sung-Kwun Oh1, Byoung-Jun Park2, and Witold Pedrycz2,3 1 Department of Electrical Engineering, The University of Suwon, San 2-2 Wau-ri, Bongdam-eup, Hwaseong-si, Gyeonggi-do, 445-743, South Korea
[email protected] 2 Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2G6, Canada 3 Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Abstract. In this paper, new architectures and comprehensive design methodologies of Genetic Algorithms (GAs) based Fuzzy Relation-based Fuzzy Neural Networks (FRFNN) are introduced and the dynamic search-based GAs is introduced to lead to rapidly optimal convergence over a limited region or a boundary condition. The proposed FRFNN is based on the Fuzzy Neural Networks (FNN) with the extended structure of fuzzy rules being formed within the networks. In the consequence part of the fuzzy rules, three different forms of the regression polynomials such as constant, linear and modified quadratic are taken into consideration. The structure and parameters of the FRFNN are optimized by the dynamic search-based GAs. The proposed model is contrasted with the performance of conventional FNN models in the literature.
1 Introduction The omnipresent tendency is the one that exploits techniques of CI [1] by embracing neurocomputing imitating neural structure of a human [2], fuzzy modeling using linguistic knowledge and experiences of experts [3], [4], and genetic optimization based on the natural law [5], [6]. Especially the two of the most successful approaches have been the hybridization attempts made in the framework of CI. Neuro-fuzzy systems are one of them [7], [8]. A different approach to hybridization leads to genetic fuzzy systems [6], [9]. In this paper, new architectures and comprehensive design methodologies of Genetic Algorithms (GAs) based Evolutionally optimized Fuzzy Neural Networks (EoFNN) are introduced for effective analysis and solution of nonlinear problem and complex systems. The proposed EoFNN is based on the Fuzzy Neural Networks (FNN). In the consequence part of the fuzzy rules, three different forms of the regression polynomials such as constant, linear and modified quadratic are considered. This methodology can effectively reduce the number of parameters and J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 786 – 791, 2006. © Springer-Verlag Berlin Heidelberg 2006
Design of Fuzzy Neural Networks Based on Genetic Fuzzy Granulation
787
improve the performance of a model. GAs being a global optimization technique determines optimal parameters in a vast search space. But it cannot effectively avoid a large amount of time-consuming iteration because GAs finds optimal parameters by using a given space (region). To alleviate the problems, the dynamic search-based GAs is introduced to lead to rapidly optimal convergence over a limited region or a boundary condition. To assess the performance of the proposed model, we exploit a well-known numerical example.
2 Polynomial Fuzzy Inference Architecture of FNN The overall network of conventional rule based fuzzy neural networks (FNN [8], [9]) consists of fuzzy rules as shown in (1) and (2). The networks are classified into the two main categories according to the type of fuzzy inference, namely, the simplified and linear fuzzy inference. The learning of the FNN is realized by adjusting connection weights wi or wki of the nodes and as such it follows a standard Back-Propagation (BP) algorithm. Ri : If x1 is A1i and L xk is Aki then Cyi = wi
(1)
Ri : If x1 is A1i and L xk is Aki then Cyi = w0i + w1i ⋅ x1 + L + wki ⋅ xk
(2)
In this paper, we propose the polynomial fuzzy inference based FNN (pFNN). The proposed pFNNs are obtained from the integration and extension of conventional FNNs. The topology of the proposed pFNN consists of the aggregation of fuzzy rules such as (3). The consequence part of pFNN consists of summation of a constant term, a linear sum of input variables and a linear sum of combination of two variables for the entire input variables. The polynomial structure of the consequence part in pFNN emerges into the networks with connection weights. This network structure of the modified quadratic polynomial involves simplified (Type 0), linear (Type 1), and polynomial (Type 2) fuzzy inferences. And fuzzy inference structure of pFNN can be defined by the selection of Type (order of a polynomial). Ri : If x1 is A1i and L xk is Aki then Cyi = w0i + w1i ⋅ x1 + L + wki ⋅ xk + wk +1i ⋅ x1 ⋅ x2 + L + wk + ji ⋅ xk ' ⋅ xk
(3)
[Layer 1] Input layer [Layer 2] Computing activation degrees of linguistic labels [Layer 3] Computing firing strength of premise rules [Layer 4] Normalization of a degree activation (firing) of the rule [Layer 5] Multiplying a normalized activation degree of the rule by connection weight ⎧Type 0 : Cyi = w0 i ⎪ f i = μi × Cyi where, ⎨Type 1: Cyi = w0 i + w1i ⋅ x1 + L + wki ⋅ xk ⎪Type 2 : Cy = w + w ⋅ x + L + w ⋅ x ⋅ x + L i 0i 1i 1 k +1i 1 2 ⎩
(4)
788
S.-K. Oh, B.-J. Park, and W. Pedrycz
[Layer 6] Computing output of pFNN n
n
n
μi ⋅ Cyi
i =1
i =1
i =1
μi ∑ i =1
yˆ = ∑ f i = ∑ μi ⋅ Cyi = ∑
n
(5)
The learning of the proposed pFNN is realized by adjusting connection weights w, which organize the consequence networks of pFNN. The standard Back-propagation (BP) algorithm is utilized as the learning method in this study. The complete update formula combining the momentum components is (6). In case of Type 2 : ⎧Δw0i (t + 1) = 2 ⋅η ⋅ ( y p − yˆ p ) ⋅ μi + α ⋅ Δw0i (t ) ⎪ ⎨Δwki (t + 1) = 2 ⋅η ⋅ ( y p − yˆ p ) ⋅ μi ⋅ xk + α ⋅ Δwki (t ) ⎪ ⎩Δwk + ji (t + 1) = 2 ⋅η ⋅ ( y p − yˆ p ) ⋅ μi ⋅ xk ' ⋅ xk + α ⋅ Δwk + ji (t )
(6)
Where, Δwki (t ) = wki (t ) − wki (t − 1) . η and α are constrained to the unit interval. The proposed pFNN can be designed to adapt a characteristic of a given system, also, that has the faculty of making a simple structure out of a complicated model for a nonlinear system, because the pFNN comprises consequence structure with various orders (Types) for fuzzy rules.
3 Evolutionally Optimized Fuzzy Neural Networks We introduce new architectures and comprehensive design methodologies of genetic algorithms (GAs [5], [6]) based EoFNN. To search a global solution in process of optimization using GAs is stagnated by various causes. One factor of them is to suitably give a definition of the search space for a global solution. Generally, search space (range) is defined for a given system and string length (the number of bit) is set for the defined range. Seeking for the optimal solution is based these. If the search range is defined as a large scale, then the number of bit increases in size for the given space. This cause the request of many computing time, tardy search of solutions, etc, so, GAs show a drop in performance. If a small-scale for the solution space is given, then the string length is sorted. This can ease the computing-burden, but debase the quality (detailed drawing) of solutions. Therefore, in order to improve these problems, we
x b a si s
x low
x h igh
G e n e tic A lg o rith m s x s ea rc h
x low
Δx
x lo w
x b a s is
x lo w
x b a s is
Δx
x b a si s
x h igh
x h ig h
x h ig h
Fig. 1. Dynamic search based Gas
Design of Fuzzy Neural Networks Based on Genetic Fuzzy Granulation
789
introduce the dynamic search-based GAs. This methodology discovers an optimal solution through adjusting search range. Adjustment of a range is based on the moving distance of a basis solution. A basis solution is previously determined for sufficiently large space. The procedure of adjusting space for each step in the dynamic search-based GAs is shown in Fig. 1.
4 Experimental Studies In this experiment, we use three-input nonlinear function as in [3], [4]. This dataset was analyzed using Sugeno's method [3]. We consider 40 pairs of the original inputoutput data. The performance index (PI) is defined by (7). 20 out of 40 pairs of inputoutput data are used as learning set and the remaining part serves as a testing set. E ( PI or EPI ) =
1 n | y p − yˆ p | ∑ y × 100(%) n p =1 p
(7)
Table 1 summarizes the results of the EoFNN architectures. This table includes the A case (Separate tuning) tuning methodologies using dynamic search based GAs. ○ includes two auto-tuning processes, namely, structure and parameter tuning processes. In first process, structure of a given model is tuned, that is input variables of premise and consequence, membership function, and order of polynomial are set. And then, B case (Consecuparameters of the identified structure are tuned in second process. ○ tive tuning) includes structure and parameter tuning processes, however, two tuning processes is not done separately but done at the same time. That is, input variables of premise and consequence, parameters of membership function, and order of polynomial are tuned. Table 1. Performance index of EoFNN01 for the nonlinear function
Case A ○ B ○
Inputs GAs (x1,x2,x3) Tuned GAs (x1,x2,x3)
Premise MFs
Para.
Consequence Inputs Order
PI
E_PI
2×2×2
Min-Max
GAs
GAs
1.856
4.500
2×2×2
GAs
Tuned
Tuned
0.255
0.875
2×2×2
GAs
GAs
GAs
0.246
0.341
R1 : If x1 is A11 and x2 is A21 and x3 is A31 then Cy1 = w01 + w11 x1 + w31 x3 R 2 : If x1 is A11 and x2 is A21 and x3 is A32 then Cy2 = w02 + w12 x1 + w22 x2 + w32 x3 R 3 : If x1 is A11 and x2 is A22 and x3 is A31 then Cy3 = w03 + w13 x1 + w23 x2 + w33 x3 R 4 : If x1 is A11 and x2 is A22 and x3 is A32 then Cy4 = w04 + w24 x2 R 5 : If x1 is A12 and x2 is A21 and x3 is A31 then Cy5 = w05 + w15 x1 + w25 x2 + w35 x3 R 6 : If x1 is A12 and x2 is A21 and x3 is A32 then Cy6 = w06 R 7 : If x1 is A12 and x2 is A22 and x3 is A31 then Cy7 = w07 + w27 x2 + w37 x3 R8 : If x1 is A12 and x2 is A22 and x3 is A32 then Cy8 = w08 + w28 x2
(8)
790
S.-K. Oh, B.-J. Park, and W. Pedrycz then Cy1 = w01 + w11 x1 + w31 x3 then Cy2 = w02 L
then Cy3 = w03 + w23 x2 then Cy4 = w04 then Cy5 = w05 + w15 x1 + w25 x2 + w35 x3
L
(9)
then Cy6 = w06 + w36 x3 then Cy7 = w07 + w17 x1 + w27 x2 + w37 x3 then Cy8 = w08 + w28 x2 + w38 x3
A method consists of 8(=2 ) fuzzy rules such as (8). ○ B carries EoFNN designed by ○ A . EoFNN out the tuning process of structure and parameters at the same time unlike ○ B method consists of 8(=23) fuzzy rules such as (9). The premise strucdesigned by ○ A . The output characteristic of the architectures of the fuzzy rule are the same that of ○ B is better than that of ○ A . As shown in (8) and (9), the ture obtained by means of ○ consequence structure of fuzzy rules includes 23 parameters in (8) and 20 parameters B is preferred as an optimal in (9), respectively. Therefore, EoFNN generated by ○ architecture for the output performance and simplicity of a model. Table 2 covers a comparative analysis including several previous models. The proposed EoFNNs come with higher accuracy and improved prediction capabilities. 3
Table 2. Comparison of performance with other modeling methods
Model
PI
E_PI
No. of rules
Linear model [4] GMDH [4,10] Fuzzy I Sugeno's [3,4] Fuzzy II FNN Type 1 Shin-ichi's [7] FNN Type 2 FNN Type 3 Simplified FNN [9] Linear Simplified Multi-FNN [9] Linear Proposed model (EoFNN)
12.7 4.7 1.5 1.1 0.84 0.73 0.63 2.865 2.670 0.865 0.174
11.1 5.7 2.1 3.6 1.22 1.28 1.25 3.206 3.063 0.956 0.689
3 4 3 8(2 ) 2 4(2 ) 3 8(2 ) 9(3+3+3) 9(3+3+3) 9(3+3+3) 9(3+3+3)
0.246
0.341
8(2×2×2)
5 Concluding Remarks In this paper, new architectures and comprehensive design methodologies of Evolutionally optimized Fuzzy Neural Networks (EoFNN) has discussed for effective analysis and solution of nonlinear problem and complex systems. Also, the dynamic search-based GAs has introduced to lead to rapidly optimal convergence over a limited region or a boundary condition. The proposed EoFNN is based on the Fuzzy Neural Networks (FNN) with the extended structure of the fuzzy rules. The structure and parameters of the proposed EoFNN are optimized by GAs. The proposed EoFNN can be designed to adapt a characteristic of a given system, also, that has the faculty
Design of Fuzzy Neural Networks Based on Genetic Fuzzy Granulation
791
of making a simple structure out of a complicated model for a nonlinear system. This methodology can effectively reduce the number of parameters and improve the performance of a model. Acknowledgements. This work has been supported by KESRI (I-2004-0-074-0-00), which is funded by MOCIE (Ministry of Commerce, Industry and Energy).
References 1. Pedrycz, W., Peters, J.F.: Computational Intelligence and Software Engineering. World Scientific. Singapore (1998) 2. Chan, L.W., Fallside, F.: An Adaptive Training Algorithm for Back Propagation Networks. Computer Speech and Language. 2 (1987) 205-218 3. Kang, G., Sugeno, M.: Fuzzy Modeling. Transactions of the Society of Instrument and Control Engineers. 23(6) (1987) 106-108 4. Park, M.Y., Choi, H.S.: Fuzzy Control System. Daeyoungsa. Korea (1990) 143-158 5. Goldberg, D.E.: Genetic Algorithms in Search, Optimization & Machine Learning. Addison-Wesley (1989) 6. Cordon, O. et al.: Ten Years of Genetic Fuzzy Systems: Current Framework and New Trends. Fuzzy Sets and Systems. 141(1) (2004) 5-31 7. Horikawa, S. I., Furuhashi, T., Uchigawa, Y.: On Fuzzy Modeling Using Fuzzy Neural Networks with the Back Propagation Algorithm. IEEE Transactions on Neural Networks. 3(5) (1992) 801-806 8. Park, H.S., Oh, S.K.: Rule-Based Fuzzy-Neural Networks Using the Identification Algorithm Of GA Hybrid Scheme. International Journal of Control Automation and Systems. 1(1) (2003) 101- 110 9. Park, H.S., Oh, S.K.: Multi-FNN Identification Based on HCM Clustering and Evolutionary Fuzzy Granulation. International Journal of Control Automation and Systems. 1(2) (2003) 194-202 10. Kondo, T.: Revised GMDH Algorithm Estimating Degree of the Complete Polynomial. Transactions of the Society of Instrument And Control Engineers. 22(9) (1986) 928-934
A New Fuzzy ART Neural Network Based on Dual Competition and Resonance Technique Lei Zhang, Guoyou Wang, and Wentao Wang Lab. For Biometric and Medical Image Processing, Institute for Pattern Recognition and Artificial Intelligence, Huazhong University of Science & Technology, Wuhan, 430074, China
[email protected]
Abstract. A new Fuzzy ART neural network model based on dual competition and resonance technique is proposed. This new model takes the competition and resonance method of the input nodes into the output nodes, putting the input nodes and output nodes competition and resonance together, solves the oversegmentation of the FART algorithm due to the vigilance’s increase in the application of the image segmentation. Experimental results have shown that compared with the original FART, this algorithm has a better segmentation and antinoise performance.
1 Introduction Fuzzy ART [1] has been introduced in the neural network literature by Carpenter, et al., 1991, and since then it has been established as one of the premier neural network architectures in solving classification problems. To deal with supervised learning tasks, the so-called Fuzzy ARTMAP classifier was developed around the combination of two Fuzzy ART modules by Carpenter, et al., 1992. To reduce the sensitivity of Fuzzy ARTMAP to the order of training samples, output combinations of independently trained Fuzzy ARTMAP systems were proposed [2], [3], [4], such as Williamson proposed a Fuzzy ARTMAP based on the Gaussian radix, Verzi improve Fuzzy ARTMAP in 1998,and proposed BARTMAP and Eduardo proposed a improve neural network in 2002.But all these algorithms proposed are sensitive to the vigilance, and with the increase of the vigilance the output patterns increases saliently, and the output patterns are so similar to each other, the false segment rate increases. To solve this problem, we propose a new Fuzzy ART neural network model based on dual competition and resonance technique. This new model takes the competition and resonance method of the input nodes into the output nodes, and combines the output pattern according to the vigilance to avoid high similarity between the output patterns.
2 FART Neural Network and Our New Model The structure of FART neural network is illustrated as Fig.1 (a). It has three fields, which are expressed respectively with F0, F1and F2. The node of field F0 expresses J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 792 – 797, 2006. © Springer-Verlag Berlin Heidelberg 2006
A New Fuzzy ART Neural Network Based on Dual Competition
793
the input vector, the field F2 expresses the active code, the field F1 receives the input of the field F0 from bottom to top and the superincumbent input from the field F2 and do matching here. The activity vector of the field F0 is expressed with I = ( I1 , I 2 ,...I m ) , here I i ∈ [0,1], i = 1, 2,...m ; the activity vector of the field F1 is expressed with x = ( x1 , x2 ,...xm ) ; the activity vector of field F2 is expressed with y = ( y1 , y2, ... ym ) , while y = 0 , is learning by no sample training, and y ≠ 0 , is learning by sample training. Although the fuzzy ART neural network is an effective classifier, there exists the defect of memory instability. When new samples occur, the training process of the vectors of the output nodes in F2 will have adversary impacts on its similarity with the other types, and this would lead to the situation that the vigilance cannot control the independence of the output nodes. Consequently, in the final classification results, false classification will be tremendous. Aiming at solving the deficiency of the standard Fuzzy ART neural network, a new Fuzzy ART neural network based on dual competition and resonance technique as showed in Fig.1 (b) is proposed, Based on the standard Fuzzy ART network, after each input node resonance, we added a process of the output nodes’ resonance in F2. If all the nodes of F2 are tried and can't satisfy the vigilance ρt , the resonance of output nodes completes. From the improvement of above, we join the nodes in F2 according to the similarity between them, and this would ensure the low similarity of the output nodes and deduce the impact of the vigilance to the final number of classification.
F2
y
F2
F1
x
F0
I
ρ
y
Cont rol l er
x
Control l er
I
ρ
Cont r ol l er
F1
ρ
F0
a
a
(a)
(b)
Fig. 1. (a) The model of Fuzzy ART Algorithm; (b) The model of improved the Fuzzy ART algorithm
3 Fuzzy ART Neural Network Based on Dual Competition and Resonance Technique The algorithm of Fuzzy ART neural network based on dual competition and resonance technique contains complement coding, input node selection, input node
794
L. Zhang, G. Wang, and W. Wang
resonance, learning processing and output node resonance. Compared with the standard Fuzzy ART algorithm, we add output node processes after each input node’s resonance to ensure the low similarity between the output nodes. The algorithm of Fuzzy ART neural network based on dual competition and resonance technique is explained below: a) Complement Coding Assume for all the input vectors I ≡ γ , and we should normalize the input vector and take the complement coding. If α ∈ [0,1]M in the field F0, the normalize network input
vector is I = ( a, a c ) ∈ [0,1]2 M , a c = {aic | aic = 1 − ai } .
Let the vectors of nodes of F2 w ji = 1 , set all the nodes in the F2 as not restricted, and if one node is selected, it is restricted. b) Input Node Selection To every node of F2, the selection uses the winner-take-all method .If input vector gets in, all the nodes in F2 would all be activated and take part in the competition. In order to select the winner node, calculate choice function, Tj , the following steps is executed: T ( I ) =| I ∧ w j | /(α + | w j |) j
(1)
If at the moment t one node of at F2 is exciting. A class choice is executed in the system, the selected class is marked as J , which is as the following: TJ = max j {T j : j = 1,2,3, L , N } Tj ( I )
(2)
Measures the matching degree between the current input vector I and w j , the
vector of the node of category j . If there is more than one node that gets the max value, the class of the minimum suffix is selected. When the class J is selected y J = 1 , and to all other y j = 0, j ≠ J . c) Input Node Resonance The two vector resonance in the field F1, Calculate the activity vector of F1 as following: ⎧⎪ I x = ⎨ ⎪⎩ I Λ w j
if F2 is not exciting if F2 is exciting
(3)
If the node J is active, I , the input vector from F0, and w j , a expected vector from the F2 match, when x and I are of high similarity, they form a new vector x , the calculation of similarity is as follow: | I ∧ w j | / | I |≥ ρ Here, ρ is the vigilance of the input node.
(4)
A New Fuzzy ART Neural Network Based on Dual Competition
795
If the formula (4) is satisfied, the resonance occurred, and update w j . Else if I Λw j I < ρ , the system is reset, the node is forbidden, T j = 0 , searching for the next
winner node. If all the nodes of F2 are tried and can't satisfy the formula (4), new node is added to F2 and the weight of new wN +1 = 1 , and the new input is studied continuously. d) Learning Processing If the resonance of the system occurs, then the weight is updated as: w
( new ) j
(
= β IΛ w
( old ) j
) + (1 − β ) w
( new ) j
(5)
Here β ∈ [0,1] the learning rate, and β = 1 stands for the quick learning. e) Output Node Resonance Each two vectors of the output node resonate in the field F2; calculate the activity vector of F2 as following: ⎧⎪ w i x = ⎨ ⎪⎩ w i Λ w
if j
if
F 2 is n o t e x c itin g F 2 is e x c iti n g
(6)
If the node i is active, wi and w j , a expected vector from the F2 match, when wi and w j are of high similarity, they form a new vector x , the calculation of similarity is as follow: | w i ∧ w j | / m a x (| w i |, | w j |) ≥ ρ t
, i, j ∈ F2 , i ≠ j
(7)
Here, ρt is the vigilance of the output node. If the formula (7) is satisfied, the resonance occurred, and update wi as follow: w i ( n ew ) = ( w i ( o ld ) ∧ w j ) , i,
Else if
| w i ∧ w j | / m a x (| w i |, | w j |) ≤ ρ t
j ∈ F2 , i ≠ j
(8)
, the system is reset, the node is forbidden,
searching for the next output node. If all the nodes of F2 are tried and can't satisfy the formula (7), the resonance of output nodes completes. f) Searching for The Next Input Node Searching for the next input node, return to a) and repeat the above processing. When all the input nodes have been trained, the processing is over, and the node number of F2 is the sort number of classification.
4 Experiment and Results A sample image (360*288) in figure 2 is adopted to illustrate the difference of the output vectors between the original FART algorithm and the one proposed in this article. We choose the gray value and the mean value of eight neighborhoods as the vectors of the input pixels, and expand it to 4 dimensions by using the complement
796
L. Zhang, G. Wang, and W. Wang
coding. In the algorithm, we set β = 1 , using the quick learning algorithm. Fig.2 (a) is the original images. The segment results using the original FART algorithm and the proposed algorithm when ρ = 0.9 , are shown in Fig.2 (b) and (c), respectively. In order to test the antinoise performance of the proposed algorithm, we add gauss noise to the original image, as shown in Fig.2 (d), and the segment results of the original FART and the proposed method are shown in Fig.2 (e) and (f), respectively. In diagram 1, accompanied with the increased vigilance, the threshold of similarity selection is more rigid and the number of categories is also increased. Moreover, through the data of the output vectors, which are under the conditions of ρ = 0.8 , it is clear that using the original FART algorithm will result in high similarity between many different categories. For example: the rate of the similarity of 2nd and 4th, 5th categories as well as 1st and 6th are both over 80%. In fact, these five categories should be divided into only two categories. Due to the memory instability of the algorithm, the redundancy of the output categories exists. However, the algorithm proposed in this article enhance the stability of the algorithm by adding constraint via competition resonance between the output nodes while remain the one between the input nodes, which can tremendously control the abnormal increase of the number of categories and restrain the similarity of output nodes.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. The results of the test images: (a) the original image;(b) the result of the FART when ρ = 0.9 ;(c) the result of the proposed method when ρ = ρt = 0.9 ; (d) the image with noise ;(e) the result of the FART when ρ = 0.9 ;(f) the result of the proposed method when ρ = ρt = 0.9 Table 1. The comparison of the output patterns of the Fuzzy algorithm and the proposed method that controlled by the vigilance
ρ
<=0.5
0.6
0.8
0.9
output patterns of the original FART algorithm output patterns of the new FART algorithm
1 1
2 2
6 3
12 4
A New Fuzzy ART Neural Network Based on Dual Competition
797
Table 2. The eigenvector of the output patterns of the standard FART algorithm when ρ = 0.8 No. The weighting vectors of the output nodes
1 0.0000 0.7803 0.0100 0.8108
2 0.1586 0.6313 0.1769 0.6392
3 0.3019 0.4705 0.3103 0.5146
4 0.1490 0.8509 0.2091 0.7908
5 0.1701 0.6034 0.1690 0.6620
6 0.0056 0.7603 0.0204 0.8210
Table 3. The eigenvector of the output patterns of the improved FART algorithm when
ρ = ρt = 0.8
No. The weighting vectors of the output nodes
1 0.0000 0.7603 0.0100 0.7608
2 0.1586 0.6034 0.1579 0.6230
3 0.3019 0.4705 0.3103 0.5146
5 Conclusion A new Fuzzy ART neural network model based on dual competition and resonance technique is proposed. This new model solves the over-segmentation of the FART algorithm by combining the output patterns. Experimental results and theoretical analysis have shown that compared with the original FART, this algorithm has a better clustering and antinoise performance. However, the proposed algorithm still cannot solve the over-training in the input nodes resonance and learning processes, to which much more attention to the study that how to make the algorithm keep the sensitivity of the new input data and the stability of the trained data should be paid.
Acknowledgment This work was supported by the Aeronautical Fund of P. R. China (1.3JW0505) and the National Defense Science Foundation of P. R. China (51401020201JW0521).
References 1. Carpenter, G.A., Grossberg, S., Rosen, D.B.: Fuzzy ART: Fast Stable Learning and Categorization of Analog Patterns by an Adaptive Resonance System. Neural Networks 4 (1991) 759–771 2. Carpenter, G.A., Grossberg, S., Markuzon: Fuzzy ARTMAP: An Earning of Analog Multidimensional Supervised Maps. IEEE Transactions on Neural Networks 3 (1992) 698–613 3. Huang, J., Georgiopoulos, M., Heileman, G.L.: Fuzzy ART Properties. Neural Networks 8 (1995) 202–213 4. Williamson, J.R.: Gaussian ARTMAP: A Neural Network for Fast Incremental Learning of Noisy Multidimensional Maps. Neural Networks 9 (1996) 881–897
Simulated Annealing Based Learning Approach for the Design of Cascade Architectures of Fuzzy Neural Networks Chang-Wook Han and Jung-Il Park School of Electrical Engineering & Computer Science, Yeungnam University, 214-1, Dae-dong, Gyongsan, Gyongbuk, 712-749 South Korea {cwhan, jipark}@yumail.ac.kr
Abstract. This paper is concerned with the optimization method of the cascade architectures of fuzzy neural networks. The structure of the network that deals with a selection of a subset of input variables and their distribution across the individual logic processors (LPs) is optimized with the use of genetic algorithms (GA). We discuss random signal-based learning employing simulated annealing (SARSL), a local search technique, aimed at further refinement of the connections of the neurons (GA-SARSL). A standard data set is discussed with respect to the performance of the constructed networks and their interpretability.
1 Introduction The cascade architectures of fuzzy neural networks [1] implement the logic-based mapping between input and output spaces viewed as multidimensional unit hypercubes that is we are concerned about the relationship from [0,1]n to [0,1]m, and thus come with interpretation transparency augmented by learning capabilities. The cascade nature of the network supports modularity and allows building reduced models on a basis of a subset of the most essential input variables. This feature is particularly appealing when we are concerned with the design of fuzzy models in presence of a high number of variables whose reduction becomes inevitable. The primary objective of this study is to develop the cascade architectures of fuzzy neural networks with genetic algorithm (GA) and random signal-based learning employing simulated annealing (SARSL), analyze its properties and introduce a comprehensive development environment. GA optimizes the order of the variables and the binary connections of the neurons, and then SARSL refines the binary connections, namely GA-SARSL. The results are compared with the optimization method in [1] which considered the gradient-based learning rather than SARSL.
2 Cascade Architectures of Fuzzy Neural Network Logic processors (LPs) consisting of AND and OR neurons are basic functional modules of the network that are combined into a cascaded structure. The essence of this architecture is to stack the LPs one on another. This results in a certain sequence of J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 798 – 803, 2006. © Springer-Verlag Berlin Heidelberg 2006
Simulated Annealing Based Learning Approach for the Design of Cascade Architectures
799
input variables. To assure that the resulting network is homogeneous, we use LPs with only two inputs, as shown in Fig. 1. In this sense, with “n” input variables, we end up with (n-1) LPs being used in the network. Each LP is fully described by a set of the connections (V and w). To emphasize the cascade-type of architecture of the network, we index each LP by referring to its connections as V[ii] and w[ii] with “ii’ being an index of the LP in the cascade sequence. For more details about the networks, please refer to [1]. x1
z1 Intermediate variables . 144 4442444443 .
LP1
.
x2 . . .
LPn − 2 LPn −1
xn Fig. 1. A cascaded network realized as a nested collection of LPs
3 GA-SARSL Development The evolutionary optimization [2][3] is an attractive avenue to exploit in the development of the cascade network. In this learning scenario, we arrange all elements to be optimized (that is a sequence and a subset of input variables, and the connections of the logic processors) into a single chromosome and carry out their genetic optimization. The considered form of the chromosome for this optimization is described in Fig. 2. The input sequence of the variables (and thus is involved in the structural optimization of the network) and the connections (parametric optimization) are the phenotype of the chromosome. The sequence gene of input variables to be used in the model (we allow for a high level of flexibility by choosing only a subset of the input variables) consists of “n” real numbers in the unit interval. These entries are assigned integer numbers that correspond to their ranking in the chromosome. The first “p” entries of the chromosome (assuming that we are interested in “p” variables) are then used in the structure of the network. For instance, if the chromosome consists of the following entries: 0.5 0.94 0.1 0.82 0.7 (n = 5) and we are interested in p=3 variables, the ranking leads to the sequence of integers 2 4 5 1 3 and we choose x2, x4, and x5. Sequence 1
2
3
0.24
0.62
0.47
Weights n . . .
0.80
LP1
LP2
. . .
Fig. 2. Structure of a chromosome for the optimization
LPp-1
800
C.-W. Han and J.-I. Park
The resulting connections of GA are binary. Instead of going ahead with the continuous connections, the intent of GA is to focus on the structure and rough (binary) values of the connections. This is legitimate in light of the general character of genetic optimization: we can explore the search space however there is no guarantee that the detailed solution could be found. The promising Boolean solution can be next refined by allowing for the values of the connections confined to the unit interval. Such refinement is accomplished by the SARSL that is quite complementary to the GA; while could be easily trapped in a local minimum, it leads to a detailed solution. The complete learning mode is composed then as a sequence of GA followed by the SARSL, let us express as GA-SARSL. The SARSL [4] is a modified random signal-based learning (RSL) with the help of simulated annealing (SA). The main idea of the RSL is that the random value randomly agitates the state in the range of learning rate in order to find the optimal state. If the learning rate is quite small, the state can exploit the search space (fine learning), but can not explore (coarse leaning). On the contrary, if the learning rate is big enough, the exploration is possible rather than exploitation. This is a trade-off between small and large learning rate. Since the candidate solution moves in a downhill direction very quickly with a small learning rate, RSL is a very effective algorithm for the local search. The SA allows a system to change its state to a higher energy state occasionally such that it has a chance to jump out of local minima and seek the global minimum [5]. The SA possesses a formal proof of convergence to the global optimum [6][7]. This convergence proof relies on a very slow cooling schedule of setting the temperature. While this cooling schedule is impractical, it identifies a useful trade-off where longer cooling schedules tend to lead to better quality solutions. In SA, the downhill moves are always accepted, whereas the uphill moves are accepted with the probability (P) that is a function of temperature. A brief explanation of SARSL is described as follows: Step 1. Initialize the state at random. Initialize the temperature T. Step 2. Evaluate the new state using RSL. Step 3. Qold=performance of the old state, Qnew=performance of the new state. Step 4. If Qnew
random number, then the new state is accepted, else the new state is rejected. Step 5. Cooling down the temperature as T=T*cooling rate. Step 6. If the stopping criterion is satisfied end algorithm, else go to Step 2.
4 Experimental Studies The experiments deal with a Box-Jenkins gas-furnace data [8], which is commonly used in the development of various models. The genetic optimization is supported by a standard form of GA with real-number encoding as shown in Fig. 2. The parameters of the optimization environment are included in Table 1 and these are fixed for all experiments. In the experiments, we confine ourselves to the two commonly encountered models of triangular norms by implementing t-norm as an algebraic product and realizing s-norm in the form of the probabilistic sum.
Simulated Annealing Based Learning Approach for the Design of Cascade Architectures
801
Table 1. Experimental setup of the parameters of the optimization environment
Algorithm GA
SARSL
Parameter Population size Maximum generation Crossover rate Mutation rate Learning rate
λ Initial temperature Cooling rate Maximum iteration no.
Value 100 200 0.8 0.03 0.05 5 1 0.98 2000
A GA optimizes all parameters (binary connections) and structure. By carrying out SARSL refinement we enhance the parametric details of the network and reveal how much these impact its performance. The performance of the network is described by means of RMSE between the desired output and the actual output. To get the reasonable experimental results, each network has been simulated 10 times independently. In this experiment, Box-Jenkins gas-furnace data is used to compare the performance of the proposed method with the method in [8] which constructed the fuzzy system based on the significance degree of the input variable. Moreover, the method use in [1] is considered to compare. Box-Jenkins gas-furnace data is a single input-output time series data for a gas furnace process with gas flow rate u(t) as input and the CO2 concentration y(t) as output. We consider 10 input variables described in [8]. The first half of the date is used as training and remaining data is used as testing. For each variable, we set up three Gaussian fuzzy sets with an overlap of 0.5. Because of these fuzzy sets, we end up with 30 input variables. For all we accept three labels, those are SMALL, MEDIUM, and LARGE. In the experiments, we limit the number of input variables and consider 3, 5, 7, and 9 number of input. The selection of these variables is also genetically optimized and in this way we can reduce the size of the model as well as reveal a hierarchy of the input fuzzy sets. Table 2 shows the average performance index (RMSE) over 10 independent simulations for the GA mode of learning with 3, 5, 7, and 9 input variables (fuzzy sets) and SARSL refinement of the result obtained by GA (GA-SARSL). Shadowed entries indicate the best results of the performance index. The values in the brackets are the results of the testing set. In this table, “Method in [1]” means the results using a gradient-based learning method rather than SARSL for further refinement of the GA-optimized binary connections. From the results we can see that the binary connections optimized by GA are further refined by SARSL. Actually, it is not possible to compare the proposed method with the algorithm introduce in [8] because the same training and testing data are not used. However, the performance of the proposed method is superior to that of the algorithm in [8] and [1] with respect to the testing data prediction RMSE. The interpretation of the networks leads to interesting results. We assume that the inputs are binary to simplify the rules and organize them in Table 3 which reveals one of the ten simulations with 5 input variables (fuzzy sets).
802
C.-W. Han and J.-I. Park
Table 2. Average performance index (RMSE) over 10 independent simulations for the GA mode of learning with 3, 5, 7, and 9 input variables and SARSL refinement for the result obtained by GA (GA-SARSL). The values in the brackets are the result of the testing set.
Algorithm
3
5
7
9
GA
0.50 (0.53) 0.40 (0.42) 0.41 (0.42)
0.45 (0.47) 0.35 (0.37) 0.37 (0.39)
0.46 (0.49) 0.37 (0.38) 0.38 (0.39)
0.45 (0.48) 0.36 (0.38) 0.37 (0.39)
GA-SARSL Method in [1]
Table 3. A collection of fuzzy rules produced by the resulting networks
Algorithm GA
GASARSL
Rules If [y(t-1) is small AND u(t-4) is large AND u(t-3) is large] OR [y(t-2) is small] then [y(t) is small] If [y(t-1) is medium] and [u(t-4) is medium] then [y(t) is medium] If [y(t-1) is large] AND [u(t-4) is medium] AND [u(t-3) is medium] then [y(t) is large] If [y(t-1) is small AND u(t-4) is large AND u(t-3) is large] then [y(t) is small] If [y(t-1) is medium] and [u(t-4) is medium] then [y(t) is medium] If [y(t-1) is large] AND [u(t-4) is medium] AND [u(t-3) is medium] then [y(t) is large]
5 Conclusions We have introduced and developed the cascade architectures of fuzzy neural networks with the use of GA-SARSL. The network exhibits a transparent logic structure that supports all interpretation faculties. We showed that the genetically optimized network help concentrate on the processing a subset of the most essential variables. The comprehensive development environment consisting of GA-SARSL helps complete structural and parametric design of the network. The experimental results help quantify the performance of the network and provide interesting interpretation of the available data.
References 1. Pedrycz, W., Reformat, M., Han, C.W.: Cascade Architectures of Fuzzy Neural Networks. Fuzzy Optimization and Decision Making 3(1) (2004) 5-37 2. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, New York (1989)
Simulated Annealing Based Learning Approach for the Design of Cascade Architectures
803
3. Michalewicz, Z.: Genetic Algorithm + Data Structures = Evolution Programs. 3rd edn. Springer Verlag, Berlin (1996) 4. Han, C.W., Park, J.I.: Design of a Fuzzy Controller using Random Signal-based Learning Employing Simulated Annealing. Proc. of the 39th IEEE Conference on Decision and Control (2000) 396-397 5. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220(4598) (1983) 671-680 6. Romeo, F., Sangiovanni-Vincentelli, A.: A Theoretical Framework for Simulated Annealing. Algorithmica 6 (1991) 302-345 7. Sullivan, K.A., Jacobson, S.H.: A Convergence Analysis of Generalized Hill Climbing Algorithms. IEEE Trans. Automatic Control 46(8) (2001) 1288-1293 8. Kilic, K., Sproule, B.A., Turksen, I.B., Naranjo, C.A.: A Fuzzy System Modeling Algorithm for Data Analysis and Approximate Reasoning. Robotics and Autonomous Systems 49 (2004) 173-180
A New Fuzzy Identification Method Based on Adaptive Critic Designs Huaguang Zhang1 , Yanhong Luo1 , and Derong Liu2 1
School of Information Science and Engineering, Northeastern University, Shenyang, Liaoning 110004, P.R. China 2 Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, Illinois 60607-7053, USA [email protected]
Abstract. A new fuzzy identification method for unknown nonlinear discrete systems is presented by introducing adaptive critic designs into the generalized fuzzy hyperbolic model (GFHM). This method minimizes the long-time error other than the immediate error to improve the identification effect. We first represent the GFHM with a neural network structure, and then utilize the adaptive critic designs (ACDs) to get the optimal parameters of the network so that the long-time identification error is minimized. Finally, we give a simulation example to verify the effectiveness of this identification method.
1
Introduction
Since the 1960s, much research work has been done in the field of model identification of dynamic systems [1]. However, it is still not perfect both in theory and in practical applications. Fuzzy identification technique is known to be a feasible scheme to solve the model identification problem. Recently, Zhang proposed a new class of fuzzy model, i.e., the fuzzy hyperbolic model (FHM) [2] and the generalized fuzzy hyperbolic model (GFHM) [3]. For GFHM, there are certain advantages over other fuzzy models and it has been proved to be a universal approximator [3]. So we choose GFHM to model nonlinear systems in this paper. However, the previous identification algorithms just pay attention to the immediate error between the outputs of the model and the plant without considering the long-time error. Here we aim to improve the identification effect by introducing ACDs into the identification scheme of GFHM. ACDs are designs seeking the optimal control in nonlinear environments. Due to the “curse of dimensionality” [4, 9, 11], a system “critic” is built to approximate the cost function by using function approximation structures. There are three basic methods which are Heuristic Dynamic Programming (HDP), Dual Heuristic Programming (DHP), and Globalized Dual Heuristic Programming (GDHP) [5, 8, 11]. If the critic module takes the action signal as part of its inputs, the designs are called ADACDs [6, 11]. In this paper, we choose ADACDs to complete the identification algorithm for the plant. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 804–809, 2006. c Springer-Verlag Berlin Heidelberg 2006
A New Fuzzy Identification Method Based on ACDs
2
805
The Neural Network Structure of GFHM for Nonlinear Systems
Consider a class of SISO discrete nonlinear systems y(k + 1) = f (y(k), y(k − 1), . . . , y(k − n + 1), u(k), u(k − 1), . . . , u(k − p + 1)) (1) where, f (·) is an unknown nonlinear function, u(k) ∈ R and y(k) ∈ R are the input and output at the kth instant. For MIMO systems, this method is also applicable. For convenience, we define two groups of new variables as follows: x1 = y(k), x2 = y(k − 1), . . . , xn = y(k − n + 1), u1 = u(k), u2 = u(k − 1), . . . , up = u(k − p + 1).
(2)
In GFHM, the membership functions of the fuzzy sets Px and Nx are given as [3]: 2 2 1 1 μPx (xi ) = e− 2 (xi −ki ) , μNx (xi ) = e− 2 (xi +ki ) (3) where ki > 0. However, just these fuzzy sets cannot cover the whole input space. Therefore, the input variable xi will be transformed linearly. Definition 1. Given a plant with n input variables r = (r1 (t), · · · , rn (t))T and one output variable y, define the generalized input variables as r¯1 = r1 − d11 , . . . , r¯w1 = r1 − d1w1 , r¯w1 +1 = r2 − d21 , . . . , r¯m = rn − dnwn (4) n to make the fuzzy sets cover the whole input space, where m = i=1 wi , dij is the jth linear shift of ri . Then for the output y, there are 2m rules to represent it, where the lth rule is given as: Rl : IF (r1 − d11 ) is Fr11 and . . . and (r1 − d1w1 ) is Fr1w1 and (r2 − d21 ) is Fr21 and . . . and (r2 − d2w2 ) is Fr2w2 and . . . and (rn − dnwn ) is Frnwn THEN y = cF11 + · · · + cF1w1 + cF21 + · · · + cF2w2 + · · · + cFnwn where Frij include fuzzy subsets Pr and Nr , and the existence of constant cFij in the “THEN” part depends on whether Frij appears in the “IF” part. Define x ¯ and u ¯ as generalized input vectors. We can obtain fuzzy model [3]: y=
m i=1
pi +
q
qj + A1 tanh(Kx¯ x ¯) + B tanh(Ku¯ u¯)
(5)
j=1
where, pi and qj are constants related to cFx¯i and cFu¯j , A1 and B are vectors related to cFx¯i and cFu¯j , Kx¯ and Ku¯ are diagonal matrices constituted by kx¯i and ku¯j . The corresponding detailed network structure of this fuzzy model is shown in Fig. 1, which will be called NGFHM in the following sections. The network model consists of one input layer, three hidden layers and one output layer, where x and u are generalized input vectors given in (2), and the output is
806
H. Zhang, Y. Luo, and D. Liu
Fig. 1. Neural network structure for GFHM
the approximate value of the real output at the (k +1)st instant. The parameters dij and hij represent the linear shifts of xi and uj , and biases pi (i = 1, 2, . . . , m) and qj (j = 1, 2, . . . q) are sent to the nodes of the third hidden layer. Let f1 (x) = f3 (x) = f4 (x) = x, f2 (x) = tanh(x), the nonlinear model can be obtained as in (5). Finally, ki = kx¯i , gj = ku¯j , ai , pi and bj , qj (i = 1, 2, . . . , m; j = 1, 2, . . . , q) are the parameters left to be adjusted.
3
ACD-Based Learning Algorithm for NGFHM
The diagram for the whole identification scheme is given as in Fig. 2. Here ACD is used to adjust the parameters of NGFHM. The performance index is defined as ∞ J[x(t), t] = γ k−t U [x(k), u(k), k] (6) k=t
where U is the utility function and 0 < γ < 1 is the discount factor. According to [6], the output of the critic network will be the estimate value of index function J(t) if we minimize the error function as follows: Ec (t) =
1 ˆ ˆ + 1) + U (t)])2 (J (t) − [γ J(t 2
(7)
ˆ ˆ where J(t) = J[x(t), u(t), t, wc ], γ is the discount factor. The function U is defined as: U (t) = (y(t + 1) − yˆ(t + 1))2 . (8)
A New Fuzzy Identification Method Based on ACDs
807
Fig. 2. The identification structure diagram of the system with ACD learning algorithm
The critic network can be trained to approximate the index function J(t) by using the gradient-based weight update algorithm as follows: ∂Ec (t) wc (t + 1) = wc (t) + βc − (9) ∂wc (t) where βc > 0 is the learning rate and wc (t) is the weight vector in the critic network. The NGHFM is trained together with critic network to get the optimal weight parameters. Here we aim to minimize the long-time error between the approximate value yˆ and the real output y, i.e., minimizing J(t). So the output objective can be defined as “0”: 1 Ea (t) = [J(t) − 0]2 (10) 2 Therefore, the weight updating rule is given by the following equation: ∂Ea (t) wa (t + 1) = wa (t) + βa − , (11) ∂wa (t) where βa > 0 is the learning rate and wa (t) is the weight vector in NGFHM. In summary, the critic network is trained to approximate J(t) in advance, and then NGFHM is trained to minimize J(t). After one training cycle is completed, we can stop or go on with the training procedure according to the system’s performance.
4
Simulation Study
Consider system [7]: y(k + 1) =
y(k)y(k − 1)y(k − 2)u(k − 1)(y(k) − 1) + 2u(k) 1 + y(k)2 + y(k − 1)2 + y(k − 2)2
808
H. Zhang, Y. Luo, and D. Liu
where the initial values are u(1) = u(2) = 0, y(1) = y(2) = y(3) = 0. Assume the sampling time interval is 1s. According to experience, we can obtain the generalized input variables: x = [x1 , x2 , x3 , x4 ]T =[y(k) + 0.013, y(k) − 0.663, y(k − 1) + 0.124, y(k − 2) − 0.137]T , u=[u1 , u2 , u3 ]T = [u(k) + 0.654, u(k) + 0.652, u(k − 1) − 0.485]T , while p1 = −0.175, p2 = 0.435, p3 = 0.282, p4 = −0.147, q1 = 0.24, q2 = −0.307, q3 = 0.191, A1 = [−0.18, 0.5467, −0.428, 0.988], B = [−0.471, 0.715, 0.4587], kx = diag [−1.123, −0.431, 1.542, 2.874], ku = diag [0.243, −0.716, 1.648]. After these initial values are obtained, we can begin with the training of the whole system using the MATLAB Neural Networks Toolbox. First the critic network is trained, and after 46 epochs, the error is small enough. After the successful training of critic network, we carry out the training of the action network together with the critic network. Both the inputs of plant and the NGFHM are set to sin(2πk/25), and the learning rate is set to 0.4. After 800 training cycles, we get the action network’s simulation result (for a total of 68 time instants) as in Fig. 3. We can find that the tracking performance is satisfactory. 1 0.9 0.6
y
0.3
yes
0 −0.3 −0.6 −0.9 −1.2 −1.5 0
10
20
30
40
50
60
70
80
90
time instant Fig. 3. The identification result, where the solid line denotes the estimate value yes of output and the dashed line denotes the real output y of the plant
5
Conclusion
In this paper, we present a new identification method for nonlinear discrete systems by combining ACDs and the generalized fuzzy hyperbolic model together. The GFHM was constructed by a neural network first, and then ACD was utilized to adjust the parameters of the network until the network’s parameter reach optimal values. Simulation study shows the validity of the present identification method.
Acknowledgements This work was supported by the National Natural Science Foundation of China under Grants No. 60325311, 60572070 and 60534010. The authors would like
A New Fuzzy Identification Method Based on ACDs
809
to thank the guest editors and the anonymous reviewers for their constructive suggestions for this paper.
References 1. Pedrycz, W., Czogala, E., Hirota, K.: Some Remarks on the Identification Problem in Fuzzy Systems. Fuzzy Sets and Systems 12 (1984) 185–189 2. Zhang, H.G., Quan, Y.B.: Modelling, Identification and Control of a Class of Nonlinear System. IEEE Trans. on Fuzzy Systems 9 (2001) 349–354 3. Zhang, H.G., Wang, Z.L.: Generalized Fuzzy Hyperbolic Model: A Universal Approximator. ACTA Automatica Sinica 30(3) (2004) 416–422 4. Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton, NJ (1957) 5. Prokhorov, D.V., Wunsch, D.C.: Adaptive Critic Designs. IEEE Trans. on Neural Networks 8 (1997) 997–1007 6. Liu, D., Xiong, X., Zhang, Y.: Action-Dependent Adaptive Critic Designs. Proceedings of the INNS-IEEE International Joint Conference on Neural Networks, Washington, DC 2 (2001) 990–995 7. Hou, Z.S.: Non-Parameter Model and Adaptive Control Theory. 1st edn. Science Press, Beijing, China (1999) 8. Balakrishnan, S.N., Biega, V.: Adaptive-Critic-Based Neural Networks for Aircraft Optimal Control. Journal of Guidance, Control, and Dynamics 19 (1996) 893–898 9. Murray, J.J., Cox, C.J., Lendaris, G.G., Saeks, R.: Adaptive Dynamic Programming. IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews 32 (2002) 140–153 10. Si, J., Wang, Y.-T.: On-Line Learning Control by Association and Reinforcement. IEEE Transactions on Neural Networks 12 (2001) 264–276 11. Werbos, P.J.: Approximate Dynamic Programming for Real-Time Control and Neural Modeling. In: White, D.A., Sofge, D.A. (eds.): Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches (Chapter 13). Van Nostrand Reinhold, New York (1992)
Impacts of Perturbations of Training Patterns on Two Fuzzy Associative Memories Based on T-Norms Wei-Hong Xu1,2, Guo-Ping Chen1, and Zhong-Ke Xie2 1
College of Mathematics and Computer Science, Jishou University, 416000 Jishou, China [email protected] 2 College of Computer and Communications Engineering, Changsha University of Science and Technology, 410077 Changsha, China
Abstract. In general, there is perturbation between collected training pattern and its corresponding actual pattern in real world, such perturbation may cause disadvantage to performance of a fuzzy neural network, therefore a type of robustness of fuzzy associative memories (FAMs) is proposed correlative with the perturbations of training patterns in the paper, then it is pointed out that using the maximum-weight-matrix learning algorithm, a Max-T0 FAM has poor such robustness, however, a Max-TL FAM holds good robustness, where the two FAMs are based on t-norm T0 and Lukasiewicz t-norm, respectively. Finally, a simulation experiment validates our theoretical results.
1 Introduction Since B. Kosko put forward the fuzzy associative memory (FAM) and the fuzzy bidirectional associative memory (FBAM) in 1987 [1-2], fuzzy neural networks have attracted considerable attention for their fruitful applications in pattern recognition, system forecast, control and decision making, and so on. Their vital force roots in the integration of some merits of fuzzy theory and neural networks. To construct a fuzzy system, we often need prepare some practical knowledge describing the performance of the actual system. The knowledge may be from the experience of experts or data captured by sensors, whose final forms are usually represented as fuzzy rules for fuzzy inference systems, or training patterns for fuzzy neural networks. However, uncertainty of fuzzy information, poor precision of sensors or ambient noise bring commonly into small differences (perturbations) between collected data and the corresponding actual data in real world. Delightfully, Mingsheng Ying [3] proposed the concept of maximum and average perturbation of fuzzy sets, and estimated the perturbation parameters of fuzzy reasoning when Zadeh’s compositional rule of inference (CRI) is adopted. Then Kaiyuan Cai [4] came out with the robustness results for various implication operators and inference rules in terms of δ − equalities of fuzzy sets. Lei Zhang [5] treated fuzzy reasoning as an optimization, further analyzed the robustness, Lately, Weihong Xu [6] investigated the influence of general fuzzy implication operators on robustness of fuzzy reasoning with perturbations of rules, and gave a few of necessary and sufficient conditions, and Yongming Li [7]set forth some measures of robustness of fuzzy connectives and implication operators, and made use of them to discuss the robustness. However, as for fuzzy neural networks, few papers [8] have been concerned with the congeneric problem that robustJ. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 810 – 817, 2006. © Springer-Verlag Berlin Heidelberg 2006
Impacts of Perturbations of Training Patterns on Two FAM Based on T-Norms
811
ness or sensitivity of the networks to perturbations of training patterns. So far, scholars in fuzzy community are mainly interested in the fault-tolerance [9] and robustness of the fuzzy networks in the case when only input patterns or some parameters carry noise but training patterns are assumed to be accurate or no perturbations. In this article, we present the robustness concept of fuzzy neural networks to training pattern perturbations. As a typical instance, we first analyze such robustness of a Max-TL FAM and Max-T0 FAM when a so-called max-weight-matrix learning algorithm is employed where TL and T0 are t-norms [10], further a simulation experiment is given to validate our theoretical results.
2 Correlative Concepts At first, we always have the hypothesis that I = {1,2, L , n} , J = {1,2,L, n} and K = {1,2,L, p} are nonempty subscript sets throughout the paper, also assume that A = (a1 , a2 ,L, an ) ∈[0,1]n and ΔA = (Δa1, Δa2 ,L, Δan ) ∈[−1,1]n , however, ai + Δai is set to1 if
ai + Δa i > 1 , a i + Δ a i is adjusted to 0 if ai + Δai < 0 for each i ∈ I . The other similar situations can have similar stipulation.
Definition 1[3]. For arbitrary A*, A ∈ [ 0,1] n , the following value is called a maximum perturbation between A * and A : H ( A*, A) = ∨ | a i* − a i | . i∈I
Definition 2. If a pattern pair ( A , B ) is perturbed into ( A + ΔA, B + ΔB) , and the inequality H ( A + ΔA, A) ∨ H(B + ΔB, B) ≤ γ holds, then we call this pattern pair carries a maximum γ perturbation. Generally, there may be small difference (perturbation) between collected pattern pair( Ak , Bk ) = ((a k1 , a k 2 , L , a kn ), (bk1 , bk 2 , L , bkm )) and the corresponding actual pattern pair( Ck , Dk ) = ((c k1 , c k 2 , L, ckn ), (d k1 , d k 2 , L, d km )) in real world. Assuming that training
pattern pair ( Ak , B k ) carries the maximum γ k perturbation into (C k , Dk ) , sometimes, p
the value γ ( = ∨ γ k ) is said to maximum perturbation of training set {( Ak , Bk ) | k ∈ K} . k =1 The notation f ( X ,training_ set) stands for output of a certain FAM for input X when a learning algorithm f is employed with training set training_ set . Definition 3. Supposed that a fuzzy associative memory (FAM) model employs a learning algorithm f . If there exists a positive constant h > 0 such that for arbitrary given γ ∈ (0,1] and any set = {(Ak , Bk ) | k ∈ K} carrying maximum γ perturbation into new _ set , the inequality H ( f ( X , set), f ( X , new _ set)) ≤ hγ always holds for any input X , then we call that the FAM model has universally good robustness to perturbations of training patterns under the learning algorithm f . For Definition 3, we should emphasize that: (1) ( A k , B k ) and maximum perturbation γ ∈ (0,1] are arbitrarily given, p as the number of the pattern pairs is also arbitrar-
812
W.-H. Xu, G.-P. Chen, and Z.-K. Xie
ily given; (2) The connection weight matrix of FAM is generally changeable as training pattern set changes even if the adopted learning algorithm keeps unchangeable. Definition 4. Supposing that T is a t-norm, the concomitant implication operator of T is represented by RT ( a, b) = ∨{ x ∈ [0,1] | aTx ≤ b}, ∀a, b ∈ [0,1] .
For a t-norm T , a Max-T FAM can be characterized as follows: Y = X oT W , which can also be expressed pointwisely as y j = ∨ ( xi TL wij ), ∀j ∈ J , i∈I
where W is an n-by-m connection weight matrix from input neurons to output neurons, n and m are the numbers of neurons respectively in X-layer and Y-layer. The tnorms discussed in the paper are expressed respectively by ⎧a ∧ b, a + b > 1, aT0 b = ⎨ ∀a, b ∈ [ 0,1] and Lukasiewicz aTLb = (a + b − 1) ∨ 0, ∀a, b ∈ [0,1] . ⎩ 0, otherwise .
Easily we know 1, a ≤ b, ⎧ and RTL ( a, b ) = (1 − a + b ) ∧ 1 . RT0 (a , b ) = ⎨ ( 1 − ) ∨ b, a > b. a ⎩
Remark 1. The implication operator R T0 is often written as R 0 in many articles, especially it was used as improvement of Zadeh’s CRI method by Guojun Wang [11]. As well known, performance of a neural network depends on its connection weights, which are provided by a certain learning algorithm. So research on this type of robustness characterized by Definition 3 need associate closely with learning algorithms of the discussed networks. Fortunately, for the Max-T FAM an efficient learning algorithm has been proposed, its effectiveness means that it can find the maximum of all connection weight matrices which make the network to store reliably given pattern pairs as long as the network has the ability to store reliably them. So in the article, the algorithm is called max-weight-matrix learning algorithm described as below [12-14]: Supposed that any given set of training pattern pairs is set = {( Ak , Bk ) | Ak = (a k1 , a k 2 , L, a kn ), Bk = (bk1 , bk 2 , L, bkm ), k ∈ K } .
Step 1: based on the kth pattern pair ( Ak , Bk ) and the implication operator R T , a temporal weight matrix Wk is determined as W k = ( wij( k ) ) n×m = ( a ki R T b kj ) n × m . Step 2: with fuzzy intersection operator ∩, all the above temporal weight matrices Wk are combined to form a final weight matrix W = ∩ Wk = ( ∧ ( a ki R T b kj )) n × m . k ∈K
k∈K
3 The Type of Robustness of Max-TL FAM Lemma 1[6,7,9]. (1) | ∨ a'i − ∨ ai |≤ ∨ | a'i −ai | . (2) If δ > 0 and | a'i −ai |≤ δ for each i∈I
i∈I
i∈I
i ∈ I , then | ∧ a 'i − ∧ ai |≤ δ . (3) | ∧ a' i − ∧ ai |≤ ∨ | ai' − ai | . i∈I
i∈I
i∈I
i∈I
i∈I
Lemma 2. | aTL b − cTL d |≤ 2(| a − c | ∨ | b − d |) , | aRTL b − cRTL d |≤ 2(| a − c | ∨ | b − d |) , ∀a , b, c, d ∈ [0,1] . (Its proof is trivial, and omitted here).
Impacts of Perturbations of Training Patterns on Two FAM Based on T-Norms
813
Remark 2. Lemma 2 implies that the two operators TL and RTL satisfy well-known Lipschitz condition. Theorem 1. A Max-TL FAM has universally good robustness to perturbations of training patterns under the max-weight-matrix learning algorithm.
Proof. We use the max-weight-matrix learning algorithm to train the Max-TL FAM. Let positive constant h = 4. For any training pattern pair set set = {(Ak , Bk ) | k ∈ K} and any given γ k ∈ (0,1] , the training set set carries maximum γ = ∨ γ k perturbations k ∈K
into new_ set = {(Ck , Dk ) | k ∈ K} . Let W1 and W 2 be the connection weight matrices of the Max-TL FAM, respectively with the training sets set and new_ set . For any input
X ∈[0,1]n , we have the following sequence of derivations: H (Y 1( X ), Y 2( X )) = ∨ | y1 j − y 2 j | = ∨ | ∨ (xi TL w1ij ) − ∨ (xiTL w2ij ) | j∈J
j∈J i∈I
i∈I
≤ ∨ ∨ | (xiTLw1ij ) − (xiTLw2ij ) | (by Lemma 1) j∈J i∈I
≤ ∨ ∨ 2(| (xi − xi | ∨ | w1ij − w2ij |) (by Lemma 2) j∈J i∈I
≤ 2 ∨ ∨ | ∧ (aki RTL bkj ) − ∧ (cki RTL d kj ) | (from the used learning algorithm) j∈J i∈I k∈K
k∈K
≤ 2 ∨ ∨ ∨ | (aki RTL bkj ) − (cki RTL dkj ) | (by Lemma 1) j∈J i∈I k∈K
≤ 4 ∨ ∨ ∨ (| aki − cki | ∨ | bkj − d kj |) (by Lemma 2) j∈J i∈I k∈K
≤ 4 ∨ ∨ ∨ (H( Ak , CK ) ∨ H(Bk , Dk )) ≤ 4 ∨ ∨ ∨ γ k = 4γ . j∈J i∈I k∈K
j∈J i∈I k∈K
According to Definition 3, we already finish the proof of the theorem here. Further we can arrive at the following conclusion, whose proof is similar to that of Theorem 1: Supposing that any training set set carries maximum γ perturbations into new_ set , any input X carries arbitrary maximum δ perturbation into X * , then the inequality H (Y 1( X ), Y 2 ( X *)) ≤ 2 (δ ∨ 2γ ) holds.
4 The Type of Robustness of Max-T0 FAM 4.1 Theoretical Analysis of the Robustness Theorem 2. A Max-T0 FAM does NOT have universally good robustness to perturbations of training patterns under the max-weight-matrix learning algorithm.
Proof. Any given h > 0 , if h∈(0,0.5] , let γ ∈ (0,0.5) . And if h ∈ (0.5,+∞) , let γ =1/ 4h . Next let s = 0.5−rh. Thus the two inequalities 0<0.5−s <0.5 and 0<0.5−γ <0.5 always hold. Now we construct a special set of training pattern pairs set = {( A1 , B1 ) | A1 = (0.5,0.5, L,0.5), B1 = (0.5,0.5,L,0.5)} ,
further make set carry the maximum γ perturbation into
814
W.-H. Xu, G.-P. Chen, and Z.-K. Xie
new_ set = {(C1 , D1 ) | C1 = (0.5,0.5, L,0.5), D1 = (0.5 − γ ,0.5 − γ , L,0.5 − γ )} .
After the Max-T0 FAM model is trained using the max-weight-matrix learning algorithm respectively based on set and new_ set , the two connection weight matrices W1and W 2 come out respectively, where W 1 = ( ∧ a ki RT 0 bkj ) n×m = ( a1i RT0 b1 j ) n×m = (1) n×m , k ∈K
W 2 = ( ∧ (c ki RT 0 d kj )) n×m = (c1i RT0 d1 j ) n×m = (0.5 ∨ (0.5 − γ )) n×m = (0.5) n×m . k∈K
However, when any input X satisfying xi ∈ (0.5 − s,0.5) for all i ∈ I , is respectively presented to the Max-T0 FAM with the weight matrix W1 and another Max-T0 FAM with the matrix W 2 . For the same input X , a kind of difference (perturbation) between the two output vectors is revealed as follows: H (Y1( X ),Y 2( X )) = ∨ | y1j − y2 j | j∈J
= ∨ | ∨(xiT0 w1ij ) − ∨(xiT0 w2ij ) | = ∨ | ∨ (xi T01) − ∨ (xi T0 0.5) | j∈J i∈I
i∈I
j∈J i∈I
i∈I
= ∨ | ∨ (0.5 − s) − ∨ 0 | (by the definition of T0 and the restriction of input X ) j∈J i∈I
i∈I
= 0.5 − s > 0.5 − (0.5 − rh) = rh .
Therefore, the theorem holds by Definition3 and arbitrariness of h . 4.2 Simulated Experiment and Analysis
The aim of the experiment is to not only support further the conclusion of Theorems 1 but also illustrate an application of a Max-T0 FAM in image processing. The maxweight-matrix learning algorithm [12-14] will be adopted to train the Max- T0 FAM. (1) Without loss of generality, the images A and B in Figs.1 and 2 can be regarded as actual data in real world, they both contain respectively 64-by-64 pixels of which the gray levels are in the range of {0 / 256,1 / 256, L,255 / 256} . (2) The images C and D in Figs. 3 and 4 are viewed as results from small perturbations of image A and image B, respectively. In other words, the images C and D can be considered as the captures in data acquisition, respectively from the images A and B. The approach of perturbation of image A is that each pixel increases smally its gray level, so the image should get bright, then a short dark line that has gray level 0.5 is placed in the left-central of the image. For the image B, its process of perturbation makes the image also lighten by adding another little positive numeric to each pixel. (3) Containing 64 * s pixels, each of s rows of image A from top to bottom, is used as the first pattern of a training pattern pair, and the corresponding s rows of image B constitute the second pattern of the pattern pair. Thus from (image A, image B), a set old _ set of the p training pairs is developed where p = total _ row _ numbers _ of _ image / s = 64 / s .
Similarly, another set new_ set of another p training pairs is derived from (image C, image D). Next we construct a Max-T FAM model in which node numbers of Xlayer and Y-layer are respectively n = 64 * s and m = 64 * s . Now the Max-T 0 FAM
Impacts of Perturbations of Training Patterns on Two FAM Based on T-Norms
815
model is trained, and brings forth two different connection weight matrices respectively based on old _ set and new _ set . Let s = 32 , so p = 32 in this experiment. (4) Given an input image E in Fig.5, into which image A in Fig.1 perturbs slightly, is divided into p input patterns like item (3) above. Furthermore, image E differs from image A and image C. (5) When the above p input patterns are respectively presented to the Max- T0 FAM with training set old _ set , the p output patterns of the Max- T0 FAM will be produced, and further form the image in Fig.6. Similarly, when feeding the same p input patterns to the Max- T0 FAM with another training set new_ set respectively, the new p output patterns compose the image in Fig.7. (6) Check the difference between the two output images in Figs.6 and 7. Obviously there are big differences in gray levels between the child’s eyes in the two images, and the differences are seen also in the brightness contrast, obscurity, and image definition between the two images. Especially, if distinguishing both of the images carefully, we can find that of the image in Fig.7, the area of lower half and round the midline of chin and nose tip was strongly impacted by perturbations of the training pattern pairs. In order to illustrate visually the difference between the two images in Figs.6 and 7, we give the image in Fig.8, in which a black pixel on a certain coordinate means the two pixels in Figs. 6 and 7 have equal gray level on the same coordinate, however a dark pixel implies the two gray levels are not equal, especially a white pixel denotes that the difference between the two gray levels is bigger than double of maximum perturbation of training pattern pairs.
Fig. 1. Actual image A in real world
Fig. 3. Image C from perturbed A
Fig. 5. Input image E
Fig. 2. Actual image B in real world
Fig. 4. Image D from perturbed B
Fig. 6. Output of Max-T0 FAM based on (A, B)
816
W.-H. Xu, G.-P. Chen, and Z.-K. Xie
Fig. 7. Output of Max-T0 FAM based on (C, D) Fig. 8. Difference between Fig.6 and Fig.7
According to the definition of t-norm T0 , if a pair (a, b) satisfying a + b >1( a +b ≤1) perturbs slightly into the pair (c, d ) satisfying c + d ≤ 1 ( c +d >1), then value of function T0 changes from a ∧b (0) into 0( c ∧d ) abruptly or discontinuously. Similarly, by the definition of implication operator R T 0 , there are also certain perturbations of a pair (a, b) such that the function RT0 changes discontinuously. The above short dark line, which has gray level 0.5 appears in Fig.3, can guarantee such perturbations take place to make function values of T0 or RT0 changed discontinuously in the course of the network’s training or calculating output. Further, we find that more such short dark lines will bring more difference between the two images in Figs.6 and 7, eventually the terrible situation occurs easily that the image in Fig.6 is entirely unlike the image in Fig.7. The difference between the two images in Figs.6 and 7 is caused by the perturbations of training pattern pairs because the same learning algorithm and the same set of input patterns are adopted for the Max-T0 FAM model. When we take TL instead of T0 whereas others are unchangeable in the above experiment, namely considering a Max-TL FAM, then two new images are also gained respectively like Figs. 6 and 7, furthermore, a new difference image is also generated in Fig. 9 corresponding to Fig.8. It is interested that we see that there exist areas of white pixels in Fig.8 while Fig.9 is without any white pixel. This implies that the Max-TL FAM has better robustness to training pattern perturbations than Max-T0 FAM on condition that the weight-matrix learning algorithm is employed.
Fig. 9. Difference between the two outputs of Max-TL FAM
5 Conclusions The significances of such new robustness of a general FAM are as follows: (1) provide the measure of influences of training pattern perturbations on the FAM; (2) the analyses and even control of such robustness are of some benefit to supervising the acquisition course of training patterns; (3) a fuzzy neural network may have several learning algorithms, however, different algorithm may make the network have different robust intensity [8]. Accordingly this type of robustness may be taken into account
Impacts of Perturbations of Training Patterns on Two FAM Based on T-Norms
817
when we want to choose a suitable learning algorithm. Seeing that training pattern pairs may perturb in common cases as well as there exist diversity of fuzzy neural networks and their learning algorithms, so investigation on such robustness characterized by Definition 3 has certain practical significance and may dig out some valuables.
Acknowledgment The work was supported by the Hunan Provincial Natural Science Foundation of China under Grant No.05JJ40004, the Scientific Research Fund of Hunan Provincial Education Department of China under the Grant 04C509, and the Doctorial Research Fund of Changsha University of Science and Technology under the Grant 05xxrc024. The authors would like to thank anonymous reviewers for their precious suggestions.
References 1. Kosko, B.: Fuzzy Associative Memories. In: Kandel A. (ed): Fuzzy Expert Systems, Addison Wesley Publ. Co., Reading, MA, USA (1987) 2. Kosko, B.: Bidirectional Associative Memory. IEEE Transactions on System, Man and Cybernetics 18 (1988) 49-60 3. Ying, M.: Perturbation of Fuzzy Reasoning. IEEE Transactions on Fuzzy Systems 7 (1999) 625-629 4. Cai, K.Y.: Robustness of Fuzzy Reasoning and δ − Evaluations of Fuzzy Sets. IEEE Trans. on Fuzzy Systems 9 (2001) 738-750 5. Zhang, L., Cai, K.Y.: Optimal Fuzzy Reasoning and Its Robustness Analysis. Int. J. of Intelligent Systems 19 (2004) 1033-1049 6. Xu, W., Chen, G., Yang, J., Ye, Y.: Influence of Fuzzy Implication Operators on Robustness of Fuzzy Reasoning with Perturbations of Rules. Chinese Journal of Computers 28 (2005) 1701-1707 7. Li, Y., Li, D., Pedycz, W., Wu, J.: An Approach to Measure the Robustness of Fuzzy Reasoning. Int. J. of Intelligent Systems 20 (2005) 393-413 8. Xu, W., Song, L., Yang, J.: Influences and Control of Perturbations of Training Pattern Pairs on Fuzzy Bidirectional Associative Memories. Chinese Journal of Computers (2006) to appear 9. Liu, P.: Max-Min Fuzzy Hopfield Neural Networks and an Efficient Learning Algorithm. Fuzzy Sets and Systems 112 (2001) 41-49 10. Wang, L.X.: A Course in Fuzzy Systems and Control. Englewood Cliffs, NJ: Prentice-Hall (1997) 11. Wang, G.: The Full Implicational Triple I Method for Fuzzy Reasoning. Science in China (Series E) 29 (1999) 45-53 12. Fan, J.B., Jin, F., Shi, Y.: An Efficient Learning Algorithm for Fuzzy Associative Memories. Acta Electronics Sinica 24 (1996) 112-115 13. Yang, Q., Chen, M., Yu, Y.: A Learning Algorithm of Fuzzy Associative Memories Based on a Class of T-norm Operations. Systems Engineering and Electronics 22 (2000) 73-75 14. Stamou, G.B., Tzafestas, S.G.: Neural Fuzzy Relational Systems with a New Learning Algorithm. Mathematics and Computers in Simulation 51 (2000) 301-314
Alpha-Beta Associative Memories for Gray Level Patterns Cornelio Yáñez-Márquez, Luis P. Sánchez-Fernández, and Itzamá López-Yáñez Centro de Investigación en Computación, Instituto Politécnico Nacional, Laboratorio de Inteligencia Artificial, Av. Juan de Dios Bátiz s/n, México, D.F., 07738, México {cyanez, lsanchez, itzama}@cic.ipn.mx http://www.cornelio.org.mx Abstract. In this paper, we show how the binary Alpha-Beta associative memories, created and developed by Yáñez-Márquez, and introduced in [1-3], can be used to operate with gray level patterns (namely gray-level images), improving the results presented by Sossa et. al. in [4]. To achieve our goal, given a fundamental set of gray-level patterns, we find the binary representation of each entry, then we build a binary Alpha-Beta associative memory. After that, a given gray level pattern or a distorted version of it is recalled by converting its entries to a binary representation, then recalling it with the binary associative memory, and finally converting again this binary output pattern into a gray level pattern. Experimental results show the efficiency of the new memories. It is important to point out that this solution is more simple and elegant than that of the presented in [4].
1 Introduction Basic concepts about associative memories were established three decades ago in [5-7], nonetheless here we use the concepts, results and notation introduced in the YáñezMárquez's PhD Thesis [1]. An associative memory M is a system that relates input patterns, and outputs patterns, as follows: x→M→y with x and y the input and output pattern vectors, respectively. Each input vector forms an association with a corresponding output vector. For k integer and positive, the corresponding association will be denoted as x k , y k . Associative memory M is represented by a matrix whose ij-th component is mij. Memory M is generated from an a priori finite set of known associations, known as the fundamental set of associations. If μ is an index, the fundamental set is represented as: x μ , y μ μ = 1,2,K ,p with p the cardinality of the set. The patterns that form the
(
{(
)
)
}
fundamental set are called fundamental patterns. If it holds that x μ = y μ , ∀μ ∈ {1,2,K, p } , M is auto-associative, otherwise it is heteroassociative; in this case it is possible to establish that ∃μ ∈ {1,2,K, p } for which x μ ≠ y μ . A distorted version of a pattern x k to be recuperated will be denoted as ~ x k . If when feeding a distorted version of xϖ with ϖ = {1,2,K, p } to an associative memory M, it happens that
the output corresponds exactly to the associated pattern yϖ , we say that recuperation is perfect. Among the variety of associative memory models described in the scientific J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 818 – 823, 2006. © Springer-Verlag Berlin Heidelberg 2006
Alpha-Beta Associative Memories for Gray Level Patterns
819
literature, there are two models that, because of their relevance, it is important to emphasize: morphological associative memories which were introduced by Ritter et. al. [8], and Alpha-Beta associative memories [1]. In this paper we propose an extension of the binary operators Alpha and Beta, fundation for the Alpha-Beta associative memories [1], that allows to memorize and then recall k-valued input and output patterns. Sufficient conditions for perfect recalling and examples are provided.
2 Alfa-Beta Associative Memories The new αβ associative memories are of two kinds and are able to operate in two different modes. The operator α is useful at the learning phase, and the operator β is the basis for the pattern recall phase. The properties within the algebraic operators α and β, allow the αβ memories to exhibit similar characteristics to the ones inherent to the binary version of the morphological associative memories, in the sense of: learning capacity, type and amount of noise against which the memory is robust, and the sufficient conditions for perfect recall. The heart of the mathematical tools used in the Alpha-Beta model, are two binary operators designed specifically for these memories. These operators are defined as follows: First, we define the sets A={0,1} and B={00,01,10}, then the operators α and β are defined in tabular form: α : A× A → B β : B× A → A x y α(x,y) x y β(x,y) 0 0 01 00 0 0 0 1 00 00 1 0 1 0 10 01 0 0 1 1 01 01 1 1 10 0 1 10 1 1 The sets A and B, the α and β operators, along with the usual ∧ (minimum) y (maximum) operators, form the algebraic system ( A, B,α , β ,∧,∨ ) which is the mathematical basis for the Alpha-Beta associative memories. The ij-th entry of the matrix y ⊕ x t is: y ⊕ x t ij = α yi , x j .
∨
[
]
If we consider the fundamental set of patterns:
{(x
(
μ
,y
)
μ
) μ = 1,2,K, p}
where
⎛ x1μ ⎞ ⎜ ⎟ ⎜ xμ ⎟ μ x = ⎜ 2 ⎟ ∈ An ⎜ M ⎟ ⎜ xμ ⎟ ⎝ n⎠
( )
then, the ij-th entry of the matrix y μ ⊕ x μ
t
⎛ y1μ ⎞ ⎜ ⎟ ⎜ yμ ⎟ μ y = ⎜ 2 ⎟ ∈ Am ⎜ M ⎟ ⎜ yμ ⎟ ⎝ m⎠
( )
(
t is: ⎡ y μ ⊕ x μ ⎤ = α yiμ , x μj ⎢⎣ ⎥⎦ ij
)
820
C. Yáñez-Márquez, L.P. Sánchez-Fernández, and I. López-Yáñez
The following two important results, which were obtained in the mentioned PhD Thesis [1], are presented: Let x ∈ A n and let P be a matrix of dimensions m × n . The operation Pm×n
Uβ
x
Iβ
x
gives as a result a column vector of dimension m, with i-th component: n ⎛P ⎞ β pij , x j . ⎜ m×n β x ⎟ = j =1 ⎝ ⎠i
U
∨
(
)
Let x ∈ A n and let P be a matrix of dimensions m × n . The operation Pm×n
gives as a result a column vector of dimension m, with i-th component: n ⎛P ⎞ β pij , x j . ⎜ m×n β x ⎟ = j =1 ⎝ ⎠i
I
∧
(
)
3 The New Alfa-Beta Associative Memories In this section we show how binary αβ memories can be used to operate with gray levels patterns, namely gray level images. Without lose of generality, let us just analyze the case of the Alpha-Beta autoassociative memories of kind V. First, we need to define an operator and an additional vectorial operation which will be useful to convert a positive and integer number to a binary column vector. Definition 1. Let r and k be two positive and integer numbers. The expanding k binary operator ε (r, k ) produces a k-dimensional binary column vector whose entries are the binary expansion, in descending order, of the integer number r. Definition 2. Let r, s and k be three positive and integer numbers. The concatenation of ε (r, k ) with ε (s, k ) is a binary column vector of dimension 2k, with the binary chain of r colocated top-down before the binary chain of s, and is denoted by:
ε (r, k )⎞⎟ ⎟ ⎝ ε (s, k )⎠
C [ε (r , k ),ε (s, k )] = ⎛⎜⎜
Now we can use the information from the previous sections to describe the new Alpha-Beta associative memories. Let k and m be two positive and integer numbers. We define the set A as follows:
{
A = 0,1,2,K,2 k − 1
}
If the new fundamental gray level patterns have dimension m, then we can see that the new fundamental set with cardinality p, takes the form:
{ (g μ , g μ ) μ = 1,K, p } where g μ ∈ Am , and g iμ ∈ A ∀i ∈ { 1,2,K, m } . Again, the set B={00,01,10}.
Alpha-Beta Associative Memories for Gray Level Patterns
LEARNING PHASE (three steps)
ε (giμ , k ) for i = 1,2,K, m , and do
STEP 1: For each μ = 1,2,K, p , obtain: xμ = C
[ε (g , k ),ε (g , k ),K,ε (g , k )] . μ
μ
1
2
821
μ
m
(
)
STEP 2: From each μ = 1,2,K, p , and from each couple x μ , x μ build the matrix:
( )
⎡x μ ⊕ x μ t ⎤ . ⎢⎣ ⎥⎦ n×n STEP 3: Apply the binary ∨ operator to the matrices obtained in step 1 to get V as follows: V =
∨ ⎡⎢⎣ x p
μ =1
vij =
μ
( )
t ⊕ x μ ⎤ . The ij-th entry is given as follows: ⎥⎦
∨α (x , x ) . It is obvious that, v ∈ B, ∀i ∈ { 1,2,K, n }, ∀j ∈ { 1,2,K, n } p
μ =1
μ
i
μ j
ij
RECALLING PHASE (two cases) CASE 1: Recalling of a fundamental pattern g ϖ ∈ A m , with ϖ ∈ { 1,2,K, p } (three steps). STEP 1: Obtain xϖ = C ε g1ϖ , k ,ε g ϖ2 , k ,K,ε g ϖm , k . STEP 2: Do the following operation:
[(
) (
)
V
)]
(
Iβ
xϖ
The result is a binary column vector of dimension n=mk, with i-th component given as:
⎛V ⎜ ⎝ ⎛V ⎜ ⎝
Iβ
xϖ ⎞⎟ = ⎠i
∧ β (v , x
Iβ
xϖ ⎞⎟ = ⎠i
∧ ⎪⎩⎢⎣∨
n
ϖ
ij
j
j =1 n
j =1
⎧⎪⎡
β ⎨⎢
n
j =1
) (
⎤
⎫⎪
⎥⎦
⎪⎭
)
α xiμ , x μj ⎥, xϖj ⎬
STEP 3: By means of a top-down process, the binary chain formed by the k binary entries starting in xϖ(i −1)k until xϖik −1 are converted to obtain the positive and integer number g ϖi ; do this for i = 1,2,K, m . The fundamental pattern g ϖ ∈ A m has been recovered by the associative memory. CASE 2: Recalling of a pattern from a pattern g~ which is an altered version with additive noise of one of the fundamental patterns g ϖ ∈ A m , with ϖ ∈ { 1,2,K, p } . x in the above cases, are The conditions for perfect recall for the patterns x and ~ the same as those applied to the binary patterns used in the Alpha-Beta autoassociative memory type V. For the patterns in the fundamental set, the following theorem establishes these perfect recall conditions:
822
C. Yáñez-Márquez, L.P. Sánchez-Fernández, and I. López-Yáñez
Theorem 3. Let k and m be two positive and integer numbers, and let g μ , g μ μ = 1,K, p be the fundamental set of the new Alpha Beta autoassociative
{(
)
memory
}
V,
type
{
}
where
g μ ∈ Am
,
and
g iμ ∈ A ∀i ∈ { 1,2,K, m } ,
and
A = 0,1,2,K,2 − 1 . The new Alpha Beta autoassociative memory provides perfect recall for all patterns in the fundamental set. The proof of this theorem is strongly based in a similar proof detailed in [1], which is supported by the fact that perfect recall in an Alpha-Beta heteroassociative memory is guaranteed if for each μ ∈ {1,2,K, p} and for every row index i, exists a column index j such that the ij-th term of the matrix representing the memory is equal to α yiμ , x μj . Since with the memory from theorem 3 it holds that ν ii = 1 = α g iμ , g iμ ,
(
k
)
∀μ ∈ {1,2,K, p} , it is clear that the condition for perfect recall is satisfied.
(
)
♦
4 Experiments with Gray Level Images In this section the new Alpha-Beta associative memories are tested with nine gray level images. The images, shown in the Figure 1, are 60 by 20 pixels and 256 gray levels. Only the new Alpha Beta autoassociative memories type V were tested.
Fig. 1. Images of the nine objects used in to test the new αβ associative memories
LEARNING PHASE Each one of all the nine images was first converted to a vector of 1,200 integer and positive entries (60x20). Then, for each one of all these vectors, the three steps of the learning phase in the section 3, were done. At the end, the new Alpha Beta autoassociative memory type V, was generated. RECALLING PHASE All the nine patterns in the fundamental set were perfectly recalled. This confirms the theorem 3. To perform the experiments with altered versions of the fundamental patterns, the images in the fundamental set were corrupted with additive noise. Three groups of images were generated: The first one with weak additive noise (10%), the second one with medium additive noise (30%), and the third one with severe additive noise (50%), a huge amount of noise. Twenty seven corrupted images were obtained changing randomly some pixel values to blank. In most of the cases the desired image was correctly recalled, in other cases it was not. Notice how despite the level of noise introduced in the third column is too severe (in any system, 50% of noise is a huge amount), some of the recalls are yet perfect!
Alpha-Beta Associative Memories for Gray Level Patterns
823
5 Conclusion and Future Work We have shown how it is possible to use binary Alpha-Beta associative memories, to efficiently recall patterns in more than two gray levels. This is possible because a pattern can be decomposed into binary patterns. It is worth to mention that the proposed technique can be adapted to any kind of binary associative memories while their input patterns can be obtained from the binary expansions of the original patterns. Currently, we are investigating how to use the proposed approach in the presence of mixed noise and other variants. We are also working toward the proposal of new associative memories based on others mathematical results.
Acknowledgements. The authors would like to thank the Instituto Politécnico Nacional (Secretaría Académica, COFAA, SIP, and CIC), the CONACyT, and SNI for their economical support to develop this work.
References 1. Yáñez-Márquez, C.: Associative Memories Based on Order Relations and Binary Operators (In Spanish). PhD Thesis. Center for Computing Research, México (2002) 2. Yáñez-Márquez, C., Díaz de León-Santiago, J.L.: Memorias Asociativas Basadas en Relaciones de Orden y Operaciones Binarias. Computación y Sistemas. México, 6(4). (2003) 300-311 3. Yáñez-Márquez, C., Díaz de León-Santiago, J.L.: Memorias Asociativas con Respuesta Perfecta y Capacidad Infinita. In: Albornoz, A., Alvarado, M. (eds.): Taller de Inteligencia Artificial TAINA’99. México (1999) 245-257 4. Sossa, H., Barrón, R., Cuevas, F., Aguilar, C., Cortés, H.: Binary Associative Memories Applied to Gray Level Pattern Recalling. In: Lemaître, C., Reyes, C., González, J. (eds.): Advances in Artificial Intelligence. Lecture Notes in Computer Science, Vol. 3315. Springer-Verlag, Berlin Heidelberg New York (2004) 656-666 5. Kohonen, T.: Correlation Matrix Memories. IEEE Transactions on Computers. 21(4). (1972) 353-359 6. Kohonen, T.: Self-Organization and Associative Memory. Springer-Verlag, Berlin Heidelberg New York (1989) 7. Hassoun, M.H.: Associative Neural Memories. Oxford University Press, New York (1993) 8. Ritter, G.X., Sussner, P., Diaz-de-Leon, J.L.: Morphological Associative Memories. IEEE Transactions on Neural Networks. 9 (1998) 281-293
Associative Memories Based on Discrete-Time Cellular Neural Networks with One-Dimensional Space-Invariant Templates Zhigang Zeng1,2 and Jun Wang2 1
School of Automation, Wuhan University of Technology, Wuhan, Hubei, 430070, China [email protected] 2 Department of Automation and Computer-Aided Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong [email protected]
Abstract. In this paper, discrete-time cellular neural networks with one-dimensional space invariant are designed to associative memories. The obtained results enable both heteroassociative and autoassociative memories to be synthesized by assuring the global asymptotic stability of the equilibrium point and the feeding data via external inputs rather than initial conditions. It is shown that criteria herein can ensure the designed input matrix to be obtained by using one-dimensional spaceinvariant cloning template. Finally, one specific example is included to demonstrate the applicability of the methodology.
1
Introduction
In recent years, several methods have been proposed for designing associative memories using both continuous-time neural networks (CNNs) and discrete-time cellular neural networks (DTCNNs) ([1]-[7]). There exist two methods to associatively memorize. In the first method, CNNs have multi-equilibrium points which are directly regarded as associative memories [1]-[4]. It is necessary that each dynamic trajectory converges toward one of the asymptotically stable equilibrium points of the network. In these cases, the input data are fed via initial conditions, and the outputs reach their steady state values at an equilibrium point which depends on the initial state. Each equilibrium point has a attractive region. When the initial state of CNNs located the attractive region, the corresponding equilibrium point is associatively memorized as the desired pattern. In second method, CNNs have a unique equilibrium point which is globally asymptotically stable. CNNs, by assuring that each trajectory converges to a unique equilibrium point depending on the input instead of the initial state, enable both heteroassociative and autoassociative memories to be designed [6], [7]. However, by pseudoinverse technique ([6], [7]) in the recent results of literatures, it is difficult to obtain a input matrix with a space-invariant cloning J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 824–829, 2006. c Springer-Verlag Berlin Heidelberg 2006
Associative Memories Based on Discrete-Time Cellular Neural Networks
825
template, which insures the desired pattern can be associatively memorized by the corresponding input. In this paper, CNNs with space-invariant cloning templates is developed based on stability theory ([8]-[12]). By using the obtained results, in the designed DTCNNs, the input pattern and the output pattern are one-to-one correspondence.
2
Preliminaries
Denote B n be the set of n-dimensional bipolar vectors, i.e., B n = {x ∈ n , xi = 1, or − 1, i = 1, 2, · · · , n}. Hence, B n is made up of 2n elements. We have the following design problem: Design Problem. Given p (p ≤ 22n ) pairwise vectors (s(1) , u(1) ), (s(2) , u(2) ) · · · , (p) (s , u(p) ), s() , u() ∈ B n , ∈ {1, 2, · · · , p}, design a neural network such that if u() ( ∈ {1, 2, · · · , p}) is input to this neural network, then s() is a unique output vector of neural network. When s() = u() , s() is said to be heteroassociative () () () () memorized by u . When s = u , s is said to be autoassociative memorized by u() . (1) (2) (p) Let Υ = {1, 2, · · · , n}; Υ1 = {i| si = si = · · · = si = 1, i ∈ Υ }; (1) (2) (p) Υ−1 = {i| si = si = · · · = si = −1, i ∈ Υ }; Υ0 = Υ − Υ1 − Υ−1 . For i ∈ Υ0 , () () let Ji+ = {, si = 1, ∈ {1, 2, · · · , p}}; Ji− = {, si = −1, ∈ {1, 2, · · · , p}}.
3
CNNs Described Using Cloning Templates
In this paper, we always assume that t ∈ N , where N is the set of all natural numbers. Consider a DTCNN described by using a space-invariant template where the cells are arranged on a rectangular array composed of N rows and M columns. The dynamics of such a DTCNN are governed by the following normalized equations:
k2 (i,r1 )
Δyij (t) = −ˆ cij yij (t) +
l2 (i,r2 )
(ak,l f (yi+k,j+l (t)) + dk,l ui+k,j+l ) + vˆij ,(1)
k=k1 (i,r1 ) l=l1 (i,r2 )
where for i ∈ {1, 2, · · · , N }, j ∈ {1, 2, · · · , M }, k1 (i, r1 ) = max{1 − i, −r1 }, k2 (i, r1 ) = min{N − i, r1 }, l1 (j, r2 ) = max{1 − j, −r2 }, l2 (j, r2 ) = min{M − j, r2 }, yij denotes the state of the cell located at the crossing between the i-th row and j-th column of the network, r1 and r2 denote neighborhood radius, and r1 and r2 are positive integer, Aˆ = (akl )(2r1 +1)×(2r2 +1) is the feedback cloning ˆ = (dkl )(2r +1)×(2r +1) is template defined by a (2r1 +1)×(2r2 +1) real matrix, D 1 2 the input cloning template defined by a (2r1 + 1) × (2r2 + 1) real matrix, cˆij > 0. For a one-dimensional space invariant CNN with a neighborhood radius r = ˆ = [d−1 , d0 , d1 ]. For i = 1, its cloning templates are Aˆ = [a−1 , a0 , a1 ], and D 1, 2, · · · , N and j = 1, 2, · · · , M, let c(i−1)M+j = cˆij , v(i−1)M+j = vˆij . Denote
826
Z. Zeng and J. Wang
C = diag{c1 , c2 , · · · , cN M }. An alternative expression for the state equation of DTCNN (1) is obtained by ordering the cells in some way (e.g., by rows or by columns) and by cascading the state variables into a state vector x = (x1 , x2 , · · · , xN ×M )T =(y ˆ 11 , y12 , · · · , y1M , y21 , · · · , y2M , · · · , yN M )T . Denote n = N × M. The following compact form is then obtained: Δx(t) = −Cx(t) + Af (x(t)) + Du + v,
(2)
where the vectors u and v are obtained through uij and vij , respectively, vectorvalued activation functions f (x(t)) = (f (x1 (t)), f (x2 (t)), · · · , f (xN ×M (t)))T , the ˆ which coefficient matrices A and D are obtained through the templates Aˆ and D, have the circulant form ⎡
a0 a1 ⎢ a−1 a0 ⎢ ⎢ 0 a−1 ⎢ ⎢ ⎢ ⎢ ⎣
4
0 a1 a0 .. .
⎤
0 0 a1 .. .. . . a−1 a0 0 a−1
⎡
d 0 d1 ⎥ ⎢ d−1 d0 ⎥ ⎢ ⎥ ⎢ 0 d−1 ⎥ ⎢ ⎥ and ⎢ ⎥ ⎢ ⎥ ⎢ ⎦ ⎣ a1 a0
0 d1 d0 .. .
⎤
0 0 d1 .. .. . .
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ d1 ⎦
(3)
d−1 d0 0 d−1 d0
Main Results ()
()
()
()
∀ ∈ {1, 2, · · · , p}, let s0 = sn+1 = u0 = un+1 = 0. For i ∈ Υ1 , let vi+ =
1
{ci −
max
∈{1,2,···,p}
1
()
aj si+j −
j=−1
()
dj ui+j };
vi− = +∞.
(4)
j=−1
For i ∈ Υ−1 , let vi− =
min
{−ci −
∈{1,2,···,p}
1
1
()
aj si+j −
j=−1
()
dj ui+j };
vi+ = −∞.
(5)
j=−1
For i ∈ Υ0 , let vi+ = max {ci − ∈Ji+
()
aj si+j −
j=−1
vi− = max {−ci − − ∈Ji
1
1 j=−1
1
()
(6)
()
(7)
dj ui+j };
j=−1 ()
aj si+j −
1
dj ui+j }.
j=−1
Theorem 1. If there exist constants a−1 , a0 , a1 such that 1 j=−1
|aj | < min {ci } ≤ max {ci } < 1, 1≤i≤n
1≤i≤n
and there exist constants d−1 , d0 , d1 such that ∀i ∈ Υ0 , ∀r ∈ Ji+ , ∀q ∈ Ji− ,
(8)
Associative Memories Based on Discrete-Time Cellular Neural Networks 1
(r)
(q)
(r)
(q)
(r)
827
(q)
(ui+j − ui+j )dj > 2ci − 2a0 − (si−1 − si−1 )a−1 − (si+1 − si+1 )a1 , (9)
j=−1 n then for any bias value v ∈ Πi=1 (vi+ , vi− ), (s(1) , u(1) ), (s(2) , u(2) ), · · · , (s(p) , u(p) ) are p associative memory pairwise vectors of the designed DTCNN with cloning templates [a−1 , a0 , a1 ] and [d−1 , d0 , d1 ].
Proof. ∀ ∈ {1, 2, · · · , p}, choose x∗i ()
=(
1
() aj si+j
j=−1 n Since v ∈ Πi=1 (vi+ , vi− ), if si
()
+
1
1
1
()
aj si+j −
j=−1 ()
()
dj ui+j .
(11)
j=−1
= 1, from (10) and (11), x∗i () > 1, f (x∗i ()) = si . ()
()
Similarly, if si
(12)
= −1, then from (5) and (7) vi < −ci −
1
()
aj si+j −
j=−1 ()
Hence, when si
(10)
j=−1
= 1, then from (4) and (6),
vi > ci − Hence, when si
()
dj ui+j + vi )/ci .
1
()
dj ui+j .
(13)
j=−1
= −1, from (10) and (13), x∗i () < −1, f (x∗i ()) = si . ()
(14)
From (10), (12) and (14), ∀i ∈ {1, 2, · · · , n}, x∗i () = (
1
aj f (x∗i+j ()) +
j=−1
1
()
dj ui+j + vi )/ci ;
j=−1
i.e., x∗ () = (x∗1 (), · · · , x∗n ())T is an equilibrium point of (1). In addition, since (8) holds, it is easy to prove that DTRNN (1) has a globally asymptotically stable equilibrium point. a pairwise vector (s() , u() ) is associatively memorized by using DTRNN (1). Theorem 2. If ∀i ∈ Υ0 , ∀r ∈ Ji+ , ∀q ∈ Ji− , ui constants d−1 , d0 , d1 such that (9) holds.
(r)
(q)
− ui
= 2, then there exist
Proof. Given d−1 , d1 , a−1 , a0 , a1 , choose (r) (q) (r) (q) d0 > max {2ci − 2a0 − ((ui+j − ui+j )dj + (si+j − si+j )aj )}/2. i∈Υ0 ,r,q∈{1,2,···,p}
j=−1,j=1
828
Z. Zeng and J. Wang
Hence, (9) holds. Theorem 3. If for j = 1, 2, 3, there exist ij ∈ Υ0 , rj ∈ Ji+j , qj ∈ Ji−j , such that (r ) u 1 − u(q1 ) , u(r1 ) − u(q1 ) , u(r1 ) − u(q1 ) i1 −1 i1 −1 i1 i1 i1 +1 i1 +1 (r2 ) (r2 ) (q2 ) (r2 ) (q2 ) 2) = 0, ui2 −1 − u(q i2 −1 , ui2 − ui2 , ui2 +1 − ui2 +1 (r3 ) (q3 ) (r3 ) (q3 ) (r3 ) (q3 ) u i3 −1 − ui3 −1 , ui3 − ui3 , ui3 +1 − ui3 +1
(15)
and ∀i ∈ Υ0 , ∀r ∈ Ji+ , q ∈ Ji− , there exist constants αi ≥ 0, βi ≥ 0, γi ≥ 0 such that for j = −1, 0, 1, (r)
(q)
(r )
(q )
(r )
(q )
(r )
(q )
1 2 3 3 ui+j − ui+j = αi (ui1 +j − ui1 1+j ) + βi (ui2 +j − ui2 2+j ) + γi (ui3 +j − ui3 +j ), (16)
then there exist constants d−1 , d0 , d1 such that (9) holds. Proof. Since (15) holds, for ε > 0, there exist constants d−1 , d0 , d1 such that ⎧ 1 (r1 ) (q1 ) (r1 ) (q1 ) ⎪ ⎨ j=−1 (ui1 +j − ui1 +j )dj = ε + 2ci1 − 2a0 − j=−1,j=1 (si1 +j − si1 +j )aj , (r2 ) (q2 ) (r ) (q ) 1 (s 2 − si22+j )aj , j=−1 (ui2 +j − ui2 +j )dj = ε + 2ci2 − 2a0 − ⎪ j=−1,j=1 i(r2 3+j ⎩ 1 (r3 ) (q3 ) ) (q3 ) j=−1 (ui3 +j − ui3 +j )dj = ε + 2ci3 − 2a0 − j=−1,j=1 (si3 +j − si3 +j )aj .
(17)
∀i ∈ Υ0 , ∀r ∈ Ji+ , q ∈ Ji− , (16) and (17) imply that (9) holds.
5
Simulation Results
In this section, we give one example to illustrate the new results. A DTCNN is designed to behave as a heteroassociative memory. The input and output patterns are represented by two pairs of 6 × 6-pixel images (black pixel = −1; white pixel = 1), as shown in Fig. 1 a) and b), respectively.
Fig. 1. a) Input patterns, b) desired output patterns
Υ0 = {3, 8, 9, 10, 11, 27, 28}. Choose the cloning templates Aˆ = [0.1, 0.6, 0.1], ˆ = [1.5, 1, −0.5], ci = 0.9. Then conditions (8) and (9) hold. From (4), (5) (6) D and (7), choose v = (0.8, 2.4, 0, −2.6, 0, 0, 0.5, 0, −1, 0.5, 0, 0, 0, 2.5, 3.5, −0.6, 1.5, 0, 0.5, 1.5, 0, −4, 1.5, 0.5, 0, 2.5, 2.4, 1.4, 3.5, 0.5, 0.5, 1.5, −2.5, −3.5, −2.5, 1)T .
Associative Memories Based on Discrete-Time Cellular Neural Networks
829
ˆ D, ˆ and bias Synthesize the desired DTCNN with the cloning templates A, (1) (1) (2) (2) value v. Then, according to Theorem 1, (s , u ), (s , u ) are 2 heteroassociative memory pairwise vectors of the designed DTCNN. Simulation results show that the designed DTCNN is able to recall the output patterns in Fig. 1 (b) starting from the any initial value. These results are verified by using random initial conditions.
Acknowledgement This work was supported by the Hong Kong Research Grants Council under Grant CUHK4165/03E, and the Natural Science Foundation of China under Grant 60405002.
References 1. Chua, L.O., Yang, L.: Cellular Neural Networks: Theory. IEEE Trans. Circuits Syst., 35 (1988) 1257-1272 2. Brucoli, M., Carnimeo, L., Grassi, G.: Discrete-time Cellular Neural Networks for Associative Memories with Learning and Forgetting Capabilities. IEEE Trans. Circuits and Systems I, 42 (1995) 396-399 3. Liu, D., Lu, Z.: A New Synthesis Approach for Feedback Neural Networks Based on the Perceptron Training Algorithm. IEEE Trans. Neural Networks, 8 (1997) 1468-1482 4. Seiler, G., Schuler, A.J., Nossek, J.A.: Design of Robust Cellular Neural Networks. IEEE Trans. Circuits and Systems I, 40 (1993) 358-364 5. Liu, D., Michel, A.N.: Sparsely Interconnected Neural Networks for Associative Memories with Applications to Cellular Neural Networks. IEEE Trans. Circuits and Systems II, 41 (1994) 295-307 6. Grassi, G.: A New Approach to Design Cellular Neural Networks for Associative Memories. IEEE Trans. Circuits and Systems I, 44 (1997) 835-838 7. Grassi, G.: On Discrete-time Cellular Neural Networks for Associative Memories. IEEE Trans. Circuits and Systems I, 48 (2001) 107-111 8. Liao, X.X., Wang, J.: Algebraic Criteria for Global Exponential Stability of Cellular Neural Networks with Multiple Time Delays. IEEE Trans. Circuits and Systems I., 50 (2003) 268-275 9. Zeng, Z.G., Wang, J., Liao, X.X.: Global Exponential Stability of A General Class of Recurrent Neural Networks with Time-varying Delays. IEEE Trans. Circuits and Syst. I, 50 (2003) 1353-1358 10. Zeng, Z.G., Wang, J., Liao, X.X.: Stability Analysis of Delayed Cellular Neural Networks Described Using Cloning Templates. IEEE Trans. Circuits and Syst. I, 51 (2004) 2313-2324 11. Zeng, Z.G., Wang, J., Liao, X.X.: Global Asymptotic Stability and Global Exponential Stability of Neural Networks with Unbounded Time-varying Delays. IEEE Trans. on Circuits and Systems II, Express Briefs, 52 (2005) 168-173 12. Zeng, Z.G., Huang, D.S., Wang, Z.F.: Global Stability of A General Class of Discrete-time Recurrent Neural Networks. Neural Processing Letters, 22 (2005) 33-47
Autonomous and Deterministic Probabilistic Neural Network Using Global k-Means Roy Kwang Yang Chang1, Chu Kiong Loo2, and Machavaram V.C. Rao2 1
Faculty of Information Science and Technology, Mutilmedia University, Jalan Ayer Keroh Lama, 75450 Melaka, Malaysia [email protected] 2 Faculty of Engineering Technology, Mutilmedia University, Jalan Ayer Keroh Lama, 75450 Melaka, Malaysia {ckloo, machavaram.venkata}@mmu.edu.my
Abstract. We present a comparative study between Expectation-Maximization (EM) trained probabilistic neural network (PNN) with random initialization and with initialization from Global k-means. To make the results more comprehensive, the algorithm was tested on both homoscedastic and heteroscedastic PNNs. Normally, user have to define the number of clusters through trial and error method, which makes random initialization to be of stochastic nature. Global k-means was chosen as the initialization method because it can autonomously find the number of clusters using a selection criterion and can provide deterministic clustering results. The proposed algorithm was tested on benchmark datasets and real world data from the cooling water system in a power plant.
1 Introduction To predict the time to failure, the application of Artificial Intelligence has been fruitful, especially using neural networks. Prognostic tools can be devised using soft computing know how and of course the knowledge from the experts. A good type of neural network to use is the statistical based probabilistic neural network (PNN) introduced by Specht [1]. A PNN can take in input vectors from a machine and learn from these data. Later, when fully trained, it can be used to help classify the machine into states such as operational or failing. In this paper, the PNN will be used as the basic framework of our model and training will be done utilizing the EM method with random initialization and initialization from Global k-means. In the random initialization, EM is set with the number of clusters that are obtained separately from Global k-means so that both assume the same condition. A comparison study will be done between those two methods in hopes of building an autonomous and deterministic PNN.
2 Background A quick look through of the 3 major components in the proposed system is given, PNN, Global k-means and EM. This paper aims at trying to make the PNN more J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 830 – 836, 2006. © Springer-Verlag Berlin Heidelberg 2006
Autonomous and Deterministic Probabilistic Neural Network Using Global k-Means
831
accurate along with characteristics of autonomous and deterministic. To aid in the initialization of the EM algorithm, Global k-means was chosen instead of k-means due to the fact that the k-means algorithm itself had an initialization problem. This was because results from k-means depended heavily on how “well” the initial parameter values were set. PNN is a four layer feedforward neural network with the first layer accepting input patterns. The second defines Gaussian basis functions and third computes the approximation of the class probability function. The fourth layer is the decision layer which attempts to minimize misclassification errors by choosing the biggest product of priori probability and conditional PDF from all classes. Lets assume we have a data set X={X1,…,Xn} and we want to partition it into M number of clusters C1,…,Cm so that a clustering criterion is optimized. The criterion used is the squared Euclidean distances between each data point, Xi in a cluster Ck and its cluster centroid mk. What differentiates Global k-means from k-means is that no initial parameter values are required. Global k-means proceeds in an incremental fashion, at each stage the algorithm will try to add one new cluster centroid. In general, let (m1(k),…,mk(k)) denote the final solution for the k-clustering problem. Once we have found the solution for the (k-1)-clustering problem, we try to find the solution of the k-clustering problem as follows: we perform N runs of the k-means algorithm with k clusters where at each run n starts from the initial state (m1(k-1),…,m(k-1)(k-1),Xn). The best solution obtained from the N runs is considered as the solution (m1(k),…,mk(k)) of the k-clustering problem. [2] By executing in that manner, not only the algorithm finds the solution for M number of clusters but it also has discovered all k clusters where k = 1,…,M. To accomplish it, use the Global kmeans algorithm to solve the k-clustering problem for a specified number of clusters and then choose from the result, using a selection criterion, the value of variable k. The EM algorithm is an iterative process that goes through both the expectation process (E-step) and the maximization process (M-step) to compute the Maximum Likelihood (ML) estimate in presence of missing or hidden data. The EM algorithm is guaranteed to converge to an ML estimate [3, 4], and the convergence rate of the EM algorithm is usually quite fast. [5] In general it is hard to initialize and the quality of the final solution depends on the quality of the initial solution. It may converge to a poor locally optimal solution. This means the solution may be acceptable but far from the optimal one. [6] Let us assume the data set, X will be partitioned into K number of subsets (classes) where X = X 1 ∪ X 2 ∪ ... ∪ X K and each subset having Nk number of sample size, it would also mean
∑
K k =1
N k = N . The log posterior
likelihood function of the data set would be
log L f = ∑k =1 ∑n=1 log f k ( X n,k ) . K
Where
N
(1)
f k ( X ) is the value of the class conditional PDF in the 3rd layer of PNN f k ( X ) = ∑m=1 β m,k ρ m,k ( X ) . M
(2)
832
R.K.Y. Chang, C.K. Loo, and M.V.C. Rao
ρ m ,k ( X ) whilst
is Gaussian basis functions that are defined in the 2nd layer of the PNN
β m,k define mixing coefficients of m cluster in k class. 1
ρ m,k ( X ) =
(2Πσ m2 ,k )
exp− (
d 2
X − υ m,k 2σ
2
2 m,k
),
∑
Mk m =1
β m,k = 1 .
(3)
E-step: The missing/hidden data is estimated given the observed data and the current parameter estimate. Calculate the weight using
Wm,k =
β m ,k ρ m , k ( X )
∑i=1 β i ,k ρ i ,k ( X ) Mk
(4)
.
M-step: Uses the data estimated in the E-step to form a likelihood function and determine the ML estimate of the parameter. Calculate the new values for the cluster centroid, υ m,k , the variance,
σ m2 ,k
and the mixing coefficients,
β m,k
using the
weight calculated from the E-step.
∑ W (X )X = ∑ W (X ) Nk
υ m,k
n =1 Nk
n =1
m,k
m ,k
n
n
,
σ
2 m,k
∑ =
n
β m,k =
1 Nk
∑
Nk
W ( X n ) X n − υ m,k n =1 m , k Nk
d ∑n=1W m,k ( X n ) Nk
Wm , k ( X n ) .
n =1
2
.
(5)
(6)
3 Results To do a comparative study on the effects of using Global k-means to initialize the values of the parameters in EM and without that initialization, a few noted data sets were used. They are the Iris dataset [7] and the medical datasets, consisting of data from Cancer, Dermatology, Hepato, Heart and Pima. A case study with the data from the cooling water system in power station is also presented for EM with Global kmeans initialization. 3.1 Iris Dataset The Iris dataset consist of 150 samples and 4 input features. Iris dataset was run using PNN with training of the network utilizing the EM algorithm with randomly
Autonomous and Deterministic Probabilistic Neural Network Using Global k-Means
833
Table 1. Results from both randomly initialized EM and the Global k-means initialized EM in homoscedastic and heteroscedastic PNN
Accuracy Min Mean Max
Random initialization Homoscedastic Heteroscedastic 95.71 94.29 96.29 95.36 96.43 95.71
Global k-means Homoscedastic 97.86 -
Heteroscedastic 95.71 -
initialized cluster centroids and EM with Global k-means initialization. Both the methods were executed in heteroscedastic PNN and in homoscedastic PNN. The mean accuracy of the homoscedastic with random initialization is 96.29% whilst the heteroscedastic version reports 95.36% accuracy. But in both cases, they were outdone by the accuracy of the EM with Global k-means initialization, whose mean accuracy was 97.86% and 95.71% respectively for homoscedastic and heteroscedastic PNN. Iris dataset was set as a 10-clustering problem for Global kmeans and then the number of cluster centroids was returned based on minimum squared Euclidean distance between each data point in a cluster and its centroid. This was then used to set the cluster parameter for random initialization to help it get a better result and assume under similar conditions as the Global kmeans. But as comparison data shows, Global k-means still outperforms random initialization. 3.2 Medical Datasets Cancer dataset contains 569 samples with a 30 dimension size, Dermatology dataset contains 358 samples with a 34 dimension size and Hepato dataset contains 536 samples with a 9 dimension size. Heart dataset contains 270 samples with a 13 dimension size and 2 output labels, which are “0” for absence of heart disease and “1” for presence of heart disease. Pima data set is available from machines learning database at UCI [8]. Pima dataset contains 768 samples with an 8 dimension size and has 2 classes which are diabetes positive and diabetes negative. Table 2. Correct classification rate for the Medical datasets by using the homoscedastic PNN. A ten-fold validation was employed.
Random initialization Dataset Cancer Dermatology Hepato Heart Pima
Min 90.00 60.76 37.35 62.40 70.29
Mean 90.63 64.28 38.51 63.52 71.07
Max 90.96 65.50 39.18 64.40 71.43
Global k-means Mean 91.92 69.31 39.39 58.80 71.29
Cluster 9.3 18.3 20.1 8.4 40.9
834
R.K.Y. Chang, C.K. Loo, and M.V.C. Rao
Table 3. Correct classification rate for the Medical datasets by using the heteroscedastic PNN. A ten-fold validation was employed.
Dataset Cancer Dermatology Hepato Heart Pima
Random initialization Min Mean 94.23 94.52 86.87 88.05 51.22 52.47 75.60 78.00 66.86 68.17
Max 94.62 89.08 53.27 78.80 68.86
Global k-means Mean 95.38 89.54 58.57 82.80 69.00
Table 4. Result from other methods incomparison with Global k-means
Method Heteroscedastic Homoscedastic C 4.5 rules [8] LFC [9] LDA [9]
Cancer 95.38 91.92 86.70±5.90 94.40 96.00
Heart 82.80 58.80 53.80±5.90 75.10 84.50
The Medical datasets shows improved performance of the EM with Global k-means initialization in both homoscedastic and heteroscedastic PNN. A higher M-clustering problem was first solved and the number of clusters returned by Global k-means algorithm was based on optimizing the clustering criterion. This was then fed into the EM with random initialization yet it cannot perform as well as the EM with Global k-means initialization. In most cases of the datasets, even the maximum accuracy from the EM with random initialization is not higher than the mean of EM with initialization from Global k-means. Comparison between homoscedastic and heteroscedastic PNN with Global k-means initialization was made with C 4.5 rules, LFC (Lookahead Feature Construction) and LDA (Linear Discriminant Analysis). 3.3 Case Study This dataset was obtained from a power generation plant in Penang, Malaysia. The process under scrutiny was the circulating water (CW) system (Prai Power Station Stage 3, 1999) in the power station. [10] The CW system’s objective is to make sure the main turbine condenser receives enough amount of cooling water so that it may condense steam from the turbine exhaust and other steam flows that are channeled to the condensers. The data set used in this experiment consists of 2439 samples with a 12 dimension size. Those 12 features comprise of temperature and pressure measurements of various inlet and outlet points of the condenser. There were 1224 data samples (50.18%) that showed inefficient heat transfer condition, whereas 1215 data samples (49.82%) showed efficient heat transfer condition in the condenser. [10] A ten-fold validation for both homoscedastic and heteroscedastic PNN for EM with
Autonomous and Deterministic Probabilistic Neural Network Using Global k-Means
835
Table 5. Results from the 10-fold validation homoscedastic PNN and heteroscedastic PNN for EM with Global k-means initialization
Accuracy Min Mean Max
Initialization from Global k-means Homoscedastic Heteroscedastic 90.99 96.85 94.10 97.79 95.95 99.10
Global k-means initialization was done. With the mean accuracy for homoscedastic PNN being 94.10% and for heteroscedastic PNN being 97.79%, this shows a very promising result for the proposed EM with Global k-means initialization.
4 Conclusion Though EM training of PNN is an excellent method, still it can be improved. To introduce the autonomous characteristic into EM, the Global k-means algorithm was executed ahead of the EM training to find the number of clusters based on minimizing the clustering error. Comparative results clearly indicate that even when set with the same number of clusters as Global k-means, EM with random initialization could not match the performance of the EM with Global k-means initialization. A case study on the cooling water system in power station was done for EM with Global k-means initialization, and the results returned were positive. Thus, strengthening the fact that EM with Global k-means initialization makes a good autonomous and deterministic probabilistic neural network.
References 1. Specht, D.F.: Probabilistic Neural Network. Neural Networks 3 (1990) 109-118 2. Likas, A., Vlassis, N., Verbeek, J.J.: The Global k-means Clustering Algorithm. Pattern Recognition 36 (2003) 451-461 3. Wu, C.: On the Convergence Properties of the EM Algorithm. Annals Statistics 11 (1983) 95-103 4. Xu, L., Jordan, M.I.: On Convergence Properties of the EM Algorithm for Gaussian Mixtures. Neural Computation 8 (1996) 129-151 5. Yang, Z.R., Chen, S.: Robust Maximum Likelihood Training of Heteroscedastic Probabilistic Neural Networks. Neural Networks 11 (1998) 739-747 6. Ordonez, C., Omiecinski, E.: FREM: Fast and Robust EM Clustering for Large Data Sets. In: Clustering Algorithms. Proc. of the 11th CIKM. ACM Press, New York (2002) 590-599 7. Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases. University of California, Irvine, Department of Information and Computer Sciences (1998) 8. Zarndt, F.: A Comprehensive Case Study: An Examination of Machine Learning and Connectionist Algorithms. MSc Thesis. Dept. of Computer Science, Brigham Young University (1995)
836
R.K.Y. Chang, C.K. Loo, and M.V.C. Rao
9. Ster, B., Dobnikar, A.: Neural Networks in Medical Diagnosis: Comparison with Other Methods. In: Bulsari, A. et al. (eds.): Solving Engineering Problems with Neural Networks. Proc. Int. Conference EANN ’96. Systems Engineering Association, Turku (1996) 427-430 10. Chen, K.Y., Lim, C.P., Lai, W.K.: Application of a Neural Fuzzy System with Rule Extraction to Fault Detection and Diagnosis. Journal of Intelligent Manufacturing 16 (2005) 679–691
Selecting Variables for Neural Network Committees Marija Bacauskiene1 , Vladas Cibulskis1 , and Antanas Verikas1,2 1
Department of Applied Electronics, Kaunas University of Technology, Studentu 50, LT-51368, Kaunas, Lithuania 2 Intelligent Systems Laboratory, Halmstad University, Box 823, S-30118 Halmstad, Sweden [email protected], [email protected], [email protected]
Abstract. The aim of the variable selection is threefold: to reduce model complexity, to promote diversity of committee networks, and to find a trade-off between the accuracy and diversity of the networks. To achieve the goal, the steps of neural network training, aggregation, and elimination of irrelevant input variables are integrated based on the negative correlation learning [1] error function. Experimental tests performed on three real world problems have shown that statistically significant improvements in classification performance can be achieved from neural network committees trained according to the technique proposed.
1
Introduction
Numerous previous works on classification committees have demonstrated that an efficient committee should consist of networks that are not only very accurate, but also diverse [2, 3]. Manipulating training data set and employing different subsets of variables are two the most popular approaches used to achieve the diversity of networks. In this work, aiming to promote diversity of committee networks and, more importantly, to find a trade-off between the accuracy and diversity, the steps of neural network training and elimination of irrelevant input variables are integrated. To accomplish the integration, we add two additional terms to the Negative Correlation Learning (NCL) error function [1], which force input layer weights connected to the irrelevant input variables to decay. The rationale behind the approach is the assumption that, due to the decay process, different input variables can prove useful in different committee networks. Observe that in the NCL, all individual networks in the committee are trained simultaneously. Therefore, if this is the case, the diversity of committee networks can be increased. However, the increased diversity is often achieved due to over-fitting at the expense of accuracy of the networks. To control the over-fitting, we use an ordinary weight decay term.
This work was supported in part by the Lithuanian State Science and Studies Foundation.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 837–842, 2006. c Springer-Verlag Berlin Heidelberg 2006
838
2
M. Bacauskiene, V. Cibulskis, and A. Verikas
Error Function
Unlike the other approaches, NCL trains all individual networks in the committee simultaneously using the same training data for all the networks. The error function Ei for the ith network in NCL is defined by [1] Ei =
P P 2 1 1 oip − yip + λ oip − op ojp − op 2P p=1 P p=1
(1)
j =i
where P is the number of training patterns, oip is the output of the ith network for the pth training pattern, yip is the desired output for the ith network and pth pattern. The second term in Eq.(1) is the correlation penalty term for the ith network, where the parameter 0 ≤ λ ≤ 1 adjusts the strength of the penalty and the committee output op is given by the average of the outputs of the networks. Aiming to promote diversity of committee networks and, more importantly, to find a trade-off between the accuracy and diversity of the networks, we add the following additional terms to the NCL error function (1): N N n nh 2 h β j=1 |wij | 2 R(w) = ε1 (wij ) (2) nh 2 + ε 2 i=1 1 + β i=1 j=1 j=1 |wij | n n o h 2 + ε3 (wij ) i=1 j=1
with wij being a network weight, nh is the number of the hidden nodes, no is the number of the output nodes, N is the number of input variables, and the constants ε1 , ε2 , ε3 and β have to be chosen experimentally. The first term of the function R(w) can be considered as a measure of the total number of important input features. The second and the third terms of the R(w) are exactly the weight-decay terms for the input-to-hidden and hidden-tooutput layer weights, respectively. Weights connected to unimportant features should attain values near zero during the learning process. By using both variable elimination and weight decay terms we aim to find a trade-off between accuracy and diversity of committee networks.
3
Assessing Diversity of Networks
To assess the diversity, we propose using a statistic based on the generalization error. Let us assume that X = [x1 x2 ...xP ]T is the training data set of P points, w is the q-parameter vector of the network, and Z = [z1 z2 ...zP ]T is the Jacobian
k ,w)
matrix with zk = ∂f (x , where f (xk , w) stands for the neural network ∂w w=wLS output and wLS means the least-squares estimate of the parameter vector. The diagonal elements hkk = zkT (ZT Z)−1 zk , k = 1, ..., P of the orthogonal projection matrix Z(ZT Z)−1 ZT , called leverages of the training data, satisfy P the following relations: k=1 hkk = q, 0 ≤ hkk ≤ 1 [4]. We exploit leverages for
Selecting Variables for Neural Network Committees
839
assessing the diversity of committee networks. It is known that the approximate kth leave-one-out error ek is [4]: ek = rk /(1 − hkk )
(3)
where rk = y k −f (xk , wLS ) is the kth residual and y k is the target. Then, having estimated eki for each k = 1, ..., P and each network i = 1, .., L, to assess the diversity of the networks i and j we calculate the diversity measure ρ based on the approximate leave-one-out error vectors ei and ej : ρ = 0.5 −
0.5eTi ej e i 2 ej 2
(4)
where • 2 stands for the Euclidean norm and ρ ∈ [0, 1].
4
Training and Aggregating Networks into a Committee
Numerous schemes have been proposed for aggregating outputs of multiple networks [5, 6]. In this study, we used the averaging aggregation scheme. The training algorithm proposed is encapsulated in the following eight steps. 1. Choose: the NCL parameter λ, the parameters ε1 , ε2 , ε3 and β used in the additional terms of the error function, the number of epochs for the initial training T1 , the number of networks included into the committee L, and the parameter α engaged in the feature elimination process. 2. Include a noise variable into the data set. Randomly divide the data set available into Training, Cross-Validation, and Test data sets. 3. Separately train each committee member T1 epochs by minimizing the mean squared error function augmented with the weight decay term. 4. Simultaneously train all the committee networks by minimizing the error function given by (1) augmented with (2). 5. For each committee member eliminate features i = 1, ..., N , for which Si < αSnoise , where nh Si = |wij | (5) j=1
with wij being a weight between the input and the hidden layer. 6. Retrain the committee without using the eliminated features. 7. Repeat Steps 5 and 6 until none of the features can be eliminated for none of the committee members. 8. Eliminate the noise variable, set ε1 = 0 and retrain the committee.
5
Experimental Investigations
In all the tests, we used 30 different partitionings of the data set into , and sets. The mean values and standard deviations of the average error rate were calculated from these 30 trials. To test the
840
M. Bacauskiene, V. Cibulskis, and A. Verikas
schemes studied we used three real world problems. The data used are available at: www.ics.uci.edu/~mlearn/. The United States congressional voting records data set consists of 435 16dimensional data points. Out of 435 data points available, 197 samples were randomly selected for training, 21 samples were selected for cross-validation, and 217 for testing. The Pima Indians diabetes data set contains 768 8-dimensional samples. From the data set, we have randomly selected 345 samples for training, 39 samples for cross-validation, and 384 samples for testing. There are 30 realvalued features and 569 instances in the Wisconsin diagnostic breast cancer— WDBC data set. We randomly selected 269 samples for training, 30 samples for cross-validation, and 270 for testing. 5.1
Learning Parameters
Following the recommendations given in [7], values of the parameters ε1 and ε2 were chosen such that 0.1ε1 ≤ 1000ε2 ≤ 10ε1 , more specifically: ε1 = 0.1 and ε2 = 0.0001. The regularization constant ε3 and the parameter λ = 0.9 have been found by cross-validation. The number of initial training epochs T1 = 10 and the parameter β = 10 worked well for all three data sets. In all the tests, we used the same number, L = 7, of committee members. The parameter α engaged in the feature elimination process has been set to α = 1.2. 5.2
Simulation Results
Table 1 summarizes the results obtained from the feature elimination procedure when learning the Diabetes classification problem. In the table, ’+/−’ means that the particular feature is used/not used by that particular committee network. As it can be seen, all the features are utilized. However, the feature use by the different networks is rather diverse. This is expected, since the diverse feature use allows creating less correlated networks. Various tests performed so far have shown that for the Diabetes data set, the feature ’2’ is the most discriminative one [8]. As it can be seen from Table 1, the feature is utilized by all the committee members. Table 1 indicates that the committee utilizes only about 60% of the input variables available. Approximately the same feature elimination rate has Table 1. The feature use by the committee networks for the Diabetes data set Network\Feature
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7
+ + − − + − −
+ + + + + + +
+ + − − − + −
+ − − − + − +
+ − + − + + +
+ − − + + + −
− + − − + + −
+ + + − − − +
Selecting Variables for Neural Network Committees
841
Table 2. The average test data set error rate obtained from the two approaches Method\Data NCL Proposed
Diabetes
Voting
WDBC
24.23 (1.48) 21.94 (1.11)
5.28 (1.31) 3.82 (0.61)
3.70 (1.14) 2.62 (1.31)
0.7
0.7
0.6
0.6
0.5
0.5 Rho statistic
Rho statistic
also been observed for the Voting data set, while for the WDBC data set more than 60% of the input variables were eliminated. Table 2 provides the average test data set error rate obtained from the committees trained according to the ordinary NCL and the approach proposed. In the parentheses, the standard deviation of the error rate is provided. For all the three data sets, an evident reduction in classification error is observed. Is the reduction in classification error obtained statistically significant? The classification error is not bounded from above but bounded from below by zero. Thus, it is reasonable taking the logarithm of the errors before comparing the means of the errors. If we assume that the classification errors are approximately log-normally distributed, then we can use the t-test to see if two schemes are significantly different from the classification error rate viewpoint [9]. The t-test applied has shown that the difference between the error rates obtained is significant with 95% confidence for all the three data sets. Thus, the technique not only reduces the model complexity but also improves the committee performance. Fig. 1 presents the accuracy-diversity (ρ-error) diagrams for the networks trained according to the NCL and the proposed approaches using the Diabetes data set. The networks created using the proposed training technique are more accurate than the ones obtained by the NCL. The following observations can be made regarding the networks diversity. On average, networks obtained from the NCL are more diverse than those produced by the technique proposed. However, as the diagrams indicate, the technique proposed creates a considerable number of networks, the diversity of which is as high, or even higher, as that obtained from the NCL. As to the trade-off between the accuracy and diversity of networks, the results presented in Table 2 substantiate the fact that the technique
0.4 0.3
0.4 0.3
0.2
0.2
0.1
0.1
0 20
25
30 Error rate
35
40
0 15
20
25
30
Error rate
Fig. 1. The ρ-error diagrams for the networks trained using the Diabetes data set: (left) NCL, (right) proposed
842
M. Bacauskiene, V. Cibulskis, and A. Verikas
proposed is superior to the NCL. The NCL obtains the increased diversity at the expense of accuracy. The reduced accuracy is not compensated for by the increased diversity. Concerning the generalization error vectors ei , the ρ-error diagrams presented in Fig. 1 indicate that a considerable number of neural networks created by the NCL exhibit almost uncorrelated vectors ei , ej . It is interesting to observe that analysis of the ρ-error diagrams has shown that for the Breast cancer data set, the increased average performance obtained from the committees trained according to the technique proposed is mainly due to the increased diversity of the networks, while for the Voting records data set, the increased average performance is due to both the slightly increased diversity and accuracy of the committee networks.
6
Conclusions
This paper is concerned with variable selection for neural classification committees. The technique developed aims to promote diversity of committee networks and, more importantly, to find a trade-off between the accuracy and diversity of the networks. To achieve the goals, the steps of neural network training, aggregation, and elimination of irrelevant input variables are integrated based on the negative correlation learning error function. The experiments conducted have shown that the increased performance obtained from the committees trained according to the technique proposed can be due to both the increased diversity and accuracy of committee networks. By applying the technique, statistically significant improvements in classification performance have been achieved.
References 1. Liu, Y., Yao, X.: Ensemble Learning via Negatine Correlation. Neural Networks 12 (1999) 1399–1404 2. Bacauskiene, M., Verikas, A.: Selecting Salient Features for Classification Based on Neural Network Committees. Pattern Recognition Letters 25 (2004) 1879–1891 3. Verikas, A., Lipnickas, A., Malmqvist, K.: Selecting Neural Networks for Making a Committee Decision. In: Dorronsoro, J.R. (ed.): Lecture Notes in Computer Science. Volume 2415. Springer-Verlag Heidelberg (2002) 420–425 4. Monari, G., Dreyfus, G.: Local Overfitting Control via Leverages. Neural Computation 14 (2002) 1481–1506 5. Kuncheva, L.I., Bezdek, J.C., Duin, R.P.W.: Decision Templates for Multiple Classifier Fusion. Pattern Recognition 34 (2001) 299–314 6. Verikas, A., Lipnickas, A., Malmqvist, K., Bacauskiene, M., Gelzinis, A.: Soft Combination of Neural Classifiers: A Comparative Study. Pattern Recognition Letters 20 (1999) 429–444 7. Setiono, R., Liu, H.: Neural-Network Feature Selector. IEEE Transactions on Neural Networks 8 (1997) 654–662 8. Verikas, A., Bacauskiene, M.: Feature Selection with Neural Networks. Pattern Recognition Letters 23 (2002) 1323–1335 9. Cohen, P.R.: Empirical Methods for Artificial Intelligence. MIT Press, Cambridge, Massachusetts (1995)
An Adaptive Network Topology for Classification Qingyu Xiong, Jian Huang, Xiaodong Xian, and Qian Xiao The Key Lab of High Voltage Engineering & Electrical New Technology of Education Ministry of China Automation College, Chongqing University, Chongqing 400044, China [email protected]
Abstract. Constructive learning algorithms have been proved to be powerful methods for training feedforward neural networks. In this paper, we present an adaptive network topology with constructive learning algorithm. It consists of SOM and RBF networks as a basic network and a cluster network respectively. The SOM network performs unsupervised learning to locate SOM output cells at suitable position in the input space. And also the weight vectors belonging to its output cells are transmitted to the hidden cells in the RBF network as the centers of RBF activation functions. As a result, the one to one correspondence relationship is produced between the output cells of SOM and the hidden cells of RBF network. The RBF network performs supervised training using delta rule. The output errors of the RBF network are used to determine where to insert a new SOM cell according to a rule. This also makes it possible to let the RBF cells grow while the SOM output cells increasing, until a performance criterion is fulfilled or until a desired network size is obtained. The simulation results for the two-spirals benchmark are shown that the proposed adaptive network structure can get good performance and generalization results.
1 Introduction Multilayer feedforward neural networks (MFNN) trained using the backpropagation learning algorithm are limited for searching for a suitable set of weights in a priori fixed network topology. This mandates the selection of an appropriate network topology for the learning problem on hand. However, there are no known efficient methods for determining the optimal network topology for a given problem. Optimal network model selection involves matching the complexity of the problem to be approximated with the complexity of the model. Too small networks are unable to adequately learn the problem well, while overly large networks tend to overfit the training data and consequently result in poor generalization performance [1]. This suggests the need for algorithms to learn both network topology and its weights. When it is decided to change the structure of the network during network training, there are two different approaches. The choice is between adding nodes and connecting weights to the network or deleting them from the network according to some technique algorithm. Constructive learning algorithms offer an attractive framework for the incremental construction of near-minimal neural network architectures. These algorithms start with a small network by adding and training neurons as needed until a satisfactory J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 843 – 848, 2006. © Springer-Verlag Berlin Heidelberg 2006
844
Q. Xiong et al.
solution is found [2][3]. Network pruning offers another approach for dynamically determining an appropriate network topology. Pruning techniques [4][5] begin by training a larger than necessary network and then eliminate weights and neurons that are deemed redundant. Constructive algorithms offer several significant advantages over pruning-based algorithms [6], but one has to define if using constructive method: when to add a neuron; where to add a neuron; connectivity of the added neuron; weight initialization for the added neuron; how to train the added neuron. In this paper, we propose an adaptive network structure with constructive learning algorithm, which is composed of two kinds of networks such as a basic network and a cluster network. In such network's structure, a RBF network and a Kohonen SOM network are adopted as the basic network and cluster network respectively. The relationship between the output nodes in SOM and the hidden nodes in the RBF network is one to one correspondence. (see Fig.1) Cluster N etwork
Wl
R Y
Basic Network
Fig. 1. Structure of adaptive network topology
2 Adaptive Structure 2.1 Basic Structure Let R=(r1, r2, ···, rN) denote the input vector, and Y is the output of the basic network. Each node in the output layer of SOM has a weight vector, say Wl, attached to it, where Wl = (wl1, wl2, ···, wlN ) ( l ∈ L ). Also Wl is transmitted to the hidden nodes in the RBF network as its center of RBF activation function. The usual procedure for training such a network consists of two consecutive phases: an unsupervised and a supervised one. This network starts with a minimal size network (usually three cells), and then sequentially adds neurons according to an optimal criterion. SOM network performs unsupervised learning. It generates ordered mappings for the input data onto some low-dimensional topological structure, which means that the SOM defines a topology mapping from the input data space onto the output cells. Then the weights vector Wl belonging to the output cells in SOM are transmitted to the hidden cells in the RBF network as its center of RBF activation function. The RBF network performs supervised training using delta rule. The current output errors in the RBF network are used as the error values of current best-matching cell in
An Adaptive Network Topology for Classification
845
SOM to determine where to insert a new SOM cell according to a rule. This also makes it possible to make the RBF network cells grow because of their connecting relationship, until a performance condition is fulfilled or until a desired size of network cells is reached. 2.2 Center Adaptation When an input vector R is given to this network, at first in SOM, the Euclidean distances between each Wl and R are computed. The output cells of SOM compete with each other, but the cell c whose connected weight vector is closest to the current input vector (minimum distance) wins, say best-matching cell.
c = arg Min
l∈L
{|| R − Wl ||}
(1)
Where L is the total number of the output cells in SOM network. After determining the best-matching cell for the current input pattern, such matching between the best-matching cell and its topological neighbors should be increased according to the principle proposed by Kohonen[7]. There are, however, two differences: (1) The adaptation strength is constant over time. Specifically, we use constant adaptation parameters α c and α n for the best-matching cell and its neighboring cells, respectively. (2) Only the best-matching cell and its first topological neighbors are adapted. Here, let Nc denotes the set of direct topological neighbors of the winner c. An unsupervised update step can be formulated as follows:
⎧ Wc ← Wc + α c ∗ ( R − Wc ) ⎪ ⎨ Wl ← Wl + α n ∗ ( R − Wl ) for l ∈ N c ⎪W ← W otherwise ⎩ l l
(2)
And in the basic network, the input is the same as the one in SOM. Each Gaussian cell l in the RBF network has an associated vector Wl indicating the center of each Gaussian function in the input vector space and a standard deviation σ l defined by using the mean distance of its first neighboring Gaussian cells.
σ l2 = mean(|| Wc − Wl ||2 )
for l ∈ N c
(3)
For a given input vector R, the activation of RBF cell l is described by: fl ( R) = exp(−
|| R − Wl ||2
σ l2
)
(4)
The Gaussian cells in the RBF network are completely connected to the linear output node by weighted connections wl. Therefore the activation of the output node in the RBF network is computed by:
846
Q. Xiong et al.
Y=
∑w ∗ f l
(5)
l
l∈L
Where L is the total number of hidden cells in RBF network. It is the same as output cells in SOM network. The weights wl of the output layer are trained to produce the desired values T at the output node of the RBF network. Here, the common delta rule is used.
β
wl ← wl + β (T − Y ) fl
for all l ∈ L
(6)
is the learning rate. where The error caused by the current input data is used to determine where to insert a new cell in the input space. Here, for each cell l in SOM, we define a local counter variable Vl basically containing the errors of the cell where the best-matching for the current input is obtained. Because the cells are slightly moving around, more recent errors should be weighted stronger than the previous ones. And this is guaranteed by decreasing all counter variables by a certain fraction α after each adaptation step.
⎧⎪V + | T − Y |2 for l = c Vl ← ⎨ l otherwise ⎪⎩αVl
where α
(7)
∈(0, 1), decreasing rate.
2.3 Insertion of New Cell
After a fixed number of adaptation steps M in SOM and in RBF network, we determine the cell s with the largest error counter values among Vl:
s = arg Max
l∈L
{Vl }
(8)
Then, we look for a cell with the largest error counter value among the first neighbors Ns of cell s. This is the cell f satisfying:
f = arg Max
l∈N s
{Vl }
(9)
And in their common neighbors N s ∩ f of cell s and f, we determine the cell u with the largest distance from cell s: u = arg Max
l∈N s ∩ f
{|| Wl − Ws ||}
(10)
After that, we now can insert a new cell I between node s and u. The position of the new cell I in the input space is initialized as: Wl = (Ws + Wu ) / 2
(11)
And to keep topological relationship between the output cells in SOM after inserting cell I, we should set cell s, u and their common neighbors cells N s∩u as cell I's
An Adaptive Network Topology for Classification
847
first neighbors. Then, the original neighboring relationship between cell s and u is relevantly cancelled. Doing so, we may say that the local optimum on both SOM structure and basic network's output error is realized. This is due to the fact that the new cells are always inserted in those regions where the training errors are locally maximum. Whenever a new cell I is inserted, a new weight wI connecting the cell I to the output cell in RBF network should be determined. We make initial wI equal to 0 in order not to result in bad influence to the output value of RBF network after inserting. To prevent the network growing indefinitely, a stopping criterion has to be defined. In the simplest case, one can specify a maximum number of allowed cells and halt the process when this number is achieved or surpassed.
3 Simulation To examine the abilities of the proposed adaptive method, the system was tested by the two-spiral benchmark problem [8] because it is as extremely hard problem for algorithms of the back-propagation family to solve. Each of the two spirals of the benchmark task makes three complete turns in the plane, with 32 points per turn plus an end point, total 97. The data set consists of 194 points belonging to two interspersed spiral-shaped classes, with 97 samples for each class, one to produce a +1 output and another a -1 output (see Fig.2 (a)). The SOM network and the corresponding decision regions are shown in Fig.2 (b), (c). In this case, the decision regions form two well-separated spirals. In this case, two interspersed spiral-shaped classes can be separated very well.
(a)
(b)
(c)
Fig. 2. (1) A Two-spiral Problem; (2) Position of the output cells in SOM network; (3) Performance of the output for RBF network
4 Conclusions In this paper, we have proposed a new adaptive method. It is composed of a RBF network and a Kohonen SOM network. The SOM network performs unsupervised learning. It can define a topology mapping from the input data onto the output nodes.
848
Q. Xiong et al.
Also, the weights vector belonging to the output cells in SOM are transmitted to the hidden cells in the RBF network as the center of RBF activation function. The RBF network performs supervised learning using delta rule. And the current output errors in the RBF network are used to determine where to insert a new SOM unit according to a condition. This also makes it possible to make the hidden cells of RBF network grow because of their connecting relationship between the output cells in SOM and the hidden cells in the RBF network. From our simulations, we can see that the proposed neural network has good generalization ability even for the two-spiral benchmark problem. Also one can say that the local optimum regarding SOM's structure and RBF network's output error is found because new cells are always inserted in those regions of the input space where the probability of the error occurring is locally maximum.
Acknowledgements The work described in this paper was supported by Project 60375024 of NSFC, Project 2002AA2430031 of 863, and the Applying Basic Research Grants Committee of Science & Technology of Chongqing.
References 1. Duda, R., Hart, P.: Pattern Classification and Scene Analysis. Wiley, New York (1973) 2. Gallant, S.: Neural Network Learning and Expert Systems. MIT Press, Cambridge (1993) 3. Honavar, V., Uhr, V.L.: Generative Learning Structures for Generalized Connectionist Networks. Inform. Sci. 70 (1/2) (1993) 75-108 4. Reed, R.: Pruning Algorithms—A Survey. IEEE Trans. Neural Networks, 4 (1993) 740-747 5. Finnoff, W., Hergert, Zimmermann, F., H. G.: Improving Model Selection by Nonconvergent Methods. Neural Networks. 6 (1993) 771-783 6. Kwok, T.Y., Yeung, D.Y.: Constructive Algorithms for Structure Learning in Feedforward Neural Networks for Regression Problems. IEEE Trans. Neural Networks. 8 (1997) 630-645 7. Kohonen, T.: Self-Organizing Maps. Springer , Heidelberg (1995) 8. Carpenter, G.A., Grossberg, S.N., Markuzon, J.H.: Reynold, and D.B. Rosen, Fuzzy ARTMAP: A Neural Network Architecture for Incremental Supervised Learning of Analog Multidimensional Maps. IEEE Trans. Neural Networks. 3 (5) (1992) 698-713
A Quantitative Comparison of Different MLP Activation Functions in Classification Emad A. M. Andrews Shenouda Department of Computer Science, University of Toronto, Toronto, ON, Canada [email protected]
Abstract. Multilayer perceptrons (MLP) has been proven to be very successful in many applications including classification. The activation function is the source of the MLP power. Careful selection of the activation function has a huge impact on the network performance. This paper gives a quantitative comparison of the four most commonly used activation functions, including the Gaussian RBF network, over ten real different datasets. Results show that the sigmoid activation function substantially outperforms the other activation functions. Also, using only the needed number of hidden units in the MLP, we improved its conversion time to be competitive with the RBF networks most of the time.
1 Introduction Introducing the back-propagation algorithm (BP) is a landmark in Neural Networks (NN) [5]. The earliest description of (BP) was presented by Werbos in his PhD dissertation in 1974 [12], but it did not gain much publicity until it has been independently rediscovered by Le Cun, Parker, Hinton and Williams [7]. MLP is perhaps the most famous implementation for the (BP). It is very successfully used in many applications in various domains such as prediction, function approximation and classification. For classification, MLP is considered a super-regression machine that can draw complicated decision borders between nonlinearly separable patterns [5]. The nonlinearity power of MLP is due to the fact that all its neurons use a nonlinear activation function to calculate their outputs. MLP with one hidden layer can form single convex decision regions, while adding more hidden layers can form arbitrary disjoint decision regions. In [6], Huang, et al showed that single hidden layer feedforward neural networks (SLFN’s) with some unbounded activation function can also form disjoint decision regions with arbitrary shapes. If a linear activation function is used, the whole MLP will become a simple linear regression machine. Not only determining the decision borders, but the value of the activation function also determines the total signal strength the neuron will produce and receive. In turn, that will affect almost the all aspects of solving the problem in hand like: the quality of the network initial state, speed of conversion and the efficiency of the synaptic weights updates. As a result, a careful selection of the activation function has a huge impact on the MLP classification performance. In theory, the (BP) is universal in this matter, such J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 849 – 857, 2006. © Springer-Verlag Berlin Heidelberg 2006
850
E.A.M.A. Shenouda
that any activation function can be used as long as it has a first derivative. Activation functions can be categorized into three basic families [15]: linear, e.g. Step-like functions, logistic, e.g. Sigmoidal and Radial Basis Functions (RBF), e.g. Gaussian. In spite of the importance of the activation function as an integral part of any feedforward network, it has not been well investigated in the NN literature. Although Michie et al showed that both MLP and RBF are computationally equivalent; they could not interpret their discrepancy in classification performance because MLP outperformed RBF but needed order of magnitude more training time [2]-[15]. In this paper, we compare the performance of the three families of functions with respect to classification over 10 different real datasets using both batch and online learning. With careful MLP pruning, our results showed that MLP with Sigmoid activation function is superior for classification with less or competitive tainting time to RBF. 1.1 Related Work The mathematical and theoretical foundation of various MLP activation functions, including RBF, can be found by Duch and Jankowski in [15]. In [14], the authors provided a comprehensive survey of different activations function. Others compared the performance of different activation functions in MLP. In [3], the authors provided a visual comparison for the speed of conversion of different linear and sigmoid functions. However, the comparison did not provide much information since one randomly generated easy dataset was used. Comparing the performance of MLP and RBF networks attracted the attention of more researchers. While the RBF outperformed the MLP for voice recognition in [10], other authors showed that MLP is superior to RBF in fault tolerance and resource allocation applications [8]-[9]. Various attempts to combine both MLP and RBF networks in one hybrid less complex and better performing networks exist. In [1], the authors showed a new MLP-RBF architecture that outperformed both of the individual networks in an equalizer signal application. In [4] the authors provided a unified framework for both MLP and RBF via a conic section transfer functions. For classification, the most comprehensive source of comparison is the Statlog report in [2]. The report presents the performance comparison between different NN and statistical algorithms for classification over 22 different datasets. Both MLP and RBF did well most of the time. Results showed there is a huge discrepancy between RBF and MLP in terms of cross validation error. However, MLP outperformed RBF most of the time; RBF sometimes did not report valid results. The only problem with the MLP performance in this report is its training time which was order of magnitude greater than RBF networks training time. In [13], Wang and Huang systematically studied both the Sigmoidal and RBF activations functions in protein classification and reached the same conclusion as we did. Wang et al work is very significant due to the fact of using Extreme Learning Machines (ELM) instead of the (BP). Unlike BP which works only with differentiable activation functions, ELM can work with any nonlinear activation function; also, ELM does not need any control parameters.
A Quantitative Comparison of Different MLP Activation Functions in Classification
851
According to Wang et al, ELM achieved higher classification accuracy with up to 4 four orders of magnitude less training time compared to BP. In addition to confirming the fact that MLP with sigmoid activation function is superior for classification applications, our contribution in this paper is significantly pruning the MLP size to improve its training time to be less than the time needed for the RBF most of the time. Also, we showed the relation between the problem dimensionality and number of classes from one side and the performance of the activation function on the other side. We carefully observed the network initial state using each activation function and related it to the network ability of generalization.
2 Activation Functions in Comparison Where ni denotes the net-input to the hidden unit i and oi denotes the unit output using the activation function: Form the linear family we have used the linear activation function such that for (a) and (b) as constants: o = an + b Derivative is: (a). (1) i i From the logistic family, we have used the two most commonly used functions: •
The Sigmoid function (the asymmetric logistic function [5]) : 1 o = i 1+ e− ni
•
Derivative is: oi (1−oi )
(2)
The hyperbolic tangent function (the symmetric logistic function [5] ): eni −e− ni 1−e− ni o = = i eni + e− ni 1−e− ni
Derivative is: (1−oi2 )
(3)
Those functions are the most widely used in MLP implementation because their derivatives are easy to compute and can be expressed directly as a function of the net input. Also, their curve shape contains a linear region which makes them the most suitable functions for regularizing the network using weight decay [11]. From the RBF family, we use the Gaussian activation function: ⎡ 1 ⎤ ||ni −mi ||2 ⎥ o = exp ⎢ − i ⎢⎣ 2σ 2 ⎥⎦
o
Derivative is: −2( ni − mi )× i σ2
(4)
Where m and σ are the center and the width, respectively. The reader is directed to [16] where there are more net input calculation and activation functions.
3 Datasets Selection We tried to cover all possible dataset criteria. Table 1 summarizes the datasets used.
852
E.A.M.A. Shenouda Table 1. Datasets characteristics
Dataset Name
Cases
Dimension
Classes
% continuous valued features
1 2 3 4 5 6 7 8 9 10
5620 699 198 569 366 1728 1296 768 208 435
64 9 30 30 34 6 6 8 60 16
10 2 2 2 6 4 2 2 2 2
100 100 100 100 97 0 0 100 100 0
Optdigit B_C_W Wpbc Wdbc Dermatology Car Monks Pima_diabates Sonar votes
% nominal valued features 0 0 0 0 3 100 100 0 0 100
Missing Values No Yes Yes No Yes No No No No Yes
4 Number of Hidden Unites and Centers of RBF To reduce MLP training time, all networks used in the experiments are of one hidden layer, which is the case by definition in RBF networks. For the MLP, network pruning is recommended by both Zurada and Haykin [17]-[5]. Instead of pruning only the weights, we started with number of hidden units equal to twice the number of the input features. Then, without affecting the CV error, we kept dividing this number by 2. By doing this, the number of hidden units in all MLP used was in the range between 4 and 25. From the 4 methods proposed in [5] to adapt RBF centers, we chose the closest to MLP implementation, where the number of hidden units is fixed and the centers are adapted via an unsupervised learning. The width of all functions are kept constant to be 1. Cross Validation (CV) is our main method for regularization and stopping.
5 Results Results for different activation functions performance via batch learning are tabulated in the Table 2. The analogous results for online learning are in Table 3: Where, A is the network initial state (starting MSE) after the first 50 epochs. B is the best MSE Table 2. Results for batch learning DS 1 2 3 4 5 6 7 8 9 10
Sigmoid A 83.3 186 167 179 88.8 38.1 205 178 184 149
B 7.7 22 108 11.2 3.9 24.3 202 121 95.5 135
C 11.6 4.1 124 17.2 3.8 94.5 202 123 168 166
Tanh D 35536 1783 186 1023 2621 350 1239 867 82 116
A 177 93.4 478 123 57.7 20.8 810 514 193 532
B 11.2 2.8 2 10.6 6.8 16.7 799 416 2.9 164
C 35.1 16.5 385 90.4 17.4 189 817 493 1454 676
Linear D 13295 622 1020 169 250 318 194 164 218 218
A 329 136 623 257 320 76.9 809 520 808 533
B 329 136 623 245 320 72.6 809 520 808 512
C 330 68.3 458 245 383 257 808 502 812 583
Gaussian(RBF) D 346 64 60 84 158 322 120 60 54 78
A 332 540 613 470 298 126 811 698 801 592
B 82.3 87.6 455 90.5 25.5 63.2 804 527 369 510
C D 91.2 55461 17.9 1909 716 187 11.3 559 18.7 40594 307 650 901 339 507 1395 1133 367 550 160
A Quantitative Comparison of Different MLP Activation Functions in Classification
853
Table 3. Results for the online learning DS
Sigmoid
A B C D 1 7.52 1 7.11 1330 2 24.2 .739 23.8 1446 3 107 0 288 600 4 15.5 1 49.6 130 5 4.6 0 6.39 858 6 11.4 3.7 11.8 1389 7 204 179 202 454 8 124 84.9 167 733 9 10.7 0 164 250 10 133 41.1 143 288
Tanh A 122 55.6 246 71.6 5.2 24.3 895 670 349 525
Linear
Gaussian(RBF)
B C D A B C D A B C 13.1 23.9 1490 348 348 354 290 74 49.5 56.6 0 32.2 734.4 160 160 69 90 101 90.3 29.8 0 833 656 656 558 86 655 489 1098 169 58.1 14.9 65 729 729 562 75 110 65.3 40.6 0 17 277 326 326 377 146 38.9 14.6 20 19.8 50.3 1200 62.6 62.6 174 943 38.2 30.5 106 894 963 300 894 894 984 650 1366 1361 1459 295 750 734 703 703 499 390 581 544 535 1.1 447 310 382 382 671 224 220 42.4 1038 357 935 316 629 629 582 120 537 513 552
D 10071 360 280 207 35781 600 340 597 297 228
achieved in training, C is the MSE for testing, and D is the training time in seconds. (Note, cells A, B, C are divided by 10-3 , bold figures represent best testing MSE).
6 Results Analysis We will provide result analysis with respect to MSE, training time needed and network initial state in both online and batch learning. Unlike the case in the StatLog report in [2], we did not get huge discrepancy in performance between the MLP and RBF. In spite of the expected poor performance of the linear function compared to the other functions, it will get a special attention in the network initial state section. 6.1 Test MSE From the above tables and figures 1 and 2, we can observe that the sigmoidal activation function always achieved the best classification MSE in 9 of the 10 datasets. In dataset 4, the RBF got the best MSE in both online and batch learning, which still close to the MLP MSE. Also, it is natural for the tanh MSE to be the closest to the sigmoid MSE most of the time. Both functions belong to the logistic family and geometrically equivalent except for the there minimum values, which is the minimum saturation region. For the sigmoid, tanh and RBF, it is observable that: 1- Except for the dataset number 4 in online learning, the MSE for the three functions tends to increase or decrease together for each dataset. 2- The maximum value of the MSE for each dataset is the same or so close in both online and batch learning. 3- The sigmoid function experiences the least variance in terms of its MSE value in both online and batch learning. The difference between the average of the online test MSE and batch test MSE over the 10 datasets was 0.0149, 0.0771, and 0.0682 for the sigmoid, tanh and RBF respectively. That means the dataset characteristics like dimensionality, features type and number of classes desired do not have a remarkable effect of the sigmoid function ability to generalize.
854
E.A.M.A. Shenouda
Results show that there is no clear or direct relation between the dataset dimensionality and the network ability to generalize for the three activations functions. For example, the three activation functions produced less MSE in dataset number 1 than what they produced for dataset number 6 which is less than 1/10 of the dimension, while the situation is reversed in case of datasets 2 and 9.
Fig. 1. Test (MSE), online learning
Fig. 2. Test (MSE), batch learning
6.2 Training Time MLP with sigmoid activation did not only achieve the best MSE, but also needed the least training time in either online or batch leaning in all datasets. For example, for dataset 3, the sigmoid function consumed slightly more time than the RBF function using online learning, but it consumed less time for the same dataset in batch learning. The situation is reversed in the case of dataset 4. That means, by choosing the right training method and the only needed number of hidden units, we can reduce the MLP training time to be less or competitive to the RBF networks training time. Although the difference is minimal, the tanh function tended to consume less time than the sigmoid function in many cases, but this small difference in time should not give preference of the tanh over the sigmoid function because of the MSE of sigmoid is the best most of the time. Excluding the first and fourth datasets, the batch learning training time is less than the online learning training time, in average.
Network initial state
Fig. 3. Training time, online learning
Fig. 4. Training Time, batch learning
A Quantitative Comparison of Different MLP Activation Functions in Classification
855
For the sigmoid, tanh and RBF we can observe the following: 1- Most of the time in both online and batch learning, the training time for the three functions tends to increase or decrease together for each dataset. 2- The sigmoid and tanh functions tend to experience less variance than the RBF networks in terms of time needed for each dataset. This means that the dimensionality of the dataset does not have a remarkable effect on the MLP time needed for training as long as there are enough examples. 3- On the other hand, results show that there is a strong relation between the number of classes and the time needed for training in both MLP and RBF. This is observable by inspecting datasets number 1 and 9. They have dimensionality of 64, 60 and number of classes of 10 and 2 respectively. Although their dimensionalities are close and both have 100% continuous data type, the time needed for the first data set is order of magnitude more than the time needed for the 9th dataset. The same relation holds for nominal data types as well in datasets 6 and 7. 6.3 The Network Initial State By network initial state we mean the MSE after 50 epochs of training. This section is a very important one as it carries, to some extend, the explanation of the previous analysis. Also, it clarifies the behavior of the linear activation function. Concerning the linear activation function, by observing the columns A and B in tables 1 and 2 it would be clear that the final MSE after training is identical to the initial network state in 18 experiments out of 20, or 90% of the time. Also, in the remaining 10% the difference is almost negligible. This happens because the linear activation function turns the whole MLP into a normal linear regression machine, which does not tend to learn more by more iteration on the same data. However, it might improve its MSE when more examples are introduced.
Fig. 5. Network initial state after 50 epochs, online learning
Fig. 6. Network initial state after 50 epochs, batch learning
Our results concerning this function comply with the results in [3] where the authors stated that they have experienced the same behavior form the step-like function after 120 epochs. That is why it would be misleading to compare the linear
856
E.A.M.A. Shenouda
activation function training time with the other functions because it reaches its maximum generalization ability very quickly but always produces the highest MSE. Form the above tables and Figures 5 and 6, it is observable that the sigmoid activation function tends to produce the best network state after the first 50 epochs. The next best initial state was for the tanh while RBF and linear activation functions come last. The most important observation here is the analogy between the figures 5 and 1 and figures 6 and 2. The MSE values obtained are relatively symmetrical for each function in each dataset. That means whenever the function reaches a good initial state, it generalizes well and vice verse. So, observing the network initial state is an essential task to see how fast the MSE error drops and what value it will reach; this might be an indicator that the network parameters needs some modifications or the number of hidden units is not enough. Sometimes only re-randomizing the weights and neurons thresholds values will put the network in a better initial state. Also, observing the network initial state was very useful while performing MLP hidden units pruning to decrease time needed for training.
7 Concluding Remarks and Future Work We showed that the sigmoidal activation function is the best for classification applications in terms of time and MSE. We showed a relation between the dataset dimensionality, number of classes and time need for training. Also, the results indicated that the network initial state highly impacted the test MSE and consequently, the overall network ability for generalization. Still, more research can and should be done concerning MLP activation functions. It will be very interesting to use the EM algorithm to adapt the centers of RBF networks. Also, the performance of RBF with a huge fixed number of centers versus the ones with small but adapted centers is still an open issue. Haykin showed that they should perform the same, but he also showed that this contradicts with the NETtalk experiment where using unsupervised learning to adapt the centers of RBFN always resulted in a network with poor generalization than MLP, which was the opposite when supervised learning was used.
Acknowledgement The author would like to thank Prof. Anthony J. Bonner for his valuable advices for the final preparation of this manuscript.
References 1. Lu, B., Evans, B.L.: Channel Equalization by Feedforward Neural Networks. In: IEEE Int. Symposium on Circuits and Systems, Vol. 5. Orlando, FL (1999) 587-590 2. Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Statistical Classification. Elis Horwood, London (1994) 3. Piekniewski, F., Rybicki, L.: Visual Comparison of Performance for Different Activation Functions in MLP Networks. In: IJCNN 2004 & FUZZ-IEEE, Vol. 4. Budapest (2004) 2947-2953
A Quantitative Comparison of Different MLP Activation Functions in Classification
857
4. Dorffner, G.: A Unified Framework for MLPs and RBFNs: Introducing Conic Section Function Networks. Cybernetics and Systems 25(4) (1994) 511-554 5. Haykin, S.: Neural Networks A Comprehensive Introduction. Prentice Hall, New Jersey (1999) 6. Huang, G., Chen, Y., Babri, H.A.: Classification Ability of Single Hidden Layer Feedforward Neural Networks. IEEE Transactions on Neural Networks 11(3) (2000) 799801 7. Le Cun, Y., Touresky, D., Hinton G., Sejnowski, T.: A Theoretical Framework for Backpropagation. In: The Connectionist Models Summer School. (1988) 21-28 8. Li, Y., Pont, M. J., Jones, N.B.: A Comparison of the Performance of Radial Basis Function and Multi-layer Perceptron Networks in Condition Monitoring and Fault Diagnosis. In: The International Conference on Condition Monitoring. Swansea (1999) 577-592 9. Arahal, M.R., Camacho, E.F.: Application of the Ran Algorithm to the Problem of Short Term Load Forecasting. Technical Report, University of Sevilla, Sevilla (1996) 10. Finan, R.A., Sapeluk, A.T., Damper, R.I.: Comparison of Multilayer and Radial Basis Function Neural Networks for Text-Dependent Speaker Recognition. In: IEEE Int. Conf. on Neural Networks, Vol. 4. Washington DC (1996) 1992-1997 11. Karkkainen, T.: MLP in Layer-Wise Form with Applications to Weight Decay. Neural Computation 14(6) (2002) 1451-1480 12. Werbos, P. J.: Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Doctoral Thesis, Applied Mathematics, Harvard University. Boston (1974) 13. Wang, D., Huang, G.: Protein Sequence Classification Using Extreme Learning Machine. In: IJCNN05, Vol. 3. Montréal (2005) 1406-1411 14. Duch, W., Jankowski, N.: Survey of Neural Transfer Functions. Neural Computing Surveys 2 (1999) 163-212 15. Duch, W., Jankowski, N.: Transfer functions: Hidden Possibilities for Better Neural Networks. In: 9th European Symposium on Artificial Neural Network. Bruges (2001) 8194 16. Hu, Y., Hwang, J.: Handbook of Neural Network Signal Processing. 3rd edn. CRC-Press, Florida (2002) 17. Zurada, J. M.: Introduction to Artificial Neural Systems. PWS Publishing, Boston (1999)
Estimating the Number of Hidden Neurons in a Feedforward Network Using the Singular Value Decomposition Eu Jin Teoh, Cheng Xiang, and Kay Chen Tan Department of Electrical and Computer Engineering, National University of Singapore, 4 Engineering Drive 3, Singapore 117576 [email protected]
Abstract. We attempt to quantify the significance of increasing the number of neurons in the hidden layer of a feedforward neural network architecture using the singular value decomposition (SVD). Through this, we extend some well-known properties of the SVD in evaluating the generalizability of single hidden layer feedforward networks (SLFNs) with respect to the number of hidden neurons. The generalization capability of the SLFN is measured by the degree of linear independency of the patterns in hidden layer space, which can be indirectly quantified from the singular values obtained from the SVD, in a post-learning step.
1
Introduction
The ability of a neural network to generalize well on unseen data depends on a variety of factors, most important of which is the matching of the network complexity with the degree of freedom or information that is inherent in the training data. Matching of this information with complexity is critical for networks that are able to generalize well. Previous work on estimating the number of hidden layer neurons have often focused on the learning capabilities of the SLFN on a training set without accounting for the possibility of the network being overtrained [1, 2, 3, 4, 5, 6]. Projection onto increasingly higher-dimensional (hidden layer) space is achieved by implementing increasingly larger number of neurons in the hidden layer. Cover’s seminal work using concepts from combinatorial geometry rigorously showed that patterns projected onto higher dimensions are likelier to be linearly separable [7]. This work was followed by Huang [1] and Sartori and Antsaklis [2].The use of the SVD in neural networks have been briefly highlighted by [8, 9, 10]. Geometrical methods such as in [11] have been used to great effect in providing a deeper insight into the nature of the problem. However such methods are only applicable to problems in which the dimensionality of the problem is in either 2- or 3-dimensions. 1.1
The Singular Value Decomposition (SVD)
Given a real matrix H ∈ RN ×n , applying the SVD results in the orthogonal transformation H = U ΣV T . The decoupling technique of the SVD allows the J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 858–865, 2006. c Springer-Verlag Berlin Heidelberg 2006
Estimating the Number of Hidden Neurons in a Feedforward Network
859
expression of the original matrix as a sumof the first n columns of u and v T , n weighted by the singular values, i.e. H = i=1 σi ui viT . The matrix Mi = ui viT is known as the ith mode of H, where H = σ1 M1 + σ2 M2 + . . . + σn Mn . The rank of H is determined by observing that the n largest singular values are nonzero, whereas the N − n smallest singular values are zero. The use of the SVD, specifically its spectral properties through its singular values (the diagonals of Σ), provides us with an indication of the degree of linear independency of the ˜ reveals the rank, in a manner that its matrix H. With this, the SVD of H n − k smallest singular values are bounded by the Frobenius norm of E [12]. Algorithms for computing the SVD can be found in [13], where its derivations, concepts and applications are discussed further in depth. This article will focus on the use of the SVD as a robust tool in estimating the number of hidden layer neurons without focusing on the algorithmic details of the SVD, which can be found in the literature highlighted above.
2
Estimating the Number of Hidden Layer Neurons
Every neuron in the hidden layer constructs a hyperplane in the input feature space [1]. The additional separating capability of the system in view of this additional hyperplane created by an additional hidden neuron depends on its ‘uniqueness’, an indication of which is given by the rank of the matrix H. Quite clearly, and in a geometrical sense, the rank of H denotes the space in which the columns of H occupy, representing the number of separating hyperplanes of the system. In the case where n = N − 1 hidden neurons are used (with a suitable nonlinear activation function) such that H is full-ranked, this satisfies the rank requirement of [1, 2], giving rise to N − 1 separating hyperplanes for the N training samples. Using the simple matrix inverse [1, 2], we are guaranteed perfect reconstruction for the training set. Intuitively, if H is of higher rank, H better fits the training data, at the expense of generalization when limn→N rank(H). This full-rank condition also ensures that the patterns projected onto the hidden layer space are linearly independent; accordingly this is also known as φ-general position as described in [7]. h is equivalent to φ in the context of this article. The actual rank (say k) of H may be different from its numerical rank (say n), where k ≤ n. Such a situation usually arises when the original matrix X is ˜ such that X ˜ = X + E [12], with ‘contaminated’ by E, resulting in a matrix X, ˜ rank(X) = k and rank(X) = n. These ‘contaminations’, we commonly refer to as noise, obstructs the characterization of the true properties of the system or problem given the observed training samples. As will be highlighted subsequently, this ‘noise’, corresponds to the marginal role that is played by additional hidden neurons that tends to fit the features of the training samples which are not truly representative of the intrinsic underlying distribution of observations. Aside from noise, these ‘contamination’ can also consist of observational errors or features that is only present in the training but not the testing set. The numerical rank of H is defined as the number of linearly independent columns of H. Equivalently, the rank of a matrix H is also the minimum n which satis-
860
E.J. Teoh, C. Xiang, and K.C. Tan
fies σn > σn+1 = 0 with σi (i = {1, 2, . . . , N }) being the singular values of H. However, a value k which describes the actual rank of the matrix H is more useful for us in estimating the appropriate number of hidden neurons to be used in the SLFN for a given problem. This suggests that the column space of Hk spans a region that is almost similar to that spanned by the column space of Hn . Conceptually, we may arrive at the conclusion that Hn is the result of a perturbation of a rank-k matrix Hk . Thus we use a more general concept of the rank of a matrix, by specifying a tolerance below which, it is of full rank. Theorem 1. Define the numerical -rank k of the matrix H with respect to some tolerance [14] by: k = k (H, ) ≡ min {rank (H + E)} E2 ≤
(1)
This states that if there is a ‘gap’ between the k -th and the k+1 -th singular values of size (i.e., σk > > σk +1 ), then H has actual rank (-rank) k . The larger this gap is, the more ‘robust’ the matrix H is to perturbation (the more ‘confident’ we are that the matrix H can be truncated at k) – Hansen [14] further elucidates that the -rank of a matrix H is equal to the number of columns of H that are guaranteed to be linearly independent for any perturbation of H with norm less than or equal to the tolerance , i.e. (robust to perturbations of E such that E2 ≤ ). The sensitivity to the errors E is now influenced by the ratio of σ1 /σk instead of the usual conditional number of H, cond(H)= σ1 /σn . Golub et. al. [15], from which the following definition originates, provides an excellent treatment of the concept of numerical rank. To avoid possible problems when is itself perturbed, the definition of actual rank is refined by introducing δ as an upper bound for for which the numerical rank remains at least equal to k. Theorem 2. The matrix H has a numerical rank of (δ, , r) with respect to the norm · if δ, and r satisfies the following: i) k = inf{rank (B) : A − B ≤ } ii) < δ ≤ sup{η : A − B ≤ η ⇒ rank (B) ≥ k} σk provides an upper bound for δ while σk+1 provides a lower bound for , i.e. δ ≤ σr and ≥ σk+1 . This is the best ratio that can be obtained for /γ is σk+1 /σk . Note that the norms used above are either the L2 or Frobenius norms. The above definitions are equivalent to saying that the matrix H is linearlyindependent when perturbed by E up to a threshold determined by E2 ≤ . This result also means that the singular values of H satisfies σk > > σk+1 .
3
A Pruning/Growing Technique Based on the SVD
In the previous sections, our notion of ‘small’ singular values was purely arbitrary. In this section, we define the ‘small’ singular values purely in a relative sense. Here we describe some possible methods in selecting the threshold γ. These methods are largely heuristical in nature, as is common in the determination of model order in system identification problems [16]. The derivations
Estimating the Number of Hidden Neurons in a Feedforward Network
861
of the following criteria can be found in [17, 13]. Let σi denote the ith singular value, σ ˆi denote the normalized singular value, and γ the threshold value, with i = σi+1 − σi . Criteria (1-5) have been previously considered in [17, 13]. Here, in criteria (6) and (7), we propose using both the singular values σ and its associated sensitivity in determining the actual rank k of the matrix Hn . Presently, the threshold value γ is chosen heuristically and is largely problemdependent. The possible measures or criteria that can be used in determining the appropriate, or necessary value for k = R(H, γ), are: (1) mink { σk+1 /σ1 ≤ γ} or mink { σk /σ1 < γ} (2) mink {k ≥ γ} k n n n (3) mink { i=1 σi / i=1 σi ≥ γ} or mink { i=k+1 σi / i=1 σi < γ} k n n n 2 2 2 2 (4) mink { i=1 σi / i=1 σi ≥ γ} or mink { i=k+1 σi / i=1 σi < γ} 12 k n 2 2 (5) mink { ≥ γ} i=1 σi / i=1 σi (6) mink {λσk + (1 − λ)k ≤ γ} (7) mink { λˆ σk + (1 − λ)ˆ k ≤ γ} By varying the threshold γ we are able to ‘control’ the confidence that we have in the selected model complexity of the network – a higher γ would reflect a greater level of confidence that k = n.
4
Simulation Results and Discussion
We investigate the performance of the pruning/growing method using the singular values and the associated sensitivity values from performing the SVD on H (in a post-learning step), on two types of classification datasets: (i) 2 toy datasets, and (ii) 4 well-known datasets from the UCI Machine Learning database. The SLFN for each dataset was trained for 20 runs for 5000 epochs to ensure that it is sufficiently well-trained. The back-propagation algorithm was implemented using a batch scaled-conjugate gradient (SCG) variant [18]. Figures 1-6 show the toy datasets, while for the real-life classification datasets, Figures 7, 8, 9, 10 show the mean training and testing accuracies, and the suggested number of hidden neurons using criteria (7) of Section 3. However, for the toy datasets, we use criteria (4) with a threshold of γ = 0.95 to demonstrate our ideas. The classification accuracies are measured using a normalized scale, ∈ [0, 1]. Banana Dataset: See Figures 1, 2 and 3. Geometrically, the use of 2-3 hyperplanes constructed from 2-3 corresponding hidden neurons would be sufficient (see Figure 1). The first three singular values are dominant based on criteria (4), (Figure 3). Setting the threshold γ = 0.95, we obtain k = 3. Lithuanian Dataset: See Figures 4, 5 and 6. Geometrically, the use of 3-4 hyperplanes constructed from 3-4 corresponding hidden neurons would be sufficient (see Figure 4). The first four singular values are dominant based on criteria (4), (Figure 6). Setting the threshold γ = 0.95, we obtain k = 4.
E.J. Teoh, C. Xiang, and K.C. Tan 2 hidden neurons
6 hidden neurons
1
1
1
1
0
0 −1
−2 0 1 P(1) 3 hidden neurons
0 −1
−2 −1
P(2)
2
P(2)
2
−1
0 −1
−2 −1
−2
0 1 P(1) 4 hidden neurons
−1
0 1 P(1) 7 hidden neurons
2
2
2
1
1
1
1
0 −1
P(2)
2 P(2)
P(2)
5 hidden neurons
2 P(2)
P(2)
1 hidden neuron 2
0 −1
−2 0 P(1)
1
0 −1
−2 −1
P(2)
862
0 P(1)
0 1 P(1) 8 hidden neurons
−1
0 P(1)
0 −1
−2 −1
−1
−2
1
−1
0 P(1)
1
1
Fig. 1. Banana hidden layer space: 1-8 neurons Training and testing accuracies
Minimum k, Threshold = 0.95 3 Minimum k (number of hidden neurons)
1
Classification accuracy
Training Accuracy Testing Accuracy 0.95
0.9
2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2
0.85
0
2
4 6 8 Number of hidden neurons
10
12
Fig. 2. Banana: Training and testing accuracies
4.1
1
0
2
4 6 8 Number of hidden neurons
10
12
Fig. 3. Banana: criteria (4) (γ = 0.95)
Discussion
From the simulation results, increasing the number of hidden neurons after a certain threshold of neurons has been reached has a marginal effect on the resulting performance of the classifier, mostly on the training set. Generally, using more hidden neurons than is necessary has a detrimental effect on the classifier’s performance on the testing set, as the capacity of this more complex SLFN will exploit the additional free parameters of the network in over-fitting the samples in the training set. For the problems described in the previous section, instead of using the singular values or the sensitivity of the singular values in isolation, we use a λ-weighted (or regularized) linear combination of σi and i . For our simulations, we take λ = 0.5. A possible approach is to manually determine the actual rank k from a visual observation of the proposed criteria, which is graphically similar to that obtained from these right columns of the figures. There is no ideal threshold for every problem, and unless prior information is available of the problem, fine-tuning of the threshold involves a rather qualitative than analytical approach, and is to be expected from system identification problems [16]. The decay of singular values show a range of hidden neurons that
Estimating the Number of Hidden Neurons in a Feedforward Network 2 hidden neurons
6 hidden neurons 2
1
1
1
1
0
0 −1
−2
−1
0 1 P(1) 3 hidden neurons
2
P(2)
2
P(2)
2
−1
0 −1
−2
−1
0 1 P(1) 4 hidden neurons
2
0 −1
−2
−1
0 1 P(1) 7 hidden neurons
2
2
2
1
1
1
0 −1
0 −1
−2
−1
0 P(1)
1
2
P(2)
2
1
P(2)
2
P(2)
P(2)
5 hidden neurons
2
P(2)
P(2)
1 hidden neuron
0 −1
−2
−1
0 P(1)
1
2
863
−2
−1
0 1 P(1) 8 hidden neurons
2
−2
−1
2
0 −1
−2
−1
0 P(1)
1
2
0 P(1)
1
Fig. 4. Lithuanian hidden layer space: 1-8 neurons Training and testing accuracies
Minimum k, Threshold = 0.95 4
Classification accuracy
0.94
Minimum k (number of hidden neurons)
0.96
Training Accuracy Testing Accuracy
0.92
0.9
0.88
0.86
0.84
0
2
4 6 8 Number of hidden neurons
10
3.5
3
2.5
2
1.5
1
12
0
2
4 6 8 Number of hidden neurons
10
12
Fig. 5. Lithuanian: Training and test- Fig. 6. Lithuanian: criteria (4) (γ = ing accuracies 0.95) Mean Training and Testing Accuracies
Mean Training and Testing Accuracies
1
1 Training Accuracy Testing Accuracy
0.99
Training Accuracy Testing Accuracy
0.95
0.97
Classification accuracy
Classification accuracy
0.98
0.96 0.95 0.94 0.93
0.9
0.85
0.8
0.75
0.92 0.7 0.91 0.9
0
2
4 6 8 Number of hidden neurons
Fig. 7. Iris: 2 neurons
10
12
0.65
0
2
4 6 8 Number of hidden neurons
10
12
Fig. 8. Diabetes: 3 neurons
should be used in solving a given problem, such that the hyperplanes constructed from these neurons are as ‘unique’ as possible – redundant number of neurons are identified (not the identities of the redundant neurons). Although subset selection [13] is a possible technique where the k most linearly independent columns
864
E.J. Teoh, C. Xiang, and K.C. Tan Mean Training and Testing Accuracies
Mean Training and Testing Accuracies
1
1 Training Accuracy Testing Accuracy
0.995
Training Accuracy Testing Accuracy 0.95 Classification accuracy
Classification accuracy
0.99 0.985 0.98 0.975 0.97 0.965 0.96
0.9
0.85
0.8
0.955 0.95
0
2
4 6 8 Number of hidden neurons
10
Fig. 9. Breast cancer: 2 neurons
12
0.75
0
2
4 6 8 Number of hidden neurons
10
12
Fig. 10. Heart: 3 neurons
of H ∈ N ×n are picked, such an approach is quite complex, requiring the construction of a permutation matrix, and is hence much slower than retraining the network using the reduced number k of hidden neurons. We argue that exact identification of redundant neurons is not feasible nor useful because the training algorithm constructs the separating hyperplanes based on the number of free weights available, fitting the distribution of training samples. The training algorithm would construct these hyperplanes such that they are as linearly independent as possible (as a function of the number of hidden neurons) to improve the class separation. This means that the addition of every new hyperplane alters the positioning of previous hyperplanes. Retraining the network using the same training algorithm results in better accuracy and is more efficient from a computational perspective as the training algorithm attempts to maximize the linear independency of the reduced set (k instead of n) of hidden neurons. This happens until a certain number of hidden neurons is used, after which additional hidden neurons introduced cannot be constructed such that they are linearly independent (refer to the hyperplanes constructed in the toy datasets – when increasingly larger number of hidden neurons are used, the hyperplanes tend to lie closely parallel), and this manifests itself as the decay in the singular values of the matrix H.
5
Conclusion
This article presents a simple framework for the use of the singular values as well as their corresponding sensitivities of the hidden layer output activation matrix H, in providing an indication of the necessary or appropriate number of hidden neurons (as well as a robustness measure of this estimate) that should be used in a SLFN for a given training set. While the current approach involves a rather qualitative and manual approach in determining the thresholds and number of hidden neurons, the use of the rank-revealing SVD allows us to gain a better practical and empirical insight into the geometry of hidden layer space, which would otherwise be difficult, if not outright impossible given that many
Estimating the Number of Hidden Neurons in a Feedforward Network
865
of today’s practical problems involve features that are of high-dimension and models that are of high-complexity.
References 1. Huang, S., Huang, Y.: Bounds on Number of Hidden Neurons of Multilayer Perceptrons in Classification and Recognition. IEEE International Symposium on Circuits and Systems, 4 (1990) 2500–2503 2. Sartori, M., Antsaklis, P.: A Simple Method to Derive Bounds on the Size and to Train Multi-Layer Neural Networks. IEEE Trans. on Neural Networks 2(4) (1991) 467–471 3. Tamura, S.: Capabilities of a Three-Layer Feedforward Neural Network. Proc. Int. Joint Conf. on Neural Networks (1991) 4. Tamura, S., Tateishi, M.: Capabilities of a Four-Layered Feedforward Neural Network: Four Layers Versus Three. IEEE Trans. on Neural Networks 8(2) (1997) 251–255 5. Huang, G., Babri, H.: Upper Bounds on the Number of Hidden Neurons in Feedforward Networks with Arbitrary Bounded Nonlinear Activation Functions. IEEE Trans. Neural Networks 9(1) (1998) 6. Huang, G.: Learning Capability and Storage Capacity of Two-Hidden-Layer Feedforward Networks. IEEE Trans. on Neural Networks 14(2) (2003) 274–281 7. Cover, T.: Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. IEEE Trans. Electronic. Comput. 14 (1965) 326–34 8. Hayashi, M.: A Fast Algorithm for the Hidden Units in a Multilayer Perceptron. Proc. Int. Joint Conf. on Neural Networks 1 (1993) 339–342 9. Tamura, S., Tateishi, M., Matsumoto, M., Akita, S.: Determination of the Number of Redundant Hidden Units in a Three-Layered Feedforward Neural Network. Proc. Int. Joint Conf. on Neural Networks 1 (1993) 335–338 10. Psichogios, D., Ungar, L.: SVD-NET: An Algorithm that Automatically Selects Network Structure. IEEE Trans. on Neural Networks 5(3) (1994) 513–515 11. Xiang, C., Ding, S., Lee, T.: Geometrical Interpretation and Architecture Selection of MLP. IEEE Trans. Neural Networks 16(1) (2005) 84–96 12. Stewart, G.: Determining Rank in the Presence of Error. Technical Report (TR92-108) Institute for Advanced Computer Studies, (TR-2972) Department of Computer Science, University of Maryland, College Park (Oct 1992) 13. Golub, G., Van Loan, C.: Matrix Computations (3rd ed.). John Hopkins University Press, Baltimore (1999) 14. Hansen, P.: Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion. SIAM (1998) 15. Golub, G., Klema, V., Stewart, G.: Rank Degeneracy and the Least Squares Problem. Technical Report (STAN-CS-76-559), Computer Science Department, School of Humanities and Sciences, Stanford University (1976) 16. Ljung, L.: System Identification: Theory for the User. Eaglewood Cliffs, NJ, Prentice-Hall (1987) 17. Konstantinides, K., Yao, K.: Statistical Analysis of Effective Singular Values in Matrix Rank Determination. IEEE Trans. on Acoustics, Speech and Signal Processing 36(5) (1988) 757–763 18. Moller, M.: A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning. Neural Networks 6 (1993) 525–533
Neuron Selection for RBF Neural Network Classifier Based on Multiple Granularities Immune Network Jiang Zhong, Chun Xiao Ye, Yong Feng, Ying Zhou, and Zhong Fu Wu College of Computer Science and Technology, Chongqing University, Chongqing, 400044, China {zjstud, yecx, fengyong, yingzhou, wzf}@cqu.edu.cn
Abstract. The central problem in training a radial basis function neural network is the selection of hidden layer neurons, which includes the selection of the center and width of those neurons. In this paper, we propose to select hidden layer neurons based on multiple granularities immune network. Firstly a multiple granularities immune network (MGIN) algorithm is employed to reduce the data and get the candidate hidden neurons and construct an original RBF network including all candidate neurons. Secondly, the removing redundant neurons procedure is used to get a smaller network. Some experimental results show that the network obtained tends to generalize well.
1 Introduction Radial basis function (RBF) neural networks have received considerable applications in nonlinear approximation and pattern classification. The performance of an RBF neural network depends very much on how it is trained. Training an RBF neural network involves selecting hidden layer neurons and estimating weights. The problem of neuron selection has been pursued in a variety of ways, based on different understandings or interpretations of the RBF neural network [1-5]. In ref [1], orthogonal least square (OLS) based algorithm is presented. The RBF neural network is interpreted in terms of its layered architecture, where the role of hidden layer is simply to map samples from the input space to the hidden layer space, and neuron selection is performed in the hidden layer space. Neuron selection is handled as a linear model selection problem in the hidden layer space. In ref [2], selection neurons in the hidden layer space based on data structure preserving criterion is proposed. Data structure denotes relative location of samples in the high-dimensional space. This algorithm have some shortcoming as follows: firstly, because the algorithm starting with RBF neural network with all samples as hidden neurons, this method could not solve large scale problem; secondly, an uniform values is chosen for the width parameter of the hidden neurons and which is selected by repeat attempts; lastly, this algorithm is an unsupervised neuron selection algorithm, as a result it could not utilize the class label information. In this paper we proposed a novel algorithm, it firstly use multiple granularities artificial immune network to find the candidate hidden neurons, then it construct a RBF neural network with all candidate hidden neurons and employ preserving criterion to remove some redundant hidden neurons. This new algorithm takes full J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 866 – 872, 2006. © Springer-Verlag Berlin Heidelberg 2006
Neuron Selection for RBF Neural Network Classifier
867
advantage of the class label information and starting with a small neural network; hence it is likely to be more efficient and is except to generalize well. This paper is organized as follows. In Section II, multiple granularities immune network algorithm is developed to get candidate hidden neurons. In Section III, a new neuron selection criterion is introduced. Experiment studies are presented in Section IV, and concluding remarks are given in Section V.
2 Multiple Granularities Immune Network Algorithm In order to get the neurons of the hidden layer of RBF network efficiently, it could utilize usually some clustering algorithm such as K-Means, SOM and AIN (artificial immune network). Here we employ a variation of AIN algorithm to construct the original hidden layer of RBF network. The original AIN method is an unsupervised algorithm [7,8], so it is difficult to confirm the optimal number of the neuron based on the class label information. The most problem of original AIN algorithm for hidden layer is that it is a computation under the same granularity but the classification is under different granularities. In this section, we give a multiple granularities AIN algorithm for hidden neurons. Immune clone operation, immune mutation operation and immune suppression operation are defined in [7]. The multiple granularities immune network algorithm (MGIN) is described as following: Input: data set X , and the descend factor a of granularity Output: the candidate hidden neurons, R Step1: Calculation the radius r of the dataset hypersphere, and let r be the immune suppression parameter, let R = Φ , let X ' = X . Step2: Construct artificial immune network M based on X ' ; Step3: Let X ' = Φ , Let M be the cluster centers and partition the samples based on Gaussian radial basis function and width parameter is its suppression parameter r. If partition i contains only one class data points and Mi is the center of it, let R = R ∪ M i ; otherwise add the data points of partition i into X ' . Step4: If X ' ≠ Φ , let r = r × a and go to Setp2; otherwise return R as hidden neurons and stop. Algorithm1. MGIN for candidate hidden neurons
The entire algorithm1 has time complexity of Ο(m * N ) [7], where N is the number of training samples, m is the maximum size of R. According to the property of R, a neighborhood classifier could be built based the hidden neurons R, where the distance function is Gaussian radial basis function. Theorem 1. Let V be the centers of a neighborhood classifier, then a RBF network classifier can be constructed based on V. Proof. suppose the number of classes is K, and the number of output neurons of the RBF network classifier is K, the data point of V is m. We construct a RBF network classifier as fig.1. Let v j be a neuron of the hidden
layer and be one center of class i, let W j , t be the weight between neuron v j and
868
J. Zhong et al.
⎧1 , if class(v j ) = t output neuron t, and let W j , t = ⎨ . We will prove that this RBF ⎩- 1, otherwise network can classify the data correctly. Let xi be an arbitrarily sample of the dataset, which’s class label is k, and d ij =|| xi − v j || , dsi = arg min (|| xi − v j ||) class ( v j ) = k
, dd i = arg min (|| xi − v j ||)
,
class ( v j ) ≠ k
Δ i = dd i − dsi , Δ = arg min (Δ i ) . i =1,...,n
Hence Δ may be seen as the minimum separation between different classes in the nearest neighbor classification. Fk ( xi ) =
∑e
− d ij / σ
(∀v j ) class ( v j ) = k
Fk ( xi ) ≥ e − dsi / δ −
∑e
−
− d if / σ
( ∀v f ) class ( v f ) ≠ k
∑e
− d if / σ
≥ e − dsi / δ − (m − 1)e − dd i / δ
(∀v f ) class ( v f ) ≠ k
= e − dsi / δ (1 − (m − 1)e − ( dd i − dsi ) / δ ) ≥ e − dsi / δ (1 − (m − 1)e − Δ / δ ) If the width parameter δ of the radial basis function is satisfied δ ≤ (∇ / lg(m − 1)) , then Fk ( xi ) > 0 and F f ( xi ) < 0 when f ≠ k . According to the class decision criterion, the output class label must be k.
Fig. 1. RBF network architecture
After the RBF network classifier has been constructed, we can employ some removing criterion described in section III to remove some redundant neurons.
3 Neural Selection of the RBF Neural Network In the context of pattern classification, data in the high-dimensional hidden layer space are more likely to be linearly separable than in the input space. It is believed, however, that the high-dimensional hidden layer space might be mostly empty, and samples might still be linearly separable even in a space with dimensionality lower
Neuron Selection for RBF Neural Network Classifier
869
than m. Deleting unnecessary dimensions not only reduces the architecture of the RBF network, and hence computations involved in testing, but also alleviates curse of dimensionality. A variety of approaches have been proposed for the determination of hidden layer space.The single output of RBF neural network with m neurons hidden layer can be described by m
f ( xi ) = w0 + ∑ w j g (|| x − c j ||)
(1)
j =1
where xi and f k ( xi ) denote the input and output of the network, c j denotes the center vector of hidden layer neuron j and || xi − c j || is the Euclidean distance between xi and c j ,and g (.) is the radial basis function. Although f k ( xi ) is a nonlinear function of the input vector xi , it has a linear relationship with the outputs of hidden layer neurons. Motivated by this linear structure of the RBF neural network in the hidden layer space, the idea of linear model selection was introduced to RBF neural network neuron selection, with the objective of selecting a subset of neurons that minimizes the following cost function (2): C=
1 N
n
∑ ( f ( xi ) − y i ) 2
(2)
i =1
where N is the number of training samples and yi is the target output of training sample xi . The importance of a neuron can be evaluated based on reduction or inflation of after adding or deleting the neuron. In this paper we employ a neuron selection criterion for RBF network classifier[3]. Suppose there is a RBF network with m neurons hidden layer and with C output neurons. If the neuron k is removed, the result of one output neuron t with the remaining m-1 neurons for the input xi is k −1
f t ' ( xi ) = wt 0 + ∑ wtj g (|| xi − c j ||) + j =1
m
∑ wtj g (|| xi − c j ||)
(3)
j = k +1
Thus, for an observation xi , the error resulted from removing neuron k is given by
E (i, k ) = arg max (|| f ' ( xi ) − f ( xi ) ||) = arg max (|| wk g (|| xi − c k ||) ||) t =1,...C
t =1,...C
(4)
For all learned observations caused by removing the neuron k is
E (k ) =
1 N
N
∑ E (i, k )
(5)
i =1
In this paper, the new algorithm constructed a RBF network with m neurons firstly, and then removed the neuron k as E (k ) = arg min ( E (i )) and the classification results i∈[1...m ]
870
J. Zhong et al.
satisfying the pre-requirements. Above removing procedure is continued to reduce the hidden layer until the stopping criterion is satisfied. The stopping criterion is selected that the classification error rate(Er) and the number of hidden neurons(Hn).
4 Experiments In this section, we present a few examples including six real world problems. All the real-world examples are from the UCI Machine Learning Repository. For all problems, a Gaussian basis function is employed and the width parameter is determined automatically using the search procedure described in Section II. 4.1 Iris Data
In Iris data experiment, there are 75 selected patterns with four features used as training set and testing set, respectively. The 75 training patterns are obtained via a random process from the original Iris data set with 150 patterns. The stop criterion for the removing procedure is that set Er = 0.05 and Hn = 5 . Because there are three classes in the Iris data, three output neurons is selected. The new classifiers are constructed to solve this problem. For the BP algorithm, the parameter of MSE is select 0.014. Several different kinds of classification methods are compared with our proposed MGIN based RBF network classifier on the Iris data classification problem. As shown in Table 1, 96.7% testing accuracy rate is obtained by the new classifier on the Iris data. In comparison with testing accuracy rates of other models, for the case of random sampling procedure with 50% training patterns and 50% test patterns, the new classifier has the best classification accuracy rate on the Iris data. SOM RBF classifier is also a two stage algorithm like this new algorithm; the difference is that it employs SOM to get the candidate hidden neurons. Table 1. Experimental Results on IRIS data
Accuracy Training Testing
BP 4-5-3 98.5 96.5
Nearest 97.5 96.3
SOM RBF (5 Neurons) 96.8 95.4
OLS RBF (5 Neurons) 97.9 96.0
MGIN RBF (5 Neurons) 98.1 96.7
4.2 Several Benchmark Data Sets Testing From UCI Repository
In this section, we use five benchmark’s data sets from the UCI repository to further demonstrate the classification performance for the new classifier. Experimental conditions are the same as previous experiment. We also give ten independent runs, and a half of original data patterns are used as training data (randomly selected) and the remaining patterns are used as testing data.
Neuron Selection for RBF Neural Network Classifier
871
Table 2. Testing Results of Various Learning Models
Data set Liver Breast Echo Wine Va-Heart
BP 81.9 93.8 93.9 96.6 99.5
Nearest 69.7 92.1 90.3 84.5 95.1
SOM RBF 80.3 95.2 89.4 93.6 97.0
OLS RBF 79.6 93.7 91.5 95.7 98.2
MGIN RBF 81.5 95.4 92.1 95.9 98.5
According to testing results in table 2, it found that, BP neural network has the best performance at the most time. However, the network structure of BP neural network is difficult to be determined for the higher dimensional pattern classification problems and cannot be proved to converge well. Also the accuracy rate of the new classifier is increased obviously relative to the Traditional RBF network classifier.
5 Conclusions This paper proposes MGIN based a neural-network classifier, which contains two stages: employing multiple granularities immune network to find the candidate hidden neurons; and then use some removing criterion to delete the redundant neurons. Experimental results indicate that the new classifier has the best classification ability when compared with other conventional classifiers for our tested pattern classification problems. Acknowledgement. This work is supported by the Graduate Student Innovation Foundation of Chongqing University of China (Grant No. 200506Y1A0230130), the Research Fund for the Doctoral Program of Higher Education of China (Grant No. 2004061102).
References 1. Chen, S., Cowan, C. F., Grant, P. M.: Orthogonal Least Squares Learning Algorithms for Redial Basis Function Networks. IEEE Trans. Neural Networks 2(2) (1991) 302-309 2. Mao, K. Z., Huang, G. B.: Neuron Selection for RBF Neural Network Classifier Based on Data Structure Preserving Criterion. IEEE Trans. Neural Networks 16(6) (2005) 1531-1540 3. Huang, G. B., Saratchandran, P.: A Generalized Growing and Pruning RBF (GGAP-RBF) Neural Network for Function Approximation. IEEE Trans. Neural Networks 16(1) (2005) 57-67 4. Lee, S. J., Hou, C. L.: An ART-Based Construction of RBF Networks. IEEE Trans. Neural Networks 13(6) (2002) 1308-1321 5. Lee, H. M., Chen, C. M.: A Self-Organizing HCMAC Neural-Network Classifier. IEEE Trans. Neural Networks 14(1) (2003) 15-27
872
J. Zhong et al.
6. Miller, D., Rao, A. V.: A Global Optimization Technique for Statistical Classifier Design. IEEE Trans. on Signal Processing 44(12) (1996) 3108-3122 7. Timmis, J., Neal, M., Hunt, J.: An Artificial Immune System for Data Analysis. Biosystems 55(1) (2000) 143-150 8. Zhong, J., Wu, Z. F.: A Novel Dynamic Clustering Algorithm Based on Immune Network and Tabu Search. Chinese Journal of Electronics 14(2) (2005) 285-288
Hierarchical Radial Basis Function Neural Networks for Classification Problems Yuehui Chen1 , Lizhi Peng1 , and Ajith Abraham1,2 1
School of Information Science and Engineering, Jinan University, Jinan 250022, P.R. China [email protected] 2 IITA Professorship Program, School of Computer Science and Engg. Chung-Ang University, Seoul, Republic of Korea [email protected]
Abstract. The purpose of this study is to identify the hierarchical radial basis function neural networks and select important input features for each sub-RBF neural network automatically. Based on the pre-defined instruction/operator sets, a hierarchical RBF neural network can be created and evolved by using tree-structure based evolutionary algorithm. This framework allows input variables selection, over-layer connections for the various nodes involved. The HRBF structure is developed using an evolutionary algorithm and the parameters are optimized by particle swarm optimization algorithm. Empirical results on benchmark classification problems indicate that the proposed method is efficient.
1
Introduction
Hierarchical Neural Network (HNN) is neural network architecture in which the problem is divided and solved in more than step [1]. Ohno-Machado divide hierarchical network into two architectures that are bottom-up and top-down architectures [1]. In bottom-up designs, several specialized network are used to classify the instances and a top-level network (triage network) aggregates the results. In this design, all instances are used in all networks. However, the specialized networks work only on certain features. In contrast, in top-down hierarchical architecture design, the top-level network divides the inputs to be classified in specialized networks. Many version of HNN have been introduced and applied in various applications [1][3][4][5]. Erenshteyn and Laskov examine the application of hierarchical classifier to recognition of finger spelling [2]. They refer hierarchical NN as multistage NN. The approach aimed to minimize the network’s learning time without reducing the accuracy of the classifier. Mat Isa et al. used Hierarchical Radial Basis Function (HiRBF) to increase RBF performance in diagnosing cervical cancer [3]. HiRBF cascading together two RBF networks, where both network have different structure but using the same algorithms. The first network classifies all data and performs a filtering process to ensure that only certain attributes J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 873–879, 2006. c Springer-Verlag Berlin Heidelberg 2006
874
Y. Chen, L. Peng, and A. Abraham +6
x1
ψ1
ω1
xn
ψm
ωm
. . .
y Σ
x1 x2 +2 x3 +3 +3 x1 x +2 x3 2
x1 x2 x3
+3
x3 x2 x1 x2 x3
Fig. 1. A basis function operator (left), and a tree-structural representation of a HRBF with instruction sets F = {+2 , +3 , +4 , +5 , +6 }, and T = {x1 , x2 , x3 } (right)
to be fed to the second network. The study shows that HiRBF performs better compared to single RBF. HRBF has been proved effective in the reconstruction of smooth surfaces from sparse noisy data points [5]. In this paper, an automatic method for constructing HRBF network is proposed. Based on the pre-defined instruction/operator sets, a HRBF network can be created and evolved. HRBF allows input variables selection, over-layer connections for different nodes. In our previous work, the hierarchical structure was evolved using Probabilistic Incremental Program Evolution algorithm (PIPE) with specific instructions [6][7] and Ant Programming [8]. In this research, the hierarchical structure is evolved using the Extended Compact Genetic Programming, a tree-structure based evolutionary algorithm. The fine tuning of the parameters encoded in the structure is accomplished using particle swarm optimization (PSO). The novelty of this paper is in the usage of flexible neural tree model for selecting the important features and for improving the accuracy.
2
The Hierarchical RBF Model
The function set F and terminal instruction set T used for generating a hierarchical RBF model are described as S = F T = {+2 , +3 , . . . , +N } {x1 , . . . , xn }, where +i (i = 2, 3, . . . , N ) denote non-leaf nodes’ instructions and taking i arguments. x1 ,x2 ,. . .,xn are leaf nodes’ instructions and taking no other arguments. The output of a non-leaf node is calculated as a RBF neural network model (see Fig.1). From this point of view, the instruction +i is also called a basis function operator with i inputs. The basis function operator is shown in In general, the basis funcFig.1(left). m tion networks can be represented as y = i=1 ωi ψi (x; θ), where x ∈ Rn is input vector, ψi (x; θ) is ith basis function, and ωi is the corresponding weights of ith basis function and θ is the parameter vector used in the basis functions. In this n x −b 2 research, Gaussian radial basis function is used, ψi (x; θ) = j=1 exp(− jaj 2j ) and the number of basis functions used in hidden layer is same with the number of inputs, that is, m = n. In the creation process of HRBF tree, if a nonterminal instruction, i.e., +i (i = 2, 3, 4, . . . , N ) is selected, i real values are randomly generated and used for representing the connection strength between the node
Hierarchical Radial Basis Function Neural Networks
875
+i and its children. In addition, 2 × n2 adjustable parameters ai and bi are randomly created as Gaussian radial basis function parameters. The output of the node +i can be calculated by using (1). The overall output of HRBF tree can be computed from left to right by depth-first method, recursively. Tree Structure Optimization. Finding an optimal or near-optimal neural tree is formulated as a product of evolution. In our previously studies, the Genetic Programming (GP), Probabilistic Incremental Program Evolution (PIPE) have been explored for structure optimization of the FNT [6][7]. In this paper, the Extended Compact Genetic Programming (ECGP) [9] is employed to find an optimal or near-optimal HRBF structure. ECGP is a direct extension of ECGA to the tree representation which is based on the PIPE prototype tree. In ECGA, Marginal Product Models (MPMs) are used to model the interaction among genes, represented as random variables, given a population of Genetic Algorithm individuals. MPMs are represented as measures of marginal distributions on partitions of random variables. ECGP is based on the PIPE prototype tree, and thus each node in the prototype tree is a random variable. ECGP decomposes or partitions the prototype tree into sub-trees, and the MPM factorises the joint probability of all nodes of the prototype tree, to a product of marginal distributions on a partition of its sub-trees. A greedy search heuristic is used to find an optimal MPM mode under the framework of minimum encoding inference. ECGP can represent the probability distribution for more than one node at a time. Thus, it extends PIPE in that the interactions among multiple nodes are considered. Parameter Optimization with PSO. The Particle Swarm Optimization (PSO) conducts searches using a population of particles which correspond to individuals in evolutionary algorithm (EA). A population of particles is randomly generated initially. Each particle represents a potential solution and has a position represented by a position vector xi . A swarm of particles moves through the problem space, with the moving velocity of each particle represented by a velocity vector vi . At each time step, a function fi representing a quality measure is calculated by using xi as input. Each particle keeps track of its own best position, which is associated with the best fitness it has achieved so far in a vector pi . Furthermore, the best position among all the particles obtained so far in the population is kept track of as pg . In addition to this global version, another version of PSO keeps track of the best position among all the topological neighbors of a particle. At each time step t, by using the individual best position, pi , and the global best position, pg (t), a new velocity for particle i is updated by vi (t + 1) = vi (t) + c1 φ1 (pi (t) − xi (t)) + c2 φ2 (pg (t) − xi (t))
(1)
where c1 and c2 are positive constant and φ1 and φ2 are uniformly distributed random number in [0,1]. The term vi is limited to the range of ±vmax . If the velocity violates this limit, it is set to its proper limit. Changing velocity this way enables the particle i to search around its individual best position, pi , and global best position, pg . Based on the updated velocities, each particle changes its position according to the following equation:
876
Y. Chen, L. Peng, and A. Abraham
xi (t + 1) = xi (t) + vi (t + 1).
(2)
For details of the general learning algorithm, please refer to refs. [6] and [7].
3
Classification Using HRBF Paradigms
Two common benchmark classification problems, Iris data [10] and Wine data were studied in this research. For the iris example, we also used 150 patterns to design a HRBF classifier system via the proposed algorithm. The used instruction set is F = {+2 , +3 , x1 , x2 , x3 , x4 }. For the wine example, we also used 178 patterns to design a HRBF classifier system via the proposed algorithm. The used instruction set is F = {+2 , +3 , +4 , x1 , x2 , . . ., x13 }. Tables 1 and 4 show the classification results of conventional RBF and the proposed HRBF network for Iris and Wine data sets, respectively. Tables 2 and 5 show the results of ten runs (i.e. ten different initializations of parameters) for Iris and Wine data sets, respectively. To estimate the performance of the proposed method on unseen data, five-fold cross-validation was performed on the Iris and Wine data sets. The evolved hierarchical architectures for Iris and Wine data sets are shown in Fig.2 and Fig.3. Tables 3 and 6 reports the results of five-fold cross validation for Iris and Wine data sets, respectively. For Iris data, the average classification result is 100.0% Table 1. Comparison of results on Iris data Model Recognition rate on total data set (%) RBF [12] 95.3 HRBF (this paper) 99.5
Table 2. Results of ten runs on Iris data Misclassification Recognition rate (%) Features Parameters Training time (minutes)
1 1 99.3 3 64 2
2 1 99.3 4 60 6
3 0 100 3 84 15
4 1 99.3 4 108 7
5 0 100 4 108 8
6 1 99.3 3 60 9
7 1 99.3 4 64 7
8 1 99.3 4 84 4
9 1 99.3 3 104 5
10 Average 0 0.7 100 99.5 4 3.6 108 84.4 11 7.4
Table 3. Five-Fold cross validation for Iris data Training patterns Misclassification (training) Recognition rate (training)(%) Testing patterns Misclassification (testing) Recognition rate (testing)(%)
1 120 0 100 30 0 100
2 120 0 100 30 0 100
3 120 0 100 30 1 96.7
4 120 0 100 30 0 100
5 Average (%) 120 120 0 0 100 100 30 30 1 0.4 96.7 98.68
Hierarchical Radial Basis Function Neural Networks x0 x2 x3 x2 x0 x3 x0
y
x3 x x 03 x1 x x 12 x0 x1 x2
y
x1 x2 x3 x0 x3 x2
y
x2 x x 12 x0 x1 x3 x2
y
x1 x x 13 x0 x x 12 x0 x1 x2
877
y
Fig. 2. The evolved HRBF architectures for five-fold cross-validation (Iris data)
x3 x0 x9 x6
y
x0 x1 x 12 x x0 x6
x0 x6 x0 x1 x12
y
x0 x6 x 12 x x 10
y
x0 x10 x 11 x x 12
y
y
Fig. 3. The evolved HRBF architectures for five-fold cross-validation (Wine data) Table 4. Comparison of results on Wine data Model Recognition rate on total data set (%) RBF [12] 98.89 HRBF (this paper) 99.6 Table 5. Results of ten runs on Wine data Misclassification Recognition rate (%) Features Parameters Training time (minutes)
1 1 99.4 5 85 6
2 1 99.4 4 60 5
3 1 99.4 4 64 13
4 0 100 5 107 12
5 1 99.4 5 84 9
6 0 100 6 113 10
7 1 99.4 4 64 7
8 1 99.4 6 108 8
9 0 100 6 79 12
10 Average 1 0.7 99.4 99.6 4 4.9 84 84.8 11 10.3
Table 6. Five-Fold cross validation for Wine data Training patterns Misclassification (training) Recognition rate (training)(%) Testing patterns Misclassification (testing) Recognition rate (testing)(%)
1 136 0 100 42 0 100
2 144 0 100 34 0 100
3 144 0 100 34 1 97.1
4 144 0 100 34 0 100
5 Average (%) 144 142.4 0 0 100 100 34 35.6 0 0.2 100 99.4
correct (no misclassifications) on the training data and 98.68% correct (about 0.4 misclassification) on the test data. For Wine data, the average classification result is 100.0% correct (no misclassifications) on the training data and 99.4% correct (about 0.2 misclassification) on the test data.
878
4
Y. Chen, L. Peng, and A. Abraham
Conclusions
Based on a novel representation and calculation of the hierarchical RBF models, an approach for evolving the HRBF was proposed in this paper. The hierarchical architecture and inputs selection method of the HRBF were accomplished using PIPE algorithm, and the free parameters embedded in the HRBF model were optimized using a PSO algorithm. Simulation results shown that the evolved HRBF models are effective for the classification of Iris and Wine data. Our future works will concentrate on: (1) Improving the convergence speed of the proposed method by parallel implementation of the algorithm; (2) Applying the proposed approach for more complex problems.
Acknowledgment This research was supported by the NSFC under grant No.60573065, and The Provincial Science and Technology Development Program of Shandong under grant No.SDSP2004-0720-03.
References 1. Ohno-Machado, L.: Medical Applications of Artificial Neural Networks: Connectionist Model of Survival. Ph.D. Dissertation. Stanford University (1996) 2. Erenshteyn, R., Laskov, P.: A Multi-Stage Approach to Fingerspelling and Gesture Recognition. In Proc. of the Workshop on the Integration of Gesture in Language and Speece (1996) 185 - 194 3. Mat Isa, N.A., Mashor, M. Y., Othman, N. H.: Diagnosis of Cervical Cancer using Hierarchical Radial Basis Function (HiRBF) Network. In Proc. of the Int. Conf. on Artificial Intelligence in Engineering and Technology (2002) 458 - 463 4. Ferrari, S., Maggioni, M., Alberto Borghese, N.: Multiscale Approximation with Hierarchical Radial Basis Functions Networks. IEEE Trans. on Neural Netw. 15(1)(2004)178 - 188 5. Ferrari, S., Frosio, I., Piuri, V., Alberto Borghese, N.: Automatic Multiscale Meshing Through HRBF Networks. IEEE Trans. on Instrument and Measurement 54(4) (2005) 1463 - 1470 6. Chen, Y., Yang B., Dong, J.: Nonlinear System Modeling via Optimal Design of Neural Trees. Int. J. of Neural Systems 14(2) (2004) 125 - 137 7. Chen, Y., Yang, B., Dong, J., Abraham, A.: Time-Series Forecasting using Flexible Neural Tree Model. Information Science 174(3-4) (2005) 219 - 235 8. Chen, Y., Yang, B., Dong, J.: Automatic Design of Hierarchical TS-FS Models using Ant Programming and PSO algorithm. Lecture Notes in Computer Science, Vol. 3516, Springer-Verlag, Berlin Heidelberg New York (2004) 285 - 294 9. Sastry K., Goldberg, D.E.: Probabilistic Model Building and Competent Genetic Programming. In Genetic Programming Theory and Practise, Chapter 13 (2003) 205 - 220 10. Ishibuchi, H., Murata, T., Turksen, I.B.: Single-Objective and Two-Objective Genetic Algorithms for Selecting Linguistic Rules for Pattern Classification Problems. Fuzzy Sets Syst. 89 (1997) 135 - 150
Hierarchical Radial Basis Function Neural Networks
879
11. Anderson, E.: The Irises of the Gaspe Peninsula. Bull. Amer. Iris Soc. 59 (1935) 2-5 12. Oyang, Y.J., Hwang, S.C., Ou, Y.Y., Chen, C.Y.: Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation Algorithm. IEEE Trans. on Neural Networks (2005) 225 - 236
Biased Wavelet Neural Network and Its Application to Streamflow Forecast Fang Liu1, Jian-Zhong Zhou1, Fang-Peng Qiu2, and Jun-Jie Yang1 1
School of Hydropower and Information Engineering, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China [email protected] 2 School of Management, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
Abstract. Long leading-time streamflow forecast is a complex non-linear procedure. Traditional methods are easy to get slow convergence and low efficiency. The biased wavelet neural network (BWNN) based on BP learning algorithm is proposed and used to forecast monthly streamlfow. It inherits the multiresolution capability of wavelets analysis and the nonlinear input-output mapping trait of artificial neural networks. With the new set of biased wavelets, BWNN can effectively cut down the redundancy from multiresolution calculation. The learning rate and momentum coefficient are employed in BP algorithm to accelerate convergence and avoid falling into local minimum. BWNN is applied to Fengtan reservoir as case study. Its simulation performance is compared with the results obtained from autoregressive integrated moving average, genetic algorithm, feedforward neural network and traditional wavelet neural network models. It is shown that BWNN has high model efficiency index, low computing redundancy and provides satisfying forecast precision.
1 Introduction Streamflow forecast is very important in exploring and optimizing water resources management. Long leading- time forecast with high accuracy gives more scientific and efficient instructions to flood prevention, reservoir regulation and drainage basin management. Due to its complex non-linear process, such forecast is generally built on qualitative analysis, since the corresponding quantitative analysis has greater errors especially for extreme values of streamflow. Methods have been adopted to solve this problem. Statistics forecast is used most [1]. Its basic principle is to seek and analyze the change rules of hydrology ingredients and the relations with other factors by statistics. Regression analysis [2] and time series analysis [3] are main forms of such statistics method, but they have the disadvantage of amplifying frequency noise in the data when differencing. Threshold regression model [4] and projection pursuit model [5] are developed by setting restrictions on variables, yet they can not fully explain many complex hydrological datasets with inherent static feature. Recent researches reveal that artificial neural networks (ANNs) have been widely used for water resources variables modeling. As ANNs are nonlinear data-driven methods, they suit well to nonlinear input-output mapping techniques. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 880 – 888, 2006. © Springer-Verlag Berlin Heidelberg 2006
Biased Wavelet Neural Network and Its Application to Streamflow Forecast
881
However, there inevitably exists low convergence and local optimum problems when streamflow forecasting. Wavelet analysis theory is regarded as a great progress of Fourier analysis. By dilations and translations, wavelet transform can extract the detail information of signals with multiresolution capability. In 1992, Zhang [6] explicated the concept and algorithm of wavelet neural network (WNN). WNN inherits the merits of wavelet transform and artificial neural networks. It implements wavelet transform by adjusting the shape of wavelet basis during training period and gets the excellent approximation and pattern classification [7]. Some adaptive WNN [8] was presented to meliorate WNN. Robert [9] illustrated the adaptively biased wavelets expansions. The biased wavelets are a new set of nonzero-mean functions, and show good performance when working with a “large number” functions from multiresolution calculations. In this paper, we give the topology of biased wavelet neural network (BWNN) and attempt to apply it to forecast monthly streamflow. Section 2 introduces the basic information of wavelet neural network. In Section 3, biased wavelets, topology of BWNN and BP-based learning algorithm are explained. In Section 4, the monthly streamflow forecast of Fengtan reservoir using BWNN is employed as case study, and results from the modeling experiment are reported and discussed with other algorithms. Finally, conclusions and suggestions are obtained in Section 5.
2 Wavelet Neural Network The wavelet transform of a signal f ( x ) is defined as the integral transformation [10]:
W f ( a, b) = ∫ f ( x ) Ψab ( x )dx R
(1)
x −b ) , Ψab (x ) is called wavelet function, Ψ (x ) is the a mother wavelet. a and b are dilation and translation parameters respectively.
in which, Ψab ( x ) = a
1/ 2
Ψ(
Fig. 1. The Topology of Wavelet Neural Network for approximation
882
F. Liu et al.
According to [6], a signal f ( x ) can be approximated from a series of wavelets, which can be obtained by dilating and translating the mother wavelet as follows: K
f ( x ) = ∑ wk Ψ ( k =1
x − bk ). ak
(2)
where wk are weight coefficients, K is the number of wavelets and f ( x ) is the approximation output of WNN for the signal. Fig.1 gives the topology of the network to formula (2).
3 Biased Wavelet Neural Network 3.1 Biased Wavelets
A set H of biased wavelets employed in this paper is defined in [9] as: H = {ha , b, c ( x ) = a
1/ 2
x−b ⎤ ⎡ x−b + ⎢Ψ ( a ) + cΦ ( a )⎥, a ∈ R , b, c ∈ R} ⎣ ⎦
(3)
in which Ψ ∈ L2 ( R ) is a mother wavelet and the function Φ satisfies the following terms:
1) Φ ∈ L2 ( R ) ~ 2) Φ (0) ≠ 0 3) Ψ (x ) is rapidly decreasing to zero when x → ∞ ~ 4) Ψ (ω ) is rapidly decreasing to zero when ω → ∞ It is obvious that the proposed biased wavelets do not fit the ‘‘admissibility ~ condition’’ (for Φ (0) ≠ 0 ), but using them can efficiently reduce the redundancy in wavelet neural network, which is analyzed in detail hereafter. 3.2 Biased Wavelet Neural Network
Let biased wavelets substitute the mother wavelet Ψ in equation (2), and Sa ,b , c ( x ) is the corresponding summation. We can write that: K
K
k =1
k =1
S a ,b,c ( x ) = ∑ wk ⋅ H = ∑ wi ⋅ [ Ψ (
x − bk x − bk ) + ck ⋅ Φ ( )] ak ak
K x − bk x − bk = ∑ wk ⋅ Ψ ( ) + ∑ wk ⋅ck ⋅ Φ ( ) ak ak k =1 k =1 K
= f ( x ) + L( x )
(4)
Biased Wavelet Neural Network and Its Application to Streamflow Forecast
883
From equation (4), when c = 0 , Sa ,b, c = 0 ( x ) = f ( x ) . That is to say, f ( x ) ⊂ S ( x ) , and if the bias part L( x ) is added to f ( x ) , the signal is desired to be approximated and expressed as: K
f ( x ) = ∑ wk ⋅ [ Ψ ( k =1
x − bk x − bk ) + ck ⋅ Φ ( )] ak ak
(5)
Based on the biased wavelets and equation (5), Fig 2 is the topology of the presented biased wavelet neural network (BWNN). It is constructed by two important parts, traditional part and bias part. The original part is the same as that in the traditional wavelet neural network (WNN), while in the bias part, the function Φ is multiplied by coefficient c .
Fig. 2. The Topology of Biased Wavelet Neural Network for approximation. The dashed rectangular refers to integration of biased wavelon and the dotted ellipses indicate the important two parts of BWNN, traditional part and bias part, respectively.
The essence of wavelet transform is mapping the 1-D signal f ( x ) isometrily to the 2-D wavelets space made up of the set W f (a, b) . In WNN, the wavelet transform of f ( x ) and its inverse transform are not corresponding one by one. As the wavelet functions Ψab (x ) are super-complete, they are not linearly independent and there exist some relations, which result in the redundancy. For BWNN, it is the bias part that has the redundancy cut down. BWNN is an adaptive style of WNN, and according to the different selection of c and Φ , habc (x ) have more freedom, and the time-frequency window has flexible changes in the whole time area without fixed values, which indicates that the shape of the wavelet adapts to a particular problem, not parameters of fixed-shaped wavelet.
884
F. Liu et al.
3.3 The Learning Algorithm of BWNN Based on BP
In this Section, the proposed BWNN is to be trained with BP algorithm, and the weights wk , scales a k , translations bk and bias parameters ck are adjusted th adaptively. Suppose that ym (m = 1L M ) is the observed value of the m signal and f ( xm ) is the calculated output of BWNN. Let t ' =
t − bk , and the cost function is ak
defined as: E=
1 M ∑[ ym − f ( xm )]2 2 m =1
(6)
The partial derivatives of E to each parameter are computed: ∂E = −( y − f ) ⋅ ( Ψ (t ' ) + ck ⋅ Φ(t ' )) ∂wk M ∂E = ∑ ( y − f ) ⋅ wk ⋅ ( Ψ ' (t ' ) + ck ⋅ Φ' (t ' )) / ak ∂bk m =1 M ∂E = ∑ ( y − f ) ⋅ wk ⋅ ( Ψ ' (t ' ) + ck ⋅ Φ' (t ' )) ⋅ ( x − bk ) / ak2 ∂ak m =1
(7)
M ∂E = ∑ ( y − f ) ⋅ wk ⋅ Φ ' (t ' ) ∂ck m =1
Learning rate η and momentum coefficient μ are adopted in revising parameters to avoid falling to local minimum and improve the convergence of BWNN. The iterative equations are: wk (m + 1) = wk ( m ) − η ⋅ bk ( m + 1) = bk ( m ) − η ⋅
∂E + α ⋅ Δwk ( m ) ∂wk
∂E + α ⋅ Δbk ( m ) ∂bk
∂E ak ( m + 1) = ak ( m) − η ⋅ + α ⋅ Δa k ( m) ∂ak ck ( m + 1) = ck ( m ) − η ⋅
(8)
∂E + α ⋅ Δc k ( m ) ∂ck
In the following section, BWNN shows its ability in monthly streamflow forecast. Mexican hat wavelet used as ‘‘mother wavelet’’ in the case study is: Ψ ( x ) = (1 − x 2 ) exp( −
x2 ) 2
(9)
The employed bias function is Gaussian function which exhibits good resolution in both time and frequency domain, and given in:
Biased Wavelet Neural Network and Its Application to Streamflow Forecast
Φ( x ) = exp( −
x2 ) 2
885
(10)
4 Case Study Fengtan reservoir located in the domain of Yuanling county down of You river, and 45 kilometers away from the center of Yuanling. The reservoir drainage area is 17500 square kilometer and occupies 94.4% of the whole You river area. Fengtan reservoir belongs to seasonal regulation reservoir and its basic information is listed in Table 1. The water supply of Yuanling county area depends highly on Fengtan reservoir and the monthly streamflow forecast helps the operators to well prepare for the future streamflow and make proper decisions. In this part, BWNN is verified by the real data: monthly streamflow time series recorded in Fengtan reservoir from Jan. 1952 to Dec. 2002 shown in Fig. 3. The monthly streamflow series are computed by summing each average daily flow and dividing by the number of days corresponding to that particular month. It is periodical, time dependant and shows significant linkage in frequency domain, which leads to the nonstationary condition. Data of Jan. 1952~ Nov. 1997 are chosen to train the BWNN, and the others from Dec.1997~ Dec.2002 for test. The constructed BWNN is single input, sixteen biased wavelons and single output. The initializing procedure of parameters a k and bk resembles that in [11]. The values of wk and ck are uniformly sampling in [-1, 1], and usually ck begins with zero. As that mentioned above, proper selection of learning rate η and momentum coefficient μ avail BWNN of performance. In the beginning, η = 0.18,α = 0.72 , and they are adjusted manually during convergence. Including the defined cost function E , model efficiency index R 2 is also adhibited to evaluate the model performance, and estimate the approximation precision. M
R2 = 1 −
∑ ( ym − f ( xm ))2
m =1 M
∑ ( f ( xm ) − f ) 2
(11)
m =1
ym and f ( xm ) are the observed and calculated values, and f is mean of the streamflow time series. In general, R 2 > 0.8 indicates a good model. To further assess the performance of the proposed BWNN, it is compared to other models listed in Table 2 using cost function E and model efficiency index R 2 . BWNN has a R 2 index value of 0.94 indicating a very satisfying forecasts, while autoregressive integrated moving average (ARIMA) model are unsatisfactory with largest E . Since the monthly series are still nonstationary after differenced, ARIMA model is hard to draw good results in which the first- and second-order moments depend only on time differences. The model of genetic algorithm (GA) is a compromise. Although it has much lower E than ARIMA, it is the least efficient
886
F. Liu et al.
model in the four evolutionary algorithms. Feedforward neural network (FNN) and traditional wavelet neural network (WNN) models in column four and five respectively are managed likewise BWNN: the same network construction, BP-based learning algorithm and Mexican hat wavelet except for sigmoid function in FNN. They both have effective indexes values, and provide good (or acceptable) forecast results with better R 2 index values, but not enough when more accuracy results are required in flood season. During the validation period, Fengtan reservoir had complex inflow cases, such that two months in sequence may run from the maximum to the minimum (or inverse) sharply. FNN and WNN can not find these suddenly changes (peak or trough) at some points, which lead to the worse statistics indexes than BWNN. Fig. 4 exhibits forecasting results of BWNN and WNN. For the nonstationary monthly streamflow series, BWNN shows the superiorities in lessening signal noises and cutting down redundancy with adaptive bias coefficient and bias function. In streamflow forecast, generally speaking, it is permitted to supply slightly bigger value in peak and smaller in trough, because the worse condition may gain more preparations. The maximum value in Jul. 1999 during test period is detected by BWNN and so does the minimum. However, WNN loses several peak values and even gets the wrong trend of series for certain months. In Fengtan drainage area, the flood season is from June to September, but there unexpectly comes inundation around November at times without any signs in advance, even precipitations. Such peak value is difficult to forecast, while in this case study, BWNN supplies the close result of Nov. 2000. Although WNN leaves out part of extreme values, it gains the overall annual streamflow of the abundant water years and dried years.
Fig. 3. Observed monthly streamflow time series of Fengtan Reservoir from year 1952~2002, totally 612 months
Biased Wavelet Neural Network and Its Application to Streamflow Forecast
887
Table 1. Basic Information for Fengtan Reservoir
Average Multi-years precipitation
Average streamflow of multi-years
Maximum instantaneous streamflow
Minimum instantaneous streamflow
Total reservoir capacity
1415 mm
15.9 billion km3
16900 km3/s
40 km3/s
1.73 billion km3
Table 2. Model validation statistics for the forecasts
E R2
ARIMA 670 0.64
GA 203.7 0.82
FNN 186.9 0.85
WNN 115.2 0.87
BWNN 38.1 0.94
Fig. 4. Forecasting results of biased wavelet neural network (BWNN) and traditional wavelet neural network (WNN) based on BP algorithm. The circled points at Jul. 1999 and Nov. 2000 are the maximum streamflow and inundation outside flood season respectively, which are hard to forecast.
5 Conclusions and Suggestions The BP-based biased wavelet neural network (BWNN) is presented and applied to monthly streamflow forecast in this paper. The monthly streamflow is stochastic, periodic and influenced by many factors. As its complexity, traditional statistics approaches are difficult to reflect its nonlinear characteristics. BWNN is an improvement of traditional wavelet neural network (WNN). Acceding to the capability of multiresolution analysis and nonlinear mapping, BWNN can decrease effectively the computing redundancy with flexible changes of time-frequency window. In the
888
F. Liu et al.
monthly streamflow forecast simulation of Fengtan reservoir, BWNN shows its effectiveness and superiorities in convergence and efficiency to ARIMA, GA, FNN and WNN. As the limited study for BWNN, suggestions are listed below: 1) For monthly series, due to the typically periodicity, certain delay of the signals could be added as the input nodes of BWNN. 2) The frame pattern of how to establish the biased wavelets is also expected. The bias function used in case study is Gaussian function. It is a compromise between time and frequency domain. Other bias functions meet the definition may be employed to reflect the essence of signals. 3) During the learning process, network convergence shows nonlinearity and may lead to oscillation and local values, which comes from the inherent deficiency of BP algorithm. The selection of learning rate and momentum coefficient may meliorate this phenomenon, but better leaning algorithms are still preferred. Acknowledgements. This research is supported by Nature Science Foundation of China under grant No. 50579022.
References 1. Fan, X.Z.: Mid and Long-term Hydrologic Forecast. Hehai University Publishing Company,Nanjing (1999) 2. Zhang, L.P., Wang, D.Z.: Study of Mid-Long-term Hydrological Forecasting Based on Weather Factors. Water Resources and Power 21 (3) (2003) 3–5 3. Zhang, W.S., Wang Y.T.: Application of The Time Series Decomposable Model in Medium and Long-term Hydrologic Forceasting. Hydrology 21 (1) (2001) 21–24 4. Jin, J.L., Yang, X.H.: Application of Threshold Regressive Model to Predicting Annual Run- off. Journal of Glaciology and Geocryology 22 (3) (2000) 230–234 5. Jin, J.L., Wei, Y.M., Din, J.: Application of Projection Pursuit Threshold Regressive Model for Predicting Annual Runoff. Scientia Geographica Sinica 22 (2) (2002) 171–175 6. Zhang, Q.H., Albert, B.: Wavelet Networks. IEEE Trans. Neural Networks 3 (5) (1992) 889 –898 7. Chen, T.P., Chen, H.: Approximation Capability to Functions of Several Variables, Nonlinear Functionals, and Operators by Radial Basis Function Neural Network. IEEE Trans. Neural Networks 6 (4) (1995) 904–910 8. Szu, H., Telfer, B., Kadambe, S.: Neural Network Adaptive Wavelets for Signal Representation and Classification. Optical Engineering 31 (9) (1992) 1907–1916 9. Galvao, R.K.H., Takshi, Y., Tania, N.R.: Signal Representation by Adaptive Biased Wavelet Expansions. Digital Signal Processing 9 (4) (1999) 225–240 10. Daubechies, I.: Ten Lectures on Wavelets. SIAM, Philadelphia (1992) 11. Zhou, B., Shi, A.G., Cai, F., Zhang, Y.S.: Wavelet Neural Networks for Nonlinear Time Series Analysis. In: F. Yin, J. Wang, and C. Cuo (eds.): Lecture Notes in Computer Science, Vol. 3174. Springer-Verlag, Berlin Heidelberg New York (2004) 430–435
A Goal Programming Based Approach for Hidden Targets in Layer-by-Layer Algorithm of Multilayer Perceptron Classifiers* Yanlai Li, Kuanquan Wang, and Tao Li School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, P.R. China {yanlai_l, wangkq, litao_l}@hit.edu.cn
Abstract. Layer-by-layer (LBL) algorithm is one of the famous training algorithms for multilayer perceptrons. It converges fast with less computation complexity. Unfortunately, in LBL, when calculating the desired hidden targets, solving of a linear equation set is needed. If the determinant of the coefficient matrix turns to be zero, the solution will not be unique. That results in the stalling problem. Furthermore, a truncation error will be caused by the inversing process of sigmoid function. Based on the idea of goal programming technique, this paper proposes a new method to calculate the hidden targets. A satisfied solution of hidden targets is provided through a goal programming model. Furthermore, the truncation error can be avoided efficiently by means of assigning higher priority to the limitation of variable domain. The effectiveness of the proposed method is demonstrated by the computer simulation of a mushroom classification problem.
1 Introduction Multilayer perceptrons (MLPs) have been widely and successfully applied to classification tasks. The most common training method of it is the backpropagation (BP) algorithm. However, it is a gradient descent method with a fixed learning rate, which would lead to slow convergence speed. Many algorithms have emerged to accelerate the training, includes the Layer-byLayer (LBL) algorithm, which is based on the least square method [1], [2]. Its training process is composed of three least square problems. This algorithm shows fast convergence with much less computational complexity than the conjugate gradient or Newton methods. Unfortunately, there may be a “stalling” problem when solving the hidden targets. Once this occurs, it will be extremely difficult in converging. In order to remove the stalling problem inherent in the LBL algorithm, this paper presents a new method for hidden targets based on the goal programming method [3]. A satisfied resolution of hidden targets can be got by an achievement function. Furthermore, because the solutions are restricted to a predefined domain by assigning it a *
This research is partly supported by NSFC (60571025).
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 889 – 894, 2006. © Springer-Verlag Berlin Heidelberg 2006
890
Y. Li, K. Wang, and T. Li
higher priority in the goal programming model, the truncation error caused by the inverse of sigmoid function is avoided accordingly.
2 Description of LBL Algorithm For simplicity, we discuss a three-layered MLP with only one hidden layer. Assume that the number of the input layer, hidden layer and output layer are M, H, N, respectively. In addition, the pattern number is P. Suppose that the p-th training pattern
[
]
[
T
is r p = r1p , r2p , L , rMp ,1 . And its desired output is d p = d1p , d 2p , L , d Np
pattern, the input and output of the hidden layer are s jp =
M +1
] . For each T
∑ w (jk1) rkp
and
k =1
x jp = g ( s jp ) = 1 (1 + exp(− s jp )) , respectively. Similarly, that of the output layer are
t ip =
H +1
∑ wij(2) y jp j =1
W
( 2)
and zip = g (tip ) = 1 (1 + exp(−t ip )) , respectively. Here x Hp +1 = 1 .
= {wij( 2) } and W (1) = {w (jk1) } are the weights matrixes of the output layer and the
hidden layer, respectively. The aim of training is just to minimize the following meansquare-error (MSE): 2 1 P N p E out = (1) ∑∑ t i − bip , 2 P p =1 i =1
(
−1
)
where bi = f ( d i ) is the desired input of the i-th output unit. In LBL algorithm, each iteration step is decomposed into three independent least squares problem: A. optimization of the output weights W *( 2) and the hidden target x*p Set the first partial derivative of Eout respect to wi (2) and x p to zero, we achieve: p
p
∂E out ∂w i( 2)
=0⇒
P
∑ x p ( x p )T w i(2) = p =1
P
∑ x p (b p ) T
(2)
p =1
( 2) where x p = ( x1p , x 2p , L , x Hp , 1) T , w i( 2) = ( wi(12) , wi(22) , L , wiH , wi((2H) +1) ) T . Further-
more, W ( 2) = ( w1( 2) , w 2( 2) , L w (N2) ) . And
∂E out ∂x p P
=0⇒
P
∑ ( x p ) T W ( 2 ) = (b p ) T
Define Q 2 = ∑ x p ( x p ) T , and D 2 = p =1
Q2W
( 2)
(3)
p =1
P
∑ x p (b p ) T , then Eq. (3) can be written into: p =1
= D2 , or W ( 2) = Q2 −1 D2
(4)
The optimized weights W *( 2) can be solved through Eq.(4). Then x * p can be achieved approximately by substituting W *( 2) into Eq. (3).
A Goal Programming Based Approach for Hidden Targets in LBL Algorithm
B. optimization of hidden weights matrix W *(1)
[
When x * p is known, the desired hidden input h p = h1p , h2p , L , hHp by h jp = f
−1
891
] is determined T
( x *j p ) , j = 1,L, H .Then W *(1) can be obtained through the similar
process with W *( 2) . The MSE of the hidden layer is defined as:
E hid =
(
1 P N p ∑∑ s j − h jp 2 P p =1 j =1
)
2
(5)
Similar with the former process, we can obtain:
Q1W (1) = D1 or W (1) = Q1−1 D1 where Q1 =
P
∑ r p (r p )T , and D1 = p =1
(6)
P
∑ r p (h p ) T . p =1
In case A, there is a procedure to solve the hidden targets x*p in order to optimize the parameters of the hidden layer. It should be pointed out that the coefficient matrix in Eq. (3) is a semi-positively definite matrix. When the determinant value of it turns to be zero, the LBL algorithm may come across a stalling problem. This problem will cause the network difficult to converge. Moreover, when we calculate the inverse of the desired output, truncation error will arise in order to limit the value to the domain of the sigmoid function. In order to remove these two problems, we propose a new method for hidden targets based on the goal programming model.
3 Goal Programming Model for Hidden Targets Suppose that the weights connecting the output layer and hidden layer
wij( 2) ( j = 1, L , H + 1 , i = 1, L , N ) are known already. In order to calculate the hidden targets x p = ( x1p , x2p , L , x Hp , 1)T , the error of the p-th pattern is considered:
E out p where g −1 ( x) = ln
⎞ 1 N ⎛⎜ H +1 ( 2) p wij x j − g −1 (d ip ) ⎟ = ∑ ∑ ⎟ 2 P i =1 ⎜⎝ j =1 ⎠
2
(7)
y , 0 < y < 1 , d ip is 0.01 or 0.99, corresponding with classifica1− y
tion task. According to the property of sigmoid function g (⋅) and the value of d ip , we can discuss the problem from two aspects: ⎪⎧ H +1 ( 2) p ⎫⎪ 1) When d ip = 0.99 , to min E p is to max ⎨ ∑ wij x j ⎬ . For simplicity, we denote it x p ⎪ j =1 ⎪⎭ ⎩
{
}
( 2) C i x p , where C i = ( wi(12) , L , wiH , wii( 2( )H +1) ) , i = 1 , L , N . by max p x
892
Y. Li, K. Wang, and T. Li
⎧⎪ H +1 ⎫ ( 2) p⎪ − 2) When d ip = 0.01 , to min E p is to max ( w ) x ⎨ ⎬ , we also denote it by ∑ ij j x p ⎪ j =1 ⎪⎭ ⎩
{
}
( 2) max C i x p , where C i = (− wi(12) , L , − wiH , − wi((2H) +1) ) . In summary, we can change p
x
the original problem min E p into the following linear multi-goal programming problem: max Ci x p (8) p
{
x
s.t.
}
x jp < 1 , x jp > 0, j = 1, 2 , L , H
The standard form of problem (8) is: max Z = Cx p s.t.
(9)
Ax < B , x > 0 p
p
where C = (c ij ) N *( H +1) , A = (a ij ) H *( H +1) , Z = [Z 1 , L , Z N ]T , B = [b1 , b2 , L , bH ]T . Now we are going to further change this multi-goal programming problem into a goal programming problem. Denote E = [e1 , e2 , L , e N ]T , where ei is the desired values for Z i , i = 1,2,L , N . Generally, ei can not all be achieved. Therefore, in order to express the degree of
,
difference, we assign a positive and negative deviation d i+ d i− ( d i+ , d i− ≥ 0 ) for every goal function, respectively, where d i+ denotes the value greater than ei , while d i− denotes the value less than ei .For the i-th goal function (i = 1, 2, L , N ) , its value can’t have these two deviations at the same time. So there must have one zero value in them, i.e., we have d i+ ⋅ d i− = 0 . Accordingly, the goal function equations are set up as:
[
]
Cx p + D − − D + = E
[
T d1+ , d 2+ ,L , d N+
(10)
]
T d1− , d 2− , L , d N−
, and D − = . where D + = After introducing the desired value and its positive and negative deviations, the original goal function has been transformed into one part of the restrictions. We can also introduce positive and negative deviations for the restriction Ax p < B , and add it to the final subjective equation. Now we have:
⎧Cx p + D − − D + = E ⎪ p F− −F+ B ⎨ Ax p ⎪ x > 0, D −1 ≥ 0 , D + ≥ 0, F − ≥ 0 , F + ≥ 0 ⎩
+
[
where F + = f1+ , f 2+ , L , f N+
]
T
=
[
, F − = f1− , f 2− , L , f N−
]. T
(11)
A Goal Programming Based Approach for Hidden Targets in LBL Algorithm
893
The next work is to find a feasible solution to make every goal function close to its desired value. That means to make all the deviations minimize. The single synthetical goal which reflexes the achieving conditions of the desired values of the goal functions in the original problem is called an achievement function. Using the method introduced in reference [3], we have constructed the following achievement function:
min
∑ d l− + ∑ d t+ + ∑ f m− + ∑ f n+ l
t
m
(12)
n
where l , t ∈ [1, N ], m, n ∈ [1, H ] . In a multi-goal programming problem, because the importance degree of each goal is not the same, a priority degree Q and a weight coefficient U is introduced into the model. For example, we can predefine a very high priority to the subject of our decision-making variable x jp < 1 , j = 1 , L H . Other processes have to be solved under this restriction. Thus the truncation error is avoided. Finally, we obtain a general goal programming model for the problem min E p : K ⎛ ⎞ − + min g = ∑ q k ⎜ ∑ u ki− d i− + ∑ u kj+ d +j + ∑ v km f m− + ∑ v kn f n+ ⎟ ⎜ i ⎟ k =1 j m n ⎝ ⎠
s.t.
(13)
Cx p + D − − D + = E
+
=
Ax p F − − F + B x > 0, D p
−1
≥ 0, D+ ≥ 0 , F − ≥ 0, F + ≥ 0
The meaning of each variable in Eq.(13) is x p : decision-making variable, here the hidden target; cij : coefficient of x jp in the i-th goal;
ei : desired value of the i-th goal;
aij : coefficient of x jp in the i-th restriction; bi : constant of the right hand side in the ith restriction; d i− : negative deviation variable of the i-th goal; d i+ : positive deviation variable of the i-th goal; qi : priority degree of the i-th goal (k levels altogether); uij− and uij+ : weight coefficients of d −j and d i+ in qi . uij+ , u ij− ≥ 0 . The goal programming model can be solved by an improved simplex method. Then in each iteration step, we can get a satisfied solution of the hidden targets x p = ( x1p , x 2p , L , x Hp , 1) T .
4 Simulation Results The data are chosen from UCI Machine Learning Repository [4], which is a two-class mushroom classification problem. The MLP architecture for this problem is selected as 20-11-1. 7448 patterns are selected. The BP, ABP, CG, and the classical LBL
894
Y. Li, K. Wang, and T. Li
algorithms are also carried out in order to have a comparison. Table 1 lists the simulation results, and Figure 1 shows the contrast learning curves with axis in logarithmic scale. From the results we can see that the new LBL algorithm with goal programming method uses least epochs and computation time to converge to 0.001. It also can be seen from table 1 that its recognition rate is higher than BP, CG, and LBL algorithms. So we can draw the conclusion that the new LBL algorithm behaves best among the five algorithms. Table 1. Results of different algorithms
Algo- Epochs Time MSE RT* (%) rithm (s) BP 30000 2749 0.02 98.4 ABP 2370 248 10-3 100 CG 219 46 10-3 99.8 LBL 56 52 10-3 99.7 New 10 14 10-3 100 LBL * “RT” denotes “Recognition Rate”.
Fig. 1. Contrast learning curve for the mushroom classification problem
5 Conclusions This paper has described a new approach for solving the hidden targets in LBL training algorithm of MLP classifier. After setting up a goal programming model of hidden targets according to the teach value, and assigning different priorities to the goal functions, we can get a satisfied solution of the hidden targets, avoiding the stalling problem, and the truncation error as well. Simulation results indicate that the proposed method is efficient for solving the problem in LBL algorithm.
References 1. Ergezinger, S., Thomsen, E.: An Accelerated Learning Algorithm for Multilayer Perceptrons: Optimization Layer by Layer. IEEE Trans. Neural Networks 6(1) (1995) 31- 42 2. Wang, G. -J., Chen, C. - C.: A Fast Multilayer Neural Networks Training Algorithm Based on the Layer-by-layer Optimizing Procedures. IEEE Trans. Neural Networks 7 (3) (1996) 768-775 3. Zhao, K.: Goal Programming and Its Applications. Tongji University Press (1987) 4. http://www.ics.uci.edu/~mlearn/MLRepository.html
SLIT: Designing Complexity Penalty for Classification and Regression Trees Using the SRM Principle Zhou Yang1, Wenjie Zhu2, and Liang Ji1 1
State Key Laboratory of Intelligent Technology and Systems & Institute of Information Processing, Dept. of Automation, Tsinghua University, Beijing, 100084, China 2 Dept. of Statistics and Actuarial Sciences, The University of Hong Kong, Hong Kong S.A.R. [email protected]
Abstract. The statistical learning theory has formulated the Structural Risk Minimization (SRM) principle, based upon the functional form of risk bound on the generalization performance of a learning machine. This paper addresses the application of this formula, which is equivalent to a complexity penalty, to model selection tasks for decision trees, whereas the quantization of the machine capacity for decision trees is estimated using an empirical approach. Experimental results show that, for either classification or regression problems, this novel strategy of decision tree pruning performs better than alternative methods. We name classification and regression trees pruned by virtue of this methodology as Statistical Learning Intelligent Trees (SLIT).
1 Introduction The paramount goal of supervised learning is to extract from the sample data a mapping rule or model that has an expected risk of prediction on future unknown data as low as possible. The issue of complexity control is always of primary concern in adjusting the learning machine to obtain the most desired prediction performance on generalization. There is a growing awareness that the Statistical Learning Theory (SLT) [1] provides a sophisticated theoretical framework for complexity assessment and control. According to SLT, a confidence bound (popularly known as the VC-bound) on the generalization performance of the learning machine is formulated based on the machine’s capacity (VC-dimension), which can be used for rigorous error provision and complexity selection. Once there exists a nested structure in the machine’s admissible complexity, an effective trade-off between the complexity of the machine and the quality of fit on the training data can thus be made by seeking the minimum state of this risk bound, i.e. the Structural Risk Minimization (SRM) principle. The support vector machines [1, 8] can just be regarded as a renowned example of implementing the SRM principle in classification. In regression problems where the ordinary squared-error loss is defined, a similar analytic bound is given in SLT. Both theoretical and empirical studies have shown that, the SRM method by this analytic bound performs favorably in complexity control for building regression models too [2, 4, 7]. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 895 – 902, 2006. © Springer-Verlag Berlin Heidelberg 2006
896
Z. Yang, W. Zhu, and L. Ji
The decision tree methodology has proved to be valuable in learning and modeling in diverse areas, especially useful when the features of the problem have ordinal or nominal attributes (categorical variables with or without a natural ordering) blended with continuous ones. Note that decision tree can be viewed as a kind of learning machine that has an inborn sequence of nested structures (by constraining the size of the tree), associated with each size level a fixed machine capacity. Hence it would be highly suitable for utilizing the SRM induction principle while considering complexity control. The only technical obstacle is that the analytic expression of the capacity of a tree machine in terms of VC-dimension is extremely difficult to derive. Fortunately, [3] and [4] have proposed that the capacity of a learning machine can be estimated empirically through simulated procedures, and [5] further extends this methodology to machines with insufficient search abilities, typically to the decision trees. In this paper, we will show how this empirical estimation, when working together with the SRM form of complexity penalty, can bring about better generalization performances for both classification and regression tasks. The experiments on benchmark datasets show better results as compared to classical methods of decision tree complexity control (tree pruning). We name classification and regression trees pruned by this methodology as Statistical Learning Intelligent Trees, or SLIT for short.
2 Preliminaries and the Estimation Method The mathematical abstraction of a learning machine is characterized by the designated set of functions or mappings yˆ = f (x ; α ) , α ∈ Λ that it can realize, where x is the predictor vector and y indicates the response; α is the parameter which specifies the mapping, and Λ is the set of all admissible parameters, referred to as the admissible space. Let z denote the pair (x, y ) integratively. The training set {z1 , z 2 , z l } ,
symbolized as Z (l ) , is independently and identically drawn from the underlying population distribution P(z) = P(x, y) = P(x)P( y | x) . The training process can be viewed as a mapping from the training set Z (l ) to a certain element α in the admissible space Λ, denoted as α =α tr (Z (l ) ) , where the subscript tr stands for ‘training’. With a specific loss function L(y, f(x; α)) = Q(z; α), the prediction risk of a solution α ∈ Λ is defined as R(α) E[Q(z ; α )] = ∫ Q(z ; α)dP(z) , where D indicates the population distribution D
over the space Z; and the empirical risk on the training set Z (l ) is defined as 1 l Remp (α ; Z(l ) ) ∑ Q(z i ; α) . Typically, we use the 0-1 loss for binary classification l i =1 and the squared-error loss for regression. Although we wish to optimize the value of prediction risk, only the empirical risk is observable on the training set, and statistically there is a discrepancy between the prediction risk and the empirical risk. Therefore, it would be valuable to propose a formula which governs the statistical behavior of the prediction risk based on the training sample size l, the observed empirical risk Remp, and the capacity of the machine. In
SLIT: Designing Complexity Penalty
897
[1] and [6], the following form of confidence bound is provided for binary classification to control the increase of prediction risk on generalization: R(α) ≤ Remp (α ; Z( l ) ) + Φ (l , h, Remp (α ; Z (l ) ), η )
(1)
with the analytic function Φ ( i ) of the form
Φ (l , h, Remp , η ) = c
4l i Remp h(ln(2l / h) + 1) − ln η ( 1+ + 1) c(h(ln(2l / h) + 1) − ln η ) 2l
(2)
where c is a universal constant, and h is an integer quantity corresponding to the capacity of the machine. A similar analytic bound is also formulated in [1, 6] for regression problems with squared-error loss. If the integer h corresponds to the maximum number of vectors x1 , x 2 , x h that can be perfectly shattered by indicator functions in the admissible space Λ, i.e. the (generic) VC-dimension of the learning machine on Λ, it is proved [1, 6] that the analytic bound (1) holds with probability 1 − η ; so does its counterpart form for regression problems. This risk bound, which can also be utilized as a complexity penalty, has shown to perform favorably for complexity selection and control through both theoretical and empirical studies [1, 2, 4, 7], in case sufficient knowledge about the capacity of the learning machine (in terms of VC-dimension) is available. Such a functional form as complexity penalty is called the SRM induction principle [1]. However, for a wide range of learning machines, the decision trees as a typical example, the exact expression for their VC-dimension is extremely difficult to derive by the combinatorial definition in [1]. Moreover, if the search ability of the learning machine is significantly insufficient, or equivalently, it only utilizes a small local subset of the space Λ during the search in every learning trial, it is necessary to make an offset to the generic VC-dimension that makes up for the incomplete use of the entire space Λ. For these kinds of learning machines, as pointed out by Vapnik et al [3], the only practical solution is to empirically measure the machine’s capacity (VC-dimension) through well-designed simulated experiments. The essential idea is that because some analytic forms of the bounds are satisfactorily tight, they can also serve as good approximations to the risk discrepancies when appropriately re-estimated constants in the formulae are filled in. An accurate estimate of the VC-dimension gives the best fit between the theoretical formula and a set of experimental measurements of the risk discrepancy on data sets of varying sizes [3]. Define the extra risk on generalization or EROG, as ξ (α ; Z(l ) ) R(α) - Remp (α ; Z(l ) ). Recall that when the training algorithm αtr( · ) is specified, the solution α yielded from training can be viewed as a dependent of the training sample, α =α tr (Z (l ) ) . Thereby either the prediction risk or empirical risk, or the EROG ξ (α tr (Z (l ) ); Z( l ) ) , depends solely on the random sample Z (l ) i.i.d. drawn from the generator distribution D. The estimation mechanism used in our work, first formulated in [3] and further developed by [4] and [5], is based on the following approximate relation:
898
Z. Yang, W. Zhu, and L. Ji
ξ1−η ( Remp ) ≈ Φ (l , h, Remp , η )
(3)
where ξ1−η (Remp ) denotes the 1 − η quantile of the conditional distribution of
ξ (αtr (Z(l ) ); Z(l ) ) subject to a fixed value of Remp , and the function Φ ( i ) takes the same form of (2) except with an appropriate rescaled parameter c in place of c. In the measurement procedure, we first assign an adequate form to the simulated generator distribution D; hence the prediction risk of any solution α is determinate and can be computed through either analytic or Monte Carlo methods (for detailed instruction on devising the generator distribution, refer to [5]). In each experiment where a training sample Z (l ) is i.i.d. drawn from the generator distribution D, the solution α tr (Z( l ) ) is determined using a specific training algorithm tr, and EROG ξ is obtained according to ξ = R(α) − Remp (α ; Z(l ) ) . We only collect the EROG values in the experiments which are accompanied with an empirical risk exactly equal to the predesignated value Remp (note that function Φ ( i ) needs to be filled in with a definite value of Remp), until a certain number m of observations are gathered at the same sample size l (termed a design point by [4]). Then we take the quantile statistic ξˆ1−η from these m observations, as an estimate of the true conditional quantile ξ1−η (Remp ) at this l, and conduct such measurements on a variety of design points l’s. As is proposed in [4, 5], if the training process is the special one that always yields the global minimum solution of empirical risk, i.e. α tr (Z( l ) )= arg inf Remp (α ; Z (l ) ) , the α∈Λ
generic VC-dimension gives a nice fit between the formula and the experimental curve of ξ1−η ( Remp ) of varying l ’s. By finding the integer parameter hˆ that gives the best fit d
hˆ = arg min ∑ (ξˆ1−η ( Remp ) s − Φ (ls , h, Remp, s , η ) ) 2 h
s =1
(4)
a good estimator for the generic VC-dimension is hereby provided. If the minimization ability of the training process is significantly insufficient, i.e. Remp (αtr (Z(l ) ); Z(l ) ) inf Remp (α; Z(l ) ), it is argued in [5] that the estimator hˆ in (4) no α∈Λ
longer serves as an approximation of the generic VC-dimension, but rather turns to be an approximation of the pseudo VC-dimension, which is specific to the training algorithm α tr ( i ) as well as the admissible space Λ. Definition. For a set of indicator functions Q(z ; α ) , α ∈ Λ , given the admissible space Λ and the specific mapping α =α tr (Z (l ) ) , the pseudo VC-dimension with respect to Λ and α tr ( i ) jointly is the minimal integer h* , for which the bound R(α tr (Z (l ) )) ≤ Remp (α tr (Z (l ) ); Z (l ) ) + Φ (l , h* , Remp (α tr (Z (l ) ); Z( l ) ), η )
(5)
holds true with probability 1 − η simultaneously for all population distributions. The function Φ (l , h* , Remp , η ) and constant c are all the same as those in (2).
SLIT: Designing Complexity Penalty
899
Note that the function Φ (l , h, Remp , η ) is monotone with respect to h; as long as the generic VC-dimension on Λ exists, the pseudo VC-dimension (specific to a certain training mapping) exists, and provides a lower bound to the former one. More importantly, the pseudo VC-dimension provides a tighter bound on the generalization performance of the machine whose training process has a significantly incomplete search ability, than the generic VC-dimension does (for rigorous proof, see [5]).
3 The Estimated Capacity of Learning Machines In order to further validate the empirical method that quantifies the capacity of a learning machine, a series of control experiments were first conducted on learning machines whose capacity in terms of VC-dimension is perfectly known theoretically, i.e. linear machines. For linear machines, the predictor vector x comes from Rk, and k
the set of mapping rule is defined by f (x; α) = θ (α0 + ∑α j ⋅ x j ) , where α = (α0, j =1
αk) and θ ( i ) is the threshold function. Since a training method that minimizes the empirical risk of a linear machine exists (the one described in [3]), which provides a satisfactory approximation to the ‘ideal searcher’ α id (Z (l ) )= arg inf Remp (α ; Z ( l ) ) , the estimator hˆ computed from this simu-
α1,
T
α∈Λ
lated measuring procedure then serves as a good approximation to the generic VCdimension of this kind of learning machine, asserted theoretically as k+1 [1]. We use the constant value c = 0.159 and 1 − η = 0.1 as recommended in [5]. The estimated results of the generic VC-dimension for k+1 = 10, 20, 30, 40 are shown in Table 1. Table 1. The true and empirically-estimated values of generic VC-dimension in the control experiments True generic VC-dim. Estimated generic VC-dim.
10 11
20 20
30 30
40 39
The results in the control experiments indicate that the above-described method gives accurate estimates when quantifying the machine capacity, which agree well with the theoretical values. For the learning machine of primary interest in this paper, decision trees, the predictor-space X consists of a combination of nominal, ordinal and continuous variables, with the number of categories for each of the nominal or ordinal attributes specified. The parameter α indicates the structure of the tree and output assignment at each terminal node. As suggested in [9, 10] etc, the number of splits s can be used as an effective index to characterize and constrain the complexity of the tree classifier. On a specific predictor-space X, the set Λ s comprises only the decision trees whose number of splits is less than or equal to s. The machine capacity of Λ s is finite and determinate, yet irrelevant to the specific data distribution on X.
900
Z. Yang, W. Zhu, and L. Ji
Because the standard tree growing algorithm CART (Breiman et al. [9]) is greedy but does not minimize the empirical risk of the trained tree to the infimal value on Λ s , the estimator hˆ computed this way then turns to approximate its pseudo VCdimension associated with Λ s . On the other hand, recall that the ultimate utility of such estimates is to be filled into the SRM form of complexity penalty - the right side of (1); then pseudo VC-dimension indeed provides a tighter risk bound than the generic quantity does, once the practical training process αtr( · ) is known and specified. We estimated the pseudo VC-dimension for decision trees associated with each splits number s, on four different predictor-spaces. The structures of the four predictor-spaces actually correspond to the benchmark datasets adult census, german credit, Breiman#1, and abalone, respectively. Breiman#1 is a simulated regression problem proposed in [9], while all the other three are from the UCI repository [12]. The structures of the four predictor-spaces are summarized in Table 2, and the respective estimates of the pseudo VC-dimension are displayed in Table 3.
4 Learning Experiments We propose in this paper that the SRM form of complexity penalty provides indeed an effective approach for the selection of decision tree pruning choices, and if the amended pseudo VC-dimension is filled in, as in (5), a tightened risk bound is given and thus may lead to even more favorable performance for error provision and complexity control. In classification tasks, the formula Remp +Φ (l, h* , Remp , η) with the empirical constant c and the estimated value of pseudo VC-dimension (as listed in Table 3) filled in is used to evaluate the penalized risk while seeking the optimal size of the tree. The pruning strategy for SLIT is as follows. First, the tree is fully grown until no more split is possible. Then the subtree with the minimal empirical risk at each pruned size s is prepared as a candidate choice along the pruning sequence, and the nested structure of pruning choices is hereby formed. Finally, the single choice with the smallest penalized risk is selected. In the experiments, competing pruning strategies, linear penalty [9] and pessimistic pruning [11], were both employed for comparison. Besides, the oracle choice of pruning, an imaginary procedure that always selects the best choice of pruning with the test set disclosed, was also measured. Note that this ideal pruning procedure sets a theoretical limit on the performance of any real pruning strategy. In regression tasks, the analytic risk bound for squared-error loss turns to be [1, 2]
(
Remp ⋅ 1 − (h(ln(l / h) + 1) − ln η ) / l
)
−1 +
. Furthermore, we found that with the estimated
pseudo VC-dimension filled in, this formula could provide a favorable heuristic function for complexity penalty for regression trees as well, with all the pruning steps remaining the same. As comparisons, the method linear penalty and the notion of ideal pruning choice also apply to regression tasks, while pessimistic pruning and C4.5 are confined to classification tasks only. For the proposed pruning method SLIT, although the determinations of the pseudo VC-dimension at different tree sizes require
SLIT: Designing Complexity Penalty
901
Table 2. Summary of the predictor-space structures corresponding to the four datasets Name adult german Breiman#1 abalone
#cont. 5 3 0 7
#bin. 1 3 1 0
#nom. 6 10 9 1
categories 8, 7, 14, 6, 5, 4 4, 5, 10, 5, 4, 3, 4, 3, 3, 4 3, 3, 3, 3, 3, 3, 3, 3, 3 3
#ord. 1 4 0 0
levels 16 5, 4, 4, 4 / /
Table 3. Estimates of pseudo VC-dimension for decision trees on the four predictor-spaces, with respect to the number of splits # splits adult german Breiman#1 abalone
4 19 17 10 16
5 23 21 12 19
6 27 25 14 21
7 30 28 16 24
8 33 31 17 27
9 37 34 19 29
10 40 37 21 32
11 43 40 23 35
12 47 43 24 37
13 50 46 26 40
14 53 49 27 42
15 56 52 29 45
16 59 55 31 47
17 62 58 32 50
18 66 61 34 52
Note: the notion of pseudo VC-dimension is only relevant to the structure of the predictor-spaces, yet irrelevant to the specific data in these datasets. Table 4. Averaged performance in classification classification
ideal linear pessi. C4.5 SLIT
avg risk (%) 16.26 17.55 17.69 18.48 16.80 adult risk dev (%) 0.86 exe time(s)
/
1.23
1.27
1.39
1.08
.0163 .0187 .0096 .0184
german avg risk (%) 24.17 26.33 26.28 26.85 24.94 risk dev (%) 1.03
1.28
1.34
1.49
1.12
Table 5. Averaged performance in regression regression Breiman#1
ideal linear SLIT
avg risk
5.701 6.486 6.114
risk dev
0.96
exe time(s)
/
1.13
1.05
0.296 0.475
abalone avg risk
6.588 7.326 6.958
risk dev
0.279 0.614 0.450
considerable computational expense, they are indeed independent to the specific data and can be completed prior to the training stage, well before the real observations of the data are provided. In times of pruning, the execution of SLIT only involves arithmetic calculations and runs competently fast as compared to alternative methods. In every learning experiment, we randomly drew a training sample of size 400 from each of the datasets, and used all of the rest records in the respective dataset as the test set (the test set size for Breiman#1 is 5000 as recommended in [9]). Then we repeatedly conducted the experiments 100 times, and summarize the performances of each tree-pruning method in Table 4 (classification problems) and Table 5 (regression problems), respectively. The averaged execution times are also listed in the tables, where all of the experiments were conducted on a 1.85GHz AMD Athlon machine with 512MB of RAM. We see that in either classification or regression tasks, SLIT yields the averaged prediction result consistently better than those of its competitors, and approaches very close to the ideal performance that any tree pruning method can attain.
902
Z. Yang, W. Zhu, and L. Ji
5 Conclusion In this paper, we have successfully applied the SRM principle of induction, as proposed in [1], to the complexity selection (pruning) problem of decision trees, in both classification and regression cases. The quantization of the machine capacity for decision trees, with respect to the tree size, is estimated using the empirical method proposed by [3, 4, & 5]. Experimental results have shown that this novel strategy of tree pruning outperforms alternative methods, for either classification or regression tasks.
Acknowledgements The authors would like to thank Vladimir N. Vapnik from NEC Research Institute, Princeton, and Professor Jiawei Han from the University of Illinois at Urbana-Champaign for the help to improve the quality of this work. This research was supported by the Ministry of Science and Technology of China, No. 2003CB517106.
References 1. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998) 2. Cherkassky, V., Shao, X., Mulier, F., Vapnik, V.: Model Complexity Control for Regression Using VC Generalization Bounds. IEEE Trans. Neural Networks 10 (1999) 1075-1089 3. Vapnik, V., Levin, E., LeCun, Y.: Measuring the VC-dimension of a Learning Machine. Neural Computation 6 (1994) 851-876 4. Shao, X., Cherkassky, V., Li, W.: Measuring the VC-dimension Using Optimized Experimental Design. Neural Computation 12 (2000) 1969-1986 5. Yang, Z., Ji, L.: A New Way to Estimate the VC-dimension with Application to Decision Trees. (Submitted). Technical report, DA-050812, Inst. of Information Processing, Dept. of Automation, Tsinghua University, 2005 6. Vapnik, V.: Estimation of Dependences Based on Empirical Data. Springer-Verlag (1982) 7. Cherkassky, V., Ma, Y.Q.: Comparison of Model Selection for Regression. Neural Computation 15 (2003) 1691-1714 8. Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20 (1995) 273-297 9. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall (1993) 10. Mitchell, T.M.: Machine Learning. McGraw-Hill (1997) 11. Mansour, Y.: Pessimistic Decision Tree Pruning Based on Tree Size. In Proc. 14th Intl’ Conf. on Machine Learning – ICML 97, (1997) 195-201 12. Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases [http://www.ics. uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Irvine, Dept. of Information and Computer Science (1998)
Flexible Neural Tree for Pattern Recognition Hai-Jun Li1,2 , Zheng-Xuan Wang2 , Li-Min Wang2 , and Sen-Miao Yuan2 1
College of Computer Science, Yantai University, Yantai 264005, China 2 College of Computer Science and Technology, Jilin University, ChangChun 130012, China
Abstract. This paper presents a novel induction model named Flexible Neural Tree (FNT) for pattern recognition. FNT uses decision tree to do basic analysis and neural network to do subsequent quantitative analysis. The Pure Information Gain I(Xi ; ϑ), which is defined as test selection measure for FNT to construct decision tree, can be used to handle continuous attributes directly. When the information embodied by neural network node can show new attribute relations, FNT extracts symbolic rules from neural network to increase the performance of decision process. Experimental studies on a set of natural domains show that FNT has clear advantages with respect to the generalization ability.
1
Introduction
The discovery of decision rules and recognition of patterns from data examples is one of the most challenging problems in machine learning[1][2]. If input space contains continuous attributes, decision tree methods need to apply discretization with threshold values to make continuous-valued predictive attributes incorporated into the learned tree. But applying discretization too early may create improper rectangular divisions and consequently the decision rules may not reflect real data trends. Thus, it is very important to construct a model to discover the dependencies among discrete and continuous attributes, and to improve the accuracy of decision tree algorithms. Neural networks are well known to model nonlinear characteristics of sample data [3]. They have great advantage with respect to dealing with noisy, inconsistent and incomplete data. Decision trees and neural networks are alternative methodologies for pattern classification. Atlas et al. [4] presented a performance comparison of those two methodologies used in load forecasting, power security and vowel recognition, and found that both methodologies have their own advantages. Some researchers indicated that hybrid approaches can take advantage of both symbolic and connectionist models to construct hybrid model. Setiono et al.[5] tried to extract symbolic rules from a neural network to increase the performance of decision process. They used in sequence a weight-decay back-propagation over a three-layer feed-forward network, a pruning process to remove irrelevant connection weights, a clustering of hidden unit activations, and extraction of rules from discretized unit activations. Zhou et al. [6] presented a novel approach named hybrid decision tree (HDT) that virtually embeds feedforward neural network in some leaves of a binary decision tree is proposed, which J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 903–908, 2006. c Springer-Verlag Berlin Heidelberg 2006
904
H.-J. Li et al.
is motivated by recognizing that dealing with unordered/ordered attributes is similar to performing qualitative/quantitative analysis. Based on the above considerations, we define a new test selection measure– Pure Information Gain I(Xi ; ϑ) and then propose a novel approach named Flexible Neural Tree (FNT), which uses decision tree to process discrete or discretized attributes and subsequently uses neural network to process continuous attributes when necessary. When the information embodied by neural network node can show new discrete attribute relations, FNT extracts symbolic rules from neural network to increase the performance of decision process. The remainder of this paper is organized as follows: Section 2 introduce the basic idea of Pure Information Gain I(Xi ; ϑ). Section 3 illustrates FNT in detail. Section 4 presents the corresponding experimental results of compared performance with regard to FNT and HDT. Section 5 sums up whole paper.
2
The Pure Information Gain I(Xi; ϑ)
In this discussion we use capital letters such as X, Y for attribute names, and lower-case letters such as x, y to denote specific values taken by those attributes. Let P (·) denote the probability, p(·) refer to the probability density function. Suppose training set T consists of predictive attributes {X1 , · · · , Xn } and class attribute C. Each attribute Xi is either continuous or discrete. Entropy is commonly used to characterize the impurity of an arbitrary collection of examples. And on the basis of it, researchers introduced many effective test selection measures. But Entropy has limitation when dealing with continuous attributes. Given a collection T , let H(C|Xi ) represent the conditional entropy of C given Xi , then H(C|Xi ) = − P (xi ) P (c|xi ) log P (c|xi ) (1) Xi
C
In the case of discrete attributes, the set of possible values is a numerable set. To compute the conditional probability we only need to maintain a counter for each attribute value and for each class. In the case of continuous attributes, the number of possible values is infinite, thus make it impossible to compute conditional entropy. We begin by adding an implicit attribute ϑ. For any given attribute Xi , ϑ has two kinds of values: ϑ(+) (if the corresponding instance is classified correctly by Xi ) or ϑ(−) (if not classified correctly). Let Xi represent one of the predictive attributes. According to Bayes theorem, when some examples satisfy Xi = xi , their class labels are most likely to be: ⎧ ⎨arg max P (c)P (xi |c) (if Xi is nominal) c∈C C∗ = (2) ⎩arg max P (c)p(xi |c) (if Xi is continuous) c∈C
Then, the relationship between Xi , C and ϑ can be described as Table 1:
Flexible Neural Tree for Pattern Recognition
905
Table 1. The relationship between Xi , C and ϑ Xi
C
C∗
xi1 xi2 · · · xiN
c1 c2 · · · c2
c1 c1 · · · c2
ϑ ϑ(+) ϑ(−) · · · ϑ(+)
The conditional entropy of Xi relative to this boolean situation is defined as H(ϑ|Xi ) = −P(+) log P(+) − P(−) log P(−)
(3)
Where P(+) is the proportion of ϑ(+) and P(−) is the proportion of ϑ(−) . On the other hand, if there exists class attribute C only, then the whole instance space are classified to the most probable class. Correspondingly the entropy of ϑ is
H(ϑ) = −P(+) log P(+) − P(−) log P(−)
(4)
Where P(+) and P(−) are the corresponding proportion of ϑ(+) and ϑ(−) , respectively. The information gain is σ = H(ϑ) − H(ϑ|Xi )
= [P(+) log P(+) − P(+) log P(+) ] + [P(−) log P(−) − P(−) log P(−) ]
(5)
= I(+) + I(−) Where I(+) describes the positive information that attribute Xi embodies for classification. Whereas the I(−) describes irrelevant or negative information. On the basis of this, we define a new test selection measure, I(Xi ; ϑ), to describe the extent to which that Xi supplies pure information: I(Xi ; ϑ) = I(+) − I(−)
(6)
Put another way, I(Xi ; ϑ) describes the extent to which the model constructed by attribute Xi fits class attribute C. The predictive attribute which maximizes I(Xi ; ϑ) is also the most useful for improving classification accuracy.
3
Constructive Induction of FNT
FNT combines the neural network technologies and traditional decision tree learning capabilities to discover more compact and correct rules. FNT is implemented by replacing leaf node of a decision tree with feed-forward neural networks only when necessary. In each step of the learning process, training examples falling into current node are partitioned into different subsets or splits. When current node is to
906
H.-J. Li et al.
split, the diversities of its branches are measured and compared with a pre-set diversity-threshold. If any branch contains continuous attributes only, then the branch is replaced by feed-forward neural network. Otherwise, if any diversity is smaller than the diversity-threshold, then the learning process terminates and future examples falling into current node are classified to the most probable class of current node. The diversity is measured as the proportion of the training examples that does not belong to the most probable class of the node and initialized to be 0.2 in our experiments. When new training examples are fed, FNT does not re-train the entire training set. Instead, it can learn the knowledge encoded in those new examples through modifying some parameters associated with existing hidden units, or slightly adjusting the network topology by adaptively appending one or two hidden units and relevant connections to existing network. Since the network architecture is adaptively set up, the disadvantage of manually determining the number of hidden units of most feed-forward neural networks is overcame. As more and more examples are fed into the neural network, saving all the used examples is too luxury to realize. Some rules may be inferred from neural network. With the help of saved previous training examples in neural network node, we can examine new training examples along with saved previous ones to see whether a split should be updated and one or more new branches can be evolved from this node. That is, the neural network is used to find the dependencies among attributes. The training data and the results obtained by the neural nodes are then sent to the traditional decision tree learning algorithm to extract symbolic rules that include oblique decision hyperplanes instead of general input attribute relations. The architecture of FNT model is depicted as follows: —————————————————————— T reeGrow(τ, θ) Input: training set τ ; threshold θ Output: a hybrid decision tree with some neural leaf nodes. Initialize(τ ); /* τ is the training set of current node */ If ConAttrOnly(τ ) = T rue Then /* If there exist continuous attributes only in current node */ N euralN ode(τ ) /* Replace leaf node with neural node */ Else X ∗ = SelT est(τ ); /* From current node select test X ∗ which has maximal Pure Information Gain */ If IsCon(X ∗ ) then X = Discretize(τ, X ∗ ) /* Apply discretization to continuous attribute X ∗ */ P artition(τ, τ , X); /* Partition τ according to the value of X ∗ , τ is the training set of current node */ δ = P ureGain(τ ) /* compute the diversity of current node */ End If If δ < θ Then /* Compare δ with threshold θ */ M arkN ode(τ ) /* Mark the leaf node with the most probable class label */
Flexible Neural Tree for Pattern Recognition
907
Return(τ ) Else T reeGrow(τ , θ) /* Entire process is recursively repeated on the portion of τ that matches the test leading to the node. */ End If End If ——————————————————————
4
Experiments and Comparisons
Our experiments compared FNT with another hybrid method, HDT, for classification. We chose 18 data sets from the UCI machine learning repository1 for our experiments. Each data set consists of a set of classified examples described in terms of varying numbers of continuous and nominal attributes. In data sets with missing value, we considered the most frequent attribute value as a candidate. For comparison purpose, FNT used the same feed-forward neural network, FANNC, as HDT for classification. The parameters of FANNC are set to default values [8], that is, the default responsive characteristic width αij of Gaussian weight is set to 0.25, the bias of the hidden layer unit is set to 0.3, the responsive center adjustment step δ is set to 0.1, the leakage competition threshold is set to 0.8. Table 2. Average classification accuracy and standard deviation Data set Anneal Audiology Australian Breast-w German Glass Heart-c Heart-h Ionosphere Iris Kr-vs-Kp Pima-indians Primary-tumor Segment Sick-enthyroid Soybean Vehicle Zoo 1
HDT 96.8475 ± 3.8571 78.8674 ± 2.8560 86.9684 ± 2.0587 91.8745 ± 2.3112 70.8326 ± 3.9368 68.1246 ± 2.6826 76.8646 ± 2.0862 81.2358 ± 1.0183 89.6535 ± 1.6242 92.7972 ± 1.0443 99.9645 ± 0.4927 73.9537 ± 2.3747 41.8762 ± 1.7470 96.2467 ± 1.8946 98.1452 ± 0.6376 91.6466 ± 1.2636 71.9366 ± 1.9468 92.2562 ± 2.0366
ftp://ftp.ics.uci.edu/pub/machine-learning-databases
FNT √ 98.8646 ± 1.3843 √ 82.6654 ± 1.9573 83.8765 ± 4.0834 √ 94.5213 ± 2.2561 √ 76.7114 ± 2.9366 65.0336 ± 1.0245 √ 83.8221 ± 3.9355 √ 86.2521 ± 2.2837 88.0357 ± 0.9531 √ 96.3635 ± 0.9867 96.7368 ± 0.7313 √ 77.0558 ± 2.7477 √ 47.9435 ± 2.8456 95.8466 ± 1.7858 96.0452 ± 2.3622 √ 93.7373 ± 2.1234 √ 80.2572 ± 1.5671 √ 95.6332 ± 3.0564
908
H.-J. Li et al.
The classification performance was evaluated by ten-folds cross-validation for all the experiments on each data set. Table 2 shows classification accuracy and √ standard deviation for FNT and HDT, respectively. ’ ’ indicates that the accuracy of FNT is higher than that of HDT at a significance level better than 0.05 using a two-tailed pairwise t-test on the results of the 50 trials in a data set. From Table 2, the significant advantage of FNT over HDT in terms of the higher accuracy can be clearly seen.
5
Summary
In this paper, a novel machine learning approach named FNT is proposed. By defining Pure Information Gain I(Xi ; ϑ), FNT can directly select proper test among continuous and discrete attributes, thus overcome the restrictiveness of the other measures derived from information theory such as information gain measure. FNT embeds specific feed-forward neural network in some leaves of a decision tree to simulate human reasoning in a way that, symbolic learning is used to do qualitative analysis and neural learning is used to do subsequent quantitative analysis. With respect to the final model, FNT can utilize the information learned from both decision tree and neural network to generate accurate and compact classification rules. Although more work remains to be done, our results indicate that FNT constitutes a promising addition to the induction algorithms from the viewpoint of classification performance.
References 1. Huan, L., Rudy, S.: Feature Transformation and Multivariate Decision Tree Induction. In: Arikawa, S., Motoda, H. (eds.): Discovery Science. Lecture Notes in Artificial intelligence, Vol. 1532. Springer-Verlag, Berlin Heidelberg New York (1998) 279–291 2. Masi, G. B., Stettner, D. L.: Bayesian Adaptive Control of Discrete-time Markov Processes with Long Run Average Cost. Int. J. Systems & Control Letters 34(3) (1998) 55–62 3. Hunter, A.: Feature Selection Using Probabilistic Neural Networks. Int. J. Neural Computation & Application 9(2) (2000) 124–132 4. Atlas, L., Cole, R., Uthusamy, M., Lippman, A.: A Performance Comparison of Trained Multi-layer Perceptions and Trained Classification Trees. In: Zhong, S., Malla, S. (eds.): Proceedings of the IEEE International Conference on Computer Vision, Vol. 78. Osaka, Japan (1990) 1614–1619 5. Setiono, R., Huan, L.: Symbolic Representation of Neural Networks. Int. J. Computer 29(5)(1996) 71–77 6. Zhou, Z.H., Chen, Z.Q.: Hybrid Decision Tree. Knowledge-Based Systems 15(8) (2002) 515–528 7. Dougherty, J.: Supervised and Unsupervied Discretization of Coninuous Features. In: Armand P., Stuart, J.R. (Eds.): Proceedings of the 12th International Conference on Machine Learning. Tahoe City, California, USA (1995) 194–201 8. Zhou, Z.H., Chen, S.F., Chen, Z.Q.: A FANNC: Fast Adaptive Neural Network Classifier. Int. J. Knowledge and Information Systems 1(2) (2000) 115–129
A Novel Model of Artificial Immune Network and Simulations on Its Dynamics∗ 1
Lei Wang1, Yinling Nie , Weike Nie2, and Licheng Jiao 1
2
School of Computer Science and Engineering, Xi'an University of Technology, 710048 Xi'an, China {leiwang, yinlingnie}@xaut.edu.cn 2 School of electronic engineering, Xidian University, 710071 Xi'an, China {wknie, lchjiao}@mail.xidian.edu.cn
Abstract. A novel model of artificial immune network is presented at first, and then a simulative research work is made on its dynamic behaviors. In this model, a B cell makes a key role that takes antigens in, so as to generate antibodies as its outputs. Under five different kinds of adjustment by suppressor T cells, number of antibodies will keep to a certain degree through influencing the B cell's activation. On the other hand, with help T cells, different B cells could cooperate from each other, which makes the system's dynamic behavior appear more complex, such as phenomena of limit cycle, chaos, etc. Simulative results show that limit cycle and chaos may exist simultaneously when four units are in connection, and the network's characteristic has a close relationship with the intensity of suppressor T cell's function.
1 Introduction As we know, the theoretical development of immune system was initiated by Niels K. Jerne, who introduced us to the mirror halls of immunology. The immune system is visualized somewhat like a gigantic computer where constant communication and regulation takes place in among different cells [1]. Jerne postulated that pairs of antibodies, a special kind of proteins produced by the body’s immune system that recognize and help fight infectious agents and other foreign substances invading the body, and their mirror images would spontaneously be produced during the development of the immune system, thus creating the possibilities for communication networks and regulatory equilibriums, which serves as a model of immune network. From then on, more and more efforts have been made for using Jerne's theories to set up some artificial immune network models (AIN for short in the following). Among them, Zheng Tang proposed a model of multiple-valued AIN based on biological immune response network, which simulates the interaction between B cells and T cells in natural immune system, and has a property that resembles immune response well [2]. Besides, Yoshiteru Ishida described an immune network model for the mutual recognition between antigen and antibody [3]. With the aim of clustering and filtering crude data sets, Leandro Nunes de Castro proposed another kind ∗
This research is supported by National Science Foundation of China under grant no60372045.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 909 – 914, 2006. © Springer-Verlag Berlin Heidelberg 2006
910
L. Wang et al.
of AIN, named aiNet, which is a disconnected graph composed of a set of nodes, called cells or antibodies, and sets of node pairs called edges with a number assigned called weight, or connection strength, specified to each connected edge. Through theoretical comparisons between aiNet and the well-known artificial neural network, de Castro concluded that there are many similarities between them and one area has much to gain from the other [4]. From the above research, it seems quite clear that natural immune mechanisms are very important for setting up the correspondingly artificial models, however, another crucial factor, namely stability, is omitted to some extent, whereas which is one of the basic rules in designing information processing networks. In view of this limitation resulting from these models, some researches are focused on the stability problem of asymmetrical continuous network models [5], and investigate the influence of suppressing function on the system's global stability. However, it necessary to note that most present researches are still limited to investigate the conditions of stability on the basis of the equilibrium point. It is generally believed that bifurcation and chaotic phenomenon may occur in some highly nonlinear artificial immune networks, which vary with the parameters. However, deep-going research on how to construct such a network has not been done yet. Such kind of examples have been seldom seen that integrate equilibrium point, limit cycle and chaos into one system or network. Therefore, some kind of new AIN models are still needed, whose behaviors should: 1, integrate equilibrium point, limit cycle, chaos and other characteristics together, which could occur respectively with parameter's variation in the same model; 2, have a simple structure and convenient realization; and 3, be provided with a wider versatility.
2 The Model of Artificial Immune Network In Tang's model, suppressor T cells (TS) compose a group, which takes the role of suppressing layer in the immune network, and has strict monotonously increasing or decreasing nonlinear relationship of saturation. However, besides saturate phenomenon, interactions between excitement and suppression also frequently occur in body's physiological process, which means AIN's nonlinear unit can be not only monotonous but also non-monotonous as well. In view of this problem, another group composed with help T cells (TH) is applied in this model for the memory function, which maybe (we guess) makes the system appear non-monotonous with Ts' cooperation. In fact, the cooperation between TS and TH set up a feedback channel transferring the information from the output layer (antibodies) to the input (antigens). If there has already been some antibodies corresponding to a special kind of antigens, other similar antigens will not activate the immune network to the same degree any further when they invade the system. Therefore, at present, the system appears suppressant to some extent. On the contrary, if there are some new antigens invading the system, due to no corresponding antibodies existing in the output layer, they will activate the immune network immediately and obviously, and now, the system
A Novel Model of Artificial Immune Network and Simulations on Its Dynamics Ag1, Ag2, …, AgN
TH
TS
- uii + TS
Ab1, …, Abj, …, AbM, (jzi) iii
B
Ii
TH
Ag1, …, AgN + ui -
Abi
a. Ab1, …, Abj, …, AbM, (jzi)
911
Abi
B() ci
ri
b.
Fig. 1. The schematic structure of an artificial immune network model, in which, a, the immune response network in view of nature, and b, the engineering system model in accordance with the immune response network on its left
appears excited. Based on this consideration, a novel AIN model is designed as the following. In the model above, TH takes a role of the transconductance collecting information from the other immune units (Tij is defined as a connection weight of an Abj from the other unit to this one), and TS is shown as a voltage-controlled nonlinear resister simulating the action of antibodies on their aimed antigens. Therefore, its differential equation corresponding to Fig.1 can be expressed as follows: M ⎧ dui u = Tij Ab j − ui − i + iii + Ag i ⎪ci ri dt ⎨ j =1, j ≠i ⎪ Abi = Bi (ui ) ⎩
∑ (
)
For convenience, we let Ri = ri + 1
∑T
ij
i = 1, 2,L, N
(1)
and iii = Gi (uii ) = Gi ( Abi − ui ) , then
the equation above can be changed as:
dui = dt
M
∑
Tij
j =1, j ≠i
ci
( )
Bj uj −
ui + Gi (Bi (ui ) − ui ) + Ag i ci Ri
i = 1, 2,L, N
(2)
where Gi(⋅) is a continuous differentiable monotonous function, and Bi(⋅) is defined as a continuous function which may have several varieties as shown in Fig.2. 0.5
0.5
0.5
0.5
0.5
0
0
0
0
0
-0.5 -1
0
1
-0.5 -1
0
1
-0.5 -1
0
1
-0.5 -1
0
1
-0.5 -1
0
1
Fig. 2. Five types that a B cell could be, which are respectively, a, monotone-saturation type; b, excitement type; c, suppression type, d, excitement-suppression type, and e, suppressionexcitement type
912
L. Wang et al.
3 Simulation on AIN's Dynamic Behaviors With digital simulation on the above models, it can be seen that this model is provided with some very complex characteristics when the network is under three- or four-unit's connection. In this model, the nonlinear characteristics of Bi(⋅) are given by:
Bi (ui ) = ai ui + bi ui3
(3)
When ai >0, bi <0, it means that this model belongs to the excitement type, and oppositely, ai <0, bi <0, means the suppression type, and ai >0, bi <0, the excitementsuppression type. On the other hand, the characteristics of TS are defined as: iii = Gi (uii ) = ci ki
1 − e −u ii
(4)
1 + e − u ii
where ki suggests a saturated feed-back value, ci means the feed-back intensity, and especially, ci=0 indicates no-feedback and ci=1 is whole-feedback. 3.1 Research on Dynamic Behaviors of Three-Order Networks
When a1=5, b1= -1; a2=10, b2=-2; a3=8, b3=-3; ki=1, ci=1, although the initial conditions are different, only limit cycles occur, which are shown in Fig.3 as an example.
a.
b.
c.
Fig. 3. Limit cycles of three-order network, in which, a, u1 =1, u2 =1, u3 =1; b, u1 =1.5, u2 =1.8, u3 =0.5; c, u1 = 1.5, u2 = 4.0, u3 = 0.5
When a1=8, b1= -1; a2=13, b2=-2; a3=6, b3=-3; ki=1, ci=1, chaos appears as in Fig.4.
Fig. 4. Chaos appearing in three-order network, in which, u1 =1, u2 =1, u3 =1
A Novel Model of Artificial Immune Network and Simulations on Its Dynamics
913
3.2 Research on Dynamic Behaviors of Four-Order Networks
When ki=1, ci=1, namely the whole-feedback scenario, and all the nonlinear characteristics of unit belong to excitement type, the results of simulation show that only complicated limit cycle and chaos may occur. For instance, if a1 =8, b1= -1; a2=12, b2= -2; a 3=10, b3= -3; a4=6, b4= -4, then the corresponding limit cycle occurs as shown in Fig.5(a). By only changing b1 form -1 to 1 and thereby turning it into excitement type, chaotic phenomenon appears as shown in Fig.5(b).
a.
b.
Fig. 5. Dynamic behaviors of four-order networks, in which, a, limit cycle; b, chaotic phenomenon
3.3 The Influence of Feed-Back Intensity
For convenient comparison, the experiment is based on the parameters of when chaotic phenomenon appears, only changing the feed-back intensity c. Dynamic locus can be obtained as shown in Fig.6 with ci=0, ci=0.5, ci=1.0, c i=1.2 respectively. It can be seen from Fig.6 that if ci =0 (i.e. no-feedback), only a equilibrium point occurs; if ci =0.5, the dynamic locus changes to a limit circle; if ci =1, chaotic phenomenon appears in dynamic locus; and if ci =1.2, the track tends to a limit circle after running chaotically for a while. Finally it should be pointed out that when nonlinear characteristics is of monotonous-increasing type, there is no presence of limit cycle, irrespective of the choice of whatever parameters. In addition, it seems that there is even no sign of spiral approach of dynamic locus towards equilibrium point. These outlined above show: the networks structured by monotonous increasing satiation units have a simple characteristic, which is unable to analogize the memory and response process of body's immune system.
a.
b.
c.
d.
Fig. 6. Influence of the feed-back intensity on dynamic locus, in which, a, ci=0; b, ci=0.5; c, ci=1; and d, ci=1.2
914
L. Wang et al.
4 Conclusion and Discussion Based on the experimental results stated in this paper, we may come to the conclusions as follows: 1, if the nonlinear part of immune network is of monotonous increasing saturation type, equilibrium point is the only kind of attractive sets; 2, if it is of excitement-inhibition type, limit cycle may appear, but chaos is unlikely to occur; 3, when most of the nonlinear is of excitement-inhibition type and the rest is excitement type, chaos may appear. This shows that chaotic phenomenon is likely to be the complex. Thus, it offers an easy method to design a system in which chaos can produce. 4, the complex degree of the dynamic behavior of artificial immune network is closely related to the function intensity of suppressor T cells. From the calculation results, it can be seen that dynamic locus may be very simple without suppressant parts, but with the increase of suppressor T cell's intensity, the dynamic loci have a tendency toward complication, consequently leading to chaos. The AIN model presented in this paper combines equilibrium point, limit cycle and chaos to one network, which is characteristic of simple structure and easy operation. From the simulative results, it can be assumed that, with the increasing of unit numbers, the systems behavior will be various and more complex in characteristic. These, therefore, offer a practical way to improve the intelligence standard of artificial immune networks.
References 1. Jerne, N. K.: The Generative Grammar of the Immune System. In: Wigzell, H. (eds.): Presentation Speech to Nobel Prize in Physical or Medicine (1984) http://nobelprize.org /medicine/laureates/1984/presentation-speech.html 2. Xu, X. S., Wang, J. H., Tang, Z., Chen, X. M., Li, Y., Xia, G. P.: A Method to Improve the Transiently Chaotic Neural Network. In: Yin, F. L., Wang, J., Guo, C. G. (eds.): Advances in Neural Networks. Lecture Notes in Computer Science, Vol.3173. Springer-Verlag, Berlin Heidelberg (2004) 401-405 3. Watanabe, Y., Ishida, Y.: Mutual Tests Using Immunity-based Diagnostic Mobile Agents in Distributed Intrusion Detection Systems. Journal of Artificial Life and Robotics 8(2) (2004) 163-167 4. De Sousa, J. S., Gomes, L. de C. T., Bezerra, G. B., de Castro, L. N., Von Zuben, F. J.: An Immune-Evolutionary Algorithm for Multiple Rearrangements of Gene Expression Data. Genetic Programming and Evolvable Machines 5(1) (2004) 157-179 5. Stepney, S., Smith, R., Timmis, J., Tyrrell, A., Neal, M., Hone, A.: Conceptual Frameworks for Artificial Immune Systems. In: Nicosia, G., et al. (eds.): Proceedings of International Conference on Artificial Immune Systems. Lecture Notes in Computer Science, Vol.3239. Springer-Verlag, Berlin Heidelberg (2004) 53-64
A Kernel Optimization Method Based on the Localized Kernel Fisher Criterion Bo Chen, Hongwei Liu, and Zheng Bao National Lab of Radar Signal Processing, Xidian University, Xi’an, Shaanxi, 710071, P.R. China [email protected]
Abstract. It is wildly recognized that whether the selected kernel matches the data controls the performance of kernel-based methods. Ideally it is expected that the data is linearly separable in the kernel induced feature space, therefore, Fisher linear discriminant criterion can be used as a kernel optimization rule. However, the data may not be linearly separable even after kernel transformation in many applications, a nonlinear classifier is preferred in this case, and obviously the Fisher criterion is not the best choice as a kernel optimization rule. Motivated by this issue, in this paper we present a novel kernel optimization method by maximizing the local class linear separability in kernel space to increase the local margins between embedded classes via localized kernel Fisher criterion, by which the classification performance of nonlinear classifier in the kernel induced feature space can be improved. Extensive experiments are carried out to evaluate the efficiency of the proposed method.
1 Introduction The kernel-based algorithms mean that through a nonlinear function Φ(x) the input vector x has been transformed to a higher dimensional space F where we only need to compute inner products (via a kernel function) K ( x, y ) = Φ( x), Φ( y ) .
(1)
The choice of the right embedding is crucial, since different kernels will create different data distribution structures in the embedding space. The ability to assess the quality of an embedding is hence an important task in kernel machines. Recently Xiong et al. [1] propose a method for optimizing the kernel function by using the Fisher criterion to maximize the class separability in the feature space based on the data-dependent model [2]. However because of the use of the Fisher criterion, an important premise exists that the data points should be linearly separable in the kernel induced feature Space. Unfortunately this premise does not always stand, for some applications, it is very difficult to find a kernel to make the data points linear separable in the embedding space. In this paper, we will optimize the kernel by maximizing the class margin, rather than by maximizing the class linear separability. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 915 – 921, 2006. © Springer-Verlag Berlin Heidelberg 2006
916
B. Chen, H. Liu, and Z. Bao
Given two classes of data samples which are nonlinear separable, their marginal local parts of different classes are generally linear separable, in other words, they can be regarded as consisting of multiple linear separable subclasses. Motivated by this, we optimize the kernel by employing the Fisher criterion in each subclass to maximize the local class margin in order to achieve an improved classification performance, which is referred as the localized kernel Fisher criterion (LKFC) in this paper. Since the empirical kernel map is exactly the KPCA transform [3] and the principle component spaces can preserve the geometrical structure of {Φ( xi )} when the kernel matrix K is centered, in order to compare the performances of different kernel optimization methods, we first optimize the kernel by LKFC, and then use the optimized kernel for KPCA feature extraction. The classification performance is evaluated respectively via LSVM (linear SVM), GSVM (Gaussian kernel SVM) and GMM (Gaussian mixture model) based on the synthetic and benchmark data sets.
2 Kernel Optimization Based on LKFC As the above it is necessary to find a data-dependent objective function to evaluate kernel functions. The method in [1] employs a data-dependent kernel similar to that in [2] as the objective kernel to be optimized k ( x, y ) = q( x)q( y )k 0 ( x, y ) ,
(2)
where k 0 ( x, y ) , called the basic kernel, is an ordinary kernel such as a Gaussian kernel K ( x, y) = exp(− x − y
2
σ 2 ) , and q (⋅) is a factor function of the form n
q( x) = α 0 + ∑α i k1 ( x, ai ) ,
(3)
i =1
where k1 ( x, ai ) = exp(− x − ai σ 12 ) , {ai } , called the “empirical cores,” and α i ’s are the combination coefficients which need normalizing. Then according to Theorem 1 in [1], Fisher criterion in the kernel-induced feature space can be represented as the following 2
J (α) =
tr ( S b ) 1Tm B1m qT B q = T = T 0 , tr ( S w ) 1m W 1m q W0 q
(4)
where 1m is the vector of ones of the length m , α = (α 0 , α 1 ,..., α n )T and q = (q ( x1 ), q( x2 ),..., q ( xm ))T , B and W are respectively between-class and within-class kernel scatter matrix. So an updating equation for maximizing the class separability J through the standard gradient approach is given as the following α ( n +1 ) = α ( n ) + η (
K 1T B 0 K 1 K TW K − J ( α ( n ) ) 1T 0 1 ) α ( n ) , T q W0q q W0q
(5)
A Kernel Optimization Method Based on the Localized Kernel Fisher Criterion
917
where η (t ) = η 0 (1 − t T ) , in which η 0 is the initial learning rate, T denotes iteration number, and t represents the current iteration number. The detail algorithm can be found in [1]. The experimental results in [1] show that the method can improve the performance of kernel machines. However, the above kernel optimization is under a given kernel model shown as (2), and the optimization processing is to choose the kernel parameters. It can not guarantee that for any applications the data can be linearly separated in the feature space under the given kernel structure and the optimization procedure. In this case, the Fisher criterion is not applicable, since it only can measure the class linear separability. If data points are nonlinearly separable in the feature space, it is expected that nonlinear classifier should outperform linear classifier. To prove our theory, we utilize a two-dimension two classes toy data [4] as an example (250 training and 1000 test samples) and the Bayes test error is 8%. KPCA was used to extract features with Gaussian kernel. LSVM and GSVM were respectively used as the classifier. The value of σ for the Gaussian kernel of KPCA was selected by 5-fold cross-validation (CV). The Gaussian kernel, with the same σ 0 as σ for KPCA is used as the basis kernel in Xiong’s method. A value of σ 1 of the function k1 (⋅,⋅) in (3) is selected using 5-fold CV. 50 local centers are selected to form
the empirical core set {ai } . The initial learning rate η 0 was set to 0.01 and the total iteration number T was set to 200. The results with randomly chosen 200-example training set are shown in Fig.1.The test error (9.4%) for KPCA with GSVM is superior to (10.7%) with LSVM. Also the test error (9.2%) for optimized KPCA with GSVM is better than (9.6%) with LSVM. Obviously the above results show that the embedded data points still are of nonlinear relations, although KPCA is optimized by the method [1] and outperforms the original KPCA. Under this condition the linear separability measurement rule, Fisher criterion, is not suitable to be directly applied to evaluate the separability of different classes.
(a)
(b)
(c)
Fig. 1. (a) The original training set. (b) the training set in the empirical feature space. (c) the training set in the empirical feature space after Xiong’s kernel optimization.
Consider that for a nonlinear separable case, it can be regarded as consisting of multiple linear separable subclasses, therefore, we can use Fisher criterion in each subclass to maximize the separablibity so as to increase the overall between-class margin. The idea is illustrated in Fig.2. Fig.2 (a) shows the two classes data distribution
918
B. Chen, H. Liu, and Z. Bao
in the embedding space, both of which are obviously nonlinear separable. Because the Euclidean distance in the kernel induced feature space can be calculated 2
by Φ ( xi ) − Φ ( x j ) = k ( xi , xi ) + k ( x j , x j ) − 2k ( xi , x j ) , many linear clustering methods can be kernelized and in our algorithm the kernel k-means clustering method is applied. So each class is clustered into three subclasses through the kernel k-means clustering method as shown in Fig.2 (b). Apparently any two nearest embedded subclasses, which belong to different classes, are of a linearly separable, where Fisher criterion can be directly employed to evaluate their separability. Motivated by this, we propose the kernel optimization method based on the localized kernel Fisher criterion to augment the margins between local parts, so that the data points in optimized kernel induced feature space will be easier to be classified for nonlinear classifiers.
(a)
(b)
Fig. 2. (a) The distribution of tow classes in the embedding space (b) the linear relations of subclasses of different classes
Therefore the modified kernel optimization criterion is defined as the summation of all localized kernel Fisher scalars N ⋅C
N ⋅C
i =1
i =1
J (α) = ∑ ωi J i (α ) = ∑ ωi
1Tm B i 1m N ⋅C q i B0i q i , = ∑ ωi T 1Tm W i 1m i =1 q i W0i q i T
(6)
where N is the number of classes, C is the number of subclasses of each class and J i (α ) is the kernel Fisher criterion of i th couple of subclasses which are nearest in the embedding space , also B i , W i and q i respectively corresponds to ith couple of subclasses. ωi is a real number (weight). As shown in Fig.2 (b), it is easy to see that what we want to optimize is the subclasses in the margin such as the 1st, 3rd, 4th and 6th subclasses rather than the 2nd and 5th ones, so ωi is used to control the contribution of J i (α ) on the optimization, which is defined as the following ⎧ min( D ) ⎪⎪ D ( i ) ωi = ⎨ ⎪ 0 ⎪⎩
if if
min( D ) ≥ Th D (i ) , min( D ) < Th D (i )
(7)
where T is a given threshold, D (i ) represents the distance between two subclasses in the ith couple, which is defined in the feature space as the following
A Kernel Optimization Method Based on the Localized Kernel Fisher Criterion
D (i ) =
1 m
1 ∑ Φ ( xij ) − n ∑ Φ ( y ij ) , m
n
j =1
j =1
919
(8)
where xij , y ij correspond to two subclasses of i th couple. The equation also can be rewritten in the form of kernel representation 1 m m 1 n n 1 1 m n D (i ) = 2 k ( xij , xik ) + 2 k ( yij , yik ) − 2 ⋅ ⋅ k ( xi , j , yik ) . (9) m j =1 k =1 n j =1 k =1 m n j =1 k =1 Now according to (4) and (5), our updating equation for maximizing the summation of the localized kernel Fisher criterion, J (α ) , is obtained as
∑∑
∑∑
∑∑
N ⋅C ⎛ K T Bi K K TW i K α ( n +1) = α ( n ) + η ∑ ω i ⎜⎜ iT1 0 i i1 − J i (α n ) i1T 0 i i1 q i W0 q i i =1 ⎝ q i W0 q i
⎞ ⎟α ( n ) . ⎟ ⎠
(10)
Fig. 3. The summarization of our kernel optimization method
The summary of our method is shown in Fig. 3. Fig. 4 shows the projection of the training and test sets of Ripley’s data set in the empirical feature space after our kernel optimization. Except that the number of subclasses is set to 2, the rest of the parameters are the same with those of Xiong’s kernel optimization. We use GSVM and LSVM classifiers to evaluate the classification performance respectively. The test error for our method (8.2%) via GSVM demonstrated the improvement of the performance of the classification and also outperforms that via LSVM (9.0%) which indicated the nonlinear relations between the embedded data points.
Fig. 4. The projection of the training set in the feature space after our kernel optimization
3 Experimental Results on the UCI Data Sets Our method has been also tested on the eight benchmark data sets, which can be downloaded from the UCI benchmark repository.
920
B. Chen, H. Liu, and Z. Bao
As the above, kernel optimization methods are applied to optimized KPCA which can preserve the structure of data set in the empirical feature space. Linear (LSVM) and nonlinear (GSVM with C = 108 and GMM) classifiers are utilized to evaluate the classification performances. Firstly the values of kernel parameters for the Gaussian kernels of KPCA (without kernel optimization) and GSVM are selected by 10-fold CV. Then Gaussian kernel with the same σ 0 as σ for KPCA is used as the basic kernel in Xiong’s method. σ 1 in (3) and the kernel parameter for GSVM are also selected using 10-fold CV. 20 local centers are selected to form the set {ai } . The initial learning rate
η 0 is set to 0.01 and the total iteration number T is set to 200. Also in our method except that the number of subclasses is set to 2, the procedure of selecting other parameters are kept same as that used in Xiong’s method. The classification performances of each method on UCI data sets are listed in Table 1, where the bold recognition rates indicate the best performance. Evidently the proposed method can achieve better performance by the nonlinear classifiers than that by the linear one, which also indicates the nonlinear relations of data points in the kernel-induced feature space. Table 1. Test classification performance comparison on the 8 data sets, where LS, GS and GM respectively stand for LSVM , GSVM and GMM classifier, KFC and LKFC respectively corresponds to the kernel optimization methods based on KFC and LKFC
Ionosphere Monks1 Monks2 Monks3 Pima Liver WDBC WBC
LS 92.9 80.8 51.0 82.6 75.5 71.5 92.2 89.4
KPCA (%) GM 93.0 90.1 77.8 93.6 80.3 66.0 89.0 90.2
GS 93.8 89.8 84.5 92.5 76.2 73.8 93.8 95.2
KPCA with KFC (%) LS GM GS 93.8 93.9 94.8 81.5 92.2 92.9 52.0 79.3 85.0 90.0 94.5 97.0 74.1 81.8 78.5 73.7 67.0 74.1 93.4 91.7 94.1 90.7 90.3 97.4
KPCA with LKFC (%) LS GM GS 94.7 94.9 97.2 94.1 95.1 98.1 54.0 80.5 86.2 91.5 95.7 97.4 74.9 76.7 82.0 75.0 67.5 76.6 93.8 92.5 94.1 90.6 90.3 98.2
4 Conclusions In this paper a novel kernel optimization algorithm based on the localized kernel Fisher criterion is proposed, which aims to maximize the margins between local parts of different classes in the kernel induced feature space so that the overall margin between different classes can be maximized, and the classification performance can be improved. The experimental results based some benchmark data sets show that the proposed method is efficient.
Acknowledgement This work is supported by the National Science Foundation of China (No. 60302009).
A Kernel Optimization Method Based on the Localized Kernel Fisher Criterion
921
References 1. Xiong, H. L., Swamy, M. N. S., Ahmad, M. O.: Optimizing The Kernel In The Empirical Feature Space. IEEE Trans. Neural Networks 16(2) (2005) 460-474 2. Amari, S., Wu, S.: Improving Support Vector Machine Classifiers By Modifying Kernel Functions. Neural Networks 12(6) (1999) 783---789 3. Ruiz, A., Lopez-de Teruel, P. E.: Nonlinear Kernel-based Statistical Pattern Analysis. IEEE Trans. Neural Networks 12(1) (2001) 16-32 4. Ripley, B. D.: Pattern Recognition and Neural Networks. Cambridge Univ. Press (1996) 5. Blake, C., Keogh, E., Merz, C. J.: UCI Repository Of Machine Learning Databases. Dept. Inform. Comput. Sci., Univ. California, Irvine, CA (1998) [Online] Available: http://www.ics.uci.edu/mlearn
Genetic Granular Kernel Methods for Cyclooxygenase-2 Inhibitor Activity Comparison Bo Jin1 and Yan-Qing Zhang2 1
Department of Computer Science, Georgia State University, Atlanta, GA 30302-3994, USA [email protected] 2 Department of Computer Science, Georgia State University, Atlanta, GA 30302-3994, USA [email protected] P
Abstract. How to design powerful and flexible kernels to improve the system performance is an important topic in kernel based classification. In this paper, we present a new granular kernel method to improve the performance of Support Vector Machines (SVMs). In the system, genetic algorithms (GAs) are used to generate feature granules and optimize them together with fusions and parameters of granular kernels. The new granular kernel method is used for cyclooxygenase-2 inhibitor activity comparison. Experimental results show that the new method can achieve better performance than SVMs with traditional RBF kernels in terms of prediction accuracy.
1 Introduction In the last decade, kernel methods [1][2][3][4] have been widely used in data classification and pattern recognition. How to design powerful and flexible kernels to improve the system performance becomes a hot topic in the kernel based learning area. Traditional kernels such as RBF kernels and polynomial kernels don’t consider feature relationships but simply treat all features as one unit in vector operations. With the growing interest of complicated classification problems such as biological and chemical data prediction, more powerful and more flexible kernels are really needed. In [5], we present the concept of granular kernel computing and propose a powerful hierarchical design method to integrate prior knowledge such as object structures and feature relationships into the kernel design. Features within an input vector are grouped into feature granules according to composition and structure of each data item. The similarity between two feature granules is measured by using a granular kernel, and granular kernels are fused together by hierarchical trees. It’s natural to generate feature granules according to the prior knowledge, while sometimes it is not easy to decide what kind of knowledge is useful. Considering such complex problems, we present a new granular kernel method to improve the performance of SVMs. In the system, GAs are used to generate feature granules and optimize them together with fusions and parameters of granular kernels. In J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 922 – 927, 2006. © Springer-Verlag Berlin Heidelberg 2006
Genetic Granular Kernel Methods for Cyclooxygenase-2 Inhibitor Activity Comparison
923
applications, the new method is used for cyclooxygenase-2 inhibitor activity comparison. Experimental results show that the new system can achieve better performances than GAs based SVMs with traditional RBF kernels in terms of prediction accuracy. It should be mentioned that Haussler [6] first detailedly introduced the decomposition based kernel design and proposed convolution kernels. Then many decomposition based kernels were designed such as string kernels [7][8] and tree kernels [9][10]. However, high dimensional feature mapping and necessary optimization weren’t provided in many of these kernels. The rest of the paper is organized as follows. Section 2 describes the feature granules and granular kernels. The new genetic granular kernel method is presented in Section 3. Section 4 shows the experiments of cyclooxygenase-2 inhibitor activity comparison. Finally, Section 5 gives conclusion and directs the future work.
2 Feature Granules and Granular Kernels Definition 1. A feature granule space G of input space X = R n is a sub space of X , where G = R m and 1 ≤ m ≤ n .
r Definition 2. A feature granule g is a vector which is defined in a feature granule space G . r r Definition 3. A granular kernel gK is a kernel that for feature granules g , g '∈ G satisfies r r r r gK ( g , g ' ) = ϕ ( g ), ϕ ( g ' ) (1) where ϕ is a mapping from the feature granule space to an inner product feature space R E .
r
r
ϕ : g a ϕ(g) ∈ RE
(2)
Granular Kernel Properties. Let X = R n be input space, G1 be a feature granule
space of X and gK1 is a granule kernel over G1 × G1 . The granular kernel gK1 is closed under sum, product, and multiplication with a positive constant over G1 × G1 . It r r means that the following gK ( g , g ' ) are also granular kernels. r r r r r r gK ( g , g ' ) = gK1 ( g , g ' ) + gK1 ' ( g , g ' )
(3)
r r r r r r gK ( g , g ' ) = gK1 ( g , g ' ) × gK1 ' ( g , g ' )
(4)
r r r r gK ( g , g ' ) = c × gK 1 ' ( g , g ' )
(5)
924
B. Jin and Y.-Q. Zhang
r r where g , g '∈ G1 and c ∈ R + . This property can be elicited from the kernel closure properties introduced in [7]. Let G2 be another feature granule space of X r r r r r r and g 2 , g '2 ∈ G2 . gK 2 is a granule kernel over G2 × G2 . gK1 ( g1 , g1 ' ) + gK 2 ( g 2 , g '2 ) r r r r and gK1 ( g1 , g1 ' ) × gK 2 ( g 2 , g '2 ) are also kernels since corresponding kernel matrices are still symmetric and positive semi-definite. According to the granular kernel properties, we can define kernels with various possible structures, such as the granular kernel tree in Fig. 1. In Fig. 1, the connection node P1 may do a sum or product operation and a positive connection weight may associate to an edge.
Fig. 1. Granular kernel tree
3 New Genetic Granular Kernel Method In the system, the number of feature granules m and the granular kernel functions gKi are predefined, where 1 ≤ i ≤ m . We use Gi to denote the generation, where 1 ≤ j ≤ g and g is the total number of generations. In each generation, we have p
chromosomes cij , i = 1,L , p and each chromosome cij has q genes g t (cij ) . We use FGD(x, y) to represent the degree that how feature y belongs to feature granule x, where 1 ≤ x ≤ m , 1 ≤ y ≤ n and n is the total number of features. If feature y belongs to feature granule x, FGD(x, y) is equal to 1. Otherwise, FGD(x, y) is 0. Each individual has a table (see Table 1) which is used to record all current FGD(x, y) values. Feature degrees and kernel parameters are represented by genes. The kernel parameters include the parameters of granular kernels, the operations of connection nodes and the connection weights. For each feature, one of its degrees will be assigned to 1 if the related gene is larger than its other degrees’ genes. For each connection node, the operation is set to sum if the related gene is larger than 0.5, otherwise it is set to product. The initial population is randomly generated. In each generation, chromosomes are evaluated, the new population is selected and some chromosomes are modified by mutation and crossover operation.
Genetic Granular Kernel Methods for Cyclooxygenase-2 Inhibitor Activity Comparison
925
Table 1. Feature granule table
Feature Granules 1 2 M m
Feature 1 0 1 M 0
Feature 2 0 0 M 1
… … … … …
Feature n 1 0 M 0
In fitness evaluation, we use k-fold cross-validation to evaluate SVMs performance ~ on the training data set. The training data set S is separated into k mutually exclusive ~ ~ subsets S v . For v = 1, L , k , data set Λ v is used to train SVMs and S v is used to evaluate SVMs model. After k times of training-testing on all different subsets, we get k prediction accuracies. The fitness f ij of chromosome cij is calculated by Eq. (7), ~ where Accv is the prediction accuracy on S v . ~ ~ Λ v = S − S v , v = 1,L , k
f ij =
1 k ∑ Accv k v =1
(6) (7)
In selection operation, the roulette wheel method described in [11] is used to select individuals for the new population. In crossover, two chromosomes are first selected randomly from current generation as parents and then the crossover point is randomly selected to separate the chromosomes respectively. Parts of chromosomes are exchanged between parents to generate two children. In mutation, some chromosomes are randomly selected and some genes are randomly chosen from each selected chromosome for mutation. The values of mutated genes are replaced by random numbers.
4 Experiments In the experiments, the new method based SVMs are compared to SVMs with traditional RBF kernels (Eq. (8)). SVMs with traditional RBF kernels are also optimized by using GAs. r r exp(−γ || x − z || 2 ) (8) 4.1 Dataset and Experimental Setup
The dataset of cyclooxygenase-2 inhibitors includes 314 compounds, which are described in [12]. The point of log(IC50) units is used to discriminate active compounds from inactive compounds is set to 2.5. There are 153 active compounds and 161 inactive compounds in the dataset. 109 features are selected to describe each compound and each feature’s absolute value is scaled to the range [0, 1]. To make a fair
926
B. Jin and Y.-Q. Zhang
comparison between two kinds of systems, 3-fold cross-validation is used. The dataset is randomly shuffled and evenly split into 3 mutually exclusive parts. Each time one part is chosen as the unseen testing set and the other twos as the training set. Total three comparison results decide which kind of system is better. In the training phase of each system, we also use 3-fold cross-validation to evaluate that system’s performance on training set. In the experiments, granular kernel trees with 2, 4, 6, 8, 10 and 12 feature granules are evaluated respectively. Traditional RBF kernel is also chosen as each granular kernel function. For simplicity, the connection weights are always set to 1. The ranges of all RBFs’ γ are set to [0.00001, 1] and that of regularization parameter C is set to [1, 256] . The probability of crossover is 0.8 and the mutation ratio is 0.2. The population size is set to 300 and the number of generations is set to 50. The software package of SVMs used in the experiments is LibSVM [13]. 4.2 Experimental Results and Comparisons
The 3-fold cross-validation accuracies in average are shown in Table 2, where #FG is the number of feature granules, Tr_acc is the training accuracy and Te_acc is the testing accuracy. The results with one feature granule are those of SVMs with traditional RBF kernels. From Table 2, we can see that almost all SVMs with granular kernel lines always outperform SVMs with traditional RBF kernel in terms of fitness, training accuracy and test accuracy. Especially when the number of granules is 4, 6 and 8, the testing accuracies of new system are higher than that of SVMs with traditions RBF kernel by 5.1%~5.7%. Table 2. Prediction accuracy in average
#FG Fitness Tr_acc Te_acc
1 81.9% 94.4% 72.3%
2 81.5% 94.3% 72%
4 82.2% 96.2% 77.4%
6 82% 96.7% 78%
8 82.2% 96.5% 77.4%
10 82.3% 91.6% 74.9%
12 83% 94.6% 73.3%
5 Conclusion and Future Work In this paper, we present a new genetic granular kernel method. In the system, GAs are used to generate and optimize feature granules together with granular kernel parameters and fusion automatically without help of prior knowledge. The new method is used for cyclooxygenase-2 inhibitor classification and the experimental results show that the new system can achieve better performances than SVMs with traditional RBF kernels. In the future, we will continue our research in soft feature granule partition and feature relationship discovery for more complex biomedical applications.
Acknowledgment Special thanks to Dr. Peter C. Jurs and Dr. Rajarshi Guha for providing the dataset of cyclooxygenase-2 inhibitors. This work is supported in part by NIH under P20
Genetic Granular Kernel Methods for Cyclooxygenase-2 Inhibitor Activity Comparison
927
GM065762. Bo Jin is supported by Molecular Basis for Disease (MBD) Doctoral Fellowship Program, Georgia State University.
References 1. Boser, B., Guyon, I., Vapnik, V.N.: A Training Algorithm for Optimal Margin Classifiers. Proc. Fifth Annual Workshop on Computational Learning Theory (1992) 144-152 2. Cortes, C., Vapnik, V.N.: Support-Vector Networks. Machine Learning 20(3) (1995) 273297 3. Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, New York (1998) 4. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press (2004) 5. Jin, B., Zhang, Y.-Q., and Wang, B.: Evolutionary Granular Kernel Trees and Applications in Drug Activity Comparisons. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2005) 121-126 6. Haussler, D.: Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL99-10, Department of Computer Science, University of California at Santa Cruz (1999) 7. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines: and Other Kernel-Based Learning Methods. Cambridge University Press, New York (1999) 8. Lodhi, H., Shawe-Taylor, J., Christianini, N., Watkins, C.: Text Classification Using String Kernels. In: Leen, T., Dietterich, T., Tresp, V. (eds.): Advances in Neural Information Processing Systems, Vol. 13. MIT Press (2001) 9. Collins, M., Duffy, N.: Convolution Kernels for Natural Language. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.): Advances in Neural Information Processing Systems, Vol. 14. MIT Press (2002) 10. Kashima, H., Koyanagi, T.: Kernels for Semi-Structured Data. Proceedings of The Nineteenth International Conference on Machine Learning (2002) 291-298 11. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996) 12. Kauffman, G.W., Jurs, P.C.: QSAR and K-Nearest Neighbor Classification Analysis of Selective Cyclooxygenase-2 Inhibitors Using Topologically-Based Numerical Descriptors. J. Chem. Inf. Comput. Sci., 41(6) (2001) 1553-1560 13. Hsu, C.W., Chang, C.C., Lin, C.J.: A Practical Guide to Support Vector Classification. Department of Computer Science and Information Engineering, National Taiwan University (2001)
Support Vector Machines with Beta-Mixing Input Sequences Luoqing Li and Chenggao Wan Faculty of Mathematics and Computer Science, Hubei University, Wuhan 430062, China {lilq, wanchenggao}@hubu.edu.cn Abstract. This note mainly focuses on a theoretical analysis of support vector machines with beta-mixing input sequences. The explicit bounds are derived on the rate at which the empirical means converge to their true values when the underlying process is beta-mixing. The uniform convergence approach is used to estimate the convergence rates of the support vector machine algorithms with beta-mixing inputs.
1
Introduction
In many fields of science automatic learning from examples is a longstanding goal. Motivated by classification algorithms for separating data of Fisher [9], Rosenblatt [13], and Vapnik [16], the support vector machines (SVMs) were introduced by Boser, Guyon and Vapnik [2] with polynomials kernels, and by Cortes and Vapnik [5] with general kernels. Since then there has been a rich study for SVMs. Much work has been done to understand how high-performance learning may be achieved. SVMs are being used as a new technique for classification and regression and are gaining popularity due to attractive features and promising performance. SVMs have been successfully applied to a number of real world problems such as handwritten character and digit recognition, text categorization, face and object detection in machine vision. For tutorial introductions and overviews of recent developments of SVM classification and regression, see Burges [3], Smola and Sch¨ olkopf [14], Vapnik [17], Cristianini and Shawe-Taylor [6]. Up to now, for almost all the research on SVMs, the training samples are supposed to be independent and identically distributed (i.i.d.) according to some unknown probability distribution. However, independence is a very restrictive concept in several ways. In several applications such as the learning of dynamical systems, the assumption that the underlying sample sequence is i.i.d. cannot be justified. It is often an assumption, rather than a deduction on the basis of observations. It is also an “all or nothing” property, in the sense that two random variables are either independent or they are not - the definition does not permit an intermediate notion of being “nearly” independent. The notion of mixing allows one to put the notion of “near independence” on a firm mathematical foundation, and moreover, permits one to derive a “robust” theory, by allowing one to prove that most of the desirable properties of i.i.d. stochastic processes J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 928–935, 2006. c Springer-Verlag Berlin Heidelberg 2006
Support Vector Machines with Beta-Mixing Input Sequences
929
are preserved when the underlying process is mixing. Concerning the learning theory with mixing input sequences we refer to [18] for details. In this note we make contributions to the SVMs with beta-mixing input sequences. We analyze the 1-norm soft margin classifier and estimate the bounds for misclassification error. The explicit bounds are derived on the rate at which the empirical means converge to their true values when the underlying process is β-mixing. Hence the uniform convergence approach is used to estimate the convergence rates of the support vector machine algorithms with beta-mixing inputs. It is proved that the SVM 1-norm soft margin classifier converges under suitable conditions.
2
Beta-Mixing Processes
Let Z1 , Z2 , · · · , Zm be an Z valued stationary stochastic process, where Z is a set (equipped with a σ-algebra S of subsets of Z ). Without loss of generality, we will assume that the stationary processes that we consider are indexed by j ∈ Z, the set of integers, and assume that the underlying probability space is the canonical space (Z ∞ , S ∞ , P ) where Z ∞ is the Cartesian product ∞ j=−∞ Z , S ∞ is the corresponding product σ-algebra and P is a shift invariant probability measure on (Z ∞ , S ∞ ). Note that P is shift invariant if and only if {Zj : j ∈ Z} is a stationary process on (Z ∞ , S ∞ , P ). The process {Zj : j ∈ Z} is defined by Zj (z) = zj ,
z = (· · · , z−1 , z0 , z1 , z2 , · · · ) ∈ Z ∞ .
It is well known that the problem of uniform convergence of empirical means generated by i.i.d. samples is fairly well-understood. There has been also some interest in studying the convergence of empirical means when {Zj : j ∈ Z} is a mixing process. We introduce the β-mixing process. Suppose that P is a probability measure 0 such that {Zj : j ∈ Z} is a strictly stationary process under P . Let P−∞ and P1∞ 0 denote the semi-infinite marginals of P : P−∞ being the law of {Zj : ∞ < j ≤ 0}, P1∞ being the law of {Zj : 1 ≤ j < ∞}. Then the β-mixing (or Bernoulli mixing) coefficients of the stochastic process are defined as follows: 0 β(k, P ) := sup{P (A) − (P∞ × P1∞ )(A)| : A ∈ σ(Zj : j ≤ 0 or j ≥ k)}
As β(k, P ) is bounded below by zero and is a nonincreasing sequence, we have that lim β(k, P ) exists. If lim β(k, P ) = 0, then the process {Zj : j ∈ Z} is k→∞
k→∞
said to be β-mixing, or absolutely regular. Let P ∗ be the infinite product probability measure on (Z ∞ , S ∞ ) with the one dimensional marginals as P . In Nobel and Dembo [12] it is shown that if {Zj : j ∈ Z} is a β-mixing process under P , then whenever a family of functions has the property of uniform convergence of empirical means under P ∗ (for the i.i.d. process {Zj : j ∈ Z}), it continues to have the property of uniform convergence of empirical means under P . In Karandikar and Vidyasagar [10] and Yu [20] explicit upper bounds are derived on the difference between the
930
L. Li and C. Wan
empirical means and their true values with respect to the underlying process which is β-mixing.
3
SVMs with Beta-Mixing Inputs
Suppose that X ⊂ Rn be compact set and Y = {1, −1}. Any function f : X → Y defines a classifier or a decision function. Let P be a probability distribution on Z := X × Y . Then the misclassification error for a classifier f : X → Y is defined to be the probability of the event {f (X) = Y }: R(f ) := P {f (X) = Y } = P {f (X) = Y | X = x}dP (x). X
Here dP (x) is the marginal distribution of P and dP (y|x) is the conditional probability distribution. If we define the regression function of P as freg := ydP (y|x), x∈X, Y
then one can see (e.g., [8]) that a best classifier is given by the sign of the regression function fc := sgn (freg ). Here for a function f : X → R, the sign function is defined as sgn(f )(x) = 1 if f (x) ≥ 0 and = −1 if f (x) < 0. We call fc the Bayes classifier or Bayes decision. Hence for any function f : X → R, there holds R(sgn(f )) − R(fc ) ≥ 0. SVMs aim at good approximations fzm of the regression function, constructed by learning algorithms from a set of random samples zm = {(xi , yi )}m i=1 . To understand the approximation, we estimate the error R(sgn(fzm )) − R(fc ). The SVM algorithm we investigate in this note is a Tikhonov regularization scheme associated with Mercer kernel. We usually call a symmetric and positive semidefinite continuous function K : X × X → R a Mercer kernel [1]. Typical examples of Mercer kernels include the Gaussian, polynomial, and spline kernels. The Reproducing Kernel Hilbert Space (RKHS) HK associated with the kernel K is defined to be the closure of the linear span of the set of functions {Kx := K(x, ·) : x ∈ X } with the inner product · K satisfying Kx , Kx K = K(x, x ). The reproducing property f (x) = f, Kx K , ∀x ∈ X , f ∈ HK follows from the definition. yields that The reproducing property with the Schwartz inequality |f (x)| ≤ K(x, x) f K . Then f ∞ ≤ κ f K , where κ := supx∈X K(x, x). So HK can be embedded into the space of bounded functions. Since we are interested in designing approximation schemes from function values on discrete points, we consider the RKHS that can be embedded into C(X ), the space of continuous functions on X . We denote the inclusion map from HK to C(X ) as I. For BR := {f ∈ HK : f K ≤ R} we denote I(BR ) as the closure of I(BR ) in C(X ).
Support Vector Machines with Beta-Mixing Input Sequences
931
The SVM 1-norm soft margin classifier associated with Mercer kernel K is defined as sgn (fzm ), where fzm is a minimizer of the following optimization 1 C f 2K + ξi 2 m i=1 m
fzm := arg min
f ∈HK
(1)
s. t. yi f (xi ) ≥ 1 − ξi ξi ≥ 0,
i = 1, 2, · · · , m.
We denote (t)+ = max{0, t}, and the loss function L as Lf (z) = (1 − yf (x))+ ,
z = (x, y), f : X → R.
The risk functional associated with the loss function L is defined by E(f ) := L (y, f (x))dP (x, y) = EP (Lf )
(2)
(3)
X ×Y
for f : X → R. The corresponding empirical error is given by Ezm (f ) :=
m m 1 1 Lf (Zj ) = (1 − yi f (xi ))+ . m i=1 m i=1
(4)
Then it is easy to check the optimization problem (1) can be rewritten as fzm = arg min Ezm (f ) + f ∈HK
1 f 2K . 2C
(5)
As it was assumed that the distribution P is unknown and we therefore cannot directly minimize risk functional (3). It was proven in [21] (see also [4, 19]) that for any function f : X → R one has E(f ) − E(fc ) ≥ 0 and R(sgn (f )) − R(fc ) ≤ E(f ) − E(fc ).
(6)
This comparison relation can be used to investigate the convergence of the above SVM algorithm. Here the convergence means R(sgn(fzm )) →
inf R(sgn(f ))
f ∈HK
with confidence as m, C → ∞.
(7)
However, the convergence does not hold in general which is shown by a counterexample to (7) in [Wu, Q., Zhou, D.X.: Analysis of support vector machine classification. Technical Report, City University of Hong Kong, Hong Kong (2004)]. Though the convergence (7) does not hold in general, it is possible to derive the convergence when P satisfies suitable conditions with respect to the kernel K. For the i.i.d. input sequences the convergence of the SVM algorithm was shown in [11, 15]. Let S be a metric space and r > 0. The covering number N (S, r) is the smallest n such that there exist n disks of radius r that cover S. The covering number is a complexity measure of the hypothesis space. The following lemma for the error analysis is somehow standard [4]. For our purposes and completeness, we state the error bounds and give an outline of proofs here.
932
L. Li and C. Wan
Lemma 1. Suppose that {Zj : j ∈ Z} is a β-mixing process under P and P ∗ is the product probability measure on (Z ∞ , S ∞ ) with the same one-dimensional marginals as P . Let zm = (xi , yi )m i=1 be drawn from the m observation of βmixing sequence Z and let R > 0. Then for all ε > 0 ε mε2 ∗ P sup |Ezm (f ) − E(f )| ≥ ε ≤ 2 N I(BR ), exp − . 4 32(1 + κ R)2 f ∈I(BR ) Proof. Note that the loss function L gives a special property |Ezm (f ) − Ezm (g)| ≤ f − g ∞ , and |E(f ) − E(g)| ≤ f − g ∞ . Let J = N (I(BR ), 4ε ) and consider f1 , · · · , fJ such that the disks Dj centered at fj and with radius 4ε cover I(BR ). Since 0 ≤ Lf ≤ 1 + κ R for each f ∈ BR , and by Hoeffding inequality [7] we conclude that
ε mε2 P ∗ |Ezm (f ) − E(f )| ≥ ≤ 2 exp − . 2 32(1 + κ R)2 For each f ∈ I(BR ), there is a j such that f − fj ∞ ≤ 4ε . It follows that J
ε ∗ P sup |Ezm (f ) − E(f )| ≥ ε ≤ P ∗ |Ezm (fj ) − E(fj )| ≥ 2 f ∈I(BR ) j=1 ≤ 2 J exp −
mε2 . 32(1 + κ R)2
The proof is complete. Recall that zi = (xi , yi ) ∈ X × Y and the stochastic process {Zj , j ∈ Z} is defined by Zj (z) = zj ,
z = (· · · , z−1 , z0 , z1 , z2 , · · · ) ∈ Z ∞ .
For the random variable Lf (Z) = (1 − yf (x))+ we set m 1 πm (P ) := sup Lf (Zj ) − EP (Lf (Z1 )) . m f ∈I(BR ) j=1 Hence we have πm (P ) = πm (P ∗ ) owing to P ∗ being the product measure on (X ∞ , S ∞ ) with the same one-dimensional marginals as P . Fix a sequence {km } such that km ≤ m and let m = m/km denote the integer part of m/km . Write m = km m + rm , where the remainder rm lies 1 lm between 0 and km − 1. Partition the set Im := {1, · · · , m} as Im ∪ · · · ∪ Im , where i Im := {i, i + km , · · · , i + m km } if 1 ≤ i ≤ rm , and i Im := {i, i + km , · · · , i + (m − 1)km } if rm + 1 ≤ i ≤ km .
Support Vector Machines with Beta-Mixing Input Sequences
933
Furthermore we define πm,i by 1 πm,i (f ) := (Lf (Zj ) − EP (Lf (Z1 ))) for 1 ≤ i ≤ rm , and m + 1 i j∈Im
1 πm,i (f ) := (Lf (Zj ) − EP (Lf (Z1 ))) for rm + 1 ≤ i ≤ km . m i j∈Im
Then rm km m 1 m + 1 m (Lf (Zj ) − EP (Lf (Z1 ))) = πm,i (f ) + πm,i (f ). m j=1 m m i=1 i=r +1 m
Let Am,i denote the event supf ∈I(BR ) |πm,i (f )| > ε. It is obvious that
k km m P (πm (P ) > ε) ≤ P Am,i ≤ P (Am,i ). i=1
(8)
i=1
Next, observe that Am,i ∈ σ{Zi , Zi+km , · · · , Zi+m km } for 1 ≤ i ≤ rm , and Am,i ∈ σ{Zi , Zi+km , · · · , Zi+(m −1)km } for rm + 1 ≤ i ≤ km . Hence by Nobel and Dembo [12, Lemma 2], (see also Yu, [20, Lemma 4.1]), it follows that |P (Am,i ) − P ∗ (Am,i )| ≤ m β(km , P ) for 1 ≤ i ≤ rm , and |P (Am,i ) − P ∗ (Am,i )| ≤ (m − 1) β(km , P ) for rm + 1 ≤ i ≤ km . Consequently, inequality (8) leads to P (πm (P ) > ε) ≤ m km β(km , P ) +
km
P ∗ (Am,i ).
i=1 ∗
Recall πm (P ) = πm (P ), we have the estimates for P ∗ (Am,i ) as follows P ∗ (Am,i ) ≤ P ∗ (πm +1 (P ∗ ) > ε) for 1 ≤ i ≤ rm , and P ∗ (Am,i ) ≤ P ∗ (πm (P ∗ ) > ε) for rm + 1 ≤ i ≤ km . Summarize the above we have the following conclusion involving an explicit bound on P (πm (P ) > ε) in terms of the β-mixing rate and P ∗ (πm (P ∗ ) > ε). Lemma 2. Suppose that {Zj : j ∈ Z} is a β-mixing process under P , the shift invariant probability measure on (Z ∞ , S ∞ ), and P ∗ is the product probability measure on (Z ∞ , S ∞ ) with the same one-dimensional marginals as P . Let zm = (xi , yi )m i=1 be drawn from the m observation of β-mixing sequence Z. Fix a sequence {km } such that km ≤ m and let m = m/km denote the integer part of m/km . Then P (πm (P ) > ε) ≤ m β(km , P )+km max {P ∗ (πm +1 (P ∗ ) > ε), P ∗ (πm (P ∗ ) > ε)} .
934
L. Li and C. Wan
We are now in a position to estimate the error bounds for the SVMs with βmixing input sequences. Theorem 1. Let P be a shift invariant probability measure on (Z ∞ , S ∞ ) and let {Zj : j ∈ Z} be a β-mixing process under P . Let zm = (xi , yi )m i=1 be drawn from the m observation of β-mixing sequence Z. Fix a sequence {km } such that km ≤ m and let m = m/km denote the integer part of m/km . Let C > 0 and set ε m ε2 √ η := m β(km , P ) + 2 km N I(B√2C ), exp − . 8 128(1 + κ 2C)2 Then for every ε > 0, with confidence at least 1 − η, we have 1 2 R(sgn(fzm )) − R(fc ) ≤ ε + inf f − fc L1 + f K . f ∈HK 2C Proof. Recall that the empirical risk minimizer fzm is defined through (5) and denote the minimizer fK,C as the risk functional fK,C := arg min E(f ) + f ∈HK
1 f 2K . 2C
From the property of the loss function we deduce that fzm , fK,C ∈ B√2C . It is easy to check E(fzm ) − E(fc ) ≤ {E(fzm ) − Ezm (fzm )} + {Ezm (fK,C ) − E(fK,C )} 1 + {E(fK,C ) − E(fc ) + f 2K }. 2C By Lemmas 1 and 2, with the confidence 1 − η, we have |E(fzm ) − Ezm (fzm )| ≤ ε/2, and |Ezm (fK,C ) − E(fK,C )| ≤ ε/2. Therefore from the comparison relationship (6), Lemma 2 and the following inequality |E(f ) − E(fc )| ≤ f − fc L1 we complete the proof. As a corollary of Theorem 1 we see that the SVM 1-norm soft margin classifier converges when fc lies in the closure of HK in L1 . The SVM algorithms with involved an offset term and refined learning rates are investigated by Wu and Zhou. We refer to [Wu, Q., Zhou, D.X.: Analysis of support vector machine classification. Technical Report, City University of Hong Kong, Hong Kong (2004)] for details.
Acknowledgement This work is supported in part by NSFC under grant 10371033.
Support Vector Machines with Beta-Mixing Input Sequences
935
References 1. Aronszajn, N.: Theory of Reproducing Kernels. Trans. Amer. Math. Soc. 68 (1950) 337–404 2. Boser, B.E., Guyon, I., Vapnik, V.: A Training Algorithm for Optimal Margin Classifiers. In: Proceedings of the Fifth Annual Workshop of Computational Learning Theory, Vol.5, Pittsburgh, ACM (1992) 144–152 3. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2 (1998) 121–167 4. Chen, D.R., Wu, Q., Ying, Y.M., Zhou, D.X.: Support Vector Machine Soft Margin Classifiers: Error Analysis. J. of Machine Learning Research 5 (2004) 1143–1175 5. Cortes, C., Vapnik, V.: Support-vector Networks. Machine Learning 20 (1995) 273–297 6. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press (2000) 7. Cucker, F., Smale, S.: On the Mathematical Foundations of Learning. Bull. Amer. Math. Soc. 39 (2002) 1–49 8. Devroye, L., Gy¨ orfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer-Verlag, Berlin Heidelberg New York (1996) 9. Fisher, R.A.: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7 (1936) 179–188 10. Karandikara, R.L., Vidyasagar, M.: Rates of Uniform Convergence of Empirical Means with Mixing Processes. Statistics & Probability Letters 58 (2002) 297–307 11. Lin, Y.: Support Vector Machines and the Bayes Rule in Classification. Data Mining and Knowledge Discovery 6 (2002) 259C-275 12. Nobel, A., Dembo, A.: A Note on Uniform Laws of Averages for Dependent Processes. Statistics & Probability Letters 17 (1993) 169–172 13. Rosenblatt, F.: Principles of Neurodynamics. Spartan Book, New York (1962) 14. Smola, A.J., Sch¨ olkopf, B.: A Tutorial on Support Vector Regression. Statistics and Computing 14 (2004) 199–222 15. Steinwart, I.: Support Vector Machines are Universally Consistent. J. Complexity 18 (2002) 768–791 16. Vapnik, V.: Estimation of Dependences Based on Empirical Data. Springer-Verlag, Berlin Heidelberg New York (1982) 17. Vapnik, V.: Statistical Learning Theory. John Wiley & Sons (1998) 18. Vidyasagar, M.: Learning and Generalization with Applications to Neural Networks. Springer-Verlak, Berlin Heidelberg New York (2003) 19. Wu, Q., Ying, Y.M., Zhou, D.X.: Learning Theory: From Regression to Classification. In: Jetter K., Buhmann M., Haussmann W., Schaback R., Stoeckler J. (eds.): Topics in Multivariate Approximation and Interpolation, Elsevier (2006) 101-133 20. Yu, B.: Rates of Convergence of Empirical Processes for Mixing Sequences. Ann. Probab. 22 (1994) 94–116 21. Zhang, T.: Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization. Ann. Statis. 32 (2004) 56–85
Least Squares Support Vector Machine on Gaussian Wavelet Kernel Function Set Fangfang Wu and Yinliang Zhao Institute of Neocomputer, Xi’an Jiaotong University, Xi’an 710049, People’s Republic of China [email protected], [email protected] Abstract. The kernel function of support vector machine (SVM) is an important factor for the learning result of SVM. Based on the wavelet decomposition and conditions of the support vector kernel function, Gaussian wavelet kernel function set for SVM is proposed. Each one of these kernel functions is a kind of orthonormal function, and it can simulate almost any curve in quadratic continuous integral space, thus it enhances the generalization ability of the SVM. According to the wavelet kernel function and the regularization theory, Least squares support vector machine on Gaussian wavelet kernel function set (LS-GWSVM) is proposed to greatly simplify the solving process of GWSVM. The LS-GWSVM is then applied to the regression analysis and classifying. Experiment results show that the regression’s precision is improved by LS-GWSVM, compared with LS-SVM whose kernel function is Gaussian function.
1 Introduction Vapnik’s support vector machine (SVM) is a statistical model of data that simultaneously minimizes model complexity and data fitting error [1]. SVM has attracted a lot of interest and has spurred voluminous work in Machine Learning, both theoretical and experimental. The remarkable generalization ability exhibited by SVM can explained through margin-based VC theory [2]. SVM has been a promising method for data classification and regression [3]. For pattern recognition and regression analysis, the non-linear ability of SVM can use kernel mapping to achieve. The kernel function must satisfy the condition of Mercer [4]. The Gaussian function is a kind of kernel function which is general used. It shows the good generalization ability. However, for our used kernel functions so far, the SVM can not approach any curve in L2(R) space (quadratic continuous integral space), because the kernel function which is used now is not the complete orthonormal base. This character lead the SVM can not approach every curve in the L2(R) space, similarly, the regression SVM can not approach every complicated function. According to the above describing, we need find a new kernel function, and this function can build a set of complete base through horizontal floating and flexing. As we know, this kind of function has already existed, and it is the wavelet functions. Based on wavelet decomposition, this paper propose a kind of allowable support vector’s kernel function set which is named Gaussian wavelet kernel function set, and we can prove that this kind of kernel function is existent. The Gaussian wavelet kernel J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 936 – 941, 2006. © Springer-Verlag Berlin Heidelberg 2006
Least Squares SVM on Gaussian Wavelet Kernel Function Set
937
functions are the orthonormal base of L2(R) space. At the same time, combining this kernel function with least squares support vector machine[5], we can build a new SVM learning algorithm that is Least squares support vector machine on Gaussian wavelet kernel function set (LS-GWSVM). In section 2, we propose the Least squares support vector machine on Gaussian wavelet kernel function set (LS-GWSVM); in section 3, we do the experiment to compare with other algorithms of SVM; finally, in section 4, we draw the conclusion.
2 Least Squares Support Vector Machine on Gaussian Wavelet Kernel Function 2.1 The Conditions of Support Vector’s Kernel Function The support vector’s kernel function can be described as not only the point product of two vector, such as k ( x, x' ) = k (< x ⋅ x' >) , but also the horizontal floating function, such as k(x,x’)=k(x-x’)[9]. In fact, if a function satisfied the condition of Mercer, it is the allowable support vector’s kernel function. Theorem 1[4]: The symmetry function k(x, x’) is the kernel function of SVM if and only if: for all function g≠0 which satisfied the condition of ∫ d g 2 (ξ )dξ < ∞ , we R
need satisfy the condition as follows:
∫∫R ⊗R d
d
k ( x, x' ) g ( x) g ( x' )dxdx' ≥ 0
(4)
This theorem proposed a simple method to build the kernel function. We can give the condition of horizontal floating kernel function because of hardly dividing the kernel function into two same functions. Theorem 2[6], [7]: The horizontal floating function is a allowable support vector’s kernel function if and only if the Fourier transform of k(x) need satisfy the condition as follows:
F [k ( w)] = (2π )
−
d 2
∫R
d
exp(− jwx )k ( x)dx ≥ 0
(5)
2.2 Least Squares Support Vector Machine on Gaussian Wavelet Kernel Function
If the wavelet function ψ (x) satisfied the conditions: ψ ( x) ∈ L 2 ( R ) I L1 ( R ) , and ψˆ (0) = 0 , ψˆ is the Fourier transform of function ψ (x) . Then the wavelet function group can be defined as: 1
⎛ x−m⎞ ⎟ ⎝ a ⎠
ψ a ,m ( x) = (a )− 2 ψ ⎜
(6)
m ∈ R; a ≥ 0 . For this equation, a is the flexible coefficient, m is the horizontal floating coefficient, and ψ (x) is the base wavelet. For the function f(x), f(x)∈L2(R), the wavelet transform of f(x) defined as:
938
F. Wu and Y. Zhao
(Wψ f )(a, m) = (a )− 2 ∫−+∞∞ f ( x)ψ ⎛⎜ x −a m ⎞⎟dx 1
⎝
(7)
⎠
the wavelet inverse transform for f(x) is: f ( x) = Cψ−1 ∫
da ∫ [(Wψ f )(a, m )]ψ a,m ( x) a 2 dm
+∞ +∞
(8)
−∞ −∞
For the above equation (8), Cψ is a constant with respect to ψ (x) . The theory of wavelet decomposition is to approach the function f(x) by the linear combination of wavelet function group. If the wavelet function of one dimension is ψ (x) , using tensor theory[7], the multidimension wavelet function can be defined as: d
ψ d ( x) = ∏ ψ ( x i )
(9)
i =1
We can build the horizontal floating kernel function as follows: d
k ( x, x ' ) = ∏ ψ ( i =1
x i − x i' ) ai
(10)
ai is the flexible coefficient of wavelet, ai>0. So far, because the wavelet kernel function must satisfy the conditions of theorem 2, the number of wavelet kernel function which can be showed by existent functions is few [8], [9]. Now, we give an existent wavelet kernel function: Gaussian wavelet kernel function. We can prove that this function can satisfy the condition of allowable support vector’s kernel function. Gaussian wavelet function is defined as follows: x2 (11) ) 2 The above function (11) is the Gaussian wavelet function. the parameter of p is a
ψ ( x) = (−1) p C 2 p ( x) exp(−
positive integer. The function C 2 p ( x) exp(−
x2 ) 2
is the 2pth step’s differential
coefficient of Gaussian function. According to the different value of p, the Gaussian wavelet functions are different. When p=0, C2p(x)=1, this Gaussian wavelet function is a Gaussian function in fact. When p=1, C2p(x)=x2-1, this the Gaussian wavelet function is a Marr wavelet (Mexican hat wavelet) function in fact. When p=2,
(
)
C2p(x)=x4-6x2+3, this the Gaussian wavelet function is ψ ( x) = x 4 − 6 x 2 + 3 exp(−
x2 ). 2
We can regard these wavelet functions as the special examples of the Gaussian wavelet function set. Theorem 3: Gaussian wavelet kernel function is defined as: d ⎛ ⎛ x − x i' k ( x, x' ) = k ( x − x' ) = ∏ ⎜ (−1) p C 2 p ⎜ i ⎜ ⎜ ai i =1 ⎝ ⎝
' 2 ⎞ ⎞ ⎟ exp(− ( xi − x i ) ) ⎟ ⎟ ⎟ 2a i2 ⎠ ⎠
and this kernel function is a allowable support vector kernel function.
(12)
Least Squares SVM on Gaussian Wavelet Kernel Function Set
939
Proof. F [k ( w)] = (2π )
−
d 2
∫R
d
exp(− jwx )k ( x)dx
2 ⎞ d ⎛ ⎜ ( −1) p C ⎛⎜ xi ⎞⎟ exp(− xi ) ⎟dx exp( jwx ) − ∏ ⎜⎜ 2p⎜ ∫R d ⎟ 2ai2 ⎟⎟ ⎝ ai ⎠ i =1 ⎝ ⎠ d 2 ⎞ ⎛ − d +∞ x ⎛x ⎞ = (2π ) 2 ∏ ∫ exp(− jwi xi )⎜⎜ (−1) p C2 p ⎜⎜ i ⎟⎟ exp(− i 2 ) ⎟⎟ dxi −∞ a ⎜ 2 ai ⎟ ⎝ i⎠ i =1 ⎠ ⎝
= (2π )
= (2π )
−
d 2
−
d 2
d
i =1
= (2π )
−
d 2
2 ⎛ xi ⎞ x ⎛x ⎞ x ⎞⎛ exp⎜⎜ − j (ai wi ) ⋅ ( i ) ⎟⎟⎜⎜ (−1) p C2 p ⎜⎜ i ⎟⎟ exp(− 2 ) ⎟⎟ d i −∞ ai ⎠⎜ 2ai ⎟ ai ⎝ ⎝ ai ⎠ ⎠ ⎝
⋅∏∫
+∞
⋅
2 d ⎛ xi ⎞⎟ xi ⎞⎟ ⎛ ⎛ xi ⎞ xi ⎞⎛⎜ ⎜ 2 p −1 2 +∞ ⎜ ⎟ ⎜ ⎟ ⋅ − − ⋅ ⋅ − a j a w C ω ( ) exp ( ) ( ) ( 1 ) exp( ) d ⎟ ∏ ⎜⎜ i 2 ( p −1) ⎜ i i i ∫−∞ ⎜ ai ⎟⎠⎜⎜ ai ⎟⎠ 2ai2 ⎟⎟ ai ⎟ ⎝ ⎝ i =1 ⎝ ⎠ ⎠ ⎝ M
= (2π )
= (2π )
−
d 2
−
d 2
d ⎛ = ∏ ⎜ ai ⎜ i =1 ⎝
d ⎛ ⎜ ⋅ ∏ ⎜ ai i =1 ⎜ ⎝ d ⎛ ⎜ ⋅ ∏ ⎜ ai i =1 ⎜⎝ 2 p +1
2 ⎛ xi ⎞ x ⎞⎟ x ⎞⎛ exp⎜⎜ − j ( ai wi ) ⋅ ( i ) ⎟⎟⎜⎜ exp(− 2 ) ⎟⎟d i ⎟ −∞ ai ⎠⎜ 2ai ⎟ ai ⎟ ⎝ ⎝ ⎠ ⎠ 1 2 2 ⎞ aω ⎟ 2 p +1 ⋅ (ωi )2 p ⋅ (2π ) 2 exp(− i i ) ⎟ 2 ⎟ ⎠
2 p +1
⋅ (ωi )2 p ⋅ ∫
⋅ (ωi ) 2 p ⋅ exp(−
+∞
ai2ωi2 ⎞⎟ ) 2 ⎟⎠
F (k (ω )) ≥ 0
■
The output function is defined as: d ⎛ l ⎛x −xj f ( x ) = ∑ wi ∏ ⎜ C 2 p ⎜ i j i ⎜ ⎜ a j =1 i =1 ⎝ i ⎝
j 2 ⎞ ⎞ ⎟ exp(− ( x i − x i ) ) ⎟ + b ⎟ 2(a ij ) 2 ⎟⎠ ⎠
(13)
x ij is the value of ith training sample’s jth attribute. Using the Gaussian wavelet kernel function, we can give the regression function a new concept: using the linear combination of wavelet function group, we can approach any function f(x), that is to say, we can find the wavelet coefficients to decomposition the function f(x). Based on Gaussian wavelet kernel function set, we can get a new learning method: Least squares support vector machine on Gaussian wavelet kernel function set (LSGWSVM). In fact, this algorithm is also Least squares support vector machine. We only use the Gaussian wavelet kernel function to represent the kernel function of SVM.
940
F. Wu and Y. Zhao
There is only one parameter γ need to be made certain for this algorithm, and the number of parameters of this kind of SVM is smaller than other kind of SVM, at the same time, the uncertain factors are decreased. Additionally, because using least squares method, the computation speed of this algorithm is more rapid than other SVM. Because LS-SVM can not optimize the parameters of kernel function, it is hard to select l × d parameters. For convenience, we fix a ij = a , and the number of kernel function’s parameters is 1. We can use the Cross-Validation method to select the value of parameter a.
3 Experiments and Results For the following two experiments, we use the approaching error as follows [10]: E=
l
l
i =1
i =1
∑ ( yi − fi )2 ∑ ( yi − yi )2 ,
y=
1 l ∑ yi l i =1
(14)
3.1 Regression Performance of LS-GWSVM
We use LS-GWSVM to regress the following function:
y = sin( 2πx) + sin( 4πx) + sin(10πx)
(15)
The result of this experiment can be described as table 1. Table 1. The regression result for unitary function y
Gaussian kernel Gaussian wavelet kernel ( p=1) Gaussian wavelet kernel (p=2)
The parameter of kernel function σ=2 a=2 a=3.8
Training samples 500 500
Regression error 0.0417 0.0273
500
0.0236
3.2 Application of LS-GWSVM to Nonlinear System Identification
Consider a unitary nonlinear system. u (t + 1) =
1.4eu (t ) 1 + eu (t )
+ v (t ) 2
(16)
v (t ) = 0.4 sin(2πt / 200) + 0.2 cos(2πt / 40) + 0.4 sin(2πt / 1000) , where v(t) is random input in interval [-1,1].One takes 400 points as the training samples, and 80 points as testing samples. The original input u(1)=0. The result of the identification can be described as table 2.
Least Squares SVM on Gaussian Wavelet Kernel Function Set
941
Table 2. The unitary nonlinear system identification result
Gaussian kernel Gaussian wavelet kernel (p=1) Gaussian wavelet kernel (p=2)
The parameter of kernel function σ=2 a=2
Approximation error 0.0376 0.0217
a=3
0.0179
4 Conclusion For the SVM’s learning method, this paper proposes a new kernel function set of SVM which is the Gaussian wavelet kernel function set. it can build the orthonormal base of L2(R) space, and using this kernel function, we can approach almost any complicated functions in L2(R) space, thus this kernel function enhances the generalization ability of the SVM. At the same time, combining LS-SVM, a new algorithm named Least squares support vector machine on Gaussian wavelet kernel function set is proposed. Experiment shows: the Gaussian wavelet kernel function is better than Gaussian kernel function.
References 1. Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag, Berlin Heidelberg New York (1995) 1~175 2. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2) (1998) 955~974 3. Edgar, O., Robert, F., Federico, G.: Training Support Vector Machines: An Application to Face Detection. IEEE Conference on Computer Vision and Pattern Recognition. San Juan, Puerto Rico (1997) 130~136 4. Mercer, J.: Function of Positive and Negative Type and Their Connection with The Theory of Integral Equations. Philosophical Transactions of The Royal Society of London: A 209 (1909) 415-446 5. Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers. Neural Processing Letter 9(3) (1999) 293-300 6. Burges, C.J.C.: Geometry and Invariance in Kernel Based Methods [A]. Advance in Kernel Methods-Support Vector Learning[C]. Cambridge, MA: MIT Press. London (1999) 89-116. 7. Smola A., Schölkopf B., Müller K R.: The Connection between Regularization Operators and Support Vector Kernels. IEEE Trans on Neural Networks 11(4) (1998) 637-649 8. Zhang, L. Zhou, W.D. and Jiao, L.C.: Wavelet Support Vector Machines. IEEE Transaction on Systems, Man, and Cybernetics, Part B: Cybernetics 34 (2004) 34-39 9. Wu, F.F., Zhao, Y.L.: Least Squares Littlewood-Paley Wavelet Support Vector Machine. MICAI 2005: Advances in Artificial Intelligence: 4th Mexican International Conference on Artificial Intelligence. Lecture Notes in Artificial Intelligence, Vol. 3789. SpringerVerlag, Berlin Heidelberg Monterrey, Mexico. (2005) 462-472 10. Zhang, Q., Benveniste A.: Wavelet Networks. IEEE Trans on Neural Networks 3(6) (1992) 889-898
A Smoothing Multiple Support Vector Machine Model Huihong Jin1,2 , Zhiqing Meng2 , and Xuanxi Ning1 1
College of Business and Administration, Nanjing University of Aeronautics and Astronautics, Jiangshu, China 2 College of Business and Administration, Zhejiang University of Technology, Zhejiang 310032, China net [email protected], [email protected]
Abstract. In this paper, we study a smoothing multiple support vector machine (SVM) by using exact penalty function. First, we formulate the optimization problem of multiple SVM as an unconstrained and nonsmooth optimization problem via exact penalty function. Then, we propose a two-order differentiable function to approximately smooth the exact penalty function, and get an unconstrained and smooth optimization problem. By error analysis, we can get approximate solution of multiple SVM by solving its approximately smooth penalty optimization problem without constraint. Finally, we give a corporate culture model by using multiple SVM as a factual example. Compared with artificial neural network, the precision of our smoothing multiple SVM which is illustrated with the numerical experiment is better.
1
Introduction
Recently, support vector machines (SVM) have been employed for classification and prediction in pattern recognition and regression problems. But, many problems need more complex SVM model, such as to face recognition[1]. The corporate culture model is analogously complex problem too [2-3]. SVM problem needs to solve a constrained mathematical programming [4]. Generally, a simple method dealing with SVM problem is to formulate the problem as a Lagrangian problem[5]. However, the Lagrangian problem has still some constraint and many additional variables, and so it is difficult to solve. In order to overcome the difficult, some researchers studied smooth SVM [6]. On the other hand, the penalty function is a good tool to solve the constrained optimization problems, such as SVM problem, since it will transform the constrained optimization problems into the unconstrained optimization problems for solving. Recently, Quan and Yang applied the penalty function to SVM [7] in 2003. It is well known that the exact penalty function is better than inexact penalty function for the constrained optimization problems [11]. However, many exact penalty functions are not differentiable so that many good algorithm, such as Newton algorithm, can not be applied. Therefore, exact penalty function needs be approximately smoothed [8,9,10]. In this paper, we study a multiple SVM model which consists J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 942–948, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Smoothing Multiple Support Vector Machine Model
943
of multi-input to multi-output. The multiple SVM can be applied to build the corporate culture model.
2
Smoothing Multiple Support Vector Machine
Let Rn be an n dimension space. Given a set of data {(x1 ; y 1 ), (x2 , y 2 ), · · · , (xN , y N )}, where xi ∈ Rn , y i ∈ Rp , the nonlinear multiple SVM first map the input data {x1 ; x2 ; · · · xN } into a high-dimensional feature space Rm by using a nonlinear function column vector Φj (: Rn → Rm ) and then perform a multiple regression in this feature space so that yj = wjT Φj (x) + bj , j = 1, 2, · · · , p, where wj ∈ Rm , bj is the threshold, wjT Φj (x) denotes the dot product of wjT with Φj and the superscript ‘T’ denotes transpose. To determine the unknowns (wj , bj )(j = 1, 2, · · · , p), we define the following an optimization problem. (PSV M ) min f (w, b, u, v) :=
p p N 1 wj 2 + C [ξ(uij ) + ξ(vji )] 2 i=1 j=1 i=1
s.t. gji (w, b, u, v) := yji − wjT Φj (xi ) − bj − ε − uij ≤ 0, hij (w, b, u, v) := wjT Φj (xi ) + b − yji − ε − vji ≤ 0, uij , vji
≥ 0,
i = 1, 2, · · · , N, j = 1, 2, · · · , p,
(1) (2) (3) (4)
where · denotes the Euclidean norm, ε is the e-insensitive loss zone[2], C is a positive constant, w = (w1 , w2 , · · · , wp ), b = (b1 , b2 , · · · , bp ), u = (u1 , u2 , · · · , up ), 1 2 N i i v = (v1 , v2 , · · · , vp ),uj = (u1j , u2j , · · · , uN j ), vj = (vj , vj , · · · , vj ),uj , vj are training errors and ξ(·) is the loss function. A squared loss function is often used in the study: 1 ξ(t) := t2 . 2 Note that with the use of the squared e-insensitive loss zone, errors |(yji − wjT Φ(xi ) − bj )|, which are less than ε, will not contribute any cost to the optimization function of equation (1) since the training errors are zero in this case, uij or vji is zero in (2) or (3). To solve (PSV M ),we define a penalty function as F (w, b, u, v, ρ) : = f (w, b, u, v) + ρ
p N ( max{gji (w, b, u, v), 0} + max{uij , 0} j=1 i=1
+ max{hij (w, b, u, v), 0} + max{vji , 0}).
(5)
Define the corresponding penalty optimization problem without any constraint as follows. (MPρ ) min F (w, b, u, v, ρ). (w,b,u,v)
944
H. Jin, Z. Meng, and X. Ning
According to [11], F is exact penalty function. Therefore, we can find a solution to (PSV M ) by solving (MPρ ) for some ρ. However, since the penalty function F (w, b, u, v, ρ) is not smooth, we will smooth function F as a twice differential as follows. Let p : R1 → R1 , p(t) = max{t, 0}. We define q : R1 → R1 : ⎧ 0, if t ≤ 0, ⎪ ⎪ ⎪ ⎪ ⎨ 5 1 2 if 0 ≤ t ≤ , q (t) = 15 2t , ⎪ ⎪ ⎪ ⎪ ⎩ 1 1 1 (t 2 + 23 t− 2 − 85 2 ), if t ≥ , It is easily shown that q (t) is twice differential. Then, lim q (t) = p(t). →0
Consider the nonlinear penalty functions for (PSV M ): G(w, b, u, v, ρ, ) = f (w, b, u, v) + ρ
p N
(q (gji (w, b, u, v)) + q (uij )
j=1 i=1
+q (hij (w, b, u, v))
+ q (vji )).
(6)
where ρ > 0. G(w, b, u, v, ρ, ) is twice differentiable at every (w, b, u, v). Then, we have the following penalty problems without constraint: (MP(ρ,) ):
min
G(w, b, u, v, ρ, )
(w,b,u,v)
In this section, we discuss relationship among (PSV M ), (MPρ ) and (MP(ρ,) ). Definition 2.1. Let > 0. A vector (w, b, u, v) ∈ Rn × R1 × RN × RN is said to be an -feasible solution of (PV SM ) if gji (w, b, u, v) ≤ ,
i = 1, 2, · · · , N, j = 1, 2, · · · , p,
hij (w, b, u, v) ≤ ,
i = 1, 2, · · · , N, j = 1, 2, · · · , p,
−uij ≤ , −vji ≤ ,
i = 1, 2, · · · , N, j = 1, 2, · · · , p.
When > 0 is very small, -feasible solution is an approximate solution of (PV SM ), since lim G(w, b, u, v, ρ, ) = F (w, b, u, v, ρ), ∀ρ. It is easy to prove the →0
following Lemma 2.1. Lemma 2.1. For > 0, we have 0 ≤ F (w, b, u, v, ρ) − G(w, b, u, v, ρ, ) ≤
1 32 pN ρ 2 , 5
(7)
where ρ > 0. By Lemma 2.1, we can easily obtain the following three conclusions. Theorem 2.1. Let {k } → 0 be a sequence of positive numbers. Assume that for k, (wk , bk , uk , v k ) is a solution to (MP(ρ,k ) ) for some ρ > 0 . Let (w∗ , b∗ , u∗ , v ∗ )
A Smoothing Multiple Support Vector Machine Model
945
be an accumulation point of the sequence {(wk , bk , uk , v k }, then (w∗ , b∗ , u∗ , v ∗ ) is an optimal solution to (MPρ ). Theorem 2.2. Let (w∗ , b∗ , u∗ , v ∗ ) be an optimal solution of (MPρ ) and (w, ¯ ¯b, u¯, v¯) be an optimal solution of (MP(ρ,) ). Then 1 32 0 ≤ F (w∗ , b∗ , u∗ , v ∗ , ρ) − G((w, ¯ ¯b, u¯, v¯, ρ, ) ≤ pN ρ 2 5
(8)
Proof. From Lemma 2.1 we have F (w, b, u, v, ρ) ≤ G(w, b, u, v, ρ, ) +
32 1 pN ρ 2 5
Consequently, inf (w,b,u,v)
F (w, b, u, v, ρ) ≤
inf
G(w, b, u, v, ρ, ) +
(w,b,u,v)
32 1 pN ρ 2 , 5
which proves the right-hand side inequality of (8). The left-hand side inequality of (8) can be proved by Lemma 2.1. Theorem 2.3. Let (w∗ , b∗ , u∗ , v ∗ ) be an optimal solution of (MPρ ) and (w, ¯ ¯b, u¯, v¯) be an optimal solution of (MP(ρ,) ). Furthermore, let (w∗ , b∗ , u∗ , v ∗ ) be feasible to (PSV M ) and (w, ¯ ¯b, u¯, v¯) be -feasible to (PSV M ). Then 64 1 0 ≤ f (w∗ , b∗ , u∗ , v ∗ ) − f (w, ¯ ¯b, u¯, v¯) ≤ pN ρ 2 . 5
(9)
Proof. By Lemma 2.1 and Definition 2.1, we get (9). Theorems 2.2 and Theorem 2.3 mean that an approximate solution to (MP(ρ,) ) is also an approximate solution to (MPρ ) when > 0 is sufficiently small. Moreover, an approximate solution to (MP(ρ,) ) also becomes an approximate optimal solution to (PV SM ) by Theorem 3.3 if the approximate solution is -feasible. Therefore, we may obtain an approximate optimal solution to (PV SM ) by finding an approximate solution to (MP(ρ,) ).
3
Corporate Culture Model Based on Multiple VSM
Corporate culture is a very distinctive characteristic of individual organizations and has been a well known critical factor in the field of strategic management. It clearly sets forth the direction in which the company would like to go as an organizational entity (Denison, 1990)[2]. Regardless of the business nature, history, or size of the company, the effort toward developing a corporate culture varies. Many organizations have recognized the need to reshape their culture to suit the changing competitive business environment. However, searching out excellent corporate culture models and implementation frameworks is not an easy task. Denison (1990) argues that corporate culture has a close relationship with the organizational effectiveness that is a function of the interrelation of core
946
H. Jin, Z. Meng, and X. Ning
values and beliefs, company policies and practices, and the business environment of the organization[2]. Based on the multiple VSM, we propose a corporate culture model that consists of three grades: core values and beliefs, company policies and practices, and business environment of the organization, which are denoted by y1 , y2 and y3 , where 13 factors relate with long term, process, teamwork, customer focus, quality, correct problems, find opportunities, leadership, increase market share,systems view, continuous improvement, planning intensive and peopleoriented, which are denoted by x1 , x2 , · · · , x13 . We define the model yj = wjT Φj (x) + bj , j = 1, 2, 3 where x = (x1 , x2 , · · · , x13 )T ∈ R13 . Parameter (wj , bj )(j = 1, 2, 3) are determined by solving optimization problem (PSV M ). In order to find the relationship with corporate culture, we have obtained many sample data of corporate culture to the survey of 14 companies. We choose survey data of 245 individuals as training data. Then, we do some numerical experiments by using approximately exact penalty function of (MP(ρ,) ) for the data, in order to solve (PSV M ). In the experiment, by solving problem (MP(ρ,) ), let penalty parameter ρ = 10, 100, 1000, and = 10−8 . We choose RBF kernel function in the multiple SVM. With comparing with the multiple SVM, we do like experiment to these data by the BP artificial neural networks. We obtain prediction value of 9 individuals of the multiple SVM and that of BP Neural Networks. We obtained prediction value yi , where yi is factual value . Then, we get an 3 error between yi and yi : |yi − yi |. In order to compare it with precision of i=1
prediction, we calculate error
3 i=1
|yi − yi |, where prediction value yi is obtained
by artificial neural network. We calculate some results about error of multiple SVM, error of BP neural networks in Table 1. Table 1. Compare error of Multiple SVM with error of Neural Networks Individuals Error of Multiple SVM Error of Neural Networks 1 0.012493 0.042535 2 0.023453 0.051523 3 0.037129 0.033172 4 0.028372 0.037980 5 0.042426 0.062467 6 0.127230 0.152787 7 0.022757 0.035782 8 0.053620 0.067255 9 0.056726 0.092772
A Smoothing Multiple Support Vector Machine Model
947
Table 1 show that the the prediction value of multiple SVM is better precision than neural network. The corporate culture model can estimate approximately the organizational effectiveness about core values and beliefs, company policies and practices, and the business environment of the organization. This is very useful that company changes business policies.
4
Conclusion
This paper gives a smoothing multiple SVM problem’s (MP(ρ,) ) based on exact penalty function. The smoothed exact penalty function is twice differential. An approximate solution to (MP(ρ,) ) is proved to be an approximate solution to (PSV M ) when > 0 is sufficiently small. We give a corporate culture model based on the multiple SVM. By the sample data of survey to 14 companies, we compute some approximate solution to corporate culture model (PSV M ). By comparing with the artificial neural networks, experiment’s results show that the way is efficient for the prediction of multiple problems.
Acknowledgements This research work was partially supported by grant No. N05GL03 from Zhejiang Provincial Philosophical & Social Science Foundation.
References 1. Ko, J. and Byun, H.: Multi-class Support Vector Machines with Case-Based Combination for Face Recognition. Lecture Notes in Computer Science, Vol.2756. Springer-Verlag, Berlin Heidelberg New York(2003)623-629 2. Denison, D.R.: Corporate Culture and Organizational Effectiveness. New York: JohnWiley & Sons, Inc.(1990) 3. Chin, K.-S., Pun,K.F., Ho, A.S.K., and Lau, H.: A Measurement-CommunicationRecognition Framework of Corporate Culture Change: An Empirical Study. Human Factors and Ergonomics in Manufacturing 12(2002)365-382 4. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(1998) 121-167 5. Lee, K.L. and Billings, S.A.: Time Series Prediction Using Support Vector Machines, the Orthogonal and the Regularized Orthogonal Least-Squares Algorithms. International Journal of Systems Science 33(2002) 811-821 6. Lee, Y.-J., and Mangasarin, O.L.: SSVM: A Smooth Support Vector Machine for Classification. Computational Optimization and Applications 20(2001) 5-22 7. Quan, Y. and Yang, J.: An Improved Parameter Tuning Method for Support Vector Machines . Lecture Notes in Computer Science, Vol.2639. Springer-Verlag, Berlin Heidelberg New York (2003) 607-610 8. Zenios, S.A., Pinar, M.C., Dembo, R.S.: A Smooth Penalty Function Algorithm for Network-Structured Problems. European Journal of Operation Research 64(1993) 258-277
948
H. Jin, Z. Meng, and X. Ning
9. Meng, Z.Q., Dang, C.Y., Zhou, G.,Zhu, Y., Jiang, M.: A New Neural Network for Nonlinear Constrained Optimization Problems. Lecture Notes in Computer Science , Vol.3173. Springer-Verlag, Berlin Heidelberg New York(2004) 406-411 10. Yang, X.Q., Meng, Z.Q., Huang, X.X., Pong, G.T.Y.: Smoothing Nonlinear Penalty Functions for Constrained Optimization. Numerical Functional Analysis Optimization 24 (2003) 351-364 11. Rubinov, A.M., Glover, B.M. and Yang, X.Q.: Decreasing Functions with Applications to Penalization. SIAM Journal Optimization 10(1999) 289-313
Fuzzy Support Vector Machines Based on Spherical Regions Hong-Bing Liu1,2, Sheng-Wu Xiong1, and Xiao-Xiao Niu1 1
School of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, P. R. China 2 Department of Computer Science, Xinyang Normal University, Xinyang 464000, P. R. China [email protected], [email protected], [email protected]
Abstract. Fuzzy Support Vector Machines (FSVMs) based on spherical regions are proposed in this paper. Firstly, the center of the spherite is determined by all the training data. Secondly, the membership functions are defined with the distances between each data and the center of the spherite. Thirdly, using the suitable parameter λ , FSVMs are formed on the spherical regions. Oneagainst-one decision strategy of FSVMs is adopted so that the proposed FSVMs can be extended to solve multi-class problems. In order to verify the superiority of the proposed FSVMs, the traditional two-class and multi-class problems of machine learning benchmark datasets are used to test the feasibility and performance of the proposed FSVMs. The experiment results indicate that the new approach not only has higher precision but also downsizes the number of training data and reduces the running time.
1 Introduction Support Vector Machines (SVMs) are very popular machine learning methods [1]. Due to their excellent generalization performance, they have been used in a wide field of learning problems, such as handwritten digit recognition [1], disease diagnosis [2] and face detection [3]. Some classical problem such as multi-local minima and overfitting in neural networks seldom occur in SVMs. However, there are some problems in SVMs. One of them is how to speed up in training SVMs and avoid the overfitting problem [4], especially in the case of the large-scale learning problems. Another problem is how to extend the SVMs to solve multi-class problems [5]. Therefore, it is valuable and important to develop some improved SVMs, which can reduce the training time and improve the performance of the SVMs. Two kinds of FSVMs [4,6] have been used to solve these two mentioned problems. In SVMs, the training data, which lie in the boundary, are very important and decisive to form the hyperplane. Only if the boundary data are extracted, the classification hyperplane can be determined and the computation time can be reduced. In our research [7,8], some new FSVMs based on the reduced training set are formed J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 949 – 954, 2006. © Springer-Verlag Berlin Heidelberg 2006
950
H.-B. Liu, S.-W. Xiong, and X.-X. Niu
and the number of the training data is downsized, but the pre-treatment of the training set will lead to more time-consuming because clustering techniques are iteration algorithms. The aim in this paper is to seek another method by which the number of the training data and the computation time can be reduced. The methods based on spherical regions select the training data near the hyperplane to form the proposed FSVMs. Why do we select these data to form learning machines? Firstly, the data in the spherical regions have more opportunities to become support vectors compared with the outside data. Secondly, during training process, we still do not know the hyperplane. So these data in the spherical regions are selected as representation of the data near to the hyperplane. We examine the performance of the proposed FSVMs with the following aspects such as the number of training data (Tr), the number of support vectors (SVs), the training accuracy (Tr(%)), the testing accuracy (Ts(%)) and the total computation time (T(s)) including training and testing processes. We explain the proposed FSVMs based on spherical regions in section 2, compare the new FSVMs with the traditional FSVMs on benchmark data in section 3. Conclusions are drawn finally.
2 FSVMs Based on Spherical Regions As mentioned above, these two FSVMs can solve the problem of overfitting and reduce the unclassifiable region. But there are some improved places factually. For example, if the membership function is defined by the formula in reference [4], the data points far from the hyperplane are assigned with lager penalty values. But these data are impossible to be misclassified because they lie in the right side and far from the hyperplane. The methods in this paper are to assign a lager penalty value to the training data near the hyperplane and then discard the other training data, which are impossible to be misclassified. In order to explain the proposed FSVMs clearly, we use the two-class problems to interpret the feasibility of the proposed methods. 2.1 Extraction of the Reduced Training Set According to the idea of the proposed FSVMs, the training data near to the hyperplane are extracted out to form FSVMs. Since the hyperplane is unknown at first, we replace the distances between the training data and the hyperplane with the distances between the data and the center of spherite. For the two-class problems, the data close to the center of spherite will be near to the hyperplane. So we can define the membership functions of data in class1 and class2 as
u1i =
max x1 j − mm − x1i − mm 1j
max x1 j − mm − min x1 j − mm 1j
1j
, u2 i =
max x2 j − mm − x2i − mm 2j
max x2 j − mm − min x j − mm 2j
2j
(1)
Fuzzy Support Vector Machines Based on Spherical Regions
951
Where mm denotes the middle point of two class centers. Formula (1) ensures that the data near the hyperplane have larger penalty values. The training set including two fuzzy sets are obtained independently.
S1f ={(x1i , y1i ,u1i )| x1i ∈Rd , y1i =1,u1i ∈[0,1],1=1,2,...,l1}
(2a)
S2 f ={(x2i , y2i ,u2i )| x2i ∈Rd , y2i =−1,u2i ∈[0,1],1 =1,2,..., l2}
(2b)
The purpose of FSVMs is to select these data inside spherical regions and then to train the learning machine. Spherical regions are determined by the following formula (3) using a parameter λ . Parameter λ is a user-defined empirical value. Ssr = {x1i | u1i ≥ λ , x1i ∈ class1} U {x1 j | u2 j ≥ λ , x1 j ∈ class 2}, λ ∈ [0,1]
(3)
Fig.1 shows a spherical region of the training set including class1 and class2. The circle marked with shade represents the spherical region in 2-dimensional space. The data inside the circle are selected to train and the other data are discarded.
Fig. 1. Extraction of spherical region
2.2 Geometrical Interpretation of the Proposed FSVMs
In order to explain the idea clearly, we select the linearly separable two-class problem in 2-dimentional space to perform the training process. In case of nonlinearly separable problems, the reduced training data can be selected in the original space and mapped into high-dimensional space by using kernel methods. In this experiment, 200 data points of the training set are generated under two −0.3 ⎞ ⎛ 1 normal distributions, mu1 = [1,1] , mu2 = [4, 4] , sigma1 = sigma2 = ⎜ ⎟ . 1 ⎠ ⎝ −0.3 Obviously, they are linearly separable. In Fig.2, the data marked ‘o’ and ‘+’ represent the source data, and the data marked ‘ ’represent the selected training data respectively. When λ = 0.7 , the hyperplanes of the proposed FSVMs and FSVMs are identical, but there are only 36 training data in the case of the proposed methods.
·
952
H.-B. Liu, S.-W. Xiong, and X.-X. Niu
Fig. 2. The hyperplane1 showed in dash and hyperplane2 showed in solid line are hyperplanes of FSVMs and the proposed FSVMs when λ = 0.7 respectively
3 Comparison of the New FSVMs and the Traditional FSVMs To verify the performance of proposed FSVMs, experiments were performed using some benchmark data [9]. In order to compare the computation time with the conventional FSVMs, all the experiments are performed in the same operating environment. As for the selection of parameter, we mainly discuss the influence that the parameter λ imposes on the proposed FSVMs. We select the constant C using the cross-validation method [10] to achieve the maximal accuracy of FSVMs. The dataset named BCW (breast-cancer-wisconsin) includes 239 positive 9dimension instances and 444 negative ones. In this experiment, 159 positive and 296 negative data are selected as training data at random. The rest data are used to verify the generalization ability of the proposed FSVMs and FSVMs. From table 1, we can see the performance of FSVMs is not the best classifier, and the best one is the proposed FSVMs when λ = 0.3 . There are 398 training data in training including 53 support vectors. From the table, we can also see that the overfitting problem can be avoided by the parameter λ = 0.3 and T(s) is reduced obviously. Table 1. Performance of the proposed FSVMs on BCW for dot kernels and C=100000
Classifiers
The proposed FSVMs
FSVMs
λ 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 -
Tr 19 50 92 190 323 398 440 446 455
Svs 14 23 37 43 51 53 54 53 53
Tr(%) 96.264 95.824 97.143 96.923 97.143 97.363 97.143 97.363 97.363
Ts(%) 95.61 96.05 96.93 96.93 96.93 96.93 96.93 96.49 96.491
T(s) 0.453 0.484 0.5 0.532 0.64 0.782 0.843 0.844 0.859
Fuzzy Support Vector Machines Based on Spherical Regions
953
Table 2. Performance of the new FSVMs and FSVMs on benchmark data for different kernels
Classifiers
The proposed FSVMs
λ 0.5 0.4 0.3 0.2 0.1
FSVMs Classifiers
The proposed FSVMs
λ 0.5 0.4 0.3 0.2 0.1
FSVMs Classifiers
The proposed FSVMs
λ 0.5 0.4 0.3 0.2 0.1
FSVMs Classifiers
The proposed FSVMs
FSVMs
λ 0.5 0.4 0.3 0.2 0.1
Tr(%) 87.337 94.369 98.452 99.079 99.319 99.359
Dot product kernel Ts(%) T1(s) 30.488 83.962 37.644 89.794 46.309 95.597 55.799 96.398 63.997 95.883 95.483 78.657
T2(s) 7.904 10.217 11.736 12.997 13.423 13.627
Tr(%) 89.311 95.41 99.453 99.6 99.907 100
Polynomial kernel (Poly2) Ts(%) T1(s) 27.779 86.735 33.153 92.853 41.379 97.599 48.89 97.627 53.686 97.684 97.57 58.736
T2(s) 11.655 13.672 16.234 18.297 19.752 20.127
Tr(%) 89.392 95.81 99.533 99.827 99.907 100
Polynomial kernel (Poly4) Ts(%) T1(s) T2(s) 14.777 30.134 87.907 18.825 40.859 92.967 21.406 48.701 97.713 22.846 54.687 97.684 24.048 59.876 97.713 97.599 64.377 24.249
Tr(%) 93.528 95.97 99.573 99.786 99.907 100
Polynomial kernel (Poly6) Ts(%) T1(s) T2(s) 15.408 31.265 90.938 19.3 42.622 93.025 21.775 52.659 97.713 23.766 59.08 97.656 24.859 64.555 97.713 97.541 68.213 24.987
The digits recognition task can be solved using voting scheme methods based on a combination of many binary classifiers. In the experiment of digits recognition, we use the dataset named pendigits, which consists 7494 training data and 3498 testing data [9]. The integration of the two kinds FSVMs is adopted in this paper. Namely, its implementation consists of 45 classifiers, and each classifier is the proposed FSVMs. In decision phases, the fuzzy membership functions in the directions orthogonal to the hyperplane are adopted. Because of the low precisions by using RBF kernel, we only list the results by using dot and polynomial kernels in table 2. In the table, T1(s) and T2(s) represent the time of quadratic programming and kernel computation
954
H.-B. Liu, S.-W. Xiong, and X.-X. Niu
respectively, and Poly i denotes the polynomial kernel with order d = i . Obviously, the best classifier is not the conventional FSVMs but the proposed FSVMs.
4 Conclusions In this paper, we proposed the FSVMs based on spherical regions. During the training process, the proposed learning machines define the membership functions with the distance between data and the center of spherite. The proposed FSVMs are formed on the reduced training set, which is composed of the data inside the spherical regions. During decision process, one-against-one strategy is used for multi-class classification problems. With the help of numerical simulation using the typical two-class problem and multi-class problem data sets, we demonstrate the superiority of our method. Firstly, the acceptable FSVMs and the better ones can be obtained. Secondly, the proposed FSVMs can find the best results rapidly for the linearly separable learning problems. Thirdly, the number of support vectors in the proposed FSVMs is less than or equal to that of the traditional FSVMs. Of course, there are some disadvantages in our method. Owe to selecting an array of parameter λ , many new FSVMs need to be constructed. Acknowledgement. This work was in part supported by the 973 Program (Grant No. 2004CCA02500) and NSFC (Grant No. 60572015).
References 1. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons, New York (1998) 2. Wee, J.W., Lee, C.H.: Concurrent Support Vector Machine Processor for Disease Diagnosis. Lecture Notes in Computer Science, Vol. 3316. Springer-Verlag, Berlin Heidelberg New York (2004) 1129-1134 3. Buciu, L., Kotropoulos, C., Pitas, I.: Combining Support Vector Machines for Accurate Face Detection. Proceeding of International Conference on Image Processing. Thessaloniki, Greece, (2001) 1054–1057 4. Inoue, T., Abe, S.: Fuzzy Support Vector Machines for Pattern Classification. Proceeding of International Joint Conference on Neural Networks. Washington (2001) 1449–1454 5. Abe, S.: Analysis of Multi-class Support Vector Machines. Proceeding of International Conference on Computational Intelligence for Modeling Control and Automation. Vienna (2003) 385-396 6. Huang, H.P., Liu, Y.H.: Fuzzy Support Vector Machines for Pattern Recognition and Data Mining. International Journal of Fuzzy Systems, Vol. 4. (2002) 826-835 7. Xiong, S.W., Liu, H.B., Niu, X.X.: Fuzzy Support Vector Machines Based on FCM Clustering. Proceeding of International Conference on Machine Learning and Cybernetics, Vol. 5. Guangzhou, China (2005) 2608-2613 8. Xiong, S.W., Niu, X.X., Liu, H.B.: Support Vector Machines Based on Subtractive Clustering. Proceeding of International Conference on Machine Learning and Cybernetics, Vol. 9. Guangzhou, China (2005) 4345-4350 9. ftp://ftp.ics.uci.edu/pub/machine-learning-databases 10. Montgomey, D.C.: Design and Analysis of Experiments. 5th edition. John Wiley and Sons, New York (2001)
Building Support Vector Machine Alternative Using Algorithms of Computational Geometry Marek Bundzel1 , Tom´aˇs Kasanick´ y1,2 , and Baltaz´ ar Frankoviˇc2 1
Technical University of Koˇsice, Faculty of Electrical Engineering and Informatics, Department of Cybernetics and Artificial Intelligence, Letn´ a 9, Koˇsice, 04001, Slovak Republic [email protected] 2 Slovak Academy of Sciences, Institute of Informatics, D´ ubravsk´ a cesta 9, 845 07 Bratislava 45, Slovak Republic [email protected], [email protected]
Abstract. The task of pattern recognition is a task of division of a feature space into regions separating the training examples belonging to different classes. Support Vector Machines (SVM) identify the most borderline examples called support vectors and use them to determine discrimination hyperplanes (hyper–curves). In this paper a pattern recognition method is proposed which represents an alternative to SVM algorithm. Support vectors are identified using selected methods of computational geometry in the original space of features i.e. not in the transformed space determined partially by the kernel function of SVM. The proposed algorithm enables usage of kernel functions. The separation task is reduced to a search for an optimal separating hyperplane or a Winner Takes All (WTA) principle is applied.
1
Motivation
Pattern recognition is based on a correct determination of membership of particular examples to the individual classes. Classification techniques use several ways for selection of the reference examples. The selection method influences the bias the robustness and the accuracy of the approach. SVMs are well known and respected for their generalization ability. SVM searches for a decision boundary in the form of a hyperplane and non–linearity of the solution is achieved by means of transformation of the training examples. Support vectors and a selected kernel function play the role there ([3], [4]). But SVM training is also demanding on computer resources. Therefore (incremental) alternatives are being searched for. The pattern recognition algorithm proposed in this paper identifies the reference examples – support vectors – using computational geometry algorithms. The support vectors are identified in the original space of features. This is in contrary to SVM which identifies the support vectors in the transformed space. The separation hyperplane calculated by SVM in the transformed space satisfies the criteria of optimality stated in [4]. However, optimality of the decision J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 955–961, 2006. c Springer-Verlag Berlin Heidelberg 2006
956
M. Bundzel, T. Kasanick´ y, and B. Frankoviˇc
Fig. 1. Example: edges created by CG methods – a. Delaunay triangulation, b. Gabriel graph, c. relative neighborhood graph.
boundary projected back into the space of features is not well defined. Our goal was to use non–linear transformation of the training examples with special focus on the spatial distribution of support vectors in the original space of features. Usage of the simple Winner Takes All (WTA) principle for determination of the class of the fact of an example enables to use the proposed algorithm incrementally. Smooth boundaries and better generalization are achieved using kernel functions and applying the proposed linear separation algorithm called Optimal Linear Separation.
2
Background
Computational geometry (CG) studies algorithms solving problems formulated in terms of mathematic geometry. Some purely geometrical problems arise out of the study of CG algorithms, and the study of such problems is considered to be part of CG as well. Delaunay triangulation, Gabriel graph and relative neighborhood graph were selected to be applied in the proposed approach. The algorithms are mainly used in computer graphics, CAD systems, cartography etc. Delaunay Triangulation: The triangulation was invented by Boris Delaunay in 1934 ([1]). The Delaunay triangulation for a set P of points in the plane is the triangulation DT (P ) of P such that no point in P is inside the circumcircle of any triangle in DT (P ). Figure 1 a. shows an example of Delaunay triangulation. Complexity O of the calculation depends on the count n of the points and d dimension d of the space of the points, O(n 2 ). Gabriel graph: The Gabriel graph GG(P ) ([2]) is the set of edges lij subset of DT (P ), for which the open circle (ball) with diameter [Pi Pj ] contains no other points than Pi and Pj : GG(P ) = {lij ⊂ P | ∀Pk ∈ P, (Pk − Pi )2 + (Pk − Pj )2 > (Pi − Pj )2 } The situation is documented in Figure 1 b. The condition for an edge to be created is more strict than by Delaunay triangulation, therefore less edges arise. Average complexity of Gabriel graph is O(dn3 ) ([6]). Relative neighborhood graph: The relative neighborhood graph of P , the RN G(P ), is a neighborhood graph, where two points Pi and Pj are called relative
Building Support Vector Machine Alternative Using Algorithms
957
neighbors, if all other points Px in the set satisfy: distance(Pi , Pj ) < max{distance(Pi , Px ), distance(Pj , Px )}. The situation is documented in Figure 1 c. There are N –dimensional and incremental implementations for the above mentioned algorithms. One of the first works discussing the possible application of CG for the purposes of pattern recognition was [8]. A pioneer work dealing with comparison of SVM with the graph building algorithms was [6]. The author points out the possibility to use CG algorithms for identification of support vectors. Complexity and performance of the algorithms were compared. Nevertheless, the author does not consider the idea of further application of support vectors together with a kernel function. This is the base of our method, eventually increasing the accuracy and the generalization performance. The need to reduce the demand of SVM on computational resources gave rise to various alternatives. DirectSVM ([9]) represents a purely geometrical approach to identification of support vectors and determination of the optimal separating hyperplane. The method described here uses incremental geometrical approach to identification of support vectors and the optimal separating hyperplane is determined using an iterative optimization procedure. The graph constructed during the training phase can be potentially utilized for various purposes (e.g. data mining).
3
Design
The proposed system and its process chart are depicted in Figure 2. Graph building block creates a graph in the space of features (training data) using one of the above CG algorithms. Support vector selector identifies support vectors from the training data. Support vectors are defined as points of different classes connected with an edge in the graph. Using the set of support vectors, membership of any example can be determined using WTA algorithm, i.e. the evaluated example will belong to the same class as its closest support vector.
Fig. 2. Outline of the proposed system
958
M. Bundzel, T. Kasanick´ y, and B. Frankoviˇc
If a kernel function is to be used the classifier is finalized to the form: f (xt ) =
Ns
αi yi e−γsi −xt + b. 2
(1)
i=1
The training data are non–linearly transformed by Gaussian RBF in this case. The parameter γ of the Gaussian kernel is set by the user. The final block of the system, the Separator, searches for the parameters α of the hyperplane linearly separating the transformed examples. xt stands for the evaluated example, si are the support vectors and yi ∈ {−1, 1} represents the class of a fact of the individual support vectors. The Optimal Linear Separation (OLS) method was used to determine the optimal (the widest margin) separating hyperplane. The method described here was originally introduced in [7]. Let us consider two sets A = {a1 , a2 , . . . , aNA }, B = {b1 , b2 , . . . , bNB },
a i ∈ Rn , bi ∈ Rn ,
i = 1, 2, . . . , NA
(2)
i = 1, 2, . . . , NB .
(3)
Next let us consider normal vector v ∈ R , with unity norm n
v = {v1 , v2 , . . . , vn },
v2 =
n
1/2 vi2
= 1.
(4)
i=1
For the hyperplanes with the normal vector v we consider the next criterial function which describes the dichotomous classification separation quality, achieved with direction choice: Qv = max [min(a · v) − max(b · v)]; [min(b · v) − max(a · v)] . (5) a∈A
b∈B
b∈B
a∈A
Then the Optimal Separation Problem can be formulated as finding the direction v ∗ , for which: Qv∗ = max Qv . (6) v2 =1
The ’optimal’ direction, which maximizes the criterial function on the set of unity vectors v, is computed using numerical iterative method. If the data are linearly separable, the separation hyperplane will have the widest margin. Otherwise, the separation hyperplane will have the narrowest margin relative to the closest misclassified examples.
4
Experiments
There were three different datasets used. The comparative experiments with SVM were performed on SV Mlight implementation ([5]).
Building Support Vector Machine Alternative Using Algorithms
959
Table 1. Experimental results on spiral data. Overall accuracy calculated on the testing set. Method Accuracy -Train (%) Accuracy -Test (%) Delaunay +WTA 100.0 87.92 Gabriel Graph +WTA 100.0 87.74 Relat. Neighbors +WTA 99.84 86.54 Delaunay +OLS 98.48 91.17 Gabriel Graph +OLS 98.32 90.38 Relat. Neighbors +OLS 96.48 88.32 SVM 100.0 92.1
Fig. 3. Experimental results on spiral data. a. Gabriel graph + WTA, b. Gabriel graph + OLS, c. SVM. Support vectors are circled.
The 2D ’Double Spiral’ classification problem was challenged the first. 1250 examples were drawn randomly from the 1000x1000 BW image and used for training, the rest was used for testing. The experimental results were evaluated statistically (Table 1) and graphically (Fig. 3). The counts of the support vectors identified were the following: Delaunay triangulation–910, Gabriel graph–805, relative neighborhood graph–576, SVM–213. The parameter γ of the RBF kernel was set to 95 for both, OLS and SVM. The second artificial dataset contained examples spatially distributed with different density. The goal was to observe how will the methods cope with the situation. The experimental results are provided in Fig. 4. The parameter γ of RBF kernel was set to 95 for both, OLS and SVM. The third dataset represented multispectral, remotely sensed image data. The image containing 775x475 pixels was taken over the eastern Slovakia region. The image was sensed from Landsat satellite in 6 spectral bands. The main goal was land use identification. There were seven classes of interest picked up for the classification procedure. More details on the data are provided in ([7]). This application required to extend the above described method for multiclass classification but this was trivial. The experimental results are provided in Table 2.
960
M. Bundzel, T. Kasanick´ y, and B. Frankoviˇc
Fig. 4. Experimental results. a. Delaunay graph + WTA, b. Delaunay graph + OLS, c. SVM. Support vectors are circled. Table 2. Experimental results on real world data. Overall accuracy is calculated on the testing set, percentual accuracies were calculated for each class (A,B,C ...). Method Gabriel Graph +WTA Gabriel Graph +OLS SVM
A (%) 80.77 75.64 83.33
B (%) 82.23 87.81 79.96
C (%) 100.00 100.00 100.00
D (%) 75.94 92.57 94.57
E (%) 62.57 49.12 60.82
F (%) 99.57 99.32 99.40
G (%) Overall(%) 85.14 86.96 71.62 91.16 72.97 91.6
Figure 3 and Table 1 document the positive influence of the OLS method with RBF kernel on the classification accuracy and generalization relative to the WTA method. Comparison of the Figures 3b and 3c and the counts of the identified support vectors indicates the proposed CG approach identifies more reference examples than SVM and this is not always desirable. On the contrary, the second experiment shows the count of the support vectors is higher by SVM than by CG (by RBF parameter γ set to 95). SVM identified support vectors also far from the decision boundary. The experiment on real world data documents the performance of the proposed method. Interesting insight provides the accuracy calculated for the class E represented largely by noisy and/or conflicting examples. Sensitivity of the CG support vector identification to the noise is a problem which is to be solved. A possible solution is to consider contribution of several examples near to a support vector (weighted averaging) and production of more typical support vectors.
5
Conclusions
Consequently to the fact that SVM searches for the support vectors in the transformed space optimality of the decision boundary projected in the original space of features can be put in question. The dotted lines in Fig. 4c represent images
Building Support Vector Machine Alternative Using Algorithms
961
of hyperplanes parallel to the separating hyperplane (an equidistant margin). It is obvious these images do not follow the principle of a margin in the space of features any more, i.e. their distance to the decision boundary is not the same in the direction perpendicular to the decision boundary. The proposed method reduces this feature because the support vectors are selected in the space of features. It is also possible to use the CG identified support vectors for description of the modeled system and for interpretation of the relationships between the examples of different classes in multiclass applications. The WTA implementation of the proposed method enables incremental learning. It is important to note that any linear separator (e.g. Rosenblats perceptron etc. ) can be used as a Separator in the algorithm. This opens a door for the future incremental implementations of the method with kernel functions.
Acknowledgement This work is partially supported by Scientific Grant Agency of the Ministry of Education of Slovak Republic and the Slovak Academy of Sciences under the contract No. VEGA 2/4148/24.
References 1. Delaunay, B.: Sur la sph`ere vide. Izvestia Akademii Nauk SSSR, Otdelenie Matematicheskikh i Estestvennykh Nauk (1934) 2. Gabriel, K.R., Sokal, R.R.: A New Statistical Approach to Geographic Variation Analysis. Systematic Zoology, Vol. 18. (1969) 259-278 3. Vapnik, V., Chevornenkis A.: Theory of Pattern Recognition. (in Russian) Nauka, Moscow (1974). Wapnik, W., Tscherwonenkis, A.: Theorie der Zeichenerkennung. (German Translation) Akademie-Verlag, Berlin (1979) 4. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc. (1995) 5. Joachims, T.: Making Large-Scale SVM Learning Practical. In: Schlkopf, B., Burges, C., Smola, A. (eds.): Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge (1999) 6. Zhang, W., King, I.: A Study of the Relationship Between Support Vector Machine and Gabriel Graph. In: Proc. of Int. Joint Conference on Neural Networks (IEEE) (2002) 7. Bundzel M.: Structural and Parametrical Adaptation of Artificial Neural Networks Using Principles of Support Vector Machines. Ph.D. Thesis, Technical University of Kosice, Slovakia, Faculty of Electrical Engineering and Informatics, Department of Cybernetics and Artificial Intelligence (2005), http://neuron.tuke.sk/ bundzel/Ph.D-thesis/main.pdf 8. Urquhart, R.: Graph Theoretical Clustering Based on Limited Neighborhood Sets. Pattern Recognition 15(3) (1982) 173-187 9. Roobaert, D.: DirectSVM: A Fast and Simple Support Vector Machine Perceptron. In: Proc. of International Workshop on Neural Networks for Signal Processing (IEEE) Sydney, Australia, (2000) 356-365
Cooperative Clustering for Training SVMs Shengfeng Tian, Shaomin Mu, and Chuanhuan Yin School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, P.R. China {sftian, chhyin}@center.njtu.edu.cn, [email protected]
Abstract. Support vector machines are currently very popular approaches to supervised learning. Unfortunately, the computational load for training and classification procedures increases drastically with size of the training data set. In this paper, a method called cooperative clustering is proposed. With this procedure, the set of data points with pre-determined size near the border of two classes is determined. This small set of data points is taken as the set of support vectors. The training of support vector machine is performed on this set of data points. With this approach, training efficiency and classification efficiency are achieved with small effects on generalization performance. This approach can also be used to reduce the number of support vectors in regression problems.
1 Introduction In recent years, support vector machines (SVMs) [1] have rapidly gained much popularity due to their excellent generalization performance in a wide variety of learning problems. But the approach of SVMs presents some challenges which may limit its usefulness in many applications. Firstly the training algorithm needs to solve a quadratic programming (QP) problem. Since QP routines have high complexity, the training procedure requires huge memory and computational time for large data applications. Secondly the time taken for the classification procedure is proportional to the number of support vectors, so the classification speed is slow if that number is large. To handle the first problem, several approaches have been suggested in the past few years such as Chunking [2], Decomposition [3], and SMO [4]. In least squares support vector machines (LS-SVM) [5], the solution follows from a linear system instead of a quadratic programming problem, and can be efficiently solved by iterative methods such as conjugate gradient. But the sparseness is lost. In this case, every data point is contributing to the model. To handle the second problem, several approaches are proposed to reduce the number of support vectors [6,7,8]. The drawback of these approaches is that the additional operations are needed with respect to standard training algorithms or the accuracy is lower than the standard methods. In this paper, we present a method to handle these two problems. With this method, training efficiency and classification efficiency are achieved with small effects on generalization performance. The key procedure is called cooperative clustering. In this procedure, the set of data points with pre-determined size near the border of two classes is determined. Then training is performed on this set of data points. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 962 – 967, 2006. © Springer-Verlag Berlin Heidelberg 2006
Cooperative Clustering for Training SVMs
963
This paper is organized as follows. In Section 2, we describe the cooperative clustering. In Section 3, we illustrate a training procedure for classification and a method to simplify the SVMs for regression. In Section 4 we give the results of numerical experimentation to prove the effectiveness and efficiency of the method. Finally, the conclusion is given in Section 5.
2 Cooperative Clustering The fuzzy c-means (FCM) algorithm [9] has been utilized in a wide variety of applications. FCM partitions a given data set, X = {x1,…,xn}⊂Rd, into c fuzzy subsets by minimizing the following objective function c
n
J m (U , V ) = ∑ ∑ u ikm || x k − v i || 2
(1)
i =1 k =1
where c is the number of clusters, n the number of data points, uik the membership of xk in class i, m the quantity controlling clustering fuzziness, and V the set of cluster centers (vi ∈Rd). The matrix U with the ik-th entry uik is constrained to the condition c
∑u i =1
ik
= 1, k = 1, L , n.
(2)
The FCM algorithm consists of the following steps: (1) (2)
Initialize the cluster centers vi , i=1,…,c. Compute the matrix U as follows ⎡ c ⎢ ⎛D u ik = ⎢∑ ⎜ ik ⎜D ⎢⎣ j =1 ⎝ jk
2 ⎤ ⎞ ( m −1) ⎥ ⎟ ⎥ ⎟ ⎠ ⎥⎦
−1
(3)
where Dik =|| x k − vi || . If the set Ik = {i|1≤ i ≤ c; Dik = 0} is not empty, we have 2
2
uik=0, ∀i∈{1,…,c}-Ik, and ∑i∈Ik uik =1.
(4)
(3) Update the cluster centers vi, i = 1,…,c, as follows n
v i = ∑ (u ik ) m x k k =1
n
∑ (u
ik
) m ∀i.
(5)
k =1
(4) Return the matrix U if no cluster center is changed in Step (3), otherwise go to Step (2). In support vector machines, support vectors of two classes are near the border of the two classes. Obviously, the cluster centers of two classes can not replace the support vectors in support vector machines. In the approach of cooperative clustering, we iteratively compute the cluster centers of two classes simultaneously and draw them towards the border of the two classes. Then these cluster centers near the border can be used to replace the support vectors approximately.
964
S. Tian, S. Mu, and C. Yin
Suppose that there are two-class data sets X+ and X-, and each class has c subsets in two classes. Let v
+
= {v1+ , L , v c+ } be the set of cluster centers in class +,
v − = {v1− ,L , v c− } be the set of cluster centers in class -, and A be distance matrix
between two sets v+ and v-. The ij-th entry aij of matrix A can be computed as aij =|| v i+ − v −j || 2 i, j = 1,L , c
We take out each pair
(6)
< v +p , v q− > of cluster centers with smallest distance from
sets v+ and v- iteratively according to matrix A. The two cluster centers in
< v +p , v q− > should move towards another. Let rp+ be the average radius of cluster p in class + and
rq− be the average radius of cluster q in class -. We have rp+ = rq− =
∑u
pk
|| x k − v +p ||
∑u
qk
|| x k − v q− ||
xk ∈ X +
xk ∈ X
Then each pair
−
∑u
xk ∈ X +
∑u
xk ∈ X =
pk
(7)
qk
(8)
< v +p , v q− > is updated as follows v +p = v +p + λ v q− = v q− + λ
rp+ + p
r + rq− rq− rp+ + rq−
(v q− − v +p )
(9)
(v +p − v q− )
(10)
where λ∈(0,1) is the quantity controlling the distance between
v +p and v q- . In this
paper, we take λ = 0.8. The whole procedure of the cooperative clustering is as follows. +
-
(1) Initialize the cluster centers v i and v i , i = 1,…,c. (2) Compute the matrix U+ and U- for classes + and – with equations (3) and (4) respectively. +
-
(3) Compute the cluster centers v i and v i , i = 1,…,c, for classes + and – with equation (5) respectively. (4) Computer the distance matrix A with equation (6) and find out the set Vs of pairs
v +p and v q- , by searching through matrix A. (5) Update the cluster centers
v +p and v q- in set Vs with equations (9) and (10).
(6) Return the set Vs if no cluster center is changed in step (5) otherwise go to Step (2). With the above procedure, we can find c pairs of cluster centers. Each pair crosses the border of two classes.
Cooperative Clustering for Training SVMs
965
3 Training Support Vector Machines For classification, LS-SVM is very fast among many SVM’s training algorithms but the sparseness is lost. Almost all data points are support vectors. If we train the support vector machine only with the cluster centers, LS-SVM is a good choice. Because the cluster centers found by cooperative clustering are near the border of two classes, they can replace support vectors approximately without obvious effects on classification performance. We call this algorithm CC-SVM. The training algorithm is efficient and the sparseness is reserved because the number of training data is small. For regression problems, standard training algorithms generally produce solutions with a greater number of support vectors. Several approaches are proposed to simplify the support vector solutions [6,7]. As an alternative approach, the cooperative clustering can be revised to do this task. Given a training data set of n points (xi,yi), i = 1,…,n where xi∈Rd, and yi∈R, we solve this problem with LS-SVM and the estimation function is obtained. After training we partition the training data into two classes +1 and -1 with the estimation function. Then the cooperative clustering can be performed in d+1 dimension. But the procedure is slightly different from that in Section 2. In step (5), the cluster centers
v +p and v q- in set Vs move towards another and form a single point vpq = (xpq, ypq). Let v +p = ( x p , y p ) and vq− = ( xq , yq ) , we have xpq = (xp + xq)/2, ypq = ∑ iαiK(xpq, xi) + b
(11)
With these c cluster centers as training data and LS-SVM algorithm, the estimation function with c support vectors is obtained.
4 Experiments For the classification problem, the proposed approach is evaluated over an artificially generated data set. We choose RBF kernel K(x,x’)=exp[-||x-x’||2/σ2] with σ=0.5 and the constant parameter C=5 in the experiments. The dataset contains two classes, each one has 24 points. Figure 1 shows the results, in which rectangles and circles denote data points in two classes and large rectangles and circles denote support vectors in two classes respectively and the line denotes the border of two classes. It is noticed that the support vectors are distributed pairwise and near the border with CC-SVM. Because the positions of the support vectors are not necessarily the same as the training data, the distribution of the support vectors can be more reasonable and less support vectors can be used to obtain similar performance. Of cause, the number of the pairs should be selected carefully. For the regression problem, we estimate function 10sin(x)/x from noisy data. We choose RBF kernel K(x,x’)=exp[-||x-x’||2/σ2] with σ=0.3 and the constant parameter C=25 in the experiments. First LS-SVM training algorithm is used with training dataset including all 100 points. Using the result, cooperative clustering is performed and then the support vector machine is trained with LS-SVM algorithm and the 10 cluster
966
S. Tian, S. Mu, and C. Yin
Fig. 1. Classification result
Fig. 2. Regression result
centers. Figure 2 shows the result. In this figure, circles denote data points and solid circles denote support vectors. It is noticed that the number of support vectors is reduced from 100 to 10.
5 Conclusion In this paper, we have proposed a new clustering method called cooperative clustering. For classification problems, clustering procedures are performing simultaneously in two classes. In this way, the cluster centers distribute pairwise near the border of two classes. These cluster centers can replace support vectors in support vector machines approximately. For regression problems, the pair of cluster centers merges into one point and the set of cluster centers forms the training data without noise. Experiments show that the training of the support vector machines over the above cluster centers is efficient and has small effects on generalization performance.
Acknowledgements This paper is supported by the National Natural Science Foundation of China (No. 60442002) and the Science and Technology Foundation of Beijing Jiaotong University (No. 2004SM010).
References 1. Vapnik, V.N.: Statistical Learning Theory. Join Wiley and Sons, New York (1998) 2. Boser, B., Guyon, I., Vapnik, V.N.: A Training Algorithm for Optimal Margin Classifiers. In: Haussler, D. (ed.): Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, ACM Press (1992) 144-152
Cooperative Clustering for Training SVMs
967
3. Osuna, E., Freund R., Girosi F.: An Improved Training Algorithm for Support Vector Machines. In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, (1997) 276-285 4. Platt, J.C.: Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Technical Report MSR-TR-98-14, Microsoft Research (1998) 5. Suykens, J.A.K., Vandewalle, J.: Least Square Support Vector Machine Classifiers. Neural Processing Letters 9(3) (1999) 293-300 6. Burges, C.J.C.: Simplified Support Vector Decision Rules. In: Saitta, L. (ed.): Proceedings of 13th International Conference on Machine Learning, San Mateo, CA Morgan Kaufmann Publishers, Inc. (1996) 71-77 7. Downs, T., Gates, K.E., Masters, A.: Exact Simplification of Support Vector Solutions. Journal of Machine Learning Research 2 (2001) 293-297 8. Lin, K., Lin, C.: A Study on Reduced Support Vector Machines. IEEE Transactions on Neural Networks 14(6) (2003) 1449-1459 9. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981)
SVMV - A Novel Algorithm for the Visualization of SVM Classification Results Xiaohong Wang, Sitao Wu, Xiaoru Wang, and Qunzhan Li School of Electrical Engineering, Southwest Jiaotong University, 610031, Chengdu, P.R. China [email protected]
Abstract. In this paper, a novel algorithm, called support vector machine visualization (SVMV), is proposed. The SVMV algorithm is based on support vector machine (SVM) and self-organizing mapping (SOM). High dimensional data and binary classification results can be visualized in a low dimensional space. Compared with other traditional visualization algorithms like SOM and Sammon’s mapping algorithm, the SVMV algorithm can deliver better visualization on classification results. Experimental results corroborate the effectiveness and usefulness of SVMV.
1 Introduction To obtain useful knowledge from large amounts of high-dimensional data, classification is one of the most effective methods. However, most classification methods behave like a “black box”, which cannot be easily understood in certain application field. If classification results can be directly seen in a 2-D space, it will be great help for classification users. A simple and direct idea for visualizing highdimensional data is to reduce its dimensionality to two or three dimensions by using some dimension reduction algorithms, e.g., principal component analysis (PCA)[1], multidimensional scaling (MDS)[2], self-organizing mapping (SOM) [3], etc. PCA is a simple linear dimension reduction technique. MDS produces geometric representation of data in low dimensions. Sammon’s mapping [4] is one of the widely used MDS methods. Curvilinear component analysis (CCA)[5], ISOMAP methods [6] and relational perspective map (RPM)[7], extend the traditional MDS methods by using different stress functions or distances. SOM is an unsupervised neural network approach that can be used for visualization. All these visualization algorithms are unsupervised ones without utilizing class information during the course of training. High-dimensional input data can be visualized in the reduced low-dimensional space but the class boundary cannot be clearly visualized. A novel algorithm, called support vector machine visualization (SVMV), is proposed in this paper. The SVMV algorithm combines support vector machine (SVM)[8] with SOM into a single algorithm. In SVMV, SVM is used for solving binary classification and SOM is for visualizing data. In the next section, SVM and SOM are briefly introduced. In section 3, the procedure of the SVMV algorithm is described in detail. In section 4, two datasets are used to demonstrate the effectiveness and usefulness of the SVMV algorithm. Finally, conclusion is drawn in section 5. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 968 – 973, 2006. © Springer-Verlag Berlin Heidelberg 2006
SVMV - A Novel Algorithm for the Visualization of SVM Classification Results
969
2 Introduction of SVM and SOM 2.1 SVM Suppose a dataset { ( x1 , y1 ),..., ( xn , yn ) } is used for SVMV, where xi ∈ ℜ d is the ith input vector, yi ∈ {±1} is the binary class label, n is the number of input data, and d is the input dimension. v-SVM [9], one of the SVM models, is adopted in this paper. vSVM solves the following primal problem Min s.t.
1 2 1 n w − v ρ + (∑ξi ) , n i =1 2 yi [(w ⋅ ϕ ( xi )) + b0 ] ≥ ρ − ξi , ξ i ≥ 0, ρ ≥ 0, i = 1, 2,..., n
φ (w) =
(1)
where ϕ ( x) is a mapping function, the solution of w and b0 forms the decision
function, ξi is a slack variable, ρ is a bias, v is the lower bound of the ratio of the number of support vectors over that of the whole input data, and the upper bound of classification error in training data[9]. Instead of solving problem (1), v-SVM solves its dual problem as follows 1 n ∑ α iα j yi y jϕ ( xi )T ϕ ( x j ) 2 i , j =1
Max
Q (α ) = −
s.t.
1 0 ≤ αi ≤ , n
n
∑α y i =1
i
i
=0,
n
∑α i =1
i
≥ v, i = 1, 2,..., n
, (2)
where α i ≥ 0 is the Lagrange coefficient. Note that input data corresponding to α i = 0 are called support vectors. The solution of α i in (2) can be used to compute w and b0 in (1). Kernel function K ( xi , x j ) = ϕ ( xi )T ϕ ( x j ) can be used as a trick to avoid computing dot product in the original space. The decision function is ns
f ( x) = sgn( w ⋅ ϕ ( x) + b0 ) = sgn( ∑ α i yi K ( xi , x ) + b0 ) ,
(3)
i =1
where ns is the total number of support vectors. The absolute value of bias PY ( z ) , expressed by (4), is proportional to the distance between the datum z and the classification boundary.
PY ( z ) =
ns
∑α i =1
i
y i K ( xi , z ) + b0 .
(4)
2.2 SOM
In this paper, a rectangular gird with m neurons is used for SOM output map. The basic SOM algorithm is iterative with k as the iteration number. Each output neuron j has a d-dimensional weight vector w j = [ w j1 , . . . ,w jd ] . At each training step, an input
970
X. Wang et al.
vector is randomly chosen from training data. The winning neuron, denoted by c, is the one with the weight vector wc closest to xi . c = arg min xi − w j j
, j ∈ {1,..., m} .
(5)
The weight-updating rule in the sequential SOM algorithm can be written as
∀j ∈ Nc ⎧⎪wj (k ) + ε (k )hjc (k)(xi − wj (k)), wj (k +1) = ⎨ , otherwise ⎪⎩ wj (k) ,
(6)
where N c is a set of neighboring nodes of c, hjc(k) is the neighborhood function around c, ε (k ) is the learning rate. Computational complexity of SOM is O (mn) for one whole presentation of input data. To avoid high computational complexity when a large number of neurons is used, an interpolation process on SOM [10] is adopted. If the old SOM size is l × l , the size of the new interpolated SOM becomes (2l − 1) × (2l − 1) , (4l − 3) × (4l − 3) , etc.
3 Procedure of the SVMV Algorithm In the SVMV algorithm, SVM is first used for binary classification. Secondly, SOM is used to obtain topology-preserved weight vectors in input space. The weight vectors of SOM can represent quantized input data with topology preservation. They can be classified according to the SVM results by (3) or (4). If the output neurons of SOM with the same class are displayed with the same color, the SVM classification boundary can be easily visualized. The diagram of the SVMV algorithm is shown in Fig.1.The detailed procedure of the SVMV algorithm is described as follows: Step 1) SOM algorithm is used to obtain the weight matrix W = [ w j ] , j = 1,L , m . Step 2) Interpolation
is
performed
and
an
extended
weight
matrix
’ W ' = [ w 'j ] ( j = 1,L , m ' ) is formed to obtain a more precise map, where m is
the number of the output neurons after interpolation. Step 3) v-SVM is used to obtain the optimal value of α i , i = 1,..., n , according to (2). Step 4) The bias matrix Pw = [ PY (w'j )] is computed by ns
PY (w'j ) = ∑ αi yi K ( xi , w'j ) + b0 , i =1
j = 1,K, m ' .
(7)
Step 5) Finally the 2-D map of SOM is colored according to the bias matrix Pw . Neurons belonging to class 1 are colored with yellow color while those belonging to class 2 are colored with white color. To plot classification boundary more clearly, neurons near classification boundary are plotted with heavy color for class 1 and light color for class 2 while neurons far from boundary are plotted with light color for class1 and heavy color for class 1.
SVMV - A Novel Algorithm for the Visualization of SVM Classification Results
971
Fig. 1. The diagram of the SVMV algorithm
During the testing stage, class labels and the distance between testing datum and classification boundary can be directly visualized after data are mapped to the 2-D SOM map according to equation (5). Note that if testing datum is near classification boundary on the 2-D SOM map, it may be on the wrong side of classification boundary due to the loss of information for dimension reduction.
4 Experimental Results For SVMV, two experiments are performed on a P4 1.7GHz with 256M memory and the program is programmed by MATLAB 6.5. Two different datasets, both from UCI machine learning repository (http://www.ics.UCI.edu/~mlearn/), are used. SOM and Sammon’s mapping are also used for comparisons with SVMV. For SOM, the total iteration numbers are 1000 and the learning rate is decreased from 1 to 0.001. OSU_SVM (http:// www.ece.osu.edu/~maj/osu_svm/) is used for the implementation of v-SVM. The widely used Gaussian kernel is adopted in the experiments. 4.1 The Wisconsin Breast Cancer Database
The first dataset contains 699 instances with 9 dimensions. It is a binary classification problem: the instances are either benign or malignant. Since there are 16 instances containing missing values, they are not used in the experiment. Therefore the number of total instances used is reduced to 683 with 444 benign instances and 239 malignant ones. The width σ of Gaussian kernel is set to 0.91 and v is set to 0.05. For SOM, the map size is set to 20 × 20 . The size of interpolated SOM is 39 × 39 . When extended weight matrix with more interpolations is used, similar results are obtained. The visualization results by SOM, Sammon’s mapping and SVMV are shown in Fig.2 respectively, where star symbols denote benign instances and triangle symbols denote malignant instances. For the 2-D map by SOM shown in Fig.2 (a), there are some data overlapped between the two classes such that classification boundary cannot be directly seen. For Sammon’s mapping, the two classes are heavily overlapped as shown in Fig.2 (b). Therefore the information about classification boundary is lost to a great extent. For SVMV, the yellow area represents benign data and the white area represents malignant data as shown in Fig.2 (c). Classification boundary between the two classes can be clearly seen in the 2-D visualization. The distance between any input datum and classification boundary can be also directly obtained from visualization. The two classes are well separated in the 2-D visualization.
972
X. Wang et al. 39
40
15
35 35
30
10
30
25
25 5
20
20 0
15
15
10
10
-5
5
5 0 0
5
10
15
20
25
30
35
40
-10 -15
-10
-5
0
(a)
5
10
15
1
1
5
10
15
20
(b)
25
30
35
39
(c)
Fig. 2. Visualization of the Wisconsin cancer dataset (a) by SOM; (b) by Sammon’s mapping; (c) by SVMV
4.2 The Johns Hopkins University Ionosphere Database
The second dataset contains 351 data with 34 dimensions. 225 "Good" radar return instances are those showing some evidence of structures in the ionosphere. 126 "Bad" radar return instances are those that do not show evidence. Compared with the first dataset, the classification boundary is more complicated. The width σ of Gaussian kernel is set to 0.35 and v is set to 0.05. For SOM, the 2-D map size is set to10 × 10 . The size of interpolated SOM is 19 × 19 . The visualization results by the three visualization algorithms are shown in Fig.3, where star symbols denote “good” instances and triangle symbols denote “bad” instances. 20
19
6
18 4
16
15 2
14 12
0
10
10 -2
8 6
-4
5
4 -6
2 0 0
2
4
6
8
10
12
(a)
14
16
18
20
-8 -10
-8
-6
-4
-2
0
(b)
2
4
6
8
1 1
5
10
15
19
(c)
Fig. 3. Visualization of the Ionosphere dataset (a) by SOM; (b) by Sammon’s mapping; (c) by the SVMV algorithm
For SOM, there is some overlapping between the two classes. The classification boundary cannot be directly and clearly seen in Fig. 3 (a). For Sammon’s mapping, the overlapping is even worse such that it is difficult to find classification boundary as shown in Fig. 3(b). For SVMV, the yellow area represents “good” instances while the white area represents “bad” instances as shown in Fig. 3 (c). Classification boundary between the two classes can be clearly seen. The two classes have overlapping to some extent in the 2-D visualization.
SVMV - A Novel Algorithm for the Visualization of SVM Classification Results
973
5 Conclusion In this paper, a novel SVMV algorithm is proposed. By using the SVMV algorithm, the SVM classification boundary and the distance between input data and classification boundary can be clearly visualized in a 2-D space. With the help of SVMV, classification results will be easier to understand for classification users. Experiment results on the two different datasets corroborate the effectiveness and usefulness of SVMV.
Acknowledgement This work is supported by Special Funds for Major State Basic Research Projects of P. R. China (G1998020312).
References 1. Jackson, J.E.: A User’s Guide to Principal Components. John Wiley & Sons, New York (1991) 2. Cox, T.C., Cox, M.A.A.: Multidimensional Scaling. 2nd edn. Chapman & Hall/CRC, Boca Rotan (2000) 3. Kohonen, T.: Self-Organizing Maps. 2nd edn. Springer, Berlin Heidelberg New York (1997) 4. Sammon, J.W.: A Nonlinear Mapping for Data Structure Analysis. IEEE Trans. on Comput-ers 18(5) (1969) 401-409 5. Demartines, P., Hérault, J.: Curvilinear Component Analysis: A Self-Organizing Neural Network for Nonlinear Mapping of Data Sets. IEEE Trans. on Neural Networks 8(1) (1997) 148-154 6. Tenenbaum, J.B., Silva de, V., Langford, J.C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290(12) (2000) 2319-2323 7. Li, X.Z.: Visualization of High-Dimensional Data with Relational Perspective Map. Information Visualization 3(1) (2004) 49-59 8. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Berlin Heidelberg New York (1995) 9. Schölkopf, B., Smola, A. (ed.): New Support Vector Algorithms. Neural Computation 12(5) (2000) 1207-1245 10. Wu, S.T., Chow, W.S.: Support Vector Visualization and Clustering Using SelfOrganizing Map and Support Vector One-Class Classification. In: Proc. of IEEE Int. Joint Conf. on Neural Networks. Portland, USA (2003) 803-808
Support Vector Machines Ensemble Based on Fuzzy Integral for Classification Genting Yan, Guangfu Ma, and Liangkuan Zhu Department of Control Science and Engineering, Harbin Institute of Technology, Harbin, Heilongjiang, 150001, P.R. China [email protected]
Abstract. Support vector machines (SVMs) ensemble has been proposed to improve classification performance recently. However, currently used fusion strategies do not evaluate the importance degree of the output of individual component SVM classifier when combining the component predictions to the final decision. A SVMs ensemble method based on fuzzy integral is presented in this paper to deal with this problem. This method aggregates the outputs of separate component SVMs with importance of each component SVM, which is subjectively assigned as the nature of fuzzy logic. The simulating results demonstrate that the proposed method outperforms a single SVM and traditional SVMs aggregation technique via majority voting in terms of classification accuracy.
1 Introduction Recently, support vector machines ensemble has been applied in many areas in order to improve classification performance [1] [2] [3]. The experimental results in these applications show that support vector machines ensemble can achieve equal or better classification accuracy with respect to a single support vector machine. However, a common used majority voting aggregation method is adopted in these papers, and this method does not consider the degree of importance of the output of component support vector machines classifier when combining several independently trained SVMs into a final decision. In order to resolve this problem, a support vector machines ensemble strategy based on fuzzy integral is proposed in this paper. The presented method consists of four phases. Firstly, we use bagging technique to construct the component SVM. In bagging, several SVMs are trained independently using training sets generated by a bootstrap method from the original training set. Furthermore, posterior class probabilities are required when using fuzzy integral to combine the component SVMs classification outputs for the overall decision. So we obtain probabilistic outputs model of each component SVM in the second step. Thirdly, we assign the fuzzy densities, the degree of importance of each component SVM, based on how good these SVMs performed on their own training data. Finally, we aggregate the component predictions using fuzzy integral in which the relative importance of the different component SVM is also considered. Fuzzy integral J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 974 – 980, 2006. © Springer-Verlag Berlin Heidelberg 2006
Support Vector Machines Ensemble Based on Fuzzy Integral for Classification
975
nonlinearly combines objective evidence, in the form of a SVM probabilistic output, with subjective evaluation of the importance of that component SVM with respect to the decision. Experimental results show that the presented method is more efficient than that using majority voting aggregation technique. This paper is organized as follows. Section 2 introduces the basic theory of SVM classifier. In Section 3 probabilistic outputs models for support vector machines are provided. Bagging method for constructing the component SVM and fuzzy integral strategy for aggregating several trained SVMs are presented in section 4. Section 5 presents experiment results applied to benchmark problems. Finally, conclusions and future work are given in section 6.
2 Support Vector Machines SVMs construct a classifier from a set of labeled pattern called training examples. Let {( xi , yi ) ∈ R d × {−1,1}, i = 1, 2,..., l} be such a set of training examples. The SVMs try
to find the optimal separating hyperplane wT x + b = 0 that maximizes the margin of the nearest examples from two classes. To the nonlinear classification problem, the original data are projected into a high dimension feature space F via a nonlinear map Φ : R d → F so that the problem of nonlinear classification is transferred into that of linear classification in feature space F . By introducing the kernel function K ( xi , x j ) = Φ ( xi ) ⋅ Φ ( x j ) , it is not necessary to explicitly know Φ (⋅) , and only the kernel function K ( xi , x j ) is enough for training SVMs. The corresponding optimization problem of nonlinear classification can be obtained by l 1 T w w + C ∑ ξi 2 i =1 T s.t. yi ( w Φ ( x) + b) ≥ 1 − ξ i , i = 1, 2,..., l
min J ( w, ξ ) =
(1)
where C is a constant and ξi is the slack factor. Eq. (1) can be solved by constructing a Lagrangian equality and transformed into the dual: l
max W (α ) = ∑ α i − i =1
1 l ∑ α iα j yi y j K ( xi , x j ) 2 i , j =1
l
s.t.
∑α i yi = 0, 0 ≤ αi ≤ C , i = 1, 2,..., l
(2)
i =1
By solving the above problem (2), we can get the optimal hyperplane
∑ α y K ( x, x ) + b = 0 i
i
i
(3)
i
Then we can get the decision function of the form: ⎡ ⎤ f ( x) = sgn ⎢ ∑α i yi K ( x, xi ) + b ⎥ ⎣ i ⎦
(4)
976
G. Yan, G. Ma, and L. Zhu
SVMs are originally designed for two-class classification. One-against-all or oneagainst one strategy can be used to realize multi-class classification. For details of support vector machines, see [4].
3 Probabilistic Outputs Model for Support Vector Machines 3.1 Two-Class Case
Given training data xi ∈ R n , i = 1,..., l , labeled by yi = {−1,1} , the binary support vector machine obtained a decision function f ( x) so that sign( f ( x)) is the prediction of any test sample x . Instead of predicting the label, many applications require a posterior class probability P( y = 1 | x) . Platt proposes to approximate P( y = 1 | x) by a sigmoid function P( y = 1| x ) =
1 1 + exp( Af ( x) + B )
(5)
with parameters A and B [5]. To estimate the best values of ( A, B ) , any subset of l training data ( N + of them with yi = 1 , and N − of them with yi = −1 ) can be used to solve the following maximum likelihood problem:
min F ( z )
(6)
z = ( A, B )
where l
F ( z ) = −∑ (ti log( pi ) + (1 − ti ) log(1 − pi )) ,
(7)
i =1
⎧ N+ + 1 ⎪ N + 2 if yi = 1 1 ⎪ , i = 1, 2,..., l . (8) , f i = f ( xi ) , and ti = ⎨ + pi = 1 + exp( Af i + B) ⎪ 1 if yi = −1 ⎪⎩ N − + 2 3.2 Multi-class Case
K class classification problems can be efficiently solved by partitioning the original problem into a set of K ( K − 1) / 2 two-class problems [6]. For any new x , we can calculate rij according to 3.1 as an approximation of uij = P ( y = i | y = i or j , x ) . Then using all rij , the goal is to estimate pi = P( y = i | x), i = 1,..., K . The following algorithm can be used to obtain class probabilities for any new x [7]. Consider that K
( ∑ P( y = i or j | x)) − ( K − 2) P( y = i | x) = ∑ P( y = j | x) = 1 j: j ≠ i
j =1
(9)
Support Vector Machines Ensemble Based on Fuzzy Integral for Classification
977
Using rij ≈ uij =
P( y = i | x) P( y = i or j | x )
(10)
We can obtain pi =
K
As
∑p i =1
i
1 1 ∑ − ( K − 2) j: j ≠ i rij
(11)
= 1 does not hold, we must normalize pi .
4 SVMs Ensemble 4.1 Bagging to Construct Component SVMs
The Bagging algorithm generates K training data sets {TRk , k = 1, 2,..., K } by randomly re-sampling, but with replacement, from the given original training data set TR [8]. Each training set TRk will be used to train a component SVM. The component predictions are combined via fuzzy integral. 4.2 SVMs Ensemble Based on Fuzzy Integral
In the following, we introduce the basic theory about the fuzzy integral [9]. Let X = {x1 , x2 ,..., xn } be a finite set. A set function g : 2 X → [0,1] is called a fuzzy measure if the following conditions are satisfied: 1. g (φ ) = 0, g ( X ) = 1 2. g ( A) ≤ g ( B), if A ⊂ B and A, B ⊂ X
(12)
From the definition of fuzzy measure g, Sugeno developed a so-called g λ fuzzy measure satisfying an additional property: g ( A ∪ B) = g ( A) + g ( B) + λ g ( A) g ( B)
(13)
for all A, B ⊂ X and A ∩ B = φ , and for some λ > −1 . Let
h : X → [0,1]
be
a
fuzzy
subset
of
X
and
use
the
notation Ai = {x1 , x2 ,..., xi } .For being a g λ fuzzy measure, the value of g ( Ai ) can be determined recursively as
⎧⎪ g ( A1 ) = g ({x1}) = g1 ⎨ ⎪⎩ g ( Ai ) = g i + g ( A(i −1) ) + λ g i g ( A(i −1) ), g i = g ({xi }),1 < i ≤ n
λ is given by solving the following equation
(14)
978
G. Yan, G. Ma, and L. Zhu n
λ + 1 = ∏ (1 + λ gi )
(15)
i =1
where λ ∈ (−1, +∞) and λ ≠ 0 . Suppose h( x1 ) ≥ h( x2 ) ≥ ⋅⋅⋅ ≥ h( xn ) , (if not, X is rearranged so that this relation holds). Then the so-called Sugeno fuzzy integral e with respect to g λ fuzzy measure over X can be computed by n
e = max[min(h( xi ), g ( Ai ))]
(16)
i =1
Thus the calculation of the fuzzy integral with respect to a g λ fuzzy measure would only require the knowledge of the fuzzy densities, where the ith density gi is interpreted as the degree of importance of the source xi towards the final decision. These fuzzy densities can be subjectively assigned by an expert or can be generated from training data. Let Ω = {ω1 ,ω 2 ,...,ω c } be a set of classes of interest and S = {SVM1 ,SVM 2 ,...,SVM m }
be a set of component SVMs. Let hk : S → [0,1] be
the belief degree of a new sample x belongs to class ω k , that is hk (SVM i ) is the probability in the classification of a new sample x to be in class ω k using SVM i . If we get {hk (SVM i ), i = 1,...m} and know fuzzy densities {g k ({SVM i }), i = 1,..., m} , the fuzzy integral ek for class ω k can be calculated using (14) to (16). When the fuzzy integral
{ek , k = 1,..., c}
values
are
obtained,
we
can
get
the
final
decision k = arg max ek . *
k
5 Experimental Results Fig. 1 shows a general scheme of the proposed SVMs ensemble. The whole experimental process is divided into following steps: Step 1: Generate m training data sets via bagging from original training set according to 4.1 and train a component SVM using each of those training sets. Step 2: Obtain probabilistic outputs model of each component SVM according to 3.1 and 3.2. Step 3: Assign the fuzzy densities {g k ({SVMi }), k = 1,..., c} , the degree of importance of each component SVM i , i = 1,...m , based on how good these SVMs performed on their own training data. Step 4: Obtain probabilistic outputs of each component SVM when given a new test example. Step 5: Compute the fuzzy integral ek for ω k , k = 1,...c according to 4.2. Step 6: Get the final decision k * = arg max ek . k
Support Vector Machines Ensemble Based on Fuzzy Integral for Classification
979
Final decision Combination by fuzzy integral
Component SVMs with probabilistic outputs
Training sets generated by bagging
SVM1
SVM2
SVMm
Training set 1
Traning set 2
Training set m
Original traning set
Fig. 1. The proposed scheme of the SVMs ensemble
To evaluate the efficacy of the proposed method, the two Statlog data sets such as the heart data and satimage data are used, which are available at http://www.liacc.up.pt/ML/statlog/datasets.html. The former one is a two-class classification problem and the latter one is a multi-class classification problem. Experiments are implemented based on libsvm software [10]. 5.1 Statlog Heart Data Classification
The heart data set contains two classes and the number of instances is 270. 50 samples per class are selected randomly for training and the remained 170 samples are used for testing. For bagging, we re-sample randomly 70 samples with replacement from the original training data set. We train three component SVMs independently over the three training data sets generated by bagging and aggregate three trained SVMs via fuzzy integral. Each component SVM uses RBF kernel and the corresponding parameters are selected by five-fold cross-validation. To avoid the tweak problem, ten experiments are performed and the average performance is reported in table 1. The performance of using a single SVM and SVMs ensemble via majority voting are also given in table 1. Table 1. The experimental results of heart data set
Algorithms Single SVMs SVMs ensemble via majority voting SVMs ensemble via fuzzy integral
Test Accuracy (%) 77.0588 78.8235 80.5882
5.2 Statlog Satimage Data Classification
The satimage data set contains six classes. Training set includes 4435 samples and testing set includes 2000 samples. The one-against-one strategy is adopted to realize mutli-class classification. We train eight component SVMs. Each component SVM uses RBF kernel and the corresponding parameters are selected by five-fold cross-
980
G. Yan, G. Ma, and L. Zhu Table 2. The experimental results of satimage data set
Algorithms Single SVMs SVMs ensemble via majority voting SVMs ensemble via fuzzy integral
Test Accuracy (%) 85.45 85.9 87.4
validation. For bagging, we re-sample randomly 3500 data samples with replacement from the original training data set. We train each component SVM independently over its own training data set and aggregate eight trained SVMs via fuzzy integral. Table 2 shows the test results. From table 1 and table 2, we can see obviously that the classification performance of SVMs ensemble based on fuzzy integral is better than that of a single SVM and SVMs ensemble via majority voting.
6 Conclusions This paper proposes a support vector machines ensemble strategy based on fuzzy integral for classification. The most important advantage of this approach is that not only are the classification results combined but that the relative importance of the different component SVMs classifier is also considered. The simulating results show the effectiveness and efficiency of our method. Future research will be focus on how to set fuzzy densities more reasonably and use other method to construct component SVMs such as boosting.
References 1. Kim, H.C., Pang, S., Je, H.M., Kim, D., Bang, S.Y.: Constructing Support Vector Machine Ensemble. Pattern Recognition 36 (2003) 2757-2767 2. Anna, K.J., James, D.M., Marek, F., Ronald, M.S.: Computer-aided Polyp Detection in CT Colonography Using an Ensemble of Support Vector Machines. International Congress Series 1256 (2003) 1019-1024 3. Valentini, G., Muselli, M., Ruffino, F.: Cancer Recognition with Bagged Ensemble of Support Vector Machines. Neurocomputing 56 (2004) 461-466 4. Vapnik, V.: The Nature of Statistical Learning Theory. Wiley, New York (1998) 5. Platt, J.: Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Method. In: Smola, A., Bartlett, P., and Schuurmans, D. (eds.): Advance in Large Margin Classifiers. Cambridge, MA: MIT Press (2000) 6. Hsu, C.W., Lin, C.J.: A Comparison of Method for Mutli-class Support Vector Machines. IEEE Trans. Neural Networks 13(2) (2002) 415-425 7. Wu, T.F., Lin, C. J., Weng R. C.: Probability Estimate for Multi-class Classification by Pairwise Coupling. Journal of Machine Learning Research 5 (2004) 975-1005 8. Breiman, L. Bagging Predictors. Machine Learning 24(2) (1996) 123-140 9. Kwak, K.C., Pedrycz, W.: Face Recognition: A Study in Information Fusion Using Fuzzy Integral. Pattern Recognition Letters 26 (2005) 719-733 10. Chang, C.C., Lin, C.J.: LIBSVM: A Library for Support Vector Machines. 2001 http://www.csie.ntu.edu.tw/~cjlin/libsvm
An Adaptive Support Vector Machine Learning Algorithm for Large Classification Problem Shu Yu1, Xiaowei Yang2, Zhifeng Hao2, and Yanchun Liang3 1
College of Computer Science and Engineering, South China University of Technology, Guangzhou 510640, China 2 School of Mathematical Science, South China University of Technology, Guangzhou 510640, China 3 College of Computer Science and Technology, Jilin University, Changchun 130012, China
Abstract. Based on the incremental and decremental learning strategies, an adaptive support vector machine learning algorithm (ASVM) is presented for large classification problems in this paper. In the proposed algorithm, the incremental and decremental procedures are performed alternatively, and a small scale working set, which can cover most of the information in the training set and overcome the drawback of losing the sparseness in least squares support vector machine (LS-SVM), can be formed adaptively. The classifier can be constructed by using this working set. In general, the number of the elements in the working set is much smaller than that in the training set. Therefore the proposed algorithm can be used not only to train the data sets quickly but also to test them effectively with losing little accuracy. In order to examine the training speed and the generalization performance of the proposed algorithm, we apply both ASVM and LS-SVM to seven UCI datasets and a benchmark problem. Experimental results show that the novel algorithm is very faster than LS-SVM and loses little accuracy in solving large classification problems.
1 Introduction Support vector machine (SVM) [1] is a powerful tool for data classification and function estimation. The training problem in SVM is equivalent to solve a linearly constrained convex quadratic programming (QP) problem with a number of variables equal to the one of data points. This optimization problem is known to be challenging when the number of data points exceeds a few thousands. To address this problem, many algorithms have been presented based on the decomposition strategies [2-7], where it is a very important issue how to select the working set. As for the convergence analysis, some researches have completed some work [8-9]. Compared with the decomposition algorithms, the successive overrelaxation (SOR) method [10] and Lagrangian support vector machine (LSVM) [11] are based on the iterative strategies. Making use of the Sherman-MorrisonWoodbury (SMW) matrix identity, LSVM can deal with the large linear classifications with the small number of features. However, it cannot handle the large nonlinear classification problems efficiently [12]. In order to avoid solving QP J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 981 – 990, 2006. © Springer-Verlag Berlin Heidelberg 2006
982
S. Yu et al.
problem directly, based on the equality constrains instead of the inequality ones, the least squares support vector machine (LS-SVM) is designed [13]. Using SMW matrix identity, LS-SVM can be applied to the very large datasets with the small number of features [14]. However, the SMW may result in numerical instability [15]. In LSSVM, the sparseness is lost. To overcome this shortcoming, the sparse strategy has been presented via pruning support values from the sorted support value spectrum [16]. Recently, several benchmark datasets were used to evaluate the performance of the LS-SVM classifiers and the sparseness was imposed by gradually pruning the support value spectrum [17]. In general, the sparseness is not formed adaptively and the results depend on the tolerance of the support values. At present, these SVM learning algorithms mentioned above are implemented offline in batch way, which could not fit to the sequence learning. To overcome this drawback, an online recursive SVM learning algorithm is presented [18], in which one can train SVM by increasing one data point at one step and retain the KuhnTucker condition on all previously checked training data points. Based on solving QP problem, one puts forward to an online support vector classifier (OSVC) [19] too, which handles the datasets in sequence rather than in batch. Unfortunately, it only fits to the case where the number of support vectors is very small. When the number of support vectors is large, OSVC will suffer from difficulties in computation. The reason is that the previous information is not reused in the current iteration and the large QP problems have to be resolved many times. Based on LS-SVM, an incremental algorithm is proposed [20], in which computing the large matrix inverse can be avoided by increasing one data at each step. In this paper, based on the incremental and decremental learning strategies, an adaptive support vector machine learning algorithm (ASVM), which can utilize the previous information in the current iteration and avoid solving repeatedly the large QP problems, will be presented for large classification problems. During the training process, the incremental and decremental procedures are performed alternatively, and a small scale working set, which can cover most of the information of the training set, can be formed adaptively. Using this working set, one can construct the classifier. In order to examine the proposed algorithm, both ASVM and LS-SVM algorithms are applied to seven UCI datasets and a benchmark problem. Comparisons of the results obtained by using the existing and proposed algorithms are also performed. The rest of this paper is organized as follows. In Section 2, the standard LS-SVM is reviewed briefly. The incremental and decremental learning procedures are given in Section 3 and Section 4, respectively. The ASVM algorithm is presented in detail in Section 5. The numerical results of applying ASVM and LS-SVM algorithms to seven UCI data sets and a benchmark problem are given in Section 6. In Section 7, some discussions and conclusions will be given.
2 Standard LS-SVM Considering a training set of l pairs of data points classification problems, where
{x k , y k }lk =1 for the binary
x k ∈ R n and y k ∈ {−1,+1} , the QP problem based
on the equality constraints is in the following [13].
An Adaptive Support Vector Machine Learning Algorithm
min J( w , b , e ) = w ,b ,e
1 T 1 l w w + γ ∑ e k2 , 2 2 k =1
983
(1)
s. t. k = 1, L , l .
y k [w T ϕ (x k ) + b ] = 1 − ek ,
(2)
The corresponding Lagrangian is L( w , b , e , α ) = J( w , b , e ) −
l
∑
α k { y k [ w T ϕ ( x k ) + b ] − 1 + e k },
k =1
where
αk
are Lagrange multipliers, in which
αk ≠ 0
(3)
are called support values. The
optimal conditions can be written as the following set of the linear equations ⎡0 ⎢ ⎣Y
YT ZZ + γ T
−1
⎤ ⎡b ⎤ ⎡0⎤ ⎥⎢ ⎥ = ⎢r⎥ I ⎦ ⎣α ⎦ ⎣1 ⎦
,
(4)
where
r Z = [ϕ(x1 ) y1,L,ϕ(xl ) yl ]T , Y = [ y1,L, yl ]T , 1 = [1,L,1]T , α = [α1,L, αl ]T . Mercer’s condition is applied to the matrix Ω = ZZ T with
Ωij = yi y jϕ(xi )T ϕ(x j ) = yi y j ψ(xi , x j ).
(5)
To solve Eq. (4), we need only to invert the l × l matrix
H = ZZT +γ −1I. .
(6) r Indeed the second row of Eq. (4) gives bY + Hα = 1 and together with the first row yields the explicit solution b=
Given a new input Eq. (8)
r Y T H −1 1 , α = H −1 (1r − b Y ). Y T H −1 Y
(7)
x , the corresponding value y (x) can be estimated by using
⎡ l y ( x ) = sign ⎢ ∑ α k y kψ ( x , x k ) + ⎣ k =1
⎤ b⎥ ⎦
(8)
3 Incremental Learning Procedure Without loss of generality, one can suppose that the LS-SVM model based on data points has been built. Let ⎡0 ⎢ ⎣Y
Y ZZ
T
T
+γ
−1
⎤ ⎥ = A I⎦
N
⎡b ⎤ , ⎢ ⎥ = α ⎣α ⎦
N
⎡0⎤ , ⎢r⎥ = B ⎣1 ⎦
N
,
N
(9)
984
S. Yu et al.
Eq. (4) can be rewritten as A N α N = B N ⇒ α N = A −N1 B N
When the new data point
(10)
(x N +1 , y N +1 ) is coming into the current working set, one
has α N +1 = A −N1+1B N +1 ,
(11)
where ⎡A A N +1 = ⎢ N ⎣ d
dT ⎤ ⎥ c ⎦
(12)
d = [ y N +1 Ω1, N +1 Ω 2, N +1 L Ω N , N +1 ] c = Ω B
N +1
N +1,N +1
= [B
N
+
(13)
1
(14)
γ
1] T
(15)
From Ref. [20], one knows that the following equation holds: r ⎡ A −N1 0 T ⎤ ⎡ A −N1 d T ⎤ −1 −1 A N +1 = ⎢ r −1 , ⎥+T⎢ ⎥ dA N 0 ⎦ ⎣ −1 ⎦ ⎣ 0
[
]
(16)
where T = [ c − dA −N1d T ] − 1 . It is clear that
A −N1+1 can be obtained from A −N1 without computing the matrix
inverse, and the corresponding coefficients and bias
α N +1 = [b α ]T can be
obtained from Eq. (11).
4 Decremental Learning Procedure When one only uses the incremental learning procedure, the number of the working set will become large, which leads to difficulties in training and testing. In this section, a decremental learning procedure will be given to deal with this problem. Considering that LS-SVM model based on N data points has been built, one data point will be pruned from the working set. Supposing that the kth data point is pruned from the working set, the corresponding matrix A N −1 can be obtained by deleting the kth row and column from the matrix
A N , and then one has
α N −1 = A −N1−1 B N −1 From Ref. [18], we know that the following theorem holds.
(17)
An Adaptive Support Vector Machine Learning Algorithm
Theorem. Let
985
aij( N ) and aij( N −1) be the elements of the matrices A −N1 and A −N1−1 ,
respectively, then a ij( N − 1 ) = a ij( N ) −
It shows that the matrix
a i(,Nk ) a k( N, j ) a k( N, k )
,
i , j = 1, L N ; i , j ≠ k .
A −N1−1 can be obtained from A −N1 without computing the
matrix inverse, and the corresponding coefficients and bias
α N −1 = [b α ]T can be
obtained from Eq. (17).
5 The Adaptive SVM Learning Algorithm The proposed algorithm is summarized in Algorithm 1. Algorithm 1. (Adaptive Support Vector Machine Learning Algorithm, ASVM) 1. Set W = {( x 1 , y1 ) L ( x N , y N )} , where ∃ y i ≠ y j . 2.
Solve Eqs. (4) and (8) to obtain the current classifier
3.
for
k = N + 1, L , l do
4.
Input a new data point
5.
if
6.
f current ( S k ) is wrong then W1 = W ∪ S k Adopt the incremental learning to obtain
8.
Select a data point
9.
W2 = W1 − S
10.
Adopt the decremental learning to obtain
11.
A new data point if
15.
f **
S k +1 = {x k +1 , y k +1} follows
**
W = W2 , f current = f ** else
W = W1 , f current = f *
16. end if 17. end if 18. end for 19. if ( | W |== l ) then 20. Terminate 21. else
f*
S from W1
f ( S k +1 ) is correct then
13. 14.
f current
S k = {x k , y k }
7.
12.
(18)
986
22. 23.
S. Yu et al.
while the stop criterion is false do for k = 1, L , l do
S k = {x k , y k } if S k ∉ W && f current ( S k ) is wrong then
24.
Input a new data point
25.
26. Repeat steps 6-16 27. end if 28. . end for 29. end while 30. end if Before adopting the strategy of the decremental learning in the algorithm, one data point should be selected and removed from the working set. The selecting rule is that the data point, whose absolute support value | α i | is the smallest, will be removed from the working set W1 . Since removing the sample with the smallest absolute support value will result in very little influence on the performance of the classifier without losing the sparseness. After one iterative step is completed, the data points in the current working set can be used to calculate an object value, which is given as follows Obj =
1 T 1 w w +γ 2 2
∑e
2 k
Sk ∈W ,
(19)
α k ,Sk ∈W
(20)
,
k
where w =
∑α
k
y k ϕ ( x k ),
ek =
k
1
γ
From the above analysis, it is easy to see that the object value of Eq. (19) will increase till the algorithm stops. It is sure that there exists a working set covering most of the information in the training set, which enables the object function to reach a fixed value. In Algorithm 1, the stop criterion is that the relative error of the two adjacent iteration objective values is smaller than a given threshold ε , that is
Obj
last
− Obj
Obj
current
<ε
(21)
last
6 Numerical Experiments The experiments are run on an HP Server, which utilizes a 2.5GHz Pentium IV processor with a maximum of 2GB memory available. This computer runs Windows 2000 Server. All the programs are written in C++, using Microsoft’s Visual C++ 6.0 compiler. In order to evaluate the speed and performance of the proposed algorithm, ASVM and LS-SVM are applied to seven UCI datasets available from the UCI Machine Learning Repository [21] and a benchmark problem.
An Adaptive Support Vector Machine Learning Algorithm
987
In the experiments, the RBF kernel function is employed, in which the parameter σ used for each dataset is shown in Table 1. Other parameters are as follows: γ = 1.0, N = 2, ε = 0.01 . The initial values of α k are zero. In order to compare the accuracy between the methodologies, the strategy of the k-fold cross validation is used and the results are the average value of k implements. Considering the computational time, we set k = 10 for the first five datasets and k = 5 for the last three ones. The results are listed in Table 2. To illustrate the idea that the working set obtained by ASVM can cover most of the information in the training set, Fig. 1 is given. To show the convergence of the ASVM, the iterative processes of Tic-Tac-Toe dataset and KRKPA7 dataset are illustrated in Figs. 2 and 3. Table 1. Parameter σ used for each data set
σ
Data Set Checkerboard Tic-Tac-Toe, KRKPA7, KRK Splice-junction, Letter Image Recognition Image Segmentation Sat Image
0.1 1.0 5.0 10.0 15.0
Table 2. Comparison of the results obtained by LS-SVM and ASVM Dataset
l×n
Tic-Tac-Toe 900 × 9 Segmentation 2200 × 19 Splicejunction 3000 × 60 KRKPA7 3000 × 36 Sat Image 4435 × 36 Checkerboard 10000 × 2 Letter Recognition 12000 × 16 KRK 15000 × 6
Algorithm
Training Correctness
Testing Correctness
LS-SVM ASVM LS-SVM ASVM
(%) 100.0 95.48 99.52 96.26
(%) 99.31 93.45 96.15 92.94
15.038 0.616 298.536 22.247
LS-SVM
99.82
91.15
886.964
ASVM LS-SVM ASVM LS-SVM ASVM LS-SVM ASVM
94.91 99.96 96.66 99.98 97.61 99.03 96.67
93.00 99.23 95.87 95.12 94.80 98.62 95.98
178.827 865.802 25.445 3906.050 60.739 30118.612 256.319
LS-SVM
96.95
95.02
103645.653
ASVM LS-SVM ASVM
94.51 99.91 97.79
91.86 99.70 98.12
3915.016 145298.25 195.859
Time (secs)
# of SVs
900 47 2200 253 3000 411 3000 153 4435 197 10000 674 12000 1676 15000 310
988
S. Yu et al.
From Table 2, it can be seen that compared with LS-SVM, ASVM can reduce the computational cost greatly for the large classification problems with losing little training accuracy and generalized performance. For example, for KRK dataset, the training accuracy obtained by LS-SVM and ASVM is 99.91% and 97.79%, respectively, and the testing accuracy is 99.70% and 98.12%, respectively. The training time of ASVM is faster 740 times than that of LS-SVM. The number of the support vectors obtained by LS-SVM and ASVM is 15000 and 310, respectively, which shows that the proposed algorithm can overcome the drawback of losing sparseness in LS-SVM. From the other data sets, one can also see that the number of the support vectors obtained by ASVM is much smaller than that obtained by LSSVM. In the view of testing, it is sure that ASVM is much faster than LS-SVM. Therefore, comparison of the testing time is not given.
Fig. 1. The support vectors obtained by ASVM on Checkerboard dataset
Fig. 2. The convergence curve of ASVM applied to Tac-Tic-Toe
Fig. 3. The convergence curve of ASVM applied to KRKPA7
Fig. 1 illustrates the ASVM training results on the Checkerboard dataset. From Fig. 1, it can be seen that ASVM can find out the working set adaptively. The samples in the working set can cover most of the information in the training set.
An Adaptive Support Vector Machine Learning Algorithm
989
From Figs. 2 and 3, it can be found that the objective values defined in Eq. (19) increase strictly and monotonously with the increase of the iterative times before finding the final working set. The reason is that before convergence, with the increase of the iterative times, the scale of the working set will increase and the corresponding object value will increase too.
7 Discussions and Conclusions In this paper, based on the incremental and decremental learning strategies, an adaptive and iterative support vector machine learning algorithm has been presented for the large classification problems. During the whole training process, the incremental and decremental procedures are performed alternatively, and a small scale working set can be formed adaptively, which can cover most of the information in the whole training set and overcome the drawback of losing the sparseness in LSSVM. Using this working set, one can construct the classifier. In general, the number of the elements in the working set is much smaller than that in the training set. So, the proposed algorithm can be used not only to train the data sets fast but also to test them effectively with losing quite little accuracy. The ASVM algorithm and LS-SVM are applied to seven UCI datasets and a benchmark problem. Comparisons of the obtained results are given. The results show that the proposed algorithm is convergent and its training speed and testing speed are very faster than those of the LS-SVM for the large classification problems with losing little training and testing accuracy. One of the possible extensions of the proposed algorithm is to handle the large data sets in sequence group by group rather than one by one at a time. For the large multiclass problems and the online learning ones with sifting factor, the presented algorithm could be promising.
Acknowledgements This work has been supported by the National Natural Science Foundation of China (10471045, 60433020), Natural Science Foundation of Guangdong Province (970472, 000463, 04020079), Excellent Young Teachers Program of Ministry of Education of China, Fok Ying Tong Education Foundation (91005), Social Science Research Foundation of MOE (2005-241), Key Technology Research and Development Program of Guangdong Province (2005B10101010), Key Technology Research and Development Program of Tianhe District (051G041) and Natural Science Foundation of South China University of Technology (D76010).
References 1. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer Verlag, New York (1995) 2. Joachims, T.: Making Large-Scale Support Vector Machine Learning Practical. In: SchÖlkopf, B., Burges, C. J. C., Smola, A. J. (Eds.): Advances in Kernel Methods-Support Vector Learning. MIT Press, Cambridge, MA (1998) 169-184
990
S. Yu et al.
3. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C. and Murthy, K. R. K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13 (3) (2001) 637-649. 4. Li, J. M., Zhang, B. and Lin, F. Z.: An Improved Algorithm to Sequential Minimal Optimization. Journal of Software 14 (5) (2003) 918-924 (in Chinese) 5. Osuna, E., Freund, R. and Girosi, F.: An Improved Training Algorithm for Support Vector Machines. In: IEEE Workshop on Neural Networks and Signal Processing. IEEE Press, Amelia Island (1997) 276-285 6. Platt, J. C.: Sequential Minimal Optimization- A Fast Algorithm for Training Support Vector Machines. In: SchÖlkopf, B., Burges, C. J. C., Smola, A. J. (Eds.): Advances in Kernel Methods-Support Vector Learning. MIT Press, Cambridge, MA (1998) 185-208 7. Sun, J., Zheng, N. N. Z. and Zhang, H.: An Improved Sequential Minimization Optimization Algorithm for Support Vector Machine Training. Journal of Software 13 (10) (2002) 2007-2012 (in Chinese) 8. Keerthi, S. S. and Gilbert, E. G.: Convergence of A Generalized SMO Algorithm for SVM Classifier Design. Machine Learning 46 (1-3) (2002) 351-360 9. Lin, C. J.: Convergence of the Decomposition Method for Support Vector Machines. IEEE Transaction on Neural Networks 12(6) (2001) 1288-1298 10. Mangasarian, O. L. and Musicant, D. R.: Successive Overrelaxation for Support Vector Machines. IEEE Transactions on Neural Networks 10 (5) (1999) 1032-1037 11. Mangasarian, O. L. and Musicant, D. R.: Lagrangian Support Vector Machines. Journal of Machine Learning Research 1 (2001) 161-177 12. Yang, X. W., Shu, L., Hao, Z. F., Liang, Y. C., Liu, G. R. and Han, X.: An Extended Lagrangian Support Vector Machine for Classification. Progress in Natural Science 14(6) (2004) 519-523 13. Suykens, J. A. K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers. Neural Process Letter 9(3) (1999) 293-300 14. Chua, K. S.: Efficient Computations for Large Least Square Support Vector Machine Classifiers. Pattern Recognition Letters 24(1-3) (2003) 75-80 15. Fine, S. and Scheinberg, K.: Efficient SVM Training Using Low-Rank Kernel Representations. Journal of Machine Learning Research 2 (2) (2002) 243-264 16. Suykens, J. A. K., Lukas, L. and Vandewalle, J.: Sparse Approximation Using Least Squares Support Vector Machines. In: IEEE International Symposium on Circuits and Systems. Presses Polytechniques Et Universitaires Romandes, Switzerland, Genvea (2000) 757-760 17. Van, G. T., Suykens, J. A. K., Baesens, B., Viaene, S., Vanthienen, J., Dedene, G., De Moor, B. and Vandewalle, J.: Benchmarking Least Squares Support Vector Machine Classifiers. Machine Learning 54 (1) (2004) 5-32 18. Cauwenberghs, G. and Poggio, T.: Incremental and Decremental Support Vector Machine Learning. In: Advances in Neural Information Processing Systems 13. Denver CO. (2000) 409-415 19. Lau, K. W. and Wu, Q. H.: Online Training of Support Vector Classifier. Pattern Recognition 36 (2003) 1913-1920 20. Liu, J. H., Chen, J. P., Jiang, S. and Cheng, J. S.: Online LS-SVM for Function Estimation and Classification. Journal of University of Science and Technology Beiing 10 (2003), 73-77 21. Murphy, P. M. and Aha, D. W.: UCI Repository of Machine Learning Database, 1992, http://www.ics.uci.edu/~mlearn/MLRepository.html
SVDD-Based Method for Fast Training of Multi-class Support Vector Classifier Woo-Sung Kang, Ki Hong Im, and Jin Young Choi School of Electrical engineering and computer science, Automation and Systems Research Institute, Seoul Nat’l University, San 56-1 Shillim-dong, Kwanak-ku Seoul 151-744, Korea, 550-749 {wskang, khim, jychoi}@neuro.snu.ac.kr
Abstract. Multi-class classifiers such as one-against-all or one-againstone methods using binary classifier are computationally expensive to solve multi-class problems due to repeated use of training data. Actually, they usually need training data for at least two classes to train each binary classifier. To solve this problem, we propose a new multi-class classification method based on SVDD (Support Vector Domain Description) that needs only one class data to describe each class. To verify the performance of the proposed method, experiments are carried out in comparison with two other methods based on binary classification: one-against-all and one-against-one. The experiment results shows that the proposed method reduces the order of training time as the size of training data increases.
1
Introduction
Multi-class classifiers that use support vector learning have been typically designed by combining several binary classifiers such as SVM [1]. To solve multiclass problem, if we use binary classifiers such as SVM, the number of variables used for multi-class SVM is proportional to that of the classes. Thus, the training speed for multi-class SVM depends on how we combine SVM to construct multiclass classifier. The representative multi-class classification method by SVM is one-against-all [2] and one-against-one [3]. One-against-all method can be simply implemented, but the training speed is very slow, because data of all classes have to be used for the training one SVM. One-against-one method is known to be more suitable for practical use than other methods [4]. However, it also is not adequate since the repeated use of data set happen. In order to reduce the computational load, we need to avoid the repeated use of data. Thus, we use SVDD algorithm [6] for the construction of multi-class classifier. Because SVDD is trained with only data set of one class and does not need the data sets of other classes, the training time depends on only the size of the whole data set. However, since all class data except its own class is not used for training each SVDD, there may be overlap between the boundaries of classes. For the classification of this overlap case, we introduce a decision function using J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 991–996, 2006. c Springer-Verlag Berlin Heidelberg 2006
992
W.-S. Kang, K.H. Im, and J.Y. Choi
contour map. By using this decision function, we can avoid the performance degradation due to the overlap. The paper is organized as follows. In section 2, we analyze the computation load of one-against-all and one-against-one methods. Section 3 describes SVDD method and propose a new multi-class support vector classifier based on SVDD. In Section 4, comparative experiments are performed to show the performance of three multi-class classification methods. Finally, section 5 concludes the paper with some remarks.
2
Computational Load Analysis of Multi-class Classifiers Using SVM
For the computational load analysis of one-against-all and one-against-one method, suppose that the number of classes and data are given as k and l, respectively. 2.1
One-Against-All Method
One-against-all method [2] uses the data of all classes in order to train one SVM. To train ith SVM, it uses all examples in the ith class as positive label class, and all examples in other classes as negative label class. Thus, given l training data, we should solve the dual problem whose number of variable is l for training one SVM. And the number of SVM used for one-against-all is equal to the number of classes. Hence kl variables are used to design multi-class classifier. 2.2
One-Against-One Method
In one-against-one method [3], we should solve the dual problem whose number of variable is the same as the number of data in two classes for training one SVM. Thus the average number of data for training each classifier is 2l/k. Hence, (k − 1)l variables should be used for the design of multi-class classification, since the number of SVM used for one-against-one method is k(k − 1)/2.
3 3.1
Proposed Multi-class Classification Method by SVDD Overviews on Support Vector Domain Description
The objective of SVDD [6] is to find a sphere with minimum volume containing all or most of the data. For a sphere described by center a and radius R, we minimize the following cost function. F (R, a, ξi ) = R2 + C ξi , (1) i
where the variable C gives the trade-off between volume of the sphere and the number of error. (1) should be minimized under the following constraints. φ(xi ) − a 2 ≤ R2 + ξi ,
∀i, ξi ≥ 0,
(2)
SVDD-Based Method for Fast Training
993
where a is the center of the sphere and soft constraints are incorporated by adding slack variables ξi . To solve this problem, we introduce the Lagrangian L = R2 − αi (R2 + ξi − φ(xi ) − a 2 ) − ξi μi + C ξi , (3) i
i
αi ≥ 0,
i
μi ≥ 0.
Setting the derivative of L to zero with respect to R, a, and ξi , respectively, lead to L= αi K(xi , xi ) − αi αj K(xi , xj ), (4) i
0 ≤ αi ≤ C,
i,j
αi = 1,
∀i
i
where K(x, y) = φ(x) · φ(y). We can obtain center a and radius R by maximizing (4) with respect to αi . To determine whether a test point z is within the sphere, the distance to the center of the sphere needs to be calculated as follows. K(z, z) − 2 αi K(z, xi ) + αi αj K(xi , xj ) ≤ R2 (5) i
i,j
When the distance is smaller than the radius, a test data z is accepted. 3.2
SVDD-Based Multi-class Support Vector Classifier
The structure of proposed algorithm is similar to one-against-all method. It constructs k SVDD models when k is the number of classes. However, in this method, the ith SVDD is trained with only the examples of the ith class. Thus, given l training data (x1 , y1 ),. . . ,(xl , yl ) where xi ∈Rn is input and yi ∈{1, . . . , k} is the class of xi , training ith SVDD is equivalent to solve the following quadratic problem: i 2 min (R ) + C ξji , (6) i i i R ,a ,ξ
j
φ (xj ) − ai 2 j
≤ (Ri )2 + ξji
∀j,
ξji ≥ 0,
where xj is the element of ith class and j is the index of the data in ith class. After solving (6), we achieve k domain descriptions m m m m Rm (z) = K m (z, z) − 2 αm αm (7) i K(z, xi ) + i αj K(xi , xj ), i
i,j
m = 1, ..., k. We employ the decision function in (7) to determine the class. Simply, in a case that the data between classes is separable, if we can look for domain in
994
W.-S. Kang, K.H. Im, and J.Y. Choi 6
4
X2
2
0
−2
−4
−6 −6
−4
−2
0
2
4
6
8
X1
Fig. 1. Contour map by domain description in 2 dimensional input space
which the distance between test data and its center is smaller than its radius, the class of test data may be determined. However, there may be overlap between the boundaries of classes since all class data except its own class is not used for training each SVDD. Thus, instead of looking for the domain that includes test data, we use the contour map of domain obtained by (7) in order to determine the class of test input. The example of contour map is shown in Fig. 1. The contour map can be used as a similarity measure between a center and test point. As the value of measure goes to zero, i.e., test point is in the smallest contour shown in Fig. 1, the point approaches the center of the corresponding class. Using this property, we redefine (7) by the following decision function. −Rm (z) Dm (z) = exp , m = 1, . . . , k. (8) a2 When z approaches the center of ith class domain, the value of decision function goes to 1. Otherwise, the value goes to 0 as z gets away from the center. Decision function is similar to density function. The final decision is done by the following criterion: class of x ≡ arg max Di (x). (9) i=1,...,k
3.3
Computational Load Analysis of the Proposed Algorithm
Suppose that the number of classes and data are given as k and l, respectively, for the analysis of proposed algorithm. The proposed algorithm uses only data of the corresponding class in order to train one SVDD. Thus, the average number of data for training one SVDD is l/k. And the number of SVDD used for proposed algorithm is same as that of classes. Hence l variables are used to design multi-class classifier. This indicates that the proposed method avoid the repeated use of training data. Experimental
SVDD-Based Method for Fast Training
995
results show that the training of the proposed method is faster than that of multi-class SVM.
4
Experimental Results
This experiment has been carried out to show that the proposed algorithm has the fast training capability. We test the training speed and accuracy of the proposed method with Yale face database. Yale face database is composed of 165 gray-scale images of 15 individuals with 11 images of different facial expressions. In experiment, we use the closely cropped set and face images in the set are downsampled into 21 × 30 pixels. It is converted to 1 × 630 to test proposed algorithm. In data preprocessing, input dimension is reduced by PCA where the number of principal components is 20, 14 features are extracted by PCA-FX [5]. To test the performance, types in the ratio of training to test data are following: 4 : 7, 5 : 6, 6 : 5, 7 : 4, where each type consists of 9 data sets with same statistical property. Our experimental results show the comparison of three multi-class Table 1. The mean of training time in Yale face database. The training speed of proposed method is very fast in comparison with other methods and the time required for training does not increase on a large scale with the increase of training data. Train:Test 4:7 5:6 6:5 7:4
one-against-all Linear Gaussian 10.93 10.29 17.56 16.81 26.61 25.71 38.51 38.08
one-against-one Linear Gaussian 3.25 3.25 4.94 4.95 6.62 6.61 8.19 8.29
proposed method Linear Gaussian 0.18 0.24 0.22 0.27 0.27 0.28 0.29 0.30
Table 2. The mean and standard deviation of accuracy in Yale face database. The accuracy of proposed method is comparable to that of other methods. Train:Test 4:7 5:6 6:5 7:4
one-against-all Linear Gaussian 89.7(3.2) 89.7(3.3) 90.1(3.2) 90.2(3.3) 90.2(2.2) 90.7(3.8) 91.9(3.9) 91.9(3.9)
one-against-one Linear Gaussian 89.2(3.0) 91.0(3.0) 89.9(3.5) 91.2(3.5) 90.4(4.1) 91.7(4.2) 91.1(4.1) 92.1(4.2)
proposed method Linear Gaussian 90.6(3.3) 90.6(3.3) 90.1(4.1) 90.2(4.0) 90.4(3.5) 90.5(3.2) 93.1(4.1) 93.3(4.1)
classification methods about the mean and the standard deviation of accuracy and training time. The experiments were constructed on a Pentium 4–2.00 GHz with 512 MB RAM using MATLAB 6.5. For each optimization problem, we use quadratic program solver provided by MATLAB. We tested the performance about Linear and Gaussian kernel in each multiclass classification method. Table 1 shows the mean of training time in each
996
W.-S. Kang, K.H. Im, and J.Y. Choi
multi-class classification method. The first column represents the ratio of training to test data where the number of data in all set that include training and test data is 165. And the unit of training time is second. In Table 1, we know that as the number of training data increases, the time required for training multi-class SVM increase rapidly. In contrast to multi-class SVM, the training of proposed method is very fast in comparison with other methods and the training time does not increase on a large scale with the increase of training data. From this result, we can conclude that the training time is reduced by avoiding the repeated use of training data. The comparison of accuracy in Yale face database is shown in Table 2. The number in parenthesis represents the standard deviation of accuracy. As shown in Table 1 and 2, since the training of the proposed algorithm is much faster than multi-class SVM and the accuracy of it is comparable to those of other methods, it is reasonable that proposed method is adequate for multi-class classification.
5
Conclusions
Because of the property that the training of SVM needs data with both positive and negative label, it is indispensable to use data set of any class repeatedly. However, if only data set of corresponding class is used for training one classifier, the repeated use of data set can be avoided. Thus, we use SVDD to solve this problem. As shown in experimental results, the training of SVDD-based method is much faster than one-against-one method which is known to be suitable for practical use. Also, the performance of classification is comparable with multiclass SVM methods. And the training time of the proposed method does not increase on a large scale as much as that of the other methods with the increase of training data.
References 1. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2 (1998) 121–167 2. Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L., LeCun, Y., Muller, U., Sackinger, E., Simard, P., Vapnik, V.: Comparison of Classifier Methods: A Case Study in Handwriting Digit Recognition. Proc. Int. Conf. Pattern Recognition (1994) 77–87 3. Sch¨ olkopf, B., Burges, C.J., Smola, A.J. (eds.): Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA (1999) 255–268 4. Hsu, C.W., Lin, C.J.: A Comparision of Methods for Multiclass Support Vector Machines. IEEE Trans. Neural Networks 13(2) (2002) 415–425 5. Park, M.J., Na, J.H., Choi, J.Y.: PCA-based Feature Extraction using Class Information. IEEE Int. Conf. Systems, Man and Cybernetics (2005) 341–345 6. Tax, D.M.J., Duin, R.P.W.: Support Vector Domain Description. Pattern Recognition Letters 20 (1999) 1191–1199
Binary Tree Support Vector Machine Based on Kernel Fisher Discriminant for Multi-classification Bo Liu1,*, Xiaowei Yang2, and Zhifeng Hao2 1 College of Computer Science and Engineering, South China University of Technology, Guangzhou 510640, P.R. China [email protected] 2 School of Mathematical Science, South China University of Technology, Guangzhou 510640, P.R. China
Abstract. In order to improve the accuracy of the conventional algorithms for multi-classifications, we propose a binary tree support vector machine based on Kernel Fisher Discriminant in this paper. To examine the training accuracy and the generalization performance of the proposed algorithm, One-against-All, One-against-One and the proposed algorithms are applied to five UCI data sets. The experimental results show that in general, the training and the testing accuracy of the proposed algorithm is the best one, and there exist no unclassifiable regions in the proposed algorithm.
1 Introduction Support vector machines (SVMs) [1] are good tools for binary classification problems. As for multi-classification, they are usually converted into binary ones. Several methods have been proposed to decompose and reconstruct multiclassification problem, we can refer to a good view of these methods written by Rifkin [2] and [3]. In the previous work, Vapnik first proposed One-against-All algorithm with discrete functions [4], however, there exist some unclassifiable regions and the accuracies are very low. To overcome this problem, Vapnik [1] introduced the continuous decision functions instead of the discrete decision ones. Inoue and Abe [5] presented fuzzy support vector machines for One-against-All classification. In the work of Abe [6], he proved that One-against-All algorithm with fuzzy membership functions is equivalent to One-against-All algorithm with continuous decision functions. KreBel [7] put forward pairwise method (One-against-One), in which the unclassifiable regions have been reduced. In order to deal with the unclassifiable region, Platt et al. [8] proposed Decision Directed Acyclic Graph (DDAG) based on pairwise classification. Abe put forward to a fuzzy support vector machine based on pairwise classification [9]. Recently, Hao et al. [10] proposed Twi-map Support Vector Machine based on One-against-One algorithm to improve the accuracy of the multi-classification. *
Corresponding author.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 997 – 1003, 2006. © Springer-Verlag Berlin Heidelberg 2006
998
B. Liu, X. Yang, and Z. Hao
In this paper, in order to improve the accuracy of the conventional algorithms for multi-classifications, we propose a binary tree support vector machine based on Kernel Fisher Discriminant. Some experimental results show that in general, the training and the testing accuracy of the proposed algorithm is the best one, and there exist no unclassifiable regions in the proposed algorithm. The rest of this paper is organized as follows: In Section 2, we first shortly review One-against-All and One-against-One algorithms. In Section 3, the proposed algorithm is given. Experimental results are shown in Section 4 and some conclusions will be given in Section 5.
2
Multi-classification Algorithms
The multi-classification problems refer to assigning each of the observations into one of C classes, which are usually converted into binary ones. Up to now, several methods have been proposed to decompose and reconstruct multi-classification problems. In this section, let us introduce some conventional algorithms. 2.1 One-Against-All Algorithm Considering the conventional support vector machine proposed by Vapnik [4], one needs to determine C decision functions for the C -classes. The optimal hyperplane for class i against the remaining classes, which has the maximum margin between them, is
Di (x) = wTiψ (x) + bi
(1)
where w Ti is an m-dimensional vector, bi is a scalar.
Class 3
Class 3
D3 ( x) = 0
D3 ( x) = 0
Class2 D2 ( x) = 0 Class1 D1 ( x) = 0
Fig. 1. Discrete Decision Function
Class2 D2 ( x) = 0 Class1 D1 ( x) = 0
Fig. 2. Continuous Decision Function
Vapnik put forward the discrete function discriminance similar to the binary classification. For the input vector x , if Di ( x ) > 0 satisfies for one i , x is classified into class i .
(2)
Binary Tree Support Vector Machine Based on Kernel Fisher Discriminant
999
By this formulation, the decision is discrete and there exist some unclassifiable regions as shown in Fig. 1. In order to avoid this drawback, Vapnik [1] proposed the continuous decision function discriminance. For the input vector x , it will be classified into the class: arg max D i ( x ) .
(3)
i
Since the value of Di (x) is continuous, the decision is continuous and the unclassifiable regions disappear as shown in Fig. 2. 2.2
One-Against-One Algorithm
According to the conventional pairwise classification [7], one needs to determine C (C − 1) / 2 decision functions for the C -classes. The optimal hyperplane for class i against j , which has the maximum margin between them, is Dij ( x) = wTijψ ( x) + bij = 0
(4)
where w Tij is an m-dimensional vector, and bij is a scalar.
Class 3 D23 ( x) = 0
D13 ( x ) = 0
Class 1
Class 2
D12 ( x ) = 0
Fig. 3. One-Against-One Algorithm
Defining the orientation of the optimal hyperplane via the following equation Dij ( x) = − D ji ( x)
(5)
for the input vector x , one computes c
Di (x) =
∑ sgn(D (x)) ij
(6)
j ≠ i , j =1
where ⎧1 sgn( x) = ⎨ ⎩0
x>0 x≤0
(7)
1000
B. Liu, X. Yang, and Z. Hao
and classifies x into the class arg max( D i ( x )) .
(8)
i = 1,K , c
If i is not unique in Eq. (8), then x is unclassifiable. For example, in Fig. 3, if x is in the shaded region, Di ( x ) = 1 for i =1, 2, 3, therefore, the shaded region is unclassifiable [7].
3
The Proposed Algorithm
According to the introductions mentioned above, we can easily conclude that in the original formulation of one-against-all and pairwise approaches for multiclassification, the unclassifiable regions exist. To overcome the drawback and improve the general performance of the approaches, in this section, we will propose a binary tree support vector machine based on Kernel Fisher Discriminant. Let S = {(x1, y1 ), (x 2 , y2 ),K, (xl , yl )} be a training set, where x i ∈ R m and
yi ∈ {1,2,K, C} . For a nonlinear functionψ ( x ) , if one maps the data samples into a potentially much higher (and possibly infinite) dimensional feature space, then the image of the nonlinear function will be S under ψ (x)
ψ (S ) = {(ψ (x1), y1), (ψ (x2 ), y2 ), K, (ψ (xl ), yl )} . In the feature space, the norm of feature vector ψ (x) is given by ψ (x) =
ψ (x)
2
=
ψ ( x ), ψ ( x ) =
k ( x, x ) ,
(9)
and the distance between feature vectors ψ (x) and ψ (z ) can be computed as ψ (x) − ψ (z )
2
= ψ ( x ) − ψ ( z ), ψ ( x ) − ψ ( z ) = k ( x, x ) − 2k ( x, z ) + k ( z , z )
(10)
Therefore, the Kernel Fisher Discriminant [12] [13] is defined as follows: 2
m1 − m 2
max J F ( w ) =
s12
+
(11)
s 22
where m i and s i are denoted by mi =
s i2 =
1 ni
ni
∑ ψ (x
ik
)
i = 1, 2
(12)
k =1
ni
∑ (ψ ( x
ki
) − mi ) 2
i = 1, 2 .
k =1
where ni mean the number of class i . So, the distance between two classes is
(13)
Binary Tree Support Vector Machine Based on Kernel Fisher Discriminant
2
m1 − m2 =
1
n1
n1
∑∑ K (x , x 1i
n12 i =1 j =1
2j) +
1
n2 n2
∑∑K (x
n22 i =1 j =1
2i , x2 j ) −
2 n1n2
1001
n1 n2
∑∑ K(x , x 1i
2j)
i =1 j =1
(14)
and the coincidence of two classes are in the following. si2 =
ni
∑ ψ (x
ij ) − mi
j =1
2
=
ni
∑ k (x , x ij
j =1
ij ) −
1 ni
ni
ni
∑∑ k (x , x ij
ik )
i = 1,2.
(15)
j =1 k =1
Based on the Kernel Fisher Discriminant, the proposed algorithm is as follows: Step 1: Let the set S contain all the classes. In the feature space, one partitions S into two sets S1 and S 2 according to Eq. (11). Step 2: Consider all the classes in set S 1 as one class, and the classes in set S 2 as the other class, then construct the hyperplane in the same feature space. Step 3: Let S = S1 and S = S 2 , respectively, then repeat steps 1 and 2 until there leave one class in the set of S1 and S 2 . From the process of the present algorithm, one knows that for a C -class problem, the number of hyperplanes to be calculated is C − 1 , which is less than that of Oneagainst-All and One-against-One. In order to improve the accuracy of the conventional algorithm, we choose the two classes to construct hyperplane based on the Kernel Fisher Discriminant in the feature space, which is much more reasonable. We know that Eq. (11) means maximizing the difference between the two classes and minimizing the variance of the individual classes, and then the most separable classes are separated at the upper nodes of the binary tree, which can reduce classification errors. However, for each nodes of the binary tree, if the number of classes need to be grouped is N , then we have 2N grouping possibilities, in order to make the binary tree more balanceable and reduce the complexity of the algorithm, we let the number of classes in each groups be the same as possible as can. That is to say, at each node, [N ] 2 grouping
we only have the CN
possibilities.
4 Simulation Experiments The experiments are run on a PC with a 2.8GHz Pentium IV processor and a maximum of 512MB memory. All the programs are written in C++, using Microsoft’s Visual C++ 6.0 compiler. In order to evaluate the performance of the proposed algorithm, One-against-All [1], One-against-One and the presented algorithm are applied to five UCI data sets available from the UCI Machine Learning Repository [14]. Data preprocessing is in the following: z
z
Iris dataset: This dataset involves 150 data with 4 features and 3 classes. We choose 100 samples of them randomly for training, and the rest for testing. Glass dataset: This dataset consists of 214 data points with 9 features and 6 classes. We select 154 randomly for training, and the rest 60 data points for testing.
1002
B. Liu, X. Yang, and Z. Hao
Wine dataset: This dataset includes 178 data points with 13 features and 3 classes. We choose 118 samples of them randomly for training, and the rest for testing. Auto-mpg dataset: This dataset consists of 398 data points with 7 features and 3 classes. We select 298 randomly for training, and the rest 100 data points for testing. Car Evaluation Database: This dataset involves 1728 data points with 6 features and 4 classes. We select 1228 samples of them for training and the rest 500 data points for testing.
z
z
z
An RBF kernel function is employed, the parameter values and the results are shown in Tables 1. Although the parameters may be not the optimal hyperparameters, we consider them reasonable. Table 1. Comparison of the results Dataset
Parameter
(γ ,σ )
Iris
(1,1)
Glass
(1,1)
Wine
One-against-All Testing Training
94
One-against-One Testing Training
97
96
51.667
68.831
58.333
(1,1)
96.667
100
Auto
(1,1)
69
Car
(1,1)
92
96
Binary Tree (%) Testing Training 96
96
70.779
58.333
71.429
92.372
100
96.667
100
71.233
76
77.054
77
76.712
92.182
92
92.182
92.2
92.182
From the Table 1, we can see that for Glass dataset, Wine dataset and Car dataset, the training and testing accuracy of the proposed method are higher or not less than those of One-against-All and One-against-One algorithms, for Iris dataset and Auto datasets, though the training accuracy of proposed algorithm is not the highest one, however, the testing accuracy is the highest one. As for classification, the general performance is much more important than the training accuracy and there exist no unclassifiable regions in the proposed method, so the proposed approach can be comparable with them.
5 Conclusions Support vector machines are originally designed for binary classifications. As for multi-class classifications, they are usually converted into binary classifications. Up to now, several methods have been proposed to decompose and reconstruct multiclassification problems. In order to improve the accuracy of the conventional algorithms for multi-classifications, we propose a binary tree support vector machine based on Kernel Fisher Discriminant in this paper. To examine the training accuracy and the generalization performance of the proposed algorithm, One-against-All, Oneagainst-One and the proposed algorithms are applied to five UCI data sets. Simulation results show that the testing accuracy of the proposed algorithm is the best one among
Binary Tree Support Vector Machine Based on Kernel Fisher Discriminant
1003
those compared methods and there exist no unclassifiable regions in the proposed binary tree SVM algorithm.
Acknowledgements This work has been supported by the National Natural Science Foundation of China (10471045, 60433020), Natural Science Foundation of Guangdong Province (970472, 000463, 04020079), Excellent Young Teachers Program of Ministry of Education of China, Fok Ying Tong Education Foundation (91005), Social Science Research Foundation of MOE (2005-241), Key Technology Research and Development Program of Guangdong Province (2005B10101010), Key Technology Research and Development Program of Tianhe District (051G041) and Natural Science Foundation of South China University of Technology (D76010).
References 1. Vapnik, V. N.: Statistical Learning Theory. John Wiley & Sons (1998) 2. Rifkin, R., A. Klautau.: In Defense of One-Vs-All Classification. Journal of Machine Learning Research 5 (2004) 101-141 3. Bredensteniner, E. J., Bennett, K. P.: Multicategory Classification by Support Vector Machines. Computational Optimization and Applications 12 (1999) 53-79 4. Vapnik, V. N.: The Nature of Statistical Learning Theory. Springer-Verlag, London, UK (1995) 5. Inoue, T., Abe, S.: Fuzzy Support Vector Machines for Pattern Classification. In Proceedings of International Joint Conference on Neural Networks 2 (2001) 1449-1454 6. Abe, S.: Analysis of Multiclass Support Vector Machines. In International Conference on Computational Intelligence for Modelling Control and Automation (2003) 385-396 7. KreBel, U. H. G.: Pairwise Classification and Support Vector Machines. In Schölkopf, B., Burges, C. J. and Smola, A. J., editors, Advances in Kernel Methods.: Support Vector Learning. The MIT Press (1999) 255-268 8. Platt, J. C., Cristianini, N. and Taylor, J. Shawe.: Large Margin DAGs for Multiclass Classification. In S. A. Solla, T. K. Leen, and K. R. Müller, editors, Advances in Neural Information Processing Systems 12. The MIT Press (2000) 547-553 9. Tsujinishi, D., Abe, S.: Fuzzy Least Squares Support Vector Machines for Multiclass Problems. Neural Network 16 (2003) 785-792 10. Hao, Z. F. el al.: Twi-Map Support Vector Machine for Multi-classification Problems. In Proceeding of ISNN 2005 Conference. Lecture Notes in Computer Science 3496 SpringerVerlag, Berlin Heidelberg New York (2005) 869-874 11. Suykens, J. A. K.: Least Squares Support Vector Machine for Classification and Nonlinear Modelling. Neural Network World 10(1) (2000) 29-47 12. Mike, S. el al.: Fisher Discriminant Analysis with Kernels. in Neural Networks for Signal Processing IX, Y.-H.Hu, J. Larsen, E. Wilson, and S. Douglas, Eds. Piscataway, NJ:IEEE, (1999) 41-48 13. Roth, V., Steinhage, V.: Nonlinear Discriminant Analysis Using Kernel Function. In Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K.-R. Müller, Eds. Cambridge. The MIT Press (2000) 568-574 14. Murphy, P. M., Aha, D. W. UCI Repository of Machine Learning Database (1992). http://www.ics.uci.edu/~mlearn/MLRepository.html
A Fast and Sparse Implementation of Multiclass Kernel Perceptron Algorithm Jianhua Xu Department of Computer Science, Nanjing Normal University, Nanjing 210097, China [email protected]
Abstract. Original multiclass kernel perceptron algorithm is time consuming in its training and discriminating procedures. In this paper, for each class its reduced kernel-based discriminant function is defined only by training samples from this class itself and a bias term, which means that except for bias terms the number of variables to be solved is always equal to the number of total training samples regardless of class number and the final discriminant functions are sparse. Such a strategy can speed up the training and discriminating procedures effectively. Further an additional iterative procedure with a decreasing learning rate is designed to improve the classification accuracy for the nonlinearly separable case. The experimental results on five benchmark datasets using ten-fold cross validation show that our modified training methods run at least two times and at most five times as fast as original algorithm does.
1 Introduction Support vector machine (SVM) [1] and other kernel machines (e.g., kernel Fisher discriminant analysis [2], kernel neuron algorithm [3], minimal VC dimension classifiers [4]) were primarily designed for binary classification problems. However many real world applications belong to mutilclass classification problems. Now there exist two widely used strategies to deal with them. One strategy is to reduce a multiclass problem into a set of binary classification problems. The “one-against-other” [1], “oneagainst-one” [5] and “error correcting output coding” [6] techniques were utilized to design multiclass SVM algorithms. But they only consider useful classification information from two or several classes. The other strategy is to consider all classes in one optimization problem to build the “all-together” algorithms. Through extending binary SVM form, two different multiclass SVM methods were derived [1], [7], [8]. It is very difficult to solve them directly, since the number of variables to be solved is about the product of the number of training samples by the number of classes. Hsu and Lin [9] presented two decomposition techniques to solve such two “all-together” methods effectively. However their algorithmic routines are still complicated. Linear perceptron algorithm and its multiclass version are the simplest linear methods [10]. Based on kernel tricks in SVM, a multiclass kernel perceptron algorithm (MKPA) considering all classes at once was proposed and analyzed in details where the kernel-based discriminant functions are defined by all training samples and bias terms [11]. A slightly different form was given intuitively in [12]. But, the numJ. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1004 – 1009, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Fast and Sparse Implementation of Multiclass Kernel Perceptron Algorithm
1005
ber of variable to be solved in MKPA is almost equal to that in “all-together” SVM methods and its kernel-based discriminant functions are not sparse, which results in the high computational costs in the training and discriminating procedures. In [7] a multiclass SVM method based on linear programming was proposed using the reduced kernel-based discriminant functions. For each class its reduced discriminant function is defined only by training samples from this class itself and a bias term. Thus except for bias terms the number of variables to be solved equals the number of training samples regardless of class number, and the final kernel-based discriminant functions are sparse. In this paper, we use this strategy to speed up the training and discriminating procedures of MKPA simultaneously. Further, an additional iterative procedure with a decreasing learning rate from 1 to 0 is utilized to improve the classification accuracy for the nonlinearly separable case. In our experiments, five benchmark datasets and RBF kernel are examined by using ten-fold cross validation. Their results demonstrate that the training time can be decreased evidently.
2 A Multiclass Kernel Perceptron Algorithm Let the training set of c classes be {(x1 , y1 ),..., (xi , yi ),..., (xl , yl )} , where xi ∈ ℜn and yi ∈ {1, 2,..., c} , i = 1, 2,..., l , represent training vectors and their class labels, and l is the number of total training samples. For a c class classification problem, c kernelbased discriminant functions have to be defined,
f i (x) = ∑ m =1α mi k (x m , x) + β i , i = 1,..., c , l
(1)
where α mi and βi denote the parameters in the ith function, and k (⋅, ⋅) kernel functions. For some pattern vector x, we would say it is in the class that has the largest value of the discriminant function, i.e., if f p (x) = max i =1,..., c ( f i (x)), then x ∈ ω p .
(2)
According to the decision rule (2), for some training sample ( x q , yq ) where yq = i ∈ {1, 2,..c} (i.e., x q ∈ ωi ), if there is at least one j ≠ i for which f i ( x q ) ≤ f j ( x q ) , this vector is referred to as a misclassified sample. We can use this misclassified sample to update some parameters in (1) according to the following iterative learning rule,
α mi ⇐ α mi + k (x m , x q ), βi ⇐ βi + 1; α mj ⇐ α mj − k (x m , x q ), β j ⇐ β j − 1 .
(3)
In this paper, executing one iteration implies checking all training samples once. For the linearly separable set in the feature space, the multicass kernel perceptron algorithm (MKPA) converges to a solution within limited iterations starting from any arbitrary initial values [11]. For the nonlinearly separable case, our trick is to set the maximal iterations (T1) to stop the iterative procedure and accept the current values as the final solution. In (1) for each class its kernel-based discriminant function depends on all training samples. This means that the training procedure is time consuming since c*(l+1) vari-
1006
J. Xu
ables have to be solved. Additionally, the final discriminant functions are not sparse usually, which results in high computational cost in the discriminating procedure.
3 A Fast and Sparse Implementation of MKPA In this section, we speed up the training procedure and construct the sparse discriminant functions using the reduced kernel-based discriminant functions for MKPA, and improve its classification accuracy further using an additional iterative procedure with a decreasing learning rate for the nonlinearly separable case. In binary kernel machines (c=2), the form of discriminant function is defined as, f (x) = ∑ m =1α m ym k (x, x m ) + β , l
(4)
where y i ∈ { + 1, − 1} , and its corresponding decision rule is, if f (x) > 0, then x ∈ ω1 ; otherwise x ∈ ω2 .
(5)
The formula (4) can be divided into two parts, f (x) = ∑ m =1, y l
m
α m k (x, x m ) − ∑ m =1, y l
=+1
m
=−1
α m k (x, x m ) + β .
(6)
If two new discriminant functions are defined, i.e., f 1 (x) = ∑ m =1, y l
m
α m k (x, x m ) + β1 , f 2 (x) = ∑ m =1, y l
=+1
m
=−1
α i k ( x, x m ) + β 2 ,
(7)
where β = β1 − β 2 , the decision rule (5) can be turned into, if f1 (x) > f 2 (x), then x ∈ ω1 ; otherwise x ∈ ω2 .
(8)
In fact, this is a special case of formula (2) for c=2. According to (7), Weston and Watkins [7] proposed the following reduced kernel-based discriminant functions to construct a multiclass SVM algorithm based on linear programming, f i (x) = ∑ m =1, y l
m
=i
α m k (x m , x) + βi , i = 1,..., c .
(9)
For each class its reduced discriminant function is defined only by training samples from this class itself and a bias term. Now we use these reduced functions (9) to improve MKPA. According to (9), the iterative learning rule (3) can be rewritten as,
α m ⇐ α m + k (x m , x q ), βi ⇐ βi + 1, if ym = i α m ⇐ α m − k (x m , x q ), β j ⇐ β j − 1, if ym = j
.
(10)
Since in (10) except for bias terms the number of variables is equal to the number of training samples regardless of class number, the training procedure can be speeded up effectively. Moreover the computational time in the discriminating procedure can be decreased evidently since the final discriminant functions are sparse. We refer to this iterative form as reduced multiclass kernel perceptron algorithm (RMKPA).
A Fast and Sparse Implementation of Multiclass Kernel Perceptron Algorithm
1007
In MKPA the maximal iterations are predetermined to handle the nonlinearly separable case. It is possible that the final solution is not optimal [13]. Additionally, in our experiments it is found out that the classification accuracy could be slightly degraded by RMKAP possibly. Therefore after the training set is determined as a nonlinearly separable case using RMKPA, we add the following iterative procedure with a decreasing learning rate from 1 to 0 to improve its classification accuracy,
α m ⇐ α m + η (t )k (x m , x q ), βi ⇐ β i + η (t ), if ym = i α m ⇐ α m − η (t )k (x m , x q ), β j ⇐ β j − η (t ), if ym = j
,
(11)
where the learning rate η (t ) = 1/ t , t = 1, 2,...T2 . Our modified training method includes two steps: RMKPA (10) and an additional iterative procedure (11), which is referred to as reduced and decreased multiclass kernel perceptron algorithm (RDMKPA).
4 Experimental Results and Analysis In order to illustrate the performance of our modified multiclass kernel perceptron algorithms (RMKPA and RDMKPA), we choose five ten-fold cross validation datasets from [14], which are summarized in table 1 with an increasing order of set size. Note that some of them were also tested in [7]. In our experiments, we scale each attribute to the interval [0,1] and rearrange training samples according to class labels. All experiments are executed on a P4-3.0GE with 1024M RAM using VC6.0 complier. Table 1. Benchmark datasets used in our experiments Dataset Iris Wine Ecoli Dematogy Vowel
Size 150 178 336 358 990
Class 3 3 8 6 11
Dimension 4 13 7 34 10
Average error rates (%) Grappa[14] 3.33 2.25 13.01 3.08 6.26
W&W[7] 1.33 3.60
34.80
We examine RBF kernel ( k ( x , y ) = ex p ( − x − y 2 / 2σ 2 ) where σ is its width) and maximal iterations in details. For all datasets we firstly estimate their testing accuracy using a two dimensional grid scheme where the RBF width takes 22, 21, 20, …, 2-6 and maximal iterations 20, 200 and 2000. In RDMKPA, “100+100” means that T1=100 and T2=100. Thus for each dataset 3* 9=27 combinations are tested. According to the optimal parameters with the best testing accuracy, we then select a small width interval with a linear step and maximal iterations to be 200 and 2000. Table 2 provides our detailed results. The columns 3, 4 and 5 represent the optimal maximal iterations and RBF kernel width, and the average error rate and variance on testing sets respectively. The last column indicates the width interval and its linear step. In table 2, it is validated that RMKPA could degrade the classification accuracy slightly. But RDMKPA could cancel this degradation on iris and ecoli. Compared table 1 with 2, our error rates on iris, wine and vowel are lower than Grappa’s ones.
1008
J. Xu Table 2. Best experimental results based on a small width interval with a linear step Dataset
Algorithm
Iterations
Width
Testing set
Iris
MKPA RMKPA RDMKPA MKPA RMKPA RDMKPA MKPA RMKPA RDMKPA MKPA RMKPA RDMKPA MKPA RMKPA RDMKPA
200 200 100+100 200 200 100+100 200 200 100+100 200 200 1000+1000 200 200 100+100
0.41 0.38 0.36 0.47 0.35 0.35 0.106 0.06 0.07 0.63 0.80 1.05 0.175 0.175 0.175
2.67±3.27 3.33±3.33 2.67±3.27 1.70±3.60 2.22±5.09 2.22±5.09 18.85±5.24 17.01±5.17 16.22±5.18 2.75±2.73 3.31±2.01 3.32±2.38 3.13±1.23 3.23±2.38 3.23±2.38
Wine
Ecoli
Dematogy
Vowel
Width interval and step [1.0, 0.1] 0.01 [1.0, 0.1] 0.01 [0.15, 0.01] 0.001 [1.5, 0.4] 0.001 [0.3, 0.1] 0.05
In table 3, the comparison of average training time among MKPA, RMKPA and RDMKPA is shown for the maximal iterations 200, where each average training time is calculated over all width values. The columns 4 and 6 provide the ratios of avearge training time between RMKPA and MKPA, and between RDMKPA and MKPA. Our two modified alorithms run at least two times and at most five times as fast as MKPA does. Moreover, as the size of set increases, the training time can be saved more evidently. These experimental results show that RMKPA and RDMKPA can speed up the training procedure of MKPA efficiently with a satisfied performance. Table 3. Comparison of average training time (ms) for MKPA, RMKPA and RDMKPA (maximal iterations = 200) Dataset Iris Wine Ecoli Dematogy Vowel
MKPA 116 75 1030 1115 10648
RMKPA 45 36 185 279 1829
RMKPA/MKPA 0.40 0.48 0.18 0.25 0.17
RDMKPA 47 36 192 278 1929
RDMKPA/MKPA 0.41 0.48 0.19 0.25 0.18
5 Conclusions Original multiclass kernel perceptron algorithm is time consuming in its training and discriminating procedures since the number of variables to be solved is about the product of the number of training samples by the number of classes and its kernelbased dicriminant functions are not sparse. In this paper, we first use the reduced kernel-based discriminant functions to speed up its training procedure and construct its sparse discriminant functions. Then, an additional training procedure with a decreasing learning rate is added to improve the classification accuracy further. The experimental results show that our new methods can save the computational time evidently.
A Fast and Sparse Implementation of Multiclass Kernel Perceptron Algorithm
1009
Our further work is to examine and analyze our algorithms elaborately for the large-scale benchmark databases. This work is supported by Natural Science Foundation of Jiangsu Province (No. BK2004142).
References 1. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons, New York (1998) 2. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Muller, K.-R.: Fisher Discriminant Analysis with Kernels. In: Hu, Y.H., Larsen, J., Wilson, E., Douglas, S. (eds.): Proceedings of the 1999 IEEE Workshop on Neural Networks for Signal Processing, Vol. 9. IEEE Press, New York (1999) 41-48 3. Xu, J., Zhang, X.: A Learning Algorithm with Gaussian Regularizer for Kernel Neuron. In: Yin, F., Wang, J., Guo, C. (eds.): Advances in Neural Networks - ISNN2004. Lecture Notes in Computer Science, Vol. 3173. Springer-Verlag, Berlin Heidelberg New York (2004) 252-257 4. Xu, J.: Designing Nonlinear Classifiers through Minimizing VC Dimension Bound. In: Wang, J., Liao, X., Yi, Z. (eds.): Advances in Neural Networks - ISNN2005. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 900-905 5. Kerbel, U.: Pairwise Classification and Support Vector Machines. In: Scholkopf, B., Burges, C.J.C., Somla, A.J. (eds.): Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge MA (1999) 255-268 6. Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing Multiclass to Binary: a Unifying Approach for Margin Classifiers. Journal of Machine Learning Research 1 (2000) 113-141 7. Weston, J., Watkins, C.: Support Vector Machines for Multiclass Pattern Recognition. In: Verleysen, M. (ed.): Proceedings of the Seventh European Symposium on Artificial Neural Networks - ESANN1999. D-Facto Publications, Brussels (1999) 219-224 8. Crammer, K., Singer Y.: On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47(2-3) (2002) 201-233 9. Hsu, C.W., Lin, C.J.: A Comparison of Methods for Multiclass Support Vector Machines. IEEE Trans. Neural Networks 13(2) (2002) 415-425 10. Duda, R.O., Hart, P.E., David, G.S.: Pattern Classification. 2nd edn. John Wiley & Sons, New York (2001) 11. Xu, J., Zhang, X.: A Multiclass Kernel Perceptron Algorithm. In: Zhang, M., Shi, Z. (eds.): Proceedings of 2005 International Conference on Neural Networks and Brain, Vol.2. IEEE Press, New York (2005) 717-721 12. Ben-Reuven, E., Singer, Y.: Discriminative Binaural Sound Localization. In: Becker, S.T., Obermayer, K. (eds.): Proceeding of Neural Information Processing System, Vol. 15. MIT Press, Cambridge MA (2002) 1229-1236 13. Gallant, S.I.: Neural Networks Learning and Expert Systems. MIT Press, Cambridge MA (1993) 14. http://www.grappa.univ-lille3.fr/~torre/guide.php?id=datasets, =methods and =results
Mutual Conversion of Regression and Classification Based on Least Squares Support Vector Machines Jing-Qing Jiang1,2, Chu-Yi Song2, Chun-Guo Wu1,3, Yang-Chun Liang1,*, Xiao-Wei Yang4, and Zhi-Feng Hao4 1 College of Computer Science and Technology, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Changchun 130012, China [email protected] 2 College of Mathematics and Computer Science, Inner Mongolia University for Nationalities, Tongliao 028043, China 3 Key Laboratory of Information Science & Engineering of Railway Ministry/ The Key Laboratory of Advanced Information Science and Network Technology of Beijing, Beijing Jiaotong University, Beijing 100044, China 4 School of Mathematical Science, South China University of Technology, Guangzhou 510640, China
Abstract. Classification and regression are most interesting problems in the fields of pattern recognition. The regression problem can be changed into binary classification problem and least squares support vector machine can be used to solve the classification problem. The optimal hyperplane is the regression function. In this paper, a one-step method is presented to deal with the multi-category problem. The proposed method converts the problem of classification into the function regression problem and is applied to solve the converted problem by least squares support vector machines. The novel method classifies the samples in all categories simultaneously only by solving a set of linear equations. Demonstrations of numerical experiments are performed and good performances are obtained. Simulation results show that the regression and classification can be converted each other based on least squares support vector machines.
1 Introduction Support vector machines (SVMs) are developed by Vipnik [1] based on the statistical learning theory. SVMs have been successfully used for solving classification and function regression problems. Many algorithms for training the SVM have been presented and studied. Suykens [2] suggested a least squares support vector machine (LSSVM) in which the inequality constrains were replaced by equality constrains. By this way, solving a quadratic programming was converted into solving linear equations. The efficiency of training SVM was improved greatly and the difficulty of training SVM was cut down. Suykens [3] studied the LSSVM for function regression further. In this paper the classification and regression problem are solved by LSSVM. Function regression plays important roles in many fields. Several SVM algorithms have been presented for function regression. Tao [4] gave a new idea that each regres*
Corresponding author.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1010 – 1015, 2006. © Springer-Verlag Berlin Heidelberg 2006
Mutual Conversion of Regression and Classification
1011
sion problem can be changed into a classification problem. Motivated by Tao, we deal with regression problem by classification and solve the classification problem using LSSVM. In opposition to Tao, we present that multi-category classification problem can be changed into a regression problem. Weston [5] introduced two methods to solve multi-category problem in one step. One of the methods needs to solve a quadratic program problem, and the other needs to solve a linear program. Angulo [6] proposed K-SVMR (Support Vector Classification-Regression) for multi-category classification problem. Based on the function regression, a one-step method to deal with multi-category classification problem was proposed [7]. This method solves the function regression by least squares support vector machines. It only needs to solve a set of linear equations. The conversion between classification and regression based on LSSVM is focused in this paper.
2 Regression Based on Classification Based on support vector machine, the optimal regression function f (x, w) = wTϕ(x) + b should satisfy the structural risk minimization principle [1]. That is to minimize
J ( w) =
1 T w w + γRemp [ f ] 2
(1)
Where γ is a constant, Remp [ f ] is empirical risk. Several measure functions could be used to measure empirical risk. While choosing the Vapnik e-insensitive loss function ⎧⎪ 0 if ξ ≤ ε ⎩⎪ ξ − ε otherwise
ξε =⎨
as the measure function the Eq.(1) can be rewritten as [4] J ( w) =
1 T 1 w w+ 2 N
N
∑ y − f (x ) . i
i
i =1
Particularly, while y i − w T ϕ ( x i ) − b ≤ ε the Eq. (1) equals to min
1 T w w 2
⎧⎪ y − wT ϕ ( xi ) − b ≤ ε subject to ⎨ iT ⎪⎩w ϕ ( xi ) + b − yi ≤ ε The constraint conditions can be rewritten as ⎧⎪( yi − ε ) − ( wT ϕ ( xi ) + b) ≤ 0 ⎨ ⎪⎩( yi + ε ) − ( wT ϕ ( xi ) + b) ≥ 0
(2)
(3)
Set Q1 = {( xi , yi + ε ), i = 1,2,..., N } and Q2 = {( xi , yi − ε ), i = 1,2,..., N } . The regression problem mentioned above is converted to the classification between Q1 and Q2 .
1012
J.-Q. Jiang et al.
We only need to construct the optimal hyperplane wˆ T ϕ ( z ) + bˆ = 0 where z = ( x, y ) . The hyperplane is the optimal solution of Eq.(4) min
1 T w w 2
⎧⎪ wˆ T ϕ ( zi ) + bˆ > 0 zi ∈ Q1 subject to ⎨ T ⎪⎩wˆ ϕ ( zi ) + bˆ < 0 zi ∈ Q2
(4)
This is a binary-category classification problem. The optimal hyperplane is the regression function for (2). We solve this classification problem by least squares support vector machine.
3 Multi-category Classification Based on Regression To present the conversion from classification to regression, we propose a novel multicategory classification approach based on regression. For multi-category classification, the samples are given as follows: S1 = {( xi , yi ) | xi ∈ R n , yi ∈ L, i = 1,2,..., N } , where yi is a class-label representing the class of the sample xi . L = {l1 , l2 ,..., l K } is the label set. Compared with function regression, given the sample set S2 = {( xi , yi ) | xi ∈ R n , yi ∈ R, i = 1,2,..., N } , it can be seen that L ⊂ R so S1 ⊂ S2 . This observation motivates us to solve the classification by the regression method. So the label yi is considered as the value of the regression function. According to [3], construct a regression function f ( x, w) which classifies a sample x . The label of the sample x is decided by j * = arg min { f ( x, w) − l j } 1≤ j ≤ K
that is, the label of sample x is j * . Consider the label as the value of the regression function, and for each xi the value of the regression function on xi is as close as to its label. The goal is to choose a regression function such that it reflects the relation between xi and its label. We choose the radial basis function as a kernel function. For multi-category classification the regression function f ( x, w) is used as classifier. When the value of the regression function f ( x, w) is located in the specified region of label for a given sample x , sample x is classified by the regression function f ( x, w) correctly.
4 Numerical Experiments The experiments are implemented on a DELL PC, which utilizes a 2.8GHz Pentium IV processor with 512MB memory. The OS is Microsoft Windows XP operating system. All the programs are compiled under Microsoft’s Visual C++ 6.0.
Mutual Conversion of Regression and Classification
1013
4.1 Regression Based on Classification
In order to examine the regression efficiency based on classification, numerical experiments are performed using two kinds of data sets. One kind of data set is composed of the simply elementary functions which include f ( x) = sin( x) , f ( x) = x 2 and f ( x) = x 3 . These functions are used to examine the regression ability for the known function. The other kind of data set is composed of Mackey-Glass (MG) system, sample function f ( x) = sin c( x) and “housing” data set. The MG system is a blood cell regulation model established in 1977 by Mackey and Glass[8]. It is described as
dx a ⋅ x(t − τ ) = − b ⋅ x(t ) dt 1 + x10 (t − τ ) where τ = 17 , α = 0 . 2 , b = 0 . 1, Δ t = 1, t ∈ ( 0 , 400 ) . The embedded dimensions are n = 4 , 6 ,8 respectively. The sample function is ⎧ 1 x=0 ⎪ sin c( x) = ⎨ sin( x) x≠0 ⎪⎩ x The “housing” data set concerns housing values in suburbs of Boston. It has 506 samples with 13 attributes whose values are continuous. The values of the attributes changed greatly so the values are normalized. 1
1
0.8 0.8 0.6 0.6
0.4 0.2
0.4
0 0.2
-0.2 -0.4
0
-0.6 -0.2 -0.8 -1 -15
-10
-5
0
5
10
15
-0.4 -15
-10
-5
0
5
10
15
Fig. 1. Testing results(“+”) for “sin”(left) and “sinc”(right)
Table 1 shows the results of LSSVC for regression. Each of them has 1000 training samples and 100 testing samples. To run the proposed method, three parameters must be predetermined for each data set. They are smoothing factor γ , band width σ (special for Gaussian kernel function) and ε for the Vapnik e-insensitive loss function. γ = 2000, σ = 1, ε = 0.01 . For housing γ = 200000 . Figure 1 show the testing result for “sin” and “sinc”. It can be seen that the value of regression function is close to the value of real function.
1014
J.-Q. Jiang et al. Table 1. Results of LSSVC for regression
Training accuracy Testing MSE
Sin
Square
Cube
Sinc
MG 4
MG 6
MG 8
Housing
100
100
100
100
100
100
100
94.6
2.41e-5
3.04e-5
2.72e-5
2.93e-5
2.25e-5
2.71e-5
2.74e-5
2.28e-3
4.2 Multi-category Classification Based on Regression
We use two types of experiments to demonstrate the performance of the proposed method: the experiments in the plane using artificial data sets which can be visualized, and the experiments using benchmark data sets. Table 2. Correct rate on artificial data sets
γ
σ
Name
Points
Classes
Strip_1
1000
10
800
3
1000 4000 4000
10 2 3
100 8 15
2 0.1 0.1
Strip_2 Checkboard_1 Checkboard_2
Training correct rate 1.00000
Testing correct rate
1.00000 0.99200 0.98850
0.99600 0.97500 0.96350
0.99600
Table 2 shows the results on artificial data sets. The proposed method is applied to three UCI data sets (iris, wine and glass) available at the UCI Machine Learning Repository (http://www.ics.uci.edu/mlearn/ MLRepository.html). Table 3 shows the testing correct rate on benchmark data sets, where qp-mc-sv and KSVCR represent the results from references [5] and [6], respectively, LSSVRC represents the result of the proposed method, Attr and Cla represent the numbers of attributes and categories, respectively. From Table 3 it can be seen that the multi-class classification problem can be dealt with by solving the regression problem and the correct rate of the proposed method is comparable to other multi-class classification methods. Table 3. Testing correct rate on benchmark data sets
Name
Points
Iris Wine Glass
150 178 214
Attr 4 13 9
Cla 3 3 7
γ
σ
qp-mc-sv
3000 300 2000
5 180 2
0.986700 0.964000 0.644000
KSVCR 0.980700 0.977100 0.695300
LSSVRC 1.000000 0.972973 0.722222
5 Conclusions and Discussions Regression problem can be changed into classification problem. The optimal hyperplane can be obtained by least squares support vector machines. This optimal hyper-
Mutual Conversion of Regression and Classification
1015
plane is the regression function. Numerical experiments on several functions show that the mean square error is satisfied. Moreover, a novel one-step method for multicategory classification problem is proposed in this paper. It converts the classification problem to a function regression problem and solves the function regression by least squares support vector machines. This method only needs to solve a set of linear equations. The performance of the proposed algorithm is good for artificial data and it is comparable to other classification methods for benchmark data sets. Therefore, the classification problem and regression problem can be converted each other.
Acknowledgments The authors are grateful to the support of the National Natural Science Foundation of China (60433020), the science-technology development project of Jilin Province of China (20050705-2), the doctoral funds of the National Education Ministry of China (20030183060), “985” Project of Jilin University, Tianhe District Foundation (051G041), Graduate Innovation Lab of Jilin University (503043), and Natural Science Foundation of South China University of Technology (D76010).
References 1. Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag, Berlin Heidelberg New York (1995) 2. Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers. Neural Processing Letters, 9(3) (1999) 293–300 3. Suykens, J.A.K., Lukas, L., Wandewalle, J.: Sparse Approximation Using Least Squares Support Vector Machines. In: IEEE International Symposium on Circuits and Systems (ISCAS 2000), Geneva, Switzerland, (2000) 757-760 4. Tao, Q., Cao, J.D., Sun, D.M.: A Regression Method Based on the Support Vectors for Classification. Journal of Software, 13(5) (2002) 1024-1028 5. Weston, J., Watkins, C.: Multi-class Support Vector Machines. CSD-TR-98-04 Royal Holloway, University of London, Egham, UK, 1998 6. Angulo, C., Parra, X., Catala, A.: K-SVCR. A Support Vector Machine for Multi-class Classification. Neurocomputing, 55 (2003) 57-77 7. Jiang, J.Q., Wu, C.G., Liang, Y.C.: Multi-category Classification by Least Squares Support Vector Regression. In: Wang, J., Liao, X.F., Yi, Z. (eds.): International Symposium on Neural Networks. Lecture Notes in Computer Science, Vol.3496. Springer-Verlag, Berlin Heidelberg New York (2005) 863-868 8. Flake, G.W., Lawrence, S.: Efficient SVM Regression Training with SMO. Machine Learning, 46(1-3) (2002) 271–290
Sparse Least Squares Support Vector Machine for Function Estimation Liang-zhi Gan1,2, Hai-kuan Liu2, and You-xian Sun1 1 National
Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China [email protected] 2 Electrical Engineering Department, Xuzhou Normal University, Xuzhou, 221011, China [email protected]
Abstract. Sparse least squares support vector machine (SLS-SVM) for regression is proposed to solve the problem of regression for large sample data. The samples are mapped into Reproducing Kernel Hilbert Space (RKHS) and span a subspace there. Then we could find the basis of the subspace. The basis can represent all the samples linearly. So we can get the least squares support vector machine by solving a small equations set. A numerical example is used to illustrate that this approach can be used to fit nonlinear models for large data set. Being compared with common least squares support vector machine, this method can find sparse solution without any pruning or surgeon, and the computing speed is much faster because the final result is found by solving a smallscale equations set.
1 Introduction Support Vector Machine (SVM) is based on the statistical learning theory [1]. Structural Risk Minimization (SRM) principle has been successfully used on SVM and many people are abstracted by its excellent performance. Based on standard SVM, a lot of variations of SVM have been put forward such as Least Squares Support Vector Machine (LS-SVM) [2]. The difference between LS-SVM and standard SVM is that the constrained condition is equality, not inequality. LS-SVM needs only to solve linear equations, not quadratic programming problem. But the standard LS-SVM can not get sparse solution. Suykens [3] gave a pruning method, which removes all support vectors whose coefficients are below some threshold. This method is considerably complex and can get only the approximate solution. This paper proposed a way to get Sparse Least Squares Support Vector Machine (SLS-SVM). The paper is organized as follows. In Section 2 we summarize the properties of SLSVM. We deduce the theory of SLS-SVM and give the algorithm in Section 3. In Section 4, we illustrate the simulation result of the algorithm. At last, the conclusion is provided in Section 5.
2 Analysis of Standard LS-SVM LS-SVM for function estimation minimizes the object function including the sum of square errors. Consider a training set containing N data points J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1016 – 1021, 2006. © Springer-Verlag Berlin Heidelberg 2006
Sparse Least Squares Support Vector Machine for Function Estimation
1017
D = {( x1 , y1 ),( x2 , y2 ),L,( xN , yN )} , xk ∈ R is input vector and y k ∈ R is the corren
sponding output. A supposed nonlinear transformation
T maps the input vector xk
T
into a high dimensional feature space: x k a ϕ ( x k ) The regression model in feature space is represented as follow:
y ( x) = wT ϕ ( x) + b
(1)
Finding the solution of LS-SVM in the feature space is equal to solving the following quadratic programming problem [2]. min Φ ( w, b, e) =
1 T 1 w w+ γ 2 2
N
∑e
2 k
(2)
k =1
s.t. y k = w ϕ ( x k ) + b + ek , k = 1, L , N T
where
γ
is a user defined constant, which balances the model complexity and ap-
proximation accuracy. ek is the approximation error at the sampling point. According to problem (2), the Lagrangian can be defined as follow: N
L = Φ ( w, b, e) −
∑α
k [ y( xk )
+ ek − y k ]
(3)
k =1
where α k ∈ R is the Lagrangian multiplier. Optimizing the above problem we get the following equations: ⎡0 ⎢T ⎢1v ⎣
⎤ b ⎡ ⎤ ⎡0 ⎤ 1 ⎥⎢ ⎥ = ⎢ ⎥ Ω + I ⎥ ⎣α ⎦ ⎣ y ⎦ γ ⎦ 1v
(4)
N
w=
∑α ϕ(x k
k)
(5)
k =1
where I is unit matrix, y = [ y1 ;L; y N ] , 1v = [11 ,L,1N ] ,
α = [α1;L;α N ] . The
element of matrix Ω is Ω kl = ϕ ( xk ) ϕ ( xl ) . So we get the regression function: T
N
y=
∑α ϕ (x k
k
) T ϕ ( x) + b
(6)
k =1
α
and b are the solutions to equation (4). If there were a “kernel function” k ( xi , x j ) such that k ( xi , x j ) = ϕ ( xi ) T ϕ ( x j ) , we would only need to use k ( xi , x j )
where
in the training algorithm, and would never need to explicitly even know what One example is k ( xi , x j ) = ϕ ( xi ) T ϕ ( x j ) = e
− xi − x j
2
/σ 2
ϕ (x)
is. (7)
1018
L.-z. Gan, H.-k. Liu, and Y.-x. Sun
In this particular example, ϕ (x ) is infinite dimensional, and we have defined dot product by (7). This feature space is so-called Reproducing Kernel Hilbert Space (RKHS). But the solution of LS-SVM is not sparse. To get the solution of LS-SVM with N samples, we must solve an N + 1 dimensional equations set. When the number of the sampling data set is very huge, it is very difficult to find the solution. To get sparse solution, Suykens gave a pruning method [3]. The pruning method is very complex and can only find approximate solution. This paper will give a new way for sparse solution.
3 Sparse LS-SVM Observing (5), we could find that w is linear combination of all samples in feature space. If we could get the basis of sampling data set, w can be simply represented. To find sparse solution, we need only to find the basis of sampling data set in feature space. All elements in sampling data set {ϕ ( x k ) : k = 1,L, N } span a subspace of the feature space. Then we can find the basis of the subspace by the following way. Without losing generality, suppose ϕ ( xk ) is an element of sampling data set. By solving the following minimizing problem, we know whether ϕ ( xk ) is one of the basis or not. min f (λ ) = (ϕ ( x k ) −
∑ λ ϕ ( x )) i
(ϕ ( x k ) −
T
i
i≠k
∑ λ ϕ ( x )) i
(8)
i
i≠k
It is easy to know that f (λ ) ≥ 0 . If f (λ ) = 0 , ϕ ( xk ) is not one element of the basis
because it can be linearly represented by other elements. But if f (λ ) > 0 , ϕ ( xk ) is one of the basis. According to Kuhn-Tucker conditions, we have:
∂f = −2 K 0 + 2 Kλ = 0 ∂λ
(9)
Where K 0 = ( k ( x1 , xk ),L, k ( xl −1 , xk )) , K is a square matrix, K ij = k ( xi , x j ) . T
Mercer condition has been applied to matrix exists. We have
K , so K is positive defined and K −1
λ min = K −1 K 0
(10)
The following steps are used to find the basis of sampling data set in feature space: 1. 2.
X l = {ϕ ( x1 )} , X A = φ ; To every ϕ ( xk ) , k = 1,L, N , get the value of Found two new sets
min f (λ ) = (ϕ ( x k ) −
∑ λ ϕ (x )) ϕ i
i
T
(ϕ ( xk ) −
( xi )∈X l
ϕ ( xk )
3.
If the minimum of f (λ ) is 0,
4.
Go back to step 2 until all the vectors are checked.
is added to
∑ λ ϕ ( x )) ; i
i
ϕ ( xi )∈X l
X A ; else ϕ ( xk ) to X l ;
Sparse Least Squares Support Vector Machine for Function Estimation
1019
X = {x1 ,L, xN } is denoted as X ' in feature space, that is X ' = [ϕ ( x1 ),L,ϕ ( x N )] . The above process split X ' into two parts: X l and X A . All the elements in X l are linearly independent and the elements in X A can be linearly represented by X l . The elements in set X l are different from each other, so we can look it as a row vector: X l = [ϕ ( x1 ),L,ϕ ( xl )] . Set
N
w = ∑ α kϕ ( xk ) and the elements in X A can be line-
Since we have known that
k =1
arly represented by
X l , so w could be write as w =
∑θ ϕ(x ) = X θ ϕ i
i
l
T
, we could
( xi )∈X l
rewrite formula (2) as: N
∑
where
1 1 min Φ(θ , b, e) = ( X l θ )T ( X l θ ) + γ ek2 θ ,b 2 2 k =1
(11)
s.t. ( X lθ )T ϕ( xk ) + b = yk − ek , k = 1,L, N
(12)
θ = [θ1 ,L,θ l ] , l
is the number of elements in set
X l . Substitute
ek (k = 1,L, N ) in (12) for ek (k = 1,L, N ) in (13), we get: 1 γ L(θ , b) = θ T XlT Xlθ + 2 2
N
∑( y
∑(X ϕ
T T l ϕ ( x j )ϕ ( x j )
− ( Xlθ )T ϕ(x j ) − b)2
(13)
j =1
According to Lagrangian Theory, we have γ
j
∂L ∂L =0, = 0 . This means: ∂θ ∂b
X lθ + X lT ϕ ( x j )b − X lT y jϕ ( x j )) + X lT X lθ = 0
( x j )∈ X '
∑ ( (X θ) ϕ l
T
ϕ(x j ) + b − y j ) = 0
(14)
(15)
( x j )∈ X '
The result could be write as matrix equations set (17): ⎡1 T X lT ϕ ( x j )ϕ ( x j ) T X l ⎢γ X l X l + ϕ ( x j )∈ X ' ⎢ ⎢ ϕ ( x j )T X l ⎢ ϕ ( ) ∈ x X' j ⎣
∑ ∑
∑
⎡ X lT ( y j ϕ ( x j )) ⎤ ⎢ ⎥ ϕ ( x j )∈ X ' ⎥ =⎢ N ⎢ ⎥ yj ⎢ ⎥ j =1 ⎢⎣ ⎥⎦
∑
X lT
∑ ϕ (x ϕ ( x j )∈ X '
N
⎤
j )⎥
⎥ ⎡θ ⎤ ⎥ ⎢⎣ b ⎥⎦ ⎥ ⎦
(16)
1020
L.-z. Gan, H.-k. Liu, and Y.-x. Sun
The final regression model is y ( x) =
∑ θ k (x , x) + b i
i
(17)
ϕ ( x i )∈ X l
Where θ and b are the solutions to equation (16). Dimension of equations set (9) is l + 1 . According to Christopher J.C. Burges [7], if the kernel function is k ( x1 , x 2 ) = exp(− x1 − x 2
2 2
/ σ ) , the width parameter
σ
can
control the VC dimension of RBF SVM. When σ a 0 , the VC dimension of RBF SVM inclines to infinite; or when σ a +∞ , it inclines to zero. So σ is an important parameter. A suitable σ can control both the fitting ability of SL-SVM and the dimension of vector X l . +
We give the algorithm of SLS-SVM as follow: 1.
Define a kernel function and find the basis of sampling set
2. 3.
Solve equations set (17), get θ and b ; Construct the regression function: y =
X l in feature space;
∑ θ k (x , x ) + b . i
i
ϕ ( xi )∈X l
4 Simulation Result Here we present an example to illustrate how effective the algorithm is. In figure 1 training data are sampled from the sine function: y = sin x / x + e .Where e ~ N (0,0.1) . 400 points are sampled, and the number of the basis is 42. All the basis elements are inside the symbol “O”. Figure 2 shows the approximating result. We could easily find that because many of the vectors are deleted from dataset, the number of the training data is much less than that of the whole dataset. Though the experiment is only in a two-dimension space, it could easily been generalized to high dimensional space.
Fig. 1. Scatter diagram of sample data set
Fig. 2. Simulation result
Sparse Least Squares Support Vector Machine for Function Estimation
1021
5 Conclusions This paper proposed the SLS-SVM training algorithm for function estimation. This algorithm can greatly reduce the need for computer memory calculating time because the dimension of the matrix is greatly reduced. Also it provides a preparation for online LS-SVM. Further job is to find the online algorithms.
References 1. Vapnik, V.: Statistical Learning Theory. New York: John Wiley and Sons, 1998. 2. Suykens, J.A.K: Least Squares Support Vector Machine Classifiers. Neural Process. Lett. 9(1999) 293 3. Quandt, R.E, Ramsey, J.B.: Estimating Mixtures of Normal Distributions and Switching Regressions. Journal American Statistical Associate 73(1978) 730-752 4. Redner, R.A, Walker, H.F.: Mixture Densities, Maximum Likelihood and the EM Algorithm. SIAM Review 31 (2) (1984) 195-239. 5. Caudill, S.B, Acharya, R.N.: Maximum-likelihood Estimation of a Mixture of Normal Regressions: Starting Values and Singularities. Comments Statistics-Simulation 27 (3) (1998) 667-674. 6. Yee, L., Ma, J.H, Zhang, W.X.: A New Method for Mining Regression Classes in Large Data Sets. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (1) (2001) 521 7. Burges C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2 (1998) 121-167
A Multiresolution Wavelet Kernel for Support Vector Regression Feng-Qing Han1,2, Da-Cheng Wang1, Chuan-Dong Li2, and Xiao-Feng Liao2 1
Chongqing Jiaotong University, 400074 Chongqing, China [email protected] 2 Chongqing University, 400044 Chongqing, China
Abstract. In this paper a multiresolution wavelet kernel function (MWKF) is proposed for support vector regression. It is different from traditional SVR that the process of reducing dimension is utilized before increasing dimension. The nonlinear mapping φ ( x) from the input space S to the feature space has explicit expression based on dimensionality reduction and wavelet multiresolution analysis. This wavelet kernel function can be represented by inner product. This method guarantee that quadratic program of support vector regression has feasible solution and need not parameter selecting in kernel function. Numerical experiments demonstrate the effectiveness of this method.
1 Introduction Support vector machines (SVM) are derived from the idea of the generalized optimal hyperplane with maximum margin between the two classes and this idea implements the structural risk minimization (SRM) principle in the statistical learning theory[1] . Maximizing the margin plays an important role in the capacity control so that the SVM will not only have small empirical risks, but also have good generalization performance. At present, they have been applied successfully to a wide variety of domains such as the support vector regression (SVR)[2,3], machine learning[4,5], computer vision[6,7], and many others. SVM is a linear classifier in the input space S = {x} , but it is easily extended to a nonlinear classifier by mapping the input space into a high-dimensional feature space F = {φ ( x)} . By choosing an adequate mapping φ , the input data become linearly separable or mostly linearly separable in the high-dimensional feature space. We need not compute the mapping patterns φ ( x) explicitly, instead only need the inner products between mapped patterns. By choosing different kinds of kernels, SVM can realize Radial Basis Function (RBF), Polynomial and Multi-layer Perceptron classifiers. Wavelet representations can serve as a tool for studying information over a range of scales. Their power lies in their ability to describe data, functions, or operators at different resolutions using building blocks that are of localized extent. This feature makes wavelets ideal for efficiently representing localized features (such as transients, singularities, and other discontinuities) because we only need to add detail where it really matters. The original wavelet constructions were developed primarily during the 1980s. Of these classical constructions, the most famous are the orthonormal compactly supported wavelets that Ingrid Daubechies developed[8]. Classical wavelet bases are generated by considering a complete set of shifted and scaled versions of a single J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1022 – 1029, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Multiresolution Wavelet Kernel for Support Vector Regression
1023
function, known as the mother wavelet. In this sense, the wavelet basis is said to be translation invariant and scale invariant. Recently, wavelet has chosen for kernel of SVM[9,10]. These kernels are tensor product wavelet and translation invariant kernel, not inner product kernel. Although there are many mother wavelets, but a few mother wavelets can be applied for kernel in SVM because of Mercer condition. Inspired by both the SVR and one-dimensional wavelet multiresolution analysis, this paper presents a new wavelet kernel function for SVR, called multiresolution wavelet kernel function. It is organized as follows. Section 2 introduce SVR algorithm. Section 3 provides a background to multiresolution analysis. Section 4 formulates the wavelet kernel function and learning algorithm. Section 5 gives the experimental results. Finally, some conclusions are drawn in Section 6.
2 SVR Algorithm Suppose the training data are {( xi , yi ) | i = 1, 2,L l} , ( xi , yi ) ∈ R N × R . The output of an SVR is written as f ( x) =< w, φ ( x) > +b ,
(1)
where w is the weight vector, b the bias and φ the nonlinear mapping from the input space S to the high dimensional feature space F. < ⋅, ⋅ > represents the inner product. Model (1) can be obtained by solving the following optimal problem[1]:
min Z =
1 || w ||2 +C ∑ (ξi+ + ξi− ) 2 i
⎧ yi − < w, φ ( xi ) > −b ≤ ε + ξi+ , ⎪ s.t. ⎨− yi + < w, φ ( xi ) > +b ≤ ε + ξ i− ⎪ ξi+ , ξi− ≥ 0, i = 1, 2,L l ⎩
(2)
where C is a regularization constant, controlling a compromise between maximizing the margin and minimizing the number of training set errors, ξi+ , ξi− are some nonnegative slack variables. After the introduction of Lagrange multipliers, the dual of this optimal problem is max Z = −
1 ∑(αi −αi′)(α j −α′j )K(xi , xj ) 2 i, j
− ε ∑(αi +αi′) + ∑ yi (αi −αi′) i
,
i
⎧∑(αi −αi′) = 0 ⎪i ⎪ st. . ⎨ 0 ≤ αi ≤ C ⎪ 0 ≤ α′ ≤ C i ⎪ ⎩ where K ( xi , x j ) =< φ ( xi ), φ ( x j ) > is a nonlinear kernel functional.
(3)
1024
F.-Q. Han et al.
When Qp. (3) is optimized, model (1) can be rewritten as[1] f ( x) =
∑ (α
xi ∈SV
i
− α i′ ) K ( xi , x) + b ,
(4)
where α i , α i′ are the optimized solution of QP.(3).
b=
1 { ∑ [ yi − ∑ (α j − α j′ ) K ( x j , xi ) − ε ] N nsv 0<αi
∑
[ yi −
0 <α i ′ < C
∑
x j ∈SV
(α j − α j′ ) K ( x j , xi ) + ε ]}
.
3 Multiresolution Analysis A multiresolution analysis[8] of L2 ( R ) is defined as a set of closed subspaces V j with j ∈ Z that exhibit the following properties:
(1) L ⊂ V−1 ⊂ V0 ⊂ V1 ⊂ V2 L (2) f ( x) ∈ V j ⇔ f (2 x) ∈ V j +1 (3) f ( x) ∈ V0 ⇔ f ( x + 1) ∈ V0 (4)
IV
j
= {0}
j
⎧⎪ ⎫⎪ (5) clos ⎨U V j ⎬ = L2 ( R) ⎩⎪ j ⎭⎪ (6) A scaling function ϕ (t ) ∈ V0 exists such that the set {ϕ (t − k ) | k ∈ Z } is a Riesz basis of V0 .
As a result a sequence {hk } , exists such that the scaling function satisfies a refinement equation
ϕ (t ) = 2 ∑ hk ϕ (2t − k ) .
(5)
k
Define W j as a complementary space of V j in V j +1 , such that V j +1 = V j ⊕ W j . Consequently L2 ( R ) = ⊕ W j . j
(6)
A function ψ (t ) is a wavelet if the set of functions {ψ (t − k ) | k ∈ Z } is a Riesz basis of W0 . Since the wavelet is also an element of V0 , a sequence {g k } exists such that
A Multiresolution Wavelet Kernel for Support Vector Regression
ψ (t ) = 2 ∑ g kϕ (2t − k ) .
1025
(7)
k
The set of wavelet functions {ψ j , k (t ) | j , k ∈ Z } is now a Riesz basis of L2 ( R ) . The coefficients in the expansion of a function in the wavelet basis are given by the inner product with dual wavelets ψ% j , k (t ) = 2 j / 2ψ% (2 j t − k ) such that for ∀f (t ) ∈ L2 ( R) , we have f (t ) = ∑ < f ,ψ% j , k > ψ j , k (t ) .
(8)
j ,k
4 Multiresolution Wavelet Kernel Function From Eq.(8), one can reconstruct function exactly. But it is only for single variable function, not multivariable. For multivariable function, construct a one-to-one mapping on the whole training data as follows: u : R N a R, u ( x ) = t , x ∈ R N , t ∈ R ,
(9)
for training data xi ≠ x j , satisfy u ( xi ) ≠ u ( x j ) .
(10)
Define two hyperplane P1 and P2: P1: ∑ w j , kψ j , k (u ( x)) + b1 = 0 j ,k
P 2 : ∑ w j , kψ j , k (u ( x)) + b2 = 0
,
(11)
j ,k
where w j , k =< y,ψ% j , k > , b1 = − min yi , b2 = − max yi . i
i
For each training data {( xi , yi ) | i = 1, 2,L l} , it lies between in P1 and P2. This
means that optimal problem (2) has feasible solution when the nonlinear mapping φ ( x) is defined as:
φ ( x ) = (ψ j , k (u ( x)))∞×∞ .
(12)
Now the multiresolution wavelet kernel function has the following form. K ( x, x′) = ∑ψ j , k (u ( x))ψ j , k (u ( x ′)) . j ,k
(13)
When we apply this kernel function in SVR, it will be truncated because the index j varies on the whole integer set. The index k varies on the finite interval because of the wavelet adopted with compact support or decay quickly. Now the model (4) becomes
1026
F.-Q. Han et al.
f ( x) = b +
∑ ∑ (α
xi ∈SV j∈J , k
− α i′ )ψ j , k (u ( xi ))ψ j , k (u ( x)) ,
i
(14)
where J ⊂ Z , α i , α i′ are the optimized solution of QP.(3). How to construct mapping u ? There are many different maps satisfied Eq.(9) and (10). We give an example as follows. Define N
u ( x) = ∑ c k x ( k ) ,
(15)
k =1
where x = ( x (1) , x (2) ,L , x ( N ) )T ∈ R N . Denote Lk = min | xi( k ) − x (jk ) |, H k = max | xi( k ) − x (j k ) | . i≠ j
i≠ j
(16)
Then the coefficients c k is produced by the following procedure c1 = 1 k −1
c k = 1 + ∑ ci H i / Lk ,
2≤k ≤ N
for
.
(17)
i =1
It is easy to show that mapping u is a one-to-one mapping.
5 Experimental Results In this section, we do simulations to evaluate the performance of this method. We have compared our MWKF, the tensor products of 1-D wavelet kernel and the RBF kernel on the same examples. There have the same training parameters in optimal problem (3). To assess the approximation results and the generalization ability, a figure of merit is defined as
δ(f ) =
∑ ( y − f ( x )) ∑ ( y − y) i
2
i
i
2
,
i
i
where {( xi , yi ) | i = 1, 2,L m} is the test set, y = ∑ yi / m . i
5.1 Approximation of Single Variable Function
Fig.1 shows the results for the approximation of the function 1 x f ( x) = sin x + sin 3 x − 2sin , x ∈ [0, 2π ] . 3 2
(18)
A Multiresolution Wavelet Kernel for Support Vector Regression
1027
For comparison we show the results with our MWKF, with the tensor products of 1-D wavelet kernel, and with the RBF kernel. The solid lines represent the original function and the dashed lines show the approximations.
a. The approximation result with the MWKF
b. The approximation result with the RBF
c. The approximation result with the Wavelet Kernel Fig. 1. Approximation results of one-dimensional function
The figure of merit for each approximation is computed on a uniformly sampled test set of 63 points. Approximation results and parameters are represented in table 1. Our MWKF gives the best approximations from comparison. Table 1. Approximation results and parameters of one dimensional function
RBF Wavelet Kernel MWKF
Parameters
δ(f )
σ =1
0.038 0.168 0.036
a =1 J = [−1,1]
5.2 Approximation of Two Variable Functions
Table 2 shows the approximation results of the function f ( x) = ( x12 − x22 )sin
x1 , 2
1028
F.-Q. Han et al.
over the domain D = [−10,10] × [−10,10] . The figure of merit for each approximation is computed on a uniformly sampled test set of 441 points. From comparison we can see the MWKF gives the best approximations. Fig.2 illustrates the original function over D and its approximation by the MWKF. Table 2. Approximation results and parameters of two-dimensional function
Parameters RBF Wavelet Kernel MWKF
a. Original function
σ =1 a=3 J = {−5} , u ( x) = 30 x1 + x2
δ(f ) 0.0005 0.000197 0.000055
b. Approximation result of the MWKF
Fig. 2. Original function and approximation result of two-dimensional function
6 Conclusion In this paper a new wavelet kernel has been introduced for support vector regression. This method was inspired by both support vector regression and the one-dimensional wavelet multiresolution analysis. The differences between this method and traditional SVM are the dimensionality reduction of input space S , explicit expression of nonlinear mapping φ ( x) and need not parameter selecting in kernel function. The process of reducing dimension is simple and quick, avoiding high complexity of multidimensional wavelet or tensor product of wavelet. The wavelet multiresolution represent can guarantee that quadratic program of support vector regression has feasible solution.
Acknowledgement This work is supported by the Chongqing Science Technology Project (030826).
A Multiresolution Wavelet Kernel for Support Vector Regression
1029
References 1. Vapnik, V. N.: The Nature of Statistical Learning Theory. Springer-Verlag, Berlin Heidelberg New York (1995) 2. Mangasarian, O. L., Musicant, D. R.: Robust Linear and Support Vector Regression. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(9) (2000) 950–955 3. Shevade, S.K., Keerthi, S.S., Bhattacharyya, C. and Murthy, K. R. K.: Improvements to the SMO Algorithm for SVM Regression. IEEE Trans on Neural Networks 11(5) (2000) 1188–1193 4. Brailovsky, V. L.: On Global, Local, Mixed and Neighborhood Kernels for Support Vector Machines. Pattern Recognition Letters 20(11-13) (1999) 1183–1190 5. Chan, W. C., Chan, C. W., Cheung, K. C., Hanis, C. J.: On the Modelling of Nonlinear Dynamic Systems Using Support Vector Neural Networks. Engineering Applications of Artificial Intelligence 14 (2001) 105–113 6. Guo, G. D.: Support Vector Machines for Face Recognition. Image and Vision Computing 19(9-10) (2001) 631–638 7. Jonsson, K.: Support Vector Machines for Face Authentication. Image and Vision Computing 20(5-6) (2002) 369–375 8. Daubechies. I: Ten Lectures on Wavelets. CBMS-NSF Regional Conf. Series in Appl. Math., Vol. 61. Society for Industrial and Applied Mathematics, Philadelphia (1992) 9. Zhang, L., Zhou, W. D., Jiao, L. C.: Wavelet Kernel Function Network. Journal of Infrared Millimeter Waves 20(3) (2001) 223–227 10. Li, Y. C., Fang, T. J.: Wavelet Support Vector Machines. Pattern Recognition & Artificial Intelligence 17(2) (2004) 167–172
Multi-scale Support Vector Machine for Regression Estimation Zhen Yang, Jun Guo, Weiran Xu, Xiangfei Nie, Jian Wang, and Jianjun Lei PRIS Lab, School of Information Engineering, Beijing University of Posts and Telecommunications, Beijing, 100876, China [email protected]
Abstract. Recently, SVMs are wildly applied to regression estimation, but the existing algorithms leave the choice of the kernel type and kernel parameters to the user. This is the main reason for regression performance degradation, especially for the complicated data even the nonlinear and non-stationary data. By introducing the ‘empirical mode decomposition (EMD)’ method, with which any complicated data set can be decomposed into a finite and often small number of ‘intrinsic mode functions’ (IMFs) based on the local characteristic time scale of the data, this paper propose an important extension to the SVM method: multi-scale support vector machine based on EMD, in which several kernels of different scales can be used simultaneously to approximate the target function in different scales. Experiment results demonstrate the effectiveness of the proposed method.
1 Introduction The support vector machine (SVM) is a new universal learning method proposed by Vapnik et al. [1], which is developed based on the idea of structural risk minimization (SRM) [1][3][6]. Recently, SVMs are wildly applied to regression estimation and pattern recognition [2] [4] [7] [9] due to their superior performances compared with other traditional classifiers [1] [8]. SVMs use kernel operator to map the data in input space to a high-dimensional feature space in which the problem becomes linearly separable [1] [3] [6]. There exist many kinds of kernels that satisfy Mercer’s condition, such as the polynomial, Gaussian and neural network (NN) kernels [1] [6]. Once the kernel and user-chosen parameters (e.g. the error penalty) are fixed, SVM can be trained according to the chosen algorithm automatically [1]. It remains to specify how kernel and parameter is chosen. This is very important especially for the regression estimation. Various real valued function estimation problems need various sets of approximating functions. Therefore, it is important to construct special kernels that reflect special properties of approximating functions. Unfortunately, the data sets existing in real world are usually complicated even very short, non-stationary and highly nonlinear [5]. The SVM with single kernel can’t estimate the complicated data set very well [2] [4] [7] [13] [14], because the data has multi-scale components. In order to improve regression estimation performance, in J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1030 – 1037, 2006. © Springer-Verlag Berlin Heidelberg 2006
Multi-scale Support Vector Machine for Regression Estimation
1031
one hand, the hybrid kernel was used, which [4] use two kernels (‘wide’and ‘narrow’) in different scale to estimate function and [2] presents a mechanism to train SVM with a hybrid kernel and minimal Vapnik-Chervonenkis (VC) dimension. On the other hand, the more complicated kernel was used to obtain better performance, which [7] construct a wavelet support vector machine using the wavelet kernel as the wavelet technique shows superior performances for both non-stationary signal approximation and classification. But the hybrid kernel and wavelet kernel result in additional complexity of kernel and parameter selection associated with SVM. It means high generalization risk in order to estimate the detail components using the more complicated or ‘narrowed’ kernel. In this paper, based on a new signal process technique–‘empirical mode decomposition (EMD)’, a systematic kernel construction method in SVM regression estimation is proposed. By using EMD, any nonlinear and non-stationary complicated data set can be decomposed into a finite and often-small number of ‘intrinsic mode functions’ (IMFs) based on the local characteristic time scale of the data, which show the local properties of the signal. And this decomposition method is adaptive, highly efficient [5]. By exerting the superior performance of IMFs, this paper gives an important extension to the SVM method: multi-scale SVM (MSVM) based on EMD, in which several kernels of different scales can be used simultaneously to approximate the target function in different scales. Being differed from the mechanisms to train SVM before [1][2][4][7], we estimate each IMFs of given data sets with different kernel instead of estimating the data set with single complex kernel or hybrid kernels. Because IMFs show excellent locality and local characteristic time scale of the data set, our method can estimate target function in multi-scales. Therefore multi-scale SVM can achieve more estimation accuracy. Experiment results demonstrate the effectiveness of the proposed method.
2 Multi-scale SVM Function Regression Based on EMD 2.1 Multi-scale Signal Analysis Differently to above method, we reduce the complexity using the method of data decomposition. Multi-decomposition is a typical application to process the complicate data sets, which [14] use wavelet decomposition to process the complicated and nonstationary data. If the target function can be decomposed as well as reconstructed at different scales, we can use kernel sets with different scales to approximate the target function in multi-scales. And then an expected better performance can be obtained. For y ∈ R , which can be decomposed with the following formula: n
y = f ( x ) = ∑ Ci
(1)
i =1
Then we can use different kernels to approximate the components Ci with different scales, instead of using a single complex kernel or hybrid kernels to estimate the function y itself. The SVM regression estimation introduces a new type of loss function called ε -insensitive loss function, which makes the function estimates not only
1032
Z. Yang et al.
robust but also sparse. The sparsity of the solution is very important for estimating dependencies in high-dimensional space using a large number of data. Then formula can be rewritten as: n
y = f ( x ) = ∑ Cn i =1
n
(2)
l
= ∑ ∑ {(α i + α i ) K m ( x, x i ) + bm } m*
m
m =1 i =1 l
l
l
= {∑ (α i − α i ) K1 ( x, x i ) + b1 } + {∑ (α i − α i ) K 2 ( x, x i ) + b2 } + ............ + {∑ (α i − α i ) K n ( x, x i ) + bn } 1*
1
2*
i =1
2
n*
i =1
n
i =1
By using Lagrange multiplier techniques [1][6], then maximize l
l
Wm = −ε m ∑ (α i − α i ) + ∑ y (α i − α i ) − m*
m
m*
m
m
i
i =1
i =1
l
1
∑ (α 2
m* i
− α i )(α j − α j ) K ( x i , x j ) m
m*
m
(3)
i , j =1
m=1… n. Subject to the constraint l
∑α i =1
l
m* i
= ∑ αi
m
(4)
i =1
and to the constraints
0 ≤ α i ≤ em , i = 1,..., l m
m*
0 ≤ αi
≤ em , i = 1,..., l
(5)
Where K (⋅, ⋅) is a given function satisfying Mercer’s conditions. The key problems in the method are whether/how target function can be decomposed as well as reconstructed at different scales, and how can we use different kernels to approximate the components Ci with different scales. If the decomposition is not very suitable, it would result in additional complexity and performances degrade, even make the decomposition insignificance [2] [4]. 2.2 EMD and IMFs
Hilbert-Huang transform (HHT) has been proposed by Norden E. Huang et al [5]. The key part of the method is the ‘empirical mode decomposition (EMD) ’ method with which any complicated data set can be decomposed into a finite and often-small number of ‘intrinsic mode functions’ that admit well-behaved Hilbert transforms. In EMD decomposition, through a ‘sifting process’ [5], a time series, X (t), can be decomposed: n
X (t ) = ∑ Ci + rn
(6)
i =1
Though the orthogonality of EMD decomposition is satisfied in most practical sense, but it is not guaranteed theoretically [5]. And there are long term argument on
Multi-scale Support Vector Machine for Regression Estimation
1033
HHT, it was pointed out repeatedly that HHT had a fatal deficiency that the orthogonality among different IMFs was not guaranteed. As a result, direct applications of HHT to real data analysis without resolving this key issue will lead to spurious features in HHT spectra [14]. Maybe the HHT has long distance for spectra analysis application, but as described in [5], the completeness of EMD is given, for equation (6) is an identity. Therefore we can estimate each IMFs of given data sets with different kernel instead of estimating the whole data set with single complex kernel or hybrid kernels. Example decomposition is given in figure1 by EMD and wavelet. For a time series, X(t)= 3sin(2 π *0.2t) + 2sin(2 π *0.4t) + cos(2 π *0.8t), is shown in Fig. 1(a). Fig.1 (b) shows the resulting empirical mode decomposition components from the time series. Fig.1(c) shows data reconstruction using ‘db2’ wavelet. By using EMD, IMF1~IMF3 separated the different frequency component (0.2, 0.4, 0.8) very well based on the time scale of the data, which reconstructed components by ‘db2’ wavelet can not achieve. 5 4 3 2 1 0 1 2 3 4 5
0
100
200
300
400
500
600
700
800
900
1000
(a) 2 0 -2 5 0 -5 5 0 -5 5 0 -5 10 0 -10 5 0 -5 5 0 -5 0.5 0 -0.5 2 1 0
0 0 0 0 0 0 0 0 0
100 100 100 100 100 100 100 100 100
200 200 200 200 200 200 200 200 200
300 300 300 300 300 300 300 300 300
400 400 400 400 400 400 400 400 400
500 500 500 500 500 500 500 500 500
600 600 600 600 600 600 600 600 600
700 700 700 700 700 700 700 700 700
800 800 800 800 800 800 800 800 800
900 1000 900 1000 900 1000 900 1000 900 1000 900 1000 900 1000 900 1000 900 1000
0.5 0 -0.5 0.5 0 -0.5 2 0 -2 5 0 -5 10 0 -10 5 0 -5 2 0 -2 1 0 -1 0.5 0 -0.5
0
100
200
300
400
500
600
700
800
900 1000
0
100
200
300
400
500
600
700
800
900 1000
0
100
200
300
400
500
600
700
800
900 1000
0
100
200
300
400
500
600
700
800
900 1000
0
100
200
300
400
500
600
700
800
900 1000
0
100
200
300
400
500
600
700
800
900 1000
0
100
200
300
400
500
600
700
800
900 1000
0
100
200
300
400
500
600
700
800
900 1000
0
100
200
300
400
500
600
700
800
900 1000
(b)
(c)
Fig. 1. The Example decomposition by EMD and wavelet: (a) the original data; (b) the components C1-C9 decomposed by EMD; (c) the reconstructed components C1-C9 by ‘db2’ wavelet
2.3 Multi-scale SVM Regression Framework Based on EMD
Based on EMD decomposition, formula (1) can be rewritten as: n
y = f ( x ) = ∑ Cn + rn
(7)
i =1
Thus, we achieve a decomposition of the data into n-empirical modes, and a residue rn . For the EMD method, completeness and orthogonality are approximately satisfied (in most cases encountered, the leakage is small). Then we can always opti-
1034
Z. Yang et al.
mize the formula (3) subject to the constraint (4) (5) independently. It is very important that this can avoid much complex computation (e.g. the interaction of different components) [2][4][12].
y%
Fig. 2. Multi-scale SVM Regression Model
The structure of the multi-scale SVM regression model is shown in Fig. 2. In this model, each model of IMFs is obtained by SVM and runs in autonomous mode with different kernel parameters. Because each ‘IMF’ involves only one mode of oscillation, we can decide the kernel and parameters for each ‘IMF’ according to its scale. While using original method, complex kernels or hybrid kernels were used if we want to estimate some partial detail of the target function accurately, which would cause the risk of over fit for the rest part. Multi-scale SVM can avoid over fit by choosing suitable kernel and parameters according to different scales. This manifested the SRM principle [1]. In practice, for the sake of simplicity, original data only decomposed into the first several IMFs and the residue that subtract the first several IMFs from original data.
3 Experimental Results In order to evaluate the performance of the multi-scale SVM (MSVM), we first perform artificial data sets estimation and compare the results. And then, we validate the performance by practical data sets simulation. For comparison, all experiments in this paper use the RBF kernel for SVM training. Since SVM cannot optimize the parameters of kernels, and the ways to choose an appropriate kernel function in a specific application are far from fully understood, which can only depend on the experience of the user and his understanding of the processing data or only try and error. In this work, the parameters for RBF kernel are selected by using cross validation that is widely used [8] [10] [11]. We adopt the mean square error (MSE) as approximation error. In the following experiments, we implement the experiment via MATLAB and use MATLAB support vector machine toolbox1. Furthermore, our MSVM model only uses the first IMF1 and the residue that subtract the first IMF1 from original data for the sake of simplicity. The kernel parameters ( σ , ε ) and the support vectors number 1
http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox/
Multi-scale Support Vector Machine for Regression Estimation
1035
(SVs) of MSVM and SVM are compared. And different SVM models (model for IMF1 and model for residue) parameters of MSVM are also given. 3.1 Regression Estimation of Artificial Data Sets
In this experiment, we approximate the following ‘Two Bumps’ function: f ( x ) = 0.8exp(-
(x-0.2)2 (x-0.6) 2 ) +exp() 0.0025 0.25
(8)
Table 1. Regression Estimation of Artificial Data Sets
Kernel RBF MSVM
σ
ε
0.03 0.06 0.9
0.1 0.08 0.1
SVs 17 6
28
MSE 0.0070
23
0.0023
0.4
2
1.5
0.2
1
1
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a)
-0.2
0.5
-0.4 -0.6
1
0
0.2
0.4
0 -1
0.6
0.8
1
0
0
0.2
0.4
(d) 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.6
0.8
1
0.6
0.8
1
(f) 1.5
1
(b)
1.5 1
2
1
1
0.5
0.5
0
0
0
0.1
0.2
0.3
0.4
0.5
(c)
0.6
0.7
0.8
0.9
1
0
0.2
0.4
0.6
(e)
0.8
1
0
0
0.2
0.4
(g)
Fig. 3. (a) The original data; (b) The component IMF1; (c) The component residual; (d) The original (solid line) and reconstructed IMF1 (dashed line); (e) The original (solid line) and reconstructed residual (dashed line); (f) The original (solid line) and reconstructed data (dashed line) by SVM; (g) The original (solid line) and reconstructed data (dashed line) by multi-scale SVM
We have uniformly sampled examples of 300 points from [0, 1]. Results are shown in Table 1 and Fig.3. We can see that multi-scale SVM can achieve more compact representation results with lesser support vectors and more estimation accuracy, while the RBF SVM has inferior performance due to the mismatch between the single kernel and the multi-scales of data. 3.2 Regression Estimation of High Dimensions Real World Data Sets
Essentially above work is focused on one-dimensional data sets, but since the SVM formulation is ‘blind’ to the dimensionality of the input signal, it can be extended to high-dimensional data sets easily [1][2][3]. And people often found that many statisti-
1036
Z. Yang et al.
cal methods had superior performance in artificial data but had bad performance in natural data. In this section, we use Multi-scale SVM in high dimensions natural data. In this experiment, real world data sets come from: UCI2, Statlib3, and Delve Datasets4. The dimension (Dim.) and number of the data (Size) are given in Table 2, 80% of which are random taken as training examples and others testing examples. As shown in Table 2, Multi-scale SVM has superior accuracy than SVM, though the more support vectors are needed. Table 2. Regression Estimation of High Dimensions Real World Data Sets
Data
Dim.
Size
housing
13
506
Kernel RBF MSVM RBF
mpg
7
392 MSVM RBF
pyrim
27
74 MSVM RBF
triazines
60
186 MSVM
σ
ε
0.1
24
0.1 0.1 0.1 0.1 1 0.001 0.001 0.1 0.01 0.01 0.01
24 30 70 72 78 1 1 1.05 0.9 0.9 1.4
SVs 395 394 397 309 255 59 54 109 137
MSE 30.0881
791 302
24.9368 14.3026
564 59
12.3591 0.0031
113 144
0.0028 0.0282
246
0.0268
4 Discussions and Conclusion This paper gives an important extension to the SVM method: multi-scale SVM based on EMD, in which several kernels with different scales are used simultaneously to approximate the target function. Therefore multi-scale SVM can achieve more compact representation results and more estimation accuracy. But the performance of Multi-scale SVM varied from application to application for the instability of EMD decomposition. And it also results in additional complexity of model and parameter selection associated with SVM. The next step was focused on the stable method of EMD decomposition and the more effective method of parameter selection.
Acknowledgment This work was supported by National Natural Science Foundation of China, Beijing, China (Grant No. 60475007), Key Project of Foundation of Ministry of Education of China (Grant No. 02029), and Cross - Century Talents Foundation. 2
http://www.ics.uci.edu/~mlearn/MLRepository.html http://lib.stat.cmu.edu/datasets/ 4 http://www.cs.toronto.edu/~delve/data/datasets.htm 3
Multi-scale Support Vector Machine for Regression Estimation
1037
References 1. Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1999) 2. Tan, Y., Wang, J.: A Support Vector Machine with a Hybrid Kernel. IEEE Transactions on Knowledge and Data Engineering 16 (2004) 385–395 3. Schoelkopf, B., Smola, A., Mueller, K.R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation 10 (1998) 1299–1319. 4. Shao, X., Cherkassky, V.: Multi-Resolution Support Vector Machine. International Joint Conference on Neural Networks, Vol. 2 (1999) 1065–1070 5. Huang, N.E., Shen, Z., Long, S.R., et al.: The Empirical Mode Decomposition and the Hilbert Spectrum for Nonlinear and Non-Stationary Time Series Analysis. The Royal Society (1998) 903–995 6. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery (1998) 121–167 7. Zhang, L., Zhou, W., Jiao, L.: Wavelet Support Vector Machine. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 34 (2004) 34–39 8. Poggio, T., Rifkin, R., Mukherjee, S., Niyogi, P.: General Conditions for Predictivity in Learning Theory. NATURE 428 (2004) 419–422 9. Chen, S., Hong, X., Harris, C.J.: Sparse Kernel Density Construction Using Orthogonal Forward Regression with Leave-One-Out Test Score and Local Regularization. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 34 (2004) 1708–1717 10. Joachims, T.: Estimating the Generalization Performance of a SVM Efficiently. Proc. 17th Int. Conf. Machine Learning, San Francisco (2000) 11. Kearns, M., Ron, D.: Algorithmic Stability and Sanity-Check Bounds for Leave-one-out Cross Validation. Proc. Tenth Conf. Compute Learning Theory (1997) 152–162 12. Zheng, C., Jiao, L.: Automatic Parameters Selection for SVM Based on GA. Proceedings of the Fifth World Congress on Intelligent Control and Automation (2004) 1869–1872 13. Liao, S.P., Lin, H.T., Lin, C.J.: A Note on the Decomposition Methods for Support Vector Regression. Neural Networks (2001) 1474–1479 14. Li, J., Zhang, B., Lin, F.: Nonlinear Speech Model Based on Support Vector Machine and Wavelet Transform. Tools with Artificial Intelligence (2003) 259–263 15. Chen, C.H., Li, C.P., Teng, T.L.: Surface-Wave Dispersion Measurements Using HilbertHuang Transform. Terrestrial, Atmospheric and Oceanic Sciences (TAO) 13 (2002) 171–184
Gradient Based Fuzzy C-Means Algorithm with a Mercer Kernel Dong-Chul Park1, Chung Nguyen Tran1 , and Sancho Park2 1
2
Dept. of Information Engineering, Myong Ji University, Korea {parkd, tnchung}@mju.ac.kr Davan Tech Co., Seongnam, Korea [email protected]
Abstract. In this paper, a clustering algorithm based on Gradient Based Fuzzy C-Means with a Mercer Kernel, called GBFCM (MK), is proposed. The kernel method adopted in this paper implicitly performs nonlinear mapping of the input data into a high-dimensional feature space. The proposed GBFCM(MK) algorithm is capable of dealing with nonlinear separation boundaries among clusters. Experiments on a synthetic data set and several real MPEG data sets show that the proposed algorithm gives better classification accuracies than both the conventional k-means algorithm and the GBFCM.
1
Introduction
In fuzzy clustering algorithms, an object can belong to several classes simultaneously but with different degrees of certainty, which are measured by the membership function [1]. Bezdek proposed a generation of fuzzy c-means (FCM) algorithm, a powerful tool for clustering applications. In order to improve the convergence speed and the computation complexity of FCM, Park and Dagher [2] introduced the Gradient Based Fuzzy C-Means (GBFCM) algorithm which combines the FCM and the characteristics of Kohonen’s Self Organizing Map [3]. Although the GBFCM has been proven to be very efficient in clustering problems [2, 4], it often fails if the separation boundaries among clusters are nonlinear. An alternative approach to this problem is to transform the input data into a higher-dimensional feature space and then perform the clustering on the expanded feature space. By doing so, the complex nonlinear classification boundaries in the input space with the original dimensions can more likely be treated linearly in the expanded feature space [5]. Kernel methods have been proposed in dealing with the possibly infinite dimension problem of feature space. The kernel methods implicitly perform nonlinear mapping of the input data to a higher dimensional feature space [6, 7] using a kernel function. In this paper, we propose a new algorithm that combines the GBFCM and the kernel methods to form a robust clustering algorithm. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1038–1043, 2006. c Springer-Verlag Berlin Heidelberg 2006
Gradient Based Fuzzy C-Means Algorithm with a Mercer Kernel
2
1039
Kernel Methods
In this section, we present the basic concept of the positive definite kernel (or Mercer kernel) and the mathematical foundation for the Mercer kernel. In order to understand how the kernel methods work, we first look at a simple example. √ Let Φ : (x1 , x2 ) → (x21 , x22 , 2x1 x2 ) where Φ is a nonlinear mapping function. The inner product of two vectors, x and y, in the feature space R3 can be calculated as follows: √ √ Φ(x).Φ(y) = (x21 , x22 , 2x1 x2 )(y12 , y22 , 2y1 y2 ) = K(x, y) where (x.y) denotes the dot product of x and y in the input space R2 . Thus, in order to compute the dot product in the feature space, we can use the kernel function without explicitly using the mapping function Φ(x). In this case, the kernel function is K(x, y) = (x.y)2 , called the Polynomial Kernel function. However, a natural question at this point is what kind of function K(x, y) admits the representation of the dot product in feature space. In order to answer this question, we next consider the Mercer kernel [6]. Definition. Let X be a nonempty set. A function K: X × X → R is called a Mercer kernel function iff n n
ci cj K(xi , xj ) ≥ 0
(1)
i=1 j=1
for all n ∈ N, x1 , ..., xn ∈ X and c1 , ..., cn ∈ R. Mercer theorem. Let K be a symmetric function s.t. ∀ x, y ∈ X, X ⊆ R K(x, y) = Φ(x).Φ(y)
(2)
where Φ(x) : X → F is a nonlinear mapping function into the feature space F . The function K can be represented in terms of Eq.(2) iff K is a Mercer kernel. In this paper, the following Gaussian kernel function is used for substituting the inner product. K(x, y) = e−
3 3.1
x−y2 σ2
, where σ > 0
Gradient Based Fuzzy C-Means with a Mercer Kernel Updating Cluster Prototypes
The objective function in the FCM or the GBFCM to be minimized is: Ji =
c k=1
μ2ki v k − xi 2
(3)
1040
D.-C. Park, C.N. Tran, and S. Park
with the constraint of
c
μki = 1
(4)
k=1
The objective function with a kernel can be rewritten in feature space with the mapping function Φ: JiΦ =
c
μ2ki Φ(v k ) − Φ(xi )2
(5)
k=1
Through the kernel substitution in Eq.(2), we obtain Φ(v k ) − Φ(xi )2 = (Φ(v k ) − Φ(xi ))(Φ(v k ) − Φ(xi ))T
(6)
= K(v k , v k ) + K(xi , xi ) − 2K(vk , xi ) For the Gaussian kernel function, we have K(v k , v k ) = 1 and K(xi , xi ) = 1, and the objective function becomes: JiΦ = 2
c
μ2ki (1 − K(v k , xi ))
(7)
k=1
In order to minimize the objective function with a kernel, we use the steepest gradient descent algorithm. The learning rule can be summarized as follows: Δv k = η(v k − xi ) = η
∂JiΦ ∂v k
(8)
In the case of the Gaussian kernel function, the objective function in Eq.(8) can be rewritten as: JiΦ = 2
c
μ2ki (1 − e−
vk −xi 2 σ2
)
(9)
k=1
Substituting Eq.(9) into Eq.(8), we obtain: Δv k = 4ημ2ki
e−
vk −xi 2 σ2
σ2
(v k − xi ) = 4ημ2ki σ −2 K(v k , xi )(v k − xi )
Finally, the cluster prototypes are updated with a learning gain η as follows v k+1 = v k − ημ2ki σ −2 K(v k , xi )(v k − xi ) 3.2
(10)
Evaluating Membership Grades
The constrained optimization in Eq.(4) can be solved by using the Lagrange multiplier. c c Fm = 2 μ2ki (1 − K(v k , xi )) − λ( μ2ki − 1) (11) k=1
k=1
Gradient Based Fuzzy C-Means Algorithm with a Mercer Kernel
1041
Taking the first derivative of Fm with respect to μki and setting the result to zero, we have λ μki = (12) 4(1 − K(v k , xi )) Given the constraint in Eq. (4), we have λ= c j=1
1
(13)
1 4(1−K(v j ,xi ))
Substituting Eq.(13) into Eq.(12), the membership grades are defined as : μki = c
1
1−K(v k ,xi ) j=1 ( 1−K(vj ,xi ) )
4
(14)
Experiments and Results
A synthetic Ring data set is first used to evaluate the performance of the proposed GBFCM(MK). This is a 2-D data set that consists of two nonlinearly separable clusters with 100 data points each as shown in Fig. 1(a). Clustering is then performed by the conventional k-means [8], the GBFCM, and the proposed GBFCM(MK). In order to obtain fair comparison results, all the algorithms are initialized with the same cluster prototypes and membership grades.
Fig. 1. Ring data set and classification results (a)Ring data set (b)k-means result (c)GBFCM result (d)GBFCM(MK) result
1042
D.-C. Park, C.N. Tran, and S. Park Table 1. Video traces used for experiments MOVIE SPORTS Jurassic Park ATP Tennis Final The Silence of the Lambs Formula 1 Race Star Wars Super Bowl Final 1995 Terminator 2 Two 1993 Soccer World Cup Matches A 1994 Movie Preview Two 1993 Soccer World Cup Matches
Fig. 2. Accuracy vs. number of code vectors
Experiments are performed by 5 random trials. Overall classification accuracy of 94.5% is achieved by the proposed algorithm while the conventional k-means and the GBFCM yield 65.6% and 64.5% accuracies, respectively. Fig. 1(b), 1(c), and 1(d) show the classification results performed by the conventional k-means, the GBFCM, and the GBFCM(MK), respectively. In next experiments, we investigate and evaluate the proposed GBFCM(MK) algorithm in association with the MPEG data classification problem. In order to model and classify the MPEG video data, we consider the MPEG data as Gaussian Probability Density Function (GPDF) data [9]. The classification method is similar to the approach in [10], which involves 2 steps, a modeling step and a classification step. In the modeling step, we use the proposed GBFCM(MK) to model the probability distribution of the log-value of the frame size. Then, in the classification step, a Bayesian classifier is employed to decide the genre, “movie” or “sport”, to which a video sequence belongs. Table 1 shows 10 MPEG video sequences that are used in our experiments. These traces are from http://www3.informatik.uni-wuerzburd.de/MPEG. Each trace consists of 40,000 frames which result in 3,333 GOPs. Each GOP can be represented by the sequence IBBP BBP BBP BB with 12 frames for each GOP. From these sequences, we used the first 24,000 frames, resulting in 2,000 GOPs from each sequence, for training and the remaining frames from each sequence for testing. More detailed information on data preparation for these experiments can be found in [4].
Gradient Based Fuzzy C-Means Algorithm with a Mercer Kernel
1043
In order to evaluate the performance, the classification accuracy is measured by the number of correct GOPs over total number of GOPs. Fig. 2 shows the classification accuracy for the conventional k-means, the GBFCM, and the proposed GBFCM(MK) using from 3 code vectors to 10 code vectors. Overall classification accuracy of 88.12% is achieved by the proposed GBFCM(MK) while the conventional k-means and the GBFCM yield 83.49% and 84.33% accuracies, respectively. As can be seen from these results, the accuracy can be improved by the proposed GBFCM(MK) from 3.79% to 4.63% over the conventional k-means and the GBFCM, respectively.
5
Conclusions
In this paper, a new clustering algorithm, Gradient Based Fuzzy C-Means with a Mercer kernel, is proposed. The proposed algorithm is formed by incorporating the kernel methods and the GBFCM to deal with nonlinear separation boundaries among clusters. The effectiveness of the proposed method is supported by experiments on both synthetic data set and real MPEG data sets. The results show that 3.79% to 4.63% improvement over the conventional k-means and the GBFCM, respectively, can be achieved. These encouraging results provide motivation for further investigation as well as application of GBFCM(MK) to other practical problems.
References 1. Bezdek, J.C.: A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms. IEEE Trans. on Pattern Anal. Mach. Int. 2(1) (1980) 1-8 2. Park, D.C., Dagher, I.: Gradient Based Fuzzy c-means ( GBFCM ) Algorithm. IEEE Int. Conf. on Neural Networks, ICNN-94, Vol. 3 (1994) 1626-1631 3. Kohonen, T.: The Self-Organizing Map. Proc. IEEE, Vol. 78 (1990) 1464-1480 4. Park, D.C.: Classification of MPEG VBR Video Data Using Gradient-Based FCM with Divergence Measure. In: Wang, L., Jin, Y. (eds.): FSKD 2005, LNAI Vol. 3613. Springer-Verlag, Berlin Heidelberg New York (2005) 475-483 5. Cover, T.M.: Geomeasureal and Statistical Properties of Systems of Linear Inequalities in Pattern Recognition. Electron. Comput., Vol. EC-14 (1965) 326-334 6. Girolami, M.: Mercer Kernel-Based Clustering in Feature Space. IEEE Trans. on Neual Networks 13(3) (2002) 780-784 7. Chen, J.H., Chen, C.S.: Fuzzy Kernel Perceptron. IEEE Trans. on Neural Networks 13(6) (2002) 1364-373 8. Hartigan, J.: Clustering Algorithms. Wiley, New York (1975) 9. Rose, O.: Satistical Properties of MPEG Video Traffic and Their Impact on Traffic Modeling in ATM Systems. IEEE Conf. on Local Computer Networks (1995) 397-406 10. Liang, Q., Mendel, J.M.: MPEG VBR Video Traffic Modeling and Classification Using Fuzzy Technique. IEEE Trans. on Fuzzy Systems 9(1) (2001) 183-193
An Efficient Similarity-Based Validity Index for Kernel Clustering Algorithm Yun-Wei Pu1,2,3, Ming Zhu1,2, Wei-Dong Jin1, and Lai-Zhao Hu2 1
School of Information Science and Tech., Southwest Jiaotong University, Chengdu 610031, Sichuan, China [email protected] 2 National EW Laboratory, Chengdu 610036, Sichuan, China 3 Computer Center, Kunming University of Science & Technology, Kunming 650093, Yunnan, China
Abstract. The qualities of clustering, including those obtained by the kernelbased methods should be assessed. In this paper, by investigating the inherent pairwise similarities in kernel matrix implicitly defined by the kernel function, we define two statistical similarity coefficients which can be used to describe the within-cluster and between-cluster similarities between the data items, respectively. And then, an efficient cluster validity index and a self-adaptive kernel clustering (SAKC) algorithm are proposed based on these two similarity coefficients. The performance and effectiveness of the proposed validity index and SAKC algorithm are demonstrated, compared with some existing methods, on two synthetic datasets and four UCI real databases. And the robustness of this new index with Gaussian kernel width is also explored tentatively.
1 Introduction Clustering is a widely used unsupervised learning method. Generally, clustering algorithms partition the data items into different clusters according to a certain similarity measure and a reasonable cluster criterion [1]. It is a significant progress in clustering analysis that kernel trick has been incorporated into classical algorithms. By Performing a nonlinear mapping, kernel-based methods transform the linear-inseparable dataset into a high dimensional feature space, in which the feature differences of data items become obvious, so the partitioning is more effortless [2,3]. However, people know nothing about the internal structure of the dataset a priori, so any results of clustering should be assessed. For classical K-means and FCM algorithm in original space, many validity indices have been proposed [4,5], but these indices cannot be used to kernel-based clustering algorithms directly. In [3], a simple means, which tries to estimate the optimal number of clusters by eigenvalue decomposition, is introduced. Herein we have to point out two disadvantages of it favorably: firstly, this method is only effective on the well-clustered data; and secondly, the distribution of eigenvalues of the kernel matrix is sensitive to the parameter of kernel function. This means that only the number of the dominant eigenvalues gained with the optimal parameter is close to the actual number of clusters. To avoid these drawbacks, this paper proposes an effective validity index based on the pairwise similarities and develops a self-adaptive algorithm to detect the optimal number of clusters J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1044 – 1049, 2006. © Springer-Verlag Berlin Heidelberg 2006
An Efficient Similarity-Based Validity Index for Kernel Clustering Algorithm
1045
automatically. The benchmark results indicate that, our method is superior to FCM and the method in [3], and the proposed index is more feasible and can work robustly within a range of Gaussian kernel width for some cases.
2 Clustering in Kernel-Defined Feature Space Given a finite dataset X = {x i | i = 1,..., N } , the goal of clustering is to seek a map f to partition all data items into c groups. In kernel-defined feature space, let W be a indicator matrix, U the mapped points matrix and V a diagonal c × c matrix with diagonal entries the inverse of the column sums of W . Then the distances of any mapped data point from the centroids can be reckoned by [2]
Φ (x) − μ m
2
= κ (x, x) + (VW t KWV )mm − 2(m t WV ) m ,
(1)
where m is the vector of inner product between the mapped point and those assigned to cluster m , and κ (.,.) and K are the kernel function and kernel matrix, respectively. Thus, the basic kernel clustering algorithm can be summarized as below [2]. Algorithm 1. Kernel Clustering (KC)
Step 1 Standardize and unitize the dataset X in original space. Step 2 Choose a kernel function with appropriate parameters, and calculate K . Step 3 Choose c empirically and initialize W randomly. Step 4 Optimize arg min{( VW t KWV )mm − 2( mt WV ) m } , then update W . 2≤ m ≤C
Step 5 If W keep invariable, exit; otherwise, go to step 4.
3 Similarity-Based Validity Index and SAKC Algorithm From Algorithm 1, we can see that kernel matrix plays an important role in kernelbased clustering. Every element of K has defined the pairwise similarity between two data items, which gives kernel algorithm a natural similarity measure. Herein we take Gaussian kernel κ (x i , x j ) = exp( − || xi − x j ||2 / 2σ 2 ) , as a sample: when σ → 0 ,
then κ (x i , x j ) → δ ij , so the diagonal entries of K equal 1 but the non-diagonal entries equal 0. This implies that any point is only similar to itself but dissimilar to the other. In another extreme case, i.e. σ → ∞ , κ (x i , x j ) → 1 , then all data items are entirely similar. In general, the values of non-diagonal entries are between 0 and 1, which can be used to measure the similarity between x i and x j . When x i and x j are assigned to the same cluster, the value of K ij is relatively greater; otherwise, relatively smaller. Summarize the discussed above, we can know that, the value of K ij measures the pairwise similarity between two mapped points, and the structure of K mirrors the inherent organization of the dataset. Therefore, K provides all the information needed to assess the quality of a clustering.
1046
Y.-W. Pu et al.
3.1 Within-Cluster and Between-Cluster Average Similarity Coefficient In order to describe the degree of average similarity between one sample and other samples ascribed to the same cluster, or between this sample and those allocated to different group, we define within-cluster (and between-cluster) average similarity i i coefficient S within , S between as i
S within =
i
S between =
1 K ij , ∑ f (m ) − 1 j: f ( x j ) = f ( xi ) = m ,i ≠ j
∀i = 1, 2,..., N ,
(2)
1 N − f −1 ( m)
∀i = 1, 2,..., N
(3)
−1
∑
j : f ( x j ) ≠ f ( xi ) = m
K ij ,
where f −1 (m ) is the inverse mapping of f . We can see that, S
i
i
within
and S between are statistical measures. For Gaussian kernel, in
respect that 0 ≤ K ij ≤ 1 , thus, S the measures of probability. S
i
i
within
, S between ∈ [0,1] ; therefore, they can be viewed as
within
and (1 − S between ) correspond to the conditional
i
i
probabilities pw ( x i | Cm ) and pb (x i | Cm ) , respectively, and both can measure the reasonable degree of assigning x i to cluster m, but from different points of view. 3.2 Similarity-Based Clustering Validity Index Intuitively, samples divided into the same cluster should be very similar to each other, and those assigned to different groups should be very distinctive. That is, in a good i i partitioning, S within should be great, and S between should be small. Therefore, a clustering validity index combining these two aspects is defined as c
V ( K ; W, c ) =
∑ ∑
m =1 i: f ( x i ) = m c
∑ ∑
N
i
S between S
i within
=
∑S
m =1 i: f ( x i ) = m
i =1 N
i between
∑S
. i within
(4)
i =1
It is clear that, the greater within-cluster similarity and the smaller between-cluster similarity are, the smaller V ( K ; W, c ) is, and so the better the quality of clustering is. V ( K ; W, c ) evaluates the reasonable degree of results of clustering in essence. 3.3 Self-adaptive Kernel Clustering Algorithm We are now in a position to present an adaptive algorithm which can detect the optimal number of clusters c* and the corresponding best partitioning W* automatically, as below:
An Efficient Similarity-Based Validity Index for Kernel Clustering Algorithm
1047
Algorithm 2. Self-Adaptive Kernel Clustering (SAKC) Step 1 Set c =2, and initialize V1 (K ; W,1) using an arbitrary great positive number. Step 2 For a fixed c , choose an integer d , and perform Algorithm 1 d times to obtain Wcd , the index cd denotes the dth clustering for the fixed c . Step 3 Compute Vcd ( K ; W, c ) using K and Wcd , and choose the minimum as the best partitioning for the fixed c , i.e. Vc ( K ; W, c ) = min(Vcs ) . 1≤ s ≤ d
Step 4 If Vc (K ; W, c ) ≥ Vc −1 (K ; W, c − 1) , exit; otherwise, c=c+1, go to step 2.
4 Experimental Results and Discussions In this section, a series of benchmark experiments are carried out to verify the performance of the proposed index and SAKC, and the experimental results are compared with the popular FCM algorithm and Girolami’s method [3]. Throughout all experiments, we use Gaussian kernel and take d=20 for algorithm 2. Experiment 1. Comprises two toy simulations of synthetic datasets. The first toy data is the 2-dimensional “Five Gaussian Data”, and the second is the so-called “Ring Data”. The detailed information of these two dataset can be found in [3]. In this experiment, for the convenience of comparing with FCM, we simply modify SAKC to make it can compute values of V ( K ; W, c ) within the range 2 ≤ c ≤ cmax , and take cmax =10. The results of clustering are shown in Fig.1a,c,d. And the values of the proposed validity index, compared with two well-known indices for FCM, i.e. partition coefficient Vpc [4] and Xie-Beni index VXB [5], for “Five Gaussian Data” are drawn in Fig.1b, whereas those for “Ring Data” are summarized in Table 1. From Fig.1 and Table 1, it is clear that, for spherical dataset, the performances of FCM and SAKC are almost same, and all these validity indices have detected the number of clusters accurately. As for non-spherical “Ring Data”, FCM, including the noted Xie-Beni index, is of no effect, whereas SAKC and the proposed index can work effectively. The results also show that, our index can work robustly within a range ofσin different scenarios, but this robustness is discriminative. Perhaps this implies that, for “Ring Data”, clustering in feature space is also sensitive to the selection of kernel parameter, as the model selection of support vector machine [6]. Experiment 2. Consists of four standard UCI real databases: Iris, Glass, Wine and BUPA [7]. In this experiment, we use the unmodified SAKC, and except with those of FCM, the results are also compared with those of Girolami’s method [3]. For Iris Data, the optimal number of cluster detected by all validity indices including our index is 2, and the actual number 3 is the second-best number of clusters. This may imply that two-cluster solution is perhaps a better answer than three-cluster purely from the point of view of creating distinct clusters. For the other three databases, the results are summarized in Table 2. We can see that, from this table, FCM and those indices for it sometimes are no effective, such as Glass and Wine; while our
1048
Y.-W. Pu et al.
algorithm and index can work well. Compared with Girolami’s method, the proposed index is more feasible and can work robustly within a range of Gaussian kernel width, except for some complicated distribution, for example, Glass databset. Table 1. Validity indices for “Ring Data”. The number of clusters corresponding to the value with an underline is the optimal number detected by the experiment.
σ
Cluster Number
Vpc[4]
VXB[5]
2 3 4 5 6 7 8
0.6579 0.6231 0.5913 0.5894 0.5400 0.5368 0.5338
0.3851 0.2406 0.3792 0.1145 0.4902 0.4762 0.3564
V(K;W,c) with [in this paper] 0.1 0.2 0.3 0.0989 0.1939 0.4182 0.1078 0.1978 0.3803 0.1051 0.2028 0.3560 0.1099 0.2142 0.3335 0.1057 0.2235 0.3242 0.1063 0.2147 0.3331 0.1028 0.2077 0.3808
9 10
0.5277 0.5259
0.3532 0.2964
0.1071 0.1053
0.2095 0.2141
0.3735 0.3774
0.74
8 6
0.7
σ=0.2,V+0.45 σ=0.5,V σ=0.8,V-0.20 σ=1.0,V-0.25 σ=2.0,V-0.25
4 2
0.66
0
0.62
-2 -4
0.56 -6 -8 -8
-6
-4
-2
0
2
4
6
8
0.54
2
4
6
(a) 4
4
3
3
2
2
1
1
0
0
-1
-1
-2
-2
-3
-3
-4 -4
-3
-2
-1
0
1
8
10
(b)
2
3
4
-4 -4
-3
(c)
-2
-1
0
1
2
3
4
(d)
Fig. 1. (a) Result of SAKC algorithm on “Five Gaussian Data”, as almost same as that of FCM. (b) variation in V(K;W,c) with c and σ for “Five Gaussian Data”. (c) and (d) are results of FCM and SAKC on “Ring Data”, respectively. In (a) and (d),σ 0.2.
=
An Efficient Similarity-Based Validity Index for Kernel Clustering Algorithm
1049
Table 2. Results of experiment 2. In the last column, the range of values enclosed in parentheses is the effective range of kernel width, which is also used in Girolami’s method. Database Iris Glass Wine BUPA
Samples Number 150 214 178 345
Attrib. Number 4 9 13 6
Real classes 3 6 3 2
Vpc 2,3 2 2 2
Experimental Results VXB Girolami In this paper 2,3 2,3~6 2,3(0.2~1.0) 5 3~9 6(0.1~0.2) 3 3 3(0.3~0.6) 2 2,3 2(0.3~1.0)
5 Conclusions In summary, two statistical similarity coefficients which can be taken as similarity measure between the mapped points have been defined. And based on them, an efficient validity index, which has distinct physical meanings and can be easily reckoned, and a self-adaptive kernel clustering algorithm SAKC, have been developed. Benchmark results have shown that SAKC and the proposed index have better performance than FCM and some notable indices for it; compared with the method in [3], the proposed method is more feasible and can work robustly within a range of Gaussian kernel width for the major cases. The extension of the experimental validation to more datasets to investigate the robustness is the subjects of further work.
Acknowledgments This work is supported by the National Natural Science Foundation of China (60572143) and the National EW Laboratory Foundation (NEWL51435QT220401).
References 1. Xu, R., Wunsch II, D.C.: Survey of Clustering Algorithms. IEEE Trans. Neural Networks 16(3) (2005) 645-678 2. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, England (2004) 3. Girolami, M.: Mercer Kernel-Based Clustering in Feature Space. IEEE Trans. Neural Networks 13(3) (2002) 780-784 4. Bezdek, J.C., Pal, N.R.: Some New Index of Cluster Validity. IEEE Trans. Systems, Man, and Cybernetics-Part B: Cybernetics 28(3) (1998) 301-315 5. Xie, X.L., Beni, G.: A Validity Measure for Fuzzy Clustering. IEEE Trans. Pattern Analysis and Machine Intelligence 13(8) (1991) 841-847 6. Chapelle, O., Vapnik, V., Bousqet, O., Mukherjee, S.: Choosing Multiple Parameters for Support Vector Machines. Machine Learning 46(1) (2002) 131-159 7. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases. Available at: ftp://ftp.ics.uci.edu/pub/machine-learning-databases
Fuzzy Support Vector Clustering En-Hui Zheng1,2, Min Yang3, Ping Li2, and Zhi-Huan Song2 1
College of Information and Electronic Engineering, Zhejiang Gongshang University, Hang Zhou, 310035, P.R. China [email protected] 2 National Lab. of Industrial Control Technology, Institute of Industrial Process Control, Zhejiang University, Hang Zhou, 310027, P.R. China {ehzheng, pli, zhsong}@iipc.zju.edu.cn 3 Computer School, Hangzhou Dianzi University, Hang Zhou, 310018, P.R. China [email protected]
Abstract. Support vector clustering (SVC) faces the same over-fitting problem as support vector machine (SVM) caused by outliers or noises. Fuzzy support vector clustering (FSVC) algorithm is presented to deal with the problem. The membership model based on k-NN is used to determine the membership value of training samples. The proposed fuzzy support vector clustering algorithm is used to determine the clusters of some benchmark data sets. Experimental results indicate that the proposed algorithm actually reduces the effect of outliers and yields better clustering quality than SVC and traditional centroid-based hierarchical clustering algorithm do.
1 Introduction Support vector machines (SVM) are a new class of machine learning algorithms, [1] motivated by results of statistical learning theory . SVM captures the main insight of statistical learning theory: in order to obtain a small risk function, one needs to control both training error and model complexity. Currently, a novel support vector clustering (SVC) algorithm was proposed by [2] Ben-Hur . It generates cluster boundaries with arbitrary shape and uses a priori maximal allowed rejection rate of error to control the number of the clusters. However, SVC face the same over-fitting problem as SVM caused by outliers or [3, 4] noises due to the fixed slack factor when training. In this paper, a fuzzy support vector clustering (FSVC) algorithm is presented to solve the over-fitting problem. The fuzzy membership model based on k-NN is used to set the membership value. This paper is organized as follows: Section 2 briefly introduces the basic theory of support vector data description (SVDD) and the SVC algorithm. In section 3, The FSVC algorithm and the fuzzy membership model are formulated. Section 4 describes experimental results. Finally, some conclusions are given in section 5.
2 Support Vector Data Description (SVDD) and Cluster The SVC algorithm is proposed on the base of SVDD. Given training sample set T = {x1 ,..., xi ,..., xl } , xi ∈ R n , i = 1,..., l , the basic idea of SVDD is to obtain the J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1050 – 1056, 2006. © Springer-Verlag Berlin Heidelberg 2006
Fuzzy Support Vector Clustering
1051
minimal enclosing hypersphere of T with center a and radius R . Once a and R determined, we can predict a new sample whether it belongs to the data description according its distance to center a . When data samples are not spherical distributed in input space, they can be mapped into a high dimensional feature space via mapping Φ : R n → F , where F denotes the feature space. Then in F , the smallest [2] enclosing hypersphere is obtained by solving the QP problem as follows : min W (ξi , R, a) = R 2 + C
l
∑ξ
i
i =1
,
⎧⎪|| Φ (xi ) − a ||2 ≤ R 2 + ξi s.t. ⎨ ⎪⎩ ξi ≥ 0, i = 1, 2,..., l.
(1)
where C is penalty factor which determines the trade off between the volume of the sphere and the number of target objects rejected. Introducing Lagrange multipliers β i , β j and kernel mapping, finally we obtain the Wolf dual as follows: max Q( β ) =
l
∑
K ( xi , x i ) β i −
i =1
s.t.
l
∑ β β K (x , x ) i
j
i
j
i , j =1
(2)
0 ≤ β i ≤ C , i = 1, 2,..., l
After solving the quadratic programming, we can obtain the optimal Lagrange multipliers βi . These data points associated to non-zero βi are called support vectors (SVs). Furthermore, data points with 0 < βi < C are called non-bound support vector and βi = C are called bound support vector. The distance between a data sample and the center of feature space hypersphere is computed by: l
D(xi ) =
∑
l
∑K(x , x )β
βi β j K(xi , x j ) + K(xi , xi ) − 2
i, j =1
j
j =1
i
j
(3)
The radius of smallest enclosing hypersphere is determined by R = D (xi ) | ∀0 < β i < C , and the contours that enclose the points in input space are defined by set {x | D(x) = R} . Once D (xi ) and R determined, when the samples mapped back into input space, the contours enclosing the samples are obtained [3]. Because the contours do not differentiate the points that belong to different clusters, an adjacency matrix and DFS mechanism are used to determine the clusters in the input space. The adjacency matrix between pairs of points xi and x j whose images lie in or on the hypersphere in feature space is defined as follows[2]: ⎧1 if D(z ) ≤ R , Ai , j = ⎨ ⎩ 0 otherwise
(4)
1052
E.-H. Zheng et al.
where z denotes all points in the line segment connecting xi and x j . Clusters are now defined as the connected components of the graph induced by the adjacency matrix A .Checking the line segment is implemented by sampling a number of points. For example, for each two points in the sample set, we take 10 points that lie on the line connecting these points and check whether they are within the hypersphere. If all the points on that line are within the hypersphere, then the two sample points are assumed to belong to the same cluster. Otherwise, they belong to different clusters. In SVC algorithm, the most usually used kernel function is Gaussian kernel K (xi , x j ) = e
− p||xi − x j ||2
. The SVC algorithm can generate cluster boundaries with
arbitrary shape and it is very easy to control the number of clusters according to adjust two parameters: the penalty factor C and the Gaussian kernel parameter p . Especially when the clusters strongly overlap, the SVC algorithm can also work very well. This is in contrast with most clustering algorithms found in the literature, which have no mechanism for dealing with overlapping clusters.
3 Fuzzy Support Vector Clustering Algorithm In SVC, how to set the free parameter is very important. The penalty factor C can be determined by setting a priori maximal allowed rejection rate of the error on the clusters. A larger C means to assign higher penalty to the outliers and thus reduces the rate of error. On the contrary, a smaller C is to ignore more plausible error and thus [2] get smaller radius of hypersphere . However, no matter the value of C is large or small, this parameter is fixed during the training process of SVC. That is to say all data points are equally treated during the training process. This leads to a high sensitivity to some special cases, such as outliers and noises. In order to deal with the problem, fuzzy membership model is introduced to SVC and fuzzy support vector clustering algorithm is presented. In FSVC, each data point has a membership value μi , 0 < μi < 1 . The training set becomes (x1 , μ1 ),...(xi , μi ),...,(xl , μl ) . The question (2) can be rewritten as follows: min W (ξi , R, a) = R 2 + C μi
l
∑ξ
i
i =1
(5)
⎧⎪|| Φ (xi ) − a ||2 ≤ R 2 + ξi s.t. ⎨ ⎪⎩ ξi ≥ 0, i = 1, 2,..., l. The dual problem becomes: max Q( β ) =
l
∑
K ( xi , x i ) β i −
i =1
s.t.
0 ≤ β i ≤ C μi , i = 1, 2,..., l
l
∑ β β K (x , x ) i
i , j =1
j
i
j
(6)
Fuzzy Support Vector Clustering
1053
It is clear that the only difference between SVC and the FSVC is the upper bound of Lagrange multipliers β i in the dual problem. In SVC, the upper bounds of Lagrange multipliers are bounded by a constant C , while in FSVC, they are bounded by dynamical boundaries that are function of membership values. Fuzzy membership function is very important which determines the performance of FSVC. For the sequential learning problem, Lin proposed a membership model [4]. Huang proposed a membership model based on outlier detection method [3]. In this model, outliers are equally treated and membership values of outliers are assigned the same values. In this paper, a strategy based on k nearest neighbor (k-NN) method is used to set the fuzzy memberships of data points. Given two samples x1 , x 2 ∈ R , the distance between two samples in the feature space is defined by: d (x1 , x2 ) =|| ϕ (x1 ) − ϕ (x2 ) ||2 = K (x1 , x1 ) − 2 K (x1 , x2 ) + K (x2 , x 2 ).
(7)
If the kernel is Gaussian RBF function, as K (xi , xi ) = e− p||xi − xi || = 1 , Eq.(8) can be rewritten as: 2
d (x1 , x2 ) = 2 − 2 K (x1 , x 2 ).
(8)
For each data point xi , we can find a set Sik that consists of k nearest neighbors of xi according to the definition of (9). The average distance between xi and every elements x j of Sik is defined as:
di =
1 k
k
∑
2 − 2 K ( xi , x j )
(9)
j =1
We assume that the data point with a larger value of di can be considered as outlier and should make less contribution to the hypersphere in feature space. For this assumption, we can build a relationship between the fuzzy membership μi and the value of di . The maximal and minimal value of di are defined as:
d max = max(di | xi ∈ χ ), d min = min(di | xi ∈ χ ),
(10)
Then the fuzzy membership function is defined as:
μi = 1 − (1 − σ )(
di − d min f ) d max − d min
(11)
Where σ < 1 and f is the parameter that controls the degree of mapping function. When di is close to d min , data point xi is near the regression line whose fuzzy membership μi is close to 1 . With the increase of di , xi is away from the regression
1054
E.-H. Zheng et al.
line and the fuzzy membership of xi will decrease. When di is close to d max , xi can be regarded as an outlier and μi is close to a sufficiently small positive number σ . This can effectively reduce the effect of the outliers. From above analysis we can see that both the k-NN and the SVC clustering algorithm spend most of time on calculating kernel functions matrix which has a dimension equal to the number of training examples. So the training process of FSVC is very time-consuming. But if the most usually used kernel functions are cached to memory in the process of training, the training time of FSVC algorithm will be significantly reduced.
4 Experimental Result Before considering experiments on real-life data we will first run the algorithm on an artificial data sets with two dimensions and 146 samples. The results in input space using FSVC algorithm with k =5 and different p values is from Figure 1 ~ Figure 4. We can see that with the increase of the scale parameter of the Gaussian kernel p, the shape of the boundary fits more tightly the data points and enclosing contour splits into three clusters. To get a better feel for how FSVC performs in practice, we ran FSVC on real-life as well as synthetic data sets. We use real-life data sets to compare the quality of clustering due to FSVC with the traditional centroid-based hierarchical clustering (CHC) algorithm and SVC algorithm. The characteristics of two real-life datasets congressional votes and Mushroom are from literature [5] and the parameters of FSVC are p =80 , C = 0.5 and k =1 0 . The Congressional voting data set was the United States Congressional Voting Record in 1984. There are 435 records and each record corresponds to one Congress man’s votes on 16 issues (e.g., education spending, crime). The class label of each record is Republican or Democrat. The mushroom data set contains information that describes the physical characteristics (e.g., color, odor, size, ship). There are 8124 records and each record has 22 attributes. The class label of each record is poisonous or edible. In mushroom data set, the clusters are not well-separated and contain a lot of outliers. We run the three algorithms respectively and the results are shown in Table 1 and Table 2. As the table 1 illustrates, FSVC, SVC and CHC algorithm both identify two clusters, one containing a large number of republicans and the other containing a majority of democrats. However, in the cluster of republicans found by the traditional algorithm, around 25% of the members are democrats, while with SVC 14% are democrats and with FSVC only about 4% are democrats. The similar results can be obtained from Table 2. Because the clusters of mushroom overlap each other, the quality of the clusters generated by the traditional algorithm is very poor. Every cluster contains a sizable number of both poisonous and edible mushrooms. The quality of clusters generated by SVC and FSVC is better than by CHC and there are two clusters with larger sizes, cluster 1 and 2. Furthermore, as the effect of outliers, we can see that the clusters generated by SVC still contain around 4% other class samples, while the clusters (1 and 2) generated by FSVC nearly doesn’t contain any other class samples.
Fuzzy Support Vector Clustering 0.5
0.5
0.45
0.45
0.4
0.4
0.35
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15 0.1 0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Fig. 1. Cluster generated by FSVC with parameter p =50 0.5
0.1 0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Fig. 2. Cluster generated by FSVC with parameter p =150 0.5
0.45
0.45
0.4
0.4
0.35
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15 0.1 0.2
1055
0.15
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Fig. 3. Clusters generated by FSVC with parameter p =300
0.1 0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Fig. 4. Clusters generated by FSVC with parameter p =900
Table 1. Clustering result for congressional voting data set
Cluster No. 1 2
CHC 157 11
Republicans SVC FSVC 140 154 6 2
CHC 52 215
Democrat SVC 22 219
FSVC 6 221
Table 2. Clustering result for mushroom data set
Cluster No. 1 2 3 4
CHC 1257 1742 847 1247
Edible SVC 3108 173 385 45
FSVC 3274 0 376 10
CHC 754 412 1163 702
Poisonous SVC 122 2794 12 461
FSVC 7 2901 0 164
1056
E.-H. Zheng et al.
5 Conclusions In this paper, a fuzzy support vector clustering (FSVC) algorithm is presented to deal with the overfitting problem caused by outliers. A membership model based on k-NN is used to determine the membership value of training samples. The proposed algorithm is used to determine the clusters of some benchmark data sets. Results indicate that the proposed method actually reduces the effect of outliers and yields better clustering quality than SVC and CHC algorithm do.
Acknowledgment This work is support by 973 program of China (No.2002CB312200) and by 863 program of China (No.2002AA412010-12).
References 1. Vapnik, V.: The Nature of Statistical Learning Theory. Springer. New York (1995) 2. Ben-Hur, A., Horn, D., Sidgelmann, H.T.: Support Vector Clustering. Journal of Machine Learning Research 2(2) (2001) 125-137 3. Huang, H.-P., Liu, Y.-H.: Fuzzy Support Vector Machines for Pattern Recognition and Data Mining. International Journal of Fuzzy Systems 4(3) (2002) 826-835 4. Lin, C.-F., Wang, S.-D.: Fuzzy Support Vector Machines. IEEE Trans. on Neural Networks 13(2) (2002) 464-471 5. Merz, C. J., Murphy, P.M.: UCI Repository for Machine Learning Data-Bases. Irvine, CA: University of California, Department of Information and Computer Science, http://www.ics.uci.edu/, ~mlearn/MLRepository.html (1998)
An SVM Classification Algorithm with Error Correction Ability Applied to Face Recognition Chengbo Wang and Chengan Guo School of Electronic and Information Engineering, Dalian University of Technology, Dalian, Liaoning, 116023, China [email protected], [email protected]
Abstract. This paper presents an SVM classification algorithm with predesigned error correction ability by incorporating the error control coding schemes used in digital communications into the classification algorithm. The algorithm is applied to face recognition problems in the paper. Simulation experiments are conducted for different SVM-based classification algorithms using both PCA and Fisherface features as input vectors respectively to represent the images with dimensional reduction and performance analysis is made among different approaches. Experiment results show that the error correction SVM classifier of the paper outperforms other commonly used SVM-based classifiers both in recognition rate and error tolerance.
,
1 Introduction Research interests and activities in face recognition have been increased significantly over the past few years [1,2]. Two issues are essential in face recognition: the first is what features are to be used to represent a face. A face image is subject to changes of viewpoint, illumination and expression. An effective representation should be able to deal with possible changes. The second is how to classify a new face image using the chosen representation. There are two important kinds of methods to get features for representing face images: one is based on the geometric features and the other is based on the statistical features [1, 2]. In this paper we adopt the most successful statistical features based methods: the eigenface method [3] which is based on the Karhunen-Loeve transform or the Principal Component analysis (PCA) and the Linear Discriminant Analysis based Fisherface method [4], for the face representation and recognition. Using these two methods, dimensional reduction is performed to obtain smaller vectors for representing the face images. Support Vector Machine (SVM) proposed by Vapnik et al [5] is an effective method for general-purpose pattern recognition. A two-class problem can be solved by a binary SVM efficiently. For K-class problems, a number of binary SVMs should be used. Several approaches to the multi-class learning problems using binary SVMs have been proposed, including the M-ary algorithm [6], the One-Against-One method [7], the One-Against-All method [6], and the Error-correcting Output Codes (ECOC) algorithm [8,9]. Two important issues for a learning classification algorithm are usually taken into account: the computation complexity and the generalization ability J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1057 – 1062, 2006. © Springer-Verlag Berlin Heidelberg 2006
1058
C. Wang and C. Guo
(or error tolerance) of the algorithm. The M-ary algorithm is the fastest one that needs only ⎡⎢log 2 K ⎤⎥ SVMs. However, this algorithm has no error tolerance, and once an
SVM gives a wrong intermediate classification, it leads to a final mistake. The OneAgainst-All and One-Against-One algorithm need O( K ) and O( K 2 ) SVMs, respectively. These two algorithms have some error tolerance by introducing a certain number of redundant SVMs. In comparison, the ECOC algorithm has stronger error tolerance that can correct l intermediate misclassifications with preset 2l+1 redundant classifiers and it usually needs fewer SVMs than the One-Against-All and OneAgainst-One algorithms do. In this paper we present an SVM multi-classification algorithm based on the ECOC scheme and apply the algorithm to face recognition problems. Simulation results conducted on face recognition experiments in the paper show that this algorithm outperforms the other algorithms mentioned above. In the sequel of the paper, the SVM multi-classification algorithm based on the ECOC scheme is given in Section 2. The application results of the algorithm to face recognition problems and discussions are given in Section 3. Section 4 gives the summary and further research direction of the paper.
2 An SVM Multi-classification Algorithm Based on ECOC 2.1 ECOC Classification Approach
It is well known that there have been many excellent error control coding schemes in digital communication that can correct a certain bits of errors occurred during transmission over a channel [10]. Dietterich and Bakiri [8] proposed an approach for solving multi-class learning problems via error-correcting output codes (ECOC), in which the learning classification is viewed as a digital communication problem and the classification errors made by some binary classifiers are viewed as transmission errors over a channel. Therefore the errors may be corrected by using an error control coding scheme. The main idea of the ECOC learning classification method is as follows: For a Kclass problem, the first step is to generate an n-bit code that has the minimum Hamming distance d, where n ≥ ⎢⎡log 2 K ⎥⎤ + d . The second step is to assign a unique codeword of the code to each class of the training samples to represent the class. Let C (k ) denote the codeword for a training sample x of class k and express the training set by S x = {( x, C (k )), k = 1,..., K } . Then for the training set there are
n binary functions, f i ( x ) (i = 1,..., n ) with f i corresponding to the i-th bit of C (k ) . The third step of the ECOC approach is to learn the n binary functions from S x by using a learning algorithm. After the training stage, the n learned functions fˆi ( x ) (i = 1,..., n) are used to classify a new input sample x new by first obtaining n binary output values yˆ i = u[ fˆi ( xnew )] (i = 1,..., n ) , where u( x) = 1 if x ≥ 0 , otherwise u( x ) = 0 . The n binary outputs can be combined into a codeword, then passing the
An SVM Classification Algorithm with Error Correction Ability
1059
codeword through an error correcting algorithm, and finally decoding the corrected codeword to get the classification result. According to the theory of error control coding [10], this ECOC classification approach is able to correct ⎣(d − 1) / 2⎦ errors occurred in yˆi (i = 1,..., n) . 2.2 An SVM Classification Algorithm Based on ECOC Method
In this subsection we present an algorithm based on the ECOC approach for learning n binary SVMs and solving the K-class problems using the learned SVMs with predesigned error correction ability. The training algorithm for learning n binary SVMs is as follows: (i) Choose proper values of n and l for a K-class problem such that n ≥ ⎡log 2 K ⎤ + 2l + 1 . (ii) Generate an n-bit code that the minimum Hamming distance of the code should be 2l+1. BCH codes [10] are recommended here. (iii) Assign a unique codeword C (k ) of the code to each training sample x of class k , expressed as ( x , C ( k )) . (iv) Construct n training sets S i = {( S 0i , S1i )} (i = 1,...,n) by using the training set
S x = {( x , C ( k )), k = 1,..., K } in such a way that for ∀( x, C (k )) ∈ S x it is put into the subset S 1i if the i-th bit of C (k ) is 1, otherwise, it is put into the subset
S 0i . (v) Use each S i to train a binary SVM for learning the binary function f i ( x ) whose desired output yi is set to 1 if x ∈ S1 and set to –1 if x ∈ S 0i . i
In this way, n binary SVMs can be obtained, each SVM associated with a learned binary function fˆi . Then a classification algorithm can be realized by using the n trained binary SVMs in which the key procedure is to implement the error correction for the outputs of the SVMs. Many excellent decoding algorithms for a BCH code using digital hardware or software have been available (e.g., the decoding algorithm given in [10]) that can be used in the classification algorithm to correct the intermediate misclassified results made by some SVMs. It is well known that the error correction principle of BCH codes is to choose the codeword with the minimum Hamming distance. For simplification, this paper proposes the following classification algorithm by directly applying this principle using the n trained SVMs: (i) Input the sample x which is to be classified. (ii) Used each SVM to classify x and get the output yˆ i by ˆy i = u ⎡ ˆf i ( x ) ⎤ , ⎣ ⎦ for i = 1 , 2 , L , n . (iii) Construct an output codeword Cˆ by Cˆ = ( ˆy1 , ˆy 2 ,K , ˆy n ) .
1060
C. Wang and C. Guo
(iv) Compute the Hamming distances between Cˆ and each C (k ) : n
ˆ C(k)) = ∑ | yˆ - C (k) | , for k = 1, 2 ,K ,K . HD ( k ) = HammingDistance(C, i i i =1
(v) Classify x as the kˆ class, where kˆ = arg min { HD ( k )} . k
The above algorithm is simple and efficient when K (the number of classes) is not large. When K is large, more efficient decoding algorithms, such as Berlekamap’s algorithm [10], can be applied in the classification process.
3 Applications in Face Recognition and Experimental Results In this paper the error correction SVM algorithm presented in Section 2 is applied to face recognition problems. The simulation experiments are conducted on the Cambridge ORL face database that contains 400 face images of 40 people in total with10 images for each person. In the application, the eigenfaces based on PCA [3] and Fisherfaces [4] are used respectively as feature templates to represent face images and the input vector of the SVMs for an image is obtained by projecting the image onto the eigenfaces or Fisherfaces. In each simulation experiment, 200 samples (5 samples selected randomly for each person) are used as the training set to train the SVMs and the remaining 200 samples are used as the test set, each experiment with 10 runs. For the 40 classes of face images in the database, the (31,6) BCH code is chosen that has 31 bits in total with 6 information bits and the minimum Hamming distance 15. Correspondingly, there are 31 SVMs in total for the face recognition problem and up to 7 errors can be corrected by the error correction SVM classification algorithm. In order to make comparison, simulation experiments with the same data have been conducted in the paper by using the M-ary classifier [5] and the One-AgainstOne voting algorithm [6]. Both PCA and Fisherface features which represent the face images with dimensional reduction are used as inputs vectors of the SVMs. Table 1 shows the experiment results, in which the first row is the number of features or the dimension of feature vectors, and the other rows are the average recognition rates of 10 experiments obtained by different algorithms using this number of features. The last column shows the number of SVMs used by different methods. It can be seen from Table 1 that for the 40-person face recognition problem, the method of the paper (i.e. Error Correction SVM Classifier) gives the highest recognition rates by using the same number of features. As to the computation complexity, although the M-ary classifier uses fewest SVMs, its recognition rates are much lower than the others’. The One-Against-One voting method improves significantly over the M-ary classifier in the recognition rates by paying much higher computation cost with too many SVMs demanded. Compared with the One-AgainstOne voting method, the algorithm of the paper can get even better result with far fewer SVMs demanded. It has been reported in [11] that for the One-Against-One binary tree algorithm, the highest recognition result is 97% using about 42 eigenfaces [3], whereas the best result given in this paper is 97.33% using 35 Fisherfaces.
An SVM Classification Algorithm with Error Correction Ability
1061
Table 1. Recognition rates (%) and numbers of SVMs used by different SVM based multiclassifiers
Number of Features
20
25
M-ary Classifier One-Against-One Voting Method of the paper M-ary Classifier Fisherface One-Against-One Voting Features Method of the paper
89.20 94.50 95.00 90.90 96.38
89.25 95.35 96.25 91.30
PCA Features
30
35
39
89.50 89.75 89.50 95.70 95.90 95.94 96.70 97.00 97.05
91.45 91.38 90.62 96.63 96.75 96.50 96.85 96.88 97.00 97.25 97.33 97.00
Number of SVMs 6 780 31 6 780 31
4 Summary and Further Direction In this paper an SVM classification algorithm with error correction ability based on the ECOC approach and its application to face recognition are presented. The tradeoff between the computation complexity and error tolerance has been a challenge for multi-classification methods. The simulation experiment results conducted on face recognition in the paper show that this problem can be solved properly through incorporating the error control coding scheme into the SVM classification method. Compared with the most compact M-ary method, the error correction SVM classifier can obtain significant improvement by adding a few prefixed numbers of redundant SVMs. And it can get even better performance than the most computation costly OneAgainst-One voting method does with much less computation complexity. A question raised here is that by using more SVMs, this classification approach does be able to correct more errors on the one hand, but it may also accompany with more errors occurring on the other hand. This means that the overall performance of the error correction approach will not get better and better without an up bound by adding more and more SVMs. Then what is the balance point for the approach? This is the problem for further study of the paper.
Acknowledgement This work was supported by Liaoning Province of China Science and Technology Foundation Grant (20022139).
References 1. Chellappa, R., Wilson, C. L., Sirohey, S.: Human and Machine Recognition of Faces: A Survey. Proceedings of the IEEE 83(5) (1995) 705-741 2. Yang, M. H., Kriegman, D. J., Ahuja, N.: Detecting Faces in Images: A Survey. IEEE Trans. On Pattern Analysis and Machine Intelligence 24(1) (2002) 34-58 3. Turk, M. A., Pentland, A.: Eigenfaces for Recognition. Journal of Cognitive NeuroScience 3(1) (1991) 71-86
1062
C. Wang and C. Guo
4. Belhumeur, P. N., Hespanha, J. P., and Kriegman, D. J.: Eigenface vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans on PAMI 19(7)(1997) 711-720 5. Vapnik, V.: Statistical Learning Theory. John Willey and Sons Inc., New York (1998) 6. Sebald, D. J., Bucklew, J. A.: Support Vector Machines and Multiple Hypothesis Test Problem. IEEE Trans. on Signal Processing 49(11)(2001) 2865-2872 7. Kreβel, U.: Pairwise Classification and Support Vector Machines. In: Schölkopr, B., Burges, J. C. and Smola, A. J. (eds): Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA (1999) 255-268 8. Dietterich, T., Bakiri G.: Solving Multiclass Learning Problems via Error-Correcting Output Codes. Journal of Artificial Intelligence Research 2 (1995) 263-286 9. Allwein, E. L., Singer Y.: Reducing Multiclass to Binary: A Unifying Approach to Margin Classifiers. Journal of Machine Learning Research 1 (2000) 113-141 10. Lin, S., Costello D. J.: Error Control Coding: Fundamentals and Applications. PrenticeHall, Inc., Englewood Cliffs, New Jersey (1983) 141-180 11. Guo, G.D., Li, S. Z., Chan, K.: Face Recognition by Support Vector Machines. In: Titsworth Frances M. (eds): Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition. Printing House, USA (2000) 196 –201
A Boosting SVM Chain Learning for Visual Information Retrieval Zejian Yuan1 , Lei Yang1 , Yanyun Qu2 , Yuehu Liu1 , and Xinchun Jia3 1
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, 710049, P.R. China {zjyuan, liuyh}@aiar.xjtu.edu.cn 2 Department of Computer Science, Xiamen University, Xiamen, P.R. China [email protected] 3 Department of Mathematics, Shanxi University, Taiyuan, P.R. China [email protected]
Abstract. Training strategy for negative sample collection and robust learning algorithm for large-scale samples set are critical issues for visual information retrieval problem. In this paper, an improved one class support vector classifier (SVC) and its boosting chain learning algorithm is proposed. Different from the one class SVC, this algorithm considers negative samples information, and integrates the bootstrap training and boosting algorithm into its learning procedure. The performances of the SVC can be successively boosted by repeat important sampling large negative set. Compared with traditional methods, it has the merits of higher detection rate and lower false positive rate, and is suitable for object detection and information retrieval. Experimental results show that the proposed boosting SVM chain learning method is efficient and effective.
1
Introduction
Support vector machine (SVM) has been successful as a pattern recognition method. It has strong theoretical foundations and good generalization capability [1]. In general, the existing methods are either two-class classifier or one-class classifier. In fact, neither of these two kinds of classifiers alone is capable of solving the object detection and information retrieval problem well. Information retrieval is different from traditional pattern classification problem, and requires discriminate analysis between object class and its background information. Although negative samples (non objects) from background are richer and more abundant than positive samples (objects), it is very difficult to describe or define negative samples. Typical negative samples are usually unavailable for building training set due to large variance of non-object class. So training strategy for negative sample collection and robust learning algorithm for large-scale samples set are critical issues for information retrieval and object detection problem. Recently, active learning has become a popular approach for reducing the sample complexity of large-scale learning tasks. In active learning, the learner J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1063–1069, 2006. c Springer-Verlag Berlin Heidelberg 2006
1064
Z. Yuan et al.
has ability to select its own training data. Poggio [2]proposed a training schema, called bootstrap, was applied for collecting negative samples. During bootstrap procedure, false detections are collected iteratively into the training set, and a very low false positive rate can be obtained after iterations of learning. Pabitra [3] proposed a probabilistic active support vector learning algorithm for SVM design in large data application. Classification error driven methods for incremental support vector learning with multiple samples are also described in [4]. More recently, Rong [5] proposed a general classification framework, called boosting chain, for object detection. The chain structure is introduced to integrate historical knowledge into successive boosting learning. Cohen [6] also discussed training probabilistic classifiers with labeled and unlabeled data known as semisupervised learning for human-computer interaction and pattern recognition such as face detection and facial expression recognition. Also, various one class learning algorithm using only positive samples has been applied to detection problem and information retrieval. Scholkopf [7] proposed one class SVM for estimating the support of high-dimensional distribution. The one class support vector classifier (1SVC) for object detection is different from traditional reconstruction based method such as PCA and other probability distribution models [8], and can find decision boundary to decide whether or not a test sample is an object. Gunnar [9]proposed boosting algorithms which was applied to one-class classification problem. Tax [10] has also proposed a similar 1SVC method named support vector data description (SVDD), and has given better results in image retrieval. These one class classifiers only consider positive samples and ignore negative samples information, so detectors based on one class classifier have not obtained better performance for retrieving information in clutter background. In this paper, an improved one class support vector classifier and its boosting chain learning algorithm is proposed, Different from the one class SVC, this algorithm considers negative samples information, and integrates the bootstrap training and boosting algorithm into its learning procedure. Experimental results illustrate the performances of the proposed boosting SVM chain learning method. Section 2 introduce improved one class SVC model. Section 3 in detail propose the boosting chain learning method. The last section collects some experimental results.
2
Boosting Improved One Class SVC
One class SVC algorithm has been proposed for constructing a hyperplane in feature space to separate unlabeled data set from the origin with a maximum margin [7]. Different from one class SVC, the improved one class SVC algorithm considers positive and negative samples which are collected from large-scale negative source. The each new negative sample has a weight according to the result of the previous classifier. Using all positive samples and selected negative ones with weight to train the improved one class SVC, the corresponding optimization problem would be given as follow
A Boosting SVM Chain Learning for Visual Information Retrieval
min
1 2
wk 22 − ρk +
1 νN
N i=1
dki ξik
s.t yi (wk , Φ (xi ) − ρk ) ≥
1065
(1)
−ξik
To solve the QP problem (1), the kth hypothesis hyperplane of the improved one class SVC is given as fk (x) =
yj αjk K (xj , x) + ρjk
(2)
j∈sv
The class probability of each training sample can be obtained according to lo1 gistic regression function Pk (y|x) = 1+exp(−2yf . Similar to the real boosting k (x)) algorithm [11], the weak classifier can be described as hk (x) =
1 Pk (y = 1|x) log 2 1 − Pk (y = 1|x)
(3)
The weight of each training sample can be denoted as dki ← dk−1 exp {−yi hk (xi )} i
(4)
According to the result of weak classifier hk (x), and the output of final classifier is composite of all weak classifiers, that is H (x) =
M
hk (x)
(5)
k=1
According to the real boosting algorithm, the true positive rate of classifier would be hold, meanwhile, false positive rate (FPR) would decrease successively because of new effective negative samples arrival, so the performance of the improved one class SVC will be boosted successively. 2.1
Boosting Chain Learning with Bootstrap and Active SVs
Similar to the boosting cascade [12], background information of target class is regarded as the source of the negative samples. Firstly, One class SVC is trained by using only all positive samples to construct initial decision boundary, and then boosting chain will evaluate negative samples from the source of negative samples as many as possible by 1SVC. Previous negative SVs and any false positive samples will be collected as negative samples set to train and boost the improved one class classifier. The whole training procedure for improved one class SVC could be illustrated in Figure.1. Based on boosting chain structure, the performance of previous classifier will be boosted due to considering the information from new negative samples in the next training. The last classifier contains all positive SVs and important negative SVs used in the boosting chain. The boosting procedure is very similar to the standard Adaboost algorithm [13]. Different from Adaboost, the improved
1066
Z. Yuan et al.
Fig. 1. Boosting improved 1SVC chain learning with bootstrap
one class SVC boosting chain learning uses one positive sample set and multiple different negative sample sets. In fact, the chain learning strategy with bootstrap could be simply interpreted as a process that a large scale negative samples is decomposed many small negative samples set by adaptive sampling method. At this point, most samples in first stage are classified correctly with small weight, and samples which could not be classified correctly will have large weight. By extending this process to the whole negative training set, next negative training set is collected by important sampling according to the weight of negative samples. Based on the new negative training set and previous negative SVs, the training procedure is continued, and a new classifier could be learned after several iterative learning. With the similar strategy, the final classifier would be learned as well. The learning algorithm could be described as follow: 1. Initialization: k = 0, initial false positive rate (FPR) F0 = 1, di = 1/p for all positive samples, initial negative support vector set SV0− = [], the FPR of final detector is F . 2. Train initial classifier for meeting a higher true positive rate (TPR) by using only all positive samples with weight di , and construct a initial decision boundary. 3. While Fk > F , k = k + 1 – Evaluate boosting chain on negative sample source, and add false detections into the set Nk – For each sample xi in set Nk , update weight dki for next classifier according to equation (4). – Collect negative samples with higher weight, and add last negative sup− port vector set SVk−1 into it to construct a negative sample set N . – Optimize the kth classifier according to equation (1) by using all positive sample set P and new negative sample set N , and obtain a new negative support vector set SVk− . – Evaluate the FPR Fk of the new classifier. 4. Output the final classifier. The weights of missing positive sample are set to zero, and the weights of remaining positive samples will keep unchanging. Based on this strategy, the boosting chain could be regarded as a variant of AdaBoost learning algorithm, meanwhile, each stage adopts SVM algorithm. So the final classifier will have good generalization performance.
A Boosting SVM Chain Learning for Visual Information Retrieval
3
1067
Experimental Results
In the the kernel function of SVM is radial function K (xi , xj ) = experiment, xi −xj 22 exp . The kernel parameter s is selected by using five-fold crosss validation. We used a training set consisting of 1000 faces and 20000 non-faces as negative sample source by selecting randomly from the MIT CBCL face database [14], and test set consisting of 5000 samples from CBCL test set. In fact, the face class could be view as a close set, and non-face class could be view as a open set, which is different from the traditional classification problem. When two class data is very imbalance, a common two class SVMs classifiers can be ineffective in determining the class boundary. But one class support vector classifier trained by using the boosting chain learning method will enable detecting large faces, and rejecting huge background information without the need to label large the negative sample sets in advance. We compare with two class support vector classifier (2SVC) and one class support vector classifier (1SVC), and all experiments are tested over 2.2GHz Pentium 4 computer. To compare the results of the classifiers, we use the receiving operating characteristic (ROC) curves. The ROC curves show, under different classification thresholds ranging from 0 to 1, the probability of detecting a face in a face image against the probability of falsely detecting a face in a non-face image. Face detection experiment results are given in Table.1, where the BN is the boosting number of the improved one class SVM. While the BN is zero, the detector is constructed by one class SVM. Experimental results of face detection show the performance of detector can be be successively improved by using boosting chain learning because of integrating negative samples information. The contrast experimental results of the 1SVC, two class support Table 1. To improve classifier’s performance using boosting chain learning BN TPR(%) FPR(%) TrTim(s) TsTim(s) 0 1 2 3 4
78.1 77.2 76.4 76.2 75.7
23.0 1.5 1.0 0.27 0.17
0.60 2.19 5.53 7.84 9.34
0.15 2.33 2.45 2.52 2.24
vector classifier (2SVC) and Boosting chain learning SVC (BSVC) are shown in Table.2. The ROC of different classifiers is given in Figure.2. Experimental results of face detection show the detector based on the boosting chain learning has higher object detection rate, lower false positive rate especially, it has more advantages when the object samples’ distribute is more complex or the background information is richer because of using feedback mechanism for collecting environmental information. Experimental results show using our boosting chain algorithm can be used to learn good face detectors.
1068
Z. Yuan et al. 1
0.8
TPR
0.6
0.4 1SVC 2SVC
0.2
0
BSVC
0
0.2
0.4
0.6
0.8
1
FPR
Fig. 2. The ROC of classifiers
Table 2. The classifiers’ performance contrast on data set CBCL Classifier TPR(%) FPR(%) TrTim(s) TsTim(s) 1SVC 2SVC BSVC
4
78.1 74.9 75.7
23.0 0.20 0.17
0.60 5.39 9.34
0.15 2.25 2.24
Conclusions
We have proposed an improved one class support vector classifier (SVC) and its boosting chain learning algorithm. In particular, this algorithm considers negative samples information, and integrates the bootstrap training and boosting algorithm into its learning procedure. The performances of the SVC can be successively boosted by repeatedly sampling large negative set. Compared with traditional methods, it has the merits of higher detection rate and lower false positive rate, and is suitable for object detection and information retrieval. Experimental results show that the proposed boosting SVM chain learning method is efficient and effective. In the future, we will consider integrating other object features (color, texture, filter-based, etc) and feature selection methods into the boosting chain learning.
Acknowledgments This research was supported by National Innovation Group Foundation of China under grant No.60021302 and the NSF of China No.60205001.
A Boosting SVM Chain Learning for Visual Information Retrieval
1069
References 1. Vapnik, V.: The Nature of Statistical Learning Theory. 2nd Edition, Springer, New York (2000) 2. Poggio, T., Sung, K.K.: Example-Bbased Learning for View-Based Human Face Detection. Proceedings of the ARPA Image Understanding Workshop(II),(1994) 843–850 3. Mitra, P., Murthy, C.A., Pal, S.K.: A Probabilistic Active Support Vector Learning Algorithm. IEEE Trans. on Pattern Analysis and Machine Intelligence 26(3) (2004) 413–418 4. Mitra, P., Murthy, C.A., Pal, S.K.: Data Condensation in Large Data Bases by Incremental Learning with Support Vector Machines. Proceedings of ICPR2000, (2000) 712–715 5. Xiao, R., Zhu, L., Zhang, H.J.: Boosting Chain Learning for Object Detection. Proceeding of the 9th IEEE International Conference on Computer Vision (ICCV), (2003) 6. Cohen, I., Cozman, F.G., Sebe, N., Cirelo, M.C., Huang, T.S.: Semisupervised Learning of Classifiers: Theory, Algorithms, and Their Application to HumanComputer Interaction. IEEE Trans. on Pattern Analysis and Machine Intelligence 26(12) (2004) 1553–1567 7. Scholkopf, B., Platt, J., Taylor, J.S.: Estimating the Support of High-dimensional Distribution. Neural Computation 13(7) (2001) 1443–1471 8. Liu, C.J.: A Bayesian Discriminating Features Method for Face Detection. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(6)(2003) 725–740 9. Gunnar, G., Sebastian, M., Bernhard, S., Robert, M.K.: Constructing Boosting Algorithms from SVMs: An Application to One-class Classification. IEEE Trans. on Pattern Analysis and Machine Intelligence. 24(9)(2002) 1184–1198 10. Tax, M.J., Robert, P.W., Kieron, M.: Image Database Retrieval with Support Vector Data Descriptoins. Technical Report (1999) 11. Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression– A Statistical View of Boosting (1999) 12. Viola, P., Jones, M.: Robust Real-time Object Detection. The 2nd Statistical and Computation Theory of Vision-modeling, Learning, Computing and Sampling, Vancouver, Canada, July 13 (2001) 13. Schapire, R.E.: A Brief Introduction to Boosting. Proceedings of the Sixteenth International Joint Conference on AI, (1999) 14. MIT CBCL Face Database #1. MIT Center for Biological and Computation Learning (2002)
Nonlinear Estimation of Hyperspectral Mixture Pixel Proportion Based on Kernel Orthogonal Subspace Projection Bo Wu1,2, Liangpei Zhang1, Pingxiang Li1, and Jinmu Zhang3 1
State Key Lab of Information Engineering in Surveying, Mapping & Remote Sensing Wuhan University, Wuhan, 430079, China 2 Spatial Information Research Center, Fuzhou University, Fuzhou, 350002, China 3 School of Civil Engineering, East China Institute of Technology, Fuzhou, 344000, China [email protected]
Abstract. A kernel orthogonal subspace projection (KOSP) algorithm has been developed for nonlinear approximating subpixel proportion in this paper. The algorithm applies linear regressive model to the feature space induced by a Mercer kernel, and can therefore be used to recursively construct the minimum mean squared-error regressor. The algorithm includes two steps: the first step is to select the feature vectors by defining a global criterion to characterize the image data structure in the feature space; and the second step is the projection onto the feature vectors and then apply the classical linear regressive algorithm. Experiments using synthetic data degraded by an AVIRIS image have been carried out, and the results demonstrate that the proposed method can provide excellent proportion estimation for hyperspectral images. Comparison with support vector regression (SVR) and radial basis function neutral network (RBF) had also been given, and the experiments show that the proposed algorithm slightly outperform than RBF and SVR.
1 Introduction Spectral unmixing is a quantitative analysis procedure used to recognize constituent ground cover materials (or endmembers) and obtain their mixing proportions (or abundances) from a mixed pixel. By modeling pixel signature in different ways, the unmixing methods can be generally grouped into two categories: linear mixture models (LMM) and non-linear mixture models (NLMM). Both LMM and NLMM models have been widely studied [1-3]. LMM has been widely employed for spectral unmixing analysis as it allows the application of mature mathematical methods, such as least squares estimation (LSE); while NLMM is popular for its higher accuracy although there does not exist a simple and generic NLMM that can be utilized in various spectral unmixing applications. A natural problem then turns to be whether we can utilize the nonlinear characteristics of spectral mixture to obtain the higher unmixing accuracy and at the same time, keep the simplicity like LMM. To meet the requirement, we present a kernel orthogonal subspace projection (KOSP) approach to unmix hyperspectral images in this paper. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1070 – 1075, 2006. © Springer-Verlag Berlin Heidelberg 2006
Nonlinear Estimation of Hyperspectral Mixture Pixel Proportion
1071
2 Linear Mixture Model Analysis Let r be an L × 1 column image pixel vector in an multispectral or hyperspectral image where L is the number of spectral bands. Assume that M is an L × p signa-
M = [m1 , m 2 ,⋅ ⋅ ⋅, m p ] , where m i is a column vector represented by the i-th image endmember signature resident in the pixel vector r , and p is the number of signatures of interest. Let α = (α1 , α 2 ,⋅ ⋅ ⋅, α p ) T be a ture matrix, denoted by
p × 1 abundance column vector associated with the fraction of the i-th signature in the pixel vector r . A linear mixture model assumes that the spectral signature of a pixel vector is linearly superimposed by spectral signatures of image endmembers m1 , m 2 ,⋅ ⋅ ⋅, m p present in the pixel vector r and can be described by p
r = ∑ α i mi + n
(1)
i =1
n is an L × 1 column additive noise vector representing a measurement or model error. If αˆ is an estimate of α , then rˆ = Mαˆ is an estimate of r with a corresponding error e = r − rˆ = r - Mαˆ . The model’s goodness-of-fit is assessed by the length of e , using the sum-of-squared errors (SSE)
where
SSE (αˆ ) = e
2
= r T r − rˆ T rˆ = r T ( I − M ( M T M ) -1 M T )r
(2)
3 Nonlinear Spectral Mixture Analysis by KOSP If an L-dimensional input pixel vector space ℜ by a mapping function φ .
r ∈ R L is mapped into a high dimensional
φ : R L → ℜ, r a φ(r )
(3)
This high dimensional space ℜ is often called feature space. Thus eq.(1) is typically written in the following form in the feature space: p
φ(ri ) = ∑ α i φ(m i ) + φ(n i )
(4)
i =1
Consider
M S = [m1 , m 2 ,⋅ ⋅ ⋅, m s ] and define the kernel matrix K of dot products
as
K =( k ij ), 1 ≤ i ≤ s , 1 ≤ j ≤ s
(5)
k ij = φ (m i )φ(m j ) , If the inner product < ⋅,⋅ > form of eq. (2) was replace by k ( x , y ) , it turns to be
where
T
1072
B. Wu et al.
SSE (αˆ ) = e
2
= r T r − rˆ T rˆ = k (r , r ) − k T (r , M S ) K -1k (r , M S )
(6)
k (r , M S ) =[ k (r , m1 ), k (r , m 2 ),⋅ ⋅ ⋅, k (r , m S ) ]. To evaluate the errors of eq. (6), M S must be obtained at the first. An iterative projective method has been developed for the selection of them in the ℜ space so as where
to capture the image nonlinear structure. Our method is as follows. Initially, we choose the pixel vector as m1 with maximum length, i.e. the brightest pixel in the scene. Then the initial target signature is applied with an orthogonal subspace projector specified by eq. (6) with M S = m1 to all image pixel vectors. A spectral feature
m 2 with the maximum projection in the ⊥ orthogonal complement space, denoted by < m1 > that is orthogonal to the space, linearly spanned by < m1 > . A second spectral feature signature m 3 can be found signature is then found, which is denoted by
by applying an orthogonal subspace projector
P[m⊥1 ,m 2 ] with M S = [m1 , m 2 ] to the
original image, and a spectral signature that has the maximum projection in is selected as m 3 . The above procedure is repeated till all the spectral feature signatures are found or a stopping rule is met. A similar algorithm proposed by Ren and Chang had been developed [4], but their algorithm performs on the pixel vector, and ours’ works on kernel feature vector. Once the feature vectors are selected, they define a subspace M S in ℜ . We can find suitable coefficients
w = ( w 1 , w 2 ,⋅ ⋅ ⋅, w p ) T , satisfying the following ap-
proximate linear dependence condition for any given pixel vector p
∑ w φ(m ) - φ(r ) i =1
The selection of
i
i
i
≤ε
ri in the scene, (7)
M S can also be viewed as the definition of the hidden layer of a
multi-layer neural network [5]. The number of hidden neurons on which the data are projected corresponds to the number of feature vectors selected in ℜ . For a given pixel vector ficients
ri , we apply the dot product projection to obtain the coef-
w = ( w 1 , w 2 ,⋅ ⋅ ⋅, w p ) T : w i = ( M ST M S ) -1 M ST ri = K -1k (r , M )
(8)
w i is required to obtain the proportion value. Given a set of training data ( w i , α i ) of the sample ri , the classic LSE technique can be used to ˆi. estimate the vector α Additional regression on
Nonlinear Estimation of Hyperspectral Mixture Pixel Proportion
1073
4 Experiments An airborne visible and Infrared Imaging Spectrometer (AVIRIS) imaging spectrometer datasets of Moffett Field, California, have been selected for synthetic image and test the algorithm. This datasets are 12 bit digital numbers and consist of 512 ×614 pixels and 224 spectral bands per scene. Figure 1 is a sub-scene (350× 350) with 90 bands image, which was extracted from the original scene. It’s wavelength ranges from 0.468 μm to 1.284 μm .
Fig. 1. False color scene of AVIRIS. Bands 35, 24, and 5 are displayed as RGB.
Synthetic images (70× 70) are images that have been degraded to a 5:1 scale in this experiment using an average filter. Synthetic imagery has the advantage of lacking co-registration and radiometric correction errors between the lower and the higher resolution images. Degradation of a hard classification yields fractional images for each class in the classification. The resulting fractional image does not contain any uncertainty as it originates from degradation instead of a classification process. Consequently the sub-pixel proportion estimation error solely reflects the performance of the proposed algorithm. Validation is facilitated as the original hard classifications can be used as reference material. Three measures of performance were used: Root Mean Square Errors (RMSE), Bivarite Distribution Functions (BDF) between the real Table 1. The relationship between multinomial rank and feature vector, and root means square error correspondingly
Rank
d
Feature Vectors FVs
water
1 2 3 4 5 6
9 20 33 52 67 90
0.0772 0.0305 0.0291 0.0296 0.0632 0.0808
RMSE soil 0.1359 0.0830 0.0793 0.0835 0.0968 0.1303
vegetation 0.1346 0.0839 0.0805 0.0881 0.0980 0.1323
1074
B. Wu et al.
Fig. 2. Graph shows the percentage of pixels lying within a given bound of actual class proportion for KOSP model
and estimated subpixel proportions, and error bound. The BDF can help visualize the accuracies of prediction by mixture models, while the RMSE is used for evaluating total accuracies. In this experiment, the kernel function was chosen as the polynomial kernel and the parameter ε set to 0.001. 500 (about 10%) pixels were randomly selected from the degraded image as training samples. It can be found from table 1 that the number of feature vectors becoming larger with the multinomial rank d increasing. When d equals to 3, the algorithm obtains the best results with the RMSE of water, soil and vegetation is 0.0291, 0.0793 and 0.0805 respectively.
Fig. 3. BDF of the synthetic test data, from left to right is water, vegetation and soil; from top to bottom is RBF, SVR and KOSP algorithm. The RMSE of each material is listed on the upper left corner of each chart, respectively.
Nonlinear Estimation of Hyperspectral Mixture Pixel Proportion
1075
These error distribution patterns were also clearly shown in figure 2 in a different format. For each material, the graph indicates how many predictions fall within a given percentage of field measurement. Comparisons with other widely used kernel based methods, specifically, support vector regression (SVR) and radial basis function neutral network (RBF) have also been studied using this datasets. Prior research had shown RBF was a useful tool for performing abundance estimation in hyperspectral imagery data [6], and SVR is newly developed and one of the most effective methods for regression [7]. It has demonstrated from figure 3 that all of the three algorithms can perform very well. The RMSE of all materials are less than 0.09. Among them, the fractions of water and soil with our algorithm have the smallest values, but RBF have the smallest RMSE in vegetation fraction. As a whole, this experiment showed that the proposed algorithm slightly outperformed than RBF and SVR.
5 Conclusions This paper has presented a kernel based least square mixture model which makes use of the nonlinear characteristics of spectral mixture to obtain the higher unmixing accuracies. With the simulated data, the proposed method accomplishes perfect results for all the three materials with RMS error less than 0.09. The comparative experiments show that the proposed algorithm outperforms RBFNN and SVR .
References 1. Cross, A.M., Settle, J.J., Drake, N.A., Paivinen, R.T.: Subpixel Measurement of Tropical Forest Cover Using AVHRR Data. Int. J. Remote Sensing 12(5) (1991) 1119-1129 2. Hu, Y.H., Lee, H.B., Scarpace, F. L.: Optimal Linear Spectral Unmixing. IEEE Trans. Geosci. Remote Sensing 37(1) (1999) 639-644 3. Mustard, J.F., Li, L., He, G.: Nonlinear Spectral Mixture Modeling of Lunar Multispectral Data: Implications for Lateral Transport. J. Geophys. Res. 103(E8) (1998) 19419-19425 4. Ren, H., Chang, C.-I.: A Generalized Orthogonal Subspace Projection Approach to Unsupervised Multispectral Image Classification. IEEE Trans. Geosci. Remote Sensing 39(8) (2000) 2515-2528 5. Anouar, F, Badran, F, Thiria, S.: Probabilistic Self Organizing Map and Radial Basis Function. Neurocomput. 20(8) (1998) 83–96 6. Guilfoyle, K. J., Althouse, M. L., Chang, C.-I.: A Quantitative and Comparative Analysis of Linear And Nonlinear Spectral Mixture Models Using Radial Basis Function Neural Networks. IEEE Trans. Geosci. Remote Sensing 39(8) (2001) 2314-2318 7. Vapnik, V.N.: The Nature of Statistical Learning Theory. 2nd edn. Springer Verlag, Berlin Heidelberg New York (1999)
A New Proximal Support Vector Machine for Semi-supervised Classification Li Sun1, Ling Jing1, and Xiaodong Xia2 1
College of Science, China Agricultural University, Beijing 100083, P.R. China {Slsally, Jingling_aaa}@163.com 2 Institute of Nonlinear Science, Academy of Armored force Engineering, Beijing 100072, P.R. China [email protected]
Abstract. Proximal support vector machine (PSVM) is proposed instead of SVM, which leads to an extremely fast and simple algorithm by solving a single system of linear equations. However, sometimes the result of PSVM is not accurate especially when the training set is small and inadequate. In this paper, a new PSVM for semi-supervised classification (PS3VM) is introduced to construct the classifier using both the training set and the working set. PS3VM utilizes the additional information of the unlabeled samples from the working set and acquires better classification performance than PSVM when insufficient training information is available. The proposed PS3VM model is no longer a quadratic programming (QP) problem, so a new algorithm has been derived. Our experimental results show that PS3VM yields better performance.
1 Introduction Support vector machine (SVM) based on the statistical learning theory by Vapnik[12] is a new and powerful classification technique and has drawn much attention in recent years[3-6]. The optimal classifier can be obtained by solving a quadratic programming (QP) problem. Proximal support vector machine (PSVM)[7] is proposed instead of SVM, which leads to an extremely fast and simple algorithm for generating a system of linear equations. The formulation of PSVM greatly simplifies the problem with considerably faster computational time than SVM. However, despite these computationally attractive features, sometimes PSVM performs poor generalization capacity especially when the training sets are small or when there is a significant distribution deviation between the training and working sets. The paper deals with this problem. We briefly outline the contents of the paper now. In section 2 we first state the problem of PSVM, and then the proposed PS3VM is formulated. One algorithm has been derived. In section 3 experimental results support that PS3VM yields better performance than PSVM when insufficient training information is available. Finally, some conclusions are given in section 4. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1076 – 1082, 2006. © Springer-Verlag Berlin Heidelberg 2006
A New Proximal Support Vector Machine for Semi-supervised Classification
1077
2 A New PSVM for Semi-supervised Classification 2.1 Proximal Support Vector Machine Assume that a training set S is given as
S = {( x1 , y1 ),L , ( xl , yl )}
(1)
where xi ∈ R n , and yi ∈ {−1,+1} . The goal of SVM is to find an optimal separating hyperplane
w′x − b = 0
(2)
that classifies training samples correctly or basically correctly, where w ∈ R n , and the scalar b ∈ R n . Now to find the optimal separating hyperplane is to solve the following constrained optimization problem 1 w′w + ce′ξ 2 s.t. D ( Aw − eb) + ξ ≥ e
min
( w ,b ,ξ )∈R n +1+ l
(3)
ξ ≥0 where ξ = (ξ1 , ξ 2 ,L , ξ l )′ , ξ i is slack variable, i = 1,2,L, l , e denotes a column vector of ones of arbitrary dimension, A = ( x1 , x2 ,L , xl )′ , D is a diagonal matrix whose en-
tries are given by Dii = yi , and C > 0 is a fixed penalty parameter of labeled samples. It controls the tradeoff between complexity of the machine and generalization capacity. PSVM modifies SVM formulation based on maximizing the separating margin 1 in the space of R n+1 and changes the slack variables ξ from the L1 norm to w′w + b 2 the L2 one. Note that the nonnegative constraint on the slack variables ξ in (3) is no longer needed. Furthermore, a fundamental change is replacing the inequality constraint with an equality constraint. This leads to the optimization problem as follows min
( w,b ,ξ )∈R n +1+l
s.t.
1 1 ( w′w + b 2 ) + C ξ 2 2 D( Aw - eb) + ξ = e
2
(4)
This modification not only adds advantages such as strong convexity of the objective function, but changes the nature of optimization problem significantly. The planes w′x − b = ±1 are not bounding planes any more, but can be thought of as “proximal” planes, around which points of the corresponding class are clustered. The formulation of PSVM greatly simplifies the problem and generates a classifier by merely solving a single system of linear equations. However, sometimes the result of PSVM is not accurate when the training set is inadequate or there is a significant deviation between the training and working sets of the total distribution.
1078
L. Sun, L. Jing, and X. Xia
2.2 A New PSVM for Semi-supervised Classification
In many real applications, the quantity of the labeled samples from the training set is always relatively small, and they can not describe the global distribution well. As mentioned above, despite computationally attractive features, sometimes the accuracy of PSVM is low and not satisfactory. The main reason is that it neglects the additional information of unlabeled samples from the working set. Hence the central concept of the proposed PS3VM is to construct a classifier using both the training and working sets, so that the helpful information concealed in the unlabeled samples can be transferred into the final classifier during the semi-supervised learning process. Assume that a training set S of labeled samples and a working set S * of unlabeled samples are given as follows S = {( x1 , y1 ),L , ( xl , yl )} S * = {x1* , x2* ,L , xm* }
(5)
The learning process of PS3VM can be formulated as the following optimization problem minimize over( y1* , y 2* , L , ym* , w, b, ξ1 , ξ 2 ,L , ξ l , ξ1* , ξ 2* ,L , ξ m* )
subject to
1 1 1 ( w′w + b 2 ) + C || ξ || 2 +C * || ξ * || 2 2 2 2 D( Aw − eb) + ξ = e
(6)
D * ( A* w − eb) + ξ * = e where ξ * = (ξ1* , ξ 2* ,L , ξ m* )′ , A* = ( x1* , x2* ,L , xm* )′ , D* is a diagonal matrix whose entries
are given by Dii* = yi* , and C * > 0 . Solving this problem means to find ( w, b) and y1* , y 2* ,L , ym* as the labels of unlabeled samples from the working set so that the hyperplane w′x − b = 0 separates both the training and working samples with maximum margin. Obviously the problem (6) is no long a QP problem for the second equality constraint is nonlinear. So in order to simplify this problem, we firstly fix the values of y1* , y 2* ,L , ym* ( how to give the proper values of y1* , y 2* ,L , ym* will be described below), and then the problem can be converted into a QP problem with strong convexity objective function. Suppose the value of y1* , y 2* ,L , ym* is given, we construct Lagrangian as follows 1 C 2 C* * 2 L( w, b, ξ , ξ * ,α ,α * ) = ( w′w + b 2 ) + ξ + ξ − α ′ [ D( Aw − eb) + ξ − e] 2 2 2 − α *′ ⎡⎣ D* ( A*w − eb) + ξ * − e ⎤⎦
(7)
where α = (α1 , α 2 ,L , α l )′ and α * = (α1* , α 2* ,L , α m* )′ are the vectors of nonnegative Lagrange multipliers of problem (6). The KKT necessary and sufficient optimality conditions are obtained by setting the gradients of L equal to zero. We thus obtain
A New Proximal Support Vector Machine for Semi-supervised Classification
∂L = 0 ⇒ w − A′Dα − A*′ D*α * = 0 ∂w ∂L = 0 ⇒ b + e′Dα + e′D*α * = 0 ∂b ∂L = 0 ⇒ Cξ − α = 0 ∂ξ ∂L = 0 ⇒ C *ξ * − α * = 0 ∂ξ * ∂L = 0 ⇒ D( Aw - eb) + ξ − e = 0 ∂α ∂L = 0 ⇒ D* ( A*w - eb) + ξ * − e = 0 ∂α *
1079
(8)
Solve the above problem (8), we obtain
α = [C + HH ′] e, w = A ′D α , b = −e′D α , and ξ = C α −1
(9)
where α , ξ , A, C , D, H are defined as ⎡I ⎢ ⎡α ⎤ ⎡ξ ⎤ ⎡A ⎤ α = ⎢ * ⎥ ,ξ = ⎢ * ⎥ , A = ⎢ * ⎥ , C = ⎢ C ⎢0 ⎣α ⎦ ⎣ξ ⎦ ⎣A ⎦ ⎣⎢
⎤ 0 ⎥ ⎡D ⎥ , and D = ⎢ I ⎥ ⎣0 C * ⎦⎥
0⎤ are matrixes of dimenD* ⎥⎦
sion (l + m) × (l + m) , and H = D ⎡⎣ A −e⎤⎦ . Therefore any input x can be easily classified according to the decision function sgn( f ( x)) = sgn( w′x − b) . By using the kernel function, PS3VM can obtain the optimal nonlinear classifier similarly. It is easy to prove that, our proposed PS3VM model can reduce to PSVM when the number of unlabeled samples equals to zero. The following theorem can be used to judge whether the value of y *j corresponding to the unlabeled sample x *j is proper. Theorem 1. If an unlabeled sample x *j is labeled positive or negative and its slack
variable ξ *j satisfies ξ *j > 1 , then the label change of this sample decreases the objective function. Proof. The penalty item of the unlabeled samples can be written as
C* * ξ 2
2
=
C* 2
m
∑ξ
*2 i
(10)
i =1
Assume that a positive labeled sample x *j is changed to negative labeled one. Then the penalty item (10) becomes
C* C* ξi*2 + (2 − ξ *j ) 2 ∑ 2 i≠ j 2
(11)
1080
L. Sun, L. Jing, and X. Xia
Combining ξ *j > 1 with Eq. (12), we get C* C* C* C* C * m *2 ξi*2 + (2 − ξ *j )2 < ∑ ξi*2 + ξ *2 ∑ ∑ ξi j = 2 i≠ j 2 2 i≠ j 2 2 i =1
(12)
This result means the objective function decreases after the positive label is changed. The case when the unlabeled sample x *j labeled negative is changed to the
positive label can be proved similarly. The proof is completed. 3
Based on Theorem 1, we outline the new PS VM algorithm as follows Step 1: Specify the parameters C and C*. Train an initial classifier of PS3VM using all labeled samples from the training set and label the unlabeled samples from the * working set. Let Ctemp be 10-5 as a temp effect factor. * Step 2: If Ctemp < C*, then repeat steps 3-5, else go to step 6. Step 3: Retrain the classifier of PS3VM with all the available samples from the training and working sets. If there exists the sample x *j satisfying ξ *j > 1 , then select
the sample with the biggest ξ *j among them. Else go to step 5. Step 4: Change the label of the sample x *j selected by step 3, return to step 3. * * Step 5: Let Ctemp , C *} , return to step 2. = min{2Ctemp Step 6: Output the labels of the samples from the working set and terminate.
3 Experimental Results Two different experiments on artificial and real world datasets are done and the results are given in section 3.1 and 3.2 respectively. 3.1 Artificial Dataset
A simple linearly separable two-dimensional point set generated artificially, is shown in Fig.1. The goal of designing this dataset is to support the superiority of PS3VM more intuitively. There are 10 labeled samples in the training set, the positive samples are represented by “o”, and the negative samples are represented by “ ”. 90 unlabeled samples in the working set are symbolized by “+”. In this experiment, accuracy is used as a measure of performance. The classification result of PSVM is shown in Fig.1, where the dashed line represents the separating hyperplane, with two parallel “proximal” hyperplanes denoted by the two doted lines. Due to the number of training samples is small, it is not a good description of the global distribution. The classification accuracy of all the samples available is only 93% and there are seven misclassification samples represented by “ ”. Fig.2 illustrates the training process of PS3VM. Firstly PS3VM trains an initial classifier represented by a dashed line which is the same one as in Fig.1. Then retrain the classifier with all the samples according the above PS3VM algorithm, rectify the
△
A New Proximal Support Vector Machine for Semi-supervised Classification
1081
improper label value of unlabeled sample, and dynamically adjust the classifiers approaching to the real one. At last, the final classifier is found by PS3VM, represented by a solid line with two parallel “proximal” hyperplanes denoted by the two doted lines. As depicted in Fig.2, the classification result of PS3VM is 100%, much higher than the one of PSVM. 1.05 1.05
1
1
0.95
0.95
0.9
0.9
0.85
0.8
0.85
0.4
0.5
0.6
0.7
0.8
Fig. 1. Training with PSVM
0.9
0.8
0.4
0.5
0.6
0.7
0.8
0.9
3
Fig. 2. Training with PS VM
3.2 Iris Dataset
In the second experiment, we use the well-known Iris dataset introduced by R.A. Fisher, which is a collection of 150 Iris flowers of 3 kinds, with four attributes, leaf and petal width and length in cm. Three classes are Iris-setosa, Iris-versicolor and Irisvirginica. We choose 5 different proportions of labeled samples in the dataset as the training set and the remaining samples as the working set to classify Iris-one class vs. others. Table 1 shows that PS3VM results better performance compared to PSVM. When the proportion of labeled samples in the dataset is relatively low, the accuracy of PSVM is not satisfactory. When the proportion increases to 100%, there is no difference between PSVM and PS3VM. However, it is easy to see that the performances of PS3VM decrease as the proportions of overlapped samples from different classes increase gradually.
4 Conclusions In this paper, we propose a new PS3VM to deal with inaccuracy problem in PSVM especially when the training set is small or inadequate. PS3VM constructs a classifier using both the training and working sets, utilizing the additional information of unlabeled samples. Theoretical proof for the criterion of label change is given. Through gradually rectifying misclassification samples and adjusting to approach the real one, the final classifier is found using all the information available by implementing the algorithm of PS3VM. Furthermore, an artificial experiment and a real world problem are solved by PS3VM. The experiments results show the new PS3VM achieves better performance and improves the classification accuracy than PSVM greatly.
1082
L. Sun, L. Jing, and X. Xia Table 1. The classification accuracy comparison of PSVM with PS3VM
Proportions% (labeled samples) 20 40 60 80 100
Iris-setosa vs. others PSVM PS3VM 76.52 95.61 82.17 98.2 86.69 100 90.32 100 100 100
Iris-versicolor vs. others PSVM PS3VM 70.58 94.98 78.25 95.59 80.36 95.61 86.21 95.86 95.86 95.86
Iris-virginica vs. others PSVM PS3VM 63.21 78.84 70.26 82.01 77.63 85.53 82.15 87.62 90.92 90.92
Acknowledgement This work is supported by the NNSF of China (No.10371131).
References 1. 2. 3. 4.
Vapnik, V. (ed.): The Nature of Statistical Learning Theory. Springer-Verlag, New York (1995) Vapnik, V. (ed.): Statistical Learning Theory. John Wiley and Sons, New York (1998) Cortes, C., Vapnik, V.: Support Vector Networks. Machine Learning. 20(3) (1995) 273–297 Joachims, T.: Transductive Inference for Text Classification Using Support Vector Machines. In: Bratko, I., Dzeroski, S. (eds.): Proceedings of 16th International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, US (1999) 200–209 5. Bennett, K.: Combining Support Vector and Mathematical Programming Methods for Classification. In: Scholkopf, B., Burges, C., Smola, A. (eds.): Advances in Kernel Methods— Support Vector Learning. MIT Press, Cambridge, MA (1998) 6. Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-training. In: Bartlett, P., Mansour, Y. (eds.): Annual Conference on Computational Learning Theory (COLT’98). ACM Press, New York, USA (1998) 92–100 7. Fung, G., Mangasarian, O. L.: Proximal Support Vector Machine Classifiers. In: Lee, F. P. D., Srikant, R. (eds.): Knowledge Discovery and Data Mining. Association for Computing Machinery. San Francisco, CA, New York (2001)77–86
Sparse Gaussian Processes Using Backward Elimination Liefeng Bo, Ling Wang, and Licheng Jiao Institute of Intelligent Information Processing and National Key Laboratory for Radar Signal Processing, Xidian University, Xi’an 710071, China {blf0218, wliiip}@163.com
Abstract. Gaussian Processes (GPs) have state of the art performance in regression. In GPs, all the basis functions are required for prediction; hence its test speed is slower than other learning algorithms such as support vector machines (SVMs), relevance vector machine (RVM), adaptive sparseness (AS), etc. To overcome this limitation, we present a backward elimination algorithm, called GPs-BE that recursively selects the basis functions for GPs until some stop criterion is satisfied. By integrating rank-1 update, GPs-BE can be implemented at a reasonable cost. Extensive empirical comparisons confirm the feasibility and validity of the proposed algorithm.
1 Introduction Covariance functions have a great effect on the performance of GPs. The experiments performed by Williams [1] and Rusmussen [2] have shown that the following covariance function works well in practice
⎛ d 2 ⎞ C ( xi , x j ) = exp ⎜ −∑ θ p ( xip − x jp ) ⎟ ⎝ p =1 ⎠
(1.1)
where θ p is scaling factor. If some variable is unimportant or irrelevant for regression, the associated scaling factor will be made small; otherwise it will be made large. The key advantage of GPs is that the hyperparameters of covariance function can be optimized by maximizing the evidence. This is not appeared in other kernel based learning methods such as support vector machines (SVMs) [3]. In SVMs, an extra model selection criterion, e.g. cross validation score is required for choosing hyperparameters, which is intractable when a large number of hyperparameters are involved. Though GPs are very successful, they also have some shortages: (1) the computational cost of GPs is O ( l 3 ) , where l is the size of training samples, which seems to prohibit the applications of GPs to large datasets; (2) all the basis functions are required for prediction; hence its test speed is slower than other learning algorithms such as SVMs, relevance vector machine (RVM) [4], adaptive sparseness (AS) [5], etc. Some researchers have tried to deal with the shortages of GPs. In 2000, Smola et al. [6] presented sparse greedy Gaussian processes (SGGPs) whose computational J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1083 – 1088, 2006. © Springer-Verlag Berlin Heidelberg 2006
1084
L. Bo, L. Wang, and L. Jiao
cost is O ( kn 2 l ) , where n is the number of basis functions and k is a constant factor. In 2002, Csató et al. also proposed sparse on-line Gaussian processes (SOGPs) [7] that result in good sparseness and low complexity simultaneously. However both SGGPs and SOGPs throw away the key advantage of GPs. As a result, they have difficulties in tackling the hyperparameters. This paper focuses on the second shortage of GPs above. We propose a backward elimination algorithm (GPs-BE) that recursively selects the basis functions with the smallest leave-one-out score at the current step until some stop criterion is satisfied. GPs-BE has reasonable computational complexity by integrating rank-1 update formula. GPs-BE is performed after GPs is trained; hence all the advantages of GPs are reserved. Extensive empirical comparisons show that our method greatly reduces the number of basis functions of GPs almost without sacrificing the performance.
,
2 Gaussian Processes Let Z = {( xi , yi )}i =1 be l empirical samples set drawn from l
yi = f ( xi , w ) + ε i , i = 1, 2,L l
(2.1)
where ε i is independent sample from some noise process which is further assumed to be mean-zeros Gaussian with variance σ 2 . We further assume l
f ( x, w ) = ∑ wi C ( x, xi )
(2.2)
i =1
According Bayesian inference, the posterior probability of w can be expressed as P (w | Z) =
P (Z | w) P (w)
(2.3)
P (Z)
Maximizing the log-posterior is equivalent to minimizing the following objective function
(
(
ˆ = arg min L ( w, λ ) = w T ( CT C + σ 2 I ) w − 2wCT y w
))
(2.4)
where I is the identity matrix. Hyperparameters are chosen by maximizing the following evidence P ( θ, σ 2 | Z ) = ( 2π )
−l 2
σ 2 I + CCT
−1 2
−1 ⎞ ⎛ 1 exp ⎜ − y T (σ 2 I + CCT ) y ⎟ ⎝ 2 ⎠
(2.5)
In the related Bayesian models, this equality is known as the marginal likelihood, and its maximization is known as the type- maximum likelihood method [8]. Williams [9] has demonstrated that this model is equivalent to Gaussian Processes (GPs) with the covariance (σ 2 I + CCT ) ; hence we call it GPs in this paper.
Ⅱ
Sparse Gaussian Processes Using Backward Elimination
1085
3 Backward Elimination for Gaussian Processes In GPs, all the basis functions are used for prediction; therefore it is inferior to neural networks, SVMs and RVM in testing speed, which seems to prohibit its application in some fields. Here, GPs-BE is proposed to overcome this problem that selects the basis function by a backward elimination technique after training procedure. GPs-BE is a backward greedy algorithm that recursively removes the basis function with the smallest leave-one-out score at the current step until some stop criterion is satisfied. For convenience of derivation, we reformulate (2.6) into w = H −1b
(3.1)
where H = (CT C + σ 2 I ) and b = CT y . Let Δf ( k ) be the increment of L with the k training sample deleted and then the following theorem holds true. Theorem 3.1: Δf
(k )
=
( wk )
th
2
R kk
, where R = H −1 , R kk denotes the k th diagonal
element of H −1 . We call Δf ( k ) leave-one-out score. At each step, we will remove the basis function with the smallest leave-one-out score. The index of the basis function to be deleted can be obtained by
s = arg min ( Δf ( k ) ) , k ∈P
(3.2)
where P is a set of the indices of the remainder basis functions. Note that the (l+1)-th variable, i.e. the bias, is preserved during the backward elimination process. When one basis function is deleted, we require updating the matrix R and the vector w . In terms of a rank-1 update, R ( s ) and w( s ) can be formulated as
(R ) (s)
ij
= R ij −
⎛
R is R sj R ss
,
i, j ≠ s ,
R is R sj ⎞ ⎟b j , i ≠ s . R ss ⎠ j≠s ⎝ Together with w = Rb , (3.4) is simplified as ( w( s ) )i = wi − ws RRis , i ≠ s . ss
(w ) = ∑⎜ R n
(s)
i
ij
−
(3.3)
(3.4)
(3.5)
Suppose that Δ t is the increment of f at the t-th iteration, and then we will terminate the backward elimination procedure if Δt ≤ ε f
(3.6)
where we set ε = 0.01 . The detailed backward elimination procedure is summarized in Figure 3.1.
1086
L. Bo, L. Wang, and L. Jiao
1. 2. 3. 4.
Agorithm1: GPs-BE Compute the index of basis function to be removed by (3.2); Update the matrix R and the vector w by (3.3) and (3.5); Remove the index resulting from step 1; If (3.6) is satisfied, Stop; otherwise, go to Step 1. Fig. 3.1. Flow chart of backward elimination
4 Empirical Study In order to evaluate the performance of GPs-BE, we compare it with GPs, GPs-U, SVM, RVM and AS on four benchmark datasets, i.e. Friedman1 [10], Boston Housing, Abalone and Computer Activity [11]. GPs-U denotes GPs whose covariance function has the same scaling factors. Before experiments, all the training data are scaled in [-1, 1] and the testing data are adjusted using the same linear transformation. For Friedman1 and Boston Housing data sets, the results are averaged over 100 random splits of the full datasets. For Abalone and Computer Activity data sets, the results are averaged over 10 random splits of the mother datasets. The free parameters in GPs, GPs-BE and GPs-U are optimized by maximizing the evidence. The free parameters in RVM, SVMs and AS are selected by 10-fold cross validation procedure. Table 4.1. Characteristics of four benchmark datasets Abrr.
Problem
Attributes
FRI BOH ABA COA
Friedman1 Boston Housing Abalone Computer Activity
10 13 8 21
Total Size 5240 506 4117 8192
Training Size 240 481 1000 1000
Testing Size 5000 25 3117 7192
Table 4.2. Mean of the testing errors of six algorithms Problem FRI BOH ABA COA
GPs 0.46 9.23 4.61 6.61
GPs-BE 0.47 9.31 4.65 6.68
GPs-U 2.62 9.67 4.63 11.65
RVM 2.84 9.92 4.62 11.41
SVMs 2.68 10.66 4.72 11.34
AS 2.80 10.02 4.68 12.09
Table 4.3. Mean of the number of basis functions of six algorithms on benchmark datasets Problem FRI BOH ABA COA
GPs 240.00 481.00 1000.00 1000.00
GPs-BE 63.00 116.00 15.70 86.70
GPs-U 240.00 481.00 1000.00 1000.00
RVM 70.90 52.88 11.00 47.80
SVMs 178.30 165.73 470.70 357.90
AS 78.70 60.9 11.1 43.6
Sparse Gaussian Processes Using Backward Elimination
1087
Table 4.4. Runtime of six algorithms on benchmark datasets Problem FRI BOH ABA COA
GPs 217.53 8012.57 4457.84 6591.48
GPs-BE 228.78 8439.69 4687.31 6849.81
GPs-U 453.70 9946.76 6922.70 7373.28
RVM 1652.05 26922.98 13093.14 15150.28
SVMs AS 1246.32 290.35 47005.36 7639.03 39631.06 4440.03 67127.36 5183.14
From Table 4.2 we know that GPs-BE and GPs obtain similar generalization performance and are significantly better than GPs-U, RVM, SVMs and AS in the two regression tasks, i.e. Friedman1and Computer Activity. As for the remaining two tasks, all the six approaches have similar performance. Since GPs-U is often superior to SGGPs and SOGPs in terms of the generalization performance, GPs-BE is expected to have the better generalization performance than SGGPs and SOGPs.Table 4.3 show that the number of basis functions of GPs-BE approaches that of RVM and AS, and is significantly smaller than that of GPs, GPs-U and SVMs. Table 4.4 show that the runtime of GPs-BE approaches that of GPs, GPs-U and AS, and is significantly smaller than that of GPs, GPs-U and SVMs. An alternative is to select the basis functions using the forward selection proposed by [12-13]. Table 4.5 compares our method with forward selection in the same stop criterion. Table 4.5. Comparison of backward elimination and forward selection Problem FRI BOH ABA COA Normalized Mean
Backward Elimination 0.47 63.00 9.31 116.00 4.65 15.70 6.68 86.70
Forward Selection 0.47 69.30 9.33 131.72 4.66 20.20 6.71 97.40
0.998
1.000
0.864
1.000
Table 4.5 shows that the backward elimination outperforms the forward selection in the performance and the number of basis functions in the same stop criterion. In summary, GPs-BE greatly reduces the number of basis functions of GPs almost without sacrificing the performance and increasing the runtime. Moreover, GPs-BE is better than GPs-U in performance, which further indicates the performance of GPsBE is better than that of SGGPs and SOGPs. GPs-BE is better than SVMs in all the three aspects. GPs-BE is also better than RVM and AS in performance with the similar number of basis functions and runtime. Finally, the backward elimination outperforms the forward selection in the same stop criterion.
5 Conclusion This paper presents a backward elimination algorithm to select the basis functions for GPs. By integrating rank-1 update, we can implement GPs-BE at a reasonable cost. The results show that GPs-BE greatly reduces the number of basis functions of GPs
1088
L. Bo, L. Wang, and L. Jiao
almost without sacrificing the performance and increasing the runtime. Comparisons with forward selection show that GPS-BE obtains better performance and smaller basis functions in the same stop criterion. This research is supported by National Natural Science Foundation of China under grant 60372050 and 60133010 and National “973” Project grant 2001CB1309403.
References 1. Williams, C. K. I., Rasmussen, C. E.: Gaussian Processes for Regression. Advances in Neural Information Processing Systems 8 (1996) 514-520 2. Rasmussen, C. E.: Evaluation of Gaussian Processes and Other Methods for Non-linear Regression. Ph.D. thesis, Dep.of Computer Science, University of Toronto. Available from http://www.cs.utoronto.ca/~carl/. 3. Vapnik, V.: The Nature of Statistical Learning Theory. New York: Springer-Verlag (1995) 4. Tipping, M. E.: Sparse Bayesian Learning and the Relevance Vector Machine. Journal Machine Learning Research 1 (2001) 211-244 5. Figueiredo, M. A. T.: Adaptive Sparseness for Supervised Learning. IEEE Trans. Pattern Analysis and Machine Intelligence 25 (2003) 1150-1159 6. Smola, A. J., Bartlett, P. L.: Sparse Greedy Gaussian Processes Regression, Advances in Neural Information Processing Systems 13 (2000) 619-625 7. Csato, L., Opper, M.: Sparse Online Gaussian Processes, Neural Computation, 14 (2002) 641-669 8. Berger, J. O.: Statistical Decision Theory and Bayesian Analysis. Springer, Second Edition (1985) 9. Williams, C. K. I.: Prediction with Gaussian Processes: from Linear Regression to Linear Prediction and Beyond. Learning and Inference in Graphical Models (1998) 1-17 10. Friedman, J. H.: Multivariate Adaptive Regression Splines. Annals of Statistics 19 (1991) 1-141 11. Blake, C. L., Merz, C. J.: UCI Repository of Machine Learning Databases, Technical Report, University of California, Department of Information and Computer Science, Irvine, CA (1998) Data available at http://www.ics.uci.edu/~mlearn/MLRepository.html 12. Chen, S., Cowan, C. F. N., Grant, P. M.: Orthogonal Least Squares Learning Algorithm for Radial Basis Function Networks. IEEE Trans. Neural Networks 2 (1991) 302-309 13. Bo, L. F., Wang, L., Jiao, L. C.: Sparse Bayesian Learning Based on an Efficient Subset Selection, Lecture Notes in Computer Science 3173 (2004) 264-269
Comparative Study of Extreme Learning Machine and Support Vector Machine Xun-Kai Wei, Ying-Hong Li, and Yue Feng School of Engineering, Air Force Engineering University, Xi’an 710038 Shaanxi, China {skyhawkf119, yinghong_li, fengy1228}@163.com
Abstract. Comparative study of extreme learning machine (ELM) and support vector machine (SVM) is investigated in this paper. A cross validation method for determining the appropriate number of neurons in the hidden layer is also proposed in this paper. ELM proposed by Huang, et al [3] is a novel machinelearning algorithm for single hidden-layer feedforward neural network (SLFN), which randomly chooses the input weights and hidden-layer bias, and analytically determines the output weights optimally instead of tuning them. This algorithm tends to produce good generalization ability and obtain least experience risk simultaneously with solid foundations. Benchmark tests of a real Tennessee Eastman Process (TEP) are carried out to validate its superiority. Compared with SVM, this proposed algorithm is much faster and has better generalization performance than SVM in the case studied in this paper.
1 Introduction Support vector machine (SVM) has become an increasingly popular technique in machine learning domains due to demonstrated commendable superiorities compared with many other learning methods. However, the main obstacle of SVM for practical use is that a rather complex quadratic programming problem must be solved and this is demanding due to its high computational complexity and computing time cost. When the datasets are larger, this defect is more obvious. Furthermore, proper kernel and its parameters selection are still open problems worth further researching. [1-2] Compared with SVM, neural networks especially feedforward neural networks have their own advantages. It has been realized that the real problem with neural networks is the model complexity may not match the sample complexity, which results from finite samples. Neural networks have attractive virtues such as flexibility, parallel computation, easy implementation and more cognitive like human thinking, etc. However, there are also some problems that remain unsolved, for example, lacking of efficient and fast learning algorithm for large-scale problems and hidden-layer neuron number selection problem. In order to overcome these problems, a novel improved machine-learning algorithm for single-hidden layer feedforward neural network (SLFN) has been proposed by Huang, et al [3], which is named as Extreme Learning Machine (ELM) because of its extremely fast learning speed, extremely easy learning method and good J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1089 – 1095, 2006. © Springer-Verlag Berlin Heidelberg 2006
1090
X.-K. Wei, Y.-H. Li, and Y. Feng
generalization performance. ELM randomly chooses the input weights, hidden-layer bias and analytically determines the output weights optimally instead of tuning them, thus avoiding the difficulty such as local minima and parameters (learning epoch and learning rate) tuning. Comparing with SVM and other feedforward neural networks, ELM only has one parameter (number of hidden neurons) to be tuned. In this paper, we will propose a simple cross-validation method to select the appropriate number of hidden neurons for ELM. The paper is organized as follows: In section 2, basic concept of ELM is reviewed, and the proposed Cross-Validation method for neuron selection of ELM is proposed. In section 3, comparative studies of ELM and SVM are investigated. In section 4, benchmark tests are fulfilled. In section 5, conclusions and discussion are made.
2 Brief of Extreme Learning Machine The algorithm was first proposed in Huang, et al [3]. ELM can work for almost all nonlinear activation functions that could be used in practical applications. As described in Huang, et al [3], typical function approximation problem of SLFN and ELM algorithm is briefed below. 2.1 Approximation Problem Description of SLFN For N arbitrary distinct samples ( xi , t i ) , where xi = [ xi1 , xi 2 ,L, xin ]T ∈ R n and ~ ti = [t i1 , t i 2 ,L, t im ]T ∈ R m , standard SLFN with N hidden neurons and activation
function g (x) approximating these N distinct samples with zero error means ~ N
∑ β g (w ⋅ x i
i =1
i
j
+ bi ) = t j , j = 1,L , N
(1)
where wi = [ wi1 , wi 2 ,L, win ]T is an input weight vector of the links connecting the i th hidden neuron and input neurons, β i = [ β i1 , β i 2 ,L, β im ]T is an output weight vector of the links connecting the i th hidden neuron and the output neurons, and bi is the threshold of the i th hidden neuron. wi ⋅ x j denotes the inner product of wi and xj . The above N equations can be written compactly as: Hβ = T
⎡ g ( w1 ⋅ x1 + b1 ) L g ( wN~ ⋅ x1 + bN~ ) ⎤ ⎥ ⎢ where H ( w, b) = ⎢ M L M ⎥ ⎢ g ( w1 ⋅ x N + b1 ) L g ( w ~ ⋅ x N + b ~ )⎥ N N ⎦ ⎣
(2)
~ N ×N
Comparative Study of Extreme Learning Machine and Support Vector Machine
1091
⎡ β1T ⎤ ⎡t1T ⎤ ⎢ ⎥ ⎢ ⎥ ,T = ⎢ M ⎥ β =⎢ M ⎥ ⎢ β T~ ⎥ ⎢t T ⎥ ⎣ N ⎦ N~×m ⎣ N ⎦ N ×m For (2), if the number of hidden neurons is equal to the number of distinct training samples, then matrix H is square and invertible, and SLFN can approximate these training samples with zero errors.[4] However, in most cases the number of hidden ~ neurons is much less than the number of distinct training samples, namely N << N , H is no more square even singular and there may not exist wˆ i , bˆi , βˆ ~ (i = 1, L , N ) that makes Hβ = T hold. Thus, instead one may find specific wˆ i , bˆi ,
βˆ such that H ( wˆ , bˆ) βˆ − T = min H ( w, b) β − T wi ,bi , β
(3)
Several parameter optimization methods such as gradient-based learning algorithms can be adopted to find solutions to (3). However, these methods only minimized the training error regardless of model complexity, so it cannot obtain rational structure parameters while accompanying with local minima, learning epoch and learning rate selection, over learning annoyances. 2.2 Minimum Least Squares Solution with Good Generalization Performance
Reference [5] points out and rigorously proves that the feedforward neural networks with arbitrary assigned input weights and hidden layer biases with almost all nonzero activation functions can universally approximate any continuous function on any compact input sets. That means, from function approximation point of view input weights and hidden layer biases can be randomly chosen. Thus, training a SLFN is equivalent to finding a least square solution of the linear system Hβ = T and the minimum norm least square solution is
βˆ = H +T
(4)
where H + is the Moore-Penrose generalized inverse of H . As analyzed theoretically ~ and demonstrated empirically in Huang, et al [3], given the number N of hidden neurons the proposed solutions tend to have good generalization. 2.3 ELM Algorithm [3]
Given a training set ℵ = {( xi , ti ) | xi ∈ R n , ti ∈ R m , i = 1, L , N } , activation function ~ g ( x) and hidden neurons number N , (1) Randomly assign input weights wi and hidden layer biases bi , i = 1, L , N . (2) Calculate hidden layer output matrix H . (3) Calculate the output weight β = H +T .
1092
X.-K. Wei, Y.-H. Li, and Y. Feng
2.4 Proposed Cross-Validation (CV) Algorithm for ELM
This accompanying CV algorithm is proposed to select the number of neurons for the hidden layer for ELM. The concise algorithm can be summarized as: (1) Assign maximum iteration times imax ; (2) Start i = 0 , Construct initial candidate tuning set Ν 0 = {5 10 15 25 50 100 250 500}, partition the whole training data into 10 fold disjoint training subsets and testing subsets and calculate average validation error (Misclassification Rate for classification and Root Mean Square Error for regression) among the 10-fold; ~ (3) Choosing optimum N from the tuning set Ν i , which has the minimum validation error. (4) If i = imax , go to step 5; else i := i + 1 , construct a locally refined one dimen~ sional grid Ν i around the optimum N and go to step (3). ~ (5) Output optimum hidden-layer neuron number N .
3 Comparative Studies of ELM and SVM It is an interesting point that both neural networks and SVM have their origins outside of computer science. Neural networks take advantage of the new neurological discoveries in the 20th century, while SVM makes use of the advanced mathematical developments of the 20th century. Due to different backgrounds, there are both differences and similarities worth emphasizing. Dataset handling. SVM and ELM have similar way of handling the training data set. SVM handles the data set simultaneously as vector form due to the mathematical nature without losing the degree of accuracy. This actually helps reduce the burden of analyzing the data set with a lot of features. Different from traditional neural networks, ELM also treats the training data as vectors like SVM, and this greatly speeds up the learning speed. But ELM may degrade when data is high dimensional. On the contrary, if the dimension is small, yet the data is large, then ELM is more efficient and SVM has low convergence speed for large kernel computations. Global solution. SVM and ELM both obtain global solutions. SVM does it through a QP optimum solution, while ELM gets it through a minimum least squares solution. Both tend to provide satisfactory performance with solid foundations. Computation units. Both SVM and ELM have effective computation units. As for SVM, they are support vectors, which are generated automatically. As for ELM, they can be selected heuristically. Later tests show that SVM often generates more support vectors, while ELM only needs a few neurons. And this is important especially when the dataset is large, since more computation units mean more computation complexity. Model selection. For SVM, proper kernel parameters and kernel type must be determined first to obtain low VC dimension, which assures good generalization performance. Accordingly, ELM has to determine proper number of neurons to get proper model complexity, too. Further ELM seems to be more natural and simpler for
Comparative Study of Extreme Learning Machine and Support Vector Machine
1093
practical use, because ELM has only one adjustable parameter (number of hidden neurons) and SVM has to determine several parameters together.
4 Benchmark Tests In order to compare ELM and SVM more intuitively, typical TEP process fault diagnosis is studied (Table 1). It should be noted that ELM is implemented in MATLAB while SVM is implemented in C. 4.1 Benchmark Datasets
Here, we only give brief introduction of the used benchmark datasets. For more details, please visit TEP simulator1 homepage. The selected TEP faults data are strongly overlapping that suit well for testing classification performance, and each fault consists of 52 variables. Also faults data with variable selection are investigated. Table 1. TEP fault diagnois datasets. Only fault 4, fault 9 and fault 11 are investigated.
Class
Description
Training data
Testing data
Fault 4 Fault 9 Fault 11
Fault 4: Step change in the reactor cooling water inlet temperature Fault 9: Random variation in D feed temperature Fault 11: Random variation in the reactor cooling water inlet temperature
480 480 480
960 960 960
4.2 Experiments Settings
In order to obtain optimal results, Cross-Validation (10 fold CV) is done first for all experiments. Finally selected parameters of ELM and SVM are listed in Table 2. All tests are executed on IBM PC server with P4 1.7GHz processor, 512 MB memory. Table 2. Parameter settings. RBF kernel is used for all tests via LIBSVM.
Dataset name TEP with all variable TEP with variable selection
ELM (No. of neurons) 196 67
SVM (RBF kernel) C =96e4, C -SVC,γ=0.25 C =1, C -SVC,γ=0.05
4.3 Results
The testing results are listed in Tables 3 – 5. The correctness rate and confusion matrix (in which the ij th element represents the number of samples from class i which were classified as class j ) for classification are used as estimation criterion. Confusion matrix is adopted to show separation performances of ELM and SVM. Learning speed and predicting speed are reported to show how efficiently the algorithm is. To 1
http://brahms.scs.uiuc.edu
1094
X.-K. Wei, Y.-H. Li, and Y. Feng
guarantee the reliability of results, each benchmark test is executed 100 times. All above performance parameters are averaged. Also, support vectors are reported so as to compare with the neurons for ELM. Table 3. TEP faults diagnosis performance results
Classifier
ELM
SVM
Performance
All variables
Selected variables
Training time (s) Testing time (s) Training rate Testing rate Neuron number Training time (s) Testing time (s) Training rate Testing rate Support vectors
1.722 0.450 0.784 0.517 196 398.120 8.662 1.000 0.480 1154
0.581 0.140 0.931 0.938 67 27.780 0.852 0.940 0.835 980
Table 4. TEP faults diagnosis confusion matrix with all variables (52 total)
. ELM
SVM
Fault class Fault 4 Fault 9 Fault 11 Fault 4 Fault 9 Fault 11
Fault 4 673 114 284 564 85 242
Fault 9 136 491 326 186 257 156
Fault 11 136 491 326 210 618 562
Table 5. TEP faults diagnosis confusion matrix with variable selection (Variable 9 and 51)
Classifier ELM
SVM
Fault class Fault 4 Fault 9 Fault 11 Fault 4 Fault 9 Fault 11
Fault 4 795 0 3 797 0 47
Fault 9 157 949 1 158 948 252
Fault 11 8 11 956 5 12 661
5 Conclusions and Discussions This work demonstrates the superiority of ELM. In benchmark tests, ELM is faster for learning than SVM, and extremely fast for unknown pattern’s prediction. These results show that ELM has fast learning speed and better generalization.
Comparative Study of Extreme Learning Machine and Support Vector Machine
1095
Compared with SVM, ELM has several good features. ELM has only one adjustable parameter, namely, hidden-layer neuron number. On the contrary, SVM has to select proper kernel and its parameters to obtain low model complexity and promising generalization performance. So from this point, ELM is much simpler and much more flexible. ELM is straightforward for multi-class classification; however, SVM should resort to special methods to solve multi-class classification. Finally, it should be pointed out that SVM does show that artificial intelligence can also be imitated with advanced statistical technology. And the integrity of neural networks research is somehow limited by the available knowledge of brain, while with fast progresses in mathematics statistical computation domain achieves more attractive fruits. So it may be interesting to integrate neural networks with advanced mathematical theory, which could be expected to obtain more rational learning algorithm. To find the match between model complexity of ELM and sample complexity could be one of the focuses of further development of ELM.
References 1. Chapelle, O., Vapnik, V. N.: Model Selection for Support Vector Machines. In: Solla, S.A., Leen, T. K., Muller, K.-R. (eds.): Advances in Neural Information Processing Systems, Vol. 12. MIT Press, Cambridge MA (2000) 230–236 2. Vapnik, V. N.: Statistical Learning Theory.1st edn. Wiley, New York (1998) 3. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks. In: Proceedings of International Joint Conference on Neural Networks (IJCNN2004). Budapest, Hungary (2004) 25–29 4. Huang, G.-B.: Learning Capability and Storage Capacity of Two-hidden-layer Feedforward Networks. IEEE Transactions on Neural Networks 14(2) (2003) 274–281
Multi-level Independent Component Analysis Woong Myung Kim1 , Chan Ho Park2, and Hyon Soo Lee1 1 Dept. of Computer Engineering, Kyunghee University, 1, Seocheon-dong, Giheung-gu, Yongin-si, Gyeonggi-do 449-701, Republic of Korea [email protected], [email protected] 2 Dept. of Internet Information Science, Bucheon College, 424, Simgok-dong, Wonmi-gu, Bucheon-si, Gyeonggi-do, Repulic of Korea [email protected]
Abstract. This paper presents a new method which uses multi-level density estimation technique to generate score function in ICA (independent Component Analysis). Score function is very closely related with density function in information theoretic ICA. We tried to solve mismatch of marginal densities by controlling the number of kernels. Also, we insert a constraint that can satisfy sufficient condition to guarantee asymptotic stability. Multi-level ICA uses kernel density estimation method in order to derive differential equation of source adaptively score function by original signals. To increase speed of kernel density estimation, we used FFT algorithm after changing density formula to convolution form. Proposed multi-level score function generation method reduces estimate error which is density difference between recovered signals and original signals. We estimate density function more similar to original signals compared with existent other algorithms in blind source separation problem and get improved performance in the SNR measurement.
1
Introduction
Independent component analysis is a useful method to separate original signals from linear mixtures. This method is used to solve blind source separation problem, which can find original signals from observed signals. Typically, source separation is to find a minimized contrast function which is related to source densities. These contrast functions are based on maximum likelihood, information maximization and mutual information[1][2][3][4]. Despite many ICA algorithms, we have tried to separate with more accurately estimated signals with inaccurate signals because they do not know the source distribution[8][9][10]. Generally, density of source relates with score function and ICA learning in maximum likelihood estimation which assumes the fixed density function. Therefore, it is important to generate flexible score function based on density estimation so that estimated signals are similar to original signals. In this study, we focus on two points; one is score function generation and the other is controlling of the number of kernels in kernel density estimation. In latter, multi-level ICA has properties, these can generate score function adaptively in ICA learning and can find the higher resolution of density estimation. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1096–1102, 2006. c Springer-Verlag Berlin Heidelberg 2006
Multi-level Independent Component Analysis
2
1097
Maximum Likelihood Estimation Setting in ICA
In ICA definition, s is independent sources with m × n size. x is mixture signals by mixing matrix A, where A is nonsingular matrix with m × m size. Also it is full rank matrix[2][5]. x = As
(1)
W matrix is separating matrix that successfully recovers the original sources. u = Wx
(2)
And where u is recovered signals. Eq.3 defines as objective (contrast) function using negative log-likelihood function with unknown matrix . If these condition would be W = A−1 , it will be global minimum of objective function[4]. L(W ) = −
n
log p(x) = −n log | det W | −
n m
i=1
(j)
log fj (ui )
(3)
i=1 j=1 (j)
And p(x) = |detW |f (W x) is defined then fj (ui ) denotes its marginal density. Natural gradient or relative method is adopted to minimize global objective function L(W ) in updating weights[2][3]. The updating equation is W = W − η(E[ϕ(u)uT ] − I)W
(4)
Where η is constant learning rate, E is expectation operator and I is the m × m identity matrix. In Eq.6, ϕ(u) is score function which is called nonlinear function with minimized objective function, it is a set of nonlinear function m × 1 size vector ϕ(u) = [ϕ1 (u), ..., ϕm (u)]T
(5)
If the density function f (u) can be estimated from finite data set, the score function is defined as Eq.6 through their negative log-derivatives. ϕ(u) = −[log f (u)] = −
3
f (u) f (u)
(6)
SFG (Score Function Generation) Algorithm
Kernel density estimation is a method that calculates the probability density function[7][8][10]. This is a nonparametric density estimation which is possible to differential equation[11]. Therefore, we generate a score function using KDE (Kernel Density Estimate) method. KDE is defined as follow: 1 f (u) ∼ K = f˜(c) = nh i=1 n
2 c − ui 1 , K(z) = √ exp−z /2 h 2π
(7)
1098
W.M. Kim, C.H. Park, and H.S. Lee
In order to estimate Eq.7, we have to differentiate with respect to vector c because the output dimension of density has c vector size, where c is called to reference vector and dimension of c is equal to number of kernels. Also n and π in Eq.8 can be omitted as these have constant value. n is number of samples and also c is the center of kernels. ui is recovered signals in ICA learning.
ϕ(u) ∼ ˜ = = ϕ(c)
f (c) = f (c)
1 h2
n i=1
2 i) (c − ui ) exp − (c−u 2h2
n
i=1
2 i) exp − (c−u 2 2h
(8)
Usually h is a positive value, and the width in each kernel is called bandwidth or smoothing parameter. We have to change above equation to convolution form because kernel density estimation is time consuming. Since convolution presents to product form in frequency domain, we use FFT (Fast Fourier Transform) for changing density function to frequency domain. FFT is to reduce computation time of convolution. We can generate FFT form using Eq.8, and these are expressed in Eq.9 and 10, where G is Gaussian kernel and G is value that differentiate Gaussian kernel. F and F correspond to f (u) and f (u). If H defines as measured histogram, it is expressed as follows F ≡ if f t(f f t(H) ∗ f f t(G))
(9)
F ≡ if f t(f f t(H) ∗ f f t(G ))
(10)
Computational complexity of score function using FFT-convolution is decreased by O(nlogn). Hence, SFG algorithm reduces the time complexity for creating score function[11].
4
Multi-level ICA
H(yi ) is the marginal entropies and H(y) is joint entropy, sum of marginal entropies. Each of marginal entropy can be written as H(yi ) = −E[log p(yi )]
(11)
The nonlinear mapping between the output density p(yi ) and source estimate density p(ui ) can be described by the absolute value of the derivative with respect to ui ∂yi p(yi ) = p(ui )/| | (12) ∂ui where such entropy can be expressed in terms of p(y) and input density p(u) can be described by the Jacobian matrix[9]. p(y) = p(u)/J(u), J = |∂ϕ(u)/∂u|
(13)
Multi-level Independent Component Analysis
1099
Entropy H(y) that has to be maximized with respect to density function is defined as H(y) = H(u) + E[ln J]
(14)
In this works, the main idea is to determine the number of kernel adaptively to estimate suitable score function. For this purpose, we compose multi-level ICA which can select the numbers, 4, 8, 16, 32, 64 and 128 as the number of kernels. These numbers assume a kernel index. Increment in the number of kernels will provide more accurate estimation of densities. In the stability analysis, we consider a method which is able to likelihood contrasts minimization without whiteness constraint. To have global and local minima, a sufficient condition guaranteeing asymptotic stability can be derived as follows[4] Ki > 0, 1 ≤ i ≤ m
(15)
where Ki
Ki = E{ϕi (ui )}E{u2i } − E{ϕi (ui )ui }
(16)
ϕi (ui ) is equal to Eq.9. ϕi (ui ) is derived from Jacobian matrix, it is expressed as follows ∂(ϕ(u))/∂u ∼ ˜ = ∂(ϕ(c))/∂c n n n n n 2 1 = K1− h12 K3 K1 − K2 − h2 K2 / K1 i=1
i=1
i=1
i=1
(17)
i=1
In Eq.17, it has to be differentiated by vector c, so we may simplify some of the notation for easier computation. 1 K1 = exp(− 2h (c − uj )2 ) 1 K2 = (c − uj ) exp(− 2h (c − uj )2 ) 1 2 K3 = (c − uj ) exp(− 2h (c − uj )2 )
(18)
Ki is switching term for selecting the number of kernel. If Ki ≥ 0, it is going to choose larger the number than current the number of kernel. If Ki ≤ 0 , it is the opposite case. Multi-level ICA searches for all cases of these numbers to compute stability guarantee because measured marginal entropies are unequal to each others. In ICA, after learning matrix W has a property that log-likelihood is stabilized with respect to objective function. Therefore, we can derive the following condition, where L(W )(t) is observed at the log-likelihood of current time, and L(W )(t−1) is the observed log-likelihood at t − 1, ||L(W )(t) − L(W )(t−1) || ≤ ε
(19)
1100
W.M. Kim, C.H. Park, and H.S. Lee
If the difference of L(W ) is smaller than ε , we interpret it as stable. Thus, multi-level ICA performs the optimization method to determine new the number of kernels. The following table is a learning scheme of multi-level algorithm.
1. Initialize W matrix and set the kernel index, KI = [4 8 16 32 64 128] 2. Compute Ki using Eq.16 for all number of kernel from KI 3. If Ki ≥ 0 , then select larger number than the current number of kernels. Else Ki ≤ 0, then select smaller number than the current number of kernels 4. Compute score function using SFG algorithm with the selected number 5. Update W matrix using Eq.4 6. If ||L(W )(t) − L(W )(t−1) || ≤ ε then go to Step 2 , else go to Step 4.
5
Computer Simulations
To show the validity of our multi-level ICA in separating mixed signals we first carried out an experiment involving eight real-life signals. The distribution of these signals is shown in Fig.1.
Fig.1. Distribution of eight source signals
The performance of the algorithm during learning was monitored by the SNR error measure. Fig.2 shows on SNR between estimated signals and original signals, where the proposed multi-level ICA has better performance compare with FixedPoint ICA[6] and Extended Infomax algorithm[5]. Each algorithm performed almost 200 iterations for updating matrix W and constant learning parameter is η = 0.2. Such performance appears because of using fixed score function that has super-Gaussian and sub-Gaussian distribution to original at each algorithm.
Multi-level Independent Component Analysis
1101
Fig.2. SNR of multi-level ICA, Extended Infomax algorithm and Fixed point ICA
6
Conclusion
The difference between the source densities and estimated densities, known as errors, will exist because MLE is calculated with score function, which uses fixed densities in ICA. Therefore, new algorithm is required to reduce these errors. We proposed the multi-level ICA using KDE for solving this problem. This method is new density estimator with a constraint to guarantee asymptotic stability and this is focused on controlling the number of kernels. As a conclusion of computer simulation, multi-level ICA displayed better performance compared to existing algorithms.
References 1. Comon, P.: Independent Component Analysis, A New Concept?. Signal Processing 36(3) (1994) 287-314 2. Bell, A.J., Sejnowski, T.J.: An Information Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7(6) (1995) 1129-1159 3. Amari, S., Cichocki, A., Yang, H.H.: A New Learning Algorithm for Blind Signal Separation. In:D.Touretzky, M.Mozer(eds.): Advances in Neural Information Processing systems, Vol.8. (1996) 757-763 4. Cardoso, J.F.: Blind Signal Separation, Statistical Principles. Proc. IEEE Special Issue on Blind Identification and Estimation 9(10) (1998) 2009-2025 5. Lee, T.-W., Girloami, M., Sejnowski, T.J.: Independent Component Analysis Using Extended Infomax Algorithm for Mixed SubGausssian and SuperGaussian Sources. Neural Computation 1(2) (1999) 417-441
1102
W.M. Kim, C.H. Park, and H.S. Lee
6. Hyvarinen, A.: Survey on Independent Component Analysis. Neural Computing Surveys, 2 (1999) 94-128 7. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. New York, Chapman and Hall (1995) 8. Vlassis, N., Motomura. Y.: Efficient Source Adaptivity in Independent Component Analysis. IEEE Trans. Neural Networks 12(3) (2001) 559-566 9. Fiori, S.: Blind Signal Processing by the Adaptive Activation Function Neurons. Neural Networks 13(6) (2000) 597-611 10. Boscolo, R., Pan, H.: Independent Component Analysis Based on Nonparametric Density Estimation. IEEE Trans. on Neural Networks 15(1) (2004) 55-65 11. Kim, W.-M., Lee, H.-S.: An Efficient Score Function Generation Algorithm with Information Maximization. In: L. Wang, K. Chen, Y.S. Ong(eds.) : Advances in Natural Computation. Lecture Note on Computer Science, Vol. 3610. SpringerVerlag, Berlin Heidelberg New York (2005) 760-768
An ICA Learning Algorithm Utilizing Geodesic Approach Tao Yu1, 2, Huai-Zong Shao1, and Qi-Cong Peng1 1 UESTC-Texas Instrument DSPs Laboratory, University of Electronic Science and Technology of China, Chengdu 610054, China [email protected] 2
Blind Source Separation Research Group, University of Electronic Science and Technology of China, Chengdu 610054, China
Abstract. This paper presents a novel independent component analysis algorithm that separates mixtures using serially updating geodesic method. The geodesic method is derived from the Stiefel manifold, and an on-line version of this method that can directly treat with the unwhitened observations is obtained. Simulation of artificial data as well as real biological data reveals that our proposed method has fast convergence.
1 Introduction In the simplest form of ICA, mixtures are assumed to be linear instantaneous mixtures of sources. The problem was formalized by Jutten and Herault [1] as below:
x = AS
(1)
T
where s = ( s1, L sn ) is a source vector of real-valued random variables, A is a nonsingular n × n mixing matrix with aij representing the amount of source j that T appears in observation i ,and x = ( x1, L , xn ) is the random vector of mixtures from which our observations are generated. Commonly, the approach to tackle the ICA problem can be divided into two stages [2]. The first is the so-called whitening. Whitening transforms the mixing T
matrix into a new one: z = WAs = Bs , and E ( zz ) = I ,where W is called whiten matrix. This is useful as a preprocessing step in ICA. The utility of whitening resides in the fact that the new mixing matrix B is orthogonal [2].Henceforth, in the next stage, we can restrict our search for the mixing matrixU to the space of orthogonal matrices, and y = UBs satisfying that y is independent. That is to develop algorithms that minimize the ICA loss function, denoted as J (U ) , with the orthogonal constrains. Lots of authors proposed many algorithms for adjusting an orthogonal demixing matrix to separate prewhitened mixtures, such as Douglas [3] and Manton [4]. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1103 – 1108, 2006. © Springer-Verlag Berlin Heidelberg 2006
1104
T. Yu, H.-Z. Shao, and Q.-C. Peng
Recently, several authors considered orthogonal constrained problems from the perspective of Stiefel manifold [5-7].Rather than traditional “Euclidean” learning algorithms, which must make corrections to bring back the updating demixing matrix to the constraint surface of orthogonality, a modified form of update rule may be constructed where small updates will stay on (or closer to) the surface of the manifold, reducing the need for constraint applications as the algorithm proceeds when the special geometry of the Stiefel manifold is taken into account. Hence, the natural way is supplant the Euclidean gradient with the geodesic (the gradient in the Riemannian manifold). In this paper, we develop a serial adaptive algorithm that adjusts U using geodesic method combined with serial whitening procedure, making this algorithm suitable for on-line real time application. Simulations given in section 3 verify this method function as desired, blindly separating instantaneous mixtures of random sources in a numerically stable fashion.
2 Geodesic Method On a general manifold, the notion of geodesic extends the concept of straight line on a flat space to a curved space. Just as straight line is the shortest path between two points, the geodesic is the shortest path between two points on a manifold [8]. In this paper, we assume that U is a square matrix, that is, the number of the signal sources equals to the number of the observation signals. In this case, the geodesic equation takes the simple form:
τP
U (τ ) = e
(2)
U (0)
T where P is a skew-symmetric matrix ( P = − P ) and τ is a scalar parameter determining the position along the geodesic. So we have: dU d τP τP = (e U (0)) = Pe U (0) = PU dτ dτ
(3)
For the ICA problem, the maximum likelihood or mutual information suggests the gradient of the loss function J (U ) for whitened data is: ∇J (U ) = −U
−T
T + f ( y) x
(4)
where f ( y ) is the so-called non-linear function usually used in ICA algorithms. Hence,
dJ dJ dU = , dτ dU dτ = −U −T + f ( y ) xT , PU = trace(( −U
−T
T T T + f ( y ) x )U P )
(5)
An ICA Learning Algorithm Utilizing Geodesic Approach
1105
T
where C,D = trace(CD ) is the inner product between matrices C and D . Followed by the method provide by Plumbley[9], we can arrive at the following iteration equation: U (t + 1) = e
−η ( f ( y ) yT − yf ( y )T ) U (t )
(6)
where η is a small positive scalar. This is same as Nishimori’s approach [10-11]. However, this method performs well only for the whitened data. For real time online processing, the serial update of the whitening matrix should be considered. Thanks to Cardoso[12],we have the serial update equation for the whitening matrix : W (t + 1) = W (t ) − η ( zz
T
− I )W (t )
(7)
For the global updating rule for matrix UW , the corresponding update equation is formulated as below: −η ( f ( y ) yT − yf T ( y )) T U (t ) × (W (t ) − η ( zz − I )W (t )) −η ( f ( y ) yT − yf T ( y )) T =e ( I − η ( yy − I ))U (t )W (t )
U (t + 1)W (t + 1) = e
(8)
This is our finial updating equations for the on-line geodesic ICA approach.
3 Computer Simulations In this section, two experiments are reported to illustrate the effectiveness of the proposed serially updated algorithms. We compared with another on-line algorithm— Flexible ICA [13]. In all experiments, the elements of the mixing matrix A are randomly generated. To evaluate the performance of the BSS algorithm, we use the cross talking error as the performance index [14]. L d d L |cij | |cij | PI ct = ∑ ( ∑ −1)+ ∑ ( ∑ −1) i =1 j =1 max|cik | j =1 i =1 max|cij | k k where the L × d matrix C = UWA = {cij } is the combined mixing-separating matrix. The smaller the value of PI ct , the better the performance. When C is a square matrix ( L = d ) and the observation are noise-free, a perfect separation is achieved if and only if PI ct reaches the minimum of zero. 3.1 Experiment 1: Artificial Sources with Various Distributions
In this experiment, 4000 realizations of six different sources, distributed as shown in Table 1, were independently generated.
1106
T. Yu, H.-Z. Shao, and Q.-C. Peng
Table 1 Distribution of the synthetic used in the first experiment Source 1 2 3 4 5 6
Source Type Power Exponential Distribution( D 2.0 ) Power Exponential Distribution( D 0.6 ) Power Method Distribution( b 1.112,c 0.174,d 0.050 ) Power Method DistributioQ b 0.936,c 0.268,d 0.004 Normal Distribution Raleigh Distribution( E 1 )
Skewness 0.0 0.0 0.75 1.5 0.0 0.63
Kurtosis
-0.8 2.2 0.0 3.0 0.0 0.245
Fig. 1. Average performance indexes of our proposed algorithm and Flexible ICA algorithm over 100 independent runs. The nonlinear function ϕ ( y ) is the generalized Gaussian PDF proposed in Flexible ICA.
The results of this first experiment are shown in Fig.1 and they clearly show the performance gain obtained with our algorithm. On the average, to various distributed source, our on-line geodesic ICA algorithm has faster convergence than Flexible ICA, although both of them gain the same cross taking error at last (about 2500 samples of iteration). 3.2 Experiment 1: Fetal ECG Source
This experiment treats with the fetal ECG sources as shown in Figure 2 that are the recordings during an 8-channel experiment. In this experiment, the electrodes were placed on the abdomen and the cervix of the mother. As shown in Figure, the ECG raw data measured through 8 channels are dominated by mother’s ECG and noises. Figure 3 shows the separated sources use our algorithm. It is clearly that the Fetal
An ICA Learning Algorithm Utilizing Geodesic Approach
1107
ECG sources are y 7 ,and the Mother ECG are y 2 , y 6 , y8 .Moreover, some noises also are separated, such as y 3 , the mother’s breath noise.
Fig. 2. The raw ECG data of 8 channel experiment
Fig. 3.The separated source of ECG mixture
4 Conclusion In this paper, we have presented the on-line geodesic ICA algorithm from the perspective of Stiefel manifold. This method can separate the mixtures serially without firstly prewhitening. Compared with the Flexible ICA, our method converges faster. We have now 2 problems in mind which shall be pursued in the near future work. One is the analysis of real time processing of speech signals. Another is the analysis of stability of our algorithm. In the forthcoming paper, we shall analyze the
1108
T. Yu, H.-Z. Shao, and Q.-C. Peng
ode equation of our algorithm. Although stability of geodesic method is hard to analysis, geodesic flow on the Stiefel manifold is a promising tool for the application of ICA problems.
Acknowledgement This work is supported by Science & Technology Foundation of Sichuan Province, China, under the Grant No. 04GG21-020-02.
References 1. Jutten, C., Herault, J.: Blind Separation of Sources: An Adaptive Algorithm Based on Neuromimetic Architecture. Signal Processing 24 (1991) 1-10 2. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York (2001) 3. Douglas, C.: Self-stabilized Gradient Algorithms for Blind Source Separation with Orthogonality Constrains. IEEE Transactions on Signal Processing 11 (2000) 1490-1497 4. Manton, H.: Optimization Algorithms Exploiting Unitary Constrains. IEEE Tran. on Signal Processing 50 (2002) 635-650 5. Edelman, A., Arias, T.,Smith, S.T.: The Geometry of Algorithms with Orthogonality Constraints. J. Matrix Anal. Appl. 20 (1998) 783-787 6. Douglas, C., Amari, S.: Natural Gradient Adaptation in Unsupervised Adaptive Filtering, In: S. Haykin: Blind Source Separation, Wiley, New York (2000) 13-61 7. Fiori, S.: Qusi-geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial. Journal of Machine Learning Research 1, (2005) 1-42 8. Edelman, A., Arias, T.A.,Smith, S.T.: The Geometry of Algorithms with Orthogonality Constraints, Matrix Anal. Applicant 20 (1998) 303–353 9. Plumbley, D.: Algorithms for Nonnegative Independent Component Analysis. IEEE Transactions on Neural Networks 14 (2003) 534-543 10. Nishimori, Y.: Learning Algorithm for ICA by Geodesic Flows on Orthogonal Group. PIJC Neural Networks, 2. Washington, DC (1999) 933-988 11. Nishimori, Y.,Akaho, S.: Learning Algorithms Utilizing Quasi-geodesic Flows on the Stiefel Manifold. Neurocomputing 67 (2005) 106-135 12. Cardoso, L.: Equivariant Aadaptive Source Separation. IEEE Trans. on Signal Processing 44 (1996) 3017-3030 13. Choi, S., Cichocki, A., Amari, S.: Flexible Independent Component Analysis. Neural Networks for Signal Processing VIII, (1998) 83-92 14. Girolami, M.: Self-Organising Neural Networks. Independent Component Analysis and Blind Source Separation, Springer-Verlag, London (1999)
An Extended Online Fast-ICA Algorithm Gang Wang1,2, Ni-ni Rao1, Zhi-lin Zhang2, Quanyi Mo2, and Pu Wang2 1
School of life Science and Technology, 2 Blind Source Separation Group University of Electronic Science and Technology of China, Chengdu 610054, China [email protected]
Abstract. Hyävrinen and Oja have proposed an offline Fast-ICA algorithm. But it converge slowly in online form. By using the online whitening algorithm, and applying nature Riemannian gradient in Stiefel manifold, we present in this paper an extended online Fast-ICA algorithm, which can perform online blind source separation (BSS) directly using unwhitened observations. Computer simulation resluts are given to demonstrate the effectiveness and validity of our algorithm.
1 Introduction Independent component analysis (ICA) [1, 2] is to extract independent signals from their linear mixtures without knowing their mixing parameters. The main applications of ICA are blind source separation (BSS) [1, 8], blind deconvolution [10] and so on. In the basis form of ICA, let us denote by x=(x1,x2,…,xn)T an n-dimensional observation vector, and by s=(s1,s2,…,sn)T an n-dimension unknown independent components (ICs). The ICs are mutually statistically independent and zero-mean. Then the linear relationship is given by
x = As
(1)
where A is an n × n unknown matrix of full rank, called the mixing matrix. The basic problem of ICA is then to conduct an n × n full-rank separating matrix B given just the observation sequence with the restriction that source elements si are non-Gaussian independent components. So the output vector
y = Bx, y = ( y1 , y2 , K , yn )T
(2)
provides the estimates of the sources signals. Our research is extension of Hyävrinen and Oja’s Fast-ICA algorithm. Though Fast-ICA algorithm works efficiently in offline form, our experiment results show that it converges slowly in online form. By using the online whitening algorithm, and applying nature Riemannian gradient in Stiefel manifold, we present in this paper a new online Fast-ICA algorithm. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1109 – 1114, 2006. © Springer-Verlag Berlin Heidelberg 2006
1110
G. Wang et al.
This paper is outlined as follows. In Section 2, the extended online Fast-ICA algorithm is introduced. We also discuss how to separate the sub- and super- Gaussian signal in this part. Computer simulation results with artificial data are presented in Section 3. Conclusions are drawn in Section 4.
2 The Extended Online Fast-ICA Algorithm The online form of Fast-ICA is given in [4], and it is implemented by ordinary gradient learning rule and the observations should be prewhitened. In this paper, a new online algorithm for ICA is introduced by using nature gradient learning rule, and it can deal with BSS directly with the observations. Let us begin with online (serial) update of whitening matrix. 2.1 The Online Whitening Method For online ICA algorithm, we can apply the serial whitening algorithm with least mean squares (LMS) [9]:
[
U + = U +η U − ~ x~ x TU
]
or the recursive least-squares (RLS) whitening algorithm
U+ =
(3) [11]
:
~ ⎤ x~ xT 1⎡ λ U ⎥, π = ⎢U − T ~ ~ λ⎣ π + xx 1− λ ⎦
(4)
whereη(0<η<<1)is a step size (learning rate) parameter, andλ(0<<λ<1) is a forgetting factor. 2.2 Nature Gradient for Objective Function Suppose the mixed data x has been centered to have zero-mean. Then if it is preprocessed by whitening, we obtain a new vector ~ x
~ x = Ux = UAs = Vs
(5) where U is an n × n whitening matrix and V becomes an orthogonal matrix. Then we need to find an orthonormal matrix W such that WUA is a permutation matrix. So we obtain:
y = Bx = WUAS = W~ x; W = [w1 , w2 , K wn ]
y i = wiT ~ x i = 1,2, K , n
T
(6)
Combining information theoretic approach and projection pursuit approach, Hyävrinen and Oja [3] have introduced a family of new contrast function for ICA and given some choices for contrast functinons (denote by G) and their derivatives (denote by g), such as:
An Extended Online Fast-ICA Algorithm
G (u ) =
1 log(cosh (au )) ; a
g (u ) = tanh (au ) .
1111
1≤ a ≤ 2
(7)
where a is a contant, G is a general-purpose contrast function and will be applied in our experements. Then the optimization problem for ICA can be accomplished by finding all the maxima or the minima of E G ( wiT ~ x ) (i = 1,2,K, n) .
{
}
x (i = 1,2,K, n) Suppose that the estimates of the independent components yi=wiT ~ are uncorrelated and have unit variance. So we have E{ yyT } = WE{~ x~ x T }W T = WW T = I
(8)
where I is a unit matrix. Then the matrix W is restricted to the space of the orthogonal matrices: WWT=I. The set of n dimensional subspaces in Rm (m>n) is called Stiefel manifold. Amari [6] gave the natural Riemannian gradient in the Stiefel manifold as follows:
~ T ∇J (W ) = ∇J (W ) − W {∇J (W )} W
(9)
2.3 The Update Learning Rule
The extrema of the contrast function can be obtained by ordinary gradient learning rule (or Hebbian-like learning rule) as follows:
Δwi ∝ σxg ( wiT x), normalize wi
i = 1,2,K, n
(10)
where σ = ±1 is a sign determining whether we will maximize or minimize E G ( wiT ~ x ) . To find n ICs, some kind of feedback is necessary to prevent wi from converging to the same point. Using the symmetric bigradient feedback [5], Hyävrinen and Oja [4] gave the ordinary gradient desecnt learning rule in online form as follows:
{
}
W + = W + μ~ x g (~ x T W )diag ( sign(ci )) + αW ( I − W T W )
{
}
ci = E wiT xg ( wiT x) − g ' ( wiT x)
i = 1,2,K , n
(11)
where μ is learning rate (step size), α is a constant (such as 0.5). The ordinary gradient for the contrast function G ( wiT ~ x ) (i = 1,2,K, n) can be written as follows: T ⎧+ ~ x g ( wi ~ x ) ci > 0 ∇G ( wiT ~ x) = ⎨ ~ T~ ⎩− x g ( wi x ) ci < 0 ∇G (W~ x) = ~ x g (~ x T W ) = ∇G( w1T ~ x ), ∇G ( w2T ~ x ),K, ∇G ( wnT ~ x)
[
]
(12)
Then its nature Riemannian gradient in Stiefel manifold can be drawn from (9) and (12):
~ T ∇G (W~ x ) = ∇G (W~ x ) − W {∇G (W~ x )} W
(13)
1112
G. Wang et al.
The above method (13) can extract n ICs simultaneously, but cannot prevent different neurons from extracting the same source. A reason is that even if WWT=I is fulfiled, then we have T ~ ~ W +W + = I + μ 2∇ G (W~ x ) ∇ G T (W~ x)
(14)
It is clear that the constraint condition is violated gradually. Therefore, we must decorrelate the output signals at every iteration such that WWT=I holds or nearly holds at any step. The method has been used in expression (11), and is derived from the symmetric bigradient feedback [5].
W = αW ( I − W TW )
(15)
Finally the nature gradient algorithm
~ W + = W + μ∇G (W~ x ) + αW ( I − W TW )
(16)
is straightforward. As a conclusion, our extended online Fast-ICA algorithm is composed of expression (3) plus (16) or expression (4) plus (16). Expression (3) or (4) deals with online whitening and (16) deals with orthogonal rotation to find the independent components.
3 Computer Simulations In this section, experiments are performed to demonstrate validity of our algorithms. In this experiment, the elements of the mixing matrix A are randomly generated according to the Gaussian ditribution with zero mean and unit variance. The matrices U and W are initialzed with the identity matrix. To evaluate the performance of the ICA algorithm, we use the cross-talking error as the performance index (PI)
⎧⎛ cik ⎪⎜ n PI = ∑ ⎨⎜ ∑ i =1 ⎪⎜ k =1 max c ij j ⎩⎝ n
2
⎞ ⎛ n ⎟ ⎜ c ki − 1⎟ + ⎜ ∑ ⎟ ⎜ k =1 max c ji j ⎠ ⎝
2
⎞⎫ ⎟⎪ − 1⎟⎬ ⎟⎪ ⎠⎭
n (17)
where cij is the (i,j)th element of the global system matrix C=BA=WUA. When a perfect ICA is performed, C should be an n × n generalized permutation matrix and PI is zero. In practice, the smaller is PI, the better is the algorithm.
Fig. 1. Original Periodic Source Signals
An Extended Online Fast-ICA Algorithm
1113
The experiment is performed with two sub-Gaussian and three super-Gaussian sources. Figure 1 shows 200 samples of the sources. The kurtosis of sources is -2.0, -1.50, 2.26, 4.01, 0.20. After prewhitening of the mixtures (x=As), we execute expression (11) and (16) on it for comparing the ordinary gradient algorithm and the nature gradient algorithm. The two algorithms employ the same step size (μ=0.01) and the same constant (α=0.5).
Fig. 2. Comparison of the OG and NG learning rules. ‘OG’ denotes ordinary gradient; ‘NG’ denotes natural gradient; The average performances indexes are over 200 independent runs. The two algorithms take the same step sizes and forgetting factors.
Fig. 3. Comparison with Flexible ICA algorithm. The average performances indexes are over 200 independent runs. In our proposed algorithm, α=0.5, μ=0.01 and η=0.01.
1114
G. Wang et al.
The convergence curves averaged over 200 independent runs for the ordinary gradient algorithm and the nature gradient algorithm are showed in Figure 2. From Figure 2, our algorithm requires about 1500 iterations to reach the equilibrium, but the ordinary gradient algorithm about 3000 iterations. Thus, our algorithm converges much faster than the ordinary gradient one does. Further, we test our algorithm when the mixtures are not prewhitened, that is, our two algorithms and flexible ICA are used to extract the ICs. Figure 3 is the convergence curves of three algorithms. It shows our proposed algorithms have better performance and better convergence than flexible ICA algorithm.
4 Conclusions In this paper, an online algorithm for ICA is proposed. Since natural gradient learning rule is applied, the convergence speed of the alogorithm is greatly raised. Theoretical analysis and experiment results demonstrate that the novel algorithm is valid for ICA and can extract the sub- or super- Gaussian signals simultaneously with the observation. The novel algorithm can be considered as an extension of Fast-ICA algorithm and can be used more widely.
Acknowledgement This work is supported by National Nature Science Foundation 60571047.
References 1. Hyävrinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. New York Wiley (2001) 2. Hyävrinen, A., Oja, E.: Independent Component Analysis: Algorithm and Applications. Neural Networks 13 (2000) 411–430. 3. Hyävrinen, A.: Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Trans. Neural Networks 10(3) (1999) 626–634 4. Hyävrinen, A., Oja, E.: Independent Component Analysis by General Nonlinear HebbianLike Learning Rules. Signal Processing 64 (1998) 301–313 5. Karhunen, J., E., Oja, Wang, L., Vigario, R., Joutsensalo, J.: A Class of Neural Networks for Independent Component Analysis. IEEE Trans. Neural Networks 8(3) (1997) 486–504 6. Amari, S.: Natural Gradient for Over- and Under-Complete Bases in ICA. Neural Computation 11(8) (1999) 1875–1883 7. Choi, S., Cichocki, A., Amari, S.: Flexible Independent Component Analysis. Journal of VLSI Signal Processing 26 (2000) 25–38 8. Zhang, L.Q., Cichocki, A., Amari, S.: Self-Adaptive Blind Source Separation Based on Activation Function Adaptation. IEEE Trans. Neural Networks 15(2) (2004) 233–244 9. Cardoso, J.F.: Equivariant Adaptive Source Separation. IEEE Trans. Signal Processing 12 (1996) 3107–3030 10. Donoho, D.: On Minimum, Entropy Deconvolution. Academic Press New York 1981 11. Zhu, X.L., Zhang, X.D., Ye, J.M.: Natural Gradient-Based Recursive Least-Squares Algorithm for Blind Source Separation. Science in China Ser. F 47 (2004) 55–65
Gradient Algorithm for Nonnegative Independent Component Analysis Shangming Yang Computational Intelligence Laboratory, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 610054, China [email protected] http://cilab.uestc.edu.cn
Abstract. A novel algorithm is proposed for the nonnegative independent component analysis. In the algorithm, we employ the gradient algorithm with some modifications to separate nonnegative independent sources from mixtures. Since the local convergence of the gradient algorithm is already proved, the result in this paper will be considered one of the convergent nonnegative ICA algorithms. Simulation shows the proposed algorithm can separate the mixtures of nonnegative signals very successfully.
1
Introduction
Independent component analysis (ICA) [1,2] is one of the most important methods used to solve the blind source separation or blind signal separation problems. ICA has the following three basic assumptions [3-5]: (1) the observations of signals are deterministic mixtures of original sources; (2) the original signals (also called components) from sources are mutually statistically independent; (3) the number of the observed signals must be equal to or more than that of the original signals. Under these assumptions, ICA is usually considered to be the method of extracting original signals from mixtures. Algorithm suggested by Jutten and Herault in [2] is based on a linear neural network with backward connection. However, for this algorithm, the convergence of the network is flimsily guaranteed, and the information extraction of the original sources significantly depends on the input data and the initialization of the network. To improve the algorithm, Hyvarinen and Oja [6] introduced a fast fixed-point algorithm which is based on the fourth-order cumulant of kurtosis, it successfully extracted original signals by finding extremum of ’kurtosis’. The process is simple and the calculation speed is fast since there is no need to modulate learning rate. For most applications of ICA, however, nonnegativity is a natural condition in the real world, for example, in the analysis of text documents [7], images [8,9], and pollutant emission [10]. The combination of nonnegativity and independence assumption on the original sources are referred as nonnegative independent component analysis[11]. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1115–1120, 2006. c Springer-Verlag Berlin Heidelberg 2006
1116
S. Yang
In this paper, a nonnegative ICA algorithm based on the gradient algorithm for maximizing negentropy is proposed. By introducing the nonnegative concept to the gradient algorithm, the existence of the local extremum for the new algorithm can be assured.
2 2.1
ICA Algorithms Gradient Algorithm Using Negentropy
For a given observation vector z = As, ICA is to find the right matrix W such that y = Wz = WAs, y is the eventual estimation of the original source vector s. The gradient-based algorithm for maximizing negentropy is proposed by Hyvarinen and Oja [12] . The algorithm is derived from the Newton method, which is equivalent to finding maximum of the nongaussianity of the Largrandian by Newton’s method. By approximating some terms and under the constraint of E{(wT z)2 } = ||w||2 = 1, the algorithm was defined as follows: w(t + 1) = w(t) + γE(zg(w(t)T z)) w ← w/||w||
(1) (2)
where E denotes expectation, ν is a standardized Gaussian random variable, w is any one of the row vectors in W. For parameter γ, it has γ = E{G(wT z)} − E{G(ν)}, and it is given a kind of ”self-adaptation” quality, γ can be replaced by its sign without effecting the behavior of the learning rule. The normalization is to map w on the unit sphere to keep the variance of wT z constant. The function g is the derivation of the function G used in the approximation of negentropy. G can be chosen from one of the following three functions which were suggested by Hyvarinen in [12]: G1 = a11 log(cosh(a1 y)), G2 = −exp(−y 2 /2), and G3 = y 4 . The convergence of the gradient algorithm to the direction of one of the independent components was analyzed in [5]. 2.2
Nonnegative ICA Algorithm
Nonnegativity of sources can provide us an alternative way of dealing with the ICA problem. In [13,14], a source is called nonnegative si if Pr (si < δ) = 0, and such a source will be called well grounded if Pr (si < δ) for any δ ≥ 0, si has nonzero pdf and all the way down to zero. From the results in [14], we have the following lamma for our new algorithm: Lemma 1. Suppose that s is a vector of real nonnegative well-grounded independent unit-variance sources and y = Us is a orthonormal rotation of s, where UUT = UT U = In , then U is a permutation matrix only and if only y is nonnegative with probability 1. Now for a given observation vector z of the original source vector s, the problem is to find the permutation matrix U such that y is nonnegative. To find
Gradient Algorithm for Nonnegative Independent Component Analysis
1117
the permutation matrix U, the first step is pre-whitening. For the prewhitened observation z, based on nonlinear M SE criterion of the k-dimensional principal component subspace analysis [15]: eMSE = E||z − WT g(Wz)||2
(3)
Plumbley and Oja [14] suggested a nonnegative PCA algorithm, the update algorithm is W ← W + ΔW with ΔW given by ΔW = g+ (y)(z − WT g + (y))T
(4)
where μ is a small update factor, and z = Vx is pre-whitened input data, this is a special case of the nonlinear PCA rule.
3 3.1
Proposed Nonnegative Algorithm Prewhitening
The original signals are independent each other, but the separated signals may not have this properties, a prewhitening is necessary to make the components uncorrelated [16]. Whiteness of a zero-mean random vector y means that its components are uncorrelated and their variances equal unity. Whitening is only half ICA, it just obtains the uncorrelatedness, the components are not independent each other. According the discussion in [16], after the whitening, we need to find a orthonormal matrix W, such that y = Wz = WVAs, since VA is orthonormal, therefore U = WVA is also orthonormal and y = Us will be the estimation of the original sources. For the nonnegative sources si (i =1, 2, ..., n), the problem is to find nonnegative estimations yi ≥ 0(i =1, 2, ..., n) in y, such that y = Wz = WVAs. 3.2
Gradient Based Nonnegative Algorithm
To obtain the nonnegative estimations, we introduce the iteration algorithm [3], which is derived from the gradient algorithm for maximizing negentropy with some other constraints: w(t + 1) = w(t) + γ0 E(zg(w(t)T z)) w ← w/||w||
(5) (6)
where γ0 = ±1, and the sign depends on the choice of function G, the other variables are same to the descriptions in equation (1) and (2). Theorem 1. For the gradient algorithm above, it exists a group of initial components wi (i = 1, 2, ..., n) for the vector w, such that in each iteration y = wT z ≥ 0 until it convergent . In theorem 1, we can change w to be a matrix W, then we have a vector y with n observations of original signals, that is y = (y1 , y2 , ..., yn ), such that y = Wz.
1118
S. Yang
Proof. For the first iteration, we initialize w to be a vector, each component wi in w has the same sign to zi in z (i = 1, 2, 3, ..., n), except that some of which may equal to zero, then we have y = wT z ≥ 0. If wT z ≥ 0, since g(y) is an odd function, for any wT z ≥ 0, we always have g(wT z) ≥ 0, y = wT z, Then g(y) ≥ 0, vector zg(wT z) and z have same sign for each corresponding component. In practice, the expectations E(zg(wT z)) are estimated as an average of available data sample, therefore, after each iteration, signs for each component in vector E(zg(wT z)) will keep unchanging, and γ0 can be simply replaced by +1 since it will be eliminated by the normalization [5]. This means in each new iteration, non-zero components wi and zi (i = 1, 2, ..., n) in vector w and in z will be always the same sign(be positive or negative at the same time). Therefore y = wT z will be nonnegative in each iteration until it convergent. Now, we have obtained the independent component yi ≥ 0 (i = 1, 2, .., n), which are estimations of the original nonnegative sources, the problem here is in the lamma 1, to obtain permutation of the original sources, W needs to be orthogonal. For row vector wp (p = 1, 2, ..., n) in W, using Schemidt orthogonal method in each iteration, we add the following operations: wp = wp −
p−1
(wTp wj )wj
(7)
j=1
wp = wp /wp
(8)
If wp z ≥ 0, then y = Wz = WVAs = Us, U is permutation matrix. If it exists yi = wp z < 0, do the orthogonal rotation I ∗ W = Ut (I is a orthogonal diagonal matrix with ±1 for its nonzero elements, the sign for each nonzero element aii in I is same to the sign of yi ), since both W and I are orthogonal, then Ut = I ∗ W is also orthogonal, U = Ut VA will be permutation matrix. The utilization of I is reasonable because it is just a mirror reflecting of some row vectors wp in matrix W, row vectors in I ∗ W will span same subspace as row vectors in W. After the above transformation, we have a convergent nonnegative estimation y = Us, U is orthogonal. According to Lemma 1, U is a permutation matrix for the nonnegative ICA estimation.
4
Simulations
To test the proposed nonnegative algorithm, we use a simulator to generate four groups of data for the experiments, Figure 1 is the nonnegtive original signals (left) and their four observations of the mixtures (right). For comparison, we also employed FastICA algorithm to separate the same groups of data (see Figure 2 left). From the simulation results in Figure 2, we can see that the new algorithm is very effective, it has the same performance as FastICA did. The difference is, when the original sources are nonnegative, we can achieve the nonnegative separations which are the permutation of the original signals, but FastICA cannot recover the nonnegativity.
Gradient Algorithm for Nonnegative Independent Component Analysis
1119
Fig. 1. The original nonnegative signals (left) and their observations of mixtures (right)
Fig. 2. The FastICA separations (left) and the nonnegative ICA separations (right)
5
Conclusions
From the observations of mixtures, nonnegative ICA can achieve permutation of the original sources. For those particular mixtures of sources, nonnegative ICA will certainly have more efficient applications. Simulation shows that the proposed model works very effectively. For the given data sources in Figure 1, each nonnagative estimation is extremely similar to its original signal. Comparing with other algorithms, one of the most important results is that the convergence of our new algorithm is guaranteed. The problem in Figure 2 is that the estimation of the nonnegative sources is not exactly the permutation of the original sources, this may be caused by the normalization of row vector w, which can be discussed further.
References 1. Jutten, C., Herault, J.: Blind Separation of Sources, Part I: An Adaptive Algorithm Based on Neuromimetic Architecture. Signal Processing 24 (1991) 1-10 2. Comon, P.: Independent Component Analysis – A New Concept? Signal Processing 36 (1994) 287-314
1120
S. Yang
3. Pope, K. J., Bogner, R.E.:Blind Signal Separation I: Linear, Instantaneous Combinations. Digital Signal Process 6 (1996) 5-16 4. Pope, K.J., Bogner, R.E.: Blind Signal Separation II: Linear, Convolutive Combinations, Digital Signal Process 6 (1996) 17-28 5. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley Interscience Publication, New York (2001) 6. A. Hyv¨ arinen, E.:A Fast Fixed-point Algorithm for Independent Component Analysis. Neural Computation 9 (1997) 1483-1492 7. Tsuge, S., Shishibori, M., Kuroiwa, S., Kita, K.: Dimensionality Reduction Using Nonnegative Matrix Factorization for Information Retrival. In Proc. IEEE Int. Conf. System, Man, and Cybernetics. Tucson, AZ, Vol. 2 (2001) 960-965 8. Parra, L., Spence, C., Ziehe, A., Muller, K. R.: Unmixing Hyperspectral Data. In Proc. Advances in Neural Information Processing System, Denver, CO, 12 (2000) 942-948 9. Lee, J. S., Lee, D.D., Choi, S., Lee, D. S.: Application of Nonnegative Matrix Factorization to Dynamic Positron Emission Tomography. In Conf. Independent Component Analysis and Signal Separation, San Diego, CA, December (2000) 629-632 10. Paatero, P., Tapper U.: Positive Matrix Factorization for Language Model with Optimal Utilization of Error Estimates of Data Values. Environmetr. 5 (1994) 111-126 11. Plumbly, M., Oja, E.: A “Nonnegative PCA” Algorithm for Independent Component Analysis. IEEE Transactions on Neural Networks 15 (2004) 12. Hyv¨ arinen, A.: Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Transactions on Neural Networks 10 (1999) 13. Plumbley, M.: Condotions for Nonnegative Inpendent Component Analysis. IEEE Signal Processing Letters 9 (2002) 14. Plumbley, M.: Algorithms for Nonnegative Inpendent Component Analysis. IEEE Transactions on Neural Networks 4 (2003) 15. Xu, L.: Least Mean Square Error Reconstruction Principle for Self-organizing Neural-nets. Neural Networks 6 (1993) 627-648 16. Hyv¨ arinen, A., Oja, E.: Independent Component Analysis: Algorithms and Applications. Neural Networks 13 (2000) 411-430
Unified Parametric and Non-parametric ICA Algorithm for Arbitrary Sources Fasong Wang1,2 , Hongwei Li1 , Rui Li3 , and Shaoquan Yu1 1
School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, P.R. China [email protected], [email protected] 2 China Electronics Technology Group Corporation, the 27th Research Institute, Zhengzhou 450005, P.R. China 3 School of Mathematics and Physics, Henan University of Technology, Zhengzhou 450052, P.R. China
Abstract. The purpose of this paper is to develop two novel unified parametric and non-parametric Independent Component Analysis (ICA) algorithms, which enable to separate arbitrary sources including symmetric and asymmetric distributions with self-adaptive score functions. They are derived from the parameterized asymmetric generalized Gaussian density (AGGD) model and GGD kernel based k-nearest neighbor (KNN) non-parametric estimation. The parameters of the score function in the algorithms are been chosen adaptively by estimating the high order statistics of the observed signals and GGD kernel estimation based non-parametric method. Compared with conventional ICA algorithms, the two given methods can separate a wide range of source signals using only one unified density model. Simulations confirm the effectiveness and performance of the proposed algorithm.
1
Introduction
ICA has attracted considerable attention in the signal processing and neural network fields, since it has rapidly growing applications in various subjects, such as speech processing, image enhancement, and biomedical signal processing. Several efficient algorithms have been developed for ICA [1]. Here we consider the linear, instantaneous, noiseless ICA model with the form x(t) = As(t),
(1)
where s(t) = [s1 (t), s2 (t), · · · , sN (t)]T is the unknown source vector, and the source signals si (t) are stationary process. Matrix A ∈ RN ×N is an unknown real-valued and non-singular mixing matrix. The observed mixtures x(t) are sometimes called as sensor outputs. The following assumptions for the model to be identified are needed [2]: a) The sources are statistically mutually independent; b) At most one of the sources has Gaussian distribution; c) A is invertible. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1121–1126, 2006. c Springer-Verlag Berlin Heidelberg 2006
1122
F. Wang et al.
Let us consider a linear feed forward memoryless neural network which maps the observation x(t) to y(t) by the following linear transform y(t) = Wx(t) = WAs(t),
(2)
where W ∈ RN ×N is a separating matrix, y(t) is an estimate of s(t). Inherently there are two indeterminacies in ICA—scaling and permutation ambiguity [2]. We can get the performance matrix P and P = WA. Due to the important contribution of Amari, et.al. in [3], we can multiply the gradient by WT W and therefore obtain a natural gradient updating rule of W. The general learning rule for updating can be developed as Wk+1 = Wk + η[I − ϕ(y)yT ]Wk .
(3)
In (3), η > 0 is the learning rate which can be chosen by the method in [4], and ϕ(·) is the vector of score functions whose optimal components are ϕi (yi ) = −
d p (yi ) log pi (yi ) = − i . dyi pi (yi )
(4)
As shown by (4), the original source distributions must be known. But in most real world applications this knowledge does not exist. However as indicated in [5], ICA can be realized without knowing the source distributions, since these distribution may be learnt from the sample data together with the linear transformation. In this paper, we propose the parameterized AGGD model and GGD kernel based KNN non-parametric estimation method to estimate the probability density function(PDF) of source signals.
2
Unified Parametric ICA Algorithm
The PDF for GGD [6] with shaping factor c and scaling factor γ is represented by pgg (y) = 2Γ γc exp[−|γ(y−μy )|c ], where Γ (·) is the Gamma function defined by ∞ (1/c) Γ (x) = 0 tx−1 e−t dt. By changing the value of c(c > 0), a family of distributions with different sharpness will be given. So, the score function based on GGD PDF in the algorithm (3) can be derived ϕi,gg (yi ) = −
d log pi (yi ) = cγ c |yi − μyi |c−1 sgn(yi − μyi ). dyi
(5)
Because the skewness of GGD equals to zero, then for symmetric source signals we can substitute (5) to (3), the ICA algorithm based on GGD is obtained. Cao et.al. in [7] present a flexible ICA algorithm where the self-adaptive score function was derived from the GGD and t-distribution model. But they did not consider the condition where the source signals have asymmetric distributions. So we will introduce an asymmetric PDF model-AGGD model which depends on two second-order parameters called respectively leftand right variance defined as N Nr l [6]: σ ˆl2 = Nl1−1 i=1,y ˆ y )2 , σ ˆr2 = Nr1−1 i=1,y ˆ y )2 , where ˆ y (yi − m ˆ y (yi − m i <m i >m
Unified Parametric and Non-parametric ICA Algorithm
1123
m ˆ y is the estimated mode(which is not coincide with the mean in the case of asymmetry) and Nl (Nr ) is the number of yi samples that is less than m ˆ y (or greater than m ˆ y ). Hence, we can get the AGGD model cγ exp{−γlc [−(y − my )]c }, y < my ( 1c ) pagg (y) = Γcγ (6) exp{−γrc [−(y − my )]c }, y ≥ my Γ(1) c
where γ =
1 σl +σr
Γ ( 3c ) , Γ ( 1c )
γl =
1 σl
Γ ( 3c ) , Γ ( 1c )
γr =
1 σr
Γ ( 3c ) Γ ( 1c )
and my is the mode.
From the definition (6) we can prove that AGGD satisfies the required properties ∞ of density function that is pagg (y) > 0, −∞ pagg (y)dy = 1, at the same time, pagg (y) is continuous when y = my . The 2nd- and 4th-order moment of AGGD can be obtained (Γ ( 2c ))2 m2 = (σr − σl )2 1 − + σl σr ; (7) Γ ( 1c )Γ ( 3c ) m4 =
Γ ( 5c )Γ ( 1c ) Γ ( 2 )Γ ( 4 ) (σr4 − σr3 σl + σr2 σl2 − σr σl3 + σl4 ) − 4 (Γc( 3 ))2c (σr2 (Γ ( 3c ))2 c (Γ ( 2c ))2 (Γ ( 2c ))4 2 2 2 +6 Γ ( 3 )Γ (σr 1 (σr − σr σl + σl )(σr − σl ) − 3 3 (c) (Γ ( c )Γ ( 1c ))2 c
+ σl2 )(σr − σl )2 − σl )4 . (8)
We can get score function of AGGD PDF easily c−1 d −cγlc (−(yi − myi )) , ϕi,agg (yi ) = − log pi (yi ) = c−1 c dyi cγr (yi − myi ) .
y i < my i y i ≥ my i
(9)
Substitute (9) to (3), we can get the unified ICA algorithm based on AGGD model—AGGD-ICA. The stability of AGGD-ICA algorithm is discussed in [8].
3
Unified Non-parametric ICA Algorithm
The basic idea of the KNN method is to control the degree of smoothing in the density estimate based on the size of a box required to contain a given number of observations. The size of this box is controlled using an integer k which is considerably smaller than the sample size. Suppose we have the ordered data sample x(1) , · · · , x(n) . For any point on the line we define the distance by di (x) = |x(i) − x|, so that d1 (x) ≤ · · · ≤ dn (x). Then we define the KNN k−1 density estimate by fˆ(x) = 2nd . By definition we expect k − 1 observations k (x) in [x − dk (x), x + dk (x)]. Near the center of the distribution dk will be smaller than in the tails. we often use the generalized KNN estimate n In application x−X fˆ(x) = ndk1(x) j=1 K( dk (x)j ), where K(·) is a window or kernel function . The most common choice of kernel is the Gaussian kernel [9]. As discussed above, the performance of ICA algorithm crucially depends on the choice of the kernel too. With a Gaussian kernel the optimum result is sometimes not as good as an acceptable ICA solution. For this reason we propose
1124
F. Wang et al.
to choose a suitable kernel from a set of candidate functions. This can be derived from the family of GGD functions, with Gaussian and Laplacian densities being special cases, moreover the kernel width(scaling parameter) γ can be changed as the kernel shape parameter α varies, defined as in section 2. Thereby the kernel width is specified by γ while the shape is determined by α. Suppose given a batch of sample data of size M , the marginal distribution of M an arbitrary reconstructed signal is approximated as pˆi (yi ) = Mdk1(yi ) j=1 y−Y
K( dk (yij ), i = 1, 2, · · · , N , where N is the number of the source signals, K(u) = i) N γc c T T n=1 win xnj , wi and (x )j are 2Γ (1/c) exp[−|γ(y − μy )| ], Yij = wi (x )j = the ith row and jth column of the separating matrix W and mixture xT . We know that this estimator pˆi (yi ) is asymptotically unbiased estimation and mean square consistent estimation to pi (yi ). So under the measure of mean square convergence, this estimator converges to the true PDF. Moreover, it is a continuous and differentiable function of the unmixing matrix W, which is described as M 1 wi ((xT )n − (xT )j ) pˆi (yi ) = pˆi (wi (x )n ) = K , M dk (wi (xT )n ) j=1 dk (wi (xT )n ) T
(10)
From (4), if we want to get the score function, we must get the derivative of (10), yi −Yij yi −Yij then (ˆ pi (yi )) = − M(dk1(yi ))2 M j=1 ( dk (yi ) )K( dk (yi ) ). We use the derivative of the estimated PDF as the estimation of the PDF’s derivative, that is (ˆ pi (yi )) = pˆi (yi ). So from (4) we can get the score function easily M yi −Yij yi −Yij d j=1 dk (yi ) K( dk (yi ) ) ϕ(yi ) = − log pi (yi ) = . (11) M y −Y dyi dk (yi ) j=1 K( dik (yiij) ) Substitute (11) to (3), we can get the GGD kernel based generalized KNN nonparametric ICA algorithm.
4
Simulation Results
In order to confirm the validity of the proposed AGGD model based parametric ICA algorithm and GGD kernel based KNN non-parametric ICA algorithm, several simulations are given below. We simulate it on computer using four source signals which have different waveforms including super-Gaussian, sub-Gaussian sources and asymmetric source signals. The waveforms of them are shown in Fig.1.(a). The properties of them are shown in Table 1. The mixing matrix A is fixed in each run and randomly generated. The mixed signals x(t) are plotted in Fig.1.(b). The initial condition of separating matrix W is identity matrix. Here we only show the GGD kernel√based KNN ICA algorithm. A typical choice of the integer k will be used: k = n. The recovered signals y(t) are plotted in Fig.1.(c). Next, for comparison we execute the mixed signals with different ICA algorithms: JADE Algorithm [10], FastICA algorithm [11], Nonparametric ICA
Unified Parametric and Non-parametric ICA Algorithm
1125
Table 1. The High-order Statistics of Source Signals source 1 source 2 source 3 source 4 Kurtosis -0.7407 Skewness -0.0018
4.3273 -0.0316
0.3208 -0.6597
1.0531 1.0424
5
5 y 1
x 1
s 1
2
0
0
−2
500
1000
1500
2000
2500
3000
0
x 2
s 2
0 −2 −4
0
500
1000
1500
2000
2500
3000
0 x 3
s 3
−2 −4 −6 0
500
1000
1500
2000
2500
3000
4 2 0 −2 −4 −6 4 2 0 −2 −4 −6
500
1000
1500
2000
2500
3000
0
500
1000
1500
2000
2500
3000
0
500
1000
1500
2000
2500
3000
0
500
1000
1500
2000
2500
3000
0
500
1000
1500
2000
2500
3000
0 −2 y 2
0
2
0 −5
−4 −6
0
500
1000
1500
2000
2500
3000 2 y 3
−5
0 −2
0
500
1000
1500
2000
2500
3000
3 6 y 4
x 4
s 4
2
2
0
1
2
−2
0
0
500
1000
1500
2000
2500
3000
4
0
500
(a)
1000
1500
2000
2500
3000
0
(b)
(c)
Fig. 1. Experiment Results Showing the Separation of Symmetric and Asymmetric Sources Using the Proposed GGD Kernel Based KNN Algorithm Table 2. The results of the separation are shown for various ICA algorithms Algorithm JADE E
0.1051
FastICA Kernel NonpICA 0.0192
0.0203
0.0102
KNN
AGGD
0.0086
0.0101
algorithm [5] and Kernel algorithm [12]. At the same convergent conditions, the proposed algorithm was compared along the criteria statistical whose performance was measured using error index a performance index called cross-talking N N N N |pij | |pij | E defined as E = i=1 − 1 + − 1 , j=1 maxk |pik | j=1 i=1 maxk |pkj | where P(pij is the component of P) is the performance matrix. The separation results of the four different sources are shown in Table 2 for various ICA algorithms(averaged over 50 Monte Carlo simulation).
5
Conclusion
In this paper, two unified parametric and non-parametric ICA algorithms are given. They are derived from the parameterized AGGD model and GGD kernel based KNN non-parametric estimation. The proposed algorithms enable to separate arbitrary sources including symmetric and asymmetric distributions with self-adaptive score functions. The parameters of the score function in the algo-
1126
F. Wang et al.
rithms are been chosen adaptively by estimating the high order statistics of the observed signals and GGD kernel estimation based non-parametric method under a natural gradient framework. Finally, the simulation results compared with other ICA algorithms indicate that the proposed ICA algorithm can make a good performance even some parametric ICA methods do not give good results.
Acknowledgment This work is partially supported by National Natural Science Foundation of China(No. 60472062) and Natural Science Foundation of Hubei Province, China (No. 2004ABA038). The authors would like to thank the anonymous reviewers for their valuable comments and suggestions that helped to improve this paper.
References 1. Cichocki, A., Amari, S.: Adaptive Blind Signal and Adaptive Blind Signal and Image Processing. John Wiley&Sons, America (2002) 2. Comon, P.: Independent Component Analysis: A New Concept? Signal Processing 36(3)(1994) 287-314 3. Amari, S.: Natural Gradient Works Efficiently in Learning. Neural Computation 10(1)(1998) 251-276 4. Zhang, X.D., Zhu, X.L., Bao, Z.: Grading Learning for Blind Source Separation. Science in China(Series F) 46(1)(2003) 31-44 5. Boscolo, R., Pan, H., Roychowdhury, V.P.: Independent Component Analysis Based on Nonparametric Density Estimation. IEEE Trans. on Neural Networks 15(1)(2004) 55-65 6. Tesei, A., Regazzoni, C.S.: HOS-based Generalized Noise PDF Models for Signal Detection Optimization. Signal Processing 65(2)(1998) 267-281 7. Cao, J., Murata, N., Amari, S., Cichocki, A., Takeda, T.: A Robust Approach to Independent Component Analysis of Signals with High-Level Noise Measurements. IEEE Trans. on Neural Network 14(3)(2003) 631-645 8. Wang, F.S., Li, H.W., Li, R.: Unified Parametric ICA Algorithm for Hybrid Sources and Its Stability Analysis. In: Wang Ch. et al(eds.), Proceedings of SPIE in International Conference on Space Information Technology. SPIE Proceedings, Vol.5958. China (2005) 59582E1-5 9. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York, America (1985) 10. Cardoso, J.F.: High-order Contrasts for Independent Component Analysis. Neural Computation 11(1)(1999) 157-192 11. Hyvarinen, A., Oja, E.: A Fast Fixed-point Algorithm for Independent Component Analysis. Neural Computation 10(7)(1998) 1483-1492 12. Bach, F.R., Jordan, M.I.: Kernel Independent Component Analysis. Journel of Machine Learning Research 3(1)(2002) 1-48
A Novel Kurtosis-Dependent Parameterized Independent Component Analysis Algorithm Xiao-fei Shi1, Ji-dong Suo1, Chang Liu1, and Li Li2 1
Information Engineering College, Dalian Maritime University, Dalian 116026, China 2 Information Engineering College, Dalian University, Dalian 116622, China [email protected]
Abstract. In the framework of natural gradient, a novel Kurtosis-Dependent Parameterized Independent Component Analysis (KDPICA) algorithm is proposed, which can separate the mixture of super- and sub-Gaussian sources. Two kinds of new probability density models are proposed, which can provide wider ranges especially for sub-Gaussian kurtosis. According to kurtosis value of source and whitening, model parameters are adaptively calculated which can be used to estimate super- and sub-Gaussian source distributions and its corresponding score functions directly. According to stability analysis, the ranges of model parameters are fixed which confirm KDPICA algorithm stable. The experiment shows the proposed algorithm has better performance than some proposed algorithms.
1 Introduction Independent Component Analysis [1-10] is to recover the unknown sources by multisensor mixing data. The N -dimension mixing data Y (t ) is assumed to be generated in a simple form. It is
Y (t ) = AS (t ) .
(1)
where Y (t ) = [ y1 (t ),..., y N (t )]T is a known observation vector, A is a N × N full rank unknown mixing matrix, T stands for transpose, S (t ) = [ s1 (t ),..., sN (t )]T is a N dimension zero-mean sources vector. If we define the estimated source as X (t ) , then we can get X (t ) = WY (t ) = WAS (t ) .
(2)
where W is a separation matrix. WA is the product of a permutation matrix and diagonal matrix. For super- and sub-Gaussian sources, [5] proposed different nonlinear functions to estimate the score functions. [6] proposed generalized Gaussian model to estimate their probability density functions(PDF).[9] proposed two kinds of hyperbolic models.[10] proposed a kurtosis-controlled model which model parameter is controlled by kurtosis of source.[3] proposed a double Gaussian model to estimate sub-Gaussian PDF. Based on [3], we construct two kinds of novel probability density models that can provide wider range of kurtosis especially for sub-Gaussian source. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1127 – 1132, 2006. © Springer-Verlag Berlin Heidelberg 2006
1128
X.-f. Shi et al.
The stability analysis is discussed which provide the ranges of parameters that can confirm the proposed algorithm stable.
2 Natural Gradient Framework A better convergent natural gradient method was proposed in [2]. It is ∆W ∝ ( I + E{ϕ ( X ) X T })W .
(3)
where ϕ ( X ) = [ϕ1 ( x1 ),..., ϕ N ( xN )]T are score functions which is defined as
ϕ i = −(log fi )' = −
fi ' fi
(4)
.
where fi and fi ' are probability density function of source X i and its derivative.
3 Proposed KDPICA Model For super-Gaussian and sub-Gaussian source, we proposed two kinds of probability density models that can estimate their distributions. Firstly, we propose a model to estimate super-Gaussian distribution. It is f ( x) = ω g ( x, 0, σ 1 ) + (1 − ω ) g ( x, 0, σ 2 ) .
,
(5)
,
where ω is a weighed coefficient which satisfy 0 < ω < 1 g is a Gaussian function µ is the mean of one Gaussian function σ 1 and σ 2 are mean square deviation of two Gaussian functions. f ( x) is the probability density function of source x .Then we calculate one to four order moment of x .If we define
,
σ 12 = lσ 22 .
(6)
where l is positive real number, l > 0 and l ≠ 1.Using whitening [1], we can get the expression of σ 1 and σ 2 by (6) and second order moment of x .That is
σ2 =
1 . ω l + (1 − ω )
(7)
σ1 =
l . ω l + (1 − ω )
(8)
Then we can get the probability density function of super-Gaussian source x by (5), (7) and (8) directly. According to the kurtosis calculation of x , we can find that the kurtosis is always positive with 0 < ω < 1 , l > 0 and l ≠ 1, So model (5) can be used to estimate super-Gaussian distribution.
A Novel Kurtosis-Dependent Parameterized Independent Component Analysis
1129
Secondly, we propose a model to estimate sub-Gaussian distribution. It is f ( x) =
1
4
[ g ( x, − µ , σ ) + g ( x, − µ − L, σ ) + g ( x, µ , σ ) + g ( x, µ + L, σ )] .
where g is a Gaussian function
,
(9)
, µ is the mean of one Gaussian function which is
positive σ is mean square deviation of one Gaussian function. L is a positive real number and 0 < L µ . f ( x) is the probability density function of source x .Then we calculate one to four order moment of x .If we define
σ 2 = kµ2 .
(10)
L = pµ .
(11)
and
where k is a positive real number, p is 0 < p 1 .Using whitening[1], we can get σ , L and µ by (10),(11) and second order moment of x . They are
σ=
2k . 2k + 1 + (1 + p) 2
(12)
µ=
2 . 2k + 1 + (1 + p) 2
(13)
L= p
2 . 2k + 1 + (1 + p) 2
(14)
Then we can get the probability density function of sub-Gaussian source x by(9),(12), (13) and (14) directly. According to the kurtosis calculation of x , we can find that the kurtosis is always negative, so (9) can be used to estimate sub-Gaussian distribution. The kurtosis ranges of most super-Gaussian models in [3][6][10] are wide and adequate to use. But the kurtosis ranges of sub-Gaussian model in [3] [6] [10] are limited. The kurtosis ranges of [3] and [6] are (-2, 0). [10] is (-3,0).The proposed sub-Gaussian model can provide much wider kurtosis range than that of [3][6][10]. It is (-90,0).
4 Stability Analysis The proposed algorithm is stable provided that the following condition must be satisfied [4]. It is
β = E[ϕ ' ( x)] − E[ϕ ( x) x] > 0 .
(15)
1130
X.-f. Shi et al.
For super-Gaussian source x , according to parameters calculation, we can find that model parameter is in the region of U-shape curve that can make algorithm stable. l ≥ 100 is a sufficient condition for most of ω from 0.1 to 0.9. For sub-Gaussian source x , according to parameters calculation, we can find that the range of parameter k is approximately larger than 4.35. The range of parameter p is approximately selected between 0.01 and 0.4.If we set k > 4.35 , KDPICA algorithm can confirm stable with any p in the limited range. Commonly, we set p = 0.1 .
5 Experiment 5.1 Experiment 1
In this experiment, two super-Gaussian images and two sub-Gaussian images are mixed. Their kurtoses are 4.9468, 2.7903, -0.8681 and -1.0042. Learning rate is 0.001. The mixing matrix is randomly generated which must be full rank. Error is given in [5], it is Error =
N⎛N
pij
⎜∑ ∑ ⎜ max i=1 ⎝
j =1
k
⎞ N⎛N ⎞ pij −1⎟ + ⎜∑ −1⎟ . pik ⎟ j=1⎜ i=1 maxk pkj ⎟ ⎠ ⎝ ⎠
∑
(16)
10 Lee Choi Waheed Nakayama KDPICA
9
8
7
Error
6
5
4
3
2
1
0
100
200
300
400
500
600
700
Iterations
Fig. 1. After 30 Monte Carlo trials, the average error comparisons with iterations for mixture of super- and sub-Gaussian images
A Novel Kurtosis-Dependent Parameterized Independent Component Analysis
1131
After 30 Monte Carlo trials, the average error with iterations is shown in Figure 1.In Figure 1, Blue Curve is the average error of Lee’ [5] algorithm. Red curve is the average error of Choi’s [6] algorithm. Magenta curve is the average error of Waheed’s [9] algorithm. Green curve is the average error of Nakayama’s [10] algorithm. Black curve is the average error of KDPICA algorithm. From Figure 1, we can see that KDPICA algorithm has almost the same performance in contrast with Waheed’s algorithm. Their performances are best. The next is Choi’s algorithm. Then it is Nakayama’s algorithm. The last is Lee’s algorithm. 5.2 Experiment 2
In this experiment, four super-Gaussian speeches are mixed. Their kurtoses are 5.8242, 4.2833, 9.7394 and 9.3490. Learning rate is 0.001. The mixing matrix is randomly generated which must be full rank. Error is as same as (16). After 30 Monte Carlo trials, the average error with iterations is shown in Figure 2.In Figure 2, we can see that KDPICA algorithm has the best performance. The next is Waheed’s algorithm. Then they are Choi’s, Nakayama’s and Lee’s algorithms. 8 Lee Choi Waheed Nakayama KDPICA
7
6
Error
5
4
3
2
1
0
0
50
100
150
200
250
300
350
400
Iterations
Fig. 2. After 30 Monte Carlo trials, the average error comparisons with iterations for mixture of super-Gaussian speeches
6 Conclusion In this paper, a Kurtosis-Dependent Parameterized Independent Component Analysis algorithm (KDPICA) is proposed which can separate the mixture of super- and subGaussian sources. Two kinds of novel probability density models are proposed to estimate the source distributions. For sub-Gaussian kurtosis, sub-Gaussian model can
1132
X.-f. Shi et al.
provide much wider range than some proposed models. Model parameters are kurtosis-dependent. According to stability analysis, model parameters must satisfy the following conditions: 0 < ω < 1 , l ≥ 100 and k > 4.35 , 0.01 < p < 0.4 which can confirm KDPICA algorithm stable. Applied to images mixtures and speeches mixtures respectively, the experiment shows KDPICA algorithm has better performance than some proposed algorithms.
References 1. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis, Wiley Interscience,Inc, (2001) 2. Amari, S.I.: Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), (1998) 251-276. 3. Girolami, M.: An Alternative Perspective on Adaptive Independent Component Analysis Algorithms, Neural Computation,10, (1998) 2103-2114. 4. Cardoso, F.J.: Blind Signal Separation: Statistical Principles, Proceedings of the IEEE, 86(10), (1998) 2009-2025. 5. Lee, T.W., Girolami, M., Sejnowski, T.J.: Independent Component Analysis Using an Extended Informax Algorithm for Mixed Sub-Gaussian and Super-Gaussian Sources. Neural Computation, 11, (1999) 417-441. 6. Choi, S., Cichocki, A., Amari, S.I.: Flexible Independent Component Analysis. Neural Networks for Signal Processing VIII, Proceedings of the 1998 IEEE Signal Processing Society Workshop. (1998) 83-92. 7. Boscolo, R., Vwani, P.H.: Independent Component Analysis Based on Nonparametric Density Estimation. IEEE Transactions on Neural Networks, 15(1), (2004)55-65. 8. Vlassis, N., Motomura, Y.: Efficient Source Adaptivity in Independent Component Analysis. IEEE Transactions on Neural Networks, 12(3), (2001) 559-565. 9. Waheed, K., Salem, F.M.: New Hyperbolic Source Density Models for Blind Source Recovery Score Functions. Proceedings of the 2003 International Symposium on Circuits and Systems. 3,25-28. (2003),III-32 - III-35. 10. Nakayama, K.,Hirano, A., Sakai, T.: An Adaptive Nonlinear Function Controlled by Kurtosis for Blind Source Separation. Proceedings of 2002 International Joint Conference on Neural Networks. Vol. 2 (2002) 1234 – 1239.
Local Stability Analysis of Maximum Nongaussianity Estimation in Independent Component Analysis Gang Wang1,2 , Xin Xu2 , and Dewen Hu2, 1
Telecommunication Engineering Institute, Air Force Engineering University, Xi’an, Shanxi, 710077, P.R.C. 2 College of Mechatronics and Automation, National University of Defense Technology, Changsha, Hunan, 410073, P.R.C. [email protected]
Abstract. The local stability analysis of maximum nongaussianity estimation (MNE) is investigated for nonquadratic functions in independent component analysis (ICA). Using trigonometric function, we first derive the local stability condition of MNE for nonquadratic functions without any approximation as has been made in previous literatures. The research shows that the condition is essentially the generalization of Xu’s one-bit-matching ICA theorem in MNE. Secondly, based on the generalized Gaussian model (GGM), the availability of local stability condition and robustness to outliers are addressed for three typical nonquadratic functions for various distributed independent components.
1
Introduction
As a popular signal processing method originated from blind source separation, independent component analysis (ICA) has been widely applied in telecommunication systems, image enhancement, biomedical signal processing and etc. For various ICA algorithms, one crucial problem is the stability of the estimation, and many results on this topic have been reported [1, 2]. For multi-unit approaches, e.g., maximum likelihood estimation (MLE) and Infomax [3], Cheung et al. dwelled upon the stability of the equilibrium points in the global and local convergence analysis on the information-theoretic ICA [1], and the systematic stability analysis was offered in [2] for general blind source separation methods. Recently, from the perspective of convex-concave programming and combinatorial optimization, Xu et al. came up with the one-bit-matching ICA theorem for the availability for multi-unit approaches [4]. As for maximum nongaussianity estimation (MNE), the systematic theory and algorithms were first elucidated in [5], as well as the performance analysis, e.g., the consistency, asymptotic variance and robustness. In [6], the local stability analysis was provided, and three typical nonquadratic functions were
Corresponding author.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1133–1139, 2006. c Springer-Verlag Berlin Heidelberg 2006
1134
G. Wang, X. Xu, and D. Hu
recommended in the negentropy approximation. For fixed-point algorithms, Regalia et al. gave the monotonic convergence analysis and derived step-size bounds to ensure monotonic convergence to a local extremum for any initial condition [7]. And for kurtosis-based FastICA, the global convergence analysis for deflation schedule was provided in [8]. However, most of the previous literatures offered the global or local stability analysis based on cubic nonlinear function or kurtosis but not on a general nonquadratic form. Even for the general form, the local stability analysis was given approximately [5, 6] but not exactly, and the availability of the recommended nonquadratic functions has not been given except for some isolated experimental demonstrations in [6]. In this paper, in the help of trigonometric functions, the local stability condition of MNE is derived for nonquadratic functions. It shows that the condition is essentially the generalization of Xu’ one-bit-matching ICA theorem in MNE. Secondly, based on generalized Gaussian model (GGM), the availability of three typical quadratic functions and the robustness to outliers are demonstrated for different distributions. These two contributions are provided in section 3 and section 4, respectively. For simplicity, all sources and mixtures are assumed zeromean and unit-variance.
2
Preliminaries
Negentropy approximation. As a classical nongaussianity measure, negentropy is defined as the cost function in MNE, and practically its approximation is proposed based on the nonquadratic moment [6]. When only one sufficiently smooth even nonquadratic function G(·) is employed, the popular nongaussianity measure is J (y) = [E{G(y)} − E{G(v)}]2 , (1) where v is a variable that has the same variance as y , and three typical nonquadratic functions are proposed [5, 6] G1 (y) = logcosh(a1 y)/a1 , G2 = −exp(−y 2 /2), G3 = y 4 /4.
(2)
It was also recommended that G1 (·) is a good general-propose function, G2 (·) is better for the highly supergaussian, and G3 (·) for subgaussian sources when no outlier is included [5]. Generalized Gaussian model (GGM). GGM is a conventional distribution model, and for zero-mean unit-variance variables, it can be simplified as [9] y − log 2A(θ)Γ 1 + 1 pg (y, θ) = exp − (3) A(θ) θ where A(θ) = Γ (1/θ)/Γ (3/θ) and Γ (·) is the standard Gamma function. Parameter θ describes the sharpness of the distribution. When θ changes, a variety of pdf s can be obtained, e.g., Laplacian distribution (pg (y, 1)), Gaussian (pg (y, 2)) and nearly uniformly distribution (pg (y, 4)). For the simplicity, GGM is preferred to other pdf models, e.g., Gaussian Mixture model and Pearson model [9].
Local Stability Analysis of Maximum Nongaussianity Estimation in ICA
3
1135
Local Stability Condition for Nonquadratic Function
Local stability condition. The aim of MNE is to discover an independent component y from the linear mixtures of independent sources s=(s1 , s2 , · · ·, sm ) by linear transformation w, where y is of local maximum nongaussianity. Denote by A the mixing matrix, and by w the demixing vector, thus y=wAs=qT s under the consraint q2 =1, where q=(q1 , q2 , · · ·, qm )T is the relation vector. To employ the constrain q2 =1, let q1 =cos(α1 ), · · ·, qm−1 =sin(α1 )sin(α2 ) · · · cos(αm−1 ), qm =sin(α1 )sin(α2 ) · · · sin(αm−1 )(0 < α1 , α2 , · · ·, αm−1 < 2π) and the estimated component can be expressed as y = s1 cos(α1 ) + s2 sin(α1 )cos(α2 ) + · · · + sm sin(α1 )sin(α2 ) · · · sin(αm−1 ). (4) Let ˆs 2 =s2 cos(α2 ) + · · · + sm sin(α2 ) · · · sin(αm−1 ). Since the basic assumptions still holds for s1 and ˆs 2 , a new 2-dimension system is obtained. For simplicity, also denote ˆs 2 by s2 , and α1 by α. Thus the estimated component is as y = s1 cos(α) + s2 sin(α),
(5)
and the investigation on w can be transformed to the α-parameter space [1]. Since y is of zero-mean and unit-variance, maximization of J (y) is to maximize E{G(y)} when E{G(y)}>0, or minimize E{G(y)} when E{G(y)}<0. Denote J1 (y)=E{G(y)}. To evaluate the equilibrium, let dJ1 (y)/dα=0. That is E{[−s1 sin(α) + s2 cos(α)]g[s1 cos(α) + s2 sin(α)]} = 0,
(6)
where g(·) is the derivative of G(·). Obviously, α=0 and α=π/2 are the two equilibria, where the former corresponds to y=s1 and the later y=s2 . Alternatively, only the case of y=s1 is considered hereafter. Evaluate the performance of the local extreme by the second-order derivative of J1 (y) at α=0 d2 J1 |α=0 = E{[−sin(α)s1 + cos(α)s2 ]2 g [cos(α)s1 + sin(α)s2 ] dα2 +[−cos(α)s1 − sin(α)s2 ]g[cos(α)s1 + sin(α)s2 ]} = E{g (s1 ) − s1 g(s1 )},
(7)
where g (·) is the derivative of g(·). Denote by φ(s1 ) E{s1 g(s1 ) − g (s1 )}. Then if φ(s1 )> 0, α= 0 is the local maximum of J1 (y), and φ(s1 )< 0 the local minimum. Since the above results can be generalized to the multidimensional system, and s1 can be substituted by any other source, we can get the local stability condition theorem for nonquadratic functions as below. Theorem. Assume x the prewhitened mixing data, w the demixing vector, and G(·) the sufficiently smooth even function. Then the local maximum (resp. minimum) of E{G(wT x)} can be obtained when the following inequation is satisfied φ(si ) = E{si g(si ) − g (si )} > 0(resp. < 0),
where g(·) is the derivative of G(·), and g (·) is the derivative of g(·).
(8)
1136
G. Wang, X. Xu, and D. Hu
Since the type of the source to be estimated is decided by E{G(y) − G(v)}, the concise expression of (8) is ψ(si ) = E{si g(si ) − g (si )}[E{G(si )} − E{G(v)}] > 0.
(9)
The above results are similar to the Theorem 8.1 and 8.2 of [6], which are derived for the consistency of MNE from the perspective of J1 (y)’s Tailor series expansion where some approximation had been made. In this paper, the theorem is addressed as the condition for nonquadratic functions in MNE. Relation with Xu’s one-bit-matching ICA theorem. For the availability of multi-unit approaches, recently the one-bit-matching ICA theorem was presented by Xu in [4]. It was stated first as a conjecture that ”all the sources can be separated as long as there is one-to-one same-sign-correspondence between the kurtosis signs of all source probability density functions (pdf ’s) and the kurtosis signs of all model pdf ’s” [10]. In [4], the conjecture was proved from the perspective of convex-concave programming and combinatorial optimization, and generalized that even if there is a partial matching between the sources’s kurtosis signs and those of model pdf ’s, the corresponding sources can also be estimated. In the corollaries, Xu also stated that the one-to-one same-sign-correspondence can be replaced by using the duality, i.e., supergaussian sources can be separated via maximization and subgaussian sources can be separated via minimization. In MNE, it can be found that the kurtosis sign of the source to be estimated can be directly expressed as that of E{G(si )} − E{G(v)} in the case of G3 (y)=y 4 /4, and the kurtosis sign of the model pdf ’s corresponds to that of E{si g(si ) − g (si )}. As we know, J(y) is the negentropy approximation of the estimated component based on nonquadratic function, and when the nonquadratic function G(·) is selected, φ(si ) corresponding to the pdf model is also determined. What is more, the corollaries in [4] is the special case of the presented theorem where G3 (·) is employed. These are the similarities between Xu’s one-bit-matching ICA theorem and the above local stability theorem in MNE. However, it should be noted that the nonquadratic function is not limited to the kurtosis form, and many others are also acceptable, e.g., the exponential function Gopt (y)=|y|β (β<2), the complex function form as (16) in [5], and the other two typical functions in (2) [5, 6]. The local stability condition theorem is essentially the generalization of Xu’s one-bit-matching ICA theorem in MNE. Like the importance of pdf model in multi-unit approaches, nonquadratic function is crucial in MNE and related study on the performance, e.g., the consistence, robustness and asymptotic variance, given by Hyv¨ arinen [5] is valuable.
4
The Stability and Robustness Analysis Based on GGM
Local stability analysis. For unit-variance sources,(8) and (9) can be noted as the integral form respectively in (10) and (11) Φ(θ) = [ξg(ξ) − g (ξ)]pg (ξ, θ)dξ, (10)
Local Stability Analysis of Maximum Nongaussianity Estimation in ICA
1137
φ1(θ) φ2(θ) φ (θ)
0.25
3
0.2 0.15 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 −0.25
1
1.5
2
2.5
3
3.5
4
Fig. 1. Function φ(θ) for three typical nonquadratic functions
0.07 ψ (θ) 1 ψ2(θ) ψ3(θ)
0.06
0.05
0.04
0.03
0.02
0.01
0
−0.01
1
1.5
2
2.5
3
3.5
4
Fig. 2. Function ψ(θ) for three typical nonquadratic functions
Ψ (θ) =
[ξg(ξ) − g (ξ)]pg (ξ, θ)dξ ·
[G(ξ) − G(v)]pg (ξ, θ)dξ.
(11)
When θ changes from 1 to 2 and further to 4, pg (ξ, θ) is the pdf varying from super-Gaussian (Laplacian distribution) to Gaussian and to nearly uniformly distribution accordingly, and φ(θ) and ψ(θ) will give the corresponding local stability performance for nonquadratic function G(·) for different distributions. For G1 (·),G2 (·) and G3 (·), the corresponding φ(θ) and ψ(θ) can be denoted by φ1 (θ),φ2 (θ),φ3 (θ) and ψ1 (θ),ψ2 (θ),ψ3 (θ), respectively. The curves of functions φ1 (θ),φ2 (θ),φ3 (θ) are depicted in Fig.1, and those of ψ1 (θ),ψ2 (θ),ψ3 (θ) in Fig.2. It shows that even for the errors in analytic computation, φ(2)=ψ(2)=0 always exists for the three different nonquadratic functions where θ=2 is the critical point to decide the type of the independent component to be estimated. In Fig.1, φ1 (θ),φ2 (θ) and φ3 (θ) divide the space of probability distribution into two half-spaces at φ(θ)=2, and the independent component whose distribution is in one of the two half-spaces can be estimated by maximizing or minimizing alternatively. And in Fig.2, it shows that the theorem in
1138
G. Wang, X. Xu, and D. Hu 0.2
0.15
0.1
0.05
0
−0.05
−0.1 φ+(θ) 1 + φ2(θ)
−0.15
−0.2
−0.25 1
1.5
2
2.5
3
3.5
4
Fig. 3. Function φ+ (θ) for G1 (y) and G2 (y)
Section 3 or the local stability condition, i.e., ψ(θ)>0, is always satisfied except for the Gaussian distribution with ψ(2)=0. It indicates that φ(θ) and the corresponding nonquadratic moment E{G(si )} − E{G(v)} always possess the same sign, and the special case is that the kurtosis sign of the independent component and that of pre-designed model or the function based on nonquadratic moment are the same [5, 6]. Robustness to the outlier. To testify the robustness of the three typical functions to the outlier, a large value 50 is introduced to the unit-variance independent component with pdf pg (θ). Since in MNE algorithms, the iterations (e.g., the fast fixed-point iteration in FastICA [5, 6]) are based on E{G(y)}, only function φ(θ) needs to be considered. To distinguish from the above, φ(θ) with outliers is denoted by φ+ (θ). Herein φ+ (θ) includes two discrete parts as φ+ (θ) = [ξg(ξ) − g (ξ)]pg (ξ, θ) + δ, (12) ξ
where δ=[ξg(ξ) − g (ξ)]|ξ=50 /n, n is the number of discrete samples. For the + + three nonquadratic functions, φ+ (θ) is denoted by φ+ 1 (θ), φ2 (θ) and φ3 (θ), respectively. While G3 =y 4 /4, φ+ (θ)≈3.1 × 103 . That means when θ>2 (where the source is sub-Gaussian), the local stability condition fails, and MNE is suitable in discovering super-Gaussian independent component. This result is the same as the conclusion [5] that G3 (y) can be used for estimating sub-Gaussian component only when there are no outliers, or in the special case where it is important to first find + the super-Gaussian component. In Fig.3, φ+ 1 (θ), φ2 (θ) are also depicted. It shows that for G1 (y), MNE has better robustness and the local stability condition is destroyed only when 1.73<θ<2(the super-Gaussian case). Comparably, for G2 (y), MNE has the best behavior to the isolated outlier. These conclusions coincide with Hyv¨ arinen’s comments on the three typical nonquadratic function made in [5, 6] where only isolated experimental results were illustrated and no further research on different distributed independent components had been considered.
Local Stability Analysis of Maximum Nongaussianity Estimation in ICA
5
1139
Conclusions
This paper investigates the local stability condition of MNE for the nonquadratic function, and provides the availability of three typical functions as well as their robustness to the outlier. Although some of the results have been offered by Hyv¨arinen, the systematic analysis in our research is more rigorous and a necessary supplement to the MNE theory. The global performance of MNE, e.g., the global convergence and stability are to be studied in the future.
Acknowledgement This work is supported by National Natural Science Foundation of China 30370416, 60303012, 60225015, & 60234030.
References [1] Cheung, C.C., Xu, L.: Some Global and Local Convergence Analysis on the Information-Theoretic Independent Component Analysis Approach. Neurocomputing 30 (2000) 79-102 [2] Amari, S., Chen, T.P., Cichocki, A.: Stability Analysis of Adaptive Blind Source Separation. Neural Networks 10 (1997) 1345-1351 [3] Lee T.-W., Girolami M. and Sejnowski T. J.: Independent component analysis using an extended Infomax algorithm for mixed subgaussian and supergaussian sources. Neural Computation 11(2) (1999) 417-441 [4] Xu, L.: One-bit-matching ICA theorem convex-concave programming and combinatorial optimization. In: J. Wang, X. Liao, and Z.Yi (Eds.): International Symposium on Neural Networks. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg (2005) 5-20 [5] Hyv¨ arinen, A.: Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Trans. Neural Networks 10 (1999) 626-634 [6] Hyv¨ arinen, A.Karhunen, J., Oja, E.: Independent Component Analysis. 1st edn. John Wiley, New York (2001) [7] Regalia, P.A., Kofidis E.: Monotonic Convergence of Fixed-Point Algorithm for ICA. IEEE Trans. Neural Networks 14 (2003) 943-949 [8] Wang, G., Xu X., Hu, D: Global convergence of FastICA: theoretical analysis and practical considerations. In: L.Wang, K. Chen, and Y.S.Ong (Eds.): The First International Conference on Natural Computation. Lecture Notes in Computer Science, Vol. 3610. Springer-Verlag, Berlin Heidelberg (2005) 700-705 [9] Zhang, L., Cichocki, A., Amari, S.: Self-adaptive blind source separation based on activation function adaptation. IEEE Trans. Neural Networks 15(2)(2004) 233-244 [10] Liu, Z.Y., Chiu, K.C., and Xu, L.: One-Bite-Matching Conjecture for Independent Component Analysis. Neural Computation 16 (2004) 383-399
Convergence Analysis of a Discrete-Time Single-Unit Gradient ICA Algorithm Mao Ye1 , Xue Li2 , Chengfu Yang1 , and Zengan Gao3 1
Computational Intelligence Laboratory, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, P.R. China yem [email protected] 2 School of Information Technology and Electronic Engineering, The University of Queensland, Brisbane, Queensland 4072, Australia [email protected] 3 School of Economics and Management, Southwest Jiaotong University, Chengdu, Sichuan 610031, P.R. China
Abstract. We revisit the one-unit gradient ICA algorithm derived from the kurtosis function. By carefully studying properties of the stationary points of the discrete-time one-unit gradient ICA algorithm, with suitable condition on the learning rate, convergence can be proved. The condition on the learning rate helps alleviate the guesswork that accompanies the problem of choosing suitable learning rate in practical computation. These results may be useful to extract independent source signals on-line.
1
Introduction
In the simplest independent component analysis (ICA) model, there is a neural network with input x(k) = [x1 (k), · · · , xn (k)]T , weight matrix W (k) ∈ Rn×n and output Y (k) = W (k)x(k) at time k. x(k) is a zero mean discrete-time stochastic process. Such a process is constructed as a sequence of independent and identically distributed samples of a random vector x. The weight matrix W (k) determines the relationship between the input x(k) and output Y (k) at time k. And x(k) = As(k), (1) where s(k) = [s1 (k), · · · , sn (k)]T represents samples of the unobserved source signals s = [s1 , · · · , sn ]T which are independent random variables, and A ∈ Rn×n is an orthogonal constant but unknown mixing matrix. Only one of these random variables can be Gaussian. Often source signals are statistically independent from each other, and thus the signal can be recovered from linear mixture x by finding matrix W such that the components of Y are statistically as independent from each other as possible. The goal of a ICA design using neural networks is to force W (k) converges to such weight matrix W by adjusting W (k) on-line via the ICA learning law. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1140–1146, 2006. c Springer-Verlag Berlin Heidelberg 2006
Convergence Analysis of a Discrete-Time Single-Unit Gradient ICA
1141
The problem of independent component analysis has been studied by many authors in recent years, many algorithms have been proposed from different viewpoints (for a review, please refer to e.g.[1]). In the ICA problem, the measure of independence of signals is always evaluated by a theoretic function such as the minimum mutual information (MI), maximum non-gaussianity[2],etc. Kurtosis function is defined for a zero-mean random variable v as kurt(v) = E{v 4 } − 3E{v 2 }2 ,
(2)
which is used as measure of non-gaussianity. As denoted in [2, 4], the independent source signals can be found by finding extremum of function kurt(wT x) sequentially, in which x is the input random vector, w ∈ Rn×1 denotes a column of matrix W and E{·} is the expectation operator. The gradient ICA algorithm has been proposed to solve this optimization problem, which can be used in a non-stationary environment, please refer to [6]. The literature of theoretical analysis of convergence on this algorithm is very poor. It was mentioned first in [2] by using principles of stochastic approximation. How to choose suitable learning rate in practical computation is not clear yet. In [6], monotonic convergence has been shown if weight vector w is not a stationary point. However, the convergence condition on the initial weight vector and the learning rate is also unknown. By using special properties of the kurtosis function proposed in [5], if the learning rate and initial weight vector satisfy suitable conditions, we can show global convergence of this discrete-time one-unit gradient ICA algorithm. In [7], a sequential extraction algorithm without error accumulation was proposed. Instead of their stability analysis, our main contribution is that the convergence conditions of the initial weight vector and the learning rate bound have been derived. The result may alleviate the guesswork that accompanies the problem of choosing suitable learning rate in practical computation. Our approach here is purely mathematical, and since there are many experiment results on this algorithm in many ICA literature,(see e.g.the books [1] and references therein), the source separation experiments will not be presented here.
2
Problem Statement
Here, we assume the signal is zero-mean and the variance is equal to one. Kurtosis function has many useful properties. For a Gaussian random variable, kurtosis is zero; for any two independent random variables v1 and v2 and for a scalar α, it holds that kurt(v1 + v2 ) = kurt(v1 ) + kurt(v2 ) and kurt(αv1 ) = α4 kurt(v1 ). A random variable v with kurt(v) > 0 is said to be subgaussian. If the kurtosis is positive, it is called supergaussian. The kurtosis of a supergaussian signal can have a very large positive value. While the minimum kurtosis value of a subgaussian signal is −2[1]. Assume y represents a component of the output of the ICA neural network, and y = wT x is a linear combination of the zero-mean observation x. If y has maximal or minimal kurtosis with constrain w = 1 and the mixing matrix
1142
M. Ye et al.
is orthogonal, as denoted in [4], y will be one of the independent source signal si . To solve weight vector w which minimizes or maximizes kurt(wT x), a neural algorithm based on gradient descent or ascent has been proposed in [2] as follows, w+ = w(k) + δη(k) [h(w(k)) − Θ(w(k))w(k)] , w+ w(k + 1) = w+
(3) (4)
where δ = sign(kurt(wT (0)x)), h(w) = E{x(wT x)3 }, Θ(w) = hT (w)w and η(k) is the learning rate. For simplicity, we call it k-w algorithm in this paper. If δ = 1, we are extracting supergaussian signal; while if δ = −1, the extracting signal is subgaussian. Thus, the algorithm is composed of two steps (3) and (4) in each iteration. This algorithm is in batch mode, to separate the independent source signals on-line, we only need to get rid of the expectation operator. Define z T = wT A, it follows that y = wT x = wT As = z T s. Multiplying both sides of the gradient algorithm (3) from left by AT and remembering that s = AT x since AAT = I, we have z + = z(k) + δη [h(z(k)) − Θ(z(k))z(k)] . (5) √ Since AT A = I, so z + = w+T AT Aw+ = w+T w+ = w+ . Multiplying both sides of (4) from the left by AT , it follows that z(k + 1) =
z+ . z +
(6)
Here, we call algorithm (5) and (6) algorithm k-z. The convergence of algorithm k-w can be analyzed by transforming algorithm k-w to algorithm k-z. If we can show z(k) converges to a unit vector, which means only one component of this vector is one and others are zero, an independent source signal can be extracted.
3
Convergence Analysis
Let us first denote some properties of the stationary point set of algorithm k-z. Assume s(k) contain source signals that are statistically independent with zero means and unit variances, the following lemma will hold, please refer to [5] for details. Lemma 1. Since y(k) = z T (k)s(k), then, E{y 3 (k)s(k)} = Kf (z(k)) + 3z(k)2 z(k),
(7)
where K is a diagonal matrix whose (i, i)th element is κi and the ith element of f (z(k)) is zi3 (k). Assume without loss of generality that the |κi | value are ordered and bounded above by a constant c, such that c > |κ1 | ≥ |κ2 | ≥ · · · ≥ |κm | > |κm+1 | = · · · = |κn | = 0. Moreover, defining Γ = {1, · · · , m}, Γ1 = {i|i ∈ Γ, κi > 0} and Γ2 = {i|i ∈ Γ, κi < 0}, then we have the following lemma.
Convergence Analysis of a Discrete-Time Single-Unit Gradient ICA
1143
Lemma 2. The stationary point set of algorithm k-z is ⎧ ⎫ ⎨ ⎬ E1 = z zj2 = 1, and zj = 0, j ∈ Γ ⎩ ⎭ j ∈Γ / or
κ−1 E2 = z |zi | = i −1 , i ∈ J, j∈J kj
and
zj = 0, j ∈ /J
for any subset J ⊆ Γ1 or J ⊆ Γ2 . Proof. The stationary point set of algorithm k-z can be solved by the following equation, h(z) − z T h(z) · z = 0. By Lemma 1, we have κi zi3 −
κj zj4 · zi = 0
(8)
j∈Γ
for i ∈ Γ . If that
j
When
κj zj4 = 0, we have zi = 0 for i ∈ Γ . Since z = 1, it follows
j∈Γ
zj2 = 1 for j ∈ / Γ . These stationary points consist of set E1 .
κj zj4 = 0, it is easy to see that zj = 0 if j ∈ / Γ . Assume zj = 0 for
j∈Γ
any j ∈ J ⊆ Γ , from Eq. (8), we have κi zi2 = κj zj2 =
κk zk4
k∈J
for all i, j ∈ J. From the above equation, κi and κj must be the same sign for all i, j ∈ J, so J ⊆ Γ1 or J ⊆ Γ2 . Since z = 1, we have zi2
ki = 1, kj j∈J
for all i ∈ J. Solving the above equations, the stationary point set E2 follows. The proof is completed. To extract true independent source signals, the separating solutions should be |zi | = 1, |zj | = 0 for j = i, which belongs to the stationary point set E2 . To avoid the algorithm stop iterating in the trap of other stationary points, some additional conditions need to be satisfied. Next, we will show vector z(k) of algorithm k-z will converge to a separating solution if suitable conditions on the initial vector z(0) and learning rate η(k) are posed.
1144
M. Ye et al.
Theorem 1. Assume there exists an index p such that |κp zp2 (0)| > |κi zi2 (0)| for 1 i = p, where i, p belong to Γ1 or Γ2 , and η(k) < in which max |κj zj2 (k)| + |β(k)| j
j ∈ Γ2 if δ = 1 and j ∈ Γ1 if δ = −1, then z(k) will converge to the separating solution z as |zp | = 1, zi = 0, for i = p, where β(k) = ki zi4 (k). i∈Γ
Proof. By Eq.(5), if δ = 1, we have
zi+ = zi (k) + η(k) κi zi3 (k) − β(k)zi (k) ,
(9)
for all i ∈ Γ . On the other hand, if δ = −1, the algorithm will be change into zi+ = zi (k) − η(k) κi zi3 (k) − β(k)zi (k) , (10) for all i ∈ Γ . When δ = 1, from Eq.(6), it follows that zi (k + 1) = θ(k) · zi (k) zp (k + 1) zp (k) for all i = p, where
Since η(k) <
1 + η(k)(κi zi2 (k) − β(k)) . θ(k) = 1 + η(k)(κp zp2 (k) − β(k)) 1
max |κi zi2 (k)| i∈Γ2
+ |β(k)|
(11)
(12)
, z(k) = 1 and by the initial condition of
z(0), it follows that 1 + η(0) κp zp2 (0) − β(0) > 1 + η(0) κi zi2 (0) − β(0) > 0, (13) for all i = p. By Eq.(11), we have κp zp2 (1) > κi zi2 (1) for all i = p and i ∈ Γ1 . Similarly, we have |κp zp2 (k)| > |κi zi2 (k)| for all k, where i = p and i ∈ Γ1 . And by the condition on the learning rate η(k), we have θ(k) < 1 for all k. So with suitable condition on the initial vector z(0), z(k) will converge to a stationary point in E2 , i.e., |zi (k)| |zi (k)| = |zp (k)| → 0, (14) |zp (k)| for all i = p as k → +∞. And |zp (k)| → 1 because z(k) = 1 as k → +∞. This means that z(k) converges to a separating solution. For the case that δ = −1, Eq. (10) should be used. By using the same method 1 as the first case, with the condition η(k) < , we can show 2 max |κi zi (k)| + |β(k)| i∈Γ1
that z(k) converges to a separating solution z as k → +∞, in which |zp | = 1, zi = 0 for all i = p.
Convergence Analysis of a Discrete-Time Single-Unit Gradient ICA
1145
Proposition 1. Assume δ(k) = sign(kurt(wT (k)x)), the sign of δ(k) will be the same as δ. Proof. When δ = 1, which means that κi zi4 (0) > |κj |zj4 (0). i∈Γ1
i∈Γ2
|κi |zi4 (k) converges to zero faster than i ∈ Γ1 . This |κp |zp4 (k) means δ(k) will keep the sign of δ for k > 0. The case that δ = −1 can be shown similarly. By Theorem 1, when i ∈ Γ2 ,
Remark 1. Since κi is unknown beforehand in the computation, an approximated η(k) should be used. When δ = 1, since the minimum value of kurtosis is n 1 −2, and |kurt(y)| = |kurt(z T s)| = | zi4 κi |, we can let η(k) < , 2 + |kurt(y(k))| i=1 where kurt(y(k)) can be approximated to by E{(wT (k)x)4 } − 3. When δ = −1, it means that | κi zi4 (k)| > | κj zj4 (k)|. i∈Γ2
i∈Γ1
It follows that max |κj zj4 (k)| < 2n, so roughly we can let η(k) < j∈Γ1
1 . 2n + |kurt(y(k))|
Remark 2. From Theorem 1, if there are two or more indices i ∈ Γ such that the value of κi zi2 (0) is the same, the algorithm will not converge to separating solutions. How to choose satisfying initial condition z(0) is very difficult. However, in Theorem 2, we can show that the separating solution is the only stable stationary point. So we can set a few check steps to avoid unstable stationary points. When the algorithm stop iterating, we give the stationary weight vector a perturbation. Then, using the algorithm again, the weight vector will leave the unstable stationary points and go to the stationary points. Repeating the above procedure serval times, the true separating solution will be found. The traditional perturbation analysis will be used in the stability analysis. Theorem 2. Only separating solutions are stable stationary points. Proof. As denoted in [5], assume z(k) = z + Δ where z ∈ E1 or z ∈ E2 , and the norm of vector Δ is very small. To satisfy the unit-norm constrain, perturbation Δ should be chosen such that zj Δj = 0. (15) j∈Γ
If there are two or more indices i ∈ Γ such that zi = 0, assume these indices consist of the set Ψ , which means that the value of κi zi2 is the same for all i ∈ Ψ .
1146
M. Ye et al.
Small perturbation will make these values not equal. Without loss of generality, let p, i ∈ Ψ and |κp zp2 (k)| > |κi zi2 (k)|, from Theorem 1, we have zi (k) → 0 as k → +∞, which means that the weight vector is leaving from this stationary point. This behavior is clearly unstable. For the stationary points belong to E1 , unstable phenomena can be shown similarly. For the case of separating solutions defined in Theorem 1, there exists only one index i ∈ Γ such that zi = 0. So it must hold that Δi = 0. Since the value of perturbation Δ is very small, by Theorem 1, we have zj (k) → 0 as k → +∞ and |zi (k)| → 1 as k → +∞, which means that the separating solutions are stable. Remark 3. In practical computation, since we use algorithm k-w, we do not know the value of stationary point z exactly. However, since z = wT A, we can let perturbation Δ satisfy the equality wT Δ = 0.
4
Conclusions
We have derived convergence conditions on the learning rate and initial weight vector of the discrete-time one-unit gradient ICA algorithm developed in [2]. A rigorous mathematical proof is given which does not change the discretetime algorithms to the corresponding differential equations. The convergence condition on the learning rate helps alleviate the guesswork that accompanies the problem of choosing suitable learning rate in practical computation. We do not provide any more simulation results here, some practical computation theories were proposed and analyzed.
References 1. Hyv¨ arinen, A., Karhunen, J., and Oja, E.: Independent Component Analysis. John Wiley and Sons (2001) 2. Hyv¨ arinen, A., Oja, E.: A Fast Fixed-point Algorithm for Independent Component Analysis. Neural Computation 9(7) (1997) 1483-1492 3. Hyv¨ arinen, A.: Fast and Robust Fixed-point Algorithms for Independent Component Analysis. IEEE Trans. Neural Networks 10(3) (1999) 626-634 4. Delfosse, N. and Loubaton, P.: Adaptive Blind Separation of Independent Sources: A Deflation Approach. Signal processing 45 (1995) 59-83 5. Douglas, S.C.: On the Convergence Behavior of the FastICA Algorithm. Proc. of 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003) (2003) 409-414 6. Regalia, P.A., Kofidis, E.: Monotonic Convergence of Fixed-point Algorithms for ICA. IEEE Trans. Neural Networks 14(4) (2003) 943-949 7. Liu, Q., Chen, T.: Sequential Extraction Algorithm for BSS Without Error Accummulation. In: Wang, J., Liao, X., Yi, Z. (Eds.), Advances in Neural NetworksISNN2005. Lecture Notes in Computer Science, Vol. 3497. Springer, Berlin (2005) 466-471
An Novel Algorithm for Blind Source Separation with Unknown Sources Number Ji-Min Ye1 , Shun-Tian Lou1 , Hai-Hong Jin2 , and Xian-Da Zhang3 1
Key Lab for Radar Signal Processing, Xidian University, Xi’an 710071, China {jmye, shtlou}@mail.xidian.edu.cn 2 School of Science, Xi’an Petroleum University, Xi’an 710071, China [email protected] 3 Department of Automation, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing 100084, China [email protected]
Abstract. The natural gradient blind source separation (BSS) algorithm with unknown source number proposed by Cichocki in 1999 is justified in this paper. An new method to detect the redundant separated signals based on structure of separating matrix is proposed, by embedding it into the natural gradient algorithm, an novel BSS algorithm with an unknown source number is developed. The novel algorithm can successfully separate source signals and converge stably, while the Cichocki’s algorithm would diverge inevitably. The new method embedded in novel algorithm can detect and cancel the redundant separated signals within 320 iteration, which is far quicker than the method based on the decorrelation, if some parameters are chosen properly.
1
Introduction
The blind source separation (BSS) have received increasing interest and become an active research area in both statistical signal processing and unsupervised neural learning society. BSS is a fundamental problem in signal processing with a large number of extremely diverse applications, such as array processing, multiuser communications, voice restoration and biomedical engineering [1]. Although a lot of recently developed algorithms ([2], [3], [4], [5]) are able to successfully separate source signals, there are still many problems to be studied, for example, development of learning algorithms which work when: 1) the number of source signals is unknown; 2) the number of source signals is dynamically changing. In this paper, we will focus on the first case and develop an new algorithm which can work under the first case. The noise-free instantaneous BSS problem is formulated as follows. xt = Ast
(1)
where A ∈Rm×n is called the “mixing matrix” which is fixed but unknown, xt = [x1 (t), · · · , xm (t)]T ∈Rm×1 is the vector of observed data, and st = [s1 (t), · · · , J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1147–1152, 2006. c Springer-Verlag Berlin Heidelberg 2006
1148
J.-M. Ye et al.
sn (t)]T ∈Rn×1 represents the vector of the independent source signals, all but perhaps one of them are non-Gaussian. To deal with the BSS with unknown source number, in the reference [6], natural gradient algorithm proposed primarily for BSS with m = n is directly generalized to BSS with an unknown sources number by using the following separating model yt = Wxt (2) where W is an m × m nonsingular square matrix instead of an n × m matrix with full rank, thus the separating model (2) dose not dependent on the sources number entirely. It was observed via simulation experiments (see e.g., [6]) that, as the fore algorithm converges, there are n independent components among the outputs, while other m − n components are the copies of some independent component(s), and thus are redundant.
2 2.1
Justification to Natural Gradient Algorithm for BSS with Unknown Sources Number Contrast Function for BSS with Unknown Source Number
The BSS community has long recognized that the contrast function, which is objective function for source separation, is the key for developing BSS algorithms. The mutual information of yt is the measure of the dependence among the components of yt , and is defined by the Kullback-Leibler (KL) divergence I(y;W) between the joint probability density function (pdf) p(y) of y and its factorized version p˜(y), namely ∞ p(y) I(y;W) = D[p(y)||˜ p(y)] = p(y) log dy (3) p˜(y) −∞ where p˜(y) = m i=1 pi (yi ) is the product of m marginal pdf’s of y, and pi (yi ) is the marginal pdf of yi , i.e., ∞ ∞ pi (yi ) = ··· p(y)dy1 · · · dyi−1 dyi+1 · · · dym . (4) −∞
−∞
As proved by Comon [7], when m = n, I(y;W) is a contrast function for ICA, meaning I(y;W) = 0 iff W = DPA−1 (5) where D is an n × n diagonal matrix, P is an n × n permutation matrix, and A−1 is the inverse matrix of A. When m ≥ n and source number is unknown, it is proved[10] that the mutual information of the separating model (2)’s output yt is still the contrast function for BSS, meaning: the mutual information I(y; W) gets the local minima iff the m × m separating matrix T T T W = G A+ , N (AT ) + W (6)
An Novel Algorithm for BSS with Unknown Sources Number
1149
in which, denoted by G an m × m generalized permutation matrix, by A+ = (AT A)−1 AT the pseudo-inverse of the mixing matrix A, by N (AT ) the m × (m − n) matrix constructed by the basis vectors in the null space of AT , by W a (m − n) × m matrix whose rows are made up of some row(s) of A+ respectively. 2.2
Derivation of Natural Gradient Algorithm for BSS with Unknown Source Number
Since the mutual information of the separating model (2)’s output yt is the contrast function for BSS with unknown sources number, we can perform the BSS with redundant outputs by minimizing the mutual information. With the property of the mutual information, we have I(y1 , · · · , ym ) =
m
H (yk ) − H (x1 , · · · , xm ) − log(| det(W)|)
(7)
k=1
using the natural gradient dWt ∂I(y1 , · · · , ym ) T = Wt Wt dt ∂Wt
(8)
to search the local minimum of the information, one gets the natural gradient algorithm for BSS with unknown source number Wt+1 = Wt + ηt [I−Φ(yt )ytT ]Wt
(9)
where ηt is step size of learning, Φ(yt ) = [φ1 (y1 ), · · · , φm (ym )]T is a vector of nonlinear score functions φi (yi ) = −
3
d log pi (yi ) dyi
p (y )
= − pi (yi ) . i
i
An Novel Algorithm with Redundant Removal
The algorithm (9) can successfully separate the source signals, at convergence, one local minimum of the cost function (3) is achieved, and outputs of the BSS network yt = Gξ t provide the desired source signals. However, the natural gradient algorithm (9) can not be stable, since the stationary condition
E I − φ(yt )ytT = O (10) in which E{·} is the expectation operator, does not hold when yt = Gξ t . As pointed out in [10] that the natural gradient algorithm (9) would diverge inevitably. To meet the stationary condition (10), one possible way is to detect the redundant separated signals and cancel them by letting the rows corresponding to the redundant signals be zero. Here, we propose an new practical method for detecting the redundant signals. To this end, we define the correlation coefficients of separating matrix rows with respect to mixing matrix A as ri,j = min{norm((Wi − Wj ) · B), norm((Wi − Wj ) · B)}, i, j = 1, 2, · · · , m and j < i
(11)
1150
J.-M. Ye et al.
where norm(·) denote the 2-norm of a vector, B =[xk−1 , · · · , xk−2m ]T is an m × 2m matrix composed by 2m observing signal vectors and Wi denote ith row of W. At convergence, the rows of the separating matrix which yielding the copies of the same source signal only have the difference of a basis vector of null space null(AT ) and/or a sign. For simplicity, let G in (6) be diag(λ1 ,· · · , λm ). Simple calculation yields norm(|λi |si − |λj |sj ) if Wi and Wj yielding copies of different sources ri,j = 0 if Wi and Wj yielding copies of the same sources (12) where si = [si,k−1 , · · · , si,k−2m ]. Because the sources are mutually independent, |λi |si − |λj |sj will not be zero, thus norm(|λi |si − |λj |sj ) is an nonzero scalar, simulation shows it is always far greater then zeros. Based on the above analysis, we have the new algorithm for BSS with unknown source number. An novel algorithm step1 : First we run the natural gradient algorithm (9) until 200 samples; step2 : After 200 samples, compute ri,j i, j = 1, 2, · · · , m and j < i, and detect ri,j which is sufficiently small ( ri,j < a: a positive constant) and cancel yi or yj by letting Wi or Wj equal zeros; step3: repeat the step2 until all ri,j are sufficient larger (ri,j > b: a positive constant).
4
Simulations
In order to verify the effectiveness of the proposed algorithm, we consider the separation of the following source signals (taken from [1], [3], [6]): S1) Sign signal: s1 (t) = sign(cos(2π155t)); S2) High-frequency sinusoid signal: s2 (t) = sin(2π800t); S3) Amplitude-modulated signal: s3 (t) = sin(2π9t) sin (2π300t); S4) Phase-modulated signal: s4 (t) = sin(2π300t + 6 cos(2π60t)); S5) Noise: n(t) is uniformly distributed in [−1, +1], and s5 (t) = n(t). In simulations, 8 sensors are used, and the mixing matrix A is fixed in each run, the elements of which are randomly generated subject to the uniform distribution in [−1, +1] such that A is of full column rank. The mixed signals are sampled at the sampling rate of 10KHz, i.e., the sampling period T = 0.0001s. In simulations, we take ηt = 120 × T . Here, we apply the adaptive function proposed by Yang et al. [9]:
1 i 9 i i 1 i 3 i 2 3 i 2 3 2 φi (yi,t ) = − κ3 + κ3 κ4 yi,t + − κ4 + (κ3 ) + (κ4 ) yi,t (13) 2 4 6 2 4 3 4 where κi3 = E{yi,t } and κi4 = E{yi,t } − 3 represent respectively the skewness and the kurtosis of yi,t , and they are updated as follows,
dκi3 3 = −μt κi3 − yi,t dt dκi4 4 = −μt κi4 − yi,t +3 dt
(14) (15)
An Novel Algorithm for BSS with Unknown Sources Number
1151
25
waveform
20
15
10
5
0
0
500
1000
1500
2000
2500
3000
iteration number
Fig. 1. separated results of the proposed algorithm with 3000 iteration and a = 0.8, b = 1.5 25
waveform
20
15
10
5
0 9.8
9.82
9.84
9.86
9.88
9.9
9.92
9.94
9.96
9.98
10 4
iteration number
x 10
Fig. 2. separated results of the proposed algorithm with 100000 iteration and a = 0.8, b = 1.5
in which μt is the step size parameter, and taken as μt = 50 × T in simulations. The separated results of novel algorithm with 3000 iteration in one run are shown in Figure1, the separated results of novel algorithm with 100000 iteration in one run are shown in Figure2. It was shown by extensive simulation experiments [10] the natural gradient algorithm for BSS with unknown sources number can only iterate about 3000˜4000 times before divergence, while the Figure2 shows the novel algorithm can converge stably until 100000 iteration. The averaged
1152
J.-M. Ye et al.
iteration number needed to remove the three redundant separated signals over 1000 independent run with a = 0.9, b = 1.5 and a = 0.8, b = 1.5 are 312 and 425 respectively.
5
Conclusion
The natural gradient BSS algorithm with unknown source number proposed by Cichocki is justified in this paper. An new method to detect the redundant separated signals based on structure of the separating matrix is proposed, by embedding the new method to detect the redundant separated signals into the natural gradient BSS algorithm with unknown source number proposed by Cichocki, an novel BSS algorithm with unknown source number is developed, which can successfully separate source signals and converge stably, while the Cichocki’s algorithm would inevitably diverge. This work is supported by National Science Foundation of China (Grant No. 60572150).
References 1. Cardoso, J.F.: Blind Signal Separation: Statistical Principles. Proc. IEEE 9(10) (1998) 2009-2025 2. Amari, S., Cichocki, A., Yang, H.H.: A New Learning Algorithm for Blind Signal Separation. in Advanced in Neural Information Processing Systems, vol.8. MA: MIT Press, Boston (1996) 752-763 3. Amari, S.: Natural Gradient Works Effciently in Learning. Neural Computation 10(2) (1998) 251-276 4. Bell, A.J., Sejnowski, T.J.: An Information Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7(6) (1995) 1004-1034 5. Cardoso, J. F. and Laheld,B. H.: Equivariant Adaptive Source Separation. IEEE Trans. Signal Processing 44(12) (1996) 3017-3029 6. Cichocki, A., Karhunen, J., Kasprzak, W., Vigario, R.: Neural Networks for Blind Separation with Unknown Number of Sources. Neurocomputing 24(2) (1999) 55-93 7. Comon, P.: Independent Component Analysis - A New Concept?. Signal Processing. 36(3) (1994) 287-314 8. Karhunen, J., Cichocki, A., Kasprzak, W., Pajunen, P.: On Neural Blind Separation with Noise Suppression and Redundentcy Reduction. Int. J. Neural Systems 8(2) (1997) 219-237 9. Yang, H.H., Amari, S.: Adaptive on-line Learning Algorithms for Blind Separation: Maximum Entropy and Minimum Mutual Information. Neural Computation 9(9) (1997) 1457-1482 10. Ye, J.M., Zhu, X.L., Zhang, X.D.: Adaptive Blind Separation with an Unknown Number of Sources. Neural Computation 16(8) (2004) 1641-1660
Blind Source Separation Based on Generalized Variance Gaoming Huang1,2, Luxi Yang2, and Zhenya He2 1
Naval University of Engineering, Wuhan 430033, China Department of Radio Engineering, Southeast University 210096, China [email protected], [email protected], [email protected] 2
Abstract. In this paper, a novel blind source separation (BSS) algorithm based on generalized variance is proposed according to the property of multivariable statistical analysis. This separation contrast function of this algorithm is based on second order moments. It can complete the blind separation of supergaussian and subgaussian signals at the same time without adjusting the learning function The restriction of this algorithm is not too much and the computation burden is light. Simulation results confirm that the algorithm is statistically efficient for all practical purpose and the separation effect is very feasible.
1 Introduction In recent years, particularly after Herault and Jutten’s work [1], a lot of BSS algorithms have been proposed [2,3,4]. The BSS algorithm can be classified into two kinds, one is iteration estimation method based on information theory, and the other is algebra method based on high order cumulate. They are all based on the independent or non-Gaussian property of the source signal. In the research of information theory, many effective algorithms were proposed based on Infomax and minimal mutual information. The contrast functions of these algorithms are equivalent in specifically conditions. The other BSS method is sequential extraction source signals, which apply more simple high order statistics as the cost functions. Cardoso, J.F.[5] proposed that how to improve the practicability of BSS method is the current important task in the first ICA conference in 1999. Some applications based on BSS have been applied in our recent works[6,7,8]. The aforementioned BSS algorithms still has many limitations on practicability. By the statistical property analysis of receiving signals, this paper proposed a novel BSS algorithm based on generalized variance. This method only uses the second moments information of mixing signals. The computation burden is little and the flexibility is strong. The algorithm can adapt to super-Gaussian and sub-Gaussian signals, which is the development universal property requirement of blind signal processing. The details of the new method will be described in the following sections.
2 Problem Formulation The linear mixing model can be described as follow:
X ( t ) = AS ( t ) + N ( t ) J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1153 – 1158, 2006. © Springer-Verlag Berlin Heidelberg 2006
(1)
1154
G. Huang, L.Yang , and Z. He
where X (t ) = [ x1 (t ), x2 (t ),K , x p (t )]T is the observing signals vectors. T S (t ) = [ s1 (t ), s2 (t ),K , sq (t )] is the source signals vectors. A is the mixing matrix and N (t ) = [n1 (t ), n2 (t ),K , n p (t )]T is the noise vectors. Here supposing the means of each component of source signals is zero. The signals are assumed as mutually independent and independent with noise in the following analysis and p ≥ q . The following work is how to recover the source signals only by receiving signals.
3 Blind Source Separation Algorithm Analysis 3.1 Basic Model
Supposing the variables are x1 , x 2 ,K , x p , the second order center moments of these variables is a p × p covariance matrix C as: ⎡ c11 L c1 p ⎤ ⎢ ⎥ C=⎢ M O M ⎥ ⎢ c p1 L c pp ⎥ ⎣ ⎦
(x , x ) , ic, j==Cov 1, 2,L , p ij
i
j
(2)
Generalized variance of C is the determinant C . The disperse property is largest when x1 , x 2 , K , x p are mutual independent. When x1 , x 2 , K , x p are correlated, they are restrict each other and the disperse property is least. The principal diagonal element c ii of C denotes the variance of x i . When x1 , x 2 , K , x p are mutual independent, then: ⎡ c11 L 0 ⎤ ⎢ ⎥ C=⎢ M O M ⎥ ⎢ 0 L c pp ⎥ ⎣ ⎦
(3)
p
Corresponding generalized variance is C = ∏ cii . Whether x1 , x 2 , K , x p are i =1
mutual independent, C is always a non-negative matrix. There is a famous inequation p
in linear algebra as Hadamard Inequation: C ≤ ∏ cii .This inequation indicate that i =1
generalized variance will have a maximum value when x1 , x 2 , K , x p are mutual independent, which has a very distinct statistical property. If x1 , x 2 , K , x p have linear relation, then C = 0 , so
p
0 ≤ C ≤ ∏ cii i =1
As cii all larger than zero, from eq.(4) we can get
(4)
Blind Source Separation Based on Generalized Variance
0≤ C
p
∏c i =1
From
the
(
get: rij = Cov x i , x j
)
⎡ ⎢ ⎢ R=⎢ ⎢ ⎢ ⎢ ⎣⎢ Then R = C
definition
of
ii
≤1
1155
(5)
correlation
coefficient,
( )
we
can
c ii c jj , the correlation matrix R = rij can be written as:
1 c11
L
M
O
0
L
p
∏c
⎤ ⎡ 0 ⎥ ⎢ ⎥ ⎢ M ⎥C ⎢ ⎥ ⎢ 1 ⎥ ⎢ c pp ⎥⎦⎥ ⎢⎣⎢
1 c11
L
M
O
0
L
⎤ 0 ⎥ ⎥ M ⎥ ⎥ 1 ⎥ c pp ⎥⎦⎥
(6)
, so Eq.(6) can be written as: 0 ≤ R ≤ 1 . When R
ii
i =1
close to zero, the correlation of x1 , x 2 , K , x p is relatively high. When R close to 1,
x1 , x 2 , K , x p are close to independent. By these characters, the independent components can be obtained by the adjustment of R . 3.2 Separation Algorithm
Supposing W as the separating matrix, the estimating signal can be described as:
Y (t ) = WX (t )
(7)
The separating method by generalized variance is how to adjust weight matrix W to make RY close to 1, which will make y1 , y 2 , K , y p close to independent. By the aforementioned and C is a non-negative matrix, we can see that it requires
⎛ log ⎜ C ⎝
p
∏c
ii
i =1
⎞ ⎟ → 0 . The object function can be defined as: ⎠
⎛ J = log ⎜ C ⎝
p
∏c
ii
i =1
p ⎞ ⎟ = log ( det ( C ) ) − ∑ log ( cii ) i =1 ⎠
(8)
The separating matrix W learning rule can be described as: ⎧W ( k + 1) = W ( k ) + ηΔW ( k ) ⎪ ∂J ⎨ ⎪ ΔW ( k ) = − ∂W ( k ) ⎩
(9)
where η is the learning step, and by Eq.(8) and Eq.(9), the following equation can be established as:
1156
G. Huang, L.Yang , and Z. He
ΔW ( k ) = −
{(
−1 ∂J = ( I −W T ) diag Y ( t ) Y T ( t ) ∂W ( k )
)
−1
⋅ Y (t )Y T (t ) − I
}
(10)
So the weights modified equation can be written as: W ( k + 1) = W ( k ) + ηΔW ( k ) = W (k ) +η ( I −W T )
−1
{( diag Y (t )Y
(t )
T
)
−1
⋅ Y (t )Y T (t ) − I
}
(11)
By the foregoing analysis, the separation algorithm can be described as following: Step 1: Initializing separating matrix W and define a learning threshold ε . Step 2: Computing the object function J . If J ≤ ε , then go to step 5, else is the next step. Step 3: Computing the new separating weights by Eq. (11). Step 4: Estinating the separating signals y1 , y 2 , K , y p and goes to the step 2. Step 5: Learning completed and output the separating signals Y .
4 Simulations In order to verify the validity of this blind source separation algorithm proposed in this paper, here a series of experiments have been conducted. In order to quantificationally evaluate the separation effect, here we choose the resemble coefficient ξ ij of separating signal and source signal to test the resemble degree of separating signal y and expecting signal s . The resemble coefficient ξ ij can be described as: n
ξ ij = ξ (y i , s j ) =
∑ y (k )s i
j
(k )
k =1
n
∑ k =1
n
y i2
(k )∑ (k )
(12)
s 2j
k =1
When y i = cs j , then ξ ij = 1 . It allows separate results exist difference on amplitude. When y i and s j are mutual independent, ξ ij = 0 . When ξ ij < 0 , it means separate signals and source signals are reverse order. First simulation is to super-Gaussian signals, here we choose two speech signals. The mixing matrix is a random matrix A = [0.7408 0.1241; 1.0804 -0.2901] , the sampling points are 50000, and the cost time is 0.422 CPU periods. The separation results are shown as in Fig.1.The resemble coefficient of separate signals and source signals is [0.9923 0.0128; 0.029 0.9875] . The second simulation is to subGaussian signals, here we choose one FM signal and another is a AM signal, The mixing matrix is a random matrix A = [0.2652 -1.3438; 0.9411 1.2710] , the
Blind Source Separation Based on Generalized Variance
1157
sampling points is 50000, the cost time is 0.422 CPU periods. The resemble coefficient of separate signals and source signals is [-0.9836 0.0133; 0.016 0.9956] . The separation results are shown as in Fig. 2. Blind separation of sub Gaussian sources
0 -5
0 -5
2 Amplitude
1
Amplitude
2
5
Amplitude
10
5
0 -1
1 0 -1
-2
10
4
4
5
5
2
2
0
0
10
20
30
40
-2
50
Amplitude
0
0 -2
-10
-4
10
10
2
2
5
5
1
1
0
1
2 3 Sampling interval
4
5
0 -5 -10
4
x 10
0
1
2 3 Sampling interval
4
5
50
100
150
0 -1 -2
0
10
20
30
4
x 10
Fig. 1. Blind source separation of superGaussian sources
200
Sampling interval
40
50
50
100
150
200
0
50
100
150
200
0
50
100
150
200
-2 -4
Amplitude
0 -5 -10
0
0
0
-5
-10
Amplitude
-5
Amplitude
-10
10 Amplitude
-10
Amplitude
Amplitude
Amplitude
Amplitude
Blind sepration of super-Gaussian sources 10
0 -1 -2
Sampling interval
Fig. 2. Blind source separation of subGaussian sources
From the two simulation results, we can see that the separate effect by this new algorithm is perfectly. It can effectively complete the separate processing and have a good performance of real time.
5 Conclusion This paper proposed a novel blind source separation algorithm based on generalized variance, which applies the statistical independent property of source signals. This algorithm only uses the second order statistical property. The computation burden is relatively light. The main contribution of this algorithm is that it can complete the blind separation of supergaussian and subgaussian signals at the same time without adjusting the learning function, which make it have a broad application foreground.
Acknowledgement This work was supported by NSFC (60496310, 60272046), National High Technology Project (2002AA123031) of China, NSFJS (BK2002051) and the Grant of PhD Programmers of Chinese MOE (20020286014).
References 1. Jutten, C. and Herault, J.: Blind Separation of Sources, Part I: An Adaptive Algorithm Based on Neuromimetic. Signal Processing 24 (1991) 1-10 2. David, V., Sánchez, A.: Frontiers of Research in BSS/ICA. Neurocomputing 2002(49) 7-23 3. Hyvärinen, A., Karhunen, J. and Oja, E.: Independent Component Analysis. J. Wiley (2001)
1158
G. Huang, L.Yang , and Z. He
4. Mansour, A., Barros, A.K. and Ohnishi, N.: Blind Separtion of Sources: Methods, Assumptions and Application. In Special Issue on Digital Signal Processing in IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E83-A (8) (2000) 1498-1512 5. Cadoso, J.F., Jutten, C., and Loubaton, P.: Preface of the ICA Proceedings. First Int. Workshop on Independent Component Analysis and Signal Separation (ICA99), Aussois, France (1999) 11-15 6. Huang, G.M., Yang, L.X. and He, Z.Y.: Blind Source Separation Using for Time-Delay Direction Finding. ISNN2004, LNCS, Vol.3173. (2004) 660-665 7. Huang, G.M., Yang, L.X. and He, Z.Y.: Application of Blind Source Separation to a Novel Passive Location. ICA2004, LNCS, Vol.3195. (2004) 1134-1141 8. Huang, G.M., Yang, L.X. and He, Z.Y.: Application of Blind Source Separation to Time Delay Estimation in Interference Environments. ISNN2005, LNCS. Vol.3497, (2005) 496-501
Blind Source Separation with Pattern Expression NMF Junying Zhang1, 2, Zhang Hongyi1, Le Wei1, and Yue Joseph Wang3 1
School of Computer Science and Engineering, Xidian University, Xi'an 710071, P.R. China [email protected] 2 Research Institute of Electronics Engineering, Xidian University, Xi'an 710071, P.R. China 3 Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Alexandria, VA 22314, USA [email protected]
Abstract. Independent component analysis (ICA) is a widely applicable and effective approach in blind source separation (BSS) for basic ICA model, but with limitations that sources should be statistically independent, while more common situation is BSS for non-negative linear (NNL) model where observations are linear combinations of non-negative sources with non-negative coefficients and sources may be statistically dependent. By recognizing the fact that BSS for basic ICA model corresponds to matrix factorization problem, in this paper, a novel idea of BSS for NNL model is proposed that the BSS for NNL corresponds to a non-negative matrix factorization problem and the non-negative matrix factorization (NMF) technique is utilized. For better expression of the patterns of the sources, the NMF is further extended to pattern expression NMF (PE-NMF) and its algorithm is presented. Finally, the experimental results are presented which show the effectiveness and efficiency of the PE-NMF to BSS for a variety of applications which follow NNL model.
1 Introduction Blind source separation (BSS) is a very active topic recently in signal processing and neural network fields [1,2]. It is an approach to recover the sources from their combinations (observations) without any understanding of how the sources are mixtured [3]. For a basic linear model, the observations are linear combinations of sources, i.e., X = AS , where S is an r × n matrix indicating r sources in n dimensional space, X is an m × n matrix showing m observations in n dimensional space, and A is an m × r mixing matrix. Therefore, BSS problem is a matrix factorization, i.e., to factorize observation matrix V into mixing matrix A and source matrix S . Independent component analysis (ICA) has been found very effective in BSS for the basic linear model in the cases where the sources are statistically independent. In fact, it factorizes the observation matrix V into mixing matrix A and source matrix S by searching the most non-gaussian directions in the scatter plot of the observations, and therefore has a very good estimation performance of the recovered sources when the sources are statistically independent. This is based on the Central Limit Theorem which states that the distribution of a sum (observations) of J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1159 – 1164, 2006. © Springer-Verlag Berlin Heidelberg 2006
1160
J. Zhang et al.
independent random variables (sources) tends toward a Gaussian distribution under certain conditions. This induces the two main constraints of ICA to the application of BSS [3]: (1) the sources should be statistically independent; (2) the sources should not be distributed with Gaussian distribution. The performance of the recovered sources with ICA approach depends on the satisfactory of these two constraints, and decreases very rapidly when either of them is not satisfied. However in real world, there are many BSS applications where observations are non-negatively linear combinations of non-negative sources, and the sources are not guaranteed to be statistically independent[4]. This is the model referred to as non-negatively linear (NNL) model, i.e., X = AS with elements in both A and S are non-negative, and the rows in S (the sources) may not be statistically independent. One of the applications of this model is gene expression profile, which is only of non-negative value, represents a non-negatively linear composite of more than one distinct but partially dependent sources, the profiles from normal tissue and from cancer tissue. What needs to be developed is an algorithm to recover the dependent sources from the composite observations. It is easy to recognize that BSS for NNL model is a non-negative matrix factorization, i.e., to factorize X into non-negative A and non-negative S , where it seems that non-negative matrix factorization (NMF) technique is applicable. NMF technique is an approach which lead to a part-based representation because only additive, not subtractive, combinations of the original data is allowed [1,2]. This technique has been used in many applications including classifying faces [5], dynamic positron emission tomography [6], image processing [7], and language model adaptation [11]. It was also applied to BSS for NNL model for famous bar problem [13] in reference [4], but failed to that for the extended bar problem presented in this paper in which the sources are more difficult to be separated. In this paper, we extend NMF to pattern expression NMF (PE-NMF) to reach much better BSS performances compared with the conventional NMF method. Our experimental results are presented which show the effectiveness and efficiency of the PE-NMF to BSS for the extended bar problem and a variety of applications which follow NNL model.
2 Pattern Expression NMF and BSS for NNL Model NMF problem is: given a non-negative n × m matrix V , find non-negative n × r and r × m matrix factors W and H such that the difference measure between V and WH is the minimum according to some cost function, i.e.,
V ≈ WH
(1)
From equation (1), the i th observation ( i th column of V ) is a non-negative linear combination of the i th source (the i th column of W ), i.e., vi = Whi , which lead to a part-based representation because only additive, not subtractive, combinations of the original data is allowed [1,2]. Therefore, the r columns of the optimally estimated factor W of V can be viewed as the r bases of the observations when the observation matrix V is optimally estimated by its factors.
Blind Source Separation with Pattern Expression NMF
1161
2.1 Pattern Basis Let W1 ,W2 ,L,Wr be n dimensional vectors. We refer to the space spanned by arbitrarily non-negative linear combination of these r vectors as the non-negative subspace spanned by W1 ,W2 ,L,Wr , in which W1 ,W2 ,L,Wr are the bases of the subspace. Evidently, the basis W1 ,W2 ,L,Wr derived from NMF on observation data V can be viewed as pattern expression of the observation data, but this basis may not be unique. Fig.1 (a) shows an example of the data V which have the bases of both {W1 ,W2 } and {W1 ',W2 '} . Hence, we have the question which basis is more effective in expressing the pattern of the observations in V ? For expressing the pattern in V with more effective bases, it is believed that the following three points should be satisfied: (1) the angles between bases should be as large as possible, such that each data in V is a non-negative combination of the vectors; (2) the angles between the bases should be as small as possible such that they could clamp the observation data as tight as possible; (3) each base should be of nearly as same importance as the others to express observation data, such that they are equally used to express the whole data in the subspace spanned by the bases.
W1 W1 '
W1 V
W2 '
(a)
V
W2
W2 0
W1 = W1 '
W1 ' V W ' 2
W2 ' W2
0
0
(b)
(c)
Fig. 1. The basis {W1 ,W2 } / {W1 ',W2 '} of the observation data V obeys/violates the definition of the pattern basis. The basis {W1 ',W2 '} has too large-between angle (a), has too small-between angle (b), and is not of equal importance (c).
The vectors defined with the above three points is what we call the pattern basis of the data, and the number of vectors in the basis, r , is called the pattern dimension of the data. Fig.1 (a), (b) and (c) shows respectively the too large between-angle, too small between-angle, and unequally-important basis situation with {W '1 ,W2 '} as basis, where data in Fig.1 (a) and (b) are uniformly distributed in the grey area while those in Fig.1(c) are non-uniformly distributed (the data in the dark grey area is denser compared with those in the light grey area). For these three cases, {W1 ,W2 } is a better basis to express the data. Notice that the second point in the definition of pattern basis readily holds from the constraint of NMF that the elements in H are guaranteed to be non-negative. The demand that the angles between pattern bases be as large as possible desires that Wi TW j → min, for i ≠ j ; the same importance of each pattern base in expressing the
1162
J. Zhang et al.
whole observation data with the pattern basis requires
∑h
ij
→ min . Therefore, we
ij
have the following objective function for our pattern expression NMF (PE-NMF): ⎧ ⎪min W ,H ⎨ ⎪ s.t. ⎩
E (W , H ) =
1 V − WH 2
wij ≥ 0
2
+ α ∑ wiT w j + β ∑ hij ij
ij
(2)
hij ≥ 0
where α and β are weight parameters for the above constraints. 2.2 PE-NMF Algorithm We derived the iterative algorithm, which has been proved to be convergent (we omitted the convergence proof due to the limitation of the space), for minimizing the above objective function. The algorithm is presented below. Step 1: generate non-negative matrix W and H at random with the constrains that the elements in W and H are non-negative. Step 2: for the a th element ha of each column h of H , a = 1 ~ r , calculate hat +1 = hat
(W T v)a (W Wht ) a + β T
(3a)
where v is the corresponding column in V to the column of h in H ; for the a th element wa of each row w in W , a = 1 ~ r , calculate wat +1 = wat
(vH T ) a ( w HH T + α wt M )a t
(3b)
where v is the corresponding row in V to the row w in W , and M is an r × r matrix whose diagonal elements are 0s and all other elements are 1s. Step 3: repeat step 2 until both H t +1 and W t +1 converge. 2.3 Blind Source Separation for NNL Model Blind source separation (BSS) for non-negative linear (NNL) model has a variety of applications in real world. Here we use the ICA transpose model to match the PENMF model, i.e., X T = S T AT , where the rows of S indicate sources and the rows of X indicate the observations which are the linear combination of the sources with mixing matrix A . With this model, WiTW j → min for i ≠ j in PE-NMF corresponds to Si S jT → min for i ≠ j , where S k is the k th source in NNL model. PE-NMF searches
the sources such that Si S jT → min for i ≠ j , meaning that the first-order dependence between the i th and j th source is the least, but may not be statistically independent, hence applicable to BSS for NNL model for dependent sources. If we compare PENMF and ICA for BSS for NNL model, it is indicated that ICA searches the most nongausianity directions in the scatter plot of the observations as the rows of the
Blind Source Separation with Pattern Expression NMF
1163
mixing matrix A and obtain the recovered sources indirectly from the obtained directions in A rather than searches the sources (the basis) directly in observation data subspace. Hence, PE-NMF does not readily induce instability of the recovered sources compared with ICA.
(a)
(b)
(c)
(d)
Fig. 2. BSS experimental result from the proposed PE-NMF for NNL model. (a) source images, (b) mixed images, (c) recovered images from PE-NMF, and (d) convergence of the PE-NMF algorithm.
3 Experimental Results The famous bar problem in reference [13] is extended for performance evaluation of the proposed algorithm in this paper. Fourteen 4 by 4 source images with four thin vertical bars, four thin horizontal bars, three wide vertical bars and three wide horizontal bars, shown in Fig.2 (a), are evidently statistically dependent. These source images was randomly mixed with mixing matrix of (a) (b) elements arbitrarily chosen in [0,1] to form 20 mixed images, shown in Fig.2 (b). The PE-NMF Fig. 3. Image obtained from with parameter α = 4 and β = 1 was performed on direct application of ICA (a) and the ones from direct application these 20 mixed images for r = 14 and the 14 reof NMF (b) to the mixed images covered images are shown in Fig.2(c). Fig.2 (d) shown in Fig.2(b) shows the convergence of iteration process of the PE-NMF algorithm. For comparison, the recovered images directly from NMF for r = 14 and from ICA are shown in Fig.3 (a) and Fig.3 (b) respectively. Notice that the number of recovered images from ICA [8] is only 6, while both the 6 images obtained from ICA and the 14 ones from NMF are absolutely far from the real source images shown in Fig.2 (a). This paper developed pattern expression non-negative matrix factorization (PENMF) approach for pattern expression. The PE-NMF algorithm is presented and applied to blind source separation for non-negative linear (NNL) model. Its successful application to blind source separation of extended bar problem shows its great potential in blind source separation problem for NNL model.
1164
J. Zhang et al.
Acknowledgement This work was supported by the Chinese National Science Foundation under Grant No. 60371044, 60071026 and 60574039.
References 1. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley, New York (2001) 2. Hoyer, P.O., Hyvärinen, A.: Independent Component Analysis Applied to Feature Extraction from Colour and Stereo Images. Network: Computation in Neural Systems 11(3) (2000) 191-210 3. Haykin, S: Neural networks: A Comprehensive Foundation. 2nd edn. Prentice-Hall. Inc. (1999) 4. Zhang, J.Y., Wei, L., Wang, Y.: Computational Decomposition of Molecular Signatures based on Blind Source Separation of Non-negative Dependent Sources with NMF. 2003 IEEE International Workshop on Neural Networks for Signal Processing, September 17-19, 2003. Toulouse, France (2003) 5. Guillamet, D., Vitria, J.: Classifying Faces with Non-negative Matrix Factorization. In: Proceedings of the 5th Catalan Conference for Artificial Intelligence (CCIA 2002). Castello de la Plana (2002) 24-31 6. Guillamet, D., Vitria, J.: Application of Non-negative Matrix Factorization to Dynamic Positron Emission Tomography. In: Proceedings of the International Conference on Independent Component Analysis and Signal Separation (ICA2001). San Diego, California, (December 9-13, 2001)629–632 7. Guillamet, D., Vitria, J.: Unsupervised Learning of Part-based Representations. In: Proceedings of the 9th International Conference on Computer Analysis of Images and Patterns (September 05-07, 2001) 700-708 8. Hesse, Christian W., James, Christopher J.: The FastICA Algorithm with Spatial Constraints. IEEE Signal Processing Letters 12(11) (2005) 792-795 9. Lee, D., Seung, H. S.: Learning the Parts of Objects by Non-negative Matrix Factorization. Nature 401 (1999) 788 - 791 10. Lee, D., Seung, H. S.: Algorithms for Non-negative Matrix Factorization. Advances in Neural Information Processing Systems 13 (2001) 556-562 11. Novak, M., Mammone, R.: Use of Non-negative Matrix Factorization for Language Model Adaptation in A Lecture Transcription Task. In: Proceedings of the 2001 IEEE Conference on Acoustics, Speech and Signal Processing, Vol.1. Salt Lake City, UT(May 2001) 541–544 12. Guillamet, D., Bressan, M., Vitria, J.: Weighted Non-negative Matrix Factorization for Local Representations. In: Proc of Computer Vision and Pattern Recognition (2001) 13. Foldiak, P.: Forming Sparse Representations by Local Anti-Hebbian Learning. Biological Cybernetics 64(2) (1990) 165-170 14. Khan, J., et al: Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks. Nature Medicine 7(6) (2001) 673-679
Nonlinear Blind Source Separation Using Hybrid Neural Networks* Chun-Hou Zheng1,2, Zhi-Kai Huang1,2, Michael R. Lyu3, and Tat-Ming Lok4 1
Intelligent Computing Lab, Institute of Intelligent Machines, Chinese Academy of Sciences, P.O.Box 1130, Hefei, Anhui , China 2 Department of Automation, University of Science and Technology of China 3 Computer Science & Engineering Dept., The Chinese University of Hong Kong, Hong Kong 4 Information Engineering Dept., The Chinese University of Hong Kong, Shatin, Hong Kong [email protected]
Abstract. This paper proposes a novel algorithm based on minimizing mutual information for a special case of nonlinear blind source separation: postnonlinear blind source separation. A network composed of a set of radial basis function (RBF) networks, a set of multilayer perceptron and a linear network is used as a demixing system to separate sources in post-nonlinear mixtures. The experimental results show that our proposed method is effective, and they also show that the local character of the RBF network’s units allows a significant speedup in the training of the system.
1 Introduction Blind source separation (BSS) in instantaneous and convolute linear mixture has been intensively studied over the last decade. Most of the blind separation algorithms are based on the theory of the independent component analysis (ICA) when the mixture model is linear [1,2]. However, in general real-world situation, nonlinear mixture of signals is generally more prevalent. For nonlinear demixing [6,7], many difficulties occur and the linear ICA is no longer applicable because of the complexity of nonlinear parameters. In this paper, we shall in-deep investigate a special but important instance of nonlinear mixtures, i.e., post-nonlinear (PNL) mixtures, and give out a novel algortithm.
2 Post-nonlinear Mixtures An important special case of the general nonlinear mixing model that consists of so called post-nonlinear mixtures introduced by Taleb and Jutten [5], can be seen as a hybrid of a linear stage followed by a nonlinear stage. *
This work was supported by the National Science Foundation of China (Nos.60472111, 30570368 and 60405002).
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1165 – 1170, 2006. © Springer-Verlag Berlin Heidelberg 2006
1166
C.-H. Zheng et al.
s1 M sn
u1 A
un
x1
f1
M fn
Mixing system
g1
M
xn
gn
v1
y1 B
vn
M yn
Separating system
Fig. 1. The mixing –separating system for PNL
In the post-nonlinear mixtures model, the observations x = ( x1 , x2 ,L , xn )T have the following specific form (as shown in Fig. 1)
⎛ n ⎞ xi = f i ⎜ ∑ aij s j ⎟ , i = 1,L , n ⎝ j =1 ⎠
(1)
The corresponding vector-matrix form can be written as: x = f ( As )
(2)
Contrary to general nonlinear mixtures, the PNL mixtures have a favorable separability property. In fact, if the corresponding separating model for post-nonlinear mixtures, as shown in Figure 1, are written as: n
yi = ∑ bij g j ( x j )
(3)
j =1
Then it can be demonstrated that [5], under weak conditions on the mixing matrix A and on the source distribution, the output independence can be obtained if and only if ∀i = 1,L , n. , hi = gi o f i are linear. For more details, please refer to literatures [5].
3 Contrast Function In this paper, we use Shannon’s mutual information as the measure of mutual dependence. It can be defined as: I ( y ) = ∑ H ( yi ) − H ( y )
(4)
where H ( yi ) = − ∫ p( yi ) log p ( yi )dyi denotes Shannon’s differential entropy. According to the theory given above, the separating system of PNL we proposed in this paper is shown in Fig.2, where B and gi form the unmixing structure for PNL, yi are the extracted independent components, and ψ i some nonlinear mappings, which are used only for the optimization of the network.
Nonlinear Blind Source Separation Using Hybrid Neural Networks
1167
Assume that each function ψ i (φi , yi ) is the cumulative probability function (CPF) of the corresponding component yi , then zi are uniformly distributed in [0, 1], Consequently, H ( zi ) = 0 [7]. Moreover, because ψ i ( φ i , yi ) are all continuous and monotonic increasing transformations (thus also invertible), then it can be easily shown that I ( z) = I ( y ) [7]. Consequently, we can obtain
I ( y ) = I ( z ) = ∑ H ( zi ) − H ( z ) = − H ( z )
(5)
i
Therefore, maximizing H ( z ) is equivalent to minimizing I ( y ) .
x1
M ψ
x2
M
B
M
M ψ1
z1
y2
M ψ2
z2
M ψn
zn
g1
g2
M
M
xn
y1
yn gn
Fig. 2. The particular structure of the unmixing network
It has been proved in the literature [7] that, given the constraints placed on
ψ i (φi , yi ) , then zi is bounded to [0, 1], and given that ψ i ( φ i , yi ) is also constrained to be a continuous increasing function, then maximizing H ( z) will lead ψ i (φi , yi ) to become the estimates of the CPFs of yi . Consequently, yi should be the duplicate of si with just sign and scale ambiguity. Now, the fundamental problem that we should to solve is to optimize the networks (formed by the gi , B and ψ i blocks) by maximizing H ( z ) .
4 Unsupervised Learning of Separating System With respect to the separation structure of this paper, the joint probabilistic density function (PDF) of the output vector z can be calculated as: p( z) =
p(x ) n
n
i =1
i =1
det( B) ∏ g 'i (θi , xi ) ∏ ψ 'i (φi , yi )
(6)
which leads to the following expression of the joint entropy: n
n
i =1
i =1
H (z) = H (x) + log det(B) + ∑ E ( log g 'i (θi , xi ) ) + ∑ E ( log ψ 'i (φi , yi ) )
(7)
1168
C.-H. Zheng et al.
The minimization of I ( y ) , which is equal to maximize H ( z ) here, requires the computation of its gradient with respect to the separation structure parameters B , θ and φ . In this paper, we use RBF [3,4] network to model the nonlinear parametric functions g k (θ k , xk ) , and choose Gaussian kernel function as the activation function of the hidden neurons. In order to implement the constraints on the ψ function easy, we use multilayer perceptron to model the nonlinear parametric functions ψ k (φk , yk ) .
5 Experiment Results 5.1 Extracting Sources From Mixtures of Simulant Signals
In the first experiment, the source signals consist of a sinusoid signal and a funny curve signal [1], i.e. s(t ) = [(rem(t,27)-13)/9,((rem(t,23)-11)/9)0.5 ]T , which are shown in Fig.3 (a). The two source signals are first linearly mixed with the (randomly chosen) mixture matrix: ⎡-0.1389 0.3810 ⎤ A=⎢ ⎥ ⎣ 0.4164 -0.1221⎦
(8)
Then, the two nonlinear distortion functions
f1 (u ) = f 2 (u ) = tanh(u )
(9)
are applied to each mixture for producing a PNL mixture. Fig.3 (b) shows the separated signals. To compare the performance of our proposed method with other ones, we also use MISEP method [7] to conduct the related experiments based on the same data. The correlations between the two recovered signals separated by two methods and the two original sources are reported in Table.1. Clearly, the separated signals using the method proposed in this paper is more similar to the original signals than the other. 2
5
0
0
-2 0 5
0.2
0.4
0.6
0.8
1
0 -5 0
-5 0 5
0.2
0.4
0.2
0.4
0.6
0.8
1
0.6
0.8
1
0
0.2
0.4
0.6
(a)
0.8
1
-5 0
(b)
Fig. 3. The two set of signals shown. (a) Source signals. (b) Separated signals.
Nonlinear Blind Source Separation Using Hybrid Neural Networks
1169
Table 1. Correlations between two original sources and the two recovered signals
simulant signals
Experiment
MISEP Method in this paper
y1
y2
speech signals y1
y2
S1
0.9805 0.0275 0.9114 0.0280
S2
0.0130 0.9879 0.0829 0.9639
S1
0.9935 0.0184 0.9957 0.0713
S2
0.0083 0.9905 0.0712 0.9711
5.2 Extracting Sources from Mixtures of Speech Signals To test the validity of the algorithm proposed in this paper ulteriorly, we also have experimentalized using real-life speech signals. In this experiment two speech signals (with 3000 samples, sampling rate 8kHz, obtained from http://www.ece.mcmaster.ca /~reilly/ kamran /id18.htm) are post-nonlinearly mixed by: ⎡ -0.1412 0.4513 ⎤ A= ⎢ ⎥ ⎣ 0.5864 -0.2015⎦
f1 (u ) =
(10)
1 1 3 (u + u 3 ) , f 2 (u ) = u + tanh(u ) 2 6 5
(11)
The experimental results are shown in Fig.4 and Table.1, which conforms the conclusion drawn from the first experiment. 8
5
0
0 -5 0 5
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0
0 -5 0
-8 0 9
0.2
0.4
0.6
(a)
0.8
1
-9 0
(b)
Fig. 4. The two set of speech signals shown. (a) Source signals. (b) Separated signals.
5.3 Training Speed
We also performed tests in which we compared, on the same post-nonlinear BSS problems, networks in which the g blocks had MLP structures. Table 2 shows the means and standard deviations of epochs required to reach the stop criterion, which was based on the value of the objective function H ( z ) , for MLP-based networks and RBF-based networks.
1170
C.-H. Zheng et al.
Table 2. Comparison of training speeds between MLP-based and RBF-based networks
Two superg supergaussions RBF MLP
Superg. and subg. RBF
MLP
Mean
315
508
369
618
St. dev
141
255
180
308
From the two tables we can see that the separating results of the two methods are very similar, but the RBF-based implementations trained faster and show a smaller oscillation of training times (One epoch took approximately the same time in both kinds of network). This mainly caused by the local character of RBF networks.
6 Conclusions We proposed in this paper a novel algorithm for post-nonlinear blind source separation. This new method works by optimizing a network with a specialized architecture, using the output entropy as the objective function, which is equivalent to the mutual information criterion but needs not to calculate the marginal entropy of the output. Finally, the experimental results showed that this method is competitive to other existing ones.
References 1. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. J. Wiley, New York (2001) 2. Hyvärinen, A., Pajunen, P.: Nonlinear Independent Component Analysis: Existence and Uniqueness Results. Neural Networks, 12(3) (1999) 429–439 3. Huang, D.S.: Systematic Theory of Neural Networks for Pattern Recognition. Publishing House of Electronic Industry of China, Beijing (1996) 4. Huang, D.S.: The United Adaptive Learning Algorithm for the Link Weights and the Shape Parameters in RBFN for Pattern Recognition. International Journal of Pattern Recognition and Artificial Intelligence.11(6) (1997) 873-888 5. Taleb, A., Jutten, C.: Source Separation in Post- nonlinear Mixtures. IEEE Trans. Signal Processing, 47 (1999) 2807–2820 6. Martinez, Bray, D. A.: Nonlinear Blind Source Separation Using Kernels. IEEE Trans. Neural Networks, 14(1) (2003) 228–235 7. Almeuda, L. B.: MISEP – Linear and Nonlinear ICA Based on Mutual Information. Journal of Machine Learning Research.4(2) (2003) 1297-1318
Identification of Mixing Matrix in Blind Source Separation Xiaolu Li and Zhaoshui He School of Electronics and Information Engineering, South China University of Technology, Guangzhou, 510640, China [email protected], [email protected]
Abstract. Blind identification of mixing matrix approach and the corresponding algorithm are proposed in this paper. Usually, many conventional Blind Source Separation (BSS) methods separate the source signals by estimating separated matrix. Different from this way, we present a new BSS approach in this paper, which achieves BSS by directly identifying the mixing matrix, especially for underdetermined case. Some experiments are conducted to check the validity of the theory and availability of the algorithm in this paper.
1 Introduction Blind Source Separation (BSS) attempts to recover the source signals from the observed signals without any information about source signals and the mixing channel. BSS has its potential applications such as cocktail party problem, sonar, image processing/enhancement, biomedical signal processing (e.g., EEG, ECG, MEG and FMRI etc). BSS has been one of the hottest spots in signal processing and neural networks fields. There are many contributions about BSS. It can be refereed in these papers [15, 18—20]. The following mathematical formulae give the illustration of BSS.
where x = ( x1 ,
x = As
(1)
y = Wx
(2)
, xm ) ∈ R m×1 is the vector of mixed signals. A ∈ R m× n is the mixing T
matrix, while s = ( s1 ,
, sn ) is the vector of the source signals. When m < n , model T
(1) is undetermined. Many existing BSS methods require that the number of mixed signals is larger than the number of source signals, i.e., m ≥ n . BSS is achieved by solving the separating matrix W in the model (2) in the previous work (see, e.g., [2, 3], [6, 7], [9] and [16]). For the so called backward BSS Model (2), separating matrix W can be estimated by natural gradient method (see, [1, 2] and [9]). However, if A is singular or m < n , the backward BSS model (2) is unavailable. In the case of m < n , a so-called forward BSS method (see [11]) can be employed to identify A and then to estimate the source signals via model (1). Under some appropriate conditions forward BSS method can deal with the underdetermined BSS J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1171 – 1176, 2006. © Springer-Verlag Berlin Heidelberg 2006
1172
X. Li and Z. He
problems ( m < n ). Some results are reported in the papers [10][12-14][5][8]. Here this paper focuses on identifying the matrix A iteratively using forward BSS method. Assumption. (A1) the n components s1 ,
, sn of source vector s are mutually
independent; (A2) if m ≤ n , an arbitrary submatrix Al1, invertible.
, lm
= [ al 1 ,
, alm ] of A is
2 BSS Problem and Blind Identification Algorithm By maximum likelihood criterion, we have the following cost function L = log ( P ( x ) ) = log ( P ( x1 ,
, xm ) )
(3)
The backward BSS can be implemented by solving the following optimization problem.
(
)
⎧⎪ max L = max log ( P ( x ) ) , W ⎨ W ⎪⎩ s.t. Wx = s.
(4)
Lee Girolami and Sejnowski (see [9]) solved above problem using natural gradient method, where the natural gradient with respect to W is as follows ΔW =
The
separated
ϕ ( s ) = (ϕ ( s1 ) ,
∂L ⋅ W T W = ⎡⎣ I n× n + ϕ ( s ) ⋅ sT ⎤⎦ ⋅ W ∂W
matrix
W
is
updated
by
(5)
Wk +1 = Wk + μΔWk ,
, ϕ ( sn ) ) , ϕ ( si ) = ∂ log P ( si ) ∂si , i = 1, T
where
,n .
In addition, we can achieve forward BSS by solving the optimization problem
(
)
⎧⎪ max L = max log ( P ( x ) ) , A ⎨ A ⎪⎩ s.t. As = x.
(6)
For well-determined BSS, we proposed an algorithm for identifying the mixing matrix A in model (1), where the iteration formula is as follows:
⎧⎪ A = A + μΔA, ⎨ T ⎪⎩ ΔA = − A ⎡⎣ I n× n + ϕ ( s ) ⋅ s ⎤⎦ .
(7)
where constant μ is the learning step size. In fact, iteration formula (7) is not only valid for well-determined BSS, but also it is available in undetermined case. Next, we will discuss this problem for underdetermined BSS. From formula (7), the natural gradient can also be expressed in vector form ΔaiTi = −aiTi ⎡⎣ I n× n + ϕ ( s ) ⋅ sT ⎤⎦ , i = 1,
,n
(8)
Identification of Mixing Matrix in Blind Source Separation
where ΔA = [ Δa1i ,
1173
, Δan i ] . The expression (8) shows that ΔaiTi can be obtained if T
the source vector s and ai i are given. Thus, we have
Theorem 1. For m < n, rank ( A ) = m , (underdetermined BSS), the iteration formula (7) can still be applied to estimate the mixing matrix. Proof. Although there are only m observed signals, we can add extra n − m virtual observed signals denoted as xm +1 , , xn . Suppose that the virtual observed signal xi ( m + 1 ≤ i ≤ n ) is obtained by xi = aiTi s . So, we have ⎛ x ⎞ ⎛ A ⎞ ⎜ ⎟ ⎜ T ⎟ ⎜ xm +1 ⎟ = ⎜ am +1i ⎟ s = A '⋅ s ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ T ⎟ ⎝ xn ⎠ ⎝ a n i ⎠
(9)
From (8), ΔaiTi contain aii and source signal vector s . If aii is known and s can be estimated, then ΔaiTi can be estimated from equation (8). For (9), we can choose the appropriate virtual observed signals xm +1 , from equation (7), we have ΔaiTi = −aiTi ⎡⎣ I n×n + ϕ ( s ) ⋅ sT ⎤⎦ , i = 1,
From equation (10) we derive ΔA = [ Δa1i ,
, xn such that invertible A ' . Then
, m, m + 1,
(10)
,n .
, Δam i ] = − A ⎡⎣ I n× n + ϕ ( s ) ⋅ sT ⎤⎦ . Thus the T
theorem is proved.
Remark. For underdetermined problem (1), the source signals can be estimated after the mixing matrix is estimated in advance. It can be refereed that if the source signals are sufficient sparse, they can be blind separated definitely (see, e.g., [5], [17]; [12], [4], [9] and [11], etc. For over-determined problem, the discussion is in analogy.
3 Procedure of the Algorithm Sum up above discussion, the forward BSS algorithm can be summarized as follows. (1) Initialize the mixing matrix A randomly as A( 0) . Set a certain μ and let k = 0 . (2) For the case of m < n , if the source signals are sparse in some sense, the source k vector s ( ) can be estimated by shortest path algorithm (ref. [5]) or maximal posterior probability method (ref. [10]). T k (3) Estimate ΔA( k ) by substituting s ( ) ( t ) into ΔA( k ) = − A( k ) ⎡ I n× n + ϕ s ( k ) ⋅ s ( k ) ⎤ . ⎢⎣ ⎥⎦ (4) Update the mixing matrix by Ak +1 = Ak + μΔAk .
( )( )
1174
X. Li and Z. He
(5) If the procedure is convergent, stop the iteration and turn to (6). If it has not been convergent, set k = k + 1 and turn to (2). k (6) Output the estimation sˆ = s ( ) of the source vector.
4 Simulation To illustrate the effectiveness of our algorithm, we perform experiment using voice signals. Here 3 voices were employed as the source signals 1 . They are superGaussian. Their kurtoses are 16.9792, 29.0338 and 8.5586, respectively. 1 0 -1 1 0 -1 1 0 -1
0
1000
2000
3000
4000
5000
Fig. 1. 3 voice signals
In our simulation, the nonlinear function in the algorithm is selected as
ϕ ( s ) = sign ( s ) . To check the performance of the algorithm, Signal to Interference Ratio (SIR) is employed.
SIR ( s, sˆ ) = −10 log
sˆ − s s
2
(11)
2
where sˆ is the estimation of s . Usually, if SIR is not less than 20dB, the estimation is reliable. Three source signals are shown in Fig. 1. The mixing matrix is
⎛ cos(π/12) cos(5π/12) cos(9π/12) ⎞ ⎛ 0.9659 A=⎜ ⎟=⎜ ⎝ sin(π/12) sin(5π/12) sin(9π/12) ⎠ ⎝ 0.2588
0.2588 -0.7071⎞ ⎟. 0.9659 0.7071 ⎠
By MATLAB command randn('state',1); A=randn(2,3) , the matrix is initialized as ⎛ 0.8644 -0.8519 -0.4380 ⎞ 0 A( ) = ⎜ ⎟, ⎝ 0.0942 0.8735 -0.4297 ⎠ 1
The voice signals are downloaded from the website http://eleceng.ucd.ie/~srickard/bss.html, each of them has 5000 sampling points.
Identification of Mixing Matrix in Blind Source Separation
1175
Take μ = 0.1 . After 33 iterations, the algorithm is convergent. The SIR between separated signals and the source signals are 41.2324dB 24.0013dB and 40.0427dB, respectively. So the BSS result is valid. The estimation of the mixing matrix is ⎛ 0.1129 -0.0779 -0.0136 ⎞ Aˆ = ⎜ ⎟. ⎝ 0.0283 0.0723 -0.0546 ⎠
5 Conclusions In this paper the forward BSS method was used to estimate the source signals. We achieve BSS by identifying the mixing matrix. An algorithm was proposed to identify the mixing matrix for BSS problems. Theoretical analysis and simulation were presented. The experiment showed good performance in the underdetermined case.
Acknowledgements The work is supported by the National Natural Science Foundation of China for Excellent Youth (Grant 60325310), the Guangdong Province Science Foundation for Program of Research Team (grant 04205783), the National Natural Science Foundation of China (Grant 60505005), the Natural Science Fund of Guangdong Province, China (Grant 05103553) and (Grant 05006508), the Specialized Prophasic Basic Research Projects of Ministry of Science and Technology, China (Grant 2005CCA04100).
References 1. Amari, S.: Natural Gradient Works Efficiently in Learning. Neural Computation 10 (2) (1998) 251-276 2. Amari, S., Cichocki A., and Yang H.: A New Learning Algorithm for Blind Signal Separation. In D.S. Touretzky, C.M. Mozer, & M.E. Hasselmo (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press 8 (1996) 757-763 3. Bell, A.J., Sejnowski, T.J.: An Information-maxization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7 (6) (1995) 1129-1159 4. Belouchrani, A., Cardoso, J.F.: Maximum Likelihood Source Separation for Discrete Sources. In Proc. EUSIPCO. (1994) 768-771 5. Bofill, P., Zibulevsky, M.: Underdetermined Source Separation Using Sparse Representations, Signal processing 81 (2001) 2353-2362 6. Cardoso, J.F., Laheld, B.H.: Equivariant Adaptive Source Separation. IEEE Trans. Signal Processing 44 (12) (1996) 3017-3030 7. Comon, P.: Independent Component Analysis, a New Concept? Signal Processing 36 (1994) 287-314 8. Girolami, M.: A Variational Method for Learning Sparse and Overcomplete Representation. Neural Computation 13 (11) (2001) 2517-2532
1176
X. Li and Z. He
9. Lee, T.W., Girolami M., Sejnowski T.J.: Independent Component Analysis Using an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources. Neural Computation 11 (2) (1999) 417-441 10. Lee, T.W., Lewicki, M.S., et al.: Blind Source Separation of More Sources than Mixtures Using Overcomplete Representations. IEEE signal processing letters 4 (1999) 87-90 11. Lewicki, M.S., Sejnowski, T.J.: Learning Overcomplete Representations. Neural Computation 12 (2) (2000) 337-365 12. Li, Y., Cichocki, A., and Amari, S.: Analysis of Sparse Representation and Blind Source Separation. Neural Computation 16 (6) (2004) 1193-1234. 13. Li, Y., Cichocki, A, Zhang, L.: Blind Source Estimation of FIR Channels for Binary Sources: a Grouping Decision Approach, Signal Processing. 84 (12) (2004) 2245-2263 14. Li, Y., Wang, J, and Cichocki, A.: Blind Source Extraction from Convolutive Mixtures in III-conditioned Multi-input Multi-output Channels. IEEE Transactions on Circuits and Systems I 51 (9) (2004) 1814-1822 15. Sanches, A.V.D.: Frontiers of Research in BSS/ICA. Neurocomputing 49 (2002) 7-23 16. Tong, L., Liu, R., and Soon, V.C.: Indeterminacy and Identificability of Blind Identification. IEEE Trans. Circuits Syst. 38 (5) (1991) 499-509 17. Zibulevsky, M., Pearlmutter, B.A.: Blind Source Separation by Sparse Decomposition in a Signal Dictionary. Neural Computation 13 (4) (2001) 863-882. 18. Xie, S., Zhang, J.: Blind separation algorithm of minimal mutual information based on rotating transform. Tien Tzu Hsueh Pao/Acta Electronica Sinica 30 (5) (2002) 628-631 19. Xiao, M., Xie, S., and Fu, Y.: A Novel Approach for Underdetermined Blind Source in the Frequency Domain. Lecture Notes in Computer Science 3497, (2005) 484-489 20. Xie, S., Xiao, M., and Fu, Y.: A Novel Approach for Underdetermined Blind Speech Signal Separation. DCDIS Proceedings 3: Impulsive Dynamical Systems and Applications (2005) 1846-1853
Identification of Independent Components Based on Borel Measure for Under-Determined Mixtures Wenqiang Guo1,2, Tianshuang Qiu1, Yuzhang Zhao2, and Daifeng Zha1 1
Dalian University of Technology, Dalian 116024, China [email protected], [email protected], [email protected] 2 Xinjiang Institute of Finance and Economics, Urumchi 830012, China [email protected]
Abstract. In this paper, a new method for identifying the independent components of an alpha-stable random vector for under-determined mixtures is proposed. The method is based on an estimate of the discrete Borel measure for the characteristic function of an alpha-stable random vector. Simulations demonstrate that the proposed method can identify independent components and the basis vectors of mixing matrix in the so-called under-determined case of more sources than mixtures.
1 Introduction In some applications, such as underwater acoustic signal processing, radio astronomy, communications, radar system, etc., and for most conventional and linear-theorybased methods, it is reasonable to assume that the additive noise is usually assumed to be Gaussian distributed with finite second-order statistics. However in some scenarios, it is inappropriate to model the noise as Gaussian noise. The availability of such methods would make it possible to operate in the environments that differ from Gaussian in significant ways. Recent studies [1][2][3] showed that the class of alpha stable distributions is better for modeling impulsive noise than Gaussian distribution in signal processing, which has some important characteristics and makes it very attractive. Many types of noise are well modeled as alpha-stable distributed, including underwater acoustic, low-frequency atmospheric, and many man-made noises. Recent advances in blind source separation by independent have many applications including speech recognition systems, telecommunications, and medical signal processing. Independent Component Analysis addresses the problem of reconstruction of sources from the observation of instantaneous linear combinations. The goal of ICA is to recover independent sources given only sensor observations and linear mixtures of the unobserved independent source signals are unknown. The standard formulation of ICA requires at least as many sensors as sources. Lewicki and Sejnowski [4] have proposed a generalized ICA method for learning underdetermined representations of the data that allows for more basis vectors than dimensions in the input. The work of this paper considers the ICA problem on alpha stable source signal vectors. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1177 – 1182, 2006. © Springer-Verlag Berlin Heidelberg 2006
1178
W. Guo et al.
2 Alpha Stable Vector with Discrete Borel Measure Stable distributions are suitable for modeling random variables with tails of the probability density function that are heavier than the tails of the Gaussian density function. Stable distributions have found applications in signal processing, and also in processing of audio signals [2][3]. This is a kind of physical process with suddenly and short endurance high impulse in real world has no its second order or higher order statistics. It has no close form probability density function so that we can only describe it by its characteristic function (ch.f.): α
Φ (t) = exp{jμt − γ t [1 + jβ sgn(t) ω(t, α )]}.
(1)
2 απ (if α ≠ 1 ) or log t (if α = 1 ), −∞ < μ < ∞ , γ >0 , 0 <α ≤ 2 , 2 π −1≤β ≤1, α is the characteristic exponent, γ is the dispersion parameter and β is the
where ω(t, α) = tan
symmetry parameter. The ch.f. for a real random vector x is defined as E exp(it T x) . A random vector is an alpha stable vector if a finite measure Γ exists on the unit sphere S M of R M such that the ch.f. can be written on the form.
⎧exp{jt T μ − t T Dt}, α = 2. ⎪ Φ(t) = ⎨ T T ⎪⎩exp{− ∫SM Ψ α (t s)dΓ(s) + it μ + jβα (t)} 0 < α < 2. Where
(2)
⎧| u |α (1 − i tan(απ 2)sign(u) , α ≠ 1. Ψ α (u) = ⎨ ⎩ | u | (1 + iπ sign(u) ln(u) / 2) , α = 1. ⎧ απ T α T ⎪ tan 2 ∫SM | t s | sign(t s)dΓ(s), βα (t) = ⎨ ⎪ ∫ t T s ln | t T s | dΓ(s), ⎩ SM
α ≠ 1. α = 1.
where μ , t ∈ R M , D is positive semi-definite symmetric matrix. The measure Γ is called the Borel measure [1] of the SαS random vector, and μ is called the shift vector. The Borel measure is a real function, and the Borel measure representation (Γ, μ ) is unique [1][5]. Consider a neural network ICA mixing model and the real vector v = [v1 , v2 ,..., vN ]T where v1 , v2 ,..., vN are independent Sα S (α , β n , γ n , μ n ) random variables. Let x =[x1, x2,..., xM ]T be a random vector and A=[a1,a2,...,an,...,aN] be M×N matrix.
Fig. 1. ICA mixing model
Identification of Independent Components Based on Borel Measure
1179
The vector x = Av is then an SαS random vector with ch.f. N
E exp(it T Av) = ∏ E exp(it T a n v n ).
(3)
n =1
Where a n is the n th column of A. This is essentially the product of all the ch.f.s of vn , and can be written as N
N
n =1
n =1
E exp(it T Av) = exp(−∑ γ n Ψ α (t T a n ) + i∑ μ n t T a n ).
(4)
Rewriting (4) to the form of (2) yields [5]. N 1 + βn 1 − βn γ n (a Tn a n )α / 2 δ(s − s n ) + ∑ γ n (a Tn a n )α / 2 δ(s + s n ). 2 2 n =1 n =1 N
Γ(s) = ∑
⎧N ⎪∑ a n μ n ⎪ n =1 μ=⎨ N ⎪ a n (μ n − βn γ n ln(a Tn a n ) / π) ⎪⎩ ∑ n =1 where s n = a n
(5)
α ≠ 1. (6) α = 1.
a Tn a n . Hence the Borel measure Γ(s) of the random vector x is
discrete and concentrated on N symmetric pairs of points (s n ,−s n ), n = 1,2,..., N . This result holds in general, thus the Borel measure Γ(s) of an SαS random vector
x is discrete on the unit sphere S N , if, and only if, x can be expressed as a linear transformation of independent SαS random variables.
3 Estimation Borel Measure Γ(s) and Identification Basis Vectors a n From [6] we have that for a d-dimensional SαS variable with Borel measure Γ(s) and density function p (x) , there is a discrete Borel measure Γa (s) with corresponding density function pa (x) satisfying sup| p(x) − pa (x) |≤ ε , x ∈ R d , ε > 0 . Thus an arbitrary Borel measure Γ(s) can be approximated by a discrete measure Γa (s) such that the corresponding densities are arbitrarily close. The only requirement is that sampling of s is sufficiently dense. The ch.f. corresponding to the approximated discrete Borel measure Γa (s) sampled in L points, can be written as L
ˆ (t) = exp(−∑ Ψ (t T s )Γ (s )). Φ α a n a n n =1
(7)
1180
W. Guo et al.
Now define a vector Η = [Γa (s 1 ), Γa (s 2 ),....., Γa (s L )] T containing the L values of the approximated Borel measure If we evaluate the approximated ch.f. for L values of t then we can formulate the set of linear equations
⎡ − ln Φ a (t1 ) ⎤ ⎡ Ψ α (t1T s1 ) Ψ α (t1T s 2 ) ⎢ − ln Φ (t ) ⎥ ⎢ Ψ (t T s ) Ψ α (t T2 s 2 ) a 2 ⎥ ⎢ =⎢ α 2 1 ⎢ ⎥ ⎢ : : : ⎢ ⎥ ⎢ T T − Φ ln (t ) Ψ Ψ (t s ) (t a L ⎦ ⎣ α Ls2 ) ⎣⎢ α L 1
.. Ψ α (t1T s L ) ⎤ ⎡ Γ a (s1 ) ⎤ ⎥ ⎢ ⎥ .. Ψ α (t T2 s L ) ⎥ ⎢ Γ a (s 2 ) ⎥ × . ⎥ ⎢ : ⎥ : : ⎥ ⎢ ⎥ .. Ψ α (t TL s L ) ⎦⎥ ⎣Γ a (s L ) ⎦
(8)
Then the approximated Borel measure is given exact by the solution to (8). From this point, without loss of generality and to simplify the presentation, we will assume that x is SαS . In the case of symmetric density function the Borel measure and the ch.f. is real valued and symmetric. From the definition of the ch.f., an estimate based on samples of the random vector x can be obtained as K ˆ (t) = 1 ∑ exp( jt T x ). Φ k K k =1
(9)
For K samples of x . An estimate of the approximate discrete Borel measure is directly obtained from (8). The Borel measure is defined on the d-dimensional unit sphere. If no a priori knowledge about the Borel measure is available and if all directions are of equal importance, it is natural to sample uniformly on the unit sphere. The natural choice of the sampling grid of the ch.f. is to sample symmetrically on a d-dimensional sphere, again in the SαS case it suffices to sample on a half ddimensional sphere. Thus the n th sample point in t is t n = rs n , s n = a n a Tn a n where r is the sampling radius. To avoid negative values in the estimate of the Borel measure, the solution in (8) is restated as ⎧ ⎡ Ψ α (t1T s1 ) Ψ α (t1T s 2 ) .. Ψ α (t1T s L ) ⎤ ⎡ Γ a (s1 ) ⎤ ⎫ ⎡ Γ a (s1 ) ⎤ ⎡ − ln Φ a (t1 ) ⎤ ⎪⎢ ⎥⎢ ⎢ Γ (s ) ⎥ ⎢ − ln Φ (t ) ⎥ ⎥⎪ T T T ⎪ ⎢ Ψ α (t 2 s1 ) Ψ α (t 2 s 2 ) .. Ψ α (t 2 s L ) ⎥ ⎢ Γ a (s 2 ) ⎥ ⎪ a 2 ⎥ a 2 ⎥ ⎢ min ≥ 0. − 2 Re al ⎨ ⎬ , ⎢ ⎢ ⎥ ⎢ : ⎥ ⎢ ⎥ ⎢ ⎥ : : : : : : ⎪⎢ ⎪ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ T T T ⎪ ⎪ ⎣ − ln Φ a (t L ) ⎦ ⎩ ⎣⎢ Ψ α (t L s1 ) Ψ α (t L s 2 ) .. Ψ α (t L s L ) ⎦⎥ ⎣Γ a (s L ) ⎦ ⎭ 2 ⎣ Γ a (s L ) ⎦
(10)
The estimation procedure for the Borel measure is now: (i) determine the sampling radius r , a natural choice is r =γ −1/α ; (ii) calculate Ψα (t iTs j ) , t i = rs i , s j = a j a Tj a j ; (iii) estimate Φ a (t n ) according (9); (iv) solve the constrained least square problem in (10), we can get estimated Borel measure Γa (s1 ) , Γa (s 2 ) ,…, Γa (s L ) . Moreover, blind identification provides no order of the basis vectors, and for the case of symmetric densities there is a sign ambiguity of the basis vectors. Now considering (5) leads to the conclusion that the identification of the basis vectors a n is simply to determine the directions in which the Borel measure has maximum peaks. In general the Borel measure has maximum peaks in 2 N directions. Due to a finite number of samples, observation noise and possible deviations from the theoretical distributions, there will be some noise in the estimated Borel measure. The number of
Identification of Independent Components Based on Borel Measure
1181
sample points L determines the angular resolution of the basis vectors. In the d=2 case with sampling on the half 2-dimensional sphere the resolution is ±π / 2L .
4 Experimental Results In our simulations, the basis vectors a n (assuming that || a n || 2 = 1 ) for underdetermined mixing matrix A are identified for both synthetic SαS signals and speech signals which can be modeled by alpha stable distribution.
Experiment 1: Let v be a random vector with four independent SαS (1.5,1) random variables, x = Av be the observable random vector: ⎡ v1 ⎤ ⎢ ⎥ ⎡ x1 ⎤ ⎡cos(θ1 ) cos(θ2 ) ... cos(θ4 ) ⎤ ⎢ v2 ⎥ = ⎢ x ⎥ ⎢ sin(θ ) sin(θ ) ... sin(θ ) ⎥ ⎢ : ⎥ . ⎣ 2⎦ ⎣ 1 2 4 ⎦ ⎢ ⎥ ⎣ v4 ⎦
(11)
Where θ 1 = 0.2π , θ 2 = 0.4π , θ 3 = 0.6π , θ 1 = 0.8π . We can have scatter plot for the observations of x and the estimated typical Borel measure depicted in Fig.2. The basis vectors are identified as the directions in which the estimated Borel measure has significant peaks. Observe that the distribution of peaks for the Borel measure is very specific in the 4 directions corresponding to the directions θ 1 = 0.2π , θ 2 = 0.4π , θ 3 = 0.6π , θ 1 = 0 .8π , and there are four directions of scatter plot of x .
Fig. 2. Scatter plot of x and the estimated Borel measure
Experiment 2: In [3] it is demonstrated that SαS distributions are suitable for modeling a broad class of acoustical signals, including speech signals. The proposed method for identification of basis vector in a mixture is applied to a mixture of two independent speech signals. Let v be a random vector with two speech signals. x = Av be the observable random vector: ⎡ x1 ⎤ ⎡cos(θ1 ) cos(θ2 ) ⎤ ⎡ v1 ⎤ ⎢ x ⎥ = ⎢ sin(θ ) sin(θ ) ⎥ ⎢ v ⎥ . ⎣ 2⎦ ⎣ 1 2 ⎦⎣ 2⎦
(12)
1182
W. Guo et al.
θ 1 = 0.2π , θ 2 = 0.8π . We can have scatter plot for the observations of x and the estimated typical Borel measure depicted in Fig.3. We observe that the distribution of peaks for the Borel measure is very specific in the 2 directions θ 1 = 0.2π , θ 2 = 0.8π . And there are two directions of scatter plot of x and two peaks of Borel measure.
Fig. 3. Scatter plot of x and the estimated Borel measure
According to (12), we can separate the two speech signals. Experiments with different speech signals and different mixing matrices yielded similar results. Although the temporal structure of the speech signal was not taken into consideration in the model, the separation quality was good.
5 Conclusion In this paper, we propose an ICA method based on the observation that the Borel measure is discrete for stable random vectors with independent components. The method identifies the number of independent components and the non-orthogonal bases of the mixture. Simulations on four alpha stable signals and two speech signals demonstrates that the method can identify the number of independent components and the bases of the under-determined mixtures.
References 1. Nikias, C.L., Shao, M.: Signal Processing with Alpha-Stable Distribution and Applications. New York. John Wiley & Sons Inc. (1995) 2. Georgiou, P., Tsakalides, P., Kyriakakis, C.: Alpha-Stable Modeling of Noise and Robust Time-Delay Estimation in the Presence of Impulsive Noise. IEEE Transactions on Multimedia 1 (1999) 291–301 3. Kidmose, P.: Alpha-Stable Distributions in Signal Processing of Audio Signals. In 41st Conference on Simulation and Modelling. Scandinavian Simulation Society (2000) 87–94 4. Lee, T.W., Lewicki, M.S., Girolami, M., Sejnowski, T.J.: Blind Source Separation of More Sources Than Mixtures Using Overcomplete Representations. IEEE Signal Processing Letters 6 (1999) 87–90 5. Gennady, S., Murad, S.T.: Stable Non-Gaussian Random Processes. Chapman & Hall (1994) 6. Byczkowski, T., Nolan, J.P., Rajput, B.: Approximation of Multidimensional Stable Densities. Journal of Multivariate Analysis. 46 (1993) 13–31
Estimation of Delays and Attenuations for Underdetermined BSS in Frequency Domain Ronghua Li1 and Ming Xiao1,2 1 School of Electrics & Information Engineering, South China University of Technology, Guangzhou 510640, Guangdong, China [email protected], [email protected] 2 School of computer& Electrics Information Engineering, Maoming College, Maoming 525000, Guangdong, China
Abstract. Underdetermined blind delayed source problem is studied in this paper. Firstly, on the basis of the searching-and-averaging-based method in frequency domain, the algorithm was extended to blind delay source model. Secondly, a new cost function for estimating the delay of observed signal was present; the delay was inferred in the single-signal intervals. Finally, the delayed sound experiments demonstrate its performance.
1 Introduction Blind source separation based on independent component analysis (ICA) has many potential applications including speech recognition systems, telecommunications, and medical signal processing. The standard formulation of ICA requires that sources are more than sensors [1]-[6]. When the sensors are fewer than the sources, the underlying system is underdetermined. Sparse representation has been used in blind sources separation [7]-[14]. If the sources are sparse, the underdetermined blind source separation (BSS) problem can be solved by a two-step approach, which has been proposed recently by Bofill and Zibulevsky [7]. The first step is to estimate mixing matrix, and the second is to estimate sources. In the matrix-recovery step, usually use k-means, a potential-functionbased method [7] and etc. The goal of these approaches is to estimate the basis lines either by external optimization or by clustering. In source-recovery step, often use a shortest-path algorithm [7][8] and a method of solving a low-dimensional linear programming problem for each of data points [10]. Bofill has studied blind separation of underdetermined delayed sound sources [11], namely, the observed signals with only attenuations and delays (i.e. no reverberation). However, Bofill’s estimation of delays has high complexity; and it isn’t completely exact. Recently, we have proposed the searching-and-averaging-based method [12]-[14], and its results in frequency domain are better than Bofill’s [7][12]. In order to estimate the delays precisely, a new algorithm was proposed; and the SABM algorithm is extended to underdetermined blind delayed source model; and the delays were successfully inferred in single-signal intervals in this paper. Finally, several experiments results show the algorithm is effective in estimation of the attenuation and delays. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1183 – 1188, 2006. © Springer-Verlag Berlin Heidelberg 2006
1184
R. Li and M. Xiao
2 Problem Formulation This paper deals with inference or blind separation of n sources from m sensors (n>m), amusing that only attenuation and delay are present in the mixture (i.e.. no reverberation). The assumed mixture model is the following [11]:
x1 (t ) = a11 s1 (t − τ 11 ) + a12 s2 (t − τ 12 ) + " + a1n sn (t − τ 1n ) # xm (t ) = am1 s1 (t − τ m1 ) + am 2 s2 (t − τ m 2 ) + " + amn sn (t − τ mn )
(1)
where t = 1,..., P is the discrete time, xi (t ) are the sensor signals, sl (t ) are the sources, and ail and τ il are the attenuation and delay, respectively, from source l to sensor i . The problem of blind source separation, in this setting, consists of inferring the sources from the sensor signals when sources, attenuations and delays are known. The scope of this paper is the underdetermined case (i.e., m
(2)
with k = 0,..., K 2 the discrete frequency. In matrix notation we have
Xk = Z k S k
(3)
where, for each frequency bin k , the sensor data point X k ∈ Z m is a column vector, the source vector S k ∈ Z m has n unknown components, and the m × n mixing matrix Z k is
Z k = [ail e− j 2π kτ il K ] = A : e− j 2π kT K
(4)
with A = [ail ] the unknown attenuation matrix, T = [τ il ] the unknown delay matrix, and : denoting the component to component product. Notice that although Z k takes a different value for each k , they all depend on the same A and T , thus the number of unknown coefficients is 2 × m × n . Given X k , for k = 0,..., K 2 , this paper only describes procedures for inferring first A , then T , and doesn’t describes procedures for inferring Sk and s (t ) . In reference [11], Bofill used a second-order cone programming to infer Sk and s (t ) .
3 Inferring the Absolute Value of Attenuations When m = 2 , the attenuations matrix A can be decomposed into n column vectors
al ’s, where al = (| a1l | e jθ ,| a2l | e jθ )T , and θ il ∈ {0, π } . For simplicity, we 1l
2l
set B =| A | , B = [bl ] , and b l = (| a1l |,| a2l |)T , then bil ≥ 0 .
Estimation of Delays and Attenuations for Underdetermined BSS
1185
On basis of the searching-and-averaging-based method (SABM) in frequency domain [12], we find firstly some single-signal frequency intervals, and then estimate all the basis vectors b l ’s. In reference [12], we described the procedure of searching the single-signal frequency intervals. According to the sources’ sparsity in the single-signal frequency intervals, the following expression is obtained | X k |= b l | Slk |
(5)
Assuming a group of frequency bin k1 , k2 ," , kh is in the single-signal frequency intervals corresponding with the basis vectors b l , the vector b l is easy to be estimated using the SABM algorithm in frequency domain [12]. Given N 0 , If h ≥ N 0 ,
bl =
1 h ki ∑X h i =1
(6)
where the vector X ki denotes normalized | X k | for ki = k1 , k2 ," , kh .
4 Inferring the Attenuations and Delays When m = 2, Let X k = (| x1k | e jϕ1k ,| x2k | e jϕ2 k )T and Slk =| Slk | e jφlk if frequency bin k = k1 , k2 ," , kh lies in the single-signal frequency intervals corresponding with the basis vectors b l , we can obtain
Xk ≈ (b1l e jθ1l − j 2π kτ1l K , b2l e jθ2l − j 2π kτ 2l K )T | Slk | e jφlk
.
(7)
Therefore, the phases of the vector X k satisfy
⎧ ϕ1k − φlk = θ1l − 2π kτ 1l K + 2n1π ⎨ ⎩ϕ 2 k − φlk = θ 2l − 2π kτ 2l K + 2n2π
(8)
where n1 and n2 are arbitrary integer. Let δ l = τ 2l − τ 1l and θ l = θ 2l − θ1l , then (ϕ 2 k − ϕ1k ) − θ l + 2π kδ l K = 2π (n2 − n1 ) .
(9)
Therefore, a cost function for estimating the delay is obtained J (δ l ,θ l ) = arg min ∑ |1 − exp{− j[(ϕ 2 k − ϕ1k ) − θ l + 2π kδ l K ]} | δ i ,θ l
k
(10)
where k = k1 , k2 ," , kh and δ l ∈ {0,1,..., K − 1} . For θ il ∈ {0, π } , θ l ∈ {−π , 0, π } ; and for al and −al are same vector in BSS, θ l ∈ {0, π } . We can firstly obtain δ l and θ l from optimizing the cost function, then the delays and attenuations are ⎧τ 1l = 0,τ 2l = δ l ⎨ ⎩τ 1l = K − δ l ,τ 2l = 0
(δ l ≥ 0) (δ l ≥ K / 2)
(11)
1186
R. Li and M. Xiao
⎧ al = b l ⎨ ⎩al = b l ⋅ [−1,1]
(θ l = 0) (θ l = π )
.
(12)
where k = k1 , k2 ," , kh . Compared with Bofill’s algorithm [11], we use only the data in the single-signal frequency intervals to infer the delays, whereas Bofill use all the data, so our algorithm has more preciseness and lower complexity than Bofill’s.
5 Experiments and Result The experiments were conducted on the set of signals: SixFluteMelodies: A set of 6 signals of 5.7 s of duration with different flute melodies, recorded with a high-quality registration at 44,100Hz and 16 bits, and down-sampled to 22,050 Hz. It was originated from http://www.ac.upc.es/homes/pau/. The mixtures of using dynamic sources were processed in frames of length L samples and they were multiplied by a Hanning window. A “hop” distance d was used between the starting points of successive frames, leading to an overlap of L − d samples between consecutive frames. Each frame was transformed with a standard FFT of length L . We set L = 8192 and d = 3276 . ⎡ 0.9848 0.7660 0.3420 -0.0872 -0.6428 -0.9063⎤ The attenuations are A = ⎢ ⎥, ⎣0.1736 0.6428 0.9397 0.9962 0.7660 0.4226 ⎦ 0 0 0 0 0 ⎤ ⎡0 and the delays matrix is T = ⎢ ⎥. ⎣55 −150 33 −140 68 −80 ⎦ After the simulation of the SABM in frequency domain, the absolute value of attenuations were obtained as ⎡ 0.98441 0.64262 0.091297 0.90571 0.76792 0.33031⎤ Bˆ = ⎢ ⎥. ⎣ 0.17587 0.76618 0.99582 0.42389 0.64054 0.94387 ⎦ In
inferring the delays and attenuations, the delays matrix is 0 0 0 0 0 0⎤ ⎡ ˆ ˆ ˆ ˆ ˆ ˆ Tˆ = ⎢ ⎥ , and [θ1 ,θ 2 ,θ 3 ,θ 4 ,θ 5 ,θ 6 ] = [0, π , π , π , 0, 0] . ⎣55 68 8052 8042 8112 33⎦ For L = 8192 , if we subtract 8192 from 8052, 8042 and 8112, we get –140, –150 and-80, so the delays Tˆ is the same as T . Finally, the estimation of attenuations matrix are ˆ = ⎡ 0.98441 -0.64262 -0.091297 -0.90571 0.76792 0.33031⎤ ; A ⎢0.17587 0.76618 0.99582 0.42389 0.64054 0.94387 ⎥ ⎣ ⎦ The vectors a j and aˆ j are normalized to unit length, the error can be computed as:
error ( j ) = cos −1 (aˆ l ⋅ al ) Then, error = [0.11840 ,1.44180 , 0.77550 , 0.08390 , 0.20280 , 0.06240 ] .
(13)
Estimation of Delays and Attenuations for Underdetermined BSS
1187
In Bofill’s SixFluteMelodies experiment, the average differential error in Tˆ is zero ˆ , so the delays cannot be estimated when given A , but it was 3.5 or 21 when given A correctly [11]. On the basis of this, our experiment results are better than Bofill’s.
6 Conclusions In this paper, we mainly discussed the attenuations and delays-recovery for underdetermined blind delayed sources separation. Firstly, we proposed a new algorithm of estimating the delays on basis of the SABM in the frequency domain. The algorithm is an extension of the SABM in underdetermined delayed sources separation. When the delays are zero, it can resolve the underdetermined blind sources problem. Secondly, the algorithm is better than Bofill’s in the complexity and precision. Finally, the SixFluteMelodies experiment shows the algorithm’s performance.
Acknowledgement The work is supported by the National Natural Science Foundation of China for Excellent Youth (Grant 60325310), the Guangdong Province Science Foundation for Program of Research Team (grant 04205783), the National Natural Science Foundation of China (Grant 60505005), the Natural Science Fund of Guangdong Province, China (Grant 05103553) and (Grant 05006508), the Specialized Prophasic Basic Research Projects of Ministry of Science and Technology, China (Grant 2005CCA04100).
References 1. Li, Y., Wang, J.: Sequential Blind Extraction of Instantaneously Mixed Sources. IEEE Trans on Signal Processing 50 (5) (2002) 997-1006 2. Xie, S., He, Z., and Fu, Y.: A Note on Stone's Conjecture of Blind Signal Separation. Neural Computation 17 (2) (2005) 321-330 3. Cardoso, J. F.: Blind Signals Separation: Statistical Principles. Proc. IEEE 86 (10) (1998) 1129-1159 4. Zhang, J., Xie, S., and He, Z.: Separability Theory for Blind Signal Separation. Zidonghua Xuebao/Acta Automatica Sinica 30 (5) (2004) 337-344 5. He, Z., Xie, S., Zhang, J.: Geometrical Algorithm of Blind Source Separation Based on QR Decomposition. Kongzhi Lilun Yu Yinyong/Control Theory and Applications 22 (2) (2005) 17-22 6. Li, Y., Wang, J., and Zurada, J.: Blind Extraction of Singularly Mixed Source Signals. IEEE Trans on Neural Networks 11 (6) (2000) 1413-1422 7. Bofill, P., Zibulevsky, M.: Underdetermined Blind Source Separation Using Sparse Representations. Signal Processing 81 (2001) 2353-2362 8. Theis, F. J., Lang, W. E., &Puntonet C. G., A geometric Algorithm for Overcomplete Linear ICA. Neurocomputing 56 (2004) 381-398
1188
R. Li and M. Xiao
9. He, Z., Xie, S., Fu, Y.: FIR Convolutive BSS Based on Sparse Representation. Lecture Notes in Computer Science 3497: (2005) 532-537 10. Li, Y., Andrzej, C., and Amari, S.: Analysis of Sparse Representation and Blind Source Separation. Neural Computation 16 (6) (2004) 1193-1234 11. Bofill, P. Underdetermined Blind Separation of Delayed Sound Sources in the Frequency Domain. Neurocomputing 55 (2003) 627-641 12. Xiao, M., Xie, S., and Fu Y.: A Novel Approach for Underdetermined Blind Source in the Frequency Domain. Lecture Notes in Computer Science 3497: (2005) 484-489 13. Xiao, M., Xie, S., and Fu, Y.: A Searching-and-Averaging-Based Method for Underdetermined Blind Speech Signals Separation in Time Domain. Has been submitted to Science in China Series E-Technological Sciences in Nov. (2004) 14. Xie, S., Xiao, M., and Fu, Y.: A Novel Approach for Underdetermined Blind Speech Signal Separation DCDIS Proceedings 3: Impulsive Dynamical Systems and Applications. (2005) 1846-1853
Application of Blind Source Separation to Five-Element Cross Array Passive Location Gaoming Huang1,2, Yang Gao2, Luxi Yang2, and Zhenya He2 2
1 Naval University of Engineering, Wuhan, 430033, China Department of Radio Engineering, Southeast University, Nanjing , 210096, China [email protected], {lxyang, zyhe}@seu.edu.cn
Abstract. Passive acoustic source location and tracking can be implemented by microphone array, which makes it have a broad application on target detection and fault diagnosis etc. Five-element planar cross array is an effective structure for acoustic source location, but the bad location results often obtained by conventional relative acoustic source location methods in complicated environments. This paper proposed a novel passive acoustic source location method based on blind source separation and thoroughly analyzed this new location algorithm. Experiments show that a more stable and accurate location results may be obtained by this new location method. The implementation of this location method is relatively simple and effective, which can satisfy the demand of engineering application.
1 Introduction Acoustic source location based on microphone array is a general method, which can be applied to the detection of shipboard target, hearing aids and machine diagnosis etc. As planar cross array has the property of disperse dimension and little redundancy, it is an adaptive array to the requirement of location. The location precision of four-element planar cross array and five-element planar cross array has been analyzed in [1]. Simulation results indicate that the location performance of five-element planar cross array is better. The location method based on time delay estimation is relatively easy to engineering implementation and better application foreground related to other methods. The first step of this location method is time delay estimation, then is DOA (direction of arrival) estimation, the last is location computation. Time delay estimation is the key problem in this method. The basic methods of time delay estimation include GCC, generalized bispectrum estimation, adaptive estimation, etc [2], which absorbed great of attention in early time [3]. There are also many new estimation methods in recent years [4][5]. The conventional time delay estimation methods exist many limitations in complicated environments. In recent years, particularly after Herault and Jutten’s work [6], a lot of BSS algorithms have been proposed [7][8]. This paper proposed to apply blind source separation (BSS) to improve the performance of time delay estimation, which is based on canonical correlation analysis (CCA). Stable and accurate time delay estimation results can be obtained [9,10,11] and the location error may be reduced [12][13]. Structure of this paper can be described as follows: In the Section 2, we will analyze related problems of acoustic J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1189 – 1194, 2006. © Springer-Verlag Berlin Heidelberg 2006
1190
G. Huang et al.
source location. In the Section 3, the location estimation algorithm based on BSS will be analyzed. Then some experiments of the algorithm proposed in this paper are conducted in the Section 4. Finally a conclusion is given.
2 Problem Formulation The location structure of five-element planar cross array is shown in Fig. 1.
Fig. 1. The five-element planar cross array location model
A rectangular coordinate is established, as the array element S 0 is the coordinate origin. Two mutual orthogonal linear array are S1 , S3 and S 2 , S 4 . Each array element includes two inter-sensors separately. Supposing the distance between the orthogonal array is D , the center coordinate of each array element is as: S 0 (0,0,0) S1 (D 2 ,0,0) S 2 (0, D 2 ,0) S 3 (− D 2 ,0,0) and S 4 (0, − D 2 ,0) . The coordinate of acoustic source target T j is ( x j , y j , z j ) , the sphere coordinate is (r , ϕ , θ ) , the distance between T j and S 0 is r , the azimuth angle is ϕ and the pitching angle is θ . The mathematics model of each array element received signal can be described as:
x ( t ) = A ⋅ s ( t − τ ij ) + n ( t )
i = 1,K , m; j = 1,K , n
(1)
where x ( t ) is the received signals. A is a m × n mixing matrix, s ( t ) are the source signals. τ ij is the time delay between the jth source and the ith sensor. n ( t ) are m ×1 dimension noise signals. Here signals are assumed as mutually independent and independent with noise and m ≥ n . According to the algorithm, here supposing the baseline length is much greater than the inner distance of each array element.
Application of Blind Source Separation
1191
3 Passive Location Algorithm 3.1 BSS Based on CCA
As the defined structure in Fig. 1, we can separate the receiving sensors into four groups: ( S 0 , S1 ),( S0 , S 2 ),( S0 , S3 ),( S0 , S 4 ) . Consider the each group receiving signals as two sets of input data x1 , x2 ,L , x p and y1 , y2 ,L , yq , p ≤ q . The two sets of data can be written as combination of some pairs of variables ξi and ηi as two linear combinations ξ = a T x and η = bT y . Suppose the means of x, y are μ x , μ y , and that the covariance is separately as: Σ x , Σ y , Σ xy , Σ yx . The correlation between ξ and η can be written as:
ρ ( a,b ) =
aΣ xy b
(a
T
Σ x abT Σ y b )
(2)
12
CCA is to find the vectors of a and b maximizing ρ ( a,b ) . Mutual independent variables can be obtained by the method of canonical correlation. By the theorem introduced in [14,15,16], canonical correlation variable and coefficients can be obtained by the follow steps: Step 1: computing the correlation of the two sets of variables as:
⎡ Σ xx Σ =⎢ ⎣ Σ yx
Σ xy ⎤ Σ yy ⎥⎦
(3)
Step 2: Computing the canonical correlation coefficients ri . Firstly, we compute two −1 −1 Σ xy Σ yy−1 Σ yx , M = Σ yy−1 Σ yx Σ xx Σ xy , secondly, we matrices L and M , where L = Σ xx compute the eigenvalue λi of matrix L and M . Step 3: Computing the canonical variables ξi and ηi . First separately computing the eigenvectors of matrix L about λi and matrix M about λi , we can obtain the coefficient matrix A and B . Then the canonical variables ξi and ηi can be obtained. 3.2 Time Delay Estimation
From one receiving group, we can obtain four separate signals g sx1 (n), sx2 (n) and s y1 (n), s y2 (n) by CCA. Here assuming the target source is s1 . To a same source, the receiving time is delay as the different position. Within the observation time T , assuming the total number of samples is N , correlation function can be described as: Rˆ (τ ) =
1 N −τ
N −1
∑τ s k=
x1
( k ) s y1 ( k − τ )
(4)
{
}
Then the separate time difference can be obtained as: τ 1 = arg max ⎡⎣ Rˆ (τ )⎤⎦ . Supposτ ing the time delay from target source to array element S1 , S 2 , S 3 , S 4 relative to S 0 are τ 01 ,τ 02 , τ 03 ,τ 04 , which can be obtained by the aforementioned estimation processing.
1192
G. Huang et al.
3.3 Location Computation
The location equation can be established as: c 2 (τ 012 + τ 022 + τ 032 + τ 042 ) − 2 D 2
⎛τ −τ ⎞ , ϕ ≈ arctan ⎜ 02 04 ⎟ , 4c (τ 01 + τ 02 + τ 03 + τ 04 ) ⎝ τ 01 − τ 03 ⎠ 2 2 ⎛c θ ≈ arcsin ⎜ (τ 01 − τ 03 ) + (τ 02 − τ 04 ) ⎞⎟ ⎝D ⎠ r≈
(5)
where c = 340.29m / s is the velocity of sound. The coordinate ( x, y, z ) of target T can be obtained from the sphere coordinate (r , ϕ ,θ ) as: x = r sin θ cos ϕ , y = r sin θ sin ϕ , z = r cos θ
(6)
where 0 ≤ θ ≤ 90 and 0 ≤ ϕ ≤ 360 . Here supposing the time delay estimation precision is dτ . A partial derivative to τ 01 ,τ 02 ,τ 03 ,τ 04 in Eq.(5) is calculated, the standard deviation of relative estimation value can be described as: o
o
dr ≈
o
o
4rc D 2 + r 2 2c 2 2c dτ , dϕ ≈ dτ , dθ ≈ dτ D sin θ D cosθ D ( 4 − sin 2 θ )
(7)
The following equation can be obtained from Eq.(6) as:
⎧ dx = sin θ cos ϕ dr + r cos θ cos ϕ dθ − r sin θ sin ϕ dϕ ⎪ ⎨ dy = sin θ sin ϕ dr + r cos θ sin ϕ dθ + r sin θ cos ϕ dϕ ⎪ dz = cos θ dr − r sin θ dθ ⎩
(8)
The passive location error can be described as: σ = dx 2 + dy 2 + dz 2
4 Simulations In order to verify the validity of this location estimation algorithm applied in this paper, here a series of experiments have been conducted. The background of the experiments is assumed as: The distance between the phase centers of the subarrays is 20m , two audio sources come from different direction and the sampling frequency is 8000Hz . BSS based on CCA was conducted at first. The receiving signals are shown in Fig. 2 and the separation results by CCA are shown in Fig.3. Then a cross correlation between the direct receiving signals and corresponding separation signals are conducted separately. One of the correlation results is shown as in Fig.4. A lot of experiments have been conducted by the changing the impact of additive noise. Results as in Fig.5 show that the estimation of time delay after CCA processing is more accuracy and stabilization than direct correlation and GCC. In order to verify the location performance of this algorithm, here we choose two cases to describe: supposing the azimuth angle ϕ = 30o and the pitching angle θ = 45o . The experiment results are shown in Fig. 6 and Fig. 7.
Application of Blind Source Separation The mixture signals
The separate signals 1 ux1(n)
x1(n)
1 0 -1
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
-1
5 4
2
x 10
0
2
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
-2
5 4
0.5
1
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
-0.5
5 4
0.5
0
0.5
1
1.5
2 2.5 3 sampling interval
3.5
4
4.5
2
3
3.5
4.5
5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5 4
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5 4
0 -0.5
5 4
0
0.5
1
1.5
2 2.5 3 sampling interval
3.5
4
4.5
5 4
x 10
Fig. 3. Separate signals by CCA
The first correlation coefficient
Comparison of Estimate Stabilty with the Variety of SNR
0.5
1 Direct correlation method GCC method CCA method
0.4
Amplitude
4
4
x 10
Fig. 2. Mixing signals
2.5
x 10
uy2(n)
y2(n)
-1
1.5
0
x 10 0
1
x 10
uy1(n)
y1(n)
-2
0.5
0
x 10 0
0
x 10
ux2(n)
x2(n)
2
-2
1193
0.95
0.3 0.2
0
0
10
20
30
40
50 60 Sample interval
70
80
90
100
The second correlation coefficient 1
Amplitude
0.8
Correct Estimate Percent(%)
0.9 0.1
0.85
0.8
0.75
0.6 0.4
0.7 0.2 0
0
10
20
30
40
50 60 Samlple interval
70
80
90
0.65
100
Fig. 4. One of the correlation results of different processing
0
2
4
6
8
12
14
16
18
20
Fig. 5. Performance comparison of different time delay estimation methods
Location error
Location error
16.0
40
0.35
0.3
16.4
0.3
0.5
0.4
0.3
0.6
38
0.25
0.3
16.8
36
0.25
0. 5
0.4 0.3
0.25
17.2
34
Target distance changing(m)
Sensor distance changing(m)
10 SNR(dB)
0.2
17.6
0.2 0.2
18.0
0.15 0.15
0.15
18.4 0.1
0.1
18.8
0.2
0.4
32
0.3 0.2
30
0.3
28
0.2
0.2
26
0.1
0.1
19.2
0.1
24
0.05
0.05
0.1
0.05
19.6
22
2
4
6
8 10 12 14 Time delay estimation precision(ms)
16
18
20
Fig. 6. Location error with the changing of array element distance and time delay estimation precision
2
4
6
8 10 12 14 Time delay estimation precision(ms)
16
18
20
Fig. 7. Location error with the changing of target distance and time delay estimation precision
5 Conclusion This paper proposed a novel passive acoustic source location method based fiveelement planar cross array. The key technique of this location method is time delay estimation, which is based on BSS. The advantage of the proposed location method is very outstanding. The location precision is not only related to time delay estimation precision, but also related to the size of array, target location, pitching angle and sound field etc.
1194
G. Huang et al.
Acknowledgement. This work was supported by NSFC (60496310, 60272046), National High Technology Project (2002AA123031) of China, NSFJS (BK2002051) and the Grant of PhD Programmers of Chinese MOE (20020286014).
References 1. Wang, Z.: Little Array High Precision Acoustic Source Passive Location, Ph.D Thesis, 1999 2. Bjorklund,S. and Ljung, L:A Review of Time-Delay Estimation Techniques. Proceedings of 42nd IEEE Conference on Decision and Control, Vol. 3 (2003) 2502 – 2507 3. Knapp, C.H. and Carter, G.C.: The Generalized Correlation Method for Estimation of Time Delay. IEEE Trans. Acoust. on Speech, Signal Processing 24 (8) (1976) 320-327 4. Cheng, Z and Tjhung, T.T.: A New Time Delay Estimator Based on ETDE. IEEE Trans. on Acoust., Speech, Signal Processing 51(7) (2003) 1859-1869 5. Yi, M., Wei, P., Xiao, X.C. and Tai ,H. M.: Efficient EM Initialisation Method for Timedelay Estimation. Electronics Letters 39(6) (2003) 935-936 6. Jutten, C. and Herault, J.: Blind Separation of Sources, Part I: An Adaptive Algorithm Based on Neuromimetic. Signal Processing 24(1) (1991) 1-10 7. V. David Sánchez A.: Frontiers of Research in BSS/ICA. Neurocomputing, 2002(49) 7-23 8. Hyvärinen, A., Karhunen, J. and Oja, E.: Independent Component Analysis. Wiley (2001) 9. Huang, G.M., Yang, L.X. and He, Z.Y.: Blind Source Separation Using for Time-Delay Direction Finding. ISNN2004, LNCS, Vol. 3173, (2004) 660-665 10. Huang, G.M., Yang, L.X. and He, Z.Y.: Application of Blind Source Separation to Time delay Estimation in Interference Environments. ISNN2005, LNCS, Vol. 3497, (2005) 496-501 11. Huang, G.M., Yang, L.X. and He, Z.Y.: Time-Delay Direction Finding Based on Canonical Correlation Analysis. ISCAS2005, (2005) 5409-5412 12. Huang, G.M., Yang, L.X. and He, Z.Y.: Application of Blind Source Separation to a Novel Passive Location. ICA2004, LNCS, Vol. 3195, (2004) 1134-1141 13. Huang, G.M., Yang ,L.X. and He, Z.Y.: Multiple Acoustic Sources Location Based on Blind Source Separation”, ICNC2005, LNCS Vol. 3610, (2005) 683-687 14. Anderson, T. W.: An Introduction to Multivariate Statistical Analysis. John Wiley & Sons, second edition (1984) 15. Zhang, R.T. and Fang ,K.T.: An Introduction to Multivariate Statistical Analysis. Science House, 1982 16. Huang, G.M., Yang, L.X. and He, Z.Y.: Canonical Correlation Analysis Using for DOA Estimation of Multiple Audio Sources. IWANN2005, LNCS, Vol. 3512, (2005) 857-864
Convolutive Blind Separation of Non-white Broadband Signals Based on a Double-Iteration Method Hua Zhang1 and Dazhang Feng2 1
National Laboratory of Radar Signal Processing, Xidian University, 710071, Xi’an, P.R. China [email protected] 2 National Laboratory of Radar Signal Processing, Xidian University, 710071, Xi’an, P.R. China [email protected]
Abstract. In this paper, a convolutive blind source separation (BSS) algorithm based on a double-iteration method is proposed to process the convolutive mixed non-white broadband signals. By sliding Fourier transform (SFT), the convolutive mixture problem is changed into instantaneous case in timefrequent domain, which can be solved by applying an instantaneous separation method for every frequent bin. A novel cost function for each frequent bin based on joint diagonalization of a set of correlation matrices with multiple time-lags is constructed. Through combination of the proposed double-iteration method with a restriction on the length of inverse filter in time domain, the inverse of transfer channel or separation matrix, which has consistent permutations for all frequencies, can be estimated. Then it is easy to calculate the recovered source signals. The results of simulations also illustrate the algorithm has not only fast convergence performance, but also higher recovered accuracy and output SER (Signal to Error of reconstruction Ratio).
1 Introduction In the signal processing, the source signals should be frequently recovered only from the observational signals, the mixture of multiple unknown stochastic signals. Since it needs no knowledge of source signals and mixture model, the technology of BSS has been widely applied in many fields, such as speech enhancement with multimicrophone, antenna array processing for wireless communication, separation for multiple mobile signals and so on. In the latest twenty years, more and more researchers paid attention to this area and have achieved many remarkable results. The available algorithms can be mainly divided into two categories: 1) instantaneous BSS, or BSS [3-5], and 2) convolutive BSS. Since the convolutive BSS is close to the actual environment, it has gotten more and more concerns in the recent years [6,7]. Comparing with blind separation of instantaneous mixed signals, the latter is a challenging problem, since a great of parameters are needed to describe the transfer channels of source signals. In this paper, we assume the notations A and W denote the mixed channel and separation matrix, respectively. The M received signals can be modeled as J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1195 – 1201, 2006. © Springer-Verlag Berlin Heidelberg 2006
1196
H. Zhang and D. Feng
x ( t ) = ∑ τ =1 A (τ ) s ( t − τ + 1) + n ( t ). P
(1)
where vectors s ( t ) and n ( t ) denote the N independent source signals and M additional noise which is the spatially and temporally white noise with identical variance σ n 2 , and independent of the source signals. The positive constant P denotes the order of mixed channel. One of the methods is to take the problem as a backward model, and construct FIR inverse filter W such that the filter output
y ( t ) = ∑ τ =1 W (τ ) x ( t − τ + 1). Q
(2)
is the closely exact estimate of the source signals. According to the T -point SFT [4,11] that changes a discrete-time signal into its time-frequency domain expression, the received signals and recovered signals can be converted into the following forms, x% ( jϖ k , t ) = A ( jϖ k )s% ( jϖ k , t ) + n% ( jϖ k , t ). (3)
y% ( jϖ k , t ) = W( jϖ k )x% ( jϖ k , t ) = W( jϖ k ) A ( jϖ k )s% ( jϖ k , t ) + W ( jϖ k )n% ( jϖ k , t )
(4)
The autocorrelation matrix of received signals with zero delay and with the τ timelag can be expressed as the formulation (5) and (6) respectively.
R ( jϖ k , 0 ) =< x% ( jϖ k , t )x% H ( jϖ k , t ) >= A( jϖ k ) Λ s , k ( 0 ) A H ( jϖ k ) + σ n 2 I.
(5)
R ( jϖ k ,τ ) =< x% ( jϖ k , t ) x% H ( jϖ k , t + τ ) >= A ( jϖ k ) Λ s , k (τ ) A H ( jϖ k ).
(6)
where < ⋅ > indicates the statistical expectation operator, and the diagonal matrices Λ s , k (0) and Λ s , k (τ ) are autocorrelation matrix of source signals. Thought studying the definition of non-mixture matrix and the Darmois theorem that is presented in [2] by J. F. Cardoso, the source signals can be complete recovered within an arbitrary factor and an arbitrary permutation. Many methods have been proposed to solve the permutation problems, such as an IFC (Inter-Frequency Coherency) proposed by Futoshi Asano in [3]. In [6], a restriction on the length of inverse filter of mixed channel is used to solve the problem, and the further explanation is given in [7]. This paper emphasizes on the research of convolutive blind separation of nonwhite broadband signals based on a double-iterative method. The organization of this paper is as follows. Part 2 presents a double-iteration method based on a novel cost function to estimate the inverse filter or the source signals. Part 3 displays many results of experiments. Part 4 conclude the proposed algorithm.
2 Double-Iteration Method for Convolutive BSS Problem 2.1 The Whitening Matrix
The whitening matrix can be computed by a separation of the signal subspace and the noise subspace, which can be achieved by the eigenvalue decomposition (EVD) of R ( jϖ k , 0 ) , i.e.
Convolutive Blind Separation of Non-white Broadband Signals
1197
R ( jϖ k , 0 ) = U k Λ k ( 0 ) U kH .
(7)
where Λ k ( 0 ) = [λ1 ( 0 ) ,..., λM ( 0 )] in which λ1 ( 0 ) > ... > λM ( 0 ) , and U k = [u k ,1 ,..., u k , M ]
in which the column u k ,i is the eigenvector of R ( jϖ k , 0 ) associated with the
eigenvalue λk ,i ( 0 ) . According to the properties of eigenvalue [9], the formulation (7) can be rewritten as: ˆ ( 0 ) U H + σ 2 I. R ( jϖ k , 0 ) = U s , k Λ s,k s,k n, k
ˆ ( 0 ) ≡ diag ⎡ λ ( 0 ) − σ 2 ,..., λ ( 0 ) − σ 2 ⎤ , where Λ s,k n,k k,N n,k ⎦ ⎣ k ,1
(8)
U s , k = ⎡⎣u k ,1 ,..., u k , N ⎤⎦ , and
σ n2, k = (1 ( M − N ) ) ∑ i = N +1 λk ,i . Form the formula (5) and (8), the below expression M
ˆ (0) PkH Pk ≡ ( Λ s,k
−
1 2
ˆ (0) U sH, k ) H ( Λ s,k
−
1 2
U sH, k ) = A ( jϖ k ) Λ s , k ( 0 ) A H ( jϖ k )
(9)
is held, where Pk is a so-called whitening matrix that converts mixed matrix into a unitary matrix. Since there is an uncertainty of scalar, without loss of generality, assume that the matrix Λ s , k ( 0 ) ≡ I . The formulation (10) is achieved by pre- and post-multiplying the formula (9) by the whitening matrix Pk and PkH , respectively. I = Pk A ( jϖ k ) Λ s , k ( 0 ) A H ( jϖ k )PkH = ( Pk A ( jϖ k ) )( Pk A ( jϖ k ) ) ≡ QQ H . H
(10)
2.2 Double Iterative Method for Single Sub-band Signals The N × N matrix R ( jϖ k ,τ ) can be constructed by pre- and post-multiplying the formulation (6) by Pk and PkH , respectively, i.e. R ( jϖ k ,τ ) ≡ Q k Λ s , k (τ ) Q kH .
(11)
Since the formulation (11) is the same form as the EVD of R ( jϖ k ,τ ) , a novel cost function for single component required non-white signals based on joint diagonalization of a set of L autocorrelation matrices with multiple time-lags could be constructed according to the criterion of the Least Square (LS), i.e.
⎧ min ⎪ ⎨ ⎪⎩ s.t.
J ( q n , λn ,1 ,..., λn , L ) = ∑ l =1 R l q n − λn ,l q n L
qn
2
=1
2 F
n = 1,..., N .
(12)
where R l is short for R ( jϖ k ,τ l ) by omitting frequent bin, and vector q n denotes the eigenvector of R l associated with the eigenvalue λn,l . In order to determine the minimum point of this cost function, a double-iterative method is used.
1198
H. Zhang and D. Feng
Firstly, at the k -th iteration step, a set of parameters λn ,l ( k )( l = 1,..., L ) can be
estimated by letting the gradient of J ( q n ( k − 1) , λn ,1 ( k − 1) ,..., λn , L ( k − 1) ) with respect to λn ,l ( k − 1)( l = 1,..., L ) be zero, i.e.
Second, bring the estimated parameters λn ,l ( k )( l = 1,..., L ) into (12), and let H L C ( k ) ≡ ∑ l =1 ⎡⎢( R l − λn ,l ( k ) I ) ( R l − λn ,l ( k ) I ) ⎤⎥. ⎣ ⎦
(13)
It is easily shown that the estimate of the vector q n ( k ) is the eigenvector of C ( k ) associated with the smallest eigenvalue such that the cost function (12) is minimized. Third, repeat the first two steps till q n ( k ) − q n ( k − 1)
2 F
< ε1 , where ε1 is a
positive constant. Then remark q n ( k ) as qˆ n , the estimate of q n .
% ≡ [qˆ ,..., qˆ ] . In order to If all the component of Q k is calculated, then remark Q k 1 N % . guarantee the estimate of Q k is unitary matrix, perform QR decomposition of Q k Moreover, the global convergence of the proposed cost function for single component can be proved by using the definition and property of Lyapunov or energy function [10, 11]. 2.3 Permutation Problem However, there is permutation problem in the broadband blind separation problem, since arbitrary permutation of the coordinates for each frequency will lead to the same value J ( q n , λn ,1 ,..., λn , L ) . It is a severe problem in order to recover the source signals correctly. In this paper, the problem will be solved by setting the coefficient matrix w (τ ) of time domain inverse filter be zero when τ > Q , where Q denotes the order of inverse filter. This measure will effectively restrict the solution to be continuous or ‘smooth’ in frequent domain such that the separation matrix has the consistent permutation [6,7]. Then the i -th iterative method used to compute the inverse filter that has consistent permutation is summarized as follows: (a), calculate the Wk ( i ) by the follow formulation ˆ H (i ) P Wk ( i ) = Q k k
k = 1,..., T .
(14)
And computer the time domain expression w ( i ) of Wk ( i ) by performing the
inverse fast Fourier transform (IFFT) to Wk ( i ) . Then set the last T − Q parameters be zero. (b), if condition w ( i ) − w ( i − 1)
2 F
< ε 2 is satisfied. We will regard the w ( i ) as the
estimate of inverse filter in time domain.
Convolutive Blind Separation of Non-white Broadband Signals
(c), if the condition
w ( i ) − w ( i − 1)
2 F
1199
< ε 2 isn’t satisfied, convert the inverse
ˆ ( i + 1) by the filter w ( i ) into frequent expression remarked as Wk ( i ) , and compute Q k following formulation 1
ˆ ( i + 1) = Λ ˆ (0)2 UH W (i ) Q k s ,k s,k k
k = 1,..., T .
(15)
Then let i = i + 1 and return back to (a). If the inverse filter is estimated by the above iteration method, it is easy to calculate the recovered source signals.
3 Experiment Results In the simulations, there are M = 6 received signals that are the result of two pieces of speech signals with the identical variance through a convolutive mixture matrix [13]. Concerning the additional white noise, the signal noise ratio (SNR) is 15dB. Since the speech signal is short-term stationary source and the excessive length of DFT may disobey this stationary rule, we assume that the length of DFT is 256. We also assume that the order of inverse filter is 32 and the two thresholds are ε1 = 10−3 and ε 2 = 10−12 ,
respectively. We use the 5 autocorrelation matrices with τ l = 3l ( l = 1,...,5) time-lag respectively to construct the cost function. If the algorithm converges, we get the estimate of inverse filter. Then, put the received signal through the inverse filter, we will get the recovered source signals. the source signal 1
the source signal 2
5 0 -5 0.5
1
1.5
2
2.5 x 10
4
4 2 0 -2 -4 500
the recoved signal 1 using the metod of Parra
1000 1500 2000 2500 3000 3500 4000 4500 5000 the recoved signal 2 using the method of Parra
4
4 2 0 -2
2
0.5
1
1.5
2
2.5 x 10
4
0 -2 -4
500
the recoved signal 1 using the proposed method
1000 1500 2000 2500 3000 3500 4000 4500 5000 the recoved signal 2 using the proposed method
6 4 2 0 -2 -4
2 0
0.5
1
1.5
2
2.5 x 10
(a) signal 1
4
-2 -4 500
1000 1500 2000 2500 3000 3500 4000 4500 5000
(b) signal 2
Fig. 1. Recovered source signals using two methods
The Fig.1 shows the recovered source signals. The waveform of source signals and recovered source signals using the Parra’s method are also listed for the convenience of comparison. Since we take no account of amplitude of the waveform,
1200
H. Zhang and D. Feng
the recovered source signals are multiplied by the adjusting factor that is compute by the following formulation βi = si yiH ( yi yiH ) ( i = 1, 2,..., N ) . (16) where
si denotes the source signals and yi the recovered. The Fig.2 shows the convergent curves. The first curve is generated by the proposed method, while the other by Parra’s method. In the Fig.2 the signal to error ration (SER) is calculated by the formulation (17) and (18), that are,
s ignal to error ratio in dB
signal to error ratio in dB
convergent curve using proposed method 30 25 20 15 10
0
10
20
30 40 50 60 70 iterative times convergent curve using Parra method
80
90
100
ei = yi − si
18
(i = 1, 2,..., N ). (17)
16
SER = 10 log( si
14 12 10
2
2
ei ).
(18)
The table 1 shows some performances of these two methods. Fig. 2. Convergent curve of two methods These parameters indicate there are clear improvements in convergence, reconstruction error, and SER that are computed by the formulas (17) and (18), too. 0
50
100
150
200 250 300 iterative times
350
400
450
500
Table 1. Some Parameters on two methods
proposed Parra’s
Iteration time 4.6570 22.9370
E {e1}
E {e2 }
D {e1}
D {e2 }
SER {e1}
SER {e2 }
0.0937 0.3388
0.3288 0.3088
0.0880 0.2903
0.2671 0.2676
23.3065dB 12.3667dB
13.2032dB 13.1828dB
4 Conclusion In sum, the proposed algorithm based on a novel cost function is an efficient method to the problem of convolutive BSS. We can change the convolutive mixed signals into the instantaneous case by using the DFT. Then the whitening matrix for each frequency is calculated by the EVD of autocorrelation matrix of received signals with zero time delay. And construct the cost function based on the diagonalization of a set of autocorrelation matrices with multiple time-lags. In order to minimize the cost function, a double-iteration method is used. A restriction on the length of inverse filter guarantee the estimated separation matrix has consistent permutations for all frequency. Finally, the simulations illustrate that the proposed method has not only faster convergence performance but also the higher recovered accuracy and SER than the Parra’s method. So, the proposed method can be regarded as an effective one for the convolutive BSS of non-white broadband signals.
Convolutive Blind Separation of Non-white Broadband Signals
1201
References [1] Comon, P.: Blind Separation of Source: A New Concept. Signal Processing. 36(3) (1994) 287-314 [2] Cardoso. J.F.: Blind Signal Separation Statistics Principles. Proc. of. IEEE, 86(10) (1998) 2009-2025 [3] Futoshi, A., Shiro, I., Michiaki. O. Hideki. A., Nobuhiko. K..: A Combined Approach if Array Processing and Independent Component Analysis for Blind Separation of Acoustic Signals. 0-703-7042 2001 IEEE [4] Belouchrani A., and Moeness G. Amin: Blind Source Separation Based on TimeFrequency Signal Representations. IEEE Transaction on Signal Processing. 46(11) (1998) 2888-2897 [5] Peterson M.: Simulating the Response of Multiple Microphones to a Single Acoustic Source in a Reverberant Room. J. Acoustic. Soc. Amer. 80(11) (1986) 1527-1529 [6] Parra L., Spence C.: Convolutive Blind Separation of Non-Stationary Source. IEEE Transaction on Speech and Audio Processing, 8(5) (2000) 320-327 [7] Bucher. H., Aichner, R., Kellermann, W.: A Generalization of Blind Source Separation Algorithms for Convolutive Mixtures Based on Second-Order Statistics. IEEE Transaction on Speech an Audio Processing, 13(1) (2005) 120-134 [8] Asano, F., Hayamizu, S., Yamada, T., and Nakamura, S.: Speech Enhancement Based on the Subspace Method. IEEE Transaction on SAP, 8(9) (2000) 497-507 [9] Hopfied. J.J.: Neural Networks and Physical System with Emergent Collective Computational Abilities. Proc. Natl, Acad. Sci. USA Vol.79, 1982, 2554-2558. [10] LaSalle, J.P.: The Stability of Dynamical Systems. Philadelphia, PA, SIAM Press,1976. [11] Dapena, A., Serviere, C., Castedo, L.: Inversion of the Sliding Fourier Transform Using Only Two Frequency Bins and Its Application to Source Separation. Signal Processing 83 (2003) 453-457. [12] Feng, D. Z., Zhang, X. D., and Bao, Z.: An Efficient Multistage Decomposition Approach for Independent Components. Signal Processing, 83(1) (2003) 181-197 [13] All the data and mixed filter in this paper can be obtained on http://www2.elc.tur.nl//ica99
Multichannel Blind Deconvolution Using a Novel Filter Decomposition Method Bin Xia and Liqing Zhang Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200030, China [email protected];[email protected]
Abstract. In our previous work [11], we introduced a filter decomposition method for blind deconvolution in non-minimum phase system. To simplify the deconvolution procedure, we further study the demixing filter and modify the cascade structure of demixing filter. In this paper, we introduce a novel two-stage algorithm for blind deconvolution. In first stage, we present a permutable cascade structure which constructed by a causal filter and an anti-causal scalar filter. Then, we develop SOS-based algorithm for causal filter and derive a natural gradient algorithm for anti-causal scalar filter. At second stage, we apply an instantaneous ICA algorithm to eliminate the residual instantaneous mixtures. Computer simulations show the validity and effectiveness of this approach.
1 Introduction Blind deconvolution has attracted considerable attention recently. The main idea of the blind deconvolution is to retrieve the independent source signals from sensor outputs by only using the sensor signals and certain knowledge on statistics of the source signals. In the last decade, a number of algorithms[3, 4, 7, 8] have developed for blind deconvolution. When the mixing model is a minimum phase system, we can use a causal demixing system to recover the source signals. Many algorithms work well in this situation. In the real world, the mixing model would be a non-minimum phase system generally. Researchers attempt to use the doubly infinite impulse response (IIR) filters to approximate the demixing model [2]. It is difficult to develop a practicable algorithm for doubly demixing filter in non-minimum phase system. On the other hand, some researchers try to use doubly finite impulse response (FIR) filters as demixing models. But for blind deconvolution in non-minimum phase, it is still not easy to develop a effective algorithm for doubly FIR filter. To deal with this problem, filter decomposition method was introduced. Labat et al.[6] presented a cascade structure for single channel blind equalization by decomposed the demixing model. Zhang et al [13] provided a cascade structure to multichannel blind deconvoluiton. Waheed et al [9] discussed several cascade structures for blind deconvolution problem. In our previous work [11], we present a permutable cascade structure for demixing filter. We decomposed a doubly FIR filter into a causal
To whom all correspondence should be addressed.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1202–1207, 2006. c Springer-Verlag Berlin Heidelberg 2006
Multichannel Blind Deconvolution Using a Novel Filter Decomposition Method
1203
FIR filter and an anti-causal scalar FIR filter and derived natural gradient algorithms for both sub-filters. The natural gradient algorithm implemented in an iterative way. Its convergence performance is not efficient in learning process. To overcome the drawback of slow convergence, we further study the structure of demixing filter in [11], in which the main computation cost is due to the procedure of learning the coefficients of causal filter. In this paper, we introduce a two-stage blind deconvolution algorithm. At first, a cascade algorithms, which combined by a SOSbased decorrelation algorithm and a natural gradient algorithm, to deconvolution the mixtures. And then, we use a instantaneous ICA algorithm to demix the residual instantaneous mixtures.
2 Problem Formulation For blind deconvolution problem, we assume that mutual statistical independence of the different source signals. For the simplicity, we assume that the number of source signals is equal to the number of sensor signals. We use a linear time-invariant (LTI) system to describe the the mixing model as follow x(k) =
∞
Hp s(k − p),
(1)
p=−∞
where Hp is an n × n-dimensional matrix of mixing coefficients at time-lag p, which is called the impulse response at time p, s(k) = [s1 (k), . . . , sn (k)]T is an n-dimensional vector of source signals with mutually independent components and x(k) = [x1 (k), . . . , xn (k)]T is the vector of the sensor signals. For practical implementation, the demixing model is described as finite impulse response (FIR) form y(k) =
L
Wp x(k − p) = W(z)x(k),
(2)
p=−L
where L is the length of mixing filter, the W(z) is the demixing filter. The objective of blind deconvolution is to find the W(z), which makes the global function G(z) = W(z)H(z) to satisfy the following condition G(z) = PΛB(z)
(3)
P ∈ Rn×n is a permutation matrix, Λ ∈ Rn×n is a nonsingular diagonal scaling matrix, and B(z) = diag{B1 (z), . . . , Bn (z)}. In this paper, we assume that sources are i.i.d, then we can relax the condition (3) G(z) = PΛB0 (z)
(4)
where B0 (z) = diag{z −b1 , . . . , z −bn }. We can obtain arbitrary scaled, reordered, and delayed estimates.
1204
B. Xia and L. Zhang
3 Learning Algorithm In separation-deconvolution criteria [5], the convolutive mixtures can be recovered by using a two-stage strategy. In [5], a recurrent neural network decorrelation algorithm and an instantaneous blind separation were combined in this schema. To relax the computational cost, we introduced a SOS-based decorrelation algorithm into two-stage deconvolution algorithm [10]. We first applied the decorrelation algorithm on mixtures. After decorrelation, an instantaneous ICA was introduced to demix residual instantaneous mixtures. The two-stage algorithm obtains good separation performance and works effectively in minimum phase system. For blind deconvolution problem in non-minimum phase, we introduced a permutable cascade structure for demixing filter in our previous work [11]. Natural gradient algorithms were developed for both filters. Its convergence is slow due to the iteration procedure. In this paper, we intend to combine two-stage algorithm into filter decomposition method to develop an effective deconvolution algorithm. Fig.1 shows a novel demixing structure. We still use two-stage schema in new structure. In first stage, we use a cascade structure which combined with a causal filter and an anti-causal scalar filter. This permutable structure was first introduced in [11] and we presented natural gradient algorithms for both filters as follow: Dp = η
p
(δ0,q I − ϕ(y(k))yT (k − q))Dp−q
(5)
q=0
ap = η
p
(ϕT (y(k))y(k + q))ap−q
(6)
q=0
where ϕ(y) = (ϕ1 (y1 ), . . . , ϕn (yn ))T is the vector of non-linear activation functions,
p (y )
which is defined by ϕi (yi ) = − pii (yii ) . To simplify the computational process, we develop a SOS-based decorrelation algorithm for D(z). when p = 0,from Eq.(5) we can obtain D0 = η(I − E[ϕ(y(k))y(k)T ])D0
s(k )
x(k )
H (z)
Unknown Mixing Model
r(k )
W
Instantenous blind separation
(7)
u(k )
D(z)
y(k ) -1
a(z )
Demixing Model
Fig. 1. Illustration of system model
Multichannel Blind Deconvolution Using a Novel Filter Decomposition Method
1205
If we assume ϕ(y(k)) = y(k), the Eq. (7) can be wrote as follows D0 = η(I − E[y(k)y(k)T ])D0
(8)
Using the PCA [5] method, we can calculate the D0 1 1 D0 = Λ−1/2 VTx = diag{ √ , . . . , √ }VTx λ1 λn
(9)
where Vx = [v1 , v2 , . . . , vn ] ∈ Rn×m Now we use D0 to calculate other Dp , p = 1, . . . , L, −1 [D1 . . . DL ] = −[D0 x(k)x(k − 1)T . . . D0 x(k)x(k − L)T ]δxx
(10)
where δxx is defined as follow: ⎡
δxx
⎤ x(k − 1)xT (k − 1) . . . x(k − 1)xT (k − L) 1 ⎢ ⎥ .. .. .. = ⎣ ⎦ . . . M k=1 x(k − L)xT (k − 1) . . . x(k − L)xT (k − L) M
(11)
where the M is the length of batch window. In first stage, we decorrelated the mixtures. There only exist some instantaneous mixtures in sensor signals. At second stage, we apply an instantaneous ICA algorithm to demix the residual mixtures. Here we use an instantaneous natural gradient algorithm [1] ΔW = η(I − ϕ(y(k))yT (k))W
(12)
Due to the limited space, we don’t provide the details for Instantaneous ICA and natural algorithms of both sub-filters. The reader can refer [1, 11] for more information.
4 Simulation We now present simulation examples to illustrate the performance of the proposed blind deconvolution algorithms. Simulation one is to illustrate the separation performance of proposed algorithm in non-minimum phase system. Generally, if the global function is close to an identity filter, we obtain the good estimates of source. The coefficients of global function is shown in Fig. 2. It is obvious that the proposed algorithm achieve good separation performance. In second simulation experiment, we compare proposed algorithm with filter decomposition algorithm (FD) [12] and the permutable filter decomposition (PFD) algorithm in [11]. We use the inter-symbol interference (ISI) as a performance criterion. In Fig. 3, the MISI measurement shows the performance of PFD and SOS-PFD is similarity and is better than FD’s. The convergence speed of SOS-PFD is the best one in three algorithms because we apply a SOS-based algorithm in it.
1206
B. Xia and L. Zhang G(z)
G(z)
1,1
1
G(z)
1,2
1
0
0
0
−1
−1
−1
0
50 G(z)
0
50 G(z)
2,1
1
1
0
0
50 G(z)
−1
0
50 G(z)
3,1
−1
1
1
0
0
0
50
−1
0
50 G(z)
0
50
−1
2,3
0
3,2
1
−1
50 G(z)
0
0
0
2,2
1
−1
1,3
1
3,3
0
50
Fig. 2. The coefficients of global function after convergence 3.5 SOS−PFD FD PFD
3
2.5
MISI
2
1.5
1
0.5
0
0
50
100
150
200
250
300
Fig. 3. Comparison results of MISI in non-minimum phase system
5 Conclusions In this paper, we present a two-stage algorithm for blind deconvolution in non-minimum phase system. At first stage, we use a cascade structure which consists of a causal filter and an anti-causal scalar filter. And then we develop a SOS-based close form decorrelation algorithm for causal filter. For anti-causal scalar filter, we use a natural gradient algorithm. At second stage, we apply an instantaneous ICA algorithm to demix the residual instantaneous mixtures. The simulation results show the proposed algorithms are stable and efficient.
Acknowledgment The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301) and National Natural Science Foundation of China (Grant No.60375015).
Multichannel Blind Deconvolution Using a Novel Filter Decomposition Method
1207
References 1. Amari, S.: Natural Gradient Works Efficiently in Learning. Neural Computation 10(2)(1998) 251–276 2. Amari, S. Cichocki, A., and Yang, H.H.: A New Learning Algorithm for Blind Signal Separation. In G. Tesauro, D. S. Touretzky, and T. K. Leen, (eds.): Advances in Neural Information Processing Systems 8(NIPS 95), MIT Press, Cambridge, MA (1996) 757–763 3. Amari, S., Douglas, S., Cichocki, A., and Yang, H.H.: Novel On-line Algorithms for Blind Deconvolution Using Natural Gradient Approach. In Proc. 11th IFAC Symposium on System Identification, SYSID’97, Kitakyushu, Japan, July (1997) 1057–1062 4. Bell, A.J. and Sejnowski, T.J.: An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7(6) (1995) 1129–1159 5. Cichochi, A. and Amari, S.: Adaptive Blind Signal and Image Processing, Wiley, May (2002) 6. Labat, J., Macchi, O., and Laot, C.: Adaptive Decision Feedback Equalization: Can You Skip the Training Period. IEEE Trans. on communication (46) (1998) 921–930 7. Tong, L., Xu, G., and Kailath, T.: Blind Identification and Equalization Base on SecondOrder Statistics: A Time Domain Approach. IEEE Trans. Information Theory (40)(1994) 340–349 8. Tugnait, J.K., and Huang, B.: Multistep Linear Predictors-Based Blind Identification and Equalization of Multiple-Input Multiple-Output Channels. IEEE Trans. on Signal Processing (48) (2000) 26–28 9. Waheed, K. and Salam, F.M.: Cascaded Structures for Blind Source Recovery. In 45th IEEE int’l Midwest Symposium on Circuits and Systems, Tulsa, Oklahoma 1 (2002) 656–659 10. Xia, B. and Zhang, L.Q.: Combining Second & High Order Statistics Method in Multichannel Blind Deconvolution. Submitted in ICA2006. 11. Xia, B., and Zhang, L.Q.: Multichannel Blind Deconvolution of Non-minimum Phase System Using Cascade Structure. In Nikhil R. Pal, Nikola Kasabov, Rajani K. Mudi, Srimanta Pal, Swapan K. Parui (Eds.): Neural Information Processing 11th International Conference. Lecture Notes in Computer Science, Springer-Verlag, Calcutta, India 3316 (2004) 1186–1191 12. Zhang, L.Q., Amari, S., and Cichocki, A.: Multichannel Blind Deconvolution of Nonminimum Phase Systems Using Filter Decomposition. IEEE Trans. Signal Processing 52(5) (2004) 1430–1442 13. Zhang, L.Q., Cichocki, A., and Amari, S.: Multichannel Blind Deconvolution of Nonminimum Phase Systems Using Information Backpropagation. In Proceedings of ICONIP’99, Perth, Australia (1999) 210–216
Two-Stage Blind Deconvolution for V-BLAST OFDM System Feng Jiang, Liqing Zhang , and Bin Xia1 Department of Computer Science and Engineering, Shanghai Jiao Tong University [email protected], [email protected] Abstract. In this paper, we integrate orthogonal frequency-division multiplexing (OFDM) technique with vertical Bell Labs layered spacetime (V-BLAST) architecture as a promising solution for enhancing the data rates of wireless communication systems, and propose a new blind deconvolution method. A two-stage algorithm is developed to estimate the channel parameters. At first stage, we propose an algorithm based on the second order statistics to decorrelate the sensor signals. After decorrelation, we apply instantaneous demixing algorithm to separate the signals at the second stage. Simulation results demonstrate the validity and the performance of the proposed algorithms.
1
Introduction
High link throughput(bit rate) is one of the main goals in wireless communication systems. Currently, there exists two series of standards IEEE802.11 and HiperLAN, proposed by Institute of Electrical and Electronics Engineers (IEEE) and European Telecommunications Standards Institute (ETSI), respectively, which can supply a maximum data transmitting rate at 54Mb/s. However, it’s not sufficient for the increasing demands of services. So it’s very crucial to improve the data transmitting rate of the communication systems. We consider this problem from two aspects. First, we introduce a MIMO scheme and Orthogonal frequency-division multiplexing (OFDM)[6] to remarkably increase the bandwidth efficiency, and this result in the increase in data transmitting rate. Vertical Bell Labs layered space-time (V-BLAST) [3] is such a MIMO scheme by Bell laboratory and has been implemented in lab environment. OFDM is an integration of Multi-Carrier Modulation (MCM) and Frequency Division Multiplexing (FDM) that operates with specific orthogonality constraints between the subcarriers. Due to these constraints, it achieves a very high spectral efficiency. Second, we present a two-stage algorithm for blind deconvolution problem. As we know, it is not sufficient for separating the symbol spaced systems only using second order statistics. But we can use the second order statistics to decorrelate the sensor signals. By using decorrelation algorithm as a preprocessing method, the mixing model reduces to an instantaneous mixing which become easier to be demixed by a general ICA algorithm.
Corresponding Author.
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1208–1213, 2006. c Springer-Verlag Berlin Heidelberg 2006
Two-Stage Blind Deconvolution for V-BLAST OFDM System
2
1209
System Model
Now we consider a V-BLAST OFDM system with M transmit (TX) and N receive receive (RX) antennas. Fig. 1 shows the block diagram of a V-BLAST OFDM system scheme considered in this paper. Basically, the MIMO OFDM transmitter consists of M OFDM transmitters, where the source signal is spatially multiplexed, and then, each branch in parallel performs QAM modulation (implemented in V-BLAST encoder), OFDM frames assembling, Nc -point IDFT and P/S transform, and the signals are finally transmitted. Tx
Assemble OFDM Frames Source data
IDFT
P/S
V-BLAST Encoder
Tx
Assemble OFDM Frames
IDFT
Rx
Rx
Blind Deconvolution and Channel Estimation
P/S
Fig. 1. V-BLAST OFDM system scheme
In the receiver, the sensor signals are used to perform blind deconvolution. Subsequently, the Nc -point DFT is performed in each receiver, and then OFDM frames are disassembled, and finally, the data in each receiver branches are performed a V-BLAST decoding, which is an inverse process of V-BLAST encoding. 2.1
V-BLAST Model
Vertical Bell Labs layered space-time (V-BLAST) is a simplified version of BLAST, which was proposed in [3]. V-BLAST is a kind of Space Division Multiplexing (SDM) [5] approach, SDM achieves a higher throughput by transmitting independent data streams on different transmit antennas simultaneously, and the same 4-point QAM modulator constellation are used for each substreams. 2.2
OFDM Model
In a normal FDM system, the many subcarriers are spaced apart in such way that the signals can be received using conventional filters and demodulators. So the guard bands have to be introduced between different subcarriers, and this results in a lower spectrum efficiency. However, it is possible to arrange the subcarriers in an OFDM signal so that the signals can still be received without adjacent carriers interference. The only thing we have to do is to keep the subcarriers mathematically orthogonal. The OFDM can be implemented by using modern digital signal processing techniques, such as discrete Fourier transform(DFT).
1210
2.3
F. Jiang, L. Zhang, and B. Xia
Mixing and Demixing Model
We consider a convolutive mixing model as a mixing model. For blind deconvolution problems, we assume it is true that mutual statistical independence of different source signals. For the sake of simplicity, we assume that the number of source signals equals to the number of sensor signals. The mixing model can be described as follows L x(k) = Hp s(k − p), (1) 0
or in the compact form x(k) = H(z)s(k), where Hp is an n × n-dimensional matrix of mixing coefficients at time-lag p, s(k) = [s1 (k), s2 (k), · · · , sn (k)]T is an n-dimensional vector of source signals (corresponding to n transmitter antennas), and x(k) = [x1 (k), x2 (k), · · · , xn (k)]T is the n-dimensional vector of the sensor signals. In blind deconvolution problem, we are interested in finding a demixing filter W (z). The demixing model is described as y(k) =
L
Wp x(k − p),
(2)
p=0
L where W (z) = p=0 Wp z −1 . Our goal is to find a W (z), which can seperate the mixing signals. So we define a global function G(z) = W (z)H(z).
(3)
Our objective is to find the demixing filter such that the following relation holds G(z) = P ΛD(z),
(4)
where P ∈ Rn×n is a permutation matrix, D(z) = diag{z −d1 , · · · , z −dn }, and Λ ∈ Rn×n is a nonsingular diagonal scaling matrix. So the W (z) is the solution of blind deconvoution. In the following section, we will develop a two-stage algorithm for the filter W (z).
3
Algorithm
In this paper, we develop a two-stage algorithm. At first stage, we derive a close form SOS based algorithm which avoid the iteration processing. And then, we use a natural gradient ICA algorithm to demix the sensor signals. The deconvolution filter is composed of one decorrelation filter W (z) and a demixing matrix W (d). In order to develop a SOS based algorithm for decorrelation, we here use the cost function[8] l(y, W (z)) = − log |det(W0 )| −
n i=1
log pi (yi (k)).
(5)
Two-Stage Blind Deconvolution for V-BLAST OFDM System
1211
where pi (yi (k) is the probability density function of the random variable yi (k). The natural gradient algorithm for W (z) can be obtained as follows[2][8] ΔWp = η
p
(δ0,q I − ϕ(y(k))y(k − q)T )Wp−q ,
(6)
q=0
where ϕ(y) = (ϕ1 (y1 ), · · · , ϕn (yn ))T is the vector of nonlinear activation func
p (y )
tions, which is defined by ϕi (yi ) = − pii (yii ) . So we have ΔW0 = η(I − E[ϕ(y(k))y(k)T ])W0 .
(7)
Assume that the activation function is linear, then the Eq.(7) can be written as follows ΔW0 = η(I − E[y(k)y(k)T ])W0 . (8) Using the PCA method, W0 can be calculated W0 = U Λ−1/2 VxT ,
(9)
where U and V are orthogonal matrices, Λ = diag(λ1 , · · · , λn ). Then we can use W0 to calculate W1 , · · · , WL . [W1 · · · WL ] = −
M 1 −1 [W0 x(k)x(k − 1)T · · · W0 x(k)x(k − L)]δxx , M
(10)
k=1
where δxx is defined as ⎡
δxx
⎤ x(k − 1)xT (k − 1) · · · x(k − 1)xT (k − L) 1 ⎢ ⎥ .. .. .. = ⎣ ⎦. . . . M k=1 x(k − L)xT (k − 1) · · · x(k − L)xT (k − L) M
(11)
After decorrelation, the mixing model will be reduced to an instantaneous mixture. For the instantaneous blind source separation problems, a lot of algorithms work well. Here we apply a natural gradient in our algorithm[1] ΔW (d) = −η(I − ϕ(y(k))y T (k))W (d).
(12)
Due to the limited space, we do not present the detail of decorrelation and instantaneous algorithm here. The reader is referred to [1][9] for more information.
4
Simulation Results
A number of computer simulations have been performed to investigate the validity and performance of the proposed blind deconvolution algorithm. In order to remove the effect of a single numerical trial on evaluating the performance of algorithms, we use the ensemble average approach, that is, for each trial, we
1212
F. Jiang, L. Zhang, and B. Xia
2
2
1
1
1.8 1.6 ISI Performance Illustration
1.4 1.2
0
1
0
0.8
−1
0.6
−1
0.4 0.2
−2 −2
−1
0
1
−2 −2
2
−1
0
1
Fig. 2. Sensor signal constellation and output constellation
0
2
0
50
100
150
200
250
Fig. 3. Performance of algorithms
obtain a value of performance, and then, we take the average of the values over differential trials. We will illustrate simulation results in the following sections. In this simulation, the sensor signals are generated by the multichannel convolutive model [4], the mathematical description is given by a linear state discretetime system u(k + 1) = Au(k) + Bs(k), (13) x(k) = Cu(k) + Ds(k).
(14)
In this example, we set a 3 × 3 channels for mixing system, the size of source signals is 3 × 12000, SNR is 20dB, the number of OFDM subcarrier is 4. In Fig. 2, it is worth noting that the output signals converge to the characteristic QAM constellation, which can then be well demodulated with low BER. Fig. 4(a) and Fig. 4(b) plot the decorrelation function and the global transfer function, respectively. To evaluate to performance of the proposed algorithms, we employ the multichannel intersymbol interference, denoted by MISI [7], as a criteria. Fig. 3 illustrates the performances of the two-stage algorithm. It is obvious that this algorithm need less than 100 iterations to obtain satisfactory results. Decorrelation(z) 1,1 Decorrelation(z) 1,2 Decorrelation(z) 1,3 1
1
1
G(z) 1,1
G(z) 1,2
1
G(z) 1,3
1
1
0
0
0
0
0
0
−1
−1
−1
−1
−1
−1
1
1
1
1
1
0
0
0
0
0
0
−1
−1
−1
−1
−1
−1
1
1
1
1
1
0
0
0
0
0
0
−1
−1
−1
−1
−1
−1
0 20 40 0 20 40 0 20 40 Decorrelation(z) 2,1 Decorrelation(z) 2,2 Decorrelation(z) 2,3
0 20 40 0 20 40 0 20 40 Decorrelation(z) 3,1 Decorrelation(z) 3,2 Decorrelation(z) 3,3
0
20
40
0
20
(a)
40
0
20
40
0
0
0
20 G(z) 2,1
20 G(z) 3,1
20
40
40
40
0
20 G(z) 2,2
40
0
20 G(z) 2,3
40
0
20 G(z) 3,3
40
0
20
40
1
0
20 G(z) 3,2
40
1
0
20
40
(b)
Fig. 4. The coefficients of partial function(a) and global function(b)
Two-Stage Blind Deconvolution for V-BLAST OFDM System
5
1213
Conclusion
In this paper we have presented a new two-stage blind deconvolution method for V-BLAST OFDM systems. For the decorrelation filter, we derive a SOSbased close form solution, after this, we apply a HOS-based instantaneous ICA algorithm for signals separation. Compared with other HOS blind deconvolution algorithms, it converges very quickly and release the computational cost. A number of computer simulations have been performed to prove the efficiency and performance of the proposed algorithms.
Acknowledgment The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301) and National Natural Science Foundation of China (Grant No.60375015).
References 1. S. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation, Vol. 10 (1998) 251-276 2. S. Amari and A. Cichocki. Adaptive Blind Signal Processing - Neural Network Approaches. Proceedings of the IEEE, 86(10) (1998) 2026-2048 3. P. W. Wolniansky, G. J. Foschini, G. D. Golden, R. A. Valenzuela. V-BLAST: An Architecture for Realizing Very High Data Rates Over the Rich-Scattering Wireless Channel. Proc IEEE ISSSE’98[C] (1998) 4. L. Q. Zhang and A. Cichocki. Blind Deconvolution of Dynamical Systems: A StateSpace Approach. Signal Processing, 4(2) (2000) 111-130 5. A. v. Zelst. Space Division Multiplexing Algorithms. 10th mediterranean Electrotech. Conf, Vol. 3 (2000) 1218-1221 6. Z. Wang and G. B. Giannakis. Wireless Multicarrier Communication. IEEE Signal Processing Mag, 17(3) (2000) 29-49 7. Y. Inouye and S. Ohno. Adaptive Algorithms for Implementing the Single-Stage Criterion for Multichannel Blind Deconvolution. Proc. of 5th Int. Conf. on Neural Information Processing, ICONIP’98, pp. 733-736, Kitakyushu, Japan, October, 2123, (1998) 8. L. Q. Zhang, S. Amari, and A. Cichocki. Multichannel Blind Deconvolution of Non-Minimum Phase Systems Using Filter Decomposition. IEEE Trans. Signal Processing, 52(5) (2004) 1430–1442 9. Bin Xia and L. Q. Zhang. Multichannel Blind Deconvolution Using a Novel Filter Decomposition Method. accepted by ISNN2006
A Comparative Study on Selection of Cluster Number and Local Subspace Dimension in the Mixture PCA Models Xuelei Hu1 and Lei Xu2 1 Department of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China [email protected] 2 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong [email protected]
Abstract. How to determine the number of clusters and the dimensions of local principal subspaces is an important and challenging problem in various applications. Based on a probabilistic model of local PCA, this problem can be solved by one of existing statistical model selection criteria in a two-phase procedure. However, such a two-phase procedure is too time-consuming especially when there is no prior knowledge. The BYY harmony learning has provided a promising mechanism to make automatic model selection in parallel with parameter learning. This paper investigates the BYY harmony learning with automatic model selection on a mixture PCA model in comparison with three typical model selection criteria: AIC, CAIC, and MDL. This comparative study is made by experiments for different model selection tasks on simulated data sets under different conditions. Experiments have shown that automatic model selection by the BYY harmony learning are not only as good as or even better than conventional methods in terms of performances, but also considerably supervisory in terms of much less computational cost.
1
Introduction
Local principal component analysis (PCA) is a useful tool to explore not only clusters but also some local subspace structures in multivariate data [1, 2]. Two important questions for local PCA are how many clusters and what are the dimensions of local subspaces for each clusters. In recent years, probabilistic models have been studied for local PCA [3, 4, 5], based on which the two questions can be solved by using one of existing statistical model selection criteria in a two-phase procedure. That is, first estimate the unknown parameters in candidate models according to a learning principle such as the well-known maximum likelihood (ML) learning principle, and then selecting the optimal model according to certain model selection criterion. Typical statistical model selection criteria include Akaike’s information criterion (AIC) [6], the consistent Akaike’s J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1214–1221, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Comparative Study on Selection of Cluster Number
1215
information criterion (CAIC) [7], and the minimum description length (MDL) criterion [8, 9]. However, these two-phase implementation based methods have a very high computational cost because a whole set of candidate models should be computed. Recently, the Bayesian Ying Yang (BYY) harmony learning has provided a promising tool to make model selection implementable not only subsequently after parameter learning but also automatically during parameter learning. the BYY harmony learning was firstly proposed in 1995 [3] and then systematically developed in past years [10, 11, 12]. Applying the BYY harmony learning on an alternative probabilistic mixture PCA model will automatically determine the number of clusters as well as the dimensions of local subspaces [5]. This paper makes a further study on automatic model selection in the mixture PCA model by the BYY harmony learning via a comparative study with other popular model selection methods including AIC, CAIC, and MDL. We investigate these methods on different simulated data sets with different conditions. The comparative study has shown that the results of the BYY harmony learning is as good as or even better than the conventional methods. A most significant advantage of the BYY harmony learning is that it can select cluster numbers and subspace dimensions of each clusters automatically and efficiently with much less computational costs than that by the conventional methods. In Section 2, we briefly review the background for the mixture PCA model and the maximum likelihood (ML) learning. In Section 3, we introduce conventional two-phase model selection via three popular criteria. In Section 4, the BYY harmony learning on an alternative mixture PCA model and the algorithm with automatic model selection ability are introduced. Experimental results, comparative analysis and discussions are given in Section 5. Finally we draw a conclusion in Section 6.
2
Mixture PCA Model and ML Learning
Provided that x is a d-dimensional random vector of observable variables. The mixture model assumes that x is distributed according to a mixture of k underlying probability distributions p(x) =
k
αl pl (x),
(1)
l=1
where pl (x) is the density of the lth component in the mixture, and αl is the proportion that an observation belongs to the lth component with αl ≥ k 0, l = 1, ..., k, and l=1 αl = 1. For a mixture PCA, it is further assumed that each pl (x) is a single probabilistic model of PCA [13, 4, 3], formed by pl (x|y) = G(x|Al y + cl , σl2 Id ), pl (y) = G(y|0, Iml ), and pl (x) = pl (x|y)pl (y)dy = G(x|cl , Al ATl + σl2 Id ), where y is a ml -dimensional random vector of unobservable latent variables, Al is a d × ml loading matrix, cl is a d-dimensional mean vector, σl2 is a scalar of noise variance, and G(x|μ, Σ) denotes a multivariate normal (Gaussian) distribution with mean μ and covariance matrix Σ.
1216
X. Hu and L. Xu
Given a set of observations {xt }nt=1 , we can explore clusters as well as local subspace structures of the data by estimating the above mixture PCA model based on the data set. Suppose the number of clusters k and the subspace dimensions ml , l = 1, 2...k are given, one widely used method to estimate the unknown parameters θ = {αl , Al , cl , σl2 }kl=1 is the maximum likelihood learning, k i.e., θˆ = arg maxθ L(θ), subject to αl ≥ 0, l = 1, ..., k and l=1 αl = 1, where n L(θ) is the following log likelihood function L(θ) = t=1 ln p(xt ), where p(xt ) takes the form Eq. 1. The EM algorithm and an iterative generalized EM algorithm to implement the ML learning on the above mixture PCA model were given in [4].
3
Model Selection Via ML Based Model Selection Criteria
The problem that remains is how to select the number of clusters k and the subspace dimensions {ml }kl=1 , simply denoted by {ml }. This task can be performed via several existing statistical model selection criteria in a two-phase procedure. In the first phase, we assume a range of values of k from kmin to kmax and a range of values of ml from mmin to mmax to design a domain D which is assumed ∗ to contain the optimal k ∗ , {m∗l }kl=1 . At each specific k, {ml } in D, we estimate the parameters θ under the ML learning principle. In the second phase, we make the following selection based on all candidate models obtained in the first phase. ˆ l } = arg min {J(θ, ˆ {m ˆ k, {ml }), {k, {ml}} ∈ D}, k, k,{ml }
(2)
ˆ k, {ml }) is a given model selection criterion. Typical model selecwhere J(θ, tion criteria include AIC, CAIC, and MDL. These three model selection criˆ k, {ml }) = −2L(θ) ˆ + teria can be summarized into the general form [14] J(θ, ˆ B(n)C(k, {ml }), where L(θ) is the log likelihood based on the ML estimate θˆ under a given k and {ml }, C(k, {ml }) is the number of free parameters C(k, {ml }) = k − 1 + kd + k + kl=1 (dml − ml (ml − 1)/2), and B(n) is a function with respect to the number of observations as follows: B(n) = 2 for AIC [15, 6], B(n) = ln(n) + 1 for CAIC [7], B(n) = ln(n) for MDL [8, 9]. However, most studies assume a known k and {ml } or determine them heuristically especially in engineering fields because model selection in a two-phase procedure kmaxis too time-consuming.k We have to perform the EM algorithm at least k=kmin (mmax − mmin + 1) times also with certain extra manual operations.
4
Automatic Model Selection Via the BYY Harmony Learning
Instead of selecting k and {ml } from a set of candidate models, the Bayesian Ying-Yang (BYY) harmony learning provides a promising tool for determining the number of clusters as well as the local subspace dimensions in parallel
A Comparative Study on Selection of Cluster Number
1217
with parameter learning. To automatically determine the local subspace dimensions, an alternative probabilistic model of PCA for component distributions pl (x) was used in [5], which can be described by pl (x|y) = G(x|Ul y + cl , σl2 Id ), pl (y) = G(y|0, Λl ), and pl (x) = pl (x|y)pl (y)dy= G(x|cl , Ul Λl UlT + σl2 Id ), where Ul is a d × ml matrix subject to UlT Ul = Iml and Λl is a ml × ml diagonal covariance matrix of y. The estimates of unknown parameters θ = {αl , Ul , Λl , cl , σl2 }kl=1 obtained by the BYY harmony learning in B-architecture without regularization can be described as follows [16, 5]: θˆ = arg maxθ H(θ), k subject to UlT Ul = Iml , αl ≥ 0, and l=1 αl = 1, l = 1, ..., k. H(θ) = n k t=1 l=1 P (l|xt ) ln[αl pl (xt |yl (xt ))pl (yl (xt ))], where 1, l = lt ; P (l|xt ) = l = arg max ln[αl pl (xt |yl (xt ))pl (yl (xt ))], 0, otherwise, t l yl (xt ) = arg max ln[pl (xt |y)pl (y)]. (3) y
Performing the above BYY learning results in maximizing ln αl , ln pl (x|y) and ln pl (y). Maximizing ln αl and ln pl (x|y) will push αl or σl2 towards zero if component l is extra. Thus we can delete the component l with αl ≈ 0 or σl2 ≈ 0. (j) On the other hand, maximizing ln pl (y) will push the variance Λl towards zero if the corresponding unobservable latent variable y (j) is extra. Consequently, we can discard the subspace dimension j. Therefore, as long as k and ml are initialized at values large enough, an appropriate k and a set of appropriate ml will be determined automatically in parallel with parameter learning. Please refer to [10] for detailed explain on this nature. This automatic model selection has promising advantages in comparison with the conventional two-phase style model selection. We only need to implement parameter learning once to obtain both the number of clusters and the local subspace dimensions. In implementation, we can obtain θˆ by an adaptive algorithm from the algorithm of general mixture model in [16] together with the algorithm of automatic PCA model in [5]. Since the EM algorithm made in this paper is in batch way, to facilitate comparison we give a batch algorithm to implement the BYY learning. That is, initially we set k = kmax , ml = mmax , and then iterate the following steps: Yang-step: Get yl (xt ) and P (l|xt ) by Eq. 3 for l = 1, ..., k and t = 1, ..., n. Ying-step: By using a Lagrange multiplier λ and letting the derivatives of the k Lagrangian H(θ) + λ( l=1 αl − 1) respect to λ, αl , cl , Λl , and σl2 equal zero, we get n n 1 1 P (l|xt ), Λnew = P (l|xt )yl (xt )yl (xt )T , n l n t=1 P (l|x ) t t=1 t=1 n 1 cnew = n P (l|xt )(xt − Ulold yl (xt )), l t=1 P (l|xt ) t=1 n 1 2 σl2 new = n P (l|xt )xt − Ulold yl (xt ) − cold (4) l . d t=1 P (l|xt ) t=1
αnew = l
1218
X. Hu and L. Xu
Update Ul by using gradient ascending on each Stiefel manifold, that is, gUl =
n
1 nσl2
old
T P (l|xt )(xt − cold l )yl (xt )
t=1
n U old − 2l old P (l|xt )yl (xt )yl (xt )T , nσl t=1 T Ulnew = Ulold + η0 (gUl − Ulold gU Ulold ). l
(5)
Discard the jth principal subspace of the lth component if the jth element of Λl approximately equals zero. Delete the lth component if α2l ≈ 0 or σl2 ≈ 0.
5
Empirical Comparative Studies
In this section, we investigate the performance of the BYY harmony learning with automatic model selection in comparison with the performances of the model selection criteria AIC, CAIC, and MDL on synthetic data sets. First, we use these methods to select the number of clusters and one same dimension of local subspaces for all clusters with different sample sizes. Second, we show an example to determine the number of clusters and different local subspace dimensions. In implementation, we use the batch algorithm to implement the BYY harmony learning with initial values at k = kmax and ml = mmax . For those twophase style approaches, we implement the iterative EM algorithm [4] to obtain the ML estimates of the whole set of candidate models, and then we calculate the model selection criteria to make model selection. Also we use the k-means algorithm to set initial partitions. The observations x t , t = 1, ..., n are randomly generated from a mixture of Gaussian distributions kl=1 αl G(x|cl , Ul Λl UlT + σl2 Id ). 5.1
Selection on the Number of Clusters and One Same Local Subspace Dimension
We use the BYY harmony learning and the model selection criteria AIC, CAIC, and MDL to select the number of clusters and local subspace dimension under the assumption that each cluster has one same local subspace dimension m. We investigate the performances of each methods on the data sets with different sample sizes n = 100 and n = 1000. The dimension of x is d = 5, the number of clusters is k ∗ = 3, and the subspace dimension of each cluster is m∗ = 2. The noise variances σl2 are equal to 0.2ψl where ψl denotes the smallest value in Λl . Experiments are repeated over 100 times to facilitate our observing on statistical behaviors. Tab. 1 illustrates the results of AIC, CAIC, MDL, and BYY on data sets with a sample size 1000 in 100 experiments and a sample size 100 in 100 experiments. When the sample size is only 100, we see that MDL and BYY have the highest correct rates. AIC has a risk of overestimating both the number of clusters and local subspace dimension. CAIC has a risk of underestimating the number of clusters. When the sample size is 1000, all methods have high successful rates except AIC with a slight risk of overestimating.
A Comparative Study on Selection of Cluster Number
5.2
1219
Selection on Both the Number of Clusters and Different Local Subspace Dimensions
In the above studies we assume that local subspace dimensions are identical. In this experiment, we try to determine the subspace dimension of each cluster without this assumption. The data points of a size 800 are randomly generated from a 4-component bivariate Gaussian mixture distribution, that is, d = 2, k ∗ = 4 and {m∗l } = {1, 1, 0, 0}. The original model and the data set are shown in Fig. 1(a). We set mmax = 1, and set kmax = 7. The initial model with k = kmax and ml = mmax is shown in Fig. 1(b). Shown in Fig. 1(c) and (d) are the results of the BYY algorithm and the EM algorithm respectively with the same initial model shown in Fig. 1(b). We observe that both extra clusters and subspace dimensions are automatically determined by the BYY algorithm that pushes their corresponding αl and (j) Λl into zero. The BYY algorithm converges to the correct model. AIC selects k = 6 and {ml } = {1, 1, 1, 0, 0, 0}. MDL selects k = 5 and {ml } = {1, 1, 0, 0, 0}. CAIC selects k = 4 and {ml } = {1, 1, 1, 0}. In other words, in sprite of after a very complex and time-consuming procedure, none of these criteria selects the correct model. Table 1. Rates of selecting the model with a number of clusters k and one same local subspace dimension m by AIC, CAIC, MDL, and BYY in 100 experiments n = 1000 rates k 1 2 3* 4 5 0 00 0 0 0 1 00 0 0 0 m 2* 0 0 68 15 3 3 00 9 2 1 4 00 0 0 2 AIC n = 100 rates k 1 2 3* 4 5 0 00 0 0 0 1 00 0 1 0 m 2* 0 0 41 20 9 3 00 4 2 8 4 0 0 0 0 16 AIC
5.3
rates
k 1 2 3* 0 00 0 1 00 3 m 2* 0 3 91 3 20 0 4 10 0 CAIC k 1 2 3* 0 0 0 1 1 0 0 2 m 2* 0 6 76 3 13 0 0 4 2 0 0 CAIC
rates
4 0 0 0 0 0
5 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
rates
k 1 2 3* 0 00 0 1 00 0 m 2* 0 2 94 3 10 0 4 00 0 MDL k 1 2 3* 0 00 0 1 00 2 m 2* 0 5 85 3 40 0 4 10 0 MDL
rates
4 0 1 1 1 0
5 0 0 0 0 0
4 0 0 2 1 0
5 0 0 0 0 0
rates
k 1 2 3* 0 00 0 1 00 0 m 2* 0 0 92 3 01 2 4 00 0 BYY k 1 2 3* 0 00 1 1 00 1 m 2* 0 1 87 3 01 2 4 00 0 BYY
4 0 1 1 2 0
5 0 0 0 1 0
4 0 1 4 0 0
5 0 0 0 2 1
rates
Discussion on Computational Cost
All the experiments are carried out using MATLAB R12.1 v.6.1 on a P4 1.4GHz 512KB RAM PC. We illustrate the computational results in Tab. 2 for every data sets. Time of manual operation for those two-phase style methods is not
1220
X. Hu and L. Xu Table 2. CPU time results of different methods in 100 experiments data sets
methods times of implementing total CPU time a single algorithm (in seconds) k and m selection criteria 25 33.12 subsec. 5.1 n = 1000 BYY 1 1.36 k and {ml } selection criteria 254 730.47 subsec. 5.2 BYY 1 1.89
4
4
2
2
0
0
−2
−2
−4
−4
−6
−6 −6
(a)
−4
{m∗l }
−2
0
2
4
6
∗
= {1, 1, 0, 0} k = 4
−6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−4
−2
0
2
4
6
(b) {ml } = {1, 1, 1, 1}, k = 7
−6 −6
−4
−2
0
2
4
6
(c) {ml } = {1, 1, 0, 0}, k = 4
−6
−4
−2
0
2
4
6
(d) {ml } = {1, 1, 1, 1} k = 7
Fig. 1. (a) The original model with d = 2, k∗ = 4 and {m∗l } = {1, 1, 0, 0} and the data sets with a size 800. (b) The initial model with {ml } = {1, 1, 1, 1} and k = 7. (c) The results of the BYY algorithm with the number of clusters and different local subspace dimensions for each clusters automatically determined. (d) The results of the EM algorithm.
included. Automatic model selection via the BYY algorithm is fastest. We need to implement the BYY algorithm only once, while making model selection by AIC, CAIC, and MDL takes much more computational costs because we have kmax to implement the EM algorithm at least k=k (mmax − mmin + 1)k times. min
6
Conclusion
We have made a comparative study on automatic model selection via the BYY harmony learning implemented by an Ying-Yang iterative algorithm and the
A Comparative Study on Selection of Cluster Number
1221
conventional ML learning based model selection criteria: AIC, CAIC, and MDL in help of a mixture PCA model. We have observed that the BYY algorithm achieves comparable good performance as CAIC and MDL, but is more advantageous in term of consuming much less time.
References 1. Xu, L.: Beyond PCA Learning: From Linear to Nonlinear and From Global Representation to Local Representation. In: Proc. Intl. Conf. on Neural Information Processing (ICONIP94). Volume 2., Seoul, Korea (1994) 943–949 2. Xu, L.: Theories for Unsupervised Learning: PCA and Its Nonlinear Extensions. In: Proc. of IEEE ICNN94. Volume II., Orlando, Florida (1994) 1252–1257 3. Xu, L.: Bayesian-Kullback Coupled Ying-Yang Machines: Unified Leanings and New Results on Vector Quantization. In: Proc. Intl. Conf. on Neural Information Processing (ICONIP95), Beijing, China (1995) 977–988 4. Tipping, M.E., Bishop, C.M.: Mixtures of Probabilistic Principal Component Analysers. Neural Computation 11 (1999) 443–482 5. Xu, L.: Independent Component Analysis and Extensions with Noise and Time: A Bayesian Ying-Yang Learning Perspective. Neural Information Processing Letters and Reviews 1 (2003) 1–52 6. Akaike, H.: A New Look at Statistical Model Identification. IEEE Transactions on Automatic Control 19 (1974) 716–723 7. Bozdogan, H.: Model Selection and Akaike’s Information Criterion (AIC): The General Theory and Its Analytical Extensions. Psychometrika 52 (1987) 345–370 8. Rissanen, J.: Modeling by Shortest Data Description. Automatica 14 (1978) 465– 471 9. Barron, A., Rissanen, J.: The Minimum Description Length Principle in Coding and Modeling. IEEE Trans. Information Theory 44 (1998) 2743–2760 10. Xu, L.: Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor Auto-determination. IEEE Trans on Neural Networks 15 (2004) 885–902 11. Xu, L.: Bayesian Ying Yang Learning (I): A Unified Perspective for Statistical Modeling. In Zhong, N., Liu, J., eds.: Intelligent Technologies for Information Analysis, Springer (2004) 615–659 12. Xu, L.: Bayesian Ying Yang Learning (II): A New Mechanism for Model Selection and Regularization. In Zhong, N., Liu, J., eds.: Intelligent Technologies for Information Analysis, Springer (2004) 661–706 13. Roweis, S.: EM Algorithms for PCA and SPCA. In Jordan, M.I., Kearns, M.J., Solla, S.A., eds.: Advances in Neural Information Processing Systems. Volume 10., The MIT Press (1998) 626–632 14. Sclove, S.L.: Some Aspects of Model-Selection Criteria. In Bozdogan, H., ed.: Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach. Volume 2., Dordrecht, the Netherlands, Kluwer Academic Publishers (1994) 37–67 15. Akaike, H.: Factor Analysis and AIC. Psychometrika 52 (1987) 317 –332 16. Xu, L.: BYY Harmony Learning, Structural RPCL, and Topological SelfOrganizing on Mixture Models. Neural Networks 15 (2002) 1125–1151
Adaptive Support Vector Clustering for Multi-relational Data Mining Ping Ling1,2, and Chun-Guang Zhou1 1
College of Computer Science, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Changchun 130012, China 2 School of Computer Science, Xuzhou Normal University, Xuzhou, 221116, China [email protected]
Abstract. A novel self Adaptive Support Vector Clustering algorithm (ASVC) is proposed in this paper to cluster dataset with diverse dispersions. And a Kernel function is defined to measure affinity between multi-relational data. Task of clustering multi-relational data is addressed by integrating the designed Kernel into ASVC. Experimental results indicate that the designed Kernel can capture structured features well and ASVC is of fine performance.
1 Introduction Multi-Relational Data Mining (MRDM) [1] promotes mining tasks from traditional single-table case to Multi-Relational (MR) environment. It aims to look for patterns that involve some connected tables or a relational database. A straightforward solution to MRDM is to integrate all involved tables into a comprehensive table, so that traditional mining algorithms can be performed on it. This, however, would cause loss of structured information, and what makes it worse, it is unwieldy to operate and store this table. Naturally, MRDM requires mining from multi tables directly. Four algorithm branches of MRDM are: inductive logic programming, probability logic, relation learning and distance-based approaches [1]. Kernel is often used in the 4th branch due to its fine quality of implicitly defining a nonlinear map from original space to feature space. A suitable Kernel creates the linear version of problem that would otherwise be nonlinear in input space while not introducing extra computation. This makes Kernel-based methods become hot research area [2]. Say, Support Vector Clustering (SVC) exhibits impressive performance in concerned application areas [3]. This paper presents an Adaptive Support Vector Clustering algorithm (ASVC) and a Kernel function of multi-relational version. ASVC is adapted to dataset of various dispersions by tuning width parameter according to data neighborhood information. It employs spectrum analysis to assign cluster membership in a simple way without suffering expensive cost that is involved in traditional SVC. The designed Kernel captures structure features of relational data and it is integrated into ASVC to finish clustering task of MRDM. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1222 – 1230, 2006. © Springer-Verlag Berlin Heidelberg 2006
Adaptive Support Vector Clustering for Multi-relational Data Mining
1223
2 Related Work 2.1 Traditional SVC Let xi ∈ χ , χ ⊆ ℜ , χ be the input space, Φ be the nonlinear transformation from n
χ to some high dimensional feature space. The optimal objective function is: min( R + C ∑ ξ i ) .
s.t. || Φ ( xi ) − a || ≤ R + ξ i . 2
2
R ,ξ
i
2
(1)
ξ i ≥ 0 is slack variables, a is the center of sphere, R is the radius of sphere and C is penalty parameter. Transfer it to Lagrangian function, and then turn to Wolfe dual:
max W = ∑ β i Φ ( xi ) − ∑ β i β j Φ ( xi ) ⋅ Φ ( x j ) . 2
i
(2)
i, j
Following (2), points with ξ i = 0 and β i = 0 are mapped to the inside of the feature sphere and they are within clusters. Those points with ξ i = 0 and 0 < β i < C are mapped to the surface of sphere, referred as non-bounded Support Vector (nbSV). Those points with ξ i > 0 , γ i = 0 and β i = C are located outside sphere, and they are called outliers or bounded Support Vector (bSV). With Kernel trick, it results into: max W = ∑ β i K ( xi , xi ) − ∑ β i β j K ( xi , x j ) . i
i, j
s.t. ∑ β i = 1 , 0 ≤ β i ≤ C .
(3)
i
Gaussian Kernel k ( xi , x j ) = exp( − || xi − x j || / σ ) is adopted into (3), whose 2
2
width parameter σ governs the sharpness or smoothness of data contours. In (3), C decides the number of bSVs. Cluster assignment is done based on an observation that given a pair of data points that belong to different components (clusters), any path that connects them must exit from the sphere in feature space. Therefore, such a path contains a segment of points y such that R(y) > R, where R(y) is the distance between point y and hyper sphere center. So an adjacency matrix A is computed: Aij
⎧ ⎪1 = ⎨ ⎪0 ⎩
∀y , y∈ path( xi , x j ), R ( y )≤ R otherwise
.
(4)
Clusters are defined as the connected components of the graph induced by A. Clearly, both the expensive computation for this matrix and the sampling manner to choose point y deteriorates strictness and veracity of traditional SVC. 2.2 Spectral Clustering
Spectral Clustering algorithm (SC) obtains the spectral projections of data by eigendecomposing affinity matrix and consequently groups these spectrums by a simple method[4]. It assigns cluster label for point according to its spectral coordinate. There are many variants of SC, such as NJW [4], SM [5], KVV [6] etc. In spite of differing
1224
P. Ling and C.-G. Zhou
in the normalization manner of affinity matrix, they share the same framework as follows: a) Compute affinity matrix H; b) Normalize H into H’ in a desired fashion; c) Conduct the eigen-decomposition on H’, and select the top M eigenvectors. Form spectral embedding matrix S by stacking M eigenvectors in columns; d) Perform Kmeans on rows of S, which are actually data spectral coordinates. Assign cluster label for ith point according to ith row’s cluster membership.
3 ASVC Algorithm 3.1 ASVC Procedure
Idea of ASVC is to perform tuning SVC firstly. Then a spectral clustering procedure is conducted on nbSVs to group them into clusters. Cluster assignment is done by checking affinities between point and nbSVs. Steps are as follows: Step1: Perform tuning SVC, to produce nbSVs. Step2: Do spectral analysis on nbSVs, to group them into clusters. Step3: Point shares the cluster membership with its most similar nbSV. Spectral clustering method used in Step2 is the NJW version, whose normalization n
is H’=D-1/2HD-1/2, where D is diagonal-shape with Dii = Σ j =1 H ij . The number of clusters is discovered according to the max gap in the magnitude of eigenvalues [7]. In Step3, because of the employment of self-tuning Kernel that is defined below, SVC comes in an adaptive version. And it will produce nbSVs which express both data boundaries and important locations of distributing structures. This, consequently, provides a good reason for labeling point the same membership as its nearest nbSV. 3.2 Tuning Approach
Essentially, width parameter σ of Gaussian Kernel plays a critical role by scaling affinity at a certain depth. Traditionally, σ is set as a fixed global value, which does well in dataset whose density is in a narrow range. But to dataset of multiscale dispersions, it is unready to present precise and individually data-specific affinities with a global σ value. Thus the self adaptation of σ is taken into consideration by integrating local density information. It means to learn σ from point’s local context. For point x we set the density factor σ x =|| x − xr || to provide local distribution information of its neighborhood. Here xr is the rth nearest point of x. In other words, xr is the rth point in the ascending list of distance from x to other points. Intuitively, given r, if || x − xr ||<|| y − yr || , it means the dispersion around x is denser than that around y. So when we measure the affinity between x and y, their density factors will serve as width parameter in a way : σ = σ x ⋅ σ y =|| x − xr || ⋅ || y − yr || . This adopts 2
local context information from both sides. This leads to the tuning Kernel:
Adaptive Support Vector Clustering for Multi-relational Data Mining
k ( x, y ) = exp( −
|| x − y ||
1225
2
|| x − xr || ⋅ || y − yr ||
)
(5)
r can be regarded as the size of neighborhood which is to be probed when we draw out the local density information. The tuning approach actually conducts an uneven adjusting to data distribution. It strengthens the similarity between the points with high relations while highlights the difference, namely degrades the similarity, between points that are far away from each other. It is expected that the corresponding Gram matrix will hold more distinguishing ability clusters detection. Here original width parameter σ is replaced by r. The former is difficult to control due to its abstract meaning, while the latter is of clear geometry meaning, that is, the neighborhood size. Again we give the heuristic to explore its suitable range. That is to try values around r’. r’ is set by below steps: 1) Sort rows of distance matrix dij in an ascending order. 2) Find gap(i)=maxj{d(i,j)-D(i,j-1)}. 3) Take r’ as the average of {gap(i)}. Thus gap(i) formulates a natural size estimate for inherent cluster containing point xi this size. The average of all estimates can be viewed as the size of neighborhood to be investigated. Local context info that is learned from this field is expected to be of a good confidence. 12 10
12 inner-cluster points nbSVs
10
12 inner-cluster points nbSVs
10 8
6
6
4
4
4
Y
Y
8
6
Y
8
2
2
0
0
0
-2
-2
-2
-4 -6 -6
2
-4
-4
-2
0
2
4
6
8
10
12
14
X
Fig. 1. SVC : 1/σ2=0.526
inner-cluster points nbSVs
-6 -6
-4
-4
-2
0
2
4
6
8
10
12
14
X
Fig. 2. SVC: 1/σ2=0.926
-6 -6
-4
-2
0
2
4
6
8
10
12
14
X
Fig. 3. Tuning SVC: r=30
It is the comparison between traditional SVC and tuning SVC. The test dataset is 3 Gaussian distributions with means and covariance [(-1, 0), 8], [(3.5, 3.5), 3], and [(11, 10), 6]. According to [3], 1/σ2 starts from 1/max||xi-xj||2 and increases gradually. So we perform SVC in a series scales {0.126, 0.226, … , 0.926}. The results under two settings 1/σ2 = 0.526 and 1/σ2 = 0.926 is shown in Fig. 1 and Fig. 2, where Csvc is set 0.8. At the first width, too many nbSVs are generated. When width arrives at the second value, the number of nbSVs reduces, while the quality of them is still too poor to reveal cluster boundaries. In Fig. 3, tuning SVC does a good job to provide a nice statement of borderlines under various dispersions. It is noted that a few more nbSVs are generated besides those on the boundaries. These extra nbSVs are located on sites where sharp changes of density happen. Thus all nbSVs give a sketch of cluster inherent structure. They can be regarded as data representatives. This justifies the feasibility to assign point into the cluster indicated by its most similar nbSV. (With 2G P4 CPU PC, 256M Memory, WinXP, MatLab7.0)
1226
P. Ling and C.-G. Zhou
4 Kernel of MR Version 4.1 Basic Concepts Key: It is the attribute that exclusively identifies a unique record. Foreign Key (FK): It is the attribute that references some Key of another table. Referenced Key (RK): The attribute that corresponds to some FK of another table is considered as referenced key in the original table. A RK can correspond to multi FKs of tables. Naturally, FK is not necessarily the Key. But RK is bound to be Key. FK and RK are named as association key (AK). Main Table: It is the table that is mainly investigated, where the class of the instance is stated. Denote some record of Main table with Ri _ , and
denote its corresponding record in another table via AK a as Rai+ _ . Rai+ _ is also named as expanding instance. Main Attribute: It is the attribute on which data mining tasks are focused. Common Attribute: In Main table, the attribute neither key nor AK is common attribute. The instance of object Ri _ on common attributes is denoted as Rci _ . 4.2 Kernel Definition
The precondition to implement ASVC into MRDM is the proposal of well-defined Kernel function that is incorporated with MR data. Some literatures have proposed Kernel definition of structured data based on convolution thoughts or summing thoughts [8, 9]. Our Kernel is guided by combining convolution idea and summing idea. In MR environment relational information is implicated by association keys. Each key stands one expanding direction and in this direction it produces a local affinity. Let local affinity is obtained by elementary Kernel. The global affinity is developed by collecting all local affinities together in a desirable way as: K ( R i _ , R j _ ) = k ( R ci _ , R cj _ ) ⋅
k FK ( Ri _ , R j _ ) + k RK ( Ri _ , R j _ )
| FK |
In (6), where
+
+
+
+
k FK ( Ri _ , R j _ ) = Σ k ( Rui _ , Ruj _ ) u =1
|RK |
k RK ( Ri _ , R j _ ) = Σ k ( Rvi _ , Rvj _ ) ïï v =1
In (8), where
In (9), where
+
+
|H v |
+
+
k ( Rvi _ , Rvj _ ) = Π k set ( Rvi _ m , Rvj _ m ) ïï m =1
k set ( S 1 , S 2 ) =
(6)
| FK | + | RK |
Σ
x ∈ S1 , y ∈ S 2
k ( x, y )
| S1 × S 2 |
ïï
(7) (8) (9)
(10)
The compound Kernel in (6) is expressed by the product of the affinity coming from M’s common attributes and the affinity from expanding information description as a whole. There are two cases of local affinity computation. For the map from FK to
Adaptive Support Vector Clustering for Multi-relational Data Mining
1227
RK, an object corresponds to a single expanding instance. To measure local affinity between two objects in this expanding direction, their respective expanding instances can be introduced into the elementary Kernel directly. Then affinities brought by all FKs are summed, as shown in (7). As to the map from RK to FK, although we add local affinities from all RKs up as shown in (8), it is a different case to compute single local affinity generated by one RK. Now in one expanding direction there might be multi expanding tables because a RK might correspond to multi FKs of tables. If we denote by Hv the set of tables that a RK v corresponds to, then the affinity produced by v is set as the product of local affinities generated by each table in Hv. This point is +
+
shown in (9), where k set ( Rvi _ m , Rvj _ m ) is to compute the local affinity from table m. And the index m runs over the set of Hv. Moreover, under the map from a RK to a FK, one object could correspond to multi instances. So the local affinity of two objects in this expanding table should be extracted from their expanding instances sets. Formula (10) formulates how to find the affinity between two sets. Seen essentially, summing operation is a soft approach that preserves the original difference from local affinities. It means a moderate attitude to affinity measurement. Convolution thoughts, however, is to strengthen the difference in local aspects, and it creates high criteria for the whole affinity. In (6), we take a slack and tolerant attitude to local affinities in expanding directions but take a censorious attitude when we organize the affinity from Main table and expanding descriptions. This means we give more attention to information on properties of Main table. Some notes are as follows to ensure global Kernel work well in both single-table case and multi-table case. If without common attributes, let k ( Rci _ , Rcj _ ) = 1 . If without FK or RK, let k FK ( Ri _ , R j _ ) = 0 or k RK ( Ri _ , R j _ ) = 0 . Tuning Gaussian Kernel serves as the elementary Kernel. According to the theorem of Kernel construction [9], formula (6) is of positive semi-definite property. The computation of global Kernel is reduced to local affinities computation on expanding tables directly.
5 Experimental Results Firstly, ASSVC is applied on 303 points shown in Fig. 4. Its variance in density is quite sharp and its clusters are not reflected by Euclidean distance directly. We present clustering results yielded by K-means, NJW and ASVC. In Fig. 5, K-means’s failure lies in the rigid partition only based on distance measurement. NJW also utilizes Gaussian Kernel affinity, so a series of width parameters from 1/max||xi-xj||2 is tried to obtain the better result. Fig. 6 and Fig. 7 demonstrate its results under two widths, none of which can achieve satisfying result due to the fix scale. The result of traditional SVC is shown in Fig. 8, which is far away from the correct status. ASVC clusters the dataset perfectly by producing 61 nbSVs under setting r =36. We record time used in ASVC and SVC: 87 seconds and 178 seconds. The time effectiveness improved by ASVC is remarkable. Then ASVC is applied on real datasets: IRIS [10], WINE [10], and Breast Cancer (BC) [10]. Table 1 compares the errors of ASVC and some other clustering algorithms: K-means; traditional SVC; Girolami method [11], a more sophisticated Ker-
1228
P. Ling and C.-G. Zhou
nel-based expectation-maximization method; and NJW. For each algorithm, the minimum number of incorrectly clustered points is documented. And we also present results of another ASVC version that is encoded with a width searching approach. We find the best clustering result by searching 1/σ2 in a parameter space. This method is named as search-ASVC. From Table1, for IRIS and BC, ASVC outperforms other algorithms. As to WINE data, the fact that 178 points cover 13 dimensions leads to the wide spreading information and the weak connections in neighboring context. 1.2
1.2
1
1
1
0.8
0.8
0.8
0.6
0.6
Cluster2
0.4
Cluster2
0.2
Cluster3
0.2
0
0.2
0
0
0.2
0.4
0.6
0.8
1
Cluster2
0.4
Cluster3
Cluster3
-0.2 -0.2
Cluster1
Y
Cluster1 0.4
Cluster1
Y
Y
0.6
1.2
0
-0.2 -0.2
1.2
0
0.2
0.4
X
0.6
0.8
1
-0.2 -0.2
1.2
0
0.2
0.4
0.6
X
Fig. 4. Multi-Density Dataset 1.2
1
1
1
0.8
0.8
0.8
0.6
0.6
1.2
1.2
0.6 Y
Cluster1
Y
Y
1
Fig. 6. SC: 1/σ2=19.9
Fig. 5. K-means Result
1.2
0.4
0.8
X
Cluster2
0.4
Cluster1
0.4
Cluster3
Cluster2 Cluster3
0.2
0.2
0.2
0
0
0
inner-cluster points1 nbSVs -0.2 -0.2
0
0.2
0.4
0.6
0.8
1
1.2
-0.2 -0.2
0
0.2
X
0.4
0.6
0.8
1
1.2
-0.2 -0.2
0
X
Fig. 7. SC: 1/σ2=2.79
0.2
0.4
0.6
0.8
1
1.2
X
Fig. 8. SVC:1/σ2=0.5612
Fig. 9. nbSVs of Tuning SVC
Table 1. Error comparison on real datasets (D is the dimensionality)
Dataset IRIS (D = 4) WINE (D = 13) BC (D = 9)
K-means 16 5 26
Girolami 8 3 20
SVC 14 12 32
NJW 14 3 22
ASVC 5(r=49) 49(r=29) 19(r=68)
search-ASVC 4(1/σ2=0.53) 5(1/σ2=0.258) 29(1/σ2=0.267)
Now ASVC is applied on a relational problem: MUSK [12]. We use MUSK1 version. We fix the data with some normalization and develop a two-level relation frame for the dataset, as shown in Fig 10. To test the performance of the designed Kernel, we employ it into classical SVM procedure and compare result with other classification learners: TILDE [13]; SVM-MM and SVM-MI [14]. The accuracy ratio of each algorithm with ten fold cross-validation is shown in Table 2. Our SVM procedure achieves better result. It shows the designed Kernel can grasp relational features and express relational schema effectively. Then we perform ASVC and search-ASVC on MUSK1, with accuracy comparison in Table 3. Obviously, two methods can find satisfying results and they are competitive.
Adaptive Support Vector Clustering for Multi-relational Data Mining RK: Molecule-Name
FK: M-Name
1229
Common Attributes (none)
Feature1
…… Feature 166
Fig. 10. Relationship of MUSK Table 2. Classification comparison on MUSK1
Dataset MUSK1
TILDE 0.87
SVM-MM 0.916
SVM-MI 0.864
Our SVM 0.9201 (Csvm = 2.3)
Table 3. Clustering accuracy on MUSK1 (Csvc= 0.8)
ASVC 0.9112 (r = 39)
search-ASVC 2 0.9312 (1/σ = 3.2517)
Table 4. Clustering on Student Database (Nclu is the number of clusters, NSV is the numbers of nbSVs, and Ngroup is cluster size, and Csvc = 0.8)
ASVC search-ASVC
Settings r = 26 1/σ2= 0.362
Nclu 4 4
NSV 5,12,10,9 7,13,12,7
Ngroup 21,75,109,18 25,81,101,16
100 80
group1
60
group2
40
group3
20
group4
0 40~50 50~60 60~70 70~80 80~90 90~100
Fig. 11. Relationship of Student Database
Fig. 12. Weighted Statistics
Finally, ASVC is conducted on the document data of 223 students coming from some grade of Software College, Jilin University. This database contains six tables, Student, Rank, Classtype, Agegroup, Work and Activities, where Student is Main table. Its relationship is shown in Fig 11. Table 4 gives clustering results of ASVC and search-ASVC. To examine the effect of ASVC, in Fig 12, the statistics information of common attributes of Main table is demonstrated after being processed by a weighted normalization. Based on it, all students can be divided into four groups from Group1 to Group4. And this coincides with the results of ASVC which forms 4 clusters from Cluster1 to Cluster4. But the content of corresponding cluster and group differ. An example falls into Group1 according to statistics, while it is partitioned in Cluster3 by ASVC. ASVC integrates more information from multi aspects to analyze features and give remarks. Thanks for the practice ability and other qualities recorded
1230
P. Ling and C.-G. Zhou
in expanding tables that make up for his/her shortages on score, he/she is upgraded to Cluster3. This clustering result is more agreed with the true comments to students.
6 Conclusion and Prospect A novel ASVC algorithm that is adapted to dataset of multi densities is presented in this paper. Another novel is Kernel definition incorporated with MRDM. Mining in MR environment is the most typical aspect of MRDM. Actually, the definition of MR data has been enlarged to the whole realm of structured data. Kernel and Kernel trick are extremely promising research tools. Future working direction is to explore Kernel usage further in structured schema and to develop potent algorithms.
Acknowledgement This work is supported by the National Natural Science Foundation of China under Grant No. 60433020; 985 Project: Technological Creation Support of Computation and Software Science; and the Key Laboratory for Symbol Computation and Knowledge Engineering of the National Education Ministry of China.
References 1. Dzeroski, S.: Multi-Relational Data Mining: An Introduction. ACM SIGKDD Explorations Newsletter 5 (1) (2003) 2. Woznica, A. (ed.): Kernel-Based Distances for Relational Learning. 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 6(2) (2004) 3. Ben-Hur, A., Horn, D., Siegelmann, H. T.: Support Vector Clustering. Journal of Machine Learning Research 2 (2001) 125-137 4. Ng, A., Jordan, M., Weiss, Y.: On Spectral Clustering: Analysis and An Algorithm. Advances in Neural Information Processing Systems, Cambridge, MIT Press (2002) 5. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (1997) 731-737 6. Kannan, R., Vempala, S., Vetta, A.: On Clusterings: Good, Bad and Spectral. Proceedings of the 41st Annual Symposium on the Foundation of Computer Science (2000) 367-377 7. Polito, M., Perona, P.: Grouping and Dimensionality Reduction by Locally Linear Embedding. In: T. G. Dietterich, S. Becker, and Z. Ghahramani (eds.): Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, MA (2002) 8. Haussler, D.: Convolution Kernels on Discrete Structures. Technical Report, Department of Computer Science, University of California, Santa Cruz (1999) 9. Renders, J.: Kernel Methods in Natural Language Processing. Learning Methods for Text Understanding and Mining, Conference Tutorial, France (2004) 10. http://www.ics.uci.edu/~mlearn/MLSummary.html 11. Girolami, M.: Mercer Kernel-Based Clustering in Feature Space. IEEE Trans. on Neural Networks, Vol. 13(3) (2002) 780-784 12. Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving The Multiple Instance Problem with Axis-Parallel Rectangles. Artificial Intelligence 89(1-2) (1997) 31-71 13. Bloedorn, E., Michalski, R.: Data Driven Constr. Ind.. IEEE Intel. Syst. 13(2) (1998) 30-37 14. Gaertner, T., Flach, P., Kowalczyk.: Multi-instance Kernels. Proceedings of the 19th International Conference on Machine (2002) 179-186
Robust Data Clustering in Mercer Kernel-Induced Feature Space Xulei Yang, Qing Song, and Meng-Joo Er EEE School, Nanyang Technological University, Singapore {yangxulei, eqsong, emjer}@pmail.ntu.edu.sg
Abstract. In this paper, we focus on developing a new clustering method, robust kernel-based deterministic annealing (RKDA) algorithm, for data clustering in mercer kernel-induced feature space. A nonlinear version of the standard deterministic annealing (DA) algorithm is first constructed by means of a Gaussian kernel, which can reveal the structure in the data that may go unnoticed if DA is performed in the original input space. After that, a robust pruning method, the maximization of the mutual information against the constrained input data points, is performed to phase out noise and outliers. The good aspects of the proposed method for data clustering are supported by experimental results.
1
Introduction
Interest in clustering has increased recently because of new areas of application, such as data mining, image and speech processing, and bio-informatics. A very rich literature on clustering has developed in the past three decades. In particular, the deterministic annealing (DA) algorithm, in which the annealing process with its phase transitions leads to a natural hierarchical clustering, is independent of the choice of the initial data configuration and has the ability to avoid poor local optima [4]. As reviewed in [5], the DA approach to clustering has demonstrated substantial performance improvement over standard unsupervised learning methods. However, the standard DA algorithm for data clustering is not robust against noisy points and only capable of linearly separable data structure. This paper aims to solve these problems by developing a robust kernel-based deterministic annealing (RKDA) algorithm in terms of kernel methods [2] [6] and mutual information [1] [7] techniques. The proposed method is basically a two step approach. By mapping the observed data from the input space into a high-dimensional kernel space by means of a Gaussian kernel in the clustering step, the probability of the linear separability of the data is increased, such that RKDA can reveal structure in the data which may go unnoticed if DA is performed in the standard Euclidean space. In the robust pruning step, RKDA algorithm maximizes the mutual information against the constrained input data points, which provides a robust estimation of probability soft margin to phase out noise and outliers. The rest of this paper is organized as follows. Section 2 contains the derivation of the proposed RKDA clustering algorithm. The experimental results are reported in section 3. Finally, Conclusion is given in section 4. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1231–1237, 2006. c Springer-Verlag Berlin Heidelberg 2006
1232
2
X.L. Yang, Q. Song, and M.-J. Er
The Proposed Clustering Algorithm
In this section, we first recast the standard deterministic annealing (DA) approach into the kernel space induced by a Gaussian kernel function, which leads to a kernel-based DA (KDA) algorithm, then study the outliers (or noise) rejection techniques based on the maximization of mutual information (MI), which leads to the proposed robust KDA (RKDA) algorithm. 2.1
Data Clustering in Kernel Space: KDA
Let the input data set be X = {x1 , x2 , . . . , xl } ⊂ Rn , where l is the number of input data points, n is the dimensionality of the input space. Suppose that the input data X ∈ Rn is mapped into a Hilbert space F through a nonlinear mapping function φ. Let φ(xj ) be the transformed point of xj . Assume all the points are partitioned into c clusters in the kernel space, then the set of cluster centers denoted by φ(w1 ), φ(w2 ), . . . , φ(wc ) ⊂ F can be expressed by φ(wk ) =
l
γkj φ(xj ),
k = 1, 2, . . . c
(1)
j=1
Where γkj is the parameters to be decided by KDA algorithm as shown below. Using (1) the squared Euclidian distance between a cluster center φ(wk ) and a mapped point φ(x) (it is called kernel-induced distance in this paper) can be stated as follows Dk (x) = φ(x) −
l
γki φ(xi ) 2
(2)
i=1
= φ(x) · φ(x) − 2
l
l
γki φ(x) · φ(xi ) +
i=1
γki γkj φ(xi ) · φ(xj )
i,j=1
Using the kernel trick k(x, y) = φ(x) · φ(y), the kernel-induced distance (2) can be rewritten as Dk (x) = k(x, x) − 2
l
γki k(x, xi ) +
i=1
l
γki γkj k(xi , xj )
(3)
i,j=1
where k(x, y) is the mercer kernel function [3]. One of the most popularly used kernel functions is the Gaussian kernel as follows, k(x, y) = e−
x−y2 2σ2
(4)
where σ is the Gaussian kernel parameter. The Gaussian kernel function is used in the following of this paper and the selection of the parameter σ for a given data is discussed later.
Robust Data Clustering in Mercer Kernel-Induced Feature Space
1233
Similar to the notations used in [5], here we let p(φ(wk )|φ(xj )) be the association probability relating mapped point φ(xj ) with cluster center φ(wk ). Using the kernel-induced distance measure (3), the objective function of KDA can be formulated by F = Jφ − T Hφ =
l c
p(φ(xj ))p(φ(wk )|φ(xj ))Dk (xj )
(5)
j=1 k=1
− T (−
l c
p(φ(xj ))p(φ(wk )|φ(xj ))logp(φ(wk )|φ(xj )))
j=1 k=1
where Jφ is the distortion function in the kernel space, Hφ is the conditional Shannon entropy in the kernel space, and T is the lagrange multiplier, which is analogical to the temperature in statistic physic. It turns out [5] (according to the maximum entropy principle) that the resultant distribution is the titled distribution and is given by Dk (xj )
p(φ(wk ))e− T p(φ(wk )|φ(xj )) = c Di (xj ) p(φ(wi ))e− T
(6)
i=1
where p(φ(wk )) =
l
p(φ(xj ))p(φ(wk )|φ(xj )) is the mass probability of kth
j=1
cluster in the kernel space. To estimate the free parameter φ(wk ), the effective cost to be minimized turns out to be the free energy (a well-know concept in statistical mechanics [4]) as follows. Fφ∗ =
min
{p(φ(wk )|φ(xj ))}
(Jφ − T Hφ ) = −T
l j=1
p(φ(xj ))log
c
p(φ(wk ))e−
Dk (xj ) T
(7)
k=1
Minimizing Fφ∗ with respect to φ(wk ) yields (the detailed derivation is omitted) l
φ(wk ) =
p(φ(xj ))p(φ(wk )|φ(xj ))φ(xj )
j=1 l
(8) p(φ(xj ))p(φ(wk )|φ(xj ))
j=1
Alternative updating (6) and (8) with phase transition (refer to [5]) leads to the KDA algorithm. We now turn to the discussion of the selection of kernel parameter(s), which is an important issue for the implementation of data clustering in the kernel space because it affects the clustering results together with the given kernel function. Unfortunately, it remains an open problem for kernel methods so far in the literature of machine learning. We here put out a simple but efficient method for the Gaussian kernel parameter selection as follows.
1234
X.L. Yang, Q. Song, and M.-J. Er
The proposed selection method is based on the statement in [8]. The authors concluded that for the data with well-separated clusters (large data covariance), the kernel-induced distance tends to reach maximum, so a large value of σ is suitable to separate the clusters; while for the data with undistinguished clusters (small data covariance), the kernel-induced distance tends to reach minimum, so a small value of σ is effective to separate the clusters. In result, the authors in [8] select the optimal σopt as the root of the sample data covariance. However, this selection is sensitive to the dimensionality of the sample data. To overcome this problem. we use the root of the scaled sample data covariance as follows l
σopt = (
xj − x ¯2
j=1
)1/2
2nl
where n is the dimensionality of input data, x ¯ 1l
l
(9)
xj is the mean of input data
j=1
points. The data covariance is divided by n such that the dot products in the Gaussian kernel function is scaled with the dimensionality of the input space to make the σ independent of it. From the experimental results we find that this method provide a reasonable estimation of the Gaussian kernel parameter. In other words, the KDA (or RKDA later) algorithm with the estimated value σopt from (9) can generate a reasonable partition for a given data set. 2.2
Robust Pruning in Kernel Space: RKDA
Assume the given data are partitioned by the KDA algorithm in the clustering step as discussed in the last subsection, we obtain the optimal conditional probability p¯(φ(wk )|φ(xj )) and output cluster probability p¯(φ(wk )). It has been shown in [1] [7] that the maximization of mutual information against the prior probability can be used to phase out the outliers in a noisy channel, which is implemented by maximizing the following objective function (all discussions below are in the kernel space.) ⎛ ⎞ l FMI = max ⎝I(Φ(X), Φ(W )) − λ p(φ(xj ))ej ⎠ (10) P (Φ(X))
j=1
where λ ∈ [0, +∞) is a separated parameter to control the degree of robustness against outliers. I(Φ(X), Φ(W )) is the mutual information given by I(Φ(X), Φ(W )) =
l c
p(φ(xj ))¯ p(φ(wk )|φ(xj )) log
j=1 k=1
p(φ(xj )|φ(wk )) p(φ(xj ))
(11)
with p(φ(xj )|φ(wk )) obtained by Bayes formula p(φ(xj )|φ(wk )) =
p(φ(xj ))¯ p(φ(wk )|φ(xj )) p¯(φ(wk ))
(12)
Robust Data Clustering in Mercer Kernel-Induced Feature Space
1235
and ej is the expense of using the jth input point, which is given by ej =
c
p¯(φ(wk )|φ(xj ))Dk (xj )
(13)
k=1
the resultant prior that maximize FMI is given by (FMI against p(φ(xj ))) p(φ(xj )) =
p(φ(xj ))cj l p(φ(xj ))cj
(14)
j=1
where cj is the capacity of the jth input point, which is defined by c p¯(φ(wk )|φ(xj )) cj = exp p¯(φ(wk )|φ(xj )) log − λej p¯(φ(wk ))
(15)
k=1
As discussed in [1] [7], the average mutual information (11) is not negative. However, the individual item in the sum of the capacity can be negative. If the jth point φ(xj ) is taken into account and p¯(φ(wk )|φ(xj )) <
l
p(φ(xi ))¯ p(φ(wk )|φ(xi )),
i=1
then the probability of the kth prototype (cluster center) is decreased by the observed point and gives a negative information about φ(xj ). Then the particular input point may be considered as an unreliable point (outlier) and its negative effect must be offset by other input points. Therefore, the maximization of the mutual information (10) provides a good robust estimation of the noisy points (outliers or noise), where the degree of robustness is controlled by the value of λ: the higher value λ is, the more robustness AKDA achieves, as demonstrated by example 1 in the section of experimental results.
3
Experimental Results
This section presents a few experimental examples to show the superiority of the proposed RKDA clustering algorithm over the standard DA clustering algorithm. This is in fact a self comparison since RKDA is an robust version of KDA by using the robust pruning technique, while KDA is a generalized version of DA by using the kernel methods. Several artificial data sets are considered in our experiments. The first example aims to show the robustness of RKDA against noisy points, and the second one is to show the effectiveness of RKDA for clustering diversity of data structure. The determination of optimal cluster number is not the focus of this paper, which is assumed as a priori in all examples. 3.1
Example 1: Robustness Against Noisy Points
This example demonstrates the robustness of RKDA against noisy points (outliers), where the degree of robustness is controlled by the separated parameter λ.
1236
X.L. Yang, Q. Song, and M.-J. Er
3.5
3.5
3.5
3.5
3
3
3
3
2.5
2.5
2.5
2.5
2
2
2
2
1.5
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0.5
0.5
0
0
0
0
−0.5
−0.5
−0.5
−0.5
−1
−1
−1
−1
−1.5
−1.5
−1.5
−1.5
−2 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
−2 −2
(a)DA, c = 2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
−2 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
−2 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
(b)RKDA,λ = 0 (c)RKDA,λ = 0.2 (d)RKDA,λ = 0.5
Fig. 1. Simple demonstration of RKDA against noisy points, see example 1
5
5
2.5
4
4
2
2
3
3
1.5
1.5
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4
−4
−5 −5
−4
−3
−2
−1
0
1
2
3
(a)DA, c = 2
4
5
−5 −5
−4
−3
−2
−1
0
1
2
3
4
(b)RKDA, c = 2
5
2.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2 −2
−1.5
−1
−0.5
0
0.5
1
(c)DA, c = 3
1.5
2
−2 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
(d)RKDA, c = 3
Fig. 2. Clustering results of RKDA and DA on the two data sets in example 2
The higher value λ is, the more data points become outliers. The given data, as shown in Fig.1a, consists of two concentric rings (containing 50 and 30 data points respectively) contaminated with some outliers. The Gaussian kernel parameter calculated by (9) for this data is σopt = 0.8383, which is used for RKDA in this example. The clustering results of DA with the priori c = 2 is shown in Fig.1a, which obviously fails to partition this data set. While the proposed RKDA performs much better and the clustering results with different values of λ are shown in Fig.1b-d: when λ = 0, only the outliers between the two rings are eliminated through the MI maximization optimization as shown in Fig.1b, when λ is gradually increased, the outliers located outside the outer ring are gradually eliminated as shown in Fig.c with (λ = 0.2) and Fig.1d (with λ = 0.5). 3.2
Example 2: A Variety of Data Structure
The purpose of this example is to show the effectiveness of RKDA for clustering different data sets, which may possess different shape, distribution and size. Example 1 has demonstrated the superiority of RKDA over DA for clustering linearly nonseparable data. Two more 2D linearly nonseparable data sets are investigated in this example. The first data set, as shown in Fig.2a-b, consists of two well-separated groups of data, however, one is drawn from an isotropic Gaussian distribution and the other from an uniform spherical shell distribution. The second data set, as shown in Fig.2c-d, contains three well-separated groups of data, however, two of them are compact spherical-shaped structures and one
Robust Data Clustering in Mercer Kernel-Induced Feature Space
1237
is a loose ellipsoidal-shaped structure. The Gaussian kernel parameter σopt calculated from (9) is 1.7244 for data one and 0.5589 for data two. Note here we set λ as a constant 0.0 for two data sets since the outlier rejection is not the focus of this example. The clustering results by standard DA algorithm (with priori c = 2 for data one and c = 3 for data two) are shown in Fig.2a and Fig.2c, respectively. It is obvious that the standard DA algorithm can not deal with this linearly nonseparable data structure. While the proposed RKDA algorithm succeeds in revealing the original data structure, as shown in Fig.2b and Fig.2d, respectively. This example is in accordance with the discussion above: RKDA is a generalization of DA by using a kernel-induced distance measure, which can reveal structure in the data that may go unnoticed if DA is performed in the original input space.
4
Conclusion
A new clustering method, robust kernel-based deterministic annealing (RKDA) algorithm, has been proposed for data clustering in mercer kernel-induced feature space in terms of kernel methods and mutual information techniques. The proposed approach exhibits the following robust clustering characteristics: 1) robust to the initialization (initial guess), 2) robust to cluster volumes and shapes (ability to detect different volumes and shapes of clusters), and 3) robust to noisy data points (noise and outliers). The promising aspects of the proposed method for data clustering have been demonstrated by several experimental results.
References 1. Blahut, R.E.: Computation of Channel Capacity and Rate-Distortion Functions. IEEE Trans. on Information Theory 18(4) (1972) 460-473 2. Cristianin, N., Shawe-Taylar, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000) 3. Muller, K.R., Mike, S., Ratsch, G., Tsuda, K., Scholkopf, B.: An Introduction to Kernel Based Learning Algorithms. IEEE Trans. on Neural Networks 12(2) (2001) 181-201 4. Rose, K., Gurewitz, E., Fox, G.C.: Statistical Mechanics and Phase Transitions in Clustering. Physical Review Letters 65(8) (1990) 945-948 5. Rose, K.: Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems. Processings of the IEEE 86(11) (1998) 2210-2239 6. Scholkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002) 7. Song, Q.: A Robust Information Clustering Algorithm. Neural Computation 17(12) (2005) 2672-2698 8. Wu, K.L., Yang, M.S.: Alternative C-Means Clustering Algorithms. Pattern Recognition 35(10) (2002) 2267-2278
Pseudo-density Estimation for Clustering with Gaussian Processes Hyun-Chul Kim and Jaewook Lee Department of Industrial and Management Engineering, Pohang University of Science and Technology, Pohang, Kyungbuk 790-784, Korea [email protected]
Abstract. Gaussian processes (GP) provide a kernel machine framework. They have been mainly applied to regression and classification. We propose a pseudo-density estimation method based on the information of variance functions of GPs, which relates to the density of the data points. We also show how the constructed pseudo-density can be applied to clustering. Through simulation we show that the topology of the pseudo-density represents the clustering information well with promising results.
1
Introduction
Clustering is the task of grouping the input patterns naturally [1, 4, 9]. Clustering techniques are useful in several exploratory pattern-analysis, grouping, decision-making, and machine-learning situations, including data mining, document retrieval, image segmentation, and pattern classification. Lots of clustering techniques have been developed. In this paper we propose a method to estimate a pseudo-density function with Gaussian process models. Variance of predictive value is used as a pseudodensity function. It has a small value in the area where the data points are dense and a large value in the area where the data points are sparse. The estimated pseudo-density function is applied to clustering with illustrations. The organization of the paper is as follows. In section 2 we describe Gaussian process regression. In section 3 we describe pseudo-density in terms of variances of predictive values. In section 4 we present a way for clustering objects utilizing the constructed pseudo-density and show simulation results in section 5. In section 6, a conclusion is drawn.
2
Gaussian Process Regression
Let us consider the regression problem. Let us assume that we have a data set D of data points xi with continuous target values ti : D = {(xi , ti )|i = 1, 2, . . . , n}, X = {xi |i = 1, 2, . . . , n}, T = {ti |i = 1, 2, . . . , n}. Given this data J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1238–1243, 2006. c Springer-Verlag Berlin Heidelberg 2006
Pseudo-density Estimation for Clustering with Gaussian Processes
1239
˜ . Let set, we wish to find (the density of) the target value t˜ for a new data point x the target function f (·). We put f as [f3 , f2 , . . . , fn ] = [f (x1 ), f (x2 ), . . . , f (xn )]. Gaussian processes for regression [14, 12] assume that the target function has a Gaussian process prior. This means that the density of any collection of target function values is modeled as a multivariate Gaussian density. Writing the dependence of f on x implicitly, the GP prior over functions can be written 1 1 −1 p(f |X, Θ) = exp − (f − μ) CΘ (f − μ) , (1) 2 (2π)N/2 |CΘ |1/4 where the mean μ is usually assumed to be zero μ = 0 and each term of a covariance matrix Cij is a function of xi and xj , i.e. c(xi , xj ). This covariance function is the kernel which defines how data points generalize to nearby data points. The commonly used covariance function is c(xi , xj ) = v0 exp{−
d 1 m 2 lm (xm i − xj ) } + v1 + δij v4 . 2 m=1
(2)
The covariance function is parameterized by the hyperparameters Θ, which we can learn from the data. The covariance between the targets at two different points is a decreasing function of their distance in input space. This decreasing function is controlled by a small set of hyperparameters that capture interpretable properties of the function, such as the length scale of autocorrelation, the overall scale of the function, and the amount of noise. When we assume that the noise of target values has a Gaussian density, the likelihood is n n 1 (ti − fi )2 √ p(t|f , Θ) = p(ti |fi ) = exp(− ). (3) 2r2 2πr i=1 i=1 Then, by Bayes’ rule the posterior of f is p(f |D, Θ) ∝ p(t|f , Θ)p(f |X, Θ). ˜ is The predictive density of f˜(= f (˜ x)) for a new data point x ˜ , Θ) = p(f˜, f |D, x ˜ , Θ) df = p(f˜|f , x ˜ , Θ)p(f |D, Θ) dq p(f˜|D, x =N (kT C−1 t, κ − kT C−1 k),
(4)
(5) (6)
˜ ). where k = [c(˜ x, x1 ), c(˜ x, x2 ), . . . , c(˜ x, xn )]T , and κ = c(˜ x, x The posterior distributions of the hyperparameters given the data can be inferred in a Bayesian way via Markov Chain Monte Carlo (MCMC) methods or they can be selected by maximizing the marginal likelihood (a.k.a. the evidence) [14, 12]. A Bayesian treatment of multilayer perceptrons for regression has been shown to converge to a Gaussian process as the number of hidden nodes approaches to infinity, if the prior on input-to-hidden weights and hidden unit biases are independent and identically distributed [11]. Empirically, Gaussian processes have been shown to be an excellent method for nonlinear regression [13].
1240
3
H.-C. Kim and J. Lee
Pseudo-density from GPR
3.1
Variances of Predictive Values
What we mean by a pseudo-density function is a function whose values are highly correlated with the density of the data. A good candidate for a pseudo-density function is a variance of a predictive value, which is smaller in a denser area, vice versa. Figure 1 shows the means and variances of predictive values in GPR. Let us σ 2 (x) the variances of predictive values in Eq. (6), rewritten as follows. σ 2 (x) = κ − kT C−1 k.
(7)
The variances of predictive values in GPR are not related to the target values. We consider GPR where all target values are zeros with noise allowance. Then sample functions from GP posterior would fluctuate around zeros. In a dense area the degree of fluctuation of sample functions is low with more probability, because sample functions should be smooth and close to (target value) zero near the data points. Then, when we consider E[f (x)2 ], it should be closer to zero in a denser area and bigger in a less dense area. E[f (x)2 ] = V [f (x)] − {E[f (x)]}2 = σ 2 (x)
(8)
Each basin should be related to a cluster or a sub-cluster. Figure 1-(a) shows the variance of the underlying function f ()˙ given the data points sampled from a bi-modal distribution. 3
1
0.9
1 1 1 11 11 1 11 11 1 1 1 1 1
1
0.8
0.7
1
0.6
11 1
0
1
1
1
1
3 3
3 3 333 3 3 3 33 3 3 3 33333 3 3 3 33 3 3 333 3 3 3
−1 2
−2
2 2 22 2 2
−5
0
5
10
15
−3 −3
−2
(a)
−1
22 2 22 2 22 2 2
0
4 44
3
3
22 2 2 2
2 2 2 222 2 2 2 2 22 2 2 2 22
2 22
2
−10
3
2
0.1
4 4 4 4
44 4
2 22 2
0.2
4
3
3 3 3 33
0.4
0.3
4
4 44 44 4 4 44 4 44 4 4 44 4 444 4 444
333 3 3
0 −15
4 4 4 4 4444
4
3 3
1
1 111 1 11
1
0.5
4
4
1 1 1 1 11111 1 1 11
2
2
2
1
2
3
(b)
Fig. 1. (a)Variance of f (·) for the data set of 500 points sampled from N (−5, 1) and 52 points sampled from N (6, 1). (b)The contour of g(x) = g ∗ for a synthetic data set.
3.2
View of Second-Order Parzen Window
Consider the second term of Eq. (7) given by g(x) = kT C−1 k.
(9)
Pseudo-density Estimation for Clustering with Gaussian Processes
1241
Then it can be shown that when C = dI, g(x) becomes the same form as Parzen window. When C = dI, g(x) includes the terms whose window centers are not only data points but also middle points of all pairs. These considerations show that the pseudo-density function g describes an asymptotic density function.
4
Cluster Assignment
For a pseudo-density function g(x) over the data points, let g ∗ be min g(x). x∈X
Then, the level curve g(x) = g ∗ makes up the boundaries of separate regions which correspond to clusters. Clustering can be made by assigning the same label to the data points in the same region. To label data points, we observe the following: given a pair of data points belonging to different clusters, some path connecting them must exit from the contour boundary. There should be a segment of points y such that g(y) < g ∗ in the path. Following this observation we define an adjacency matrix Aij : 1 if g(y) ≤ g ∗ for any point y on the segment between xi and xj ; Aij = 0 otherwise. (10) Then, the connected components of the graph induced by A corresponds to clusters. The cluster labelling can also be computed by trajectory-based methods [6, 7, 8, 9, 10].
5
Simulation Results
We applied the proposed clustering algorithm to some synthetic data sets. We compared the proposed algorithm with the fuzzy C-means algorithm [5, 2]. To measure the clustering performance we use the three measures such as the reference error rate RE, cluster error rate CE, and F Score F Score. Let us nr,c be the number of data points which belong to reference cluster r and which are assigned to cluster c by a clustering algorithm. Let us n, nr , and nc be the number of all the data points, the number of the data points in the reference cluster r, the number of the data points in the cluster c obtained by a clustering algorithm. Then the measures RE and CE are defined as nr,c nr,c r max c max c r RE = 1 − , CE = 1 − . (11) n n The reference error rate RE is zero when all the data points belonging to every cluster obtained by a clustering algorithm belong to one reference cluster. The cluster error rate CE is zero when all the data points belonging to every reference cluster belong to one cluster obtained by a clustering algorithm. When both RE and CE are zero, it means that the clusters exactly match with the reference clusters.
1242
H.-C. Kim and J. Lee n
n
Let us Lr,c , Pr,c be the recall and the precision defined as nr,c , nr,c , respecc r tively. Then the FScore of the reference cluster r and cluster c is defined as Fr,c =
8Rr,c Pr,c . Rr,c + Pr,c
(12)
The FScore of the reference cluster r is the maximum FScore value over all the clusters as Fr = max Fr,c . (13) c
The overall FScore is defined to be F Score =
nr r
n
Fr .
(14)
In general the higher the FScoce, the better the clustering result. The FScore is one when the clusters exactly match with the reference clusters. The synthetic data sets are generated as shown in Figure 1-(b). The contour in Figure 1-(b) is a level curve g(x) = g ∗ . It clearly shows that the contour separates clusters well. Table 1 shows the labeling error and the classification error and F score of fuzzy C-means and the proposed algorithm for the synthetic data sets. In terms of all the measures such as reference error rate, cluster error rate and FScore, the proposed algorithm outperforms the fuzzy C-means algorithm. Table 1. Clustering performances of Gaussian processes and fuzzy C-means clustering for the synthetic data sets Data size
Ref. Err. Rate Clu. Err. rate FScore GP Fuzzy C-means GP Fuzzy C-means GP Fuzzy C-means 2D-100 0 0 0 0 1 1 2D-200 0 2.5 0 2.5 1 0.974937 2D-400 0 2.25 0 2.25 1 0.977470 3D 0 8.67 0 8.67 1 0.911209 4D 0 7.3 1.33 7.3 0.993197 0.929577 6D 6.00 41.00 43.00 20.67 0.587531 0.567435
6
Conclusion
We have proposed an algorithm to construct a pseudo-density function with Gaussian process models. Variance of predictive value is used as a pseudo-density function. It has a small value in the area where the data points are dense and a large value in the area where the data points are sparse. The topological view of a pseudo-density function tells us how the data points make up clusters. We can check the values of points on the line segment between a pair of data points and see if the pair is on the same cluster. The clusters thus obtained have shown good performance through simulation.
Pseudo-density Estimation for Clustering with Gaussian Processes
1243
Acknowledgement. This work was supported by the KOSEF under the grant number R01-2005-000-10746-0.
References 1. Ben-Hur, A., Horn, D., Siegelmann, H., Vapnik, V.: Support vector clustering. Journal of Machine Learning Research 2 (2001) 125-137 2. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algoritms. Plenum Press New York (1981) 3. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2 (1998) 121-197 4. Duda, R., Hart, P., Stork, D.: Pattern classification. Wiley New York (2001) 5. Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Journal of Cybernetics (1973) 32-57 6. Lee, J.: Dynamic Gradient Approaches to Compute the Closest Unstable Equilibrium Point for Stability Region Estimate And Their Computational Limitations. IEEE Trans. on Automatic Control 48(2) (2003) 321-324 7. Lee, J.: An Optimization-Driven Framework for the Computation of the Controlling UEP in Transient Stability Analysis. IEEE Trans. on Automatic Control 49(1) (2004) 115-119 8. Lee, J., Chiang, H.-D.: A Dynamical Trajectory-Based Methodology for Systematically Computing Multiple Optimal Solutions of General Nonlinear Programming Problems. IEEE Trans. on Automatic Control 49(6) (2004) 888-899 9. Lee, J., Lee, D.: An Improved Cluster Labeling Method for Support Vector Clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(3) (2005) 461-464 10. Lee, D., Lee, J.: Trajectory-based Support Vector Multicategory Classifier. Lecture Notes in Computer Science 3496 (2005) 857-861 11. Neal, R.M.: Bayesian Learning for Neural Networks. Lecture Notes in Statistics Springer-Verlag New York (1996) 12. Neal, R.M.: Regression and Classification Using Gaussian Process Priors. Bayesian Statistics 6 (1998) 465-501 13. Rasmussen, C.E.: Evaluation of Gaussian Processes and Other Methods for NonLinear Regression. PhD Thesis University of Toronto (1996) 14. Williams, C.K.I., Rasmussen, C.E.: Gaussian Processes for Regression. NIPS 8 (1995) 514-520
Clustering Analysis of Competitive Learning Network for Molecular Data Lin Wang1,3, Minghu Jiang2,3, Yinghua Lu1, Frank Noe3, and Jeremy C. Smith3 1
School of Electronics Eng., Beijing Univ. of Post and Telecom., Beijing 100876, China 2 Lab. of Computational Linguistics, School of Humanities and Social Sciences Tsinghua University, Beijing 100084, China 3 Interdisciplinary Center for Scientific Computing (IWR), University of Heidelberg Im Neuenheimer Feld 368, 69210 Heidelberg, Germany [email protected]
Abstract. In this paper competitive learning cluster are used for molecular data of large size sets. The competitive learning network can cluster the input data, it only adapts to the node of winner, the winning node is more likely to win the competition again when a similar input is presented, thus similar inputs are clustered into same a class and dissimilar inputs are clustered into different classes. The experimental results show that the competitive learning network has a good clustering reproducible, indicates the effectiveness of clusters for molecular data, the conscience learning algorithm can effectively cancel the dead nodes when the output nodes increasing, the kinds of network indicates the effectiveness of clusters for molecular data of large size sets.
1 Introduction The competitive learning can cluster the input data, it can be realized by neural network, the aim of such network is to cluster the input data, get insight into data distribution, find the characteristics of each cluster, and assign the cluster of a new example. Computational speed of competitive learning (unsupervised algorithms) is much faster than that of supervised algorithms, it is a fast algorithm for discovering statistical regularities in data by adaptation [1]. The outputs of the competitive learning can reflect the correlation of the input data. Ideally, final state of network convergence is to find one weight vector over the center of each cluster of the input vectors, these input vectors within a cluster have the same internal representation, so the same output. If the data is clustered, the network will classify unseen inputs perfectly, and reflect in the input-output mapping [2]. This paper is organized as follows: Section 2 presents competitive learning network for clustering, dead nodes problem of competitive network, clustering reproducible is also expatiated. Section 3 describes experiments, analysis the virtues and shortcomings of the cluster method. Section 4 summarizes the conclusions. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1244 – 1249, 2006. © Springer-Verlag Berlin Heidelberg 2006
Clustering Analysis of Competitive Learning Network for Molecular Data
1245
2 Competitive Learning Model 2.1 Competitive Learning Network Competitive learning network usually consists of an input layer of N fan-out neurons and an output layer of K processing neurons. The neurons of output layer are full connected to that of input layer. The learning algorithm is shown as follows: Step 1: Initialize the small random weights. Step 2: Each output neuron calculates the distance between the input vector and the weight vector connecting the input to it, if the ith index of the closest competitive neuron is found and set its output activation to 1, the activation of all other competitive neurons to 0. Step 3: The weight of the only winning neuron is updated so as to make the weights of the winning neuron closer to the current input pattern. Thus, the neuron whose weight vector was closest to the input vector is updated to be even closer. Step 4: Check if the stopping criterion is satisfied, if no more changes of the weight vectors or the maximum epochs is reached, then learning stop, else go to Step 2. The winning node is more likely to win the competition again when a similar input is presented, and less likely to win when a very different input is presented. Thus, the competitive network clusters the input data, the output vectors of resulting network represent to class indices. 2.2 Dead Nodes Problem Competitive learning network has not general, it is sensitive for weight vectors initialization. In the ideal situation, the input vectors should belong to k classes where k is the number of output nodes. But the kind of network has two defects, the one defect is that because there is no known desired outputs which is difficult to decide the require number of clusters; the another defect is that dead nodes which these neurons with initial weights far from any input vectors and never win the competition, the result in their weights never get to learn. The dead neurons increase with the number of competitive layer neurons increasing. To prevent this from happening, add a negative bias to each neuron and decreasing the bias each time when the neuron wins; this will make it harder for a neuron to win if it has won often (the mechanism is called conscience). Finally, when the training is finished the output of network reflect the cluster distribution of the input vectors. 2.3 Quality of Clustering 2.3.1 Intra and Inter-class Similarity The high quality clusters have the characteristics with high intra-class similarity (small within-cluster distance) and low inter-class similarity (big between-cluster distance), which are evaluated by their scattering matrices [3]. The intra-class scattering matrix is defined as:
1246
L. Wang et al. N
S int ra = ∑ P( wi ) E{( M i − M 0 )( M i − M 0 ) T | wi } .
(1)
i =1
Where, P(wi) is the a priori probability of class wi, Mi is the mean vector (centroid) of N
class wi, M 0 = ∑ P( wi ) M i , N is the number of classes. It reveals the scattering of i =1
samples around their respective class centroids. And the inter-class scattering matrix is defined as: N
S int er = ∑ P( wi )( M i − M 0 )(M i − M 0 ) T .
(2)
i =1
The diagonal items of Sintra and Sinter reveal the intra- and inter-class separability of individual features. If the diagonal item of Sintra is small while that of Sinter is large, then the corresponding feature has good class separability. The off-diagonal items of Sintra and Sinter reveal the correlation between different features [4]. 2.3.2 Reproducible Matrix Because priori probability P(wi) is usually unknown, results of clustering can be evaluated by reproducible matrix. If the same data are trained repeat with different initial conditions, similarity of classes size distribution need be evaluated, i.e., a similar assignment of data points to classes should be measured for different initial conditions, assume that input data have R samples, the algorithm is shown as follows: Step 1: Repeat the clustering 5 times using random initial conditions each time; Step 2: Setup an upper or lower triangular matrix, having R2/2 entries. In each entry (i, j), count the number of times that data sets i and j have been assigned to the same cluster. Step 3: Count the number of (i, j) – pairs with values 2–3, judge if the intermediate values (2–3) are very rare. If the number of (i, j) – pairs with values 2–3 below 10%, then the clustering is to a good degree reproducible, and the clustering is thought to be successful. Ideally, only values of 0 and 5 should occur, which would mean that results are perfectly reproducible.
3 Experimental Results 3.1 Analysis of Molecular Data 1). DATA-I: Low-energy minima samples from transition networks. The generation steps of low energy samples [5]: first, the potential energy surface and the minimized reactant and product end-states of the transition are conformed; second, conformers are uniformly spread over the part of conformational space that is relevant to the transition; third, structures without steric clashes are accepted and energy-minimized; forth, more low-energy minima are generated by pair wise interpolation between available conformers and minimization, these minima form the transition network vertices; fifth, pairs of neighboring minima are associated, forming the transition network edges, the sub-transitions barriers of selected edges are computed. This led to
Clustering Analysis of Competitive Learning Network for Molecular Data
1247
a total number of 6242 minima which served as the vertices of the transition network. According to the coordinates of these 6242 minima, 30 principal components features are obtained by principal component analysis (PCA). 2). DATA-II: 5000 "local sampling" samples of 30 principal components features for the same system, these "local sampling" data fluctuates less than low energy samples. 3.2 Hierarchical Clustering The hierarchical clustering can represent a multi-level hierarchy which show the tree relation of cluster distance. It creates a cluster tree to investigate grouping in input data, simultaneously over a variety of scales of distance. We can investigate different scales of grouping in data, this allows us to decide what scale or level of clustering is most appropriate in our application. The hierarchical cluster from Fig. 1 shows between-cluster distance for Data I is far larger than that of Data II, and redundant features are effectively canceled by PCA, the within-cluster distance of Data I is also far bigger than that of Data II due to standard deviation of samples feature of Data I is bigger than that of Data II, right figure in Fig. 2 shows that Data II nevertheless exhibit some clusters.
30
25
Cluster Distance
20
15
10
5
0 30 features, hierarchical cluster of 6242 low energy samples
10 9
Cluster Distance
8 7 6 5 4 3 2 1 30 features, hierarchical cluster of 5000 tmp samples
Fig. 1. Hierarchical cluster of Data I (up figure) and Data II (down figure)
1248
L. Wang et al. Claster of Sample Data 1200
Claster of Sample Data 800
1000
700
y = number of samples
y = number of samples
600 500 400 300
800
600
400
200
200 100
0
0
1
2
3
4
5 6 7 8 x = number of classes
9
1
10
2
3
4
5 6 7 8 x = number of classes
9
10
Fig. 2. After convergence distribution of classes is shown for Data I (Left figure) and Data II (Right figure), 10 output nodes; Kohonen learning rate is 0.001, Conscience learning rate is 0.0005
3.3 Clustering of Competitive Learning Network Competitive learning network consists of an input layer of 30 fan-out nodes and an output layer of K processing nodes, the results of adaptation and training are shown in Fig. 2 and Fig. 3. These figures show cluster distributions and convergence curves of competitive learning network, column is the number of samples for some a class, number of output nodes is from 10 to 15, number of iterations is from 20 to 40, Kohonen learning rate is 0.001, conscience learning rate is from 0.0001 to 0.0005. The left figure in Fig. 3 shows that dead neurons increase with the number of competitive layer nodes increasing, some neurons may not always get allocated, the right figure in Fig. 3 shows dead neurons are canceled by conscience learning algorithm. In order to evaluate the cluster effect of competitive learning network, similarity of classes size distribution is calculated, the network is trained by Data II with 5 different initial conditions, output of network is 10 (classes), number of iterations is Claster of Sample Data
Claster of Sample Data
900
600
800 500
y = number of samples
y = number of samples
700 600 500 400 300
400
300
200
200 100 100 0
1
2
3
4
5
6 7 8 9 10 11 12 13 14 15 x = number of classes
0
1
2
3
4
5
6 7 8 9 10 11 12 13 14 15 x = number of classes
Fig. 3. After convergence distribution of classes is shown for Data II, 15 output nodes, Kohonen learning rate is 0.001. The left figure shows that there are dead nodes (nodes 4 and 6). Right figure shows that conscience mechanism is strengthened and dead nodes are canceled, conscience learning rate is 0.0002.
Clustering Analysis of Competitive Learning Network for Molecular Data
1249
20, Kohonen learning rate is 0.001, Conscience learning rate is 0.0005. The numbers of (i, j) – pairs with values 0–5 in an upper triangular matrix of 50002/2 entries are respectively 9566706, 1075601, 675585, 467473, 354693, and 357442, it is easily to obtain the number of (i, j) – pairs with values 2–3 equal to 9.15%<10%. The experiment shows that competitive learning network has a good clustering reproducible.
4 Conclusion The competitive learning network can cluster the input data, it only adapts to the node of winner, the winning node is more likely to win the competition again when a similar input is presented, thus similar inputs are clustered into same a class and dissimilar inputs are clustered into different classes, competitive learning network has a good clustering reproducible, the conscience learning algorithm can effectively cancel the dead nodes when the output nodes increasing, the kinds of network indicates the effectiveness of clusters for molecular data of large size sets, and the kind of network does not learn topology of input data.
References 1. Hsu, D., Figueroa, M., Diorio, C.: Competitive Learning with Floating-Gate Circuits. IEEE Transactions on Neural Networks 13(2002) 732–744 2. Jiang, M., Cai, H., Zhang, B.: Self-Organizing Map Analysis Consistent with Neuroimaging for Chinese Noun, Verb and Class-Ambiguous Word. In: Wang, J., et al., (eds.): Advances in Neural Networks – ISNN2005. Lecture Notes in Computer Science, Vol. 3498. Springer-Verlag, Berlin Heidelberg New York (2005) 971-976 3. Dougherty, E. R., Brun, M.: A probabilistic theory of clustering. Pattern Recognition 37 (2004) 917-925 4. Puntonet, C. G., Mansour, A., Bauer, C., et al.: Separation of Sources Using Simulated Annealing and Competitive Learning. Neurocomputing 49(2002) 39-60 5. Noe, F.: Transition Networks for the Comprehensive Analysis of Complex Rearrangements in Proteins. Ph.D Dissertation, University of Heidelberg, Germany (2005)
Self-Organizing Map Clustering Analysis for Molecular Data Lin Wang1,3, Minghu Jiang2,3, Yinghua Lu1, Frank Noe3, and Jeremy C. Smith3 1
School of Electronics Eng., Beijing Univ. of Post and Telecom., Beijing, 100876, China 2 Lab. of Computational Linguistics, School of Humanities and Social Sciences, Tsinghua University, Beijing 100084, China 3 Interdisciplinary Center for Scientific Computing (IWR), University of Heidelberg, Im Neuenheimer Feld 368, 69210 Heidelberg, Germany [email protected]
Abstract. In this paper hierarchical clustering and self-organizing maps (SOM) clustering are compared by using molecular data of large size sets. The hierarchical clustering can represent a multi-level hierarchy which show the tree relation of cluster distance. SOM can adapt the winner node and its neighborhood nodes, it can learn topology and represent roughly equal distributive regions of the input space, and similar inputs are mapped to neighboring neurons. By calculating distances between neighboring units and Davies-Boulding clustering index, the cluster boundaries of SOM are decided by the best Davies-Boulding clustering index. The experimental results show the effectiveness of clustering for molecular data, between-cluster distance of low energy samples from transition networks is far bigger than that of "local sampling" samples, the former has a better cluster result, "local sampling" data nevertheless exhibit some clusters.
1 Introduction Clustering analysis is a way to examine similarities and dissimilarities of observations. Data often fall naturally into groups or clusters, similar inputs should be classified as the same cluster and dissimilar inputs should be classified as the different clusters. Clustering is realized by unsupervised learning, no predefined classes, those within each cluster are more closely related to one another than objects assigned to different clusters [1]. The clustering analysis is more appropriate for some aspects of biological learning, human and social sciences and related areas. The goal of clustering analysis is the notion of degree of similarity (or dissimilarity) between the individual objects being clustered. This paper in section 2 compares the hierarchical clustering and Kohonen’s SOM clustering, SOM cluster boundaries and their solving methods are also expatiated. Section 3 describes experimental methods and results, compares and analysis the virtues and shortcomings of these clustering methods. Section 4 summarizes the conclusions. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1250 – 1255, 2006. © Springer-Verlag Berlin Heidelberg 2006
Self-Organizing Map Clustering Analysis for Molecular Data
1251
2 Hierarchical Clustering and SOM Clustering 2.1 Hierarchical Clustering Hierarchical clustering creates a cluster tree to investigate clustering in input data, simultaneously over a variety of scales of distance [1]. The result of hierarchical clustering can be graphically represented by a multi-level hierarchy (dendrogram), where clusters at one level are joined as clusters at the next higher level. The root is the whole input data set, the leaves are the individual elements of input data, and the internal nodes are defined as the union of their children. Each level of the tree represents a partition of the input data into several clusters. We can investigate different scales of clustering in molecular data; this allows us to decide what scale or level of clustering is most appropriate in our application. 2.2 The Self-Organizing Map (SOM) SOM is proposed by Kohonen [2]. During learning process, no predefined classes of input data are sorted automatically and enable the weight distribution to be similar to input’s probability density distribution. SOM learns to recognize groups of similar input vectors in such a way that neurons physically near each other in the output layer respond to similar input vectors, i.e., the lesser the distance, the greater the degree of similarity and the higher the likelihood of emerging as the winner [3]. SOM can learn to detect regularities and correlations of input data, its training is based on two principles [2]: Competitive learning: the prototype vector most similar to an input vector is updated so that it is even more similar to it; Cooperative learning: not only the most similar prototype vector, but also its neighbors on the map are moved towards the input vector. SOM not only can adapt the winner node, but also some other neighborhood nodes of the winner are adapted, it can learn topology and represent roughly equal distributive regions of the input space, and similar inputs are mapped to neighboring neurons. Unlike other cluster methods, the SOM has not distinct cluster boundaries, therefore, it requires some background knowledge to solve it. Here we adopt the best Davies-Bouldin index to classify cluster boundaries. The choice of the best cluster can be determined by the Davies-Bouldin index [1]. It is a function of the ratio of the sum for within-cluster distance and between-cluster distance. Optimal clustering is determined by [1]: VDB =
1 N
N
∑ max k =1
k ≠l
S N ( Dk ) + S ( Dl ) . TN ( Dk , Dl )
(1)
Where N is the number of clusters, D is a matrix of the data set X, SN is the withincluster distance between the points in a cluster and the centroids for that cluster and TN is the between-cluster distance from the centroid of one cluster to the other. The optimal number of clusters is the one that minimizes VDB. If the clusters are well separated, then VDB should decrease monotonically over time as the number of clusters increases until the clustering reaches convergence.
1252
L. Wang et al.
3 Experimental Results 3.1 Pre-processing of Molecular Data There are two different data sets on which are trained: 1). Data I: Low-energy minima samples from transition networks. The generation steps are shown as follows [4]: first, the potential energy surface and the minimized reactant and product end-states of the transition are conformed; second, conformers are uniformly spread over the part of conformational space that is relevant to the transition; third, structures without steric clashes are accepted and energy-minimized; forth, more low-energy minima are generated by pairwise interpolation between available conformers and minimization, these minima form the transition network vertices; fifth, pairs of neighboring minima are associated, forming the transition network edges, the sub-transitions barriers of selected edges are computed. This led to a total number of 6242 minima which served as the vertices of the transition network. According to the coordinates of these 6242 minima, 30 principal components features (PCFs) are obtained by principal component analysis (PCA). 2). Data II: 5000 "local sampling" samples of 30 principal components features for the same system, these "local sampling" data fluctuates less than low energy samples. The heat maps of hierarchical clustering for two data sets are shown in Fig. 1. It is easily to know that Data I have larger fluctuant and larger standard deviations than Data II. The between-cluster distance of Data I is far larger than that of Data II, and redundant features are effectively canceled by PCA.
Fig. 1. Heat map of hierarchical clustering with correlation as the distance metric and average linkage used to generate the hierarchical tree from Data I (left figure) and Data II (right figure). The maximum clustering distance for 6242 samples is 30, and the maximum clustering distance for 30 PCFs is 720. The maximum clustering distance for 5000 samples is 15, and the maximum clustering distance for 30 PCFs is 190.
Fig. 2 shows the U-matrix and D-matrix of SOM clustering by using the two training sets. These SOM’s initializations are linear, and batch training algorithm is used in two phases: first rough training and then fine-tuning. The 'U-matrix' shows distances between neighboring units and thus visualizes the cluster structure of the map, it has much more hexagons in the visual output planes because each hexagon shows distances between map units. While D-matrix only shows the distance values at the SOM map units. Clusters on the U-matrix are typically uniform areas of low
Self-Organizing Map Clustering Analysis for Molecular Data
1253
Fig. 2. U-matrix (up-left) and D-matrix (up-right) of SOM are trained by Data I, U-matrix (down-left) and D-matrix (down-right) of SOM are trained by Data II
Fig. 3. Davies-Boulding clustering index (left figure), distance matrix with color code (middle figure), small hexagons on the D-matrix indicate cluster borders (corresponds to large distance between neighboring map units on U-matrix), and SOM cluster by color code which is minimized with best clustering (right figure). Training data are Data I.
values (white) which mean small distance between neighboring map units, and high values (black) mean large distance between neighboring map units and thus indicate cluster borders. There are more cluster borders of high values (black) in up-figure, it shows that there are more small clusters for Data I, several white zones (uniform areas of low values) are encircled by black or gray cluster borders. It shows same hierarchical clustering that the between-cluster distance of Data I is far larger than that of Data II.
1254
L. Wang et al.
Fig. 4. Davies-Boulding clustering index (left figure), distance matrix with color code (middle figure), small hexagons on the D-matrix indicate cluster borders, and SOM cluster by color code which is minimized with best clustering (right figure). Training data are Data II.
Fig. 5. The distribution of the input data on the map (left figure), the digital in each hexagon is the number of map samples; distance matrix with color codes is shown on the right figure. Upfigure is training Data I, and down-figure is training Data II.
Because the SOM has not distinct cluster boundaries, in order to find and show the borders of the SOM clusters, we use the k-means cluster to find an initial partitioning, the experiments show that values of important variables change very rabidly. We can assign colors to the map units such that similar map units correspond to similar colors. Fig. 3 and Fig. 4 are SOM clustering results by using Data I and II, the left figures show the Davies-Boulding clustering index [1]; the middle figures show distance matrix with color code, and small hexagons on the D-matrix indicate cluster borders;
Self-Organizing Map Clustering Analysis for Molecular Data
1255
and the right figures show the SOM clustering by color code which is minimized with best clustering, Many smaller hexagons in Fig. 3 (corresponds to large distance between neighboring map units) indicate more cluster borders, it shows more small clusters for Data I. The left figures in Fig. 4 & Fig. 5 (which training data are respectively Data I & II) are the number of map samples in each unit; it shows the distribution of the input data on the output map plane. Because the significance of the components with respect to the clustering is harder to visualize, therefore we adopt distance matrix with color codes which is minimized with best clustering on the right figures of Fig. 5. By comparison the results of hierarchical clustering in Fig. 1 and SOM clustering in Fig. 5, it shows distinctly easily that between-cluster distance of Data I is far bigger than that of Data II, the within-cluster distance of Data I is also far bigger than Data II due to standard deviation of samples feature of the former is bigger than that of the latter, Fig. 2 & Fig. 5 show that Data II nevertheless exhibit some clusters.
4 Conclusion The hierarchical clustering can represent a multi-level hierarchy which show the tree relation of cluster distance. SOM can adapt the winner node and its neighborhood nodes, it can learn topology and represent roughly equal distributive regions of the input space, and similar inputs are mapped to neighboring neurons which is easily for visual cluster analysis. By calculating distances between neighboring units and Davies-Boulding clustering index, the cluster boundaries of SOM are decided by the best Davies-Boulding clustering index. The experimental results show that betweencluster distance of low energy samples from the transition network is far bigger than that of "local sampling" samples, the within-cluster distance of the former is also far bigger than the latter due to standard deviation of samples feature of the former is bigger than that of the latter, the former has a better clustering result, these kinds of methods indicate the effectiveness of clustering for molecular data of large size sets.
References 1. Davies, D., Bouldin, D.: A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence – I 2 (1979) 224-227 2. Kohonen, T.: The Self-Organizing Map. Proceedings of the IEEE 78(1990) 1464-1480 3. Jiang, M., Cai, H., Zhang, B.: Self-Organizing Map Analysis Consistent with Neuroimaging for Chinese Noun, Verb and Class-Ambiguous Word. In: Wang, J., et al, (eds.): Advances in Neural Networks – ISNN2005. Lecture Notes in Computer Science, Vol. 3498. SpringerVerlag, Berlin Heidelberg New York (2005) 971-976 4. Noe, F.: Transition Networks for the Comprehensive Analysis of Complex Rearrangements in Proteins. Ph.D Dissertation, University of Heidelberg, Germany (2005)
A Conscientious Rival Penalized Competitive Learning Text Clustering Algorithm Mao-ting Gao1,2 and Zheng-ou Wang1 2
1 Institute of Systems Engineering, Tianjin University, Tianjin, China 300072 Computer Science Department, Shanghai Maritime University, Shanghai, China 200135 [email protected], [email protected]
Abstract. Text features are usually expressed as a huge dimensional vector in text mining, LSA can reduce dimensionality of text features effectively, and emerges the semantic relations between texts and terms. This paper presents a Conscientious Rival Penalized Competitive Learning (CRPCL) text clustering algorithm, which uses LSA to reduce the dimensionality and improves RPCL to set a conscientious threshold to restrict a winner that won too many times and to make every neural unit win the competition at near ideal probability. The experiments demonstrate good performance of this method.
1 Introduction Clustering is a process of classifying the data in the hyperspace into several subgroups to make the similarity of the data in the same subgroup as large as possible and in different subgroups as small as possible. Clustering analysis plays an important role in the application of text data mining. Text clustering has been widely used in the fields of text data mining, information retrieval and others to improve the precision and recall, to seek the closest text, to cluster the texts on Web hierarchically ( for example, automatically classify the Web text in Yahoo ) [1], etc. People study and put forward many effective clustering methods[2] to cluster data effectively, a popular one is k-means. It classifies the data object as the cluster that is the nearest to the centroid, at the same time, to readjust and compute the centroid to converge the centroid to a definite position, which is simple and effective, but with serious imperfection that it needs to define the number of clustering. In common case, the number, however, is not certain, and the result and the efficiency are affected by the selection of the initial clustering centroid. The RPCL (Rival Penalized Competitive Learning), put forward by Lei Xu and his fellows[3, 4], can automatically determine the cluster number while clustering, its idea is that to each input, not only the weight of the winner is modified to adapt the input value, but also the rival (the second winner unit) is punished to be far away from the input value. In the process of text clustering, it is worth to study how to effectively reduce dimensionality of text features but lose no or as little as possible the feature information of the text. LSA (Latent Semantic Analysis ), also called as LSI (Latent Semantic Indexing ), is an effective method[5] using the theory of singular value decomposition to reduce dimensionality of text features rapidly. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1256 – 1260, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Conscientious Rival Penalized Competitive Learning Text Clustering Algorithm
1257
This paper researches on using the theory of LSA and RPCL for the text clustering and presents a CRPCL (Conscientious Rival Penalized Competitive Learning) text clustering algorithm, which uses LSA to reduce dimensionality of text TF.IDF matrix and improves RPCL to set a conscientious threshold to restrict the winner that won too many times and make every neural unit win the competition at near ideal probability. It is demonstrated that this method emerges text feature distinctly, enhances accuracy of clustering, and determines the number of clustering automatically.
2 Theory of LSA Matrix Wn× m can be decomposed into the product of three matrices: W = UAV T , such that U and V have orthonormal columns, A is diagonal matrix of the ordered eigenvalues. It is so-called SVD ( Singular Value Decomposition ) of W. Get the first r eigenvalues to form Ar ( r << min(n, m) ), and the first columns of U and V to form
U r and Vr . Compute r-order text-term matrix Wr as formula 1, reduce W to Wr .
(1)
LSA expresses the meanings with the words in it, it does not regard all words as the reliable expression of the text concept, on the contrary, the variety of the words conceals the semantic construction of texts. Using SVD and getting r-rank, LSA decreases the “noise” factor included in the original text-term matrix and emerges the semantic relation between text and term, and reduces the text-term vector space to improve the efficiency of text clustering.
3 CRPCL Text Clustering Based on LSA Competitive neural network [8] is a kind of cycle network based on unsupervised learning algorithm. In competitive learning, an output unit becomes activated by competing each other, and there is only one unit is activated in any time. This characteristic makes competitive learning highly suitable for finding the statistical prominent characters, which can be used to classify a set of input patterns. This activated unit wins current competition of the system in the form of winner-takes-all. This kind of competition rule has the whole effect making the weights vector wk of the winner unit k move to input pattern. CRPCL is an improved RPCL, a competition neural network. In the process of RPCL, for each input, not only the weights vector of winner unit is modified to adapt to the input with a certain learning rate (i.e. conscientious threshold), which makes the weights vector wk of the winner unit k move towards the input pattern, but also its rival ( the 2nd winner ) is punished by driving it far away from the input, and other
1258
M.-t. Gao and Z.-o. Wang
units do nothing. After iterative learning for many times, each weights vector tends to each clustering centroid. So, RPCL algorithm can automatically determine the number of clustering of the data set. If one unit won the competition too many times, CRPCL would prevent it win in the next turn, the conscientious threshold can be set for each unit and make every neural unit win the competition at near ideal probability. 3.1 CRPCL Clustering Algorithm Let t be the iterative number of competitive learning, tmax be the maximum number allowed, k be the setting clustering number. For every element x of the set to cluster, the weights vector of the output value ui of each unit is wi , i = 1,L, k . CRPCL algorithm[3, 4] can be described as follows. Step 1: Randomly initialize various k weights vectors {wi }ik=1 , and let t = 1 ; Step 2: Randomly take a sample x from the dataset, and for i = 1,L, k , let ⎧ 1, if i = c such that 2 2 ⎪ γ c x − wc = min j γ j x − w j ⎪⎪ ui = ⎨− 1, if i = r such that 2 ⎪ γ r x − wr = min j ≠ c γ j x − w j ⎪ ⎪⎩ 0, otherwise
where c denotes the winner unit, r denotes the rival unit, γ threshold, it could be γ j = n j
∑i =1 ni , and ni k
j
2
(2)
denotes the conscientious
is the cumulative number of the occur-
rences of ui = 1 . Step 3: Update the weight vector wi by wi = wi + Δwi
(3a)
⎧ac (t )( x − wi ), if ui = 1 ⎪ Δwi = ⎨− ar (t )( x − wi ), if ui = −1 ⎪0, otherwise ⎩
(3b)
where 0 ≤ a c (t ), a r (t ) ≤ 1 are the learning rates for the winner and its rival unit, respectively, and usually at each iteration step t it holds ar (t ) << ac (t ) . Step 4: Increase t by 1, if t < tmax , go to Step 2; otherwise, stop the algorithm. 3.2 CRPCL Text Clustering Based on LSA CRPCL can be used in text clustering after using LSA for reduction of dimensionality to form a new efficient text clustering method, which includes the following steps: Step 1: Preprocess texts to extract word frequency feature vector; Step 2: Convert text feature matrix to TF.IDF and normalize it; Step 3: Reduce text feature matrix by LSA; Step 4: Use CRPCL algorithm to cluster texts.
A Conscientious Rival Penalized Competitive Learning Text Clustering Algorithm
1259
4 Experiment Result and Its Analysis To verify the effectiveness of the algorithm, we test the texts data set obtained from HaiLiang company, there are 5 classes, 486 texts, and 2722 terms in it. The effect of the clustering is normalized with AA (Averaged Accuracy) [9], i.e., the effect is evaluated by observing whether the subjection relations between any two texts are consistent. 4.1 Experiment Process
k-means and CRPCL clustering are applied to the text data set respectively. CRPCL clustering is included reduction of dimensionality and non-reduction of dimensionality. The results are shown in Table 1. Table 1. Experiment results
Dimension Averaged Accuracy
k-means 2722 63.2
CRPCL 2722 81.3
CRPCL+LSA 120 92.1
4.2 Results Analysis After analyzing and comparing experimental results, we can get: 1) The result of CRPCL is more stable and better than k-means. 2) The nearer the initial defined clustering number is set to the actual value, the higher its clustering accuracy would be. 3) After taking LSA to reduce dimensionality, the dimension of the text vector space is decreased greatly, the computational expense is also reduces and the clustering accuracy is increased. 4) LSA is an effective method of reducing dimensionality, which can effectively reduce the “noise” factor in the original text-term matrix to emerge the semantic relation between texts and terms; and reduce the text-term vector space, decrease computation expense to improve the effectiveness of text clustering.
5 Conclusions Experiments demonstrate that RPCL algorithm for text clustering, to some extent, can handle the definition of the clustering number. Combining CPRCL and LSA can get rather sound clustering results. 1) The selections of the competitive learning rate of winner unit and the rate of its rival may affect the stability and speed of clustering. Generally, the learning rate can be chosen in two ways: a) choose a rather little fixed value; b) select a big value at first, then decrease gradually and finally tend to 0. 2) The mechanism of conscientious threshold would make every neural unit win the competition at near ideal probability 1/n.
1260
M.-t. Gao and Z.-o. Wang
3) In the text clustering, how to reasonably choose text feature and effectively handle text is a problem worth of research, for the relevance between texts is rather complicated, and the high-dimension in text feature.
Acknowledgements This work is supported by the National Nature Science Foundation of China (Grant No. 60275020) and Shanghai Municipal Education Commission Scientific Research Fund (04FB22).
References 1. Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. KDD-2000 Workshop on Text Mining, Boston MA USA (2000) 109–110 2. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, USA (2001) 3. Xu, L., Krzyzak, A., Oja, E.: Rival Penalized Competitive Learning for Clustering Analysis, RBF Net, and Curve Detection. IEEE Transactions on Neural Networks 4(4) (1993) 636–649 4. Xu, L., Krzyzak, A., Oja, E.: Unsupervised and Supervised Classification by Rival Penalized Competitive Learning. In Proc. 11th International Conference on Pattern Recognition, The Hague The Netherlands (1992) 492–496 5. Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using Latent Semantic Analysis to Improve Access to Textual Information. In Proceedings of Conference on Human Factors in Computing Systems CHI'88, Washington DC USA (1988) 281–285 6. Wang, Z., Zhu, T.: An Efficient Learning Algorithm for Improving Generalization Performance of Radial Basis Function Neural Networks. Neural Networks 13 (2000) 545–553 7. King, I., Lau, T.: Non-Hierarchical Clustering with Rival Penalized Competitive Learning for Information Retrieval. In Proc. of the First International Workshop on Machine Learning and Data Mining in Pattern Recognition, Leipzig Germany (1999) 116–130 8. Theodoris, S., Koutroumbas, K.: Pattern Recognition. 2nd edn. Academic Press, USA (2003) 9. Liu, S., Dong, M., Zhang, H., et al..: An Approach of Multi-hierarchy Text Classification Based on Vector Space Model. Journal of Chinese Information Processing 3(16) (2002) 8–14
Self-Organizing-Map-Based Metamodeling for Massive Text Data Exploration Kin Keung Lai1,2, Lean Yu2,3, Ligang Zhou2, and Shouyang Wang1,3 2
1 College of Business Administration, Hunan University, Changsha 410082, China Department of Management Sciences, City University of Hong Kong, Hong Kong {msyulean, mskklai}@cityu.edu.hk 3 Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, China {yulean, sywang}@amss.ac.cn
Abstract. In this study, we describe the use of the self-organizing map (SOM) as a metamodeling technique to design a parallel text data exploration system. Firstly, the large textual collections are divided into various small data subsets. Based on the different subsets, different unitary SOM models, i.e., base models, are then trained for word clustering map. In this phase, different SOM models are implemented in parallel to gain greater computational efficiency. Finally, a SOM-based metamodel can be produced to formulate a text category map through learning from all base models. For illustration the proposed metamodel is applied to a massive text data collection.
1 Introduction With the rapid increase of textual collections, there has been a strong demand for text exploration which can help people find some interesting relationships from the vast quantities of textual collection. In the last decades, many statistical and learning techniques were used for exploring the textual relationship based on training examples. Most methods adopt supervising learning paradigm that requires the availability of many pre-labeled exemplars. However, in many text exploration applications, little to no prior information about pre-labeled samples is available. Furthermore, it is difficult to formulate a collection of pre-defined samples for massive text corpus. In view of the drawbacks of supervised learning, more and more textual data explorations heavily depend on the unsupervised learning techniques, such as selforganizing map (SOM) [1] and furthermore some studies are provided. For example, Kaski et al. and Kohonen et al. [2-3] developed a WEBSOM system for textual data exploration. Recently, Lee and Yang [4] used SOM to mine and explore multilingual text documents. Interested readers can refer to [1] for more details about SOM. Although many research papers on textual data exploration with SOM were presented over the years, there is one important challenge in applying SOM to text exploration applications. That is, the textual data may be very large with the rapid increase of the number of text documents. This makes the implementation of the single SOM on the single platform prohibitive due to the capacity limitation of cache memory. To overcome the limitation, a SOM-based metamodling technique is proposed. The J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1261 – 1266, 2006. © Springer-Verlag Berlin Heidelberg 2006
1262
K.K. Lai et al.
generic idea is first to use SOM algorithms to extract some knowledge from some textual data subsets and then utilize the knowledge from the unitary SOM to create a unified knowledge that well represents the entire data. The main motivation of this study is to take full advantage of the parallelism of metamodeling to design a powerful textual data exploration system for large-scale textual data collections. The rest of this study is organized as follows. In the next section, a triple-phase SOM-based metamodeling process for massive textual data exploration is described in detail. For verification a practical textual exploration experiment is performed in Section 3. Finally, some conclusions are drawn in Section 4.
2 The SOM-Based Metamodeling Process While the SOM has proven to be a very suitable and efficient tool for detecting structure and relationships in high-dimensional data and organizing it accordingly on a two-dimensional output space, some drawbacks have to be emerged. These include its inefficiency for large scale textual data collection. To overcome the limitation, a SOM-based metamodel is proposed for massive textual data exploration. Generally, the main aims of the SOM metamodel are to improve computational efficiency in a parallel way. For illustration purpose, we present a generic framework of the SOMbased metamodel, in which a general view about overall text data is obtained on Site Z, starting from the original textual data set DS stored on Site A, as shown in Fig. 1.
Fig. 1. The generic framework of the SOM-based metamodeling process
As can be seen from Fig. 1, the generic metamodeling process consists of three phases, which can be described as follows. Phase 1: on Site A, original textual data set DS are partitioned into some relatively small training subsets TR1, TR2, …, TRn via certain partitioning algorithm. Then TR1, TR2, …, TRn are moved from Site A to Site 1, Site 2, …, Site n.
Self-Organizing-Map-Based Metamodeling for Massive Text Data Exploration
1263
Phase 2: on each Site i (i = 1, 2, …, n) the SOM models Mi with different topology is deployed different sites and is trained with different TRi. Then the output of each Mi (i.e., partial clustering results of each subset) is moved from Site i to Site Z. Phase 3: on Site Z, the output of the different models M1, M2, …, Mn can form a new training set namely meta-learning set ML. Based on the meta-learning set, another SOM model is trained and thus global clustering results can be obtained. From the generic framework of the SOM-based metamodeling process, there are three main problems to be further addressed, i.e., (a) how to divide a large-scale data set DS into relatively smaller subsets TRi, i.e., data partition; (b) how to deploy some individual models Mi to learn each subsets in parallel, i.e., parallel learning; and (c) how to integrate different results produced by unitary SOM model Mi to form a global solution to a specified task, i.e., model integration. Data partition is the most important step in designing a metamodel. This step is necessary and crucial for many reasons, most importantly to determine if the training set at hand is functionally or structurally divisible into some distinct subsets. The main aims of data partition are to reduce the computation load and complexity of computation and to improve the computational efficiency and scalability. In this study, a random partition method is adopted in terms of text data size due to no prior information about text data collection. After data partition, each subset is assigned to an SOM model for learning. In the metamodeling process, similar and different learning algorithms can be deployed in each site in terms of different goals and strategies. In this study, the SOM algorithm is used to textual data exploration in different subsets. Basically, the textual data analysis and exploration consists of four basic steps: text document preprocessing, text document encoding, vector dimension reduction, and text vector learning and clustering, which is described as follows. In the text preprocessing, the first is that some non-textual information, for instance, ASCII drawing and automatically included signatures are removed. Some numerical expressions and special codes are treated with heuristic rules, and punctuations are also removed. Then the bag-of-words process is used. Finally, the stem removal and stop words removal are also adopted to reduce the word dimensions. In the text encoding, a simple text encoding method, term frequency (TF), is used to generate text feature vector. In the vector dimension reduction, we adopted the latent semantic index (LSI) [5]. In the final step, the SOM algorithm [1] is used to training text feature vector so as to find the relationships of the textual data. When the results of each unitary SOM model in the second phase are output, it is necessary to fuse these individual results into a global result so as to foster a general view for overall textual data set. Model integration is another crucial decision in a metamodeling system. In our study, we adopt a self-organizing strategy to integrate the outputs of the individual models using another independent SOM model. In this approach, the outputs of the individual models that can form a new learning set named ‘metalearning (ML) set’ are seen as new input data of another SOM model, and thus generating a new general clustering result through the integration of different results from different modules. That is, the small map of textual data (i.e., word map) is remapped into a general large map (text category map) and thus forming a general view for the overall textual data sets.
1264
K.K. Lai et al.
3 Experiment Study In this study, a textual collection about crude oil price volatility is used as experimental data set. In this textual collection, there are 3352 text documents. These documents are kept in a sever in Key Laboratory of Management, Decision and Information Systems (MADIS) of Chinese Academy of Sciences (CAS), which is used for oil price forecasting. Partial related web text documents can be obtained from the website: http://oil.iss.ac.cn/. For later convenience of experiment, these text documents are preprocessed according to the steps of preprocessing indicated in Section 2 and we obtain 22 419 words to perform this experiment. The main goal of this experiment is to explore some valuable information. Therefore word clustering map and text category map are emphasized and the document map is omitted. To perform the SOM-based metamodeling system, a parallel computing environment is constructed. In our experiment, the parallel computing environment is created by using the PC computers of MADIS in CAS. The MADIS has a total of 25 PCs with Windows-XP on their office floor. Among them, we select a server and 10 PCs with high system configuration to construct a PC-based parallel SOM system in order to perform experiment, where all computers are on 100 Mbps Ethernet. The detailed system configuration is shown in Fig. 2.
Fig. 2. The system configuration of the PC-based parallel SOM system
In addition, all results reported in this section used exponentially decreasing functions for learning rate (•) and width parameter (•) of the neighborhood function: η ( k e ) = η 0 (η f η 0 )(k
e
Ke )
, σ ( k e ) = σ 0 (σ
f
σ 0 )(k
e
Ke )
(1)
where ke is the current epoch and Ke is the total number of epochs, and •0 = 0.1, •f = 0.05•0, •0 = N0.5 (where N is the number of input records), •f = 0.2. Note that •(ke) and •(ke) are held constant over epoch ke and all simulations used 1000 epochs. Using previous procedure of the proposed SOM-based metamodeling model, the final clustering result can be obtained in terms of clustering results, as shown in Fig. 3. From Fig. 3, there are 32 (8×4 map) general categories about oil price volatility. Each general category corresponds to a number of small word maps presented in the previous layer. In terms of the general categories, we can form a basic understanding to the 3352 text documents, which is useful for the finding some valuable information.
Self-Organizing-Map-Based Metamodeling for Massive Text Data Exploration
1265
Fig. 3. A general text category extracted from the SOM-based metamodel
To compare the efficiency of the proposed SOM metamodeling system, single textual data analysis without any data partition on an individual computer is also performed by a unitary SOM algorithm. In the single server environment (i.e., single textual data analysis without any data partition is performed on the server only), the total computation time took 4 hours and 8 minutes. The breakdown is as follows: 2 minutes for original data partition, 3 hours and 50 minutes for individual SOM module learning, and 16 minutes for module fusion. Of course, this text data analysis task can be finished by a single PC environment, but the total computation time is 5 hours and 41 minutes. The difference in computation time is that the performance and configuration of the server is better than that of the PC. For speeding up the textual data analysis and exploration task, we can parallelize the second procedure, because there are 3552 text documents and thousands of learning iterations. These textual data collections are assigned to ten office PCs and then the SOM algorithm is used to train the different training subsets for word clustering. The parallel procedures are run on different PCs, as illustrate in Fig. 2. The speed-up obtained by our parallel modular SOM system is summarized in Fig. 5. 3552 text documents and thousands of learning iterations, which took 4 hours and 8 minutes on one single server environment indicated above, now take only 41 minutes,
Fig. 4. The speed-up results for the PC-based modular SOM system
1266
K.K. Lai et al.
i.e., the speed-up factor is over 6 times. If the office PCs are replaced by some advanced PCs with high performance CPU and memory, the speed-up factor can excess 8 times or more, as shown in Fig. 4. This implies that the proposed SOM-based metamodeling system is very efficient. From Fig. 4, it is not hard to find that the speed-up factor become larger and larger with the increase of numbers of PCs. However, the speed-up factor is not limit at all. According to the Amdahl’s law [6] (i.e., (total computation time)/ (non-parallelizable computation time)), the upper bound for the speed-up factors is 13.78 times (= (248 minutes) / (18 minutes)) in a single server environment.
4 Conclusions In this study, we have developed a triple-phase SOM-based metamodeling system for large scale textual data exploration. Through the practical textual data experiment, we have obtained good clustering results and meantime demonstrated the computational efficiency and scalability of the metamodeling system for massive textual data collections. These advantages imply that the proposed SOM-based metamodeling system can provide an effective solution to large scale text data exploration.
Acknowledgements This work is partially supported by National Natural Science Foundation of China (NSFC No. 70221001); Chinese Academy of Sciences and Strategic Research Grant of City University of Hong Kong (SRG No. 7001806).
References 1. Kohonen, T.: Self-Organizing Maps. Springer-Verlag, Berlin (1995) 2. Kaski, S., Honkela, T., Lagus, K., Kohonen, T.: WEBSOM — Self-organizing Maps of Document Collections. Neurocomputing 21 (1998) 101-117 3. Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., Saarela, A.: Self Organization of a Massive Document Collection. IEEE Transactions on Neural Networks 11 (2000) 574-585 4. Lee, C.H., Yang, H.C.: A Multilingual Text Mining Approach Based on Self-Organizing Maps. Applied Intelligence 18 (2003) 295-310 5. Deerwester, S., Dumais, S., Furnas, G., Landauer, K.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41 (1990) 391-407 6. Kleinrock, L., Huang, J.H.: On Parallel Processing Systems: Amdahl’s Law Generalized and Some Results on Optimal Design. IEEE Transactions on Software Engineering 18 (1992) 434-447
Ensemble Learning for Keyphrases Extraction from Scientific Document Jiabing Wang1, Hong Peng1, Jing-song Hu1, and Jun Zhang2 1
School of Computer Science and Engineering, South China University of Technology, Guangzhou 510641, China 2 School of Information Science, Guangdong Commerce College, Guangzhou 510320, China [email protected]
Abstract. Keyphrase extraction is a task with many applications in information retrieval, text mining, and natural language processing. In this paper, a keyphrase extraction approach based on neural network ensemble is proposed. To determine whether a phrase is a keyphrase, the following features of a phrase in a given document are adopted: its term frequency, whether to appear in the title, abstract or headings (subheadings), and its frequency appearing in the paragraphs of the given document. The approach is evaluated by the standard information retrieval metrics of precision and recall. Experiment results show that the ensemble learning can significantly increase the precision and recall.
1 Introduction With the explosive growth of online information, it sometimes becomes simple to determine where to go to find information. At the same time, it makes us embarrassed: the size of query result sets returned by web search engines and academic databases is too large, commonly hundreds or thousands. It is infeasible for users to read each document of the result set to determine whether or not they might be useful. Therefore, document surrogates, such as title, abstract, keywords, etc., become very important for users to decide whether they would like to read a whole document or not. However, not all documents contain keyphrases: even in collections of scientific papers those with keyphrases are in the minority [10], say nothing of web pages. Consequently, many automatic keyphrase extraction algorithms have been proposed based on different techniques [1, 5, 7, 10, 13-15], especially on machine learning. A popular method, called ensemble learning [3-4, 6, 9, 11-12, 16], for creating an accurate classifier from a set of training data is to construct several different base classifiers, and then to combine their predictions. It was shown in many domains that an ensemble is often more accurate than any of the single classifiers in the ensemble. In this paper, we propose a keyphrase extraction approach based on neural network ensemble. Experiment results show that the neural network ensemble, both Bagging and AdaBoost, can significantly increase the precision and recall. The rest of the paper is organized as follows. In section 2, we briefly review the neural network ensemble and define the base neural network model for keyphrase J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1267 – 1272, 2006. © Springer-Verlag Berlin Heidelberg 2006
1268
J. Wang et al.
extraction. In section 3, we introduce the dataset used in this research. In section 4, we apply the single neural network model, Bagging and AdaBoost algorithms to keyphrase extraction and give experiment results. This paper concludes with some comments.
2 The Base Neural Network Neural network ensemble, which originated from Hansen and Salamon’s work [9], is a learning paradigm where a collection of a finite number of neural networks, we call these neural networks base neural networks, is trained for the same task. It was shown theoretically and experimentally that the generalization ability of a neural network system could be significantly improved through ensembling a number of neural networks [3-4, 6, 9, 11-12, 16], i.e., training many base neural networks and then combining their predictions. In general, a neural network ensemble is constructed in two steps, i.e., training a number of base neural networks and then combining the base predictions. As for training base neural networks, there are two categories of methods. In the first category, the base neural networks are trained using the sample data selection strategy, i.e., the base neural networks are trained through selecting different sample data. In this category, the most prevailing approaches are Bagging [3] and Boosting [6, 11]. In the second category, the base neural networks are trained using the feature subset selection strategy [12], i.e., the base neural networks are trained through selecting different feature subset. In this research, we only discuss the Bagging and Boosting methods. Before constructing the base neural network and neural network ensemble, the features (or attributes) of a phrase for judging whether it is a keyphrase should be defined. We have the following consideration and hypothesis when deciding to adopt what features to determine a keyphrase: TF (Term Frequency) provides one measure of how well the term describes the document contents; Title, abstract and headings (subheadings) are the most common summaries in documents and usually describe the main issue discussed in the document; If a phrase recurs in different paragraphs of a document, then the phrase may be an important concept to the document. Based on the above discussion, we use the following five features to determine whether a phrase is a keyphrase: TF: the normalized frequency of the phrase in the given document. For a phrase i and a document d, let ni be the number of times the phrase i is mentioned in the document d, then the TF of i is computed by the formula (1). TF = ni/maxl∈d nl .
(1)
PDF: PDF is a structural measure feature. For a phrase i and a document d, let mi be the number of paragraphs (of the document d) in which the phrase i appears, then the PDF of i is computed by the formula (2). PDF = mi/maxl∈d ml .
(2)
TI: existence of the phrase in the title of a given document. If the phrase appears in the title of a given document, TI is equal to 1; otherwise 0.
Ensemble Learning for Keyphrases Extraction from Scientific Document
1269
AB: existence of the phrase in the abstract of a given document. If the phrase appears in the abstract of a given document, AB is equal to 1; otherwise 0. HS: existence of the phrase in headings (or subheadings) of a given document. If the phrase appears in the headings (or subheadings) of a given document, HS is equal to 1; otherwise 0.
Fig. 1. The base neural network for keyphrase extraction
Therefore, we have the base neural network model for keyphrase extraction as shown in Figure 1. In the input layer, there are five neurons corresponding to a phrase’s five features in a given document. In the output layer, there is only one neuron corresponding to whether or not a phrase is a keyphrase.
3 Dataset Because the datasets satisfying the requirements of this research are unavailable in popular repository of machine learning databases, e.g., UCI repository of machine learning databases [2], we have made a dataset. We selected ten papers from academic databases. In order to obtain the feature matrix of each document, we manually extract a set of candidate phrases from each document. For each document, every sequence of one to four (for English document) or two to six (for Chinese document) words is identified. Stems that are equivalent are merged into a single canonical form. Many phrases are discarded at this stage, including those that begin or end with a stopword, those which consist only of a proper noun, etc. After processing, 572 phrases of ten papers remained, among which 40 phrases are keyphrases assigned by authors. In order to conduct objective evaluation, we randomly select five papers from ten papers as the training set, and the remaining five papers as the test set. After repeating the process 10 times, we have the datasets shown in Table 1. In Table 1, Pos denotes the number of phrases with class label +1, i.e., the number of keyphrases assigned by authors; and Neg denotes the number of phrases with class label -1, i.e., the number of phrases that are non-keyphrases.
1270
J. Wang et al. Table 1. The dataset summarization
Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 Data 9 Data 10
The Number of Phrases: Training Set Pos Neg Total 21 247 268 22 278 300 24 260 284 20 303 323 18 312 330 20 285 305 22 279 301 21 242 263 19 252 271 19 285 304
The Number of Phrases: Test Set Pos Neg Total 19 285 304 18 254 272 16 272 288 20 229 249 22 220 242 20 247 267 18 253 271 19 290 309 21 280 301 21 247 268
4 Experiment Results We compare the following three algorithms: single neural network, Bagging [3] and AdaBoost [6, 11]. The three algorithms are implemented in MATLAB neural network toolbox. In our neural network implementation, there are ten neurons in the hidden layer. The network is trained for up to 1000 epochs to the mean squared error goal of 0.00001 for single neural network and Bagging, and 10 epochs to the mean squared error goal of 0.01 for AdaBoost. The number of base neural networks is set to 5. The neural network is trained using Levenberg-Marquardt training algorithm [8]. There are two basic methods for evaluating automatically generated key phrases [10]. The first, we call it objective assessments, adopts the standard information retrieval metrics of precision and recall to reflect how well generated phrases match a set of “relevant” phrases. The second gathers subjective assessments from human readers. In this research, we adopt the first approach to evaluate the three algorithms. The objective assessments require a set of “relevant” phrases against which extracted phrases can be compared. The keyphrases provided by authors are usually used as gold standard set of keyphrases for a document. Performance is measured using precision and recall. For some dataset, let N+1 be the set of phrases that are assigned class label +1 by an extraction algorithm, i.e., phrases being extracted as keyphrases, and N a+1 the set of keyphrases assigned by authors, then precision and recall is computed by (3) and (4) respectively, where | | denotes the number of elements in a set, or cardinality of a set:
Precision =
Recall =
| N +1 ∩ N a+1 | . | N +1 |
| N +1 ∩ N +a 1 | | N +a 1 |
.
(3)
(4)
Ensemble Learning for Keyphrases Extraction from Scientific Document
1271
Table 2. Experiment results: precison (abbr. Pr), recall (abbr. Re), and accuracy (abbr. Ac) Single Neural Network (%)
Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 Data 9 Data 10
Bagging (%)
AdaBoost (%)
Pr
Re
Ac
Pr
Re
Ac
Pr
Re
Ac
35.71 45.83 9.38 78.57 52.38 40.63 36.36 21.05 47.83 80.00
26.32 61.11 56.25 55.00 50.00 65.00 44.44 21.05 52.38 38.10
92.43 92.65 67.36 95.18 91.32 90.26 91.14 90.29 92.69 94.40
64.71 50.00 62.50 61.90 41.67 50.00 72.73 22.22 72.73 100
57.89 55.56 62.50 65.00 45.45 50.00 44.44 63.16 38.10 42.86
95.39 93.38 95.83 93.98 89.26 92.51 95.20 84.14 94.68 95.52
41.18 52.38 56.25 54.17 40.00 45.83 68.42 45.00 73.68 84.62
36.84 61.11 56.25 65.00 54.55 55.00 72.22 47.37 66.67 52.38
92.76 93.75 95.14 92.77 88.43 91.76 95.94 93.20 96.01 95.52
Table 3. Experiment results: average precison, recall and accuracy Single Neural Network (%)
Bagging (%)
AdaBoost (%)
Pr
Re
Ac
Pr
Re
Ac
Pr
Re
Ac
33.09
46.91
89.64
49.75
52.06
92.96
54.19
56.70
93.61
The experiment results are shown in Table 2 and Table 3. Different from the metrics precision and recall, which emphasize the examples with the class label +1, the accuracy is an evaluation metric that reflects the classification accuracy for all examples of test data, including both examples with class label +1 and examples with class label –1. In Table 3, the accuracy, precision and recall are average over 10 datasets in Table 1. From Table 3, we can see that both Bagging and AdaBoost have significantly increased the precision and recall comparing with the single neural network. For Bagging, the precision and recall have approximately increased 17 percent and 5 percent respectively; for AdaBoost, the precision and recall have approximately increased 21 percent and 10 percent respectively. The accuracy has also increased 3 percent for Bagging and 4 percent for AdaBoost comparing with the single network.
5 Conclusions In this paper we proposed a keyphrase extraction approach based on neural network ensemble. To determine whether a phrase is a keyphrase or not, five features of a phrase are adopted: term frequency TF, whether it appears in the title, abstract or headings (subheadings), and the phrase structural feature, i.e., the distribution of a phrase in a given document. Two ensembles are implemented based on Bagging and AdaBoost algorithms respectively. The approach is evaluated by the standard information retrieval metrics of precision and recall. Experiment results show that the ensemble learning for keyphrases extraction can significantly increase the precision and recall comparing with the single neural network.
1272
J. Wang et al.
Acknowledgements This work was supported partially by Natural Science Foundation of China under Grant 30230350 and 60574078, and Natural Science Youth Foundation of South China University of Technology.
References 1. Barker, K., Cornacchia, N.: Using Noun Phrase Heads to Extract Document Keyphrases. In: Hamilton, H., Yang, Q. (eds.): Canadian AI 2000. Lecture Notes in Artificial Intelligence, Vol. 1822. Springer-Verlag, Berlin Heidelberg (2000) 40 – 52 2. Blake, C., Keogh, E., Merz, C.J.: UCI Repository of Machine Learning Databases. Department of Information and Computer Science, University of California, Irvine, CA, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.htm 3. Breiman, L.: Bagging Predictors. Machine Learning 24 (2) (1996) 123 – 140 4. Dietterich, T.G.: An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Machine Learning 40 (2000) 139–157 5. Freitag, D.: Machine Learning for Information Extraction in Informal Domains. Machine Learning 39 (2000) 169 – 202 6. Freund, Y., Schapire, R.E.: A Decision-Theoretic Generalization of On-line Learning and An Application to Boosting. Journal of Computer and System Sciences 55(1) (1997) 119 – 139 7. HaCohen-Kerner, Y., Gross, Z., Masa, A.: Automatic Extraction and Learning of Keyphrases from Scientific Articles. In: Gelbukh, A. (ed.): CICLing 2005. Lecture Notes in Computer Science, Vol. 3406. Springer-Verlag, Berlin Heidelberg (2005) 657 – 669 8. Hagan, M.T., Menhaj, M.: Training Feed-forward Networks with the Marquardt Algorithm. IEEE Transactions on Neural Networks 5(6) (1994) 989 – 993 9. Hansen, L., Salamon, K.P.: Neural Network Ensembles. IEEE Trans. on Pattern Analysis and Machine intelligence 12(10) (1990) 993 – 1001 10. Jones, S., Paynter, G.W.: Automatic Extraction of Document Keyphrases for Use in Digital Libraries: Evaluation and Applications. Journal of the American Society for Information Science and Technology 53(8) (2002) 653 – 677 11. Schapire, R.E., Singer, Y.: Improved Boosting Algorithms Using Confidence-Rated Predictions. Machine Learning 37 (1999) 297 – 336 12. Tsymbal, A., Pechenizkiy, M., Cunningham, P.: Diversity in Search Strategies for Ensemble Feature Selection. Information Fusion 6 (2005) 83 – 98 13. Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval 2(4) (2000) 303 – 336 14. Wang, J.B., Peng, H., Hu, J.-S.: Automatic Keyphrases Extraction from Document Using Backpropagation. In: Proceedings of 2005 International Conference on Machine Learning and Cybernetics. IEEE Press, New York (2005) 3770 – 3774 15. Wang, J.B., Peng, H.: Keyphrases Extraction from Web Document by the Least Squares Support Vector Machine. In: Skowron, A., Agrawal, R., Luck, M., et al (eds.): Proceedings of 2005 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society Press, Los Almitos, CA (2005) 293 – 296 16. Zhou, Z.H., Wu, J.X., Tang, W,: Ensembling Neural Networks: Many Could Be Better Than All. Artificial Intelligence 137 (2002) 239 – 263
Grid-Based Fuzzy Support Vector Data Description Yugang Fan1,2, Ping Li1, and Zhihuan Song1 1
Institute of Industrial Process Control, Zhejiang University, Hangzhou 310027, China 2 Kunming University of Science and Technology, Kunming 650051, China {pli, zhsong, ygfan}@iipc.zju.edu.cn
Abstract. Support Vector Data Description (SVDD) concerns the characterization of a data set. A good description covers all target data but includes no superfluous space. The boundary of a data set can be used to detect outliers. SVDD is affected by noises during being trained. In this paper, Grid-based Fuzzy Support Vector Data Description (G-FSVDD) is presented to deal with the problem. G-FSVDD reduces the effects of noises by a new fuzzy membership model, which is based on grids. Each grid is a hypercube in data set. After obtaining enough grids, Apriori algorithm is used to find grids with high density. In G-FSVDD, different training data make different contributions to the domain description according to their density. The advantage of G-FSVDD is shown in the experiment.
1 Introduction In data domain description, the aim is to give a description of a data set and this description should cover all (or most of) target data but include no superfluous space. Obviously data domain description can be used for outlier detection. Different methods for data domain description have been developed. Recently, a new method for data domain description which does not have to make a probability density estimation, called support vector domain description (SVDD), is presented by Tax and Duin [1,2]. It obtains a spherically shaped boundary around the data set, which is called hypersphere . But SVDD is affected by noises during being trained. FSVDD is inspired by the fuzzy support vector machines (FSVMs)[3], and reformulated from SVDD[4]. The previous work of FSVDD did not address the issue of setting fuzzy membership from the data points. FSVDD is still affected by noises because the fuzzy membership is only a function of time in [4]. To deal with the problem, Grid-based Fuzzy Support Vector Data Description (G-FSVDD) is presented in this paper. G-FSVDD reduces the effects of noises by a new fuzzy membership model, which is based on grids. Fuzzy membership is associated with data points such that different data points can have different effects in the learning of the hypersphere. We can treat the noises as less importance and let these points have lower fuzzy membership. The key idea of the approach is to group the data points based on the dense region, which is surrounding with the sparse region. We divide data space into many grids at different scales for a number of times. Each grid is a hypercube. After obtaining enough grids, apriori algorithm is used to find these grids with high density [5]. Unlike the category in the density-based approach to data domain description, here J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1273 – 1279, 2006. © Springer-Verlag Berlin Heidelberg 2006
1274
Y. Fan, P. Li, and Z. Song
the density notation has only relative meaning rather than absolute one. After finding these dense grids, fuzzy membership value can be set. The proposed approach in this paper is relevant to the recent approach in [6], which enlightened our study.
2 Fuzzy Support Vector Domain Description In this section, we describe SVDD and FSVDD, and show the difference between them. 2.1 Architecture of SVDD Suppose that there is a d-dimensional training data set containing l data points {x1, x2, …, xl}. The aim is to enclose all (or most of) the data by a hypersphere with minimum volume and the hypersphere is described by center a ∈ R d and radius R. We wish to find the hypersphere xi − a
2
= R 2 , defined by the pair ( R, a) , such that the inequal-
ity holds :
xi − a ≤ R 2 + ξi , i=1, …, l ,
(1)
where ξi ≥ 0 are slack variables for those data points that cannot be fitted in the optimal hypersphere. The optimal hypersphere problem is then regarded as the solution to the problem: l
min
R 2 + C ∑ ξi
s.t.
xi − a ≤ R 2 + ξi , i = 1,..., l ,
i =1
2
(2)
ξi ≥ 0, i = 1,..., l where C is a constant, called penalty factor. The parameter C can be regarded as a regularization parameter. It is used for the trade-off between the volume of the hypersphere and errors. Nonnegative slack variable ξi is used as a measure of error. Searching the optimal hypersphere in (2) is a quadratic programming (QP) problem, which can be transformed into the dual: l
∑α x
max
i
i =1
l
i
l
⋅ xi − ∑∑ αi α j xi ⋅ x j i,=1 j =1
s.t. 0 ≤ αi ≤ C , i = 1,..., l , l
∑α i =1
i
,
(3)
= 1.
where α = (α1 ,..., α l ) is the vector of nonnegative Lagrange multipliers associated with the constraints (1). According to the Kuhn-Tucker theorem, the solution α i of problem (3) satisfies the equality:
Grid-Based Fuzzy Support Vector Data Description
α i ( R 2 + ξ i − xi − a ) = 0
1275
2
β iξ i = 0
,
(4)
where β i is nonnegative Lagrange multiplier. From this equality it comes that the only non-zero values α i in (4) are those for that the constraints (1) are satisfied with the equality sign. The point xi corresponding with α i > 0 is called support vector. To construct the optimal hypersphere xi − α
l
= R 2 , it follows that a = ∑ α i xi .
2
i =1
The radius of the hypersphere can be obtained by calculating the distance between a support vector lay on the surface of the hypersphere and the center of the hypersphere. Let x0 be one of the support vectors for which have 0 < αi < C , then we have the l
l
i =1
i =1
decision function: x ⋅ x − 2∑ α i (x ⋅ xi ) ≤ x 0 ⋅ x0 − 2∑ α i (x0 ⋅ xi )
2.2 Architecture of FSVDD In the previous study, we can easily check that each training data point is considered as the same before the optimizing process in SVDD. For the purpose of tuning the importance of training data points, we can assign a fuzzy membership to each data point. Suppose we are given a set of labeled training points with associated fuzzy memberships (x1,s1),…,(xl,sl). Each training point is given a fuzzy membership σ ≤ si ≤ 1 with i=1,…,l, and sufficient small σ > 0 , since the data point with si=0 means nothing and can be just removed from training set without affecting the result of optimization. Since the fuzzy membership si is the attitude of the corresponding point xi and the parameter ξi can be viewed as a measure of error in the SVDD, the term siξi is then a measure of error with different weighting. The optimal hypersphere problem is then regarded as the solution to the problem: l
min
R 2 + C ∑ siξi
s.t.
xi − a ≤ R 2 + ξi , i = 1,..., l ,
i =1
(5)
2
ξi ≥ 0, i = 1,..., l. It is noted that a smaller si reduces the effect of the parameter ξi in problem (5). The problem (5) can be transformed into: l
∑α x
max
i
i =1
l
i
l
⋅ xi − ∑∑ αi α j xi ⋅ x j i,=1 j =1
s.t. 0 ≤ αi ≤ si C , i = 1,..., l , l
∑α i =1
i
= 1.
.
(6)
1276
Y. Fan, P. Li, and Z. Song
FSVDD is the same as SVDD if we set all si = 1. With different value of si , we can control the trade-off of the respective training point xi for FSVDD. A smaller value of si makes the corresponding point xi less important in the training. There is only one free parameter in SVDD while the number of free parameters in FSVDD is equivalent to the number of training points. 2.3 Flexible Descriptions In (3) and (6), we can replace the inner products by kernel function K (⋅, ⋅) = Φ (⋅) ⋅ Φ(⋅) . This means that the data are implicitly mapped to a new feature space by a mapping Φ (⋅) , and we would only need to use K (⋅, ⋅) , and would never need to explicitly even know what the mapping Φ (⋅) is. Throughout this paper we use the Gaussian kernel K (x, y) = e −||x − y||
2
/ 2σ 2
. The problem (6) now becomes into: l
l
l
∑ α K (x , x ) − ∑∑ α α K (x , x
max
i
i =1
i
i
i,=1 j =1
i
j
i
s.t. 0 ≤ αi ≤ si C , i = 1,..., l , l
∑α i =1
i
j
) .
(7)
= 1.
This gives α , which are used in the computation of the center a. Now it can be l
checked whether a point within the hypersphere:
l
∑ α K (x, x ) ≥ ∑ α α K (x , x ) . i =1
i
i
i,=1
i
j
0
i
3 Fuzzy Membership Model There are many methods to training FSVDD, depending on how much information is contained in the data set. If the data points are already associated with the fuzzy memberships, we can just use this information in training FSVDD. If it is given a noise distribution model of the data set, we can set the fuzzy membership as the probability of the data point that is not a noise, or as a function of it. Since almost all applications lack this information, we need some other methods to set the fuzzy membership si . The method to set fuzzy membership consists of following subprocedures: A. Group the data points based on the dense region, which is surrounding with the sparse region. (a) Partition data set to form transaction set; (b) Find frequent item based on Apriori algorithm, group the data points based on the dense region from sparse region; B. Set fuzzy membership si 3.1 Group Data Set into Xdense and Xsparse Given a data set X, we divide each dimension into equidistance intervals at the same scale in each partition to form grids. However, the kind of partition is performed a
Grid-Based Fuzzy Support Vector Data Description
1277
number of times, T, at different scales rather than a time. Denoted these obtained grids as Grid{(i1,…,id),t}, ik is the interval index in k-th dimension at t-th time, k=1,2,…,d; t=1,2,…,T. Having obtained these grids, each data point belongs to only a grid in any partition yet belongs a number of grids in T times of partitions. From the first partition with large scale, the subsequent partitions are performed repeatedly by two manners: translation and division. Translation forms different grids by translating the positions of the initial dividing line in each dimension, which aim is to recover the inherent proximity cut off in single partition. Division is to adding new dividing lines into each initial grid to obtain smaller scale of grids. After obtaining enough grids, the two operations stop. All points in any grid construct a transactional item, and all transactional items forms an item set, denoted by Ω . The number and the size of grids affect the final result. How to divide the data space and form the grids is explained in [5][6]. After obtaining the item set, we perform apriori algorithm. The detail about apriori algorithm is shown in [7]. Apriori algorithm is used to obtain frequent item set in Ω . The frequent item set corresponds to the dense grids. And dense grids correspond to the dense region. We can then group the data points in the dense region from sparse region. Suppose Xdense denote data points in dense region and Xsparse denote points in sparse region. The Xsparse are possibly noises. 3.2 Fuzzy Membership Let D denote Euclidean distance. D(xi) is the shortest distance of point xi in Xsparse to data set Xdense. Define Dmin as Dmin = min( D(xi ) | xi ∈ X sparse ) , and Dmax as Dmax = max( D (xi ) | xi ∈ X sparse ) . Fuzzy membership then is defined as xi ∈ X dense ⎧1 ⎪ k si = ⎨ ⎛ D(xi ) − Dmin ⎞ ⎟ +σ ⎪(1 − σ ) ⋅ ⎜ 1 − Dmax − Dmin ⎠ ⎝ ⎩
xi ∈ X sparse
,
(8)
where σ > 0 , k >0, and σ is sufficient small, k is the parameter that controls the degree of mapping function.
4 Experiment In this section the advantage of G-FSVDD was investigated. The data set contain 145 points, which are drawn from the (artificially generated) banana shaped distribution. The first 130 points are normal data, the last 15 points are noises, shown in Fig.1(a). In the simulation, the Gaussian kernel is used. We use two procedures to find the parameters as described in the previous section. In the first procedure, we search the kernel parameters and C using the original algorithm of SVDD and SVMs, for detail see [2, 8]. In the second procedure, we determine the fuzzy membership for each point.
1278
Y. Fan, P. Li, and Z. Song
(a)
(b)
Fig. 1. Banana shaped data set; (a) Normal data and noises(‘.’ denotes noises; ‘+’denotes normal data); (b) Xdense and Xsparse(dense region (denoted by ‘*’) surrounding with the sparse region(denoted by ‘.’ ))
(a)
(b)
Fig. 2. Data description trained on a banana shaped data set; (a) Result of SVDD; (b) Result of G-FSVDD
To determine the fuzzy membership, we group the data points based on the dense region, which is surrounding with the sparse region. We first divided each dimension at large scale. The subsequent partitions are performed repeatedly by two manners: translation and division. The two operations stop if points in any grid formed at last time partition are less than 3. After obtaining the item set, we perform apriori algorithm to obtain frequent item set. Adjust support degree and confidence degree until about 80% points were selected. The frequent item set corresponds to the dense grids, and the selected points were the data points based on the dense region, which denoted by Xdense. The remained data were Xsparse, shown in Fig.1(b). The fuzzy membership was then determined by (8) Fig.2 shows the results of our simulations. SVDD is affected by noises in Fig.2(a). Noises affect SVDD greatly, and it is difficult to get a perfect domain description. Fig.2(b) is the result of G-FSVDD, which gives a better description of the banana shaped data set than SVDD because most of the noises in sparse region are found beforehand and their effects are reduced. These results in Fig.2 show that our algorithm can improve the performance of SVDD when the data set contains noisy data.
Grid-Based Fuzzy Support Vector Data Description
1279
5 Conclusions Data domain description is an important tool for outlier detection. In this paper, we propose G-FSVDD with a new fuzzy membership model, and describe the detail strategy to set fuzzy membership. It makes G-FSVDD more feasible in the application of reducing the effects of noises. The experiment shows that the G-FSVDD performs better in the application with the noisy data.
Acknowledgment This work is support by 973 program of China (No.2002CB312200) and by 863 program of China (No.2002AA412010-12).
References 1. David, M.J.T., Robert, P.W.D.: Support Vector Data Description. Pattern Recognition Letters 20(11-13) (1999) 1191-1199 2. David, M.J.T., Robert, P.W.D.: Support Vector Data Description. Machine Learning 54(1) (2004) 45-66 3. Lin, C.F., Wang, S.D.: Fuzzy Support Vector Machines. IEEE Transactions on Neural Networks 13 (2) (2002) 464-471 4. Wei, L.L., Long, W.J., Zhang, W.X.: Fuzzy Data Domain Description using Support Vector Machines. Proceedings of the Second International Conference on Machine Learning and Cybernetics, xi'an, 5(2-5) (2003) 3082-3085 5. Yue, S.H.: Several Problems in Data Mining and Fusion. The Graduated Report of Postdoctoral Research, Zhejiang University, Hangzhou, China, (2004) 15-23 6. Eden, M.W.M., Tommy C.-W.S.: A New Shifting Grid Clustering Algorithm. Pattern Recognition 37(3) (2004) 503-514 7. http://fuzzy.cs.uni-magdeburg.de/~borgelt/doc/apriori/apriori.html 8. Lin, C.F., Wang, S.D.: Training Algorithms for Fuzzy Support Vector Machines with Noisy Data. Pattern Recognition Letters 25(14) (2004) 1647-1656
Development of the Hopfield Neural Scheme for Data Association in Multi-target Tracking Yang Weon Lee Department of Information and Communication Engineering, Honam University, Seobongdong, Gwangsangu, Gwangju, 506-714, South Korea [email protected]
Abstract. The neural scheme for data association in multi-target environment is proposed. This scheme is derived by using the Lyapunov energy function and is important in providing a computationally feasible alternative to complete enumeration of JPDA which is intractable. Through the experiments, we show that the proposed scheme is stable and works well in general environments.
1
Introduction
Generally, there are three approaches in data association for MTT : non-Bayesian approach based on likelihood function[1], Bayesian approach[2, 3, 4], and neural network approach[5, 6, 7]. The major difference of the first two approaches is how treat the false alarms. The non-Bayesian approach calculates all the likelihood functions of all the possible tracks with given measurements and selects the track which gives the maximum value of the likelihood function. Meanwhile, the tracking filter using Bayesian approach predicts the location of interest using a posteriori probability. These two approaches are inadequate for real time applications because the computational complexity is tremendous. As an alternative approach, we suggested a Hopfield neural network probabilistic data association (HNPDA) to approximately compute a posteriori probability βjt , for the joint probabilities data association filter (JPDAF)[8] as a constrained minimization problem. This technique based on the use of neural networks was also started by comparison with the traveling salesman problem(TSP). In fact βjt is approximated by the output voltage Xjt of a neuron in an (m + 1) × n array of neurons, where m is the number of measurements and n is the number of targets.
2 2.1
Development of the Energy Function Relations Between Data Association and Energy Function
Suppose there are n targets and m measurements. The energy function for the data association is written below. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1280–1285, 2006. c Springer-Verlag Berlin Heidelberg 2006
Development of the Hopfield Neural Scheme for Data Association
EDAP =
1281
m n n n m m n m A B C t Xjt Xjτ + Xjt Xlt + ( X − 1)2 2 j=0 t=1 2 t=1 j=0 2 t=1 j=0 j τ =1τ =t
l=0l=j
m n m n n m D t E + (Xj − ρtj )2 + (Xjt − ρτl )2 . 2 j=0 t=1 2 j=0 t=1 τ =1τ =t
(1)
l=0l=j
Xjt is the output voltage of a neuron in an (m + 1) × n array of neurons and is the approximation to the a posteriori probability βjt in the JPDAF[8]. This a posteriori probability, in the special case of the PDAF[8] when the probability PG that the correct measurement falls inside the validation gate is unity, is denoted by ρtj . Actually, PG is very close to unity when the validation gate size is adequate. In (1), A,B,C,D, and E are constants. In order to justify the first two terms of EDAP in (1), we assumed that the dual assumptions of no two returns form the same target and no single return from two targets are consistent with the presence of a dominating Xjt in each row and each column of the (m + 1) × n array of neurons. The third term mof EDAP is used to constrain the sum of the Xjt ’s in each column to unity i.e. j=0 Xjt = 1. m This constraint is consistent with the requirement that j=0 βjt = 1 in both the JPDAF and the PDAF[2]. Therefore, this constraint, by itself, does not permit us to infer whether the βjt ’s are from the JPDAF, or from the PDAF. The assumption used to set up the fourth term is that this term is small only if Xjt is close to ρtj , in which case the neural network simulates more closely the PDAF for each target rather than the intended JPDAF in the multitarget scenario. Finally, the fifth term is supposed to be minimized if Xjt is not large unless for each τ = t there is a unique l = j such that ρτl is large. 2.2
Modification of Energy Equation
In Hopfield network, when the operation is approaching the steady state, at most one neuron gets into the ON state in each row and column and the other neurons must be in the OFF state. To guarantee the this state, we add the following constraints additionally for Hopfield network : 1 1 ωjt ωjτ + ωjt ωlt . 2 j=1 t=0 2 t=1 j=1 m
Es =
n
n
n
τ =t
m
m
(2)
l=j
By summing (2) and (1) in [4], we get the final energy equation for Hopfield network: EHDA =
m n n n m m n m A B C ωjt ωjτ + ωjt ωlt + (w ˆjt − wjt )2 2 j=1 t=0 2 t=1 j=1 2 t=1 j=1 τ =t
l=j
n m m n n m D F G 2 2 + ( w ˆjt − 1) + ( w ˆjt − 1) + rjt w ˆjt . 2 t=1 j=1 2 j=1 t=0 2 t=1 j=1
(3)
The first two terms of (3) correspond to row and column inhibition and the third term suppressed the activation of uncorrelated part( i.e. if ωjt = 0, then
1282
Y.W. Lee
ω ˆ jt = 0). The fourth and fifth terms biased the final solution towards a normalized set of numbers. The last term favors associations which have a nearest neighbor in view of target velocity. 2.3
Transformation of Energy Function into Hopfield Network
A Hopfield network with m(n + 1) neurons was considered. The neurons were subdivided into n + 1 target’s column of m neurons each. Henceforward we will identify each neuron with a double index, tl(where the index t = 0, 1, . . . , n relates to the target, whereas the index l = 1, . . . , m refers to the neurons in each column), its output with Xlt , the weight for neurons jt and lτ with Wjltτ , and the external bias current for neuron tl with Ilt . According to this convention, we can extend the notation of the Lyapunov energy function[8] to two dimensions. Using the Kronecker delta function, δij , the Lyapunov energy function can be written as EHDA =
A B δlj (1 − δtτ )Xlt Xjτ + δtτ (1 − δlj ) 2 2 t τ t τ j j l
l
C (1 − δot )Xlt Xjτ + δtτ δlj (1 − δot )Xlt Xjτ − C 2 t τ t j l
l
C 2 Dn ωjt (1 − δot )Xlt + ωjt (1 − δot ) + −D (1 − δot )Xlt 2 2 t t l l D Fm t τ + δtτ (1 − δot )Xl Xj + −F Xlt 2 2 t τ t j l
l
F G + δlj Xlt Xjτ + rjt (1 − δot )Xlt . 2 2 t τ t j l
(4)
l
We also can extend the notation of the Lyapunov energy function to two dimensions: E= −
1 tτ t τ t t Wjl Xl Xj − Il Xl . 2 t τ t j l
(5)
l
By comparing (5) and (4), we get the connection strength matrix and input parameters each :
Wjltτ = − A(1 − δtτ ) + Cδtτ (1 − δot ) + F δlj − B(1 − δlj ) + D δtτ (1 − δot ), Ilt = (Cωjt + D − G r )(1 − δot ) + F. 2 jt
(6)
2 m Here we omit the constant terms such as Dn+F + C2 ωjt (1−δot ) . These 2 terms do not affect the neuron’s output since they just act as a bias terms during the processing.
Development of the Hopfield Neural Scheme for Data Association
1283
Using the (6), the connection strength Wjltτ from the neuron at location (τ, l) to the neuron at location (t, j) is ⎧ ⎪ ⎪−[(C + D)(1 − δot ) + F ] if t = τ and j = l “self feedback”, ⎪ ⎨−(A + F ) if t = τ and j = l “row connection”, Wjltτ = (7) ⎪−(B + D)(1 − δot ) if t = τ and j = l “column connection”, ⎪ ⎪ ⎩ 0 if t = τ and j = l “global connection”. -(C+D+F)
-(A+F) -F -(A+F)
-(A+F)
-(A+F) -(B+D)
-(B+D)
-F -(A+F)
-(A+F)
-(B+D)
-F -(A+F)
-(A+F) -(A+F)
Fig. 1. Example of Hopfield network for two targets and three plots
Fig.1 sketches the resulting two-dimensional network architecture as a directed graph using the (7). We note that only 39 connections of possible 81 connections are achieved in this 3 × 3 neurons example. This means that modified Hopfield network can be represented as a sparse matrix. In Fig.1, we also note that there are no connections between diagonal neurons. With the specific values from (6), the equation of motion for the MTT becomes dSlt St =− l − {A(1 − δtτ ) + Cδtτ (1 − δot ) + F }δlj dt so τ j −{B(1 − δlj ) + D}δtτ (1 − δot ) Xjτ rjt +(Cωjt + D − G)(1 − δot ) + F. (8) 2 The final equation of data association is n n m dSlt St =− l −A Xlτ − B(1 − δot ) Xjt − D(1 − δot )( Xjt − 1) dt So j=1 τ =1,τ =t
−F (
m
τ =1
j=1,j=l
Xlτ − 1) − C(1 − δot )(1 − ωjt ) −
rjt (1 − δot )G. 2
(9)
1284
Y.W. Lee
The parameters A, B, C, D, F and G can be adjusted to control the emphasis on different constraints and properties. A larger emphasis on A, B, and F will produce the only one neuron’s activation both column and row. A large value of C will produce Xlt close to ωjt except the duplicated activation of neurons in the same row and column. A larger emphasis on G will make the neuron activate depending on the value of target’s course weighted value. Finally, a balanced combination of all six parameters will lead to the most desirable association. In this case, a large number of targets and measurements will only require a larger array of interconnected neurons instead of an increased load on any sequential software to compute the association probabilities. 2.4
Computational Complexity
The computational complexity of the modified Hopfield data association (MHDA) scheme, when applied to the target association problem, depends on the number of tracks and measurements and the iteration numbers to be reached stable state. Suppose that there are n tracks and m measurements. Then according to the (9) for data association, the computational complexity per iteration of the MHDA method require O(nm) computations. When we assume the average itera¯ the total data association calculations require O(knm) ¯ tion number as k, computations. Therefore, even if the tracks and measurements are increased , the required computations are not increasing exponentially . However JPDAF as estimated in [5] requires the computational complexity O(2nm ) , so its computational complexity increases exponentially depending on the number of tracks and measurements.
3
Conclusions
In this chapter, we have developed the MHDA scheme for data association. This scheme is important in providing a computationally feasible alternative to complete enumeration of JPDA which is intractable. We have proved that given an artificial measurement and track’s configuration, MHDA scheme converges to a proper plot in a finite number of iterations. Also, a proper plot which is not the global solution can be corrected by re-initializing one or more times. In this light, even if the performance is enhanced by using the MHDA, we also note that the difficulty in tuning the parameters of the MHDA is critical aspect of this scheme. The difficulty can, however, be overcome by developing suitable automatic instruments that will iteratively verify convergence as the network parameters vary.
References 1. Alspach, D.L.: A Gaussian Sum Approach to the Multi-Target Identification Tracking Problem. Automatica 11 (1975) 285-296 2. Bar-Shalom, Y.: Extension of the Probabilistic Data Association Filter in MultiTarget Tracking. In: Proceedings of 5th Symposium on Nonlinear Estimation, (1974) 16-21
Development of the Hopfield Neural Scheme for Data Association
1285
3. Reid, D.B.: An Algorithm for Tracking Multiple Targets. IEEE Trans. on Automat. Contr. 24 (1979) 843-854 4. Lee, Y.W.: Adaptive Data Association for Multi-target Tracking Using Relaxation. Lecture Notes in Computer Science, Vol. 3644. Springer Verlag (2005) 552-561 5. Sengupta, D., Iltis, R.A.: Neural Solution to the Multitarget Tracking Data Association Problem. IEEE Trans. on AES 25 (1999) 96-108 6. Kuczewski, R.: Neural Network Approaches to Multitarget Tracking. In: Proceedings of the IEEE ICNN (1987) 7. Hopfield, J.J., Tank, D.W.: Neural Computation of Decisions in Optimization Problems. Biological Cybernatics (1985) 141-152 8. Fortmann, T.E., Bar-Shalom, Y., Scheffe, M.: Sonar Tracking of Multiple Targets Using Joint Probabilistic Data Association. IEEE J. Oceanic Engineering 8 (1983) 173-184
Determine Discounting Coefficient in Data Fusion Based on Fuzzy ART Neural Network Dong Sun and Yong Deng Department 8, Institute of Automatic Detection (810), Shanghai Jiao Tong University, No. 800 Dong Chuan Road, Min Hang, Shanghai, China
Abstract. The method of discounting coefficient is an efficient way to solve the problem of evidence conflicts. In this paper a new method to calculate the discounting coefficient of evidence based on evidence clustering by the way of fuzzy ART neural network is proposed. The discounted evidence is taken into account in belief function combination. A numerical example is shown to illustrate the use of the proposed method to handle conflicting evidence.
1
Introduction
The Dempster-Shafer theory of evidence is a mechanism formalized by Shafer [1] for representing and reasoning with uncertain, imprecise and incomplete information. It is based on Dempster’s original work [2] on the modeling of uncertainty in terms of upper and lower probabilities that are induced by a multivalued mapping rather than as a single probability value. Dempster-Shafer theory has the ability to model information more flexible without requiring a probability (or a prior) to be assigned to each element in a set in representing and dealing with ignorance and uncertainty and a crucial role in evidence theory is played by Dempster’s combination rule that has several interesting mathematical properties such as commutativity and associativity [1]. Dempster-Shafer theory of evidence is widely used in many fields such as artificial intelligence [3], information fusion [4, 5], decision-making [6] and classification [7]. However, it was pointed out by Zadeh [8] that illogical results may be obtained by classical Dempster combination rule when collected evidence highly conflicts each other.Many methods are proposed to solve this problem. One way is to improve the combination rule of evidences, many new combination rule are proposed [9, 10, 11, 12, 13, 14, 15] by this way the classical dempster’s combination rule is changed greatly or completely given up.While Haenni [16] thought that the combination rules other than Dempster’s rule are impracticable, unnecessary. He pointed out that the commutativity and associativity of classical Dempster’s combination rule is dismissed according to these new combination rule and the mode of evidence itself is the reason of conflict. As this viewpoint the method of averaging evidence is proposed by Murphy [17], he suggests that, J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1286–1292, 2006. c Springer-Verlag Berlin Heidelberg 2006
Determine Discounting Coefficient in Data Fusion Based on Fuzzy ART
1287
if all the evidence is available at the same time, average the masses, and calculate the combined masses by combining the average values multiple times that can solve the problem of conflict to some extent.Deng’s weighting averaging approach [18] was a method developed from Murphy’s averaging approach ,which seems more reasonable than that of averaging approach without considering the weight of each evidence. Another way to solve the problem of conflict by improving the mode of evidence is to use discounting coefficients. How to determine the discounting coefficient is a problem , in [19] the Weighted Sum Operator method is used to combine the mass functions, or discounting factors if there is reason to believe that some of the samples are in some way suspect, so this weights can be look on as a kind of discounting coefficient but this method is not the real way to improving the mode of evidence, the method of using discounting coefficient that we applied will be introduced in next section. The aim of this paper is to propose an appropriate method for calculating the discounting coefficient properly. This paper is organized as follows. In Section 2, we introduce the Dempster-Shafer evidence theory and the method of discounting coefficient to solve the problem of conflict briefly. In Section 3, a new method for calculating the discounting coefficients is proposed. In Section 4, a numerical example for handling the conflict problems of evidences is shown to illustrate our procedure. Finally, a conclusion is made in Section 5.
2
Preliminaries
In this section, we briefly introduce the Dempster-Shafer theory and the concept of the discounting coefficient. 2.1
Dempster-Shafer Evidence Theory
In Dempster-Shafer evidence theory [2], there is a fixed set of N mutually exclusive and exhaustive elements, called the frame of discernment, which is symbolized by Θ = {H1 , H2 , . . . , HN }. Let us denote P (Θ) the power set composed of 2N elements A of Θ, P (Θ) = {∅, {H1}, {H2 }, . . . , {HN }, {H1 ∪ H2 }, {H1 ∪ H3 }, . . . , Θ} A BP A (basic probability assignment) is a function from P (Θ) to [0, 1] deP (Θ) → [0, 1] A → m(A) and which satisfies the followfined by: m : ing conditions: m(A) = 1 m(∅) = 0 A∈P (Θ)
The mass m(A) represents how strongly the evidence supports A. The elements of P (Θ) that have a non-zero mass are called focal elements. A body of evidence (BOE) is the set of all the focal elements (R, m) = {[A, m(A)]; A ∈ P (Θ) and m(A) > 0} R is a subset of P (Θ), and each of A ∈ R has a fixed value m(A). Two bodies of evidence m1 and m2 can be combined with Dempster’s orthogonal rule as follows m1 (B)m2 (C) m(A) = B∩C=A (1) 1−K
1288
D. Sun and Y. Deng
where K= 2.2
B∩C=∅
m1 (B)m2 (C)
(2)
Open Issue of Dempster-Shafer Combination Rule
Zadeh has underlined that normalization procedure in the combination rule involves counter-intuitive behaviours when evidence conflicts [20]. In order to solve this problem, the method of discounting coefficient are proposed [17]. The method is described as follow . Let mj be a belief mass given by the source Sj and let αj be a coefficient which represents the confidence degree one has in source Sj . Let mα,j denote the belief mass discounted by a coefficient 1 − αj and defined as : mα,j (A) = αj mj (A)
∀A ⊂ Θ
mα,j (Θ) = 1 − αj + αj mj (Θ) αj is called the discounting coefficient. When αj = 0 it means that the belief mass from source Sj is not reliability at all,when αj = 1 it means that we are full confident in the reliability of Sj . So the value of αj is between the 0 and 1, if αi > αj then the belief mass from source Si is more reliable than the one from the source Sj . The discounted evidences is looked as having no conflicts, and the classical dempster’s combination rule can be used to combine them.
3
Fuzzy ART Neural Networks and the Proposed Method of Determine the Discounting Coefficient
Adaptive Resonance Theory, or ART, was introduced as a theory of human cognitive information processing. The theory has introduced series of real-time neural network models for unsupervised category learning and pattern recognition. These models are capable of learning stable recognition categories in response to arbitrary input sequences with either fast or slow learning. ART1 is the earliest Model of this families , which can stably learn to categorize binary input patterns presented in an arbitrary orderThe Fuzzy ART model developed here in generalizes ART l to be capable of clustering arbitrary collections of arbitrarily both analog and binary input patterns. The structure of Fuzzy ART is shown in figure1, and it is same as the structure of ART1 mostly. But unlike the ART1, some preprocessing of input vector is done before the vectors is presented to the Fuzzy ART, the input vectors are transformed into an output vector a = (a1 , a2 , · · · aM )in which every component lies in the interval [0,1]. In F0 layer ,the output vector a = (a1 , a2 , · · · aM )is accepted as an input, and the output vectorIof this layer is I = (a, ac ) = (a1 , · · · , aM , 1 − a1 , · · · , 1 − aM ). This transformation is called Complement coding . It is assumed that there are 2M nodes in F1 layer, and N nodes in F2 layer, so the number of potential categories is N , the top-down weight vector of jth category is WJ = (Wj1 , · · · , Wji , · · · , Wj(2M) ), 1 ≤ i ≤ 2M, 1 ≤ j ≤ N ,where
Determine Discounting Coefficient in Data Fusion Based on Fuzzy ART
F2
1289
......
W
Orienting Subsystem
Attentional Subsystem
F1
F0
......
reset node
I ={a,1-a}
a Fig. 1. The structure of fuzzy ART
Wji is top-down connecting weight from jth node of F2 layer to ith node of F1 layer. When the vector I of input patterns form F0 layer is presented, the choice |I∧W | function of jth node of F2 layer is defined by Tj (I) = a+|Wjj | where the fuzzy AND operator is defined by (x ∧ y)i = min(xi , yi ) where the norm |·| is defined 2M by|x| = |xi |.For notational simplicity, Tj (I)is written as Tj when the input I i=1
is fixed. In F2 layer, The category choice is indexed by J, whereTJ = max{Tj : J| j = 1 · · · N }. If the node J,meets the vigilance criterion |I∧W ≥ ρ,where is the ρ |I| is vigilance parameter, then the Resonance occur, and the Mismatch reset occur J| when |I∧W < ρ, the node Jis reset and the process of research is going on. If |I| all the categories that exist in the F2 layer is Mismatched then a new category is created. While the search process is finished the weight vector WJ is updated (new) (old) (old) according to the equation WJ = (1 − β)WJ + β(I ∧ WJ ),where the β is the learning rate . What we talk above is the brief introduction of the arithmetic of fuzzy ART, please refer to the literature [21] for the detail .Here we assume that the input patterns is the vector of BP A (basic probability assignment) function from the power set P (Θ) of frame of discernment. For example, the frame of discernment Θ = {A, B, C}then the power set based on Θ is P (Θ) = {∅, {A}, {B}, {C}, {AB}, {AC}, {BC}, {ABC}}.One of the BP As about this power set is m{A} = 0.8, m{B} = 0.1, m{AC} = 0.05, m{ABC} = 0.05. So the input vector of this BP A is [0 0.8 0.1 0 0 0.05 0 0.05]. A number of vectors can be clustered into deferent categories by the way of fuzzy ART, we get the number of vectors in each categories{N OV1, N OV2 , · · · , N OVN }
1290
D. Sun and Y. Deng
where N is the number of categories, so the clustering coefficient of the body of N OVi evidence in ith category is defined as:CCi = . N N OVi
i=1
In a group of evidences , if a category has many evidences belonging to it, that is to say clustering coefficient of this category is large, the piece of evidence belong to this category should be more important and has more effect on the final combination results. On the contrary, if there is a category that few evidences belong to it, the piece of evidence belong to this category should be pay little attention. Based on the clustering coefficient, the discounting coefficient of the ith body of evidence is given as: DCEi = CCi
(i = 1, 2, . . . , k)
(3)
So, we can get the weighted body of evidences as mDCEi ,i (A) = DCEi mi (A)
∀A ⊂ Θ
mDCEi ,i (Θ) = 1 − DCEi + DCEi mi (Θ).
(4)
At last, the classical dempster’s combination rule can be used to combine them.
4
Numerical Example
A fictitious example is illustrated to show the use of the proposed method. In a multisensor-based automatic target recognition system, suppose the real target is A. From 5 different sensors, the system has collected 5 bodies of evidence shown as follows: (R1 , m1 ) = ([{A}, 0.4], [{B}, 0.3], [{C}, 0.3]) (R2 , m2 ) = ([{A}, 0], [{B}, 0.8], [{B, C}, 0.2]} (R3 , m3 ) = {[{A}, 0.55], [{B}, 0.1], [{C}, 0.3], [{A, C}, 0.05]} (R4 , m4 ) = {[{A}, 0.55], [{B}, 0.1], [{C}, 0.3], [{A, C}, 0.05]} (R5 , m5 ) = {[{A}, 0.6], [{B}, 0.1], [{C}, 0.3]} If these evidences were combined with the classical dempster’s combination rule directly, the result we get is (R, m) = {[{A}, 0], [{B}, 0.0982], [{C}, 0.9018]} and it is conflict with the real estimation. As the proposed method the Fuzzy ART neural networks can be used to clustering these evidences(assume that the vigilance ρ = 0.9),and the discounting coefficient of these evidences is [DCE1 , 0.8], [DCE2 , 0.2], [DCE3 , 0.8], [DCE4 , 0.8], [DCE5 , 0.8]
Determine Discounting Coefficient in Data Fusion Based on Fuzzy ART
1291
According to the discounting coefficient of each bodies of evidence, and the we get the weighted body of evidences shown as follows: ˜ 1 , m1 ) = ([{A}, 0.3200], [{B}, 0.2400], [{C}, 0.2400], [{A, B, C}, 0.2000]) (R ˜ 2 , m2 ) = ([{A}, 0], [{B}, 0.1600], [{C}, 0.0400], [{A, B, C}, 0.8000]} (R ˜ 3 , m3 ) = {[{A}, 0.4400], [{B}, 0.0800], [{C}, 0.2400], [{A, C}, 0.0400], (R [{A, B, C}, 0.2000]} ˜ 4 , m4 ) = {[{A}, 0.4400], [{B}, 0.0800], [{C}, 0.2400], [{A, C}, 0.0400], (R [{A, B, C}, 0.2000]} ˜ 5 , m5 ) = {[{A}, 0.4800], [{B}, 0.0800], [{C}, 0.2400], [{A, B, C}, 0.2000]} (R The classical dempster’s combination rule can be used to combine them,and we get(R, m) = {[{A}, 0.7400], [{B}, 0.0500], [{C}, 0.2100]}. It can be seen that, when conflicting evidence is collected, the classical Dempster’s rule for combining beliefs produces illogical results that do not reflect the actual distribution of beliefs. In this case, for the collection of the “bad” evidence R2 , which may be caused by many factors such as atrocious weather or enemy’s jammer or the flaws of the sensor itself, Dempster’s combination results show that, though more pieces of evidence collected later support target A, it is impossible that the target is A, which is just against the truth. While with weighted evidence, the classical Dempster’s rule is still used, the combination results is deferent very much.The main reason for these phenomenon is that, by making use of the Fuzzy ART neural networks clustering of the evidence, the discounting coefficient decrease the weight of the “bad” evidence, so the “bad” evidence has less effect on the final combination results.
5
Conclusions
Dempster’s combination operator is a poor solution for the management of the conflict between the various information sources at the normalization step. According to the fuzzy ART neural networks a group of evidences are divided into several deferent categories and the discounting coefficient can be determined by the clustering coefficient,Of the alternative methods that address the problems of discounting coefficient, solves this problem to some extent without change the combination operator itself.
Acknowledgement Many thanks for anonymous referees’ constructive comments for improving the paper. The work is partially supported the National Defence Key Laboratory under contact number 51476040103JW13.
1292
D. Sun and Y. Deng
References 1. Shafer, G,A.: Mathematical Theory of Evidence. Princeton University Press, Princeton (1976) 2. Dempster, A.P.: Upper and Lower Probabilities induced by a Multivalued Mapping. Annals of Mathematical Statistics (1967) 38 325-339 3. Liau, C.J., Lin, B.I.: Possibilistic Reasoning-a mini-survey and uniform Semantics. Artificial Intelligence 88(1) (1996) 163-193 4. Murphy, Robin, R.: Dempster-Shafer Theory for Sensor Fusion in Autonomous Mobile Robots. IEEE Transaction on Robotics and Automation 14(1) (1998) 197206 5. Pohl, C., Van Genderen, J.L.: Multisensor Image Fusion in Remote Sensing: Concepts, Methods and Applications. International Journal of Remote Sensing 19(1) (1998) 823-854 6. Beynon, M., Curry, B., Morgan, P.: The Dempster-Shafer Theory of Evidence: An Alternative Approach To Multicriteria Decision Modeling. Omega 28(1) (2000) 37-50 7. Zouhal, L.M., Denoeux, T.: An Evidence-theoretic K-NN Rule with Parameter Optimization. IEEE Transactions on Systems, Man and Cybernetics 28(1) (1998) 263-271 8. Zadeh, L.: A Simple View of the Dempster-Shafer Theory of Evidence and Its Implication for the Rule of Combination. AI Magazine 7(2) (1986) 85-90 9. Yager, R. R.: On the Dempster-Shafer Framework and New Combination Rules. Information Sciences 41(1) (1987) 93-138 10. Dubois, D., Prade, H.: Representation and Combination of Uncertainty with Belief Functions and Possibility Measures. Computational Intelligence 4(1) (1998) 244-264 11. Smets, P.: The Combination of Evidence in the Transferable Belief Model. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(1) (1990) 447-458 12. Smets, P., Kennes, R.: The Transferable Belief Model. Artificial Intelligence 66(1) (1994) 191-234 13. Smets, P.: Resolving Misunderstandings about Belief Functions. International Journal of Approximate Reasoning 6(1) (1992) 321-344 14. Lefevre, E.: Belief Function Combination and Conflict Management. Information Fusion 3(1) (2002) 149-162 15. Audun, J.: The Consensus Operator For Combining Beliefs. Artificial Intelligence 141(1) (2002) 157-170 16. Haenni, R.: Are Alternatives to Dempster’s Rule of Combination ? Comments on ”About the Belief Function Combination and The Conflict Management Problem” Real Alternatives. Information Fusion 3(1) (2002) 237-239 17. Murphy, C,K.: Combining Belief Functions When Evidence Conflicts. Decision Support Systems 29(7) (2000) 1-9 18. Yong, D. , Shi W.K., Zhu Z.F.: Combining Belief Functions Based on Distance of Evidence. Decision support system 38(1) (2004) 489-493 19. Sally, M., Bryan, S.: Using Evidence Theory for The Integration of Distributed Databases. International Journal of Intelligent System 12(1) (1997) 763-776 20. Zadeh, L.A.: A Simple View of the Dempster-Shafer Theory of Evidence and its Implication for the Rule of Combination. AI Magazine 7(1) (1986) 85-90 21. Carpenter, G.A., Stephen,G.: Fuzzy ART Fast Stable Learning and Categorization of Analog Patterns by an Adaptive Resonance System. Neural Networks 4(1) (1991) 759-771
Scientific Data Lossless Compression Using Fast Neural Network Jun-Lin Zhou and Yan Fu School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China [email protected], [email protected]
Abstract. Scientific computing generates huge loads of data from complex simulations, usually takes several TB, general compression methods can not have good performance on these data. Neural networks have the potential to extend data compression algorithms beyond the character level(n-gram model) currently in use, but have usually been avoided because they are too slow to be practical. We present a lossless compression method using fast neural network based on Maximum Entropy and arithmetic coder to succeed in the job. The compressor is a bitlevel predictive arithmetic encoder using a 2 layer fast neural network to predict the probability distribution. In the training phase, an improved adaptive variable learning rate is optimized for fast convergence training. The proposed compressor produces better compression than popular compressors(bzip, zzip, lzo, ucl and dflate) on the lared-p data set, also is competitive in time and space for practical application.
1
Introduction
Numerous engineering, biomedical, and other scientific applications produce extremely large data sets through numeric simulations. In a large proportion of the cases, the data represents one or more scalar fields sampled over regular grids in dimension of three, four, or higher. For example, fluid dynamics data lared-p were used in our tests. At each space-time sample, the values of several scalar and vector fields are produced. Storing the results of this simulation and transmitting them to remote visualization clients is expensive. A variety of data compression techniques have been proposed to reduce the storage and transmission cost. We focus here on the lossless compression. Instead of a general method, we employ the neural network to estimate the probability distribution of the future character. One of the motivations for using neural networks for data compression is that they excel in complex pattern recognition. Standard compression algorithms, such as bzip, lzo and dflate are based on simple n-gram models. However, there are other types of learnable redundancies that can not be modeled using n-gram frequencies. Neural networks possess certain adaptive properties which are useful to enhance the performance of a predictor. For example: the ability to acquire knowledge even from noisy data through generalization, flexible J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1293–1298, 2006. c Springer-Verlag Berlin Heidelberg 2006
1294
J.-L. Zhou and Y. Fu
internal organization, error tolerance to inconsistencies in the data and parallel distributed processing of information.We follow the approach of Schmidhuber and Heil[1] who used neural network prediction followed by arithmetic encoding. In section 2, we describe the maximum-entropy[2] approach to combining probabilistic constraints and show that it is equivalent to the model of neural network that predicts the probability distribution. In (3), we describe an fast convergence neural network with improved adaptive learning rate. In (4), we find experimentally that neural network compression performs better than bzip, zzip, lzo, ucl and dflate compressors on Lared-p data set in compression ratio, and have acceptable time consume to use in practical. In (5) we have a conclusion.
2
Maximum Entropy Neural Network Model
In this paper, maximum entropy method is applied to data compression. The key point is to find the probability distribution that has maximum entropy measure, which ensures that the distribution is most reasonable. Solving the maximum entropy distribution involves a constrained optimization, and by constructing a neural network, the algorithm is more robust. To construct a stochastic model, we collecting a large number of samples (x1 , y1 ), (x2 , y2 ), ..., (xN , yN ), where each xi is a 1 if a particular context is present in the input history, and 0 otherwise, y is the next input symbol. We predict one bit at a time, so y is either 0 or 1. We can summarize the training sample in terms of its empirical probability distribution p, defined by p(x, y) = N1 × number of times that (x,y) occurs in the sample. To express the event that when xi is the present input context and yi is the next input character will occur, we can introduce the indicator function 1 if yi = 1 and xi = 1 f (x, y) = 0 otherwise The expected value of f with respect to empirical distribution p(x, y) is exactly the statistic we are this empirical value and the expected interested in. We denote value by p(f ) ≡ x,y p(x, y)f (x, y), p(f ) ≡ x,y p(x)p(y|x)f (x, y), where p(x) is the empirical distribution of x in the training sample. We constrain this expected value to be the same as the expected value of f in the training sample. That is, we require a constraint equation p(f ) ≡ p(f ). Suppose that we are given n feature functions fi , we would like our model to accord with these statistics. That is, we would like p to lie in the subset C of P defined by C ≡ {p ∈ P |p(fi ) = p(fi ) for i ∈ {1, 2, ..., n}}
(1)
Here P is the space of all (unconditional) probability distributions. Among the models p ∈ C, the maximum entropy philosophy dictates that we select the distribution which is most uniform. A mathematical measure of the uniformity of a conditional distribution p(y|x) is provided by the conditional entropy H(p) ≡ − p(x)p(y|x) log p(y|x) (2) x,y
Scientific Data Lossless Compression Using Fast Neural Network
1295
To select a model from a set C of allowed probability distributions, choose the model p∗ ∈ C with maximum entropy H(p) p∗ = arg max H(p)
(3)
p∈C
It can be shown that p∗ is always a unique model with maximum entropy in any constrained set C. The maximum entropy principle presents us with a problem in constrained optimization: find the p∗ ∈ C which maximizes H(p). And the p has another two constrains described by p(y|x) ≥ 0, ∀x, ∀y p(y|x) = 1, ∀x
(4) (5)
y
We apply the method of Lagrange multipliers from the theory of constrained optimization. For each feature fi we introduce a parameter λi , then define the Lagrangian ξ(p, Λ, γ) by ξ(p, Λ, γ) ≡ − p(x)p(y|x) log p(y|x) +
i
x,y
λi ( p(x, y)fi (x, y) − p(x)p(y|x)fi (x, y)) x,y
+γ( p(y|x) − 1)
(6)
x
where γ and Λ = λ1 , λ2 , ..., λn are the n + 1 class constrains. Holding γ and λ fixed, the partial derivation of ξ(p, Λ, γ) is ∂ξ = − p(x)(1 + log p(y|x)) − λi p(x)fi (x, y) + γ ∂p(y|x) i
(7)
When the above equation equal zero, we can get the p ∗ (y|x) γ p ∗ (y|x) = exp( λi fi (x, y)) exp(− − 1) = Z(x) exp( λi fi (x, y)) (8) p(x) i i Because of y p(y|x) = 1, ∀x, we donate Z(x) by Zλ (x) = exp( λi fi (x, y)) (9) y
i
We define Ψ (Λ) = ξ(p∗, Λ, γ∗), λ∗ = arg max Ψ (λ) λ
(10)
An optimization method specifically tailored to the maximum entropy problem is the GIS algorithm of Darroch and Ratcliff. we can rewrite (8) in the following form: p ∗ (y|x) = Z(x) exp( wi xi ) (11) i
1296
J.-L. Zhou and Y. Fu
where wi = log λi and fi (x, y) = xi . Setting Z so that P (y = 0|x) + P (y = 1|x) = 1, we obtain p(y|x) = g( wi xi ), where g(x) = 1/(1 + e−x ) (12) i
which is a 2-layer neural network with M input units xi and a single output P (y = 1|x). Analogous to GIS, we train the network to satisfy the known probabilities P (y|xi ) by updating the weights wi to make the error reach a minimum, E = y − P (y), the difference between the expected output y (the next bit to be observed), and the prediction P (y).
3
wi = wi + wi
(13)
wi = ηxi E, where η is the learning rate
(14)
The Fast Convergence Training
The learning constant η determines the stability and the convergence rate. In most practical cases, the iteration converges for 0 < η < 0.5. Unfortunately, the convergence rate is quite slow in terms of the nanosecond time unit used in modern digital computers. Yet most of the current ”speedup” techniques only make incremental improvement. Another problem is that the iteration process may be trapped in a local minimum that it can never reach to a true global minimum. Using momentum term can avoid the problem, so we report an fast learning algorithm that can alleviate the above two problems by using improved adaptive learning rate and momentum term. To begin our discussion of the adaptive learning rate ,we first propose a measure of how close the E(n) come to the current value E(n). We define this relationship as the relative factor er (n): er (n) =
E(n) E(n) − E(n − 1) = E(n) E(n)
(15)
And then, we can determine how to adjust both learning and momentum term according to this relative factor er (n). The adjustments of the learning rate and momentum term are given as following: Case 1:er (n) ≥ 0 η(n + 1) = η(n) × μ − β × er (n), μ ∈ (0, 1), β ∈ (0, 1)
(16)
Case 2:er (n) < 0 η(n + 1) = η(n) × ν − β × er (n), ν ∈ (1, 1.5), β ∈ (0, 1)
(17)
When er (n) > 0, indicates learning error is increasing, the output is far away from the expect output, the W is adjusted too much, we must reduce the
Scientific Data Lossless Compression Using Fast Neural Network Training for 119 Epochs
−1
10
−2
10
0
20
40
60 Epoch
80
100
10
−1
10
−2
10
0
10
20 Epoch
30
40
10
20 Epoch
30
40
8 Learning Rate
Learning Rate
4 3 2 1 0
Training for 40 Epochs
0
Sum−Squared Error
Sum−Squared Error
0
10
1297
0
20
40
60 Epoch
80
100
6 4 2 0
120
0
Fig. 1. Comparison between momentum BP(left) and the improved algorithm(right)
W . As in Case 1 the equation make the η reduce then W will reduce. It accords with our requirements and can convergence quickly. Contrarily, in case 2, it indicates learning error is reducing, the output is close to the expect output. If we increase the W , then we can have a faster convergence. In order to evaluate the learning performance of the improved adaptive learning rate algorithm, two experimental results will be illustrated in Fig.1. The example shown here is a problem for the momentum BP network, which requires a large number of iterations, but use our improved adaptive learning rate we can convergence much faster. The results were generated using MATLAB.
4
Experiments
The neural network of above is a predictor to predict the probability of the next bit, then use it to do the arithmetic coding. The algorithm is compared to some popular compression programs on the LARED-P data set. It is the output data of a 3-D plasma simulations with particle. Compression ratios are shown below.
80
116
compression ratios%
Compress and Decompress Time
← ann compressor
114 112 110
← bzip
108 106
← zzip ← lzo
104 102 ← deflate 100 0
10
20 Time Step
ucl → 30
40
compress
70
decompress 60 50 40 30 20 10 0
deflate
ucl
lzo zzip Programs
bzip
ANN
Fig. 2. The compression ratio and compression time on LARED-P data set
1298
J.-L. Zhou and Y. Fu
Tests were performed on a 1.7G MHz AMD CPU with 512 MB of RAM under Windows XP. Fig.2 shows the neural network compressor as giving is the best choice among the several compression programs on compression ratios. The compression time of every time step data-set(about 10 MB) usually around 70 seconds and the memory used in the compression is around 48 MB, so it is acceptable in practical applications. Universal lossless algorithms of data compression are always efficient to text but inefficient to typical numerical data from simulation. According to the characteristic of scientific data, using neural network, we can find other types of redundancies in the data, besides the character level models can do.
5
Conclusions
In this paper, a lossless compressing algorithm to use neural network for scientific data compression, an application requiring improved compression ratios, is proposed. The improved adaptive learning rate can achieve a faster convergence, we use the fast neural network to predict probability distribution, it was found to give good performance on high scale, complex characteristics numerical data from simulations.
Acknowledgements This work is supported by National Science Foundation of China under Grant 10476006 and Application Research Foundation of Sichuan Province under Grant 05JY029-067-2.
References 1. Schmidhuber, J.,Stefan, H.: Sequential Neural Text Compression. IEEE Trans. Neural Networks 7(1) (1996) 142 - 126 2. Berger, A.,Pietra, S.D.: A Maximum Entropy Approach to Natural Language Processing. Computation Linguistics, 22(1) (1996) 39 - 71 3. Chatterjee, A., Nait-Ali, A., Siarry, P.: An Input-delay Neural-network-based Approach for Piecewise ECG Signal Compression. IEEE Trans. Biomedical Engineering 52(5) (2005) 945 - 947 4. Chung, C., Leung, S.H.: A Backpropagation Algorithm with Adaptive Learning Rate and Momentum Coefficient. IEEE Trans. Neural Networks 15(6) (2004) 1411 - 1423 5. Petalas, Y.G., Vrahatis, M.N.: Parallel Tangent Methods with Variable Stepsize. In: IEEE International Conference on Neural Networks Proceedings, Vol. 2. IEEE Inc., Piscataway NJ United States (2004) 1063 - 1066 6. Yang, G.W., Tu, X.Y., Pang, J.: Research of Lossless Data Compression Based on a Virtual Information Source. Tien Tzu Hsueh Pao 31(5) (2003) 728 - 731
HyperSurface Classifiers Ensemble for High Dimensional Data Sets Xiu-Rong Zhao, Qing He, and Zhong-Zhi Shi The Key Laboratory of Intelligent Information Processing, Department of Intelligence Software, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080, China {zhaoxr, heq, shizz}@ ics.ict.ac.cn
Abstract. Based on Jordan Curve Theorem, a universal classification method called HyperSurface Classifier (HSC) has recently been proposed. Experimental results show that in three-dimensional space, this method works fairly well in both accuracy and efficiency even for large size data up to 107. However, what we really need is an algorithm that can deal with data not only of massive size but also of high dimensionality. In this paper, an approach based on the idea of classifiers ensemble by dimension dividing without dimension reduction for high dimensional data is proposed. The most important difference between HSC ensemble and the traditional ensemble is that the sub-datasets are obtained by dividing the features rather than by dividing the sample set. Experimental results show that this method has a preferable performance on high dimensional datasets.
1 Introduction Classification is one of the most important problems in machine learning. In [1] [2] a new classification method based on hypersurface (HSC) is put forward. In this method, a model of hypersurface is obtained during the training process and then it is directly used to classify large database according to whether the wind number is odd or even based on Jordan Curve Theorem. Experiments show that HSC can efficiently and accurately classify large-sized data sets in two-dimensional space and three-dimensional space. Though HSC can classify higher dimensional data according to Jordan Curve Theorem in theory, it is not as easy to realize HSC in higher dimensional space as in three-dimensional space. However, what we really need is an algorithm that can deal with data not only of massive size but also of high dimensionality. Thus in [3] a simple and effective kind of dimension reduction method without losing any essential information is proposed, which is a dimension changing method in nature. In this paper, based on the idea of ensemble, another solution to the problem of HSC on high dimensional data sets is proposed. An ensemble of classifiers is a set of classifiers, whose individual classification decisions are combined in some way, typically by a weighted or unweighted voting, to classify new examples. As early as the beginning of the 1990s, Hasen and Salamon pointed out that an ensemble of neural networks performs far better than a single one [4]. And Zhou proposed an approach to selecting an optimum J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1299 – 1304, 2006. © Springer-Verlag Berlin Heidelberg 2006
1300
X.-R. Zhao, Q. He, and Z.-Z. Shi
subset of individual networks to constitute an ensemble [5]. The ensemble of HSC is constructed by dimension dividing rather than dimension reduction for high dimensional data. Experimental results show that this method has a preferable performance on high dimensional data sets. The rest of this paper is organized as follows: In section 2, we give an outline of hypersurface classifier. Then in Section 3, after discussing the problems with HSC on high dimensional data sets, we give the ensemble technique used in HSC. Our experimental results are presented in Section 4, followed by our conclusions in Section 5.
2 Overview of the Classification Method Based on Hypersurface HSC is a universal classification method based on Jordan Curve Theorem in topology. The main differences between the well-known SVM algorithm and HSC are that HSC can directly solve the nonlinear classification problem in the original space without having to map the data to a higher dimensional space, and thus without the use of kernel function. Biomimetic Pattern Recognition (BPR) theory, first proposed by Wang as a new model for pattern recognition [6], is based on “matter cognition” instead of “matter classification”. And it is better closer to the function of human cognition than traditional statistical pattern recognition using “optimal separating” as its main principle. BPR is showed superior to several traditional pattern recognition methods in [7]. And in [8], Cao applied DBF, which acts by the method of covering the high dimensional geometrical distribution of the sample set in the feature space, to object recognition and gained nice results. HSC, also a kind of covering algorithm, has the same goal of feature cognition as BPR, but our classification method is based on the Jordan Curve Theorem. Jordan Curve Theorem. Let X be a closed set in n-dimensional space Rn . If X is homeomorphic to a sphere in n − 1 dimensional space, then its complement R \ X has two components, both connected, and one of them is called inside, the other called outside. n
Classification Theorem. For any given point x ∈ R n \ X , x is in the inside of X ⇔ the wind number i.e. intersecting number between any radial from x and X is odd, and x is in the outside of X ⇔ the intersecting number between any radial from x and X is even. How to construct the separating hypersurface is an important problem. An approach has been given in [1]. Based on Jordan Curve Theorem, we have put forward the following classification method HSC [1]. Step1. Let the given samples distribute in the inside of a rectangle region. Step2. Divide the region into some smaller regions called units. If some units contain samples from two or more classes then divide them into a series of smaller units repeatedly until each unit covers at most samples from the same class. Step3. Label each region according to the inside sample's class. Then the frontier vectors and the class vector form a string for each region. Step4. Combine the adjacent regions of the same class and obtain a separating hypersurface then save it as a string.
HyperSurface Classifiers Ensemble for High Dimensional Data Sets
1301
Step5. Input a new sample and calculate the intersecting number of the sample about separating hypersurface. Drawing a radial from the sample can do this. Then the class of the sample is decided according to whether the intersecting number between the radial and the separating hypersurface is even or odd. The classification algorithm based on hypersurface is a polynomial algorithm if the samples of the same class are distributed in finite connected components. Experiments show that HSC can efficiently and accurately classify large-sized data sets in two-dimensional and three-dimensional space for multi-classification problem. For three-dimensional data up to the size of 107, the speed of HSC is still very fast [2].
3 Hypersurface Classifiers Ensemble 3.1 Problems with HSC on High Dimensional Data Sets
According to Jordan Curve Theorem, HSC can deal with any data sets regardless of their dimensionality on the theoretical plane. But in practice, there exist some problems in both time and space in doing so directly. It is not as easy to realize HSC in higher dimensional space as in three-dimensional space. However, what we really need is an algorithm that can deal with data not only of massive size but also of high dimensionality. Fortunately, there exist many methods of dimensionality reduction for us to use. And another important and effective kind of dimensionality reduction method without losing any essential information is proposed in [3]. This method rearranges all of the numerals in the higher dimensional data to lower dimensional data without changing the value of all numerals, but only change their position according to some orders, and thus very suitable for HSC. In our work, we solve the problem based on the idea of ensemble and perfect results are gained. This method will be described in further detail in the next part. 3.2 The Ensemble of HSC
Aiming at the memory shortage problem of HSC on high dimensional data sets, we provide a solution based on the idea of ensemble. By attaching the same importance to each feature, firstly we group the multiple features of the data according to some rules to form some sub-datasets, then start a training process and generate a classifier for each sub-dataset, and the final determination is reached by integrating the series of classification results in some way. The following is its detailed steps. The Training Process Step1. Get the dimension of conditional attributes N from the data set of training samples. Step2. Divide the features into ⎢⎡ N / 3⎥⎤ subsets, where ⎢⎡ N / 3⎥⎤ is the smallest
integer number that is greater than N / 3 . The i -th subset covers the features 3i − 2 , 3i − 1 , 3i , i = 1, 2,3,L ⎡⎢ N / 3⎤⎥ , and the decision attribute. If N cannot be divided exactly by 3, then one or two features in the N − 1 -th subset are added to the last subset. After this, the data set of training samples has been divided into ⎢⎡ N / 3⎥⎤ sub-
1302
X.-R. Zhao, Q. He, and Z.-Z. Shi
datasets, each of which has the same size, with three conditional attributes and one decision attribute. Step3. For each sub-dataset, eliminate the inconsistency that may have been caused in Step2. For each sample in this sub-dataset, if there exist some other samples in which the values of conditional attributes are all the same with those of this sample but only the value of decision attribute is different, then we say inconsistency occurs. In that case, we just have to delete these samples from this sub-dataset. Step4. For each sub-dataset after Step3, start a HSC training process independently and save the training result as the model. And till now, we have got ⎡⎢ N / 3⎤⎥ HSC classifiers, which we define as classifiers ensemble. The Classification Process Step1. Get the dimension of conditional attributes N from the data set of testing samples. Step2. Divide the features into ⎡⎢ N / 3⎤⎥ subsets, and the i -th subset covers the
features 3i − 2 , 3i − 1 , 3i , i = 1, 2,3,L ⎢⎡ N / 3⎥⎤ . If N cannot be divided exactly by 3,
then one or two features in the N − 1 -th subset are added to the last subset. After this, the data set of testing samples has been divided into ⎡⎢ N / 3⎤⎥ sub-datasets, each of which has three conditional attributes. Step3. For each sub-dataset after Step3, start the corresponding HSC classifier and save the result of classification. So now we have got ⎡⎢ N / 3⎤⎥ classification results for each testing sample in the original data set. Step4. The final decision for each testing sample in the original data set is reached by voting. Here we adopt the plurality voting scheme in which the collective decision is the classification result reached by more classifiers than any other. Notice that we attach the same importance to all the classifiers, so in the case of the same number of votes for two or more classification results, we can randomly choose one of them. HSC Ensemble classifies high dimensional data sets by trying to analyze multiple slices of the training and testing samples. Furthermore, the ensemble can be far less fallible than any single HSC classifier.
4 Experiments As we can just predict, the main problem with HSC ensemble is inconsistency in subdatasets. In that case, we have to delete some samples to keep consistent. The following table1 gives the classification results of HSC ensemble on high dimensional data sets in which there is no inconsistency in sub-datasets. From this table, we can see that HSC ensemble works fairly well on these high dimensional data sets, and the recall rate and accuracy are very good. Specially, for the data set of Iris, there are only two classifiers, and the accuracy is lower than that of other high dimensional data sets. It is obvious that the higher the dimension is, the higher accuracy is. Now,we give the classification results on the data set of Wdbc in detail as an example.
HyperSurface Classifiers Ensemble for High Dimensional Data Sets
1303
Table 1. Classification results of HSC ensemble on high dimensional data sets
Data Sets
Dimensions
Number of Training Samples
Iris Wine Wdbc Sonar
4 13 30 60
100 128 369 158
Number of Testing Sample s 50 50 200 50
Number of SubClassifiers
Recall
Accuracy
2 5 10 20
100% 100% 100% 100%
98.00% 100% 100% 100%
Table 2. Classification results of HSC ensemble on the data set of Wdbc
Classifier No. Classifier1 Classifier2 Classifier3 Classifier4 Classifier5 Classifier6 Classifier7 Classifier8 Classifier9 Classifier10 Classifiers Ensemble
Recall 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
Accuracy 99.00% 98.00% 97.00% 95.00% 94.00% 86.50% 97.50% 99.50% 89.00% 96.50% 100%
From experimental results and analysis above, we can see that HSC ensemble performs very well on high dimensional data sets and is perfectly suitable for the data sets in which samples are different in each slice.
5 Conclusion As the complement of HSC on high dimensional data sets, an approach based on the idea of ensemble is presented. By attaching the same importance to each feature, firstly we group the multiple features of the data to form some sub-datasets, then start a training process and generate a classifier for each sub-dataset, and the final determination is reached by integrating the series of classification results in the way of voting. The most important difference between HSC ensemble and the traditional ensemble is that the sub-datasets are obtained by dividing the features rather than by dividing the sample set, so in the case of no inconsistency, the size of each sub-dataset is equal to the original sample set, with totally occupying a little more storage space than the original sample set. Experiments show that this method has a preferable performance on high dimensional data sets, especially on those in which samples are different in each slice.
1304
X.-R. Zhao, Q. He, and Z.-Z. Shi
Acknowledgements This work is supported by the National Science Foundation of China No. 60435010, National Basic Research Priorities Programme No. 2003CB317004 and the Nature Science Foundation of Beijing No. 4052025.
References 1. He, Q., Shi, Z.Z., Ren, L. A.: The Classification Method Based on Hyper Surface. In: Proc. 2002 IEEE International Joint Conference on Neural Networks, Las Vegas (2002) 14991503 2. He, Q., Shi, Z.Z., Ren, L.A., Lee, E.S.: A Novel Classification Method Based on Hyper Surface. Int. J. of Mathematical and Computer Modeling 38 (2003) 395-407 3. He, Q., Zhao, X.R., Shi, Z.Z.: A Kind of Dimension Reduction Method for Classification Based on Hypersurface. In: Wang, X.Z. (eds.): Proc. of International Conference on Machine Learning and Cybernetics. IEEE Press (2005) 3248-3253 4. Hansen, L.K., Salamon, P.: Neural Network Ensembles. IEEE Transaction on Pattern Analysis and Machine Intelligence 12(10) (1990) 993-1001 5. Zhou, Z.H., Wu, J., Jiang, Y., Chen, S.F.: Genetic Algorithm Based Selective Neural Network Ensemble. In: Proc. the 17th International Joint Conference on Artificial Intelligence, Seattle WA2 (2001) 797-802 6. Wang, S.J.: Bionic (Topological) Pattern Recognition—A New Model of Pattern Recognition Theory and Its Applications. Acta Electronica Sinica 30(10) (2002) 1417-1420 7. Wang, S.J., Qu, Y.F., Li, W.J., Qin, H.: Face Recognition: Biomimetic Pattern Recognition vs. Traditional Recognition. Acta Electronica Sinica 32(7) (2004) 1057-1061 8. Cao, W.M., Hao, F., Wang S.J.: The Application of DBF Neural Networks for Object Recognition. Information Sciences—Informatics and Computer Science: An International Journal 160(1-4) (2004) 153-160
Designing a Decompositional Rule Extraction Algorithm for Neural Networks Jen-Cheng Chen1, Jia-Sheng Heh2, and Maiga Chang3 1 Department of Electronic Engineering, Chung Yuan Christian University, Chung Li, Taiwan, R.O.C. 32023 Chung-Li, Taiwan [email protected] 2 Department of Information and Computer Engineering, Chung Yuan Christian University, Chung Li, Taiwan, R.O.C. 32023 Chung-Li, Taiwan [email protected] 3 National Science and Technology Program for e-Learning, P.O. Box 9-187, 32099 Chung-Li, Taiwan [email protected] http://maiga.dnsalias.org/maiga/resume_e.htm
Abstract. The neural networks are successfully applied to many applications in different domains. However, due to the results made by the neural networks are difficult to explain the decision process of neural networks is supposed as a black box. The explanation of reasoning is important to some applications such like credit approval application and medical diagnosing software. Therefore, the rule extraction algorithm is becoming more and more important in explaining the extracted rules from the neural networks. In this paper, a decompositional algorithm is analyzed and designed to extract rules from neural networks. The algorithm is simple but efficient; can reduce the extracted rules but improve the efficiency of the algorithm at the same time. Moreover, the algorithm is compared to the other two algorithms, M-of-N and Garcez, by solving the MONK’s problem.
1 Introduction Although both expert systems and neural networks are typical systems in the domain of artificial intelligence, the basic components of these two kinds of systems are different. The knowledge base of expert systems is a set of rules which are stored in symbolic form, while neural networks encode learned knowledge within an established structure with adjustable weights in numerical form. Hence, it is difficult to transfer the training results of a neural network to the knowledge base of an expert system. The solutions represented by the knowledge base of an expert system could be explained and understood by users [10]. The explainable feature is the main advantage of expert system. In contrast, neural networks have excellent abilities for classifying data and learning from inputs [7], but it is difficult to describe the decision process of a neural network or to merge more than one trained neural network [1][11]. The key for neural J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1305 – 1311, 2006. © Springer-Verlag Berlin Heidelberg 2006
1306
J.-C. Chen, J.-S. Heh, and M. Chang
networks to overcome this deficiency is rule extraction. Rule extraction is the process of discovering the knowledge which is encoded within a neural network and representing these knowledge pieces in symbolic form, just as a rule base in an expert system. Once the rules which could be used to represent the neural network are extracted, the expert system then can use those rules as its knowledge base and gain the advantages of both Neural Networks and Expert Systems [15]. Although the neural-network-based systems performed well, it still can not make users feel comfortable due to the 'black box' effects of the neural networks. Therefore, some rule extraction methods for neural networks had been presented and classified into two categories, the pedagogical and the decompositional algorithms [18]. The pedagogical rule extraction algorithm aims to extract rules that map inputs directly into outputs [2]. Some typical algorithms of this category are "VIA"[17], "ruleextraction-as-learning"[6] and "RULENEG"[12]. Most of pedagogical algorithms are not effective when the size of the neural network increases, such as any real world problem. In order to improve the efficiency, decompositional algorithms focus on heuristically searching and extracting rules in neurons of neural networks individually. The existing decompositional algorithms include "KT"[8], "Subset"[19], "MofN"[20], "RULEX"[3], "Setiono"[14], and “Garcez”[9]. Some decompositional algorithms use weight pruning for more effectively processing [19][13][14], however, this kind of algorithms does not guarantee that the pruned network would be equivalent to the original one [9]. Therefore, the accuracy of extracted rules might be decreased. Most of decompositional algorithm are requested additional times to retrain the pruned network with original training data. This paper tries to develop a decompositional algorithm which could extract rules directly from trained neural network without retraining and keep high accuracy and low complexity at mean time. Such an algorithm is designed by using the concept of bound decomposition of neural network’s weights. The algorithm, Bound Decomposition Tree algorithm, offers an efficient method to extract rule from a neuron. This paper analyzes the hyperspaces and hypercubes for the hyperplane of a neuron to develop an algorithm in order to extract positive, negative and uncertain Boolean rules from the neural network. After applying the Boolean logic these extracted rules will be simplified and be stored in the knowledge base. The users of the neural network-based systems then can understand the reasoning principle of the neural networks according to these human readable rules.
2 Rule-Extraction Mechanism A typical neuron used in this paper is defined as
y = f ( w T x + w0 ) = f (∑ j∈N w j x j + w0 )
(1)
in which N is the number of neuron inputs, w = [w1, w2, ... , wn]T is the weight vector of connection weights, w0 is the bias and x = [x1, x2, .... , xn]T is the inputs of the neuron. The threshold function f: R→[0,1] is monotonically increasing and may be linear, piecewise linear, hard-limited and sigmoid, as defined in a conventional neural network textbook. The weight vector and its threshold can be collected as a set of augmented
Designing a Decompositional Rule Extraction Algorithm for Neural Networks
1307
weights w* = [w,w0]T, corresponding to the augmented input x* = [x, 1]T. When the neuron inputs are binary, x = [xj] ∈ Ξ = {0, 1}N, it is called binary neuron. In essence, the outputs of a binary neuron are linearly separable and can divide the N-dimensional space (RN) into two halfspaces as Fig.1 shows. Therefore, the input states can also be classified into positive state set Ξ p,w* and negative state set Ξ n,w*. Take the input states in Fig. 1 as example :
Ξ p,w* = {110,111,100} = {11-,1-0}
(2)
Ξ n,w* = {010,000,011,001,101} = {0--,-10},
(3)
and
in which '-' is called don't care state, meaning this state can be either '0' or '1' from the viewpoint of boolean logic. For example, 1-0 corresponds to x1x3' and {11-, 1-0} corresponds to x1x2+x1x3'. Positive Halfspace
110 (max) w'x+w0=0.75
x1x3' (1-0)
x1x2 (11-) 111 w'x+w0=0.25
100 w'x+w0=0.35
hyperplane (neuron eqn) 010 w'x+w0=-0.15
000 w'x+w0=-0.55
x1'(0--)
011 w'x+w0=-0.65
101 w'x+w0=-0.15
x2'x3 (-01) 001 (min) w'x+w0=-1.05
Negative Halfspace
Fig. 1. The outputs of a binary neuron are linearly separable. According to the output, different input states can be divided to two halfspaces.
3 Bound Decomposition Tree Algorithm For the example in Fig. 1, the initial bounds between the negative halfspace and the positive halfspace are -1.05 (001) and 0.75 (110). The initial bounds can be indicated as bounds( Ψ = '---') = [-1.05, 0.75] and boundvar( Ψ = '---') = 1.8. Moreover, a hypercube involves several input states and those states might come from different hyperspaces. Therefore, a hypercube could be divided into positivecube, negativecube, or uncertain hypercube. With these definitions, the rule extraction algorithm (extract boolean rules for a single neuron) is designed as following:
1308
J.-C. Chen, J.-S. Heh, and M. Chang
1. Given a set of weights and its threshold w*=[w,w0] in Eq.(1) for a single neuron, and a rule margin ∆. 2. Sort these weights w = [wj] by their absolute values to obtain an index sequence IS = {j = arg maxj∈N |wj|}. 3. Set layer number NL = 0, Ψ = Ξ w*, bound( Ψ ), and boundvar( Ψ ). 4. Let NL = NL+1, j = the first index j in IS, and IS = IS \ {j}. 5. decompose each uncertain hypercube, Ψ = positivecube(⋅) ⋃negativecube(⋅), in the deepest layer into positive hypercube and negative hypercube. 6. For each decomposed hypercubes, calculate the bounds [lbound(⋅), ubound(⋅)]. 7. boundvar(⋅) = boundvar( Ψ ) – |wj|. 8. If all new hypercubes are positive or negative, or the sum of the rest weights less than ∆, then stop; otherwise go to step 4. 9. To transform all the calculated hypercubes into the positive rule Rp,w*, the negative rule Rn,w*, and the uncertain rule Ru,w*.
4 Experiments In this section, a well-know problem, the MONK’s problem is taken into our system in order to verify the performance of the BDT algorithm [16]. The MONK’s problem has been used as a benchmark for machine learning systems and many rule extraction algorithms widely. Moreover, the comparisons between BDT, MofN, and Garcez’s algorithm are also made, All the three algorithms are implemented by both Matlab and Java, and the test results come from 100 execution cycles of MONK's Problem. The experiment environment is Intel Celeron 1.7GHz CPU and Microsoft Windows XP. The experimental data of MONK’s problem comes from robots’ six attributes: head_shape ∈ {round, square, octagon}; body_shape ∈ {round, square, octagon}; is_smiling ∈ {yes, no}; holding ∈ {sword, balloon, flag}; jacket_color ∈ {red, yellow, green, blue}; and, has_tie ∈ {yes, no}. The neural networks used to verify the BDT algorithm and make comparisons in this paper for solving the MONK's problem are Thrun's trained BPNs [17]. The accuracy of the neural networks for training data is 100%. With the proposed algorithm in this paper, the corresponding rules are extracted with the 99.5% accuracy rate. After getting the extracted rules by the BDT algorithm, the comparisons between BDT, MofN, and Garcez’s algorithm are made and listed in Table 1. Table 1 contains four parts, the first part is the algorithm name and the rule format; the second part is how many rules and related accuracy rates that are generated by the three different algorithms; the third part is the complexity of the extracted rules; and, the fourth part is how much time that the three algorithms needed to extract rules from the neural networks. Before the experiment results of each part in Table 1 are discussed, there are several issues for the four parts from Table 1 are needed to clarify first. The first issue is the rule format, since the rule formats between the three algorithms are different, the extracted rule numbers will be difficult to compare. Therefore, in both second part and third part, the rules in M-of-N form are reconstructed to the rules in Boolean logic
Designing a Decompositional Rule Extraction Algorithm for Neural Networks
1309
Table 1. The experiment results for solving MONK’s problem by using BDT, MofN, and Garcez’s algorithms (*: The number in parenthesis is the rule number of Boolean logic format) BDT Rule Format
h1 h2 h3 o1 h1 h2 h3 o1 Accuracy Matlab Java
Boolean Logic
MofN (kmean=7) M-of-N
MofN (kmean=3) M-of-N
MofN (kmean=2) M-of-N
Rule Number 286 11(415)* 4(2555)* 3 (2865)* 1969 45(2398)* 9(3475)* 3 (14960)* 531 12(327)* 5(6286)* 2 (9250)* 3 1(3)* 1(3)* 1(3)* The Complexity of the Extracted Rules (Antecedent Rule Number) 1909 36(2980)* 8(22820)* 5(21140)* 15692 166(18621)* 20(27543)* 5(120250)* 3685 35(2338)* 10(46922)* 3(73950)* 6 1(6)* 1(6)* 1(6)* 99.5% 99.5% 98.6% 98.1% Execution Time (Running 1000 times) unit : second 71.0 76.7 25.16 16.6 26.1 3.14 1.26 0.59
Garcez Boolean Logic 45369 50276 54168 3 417131 454960 477476 6 99.5% 35094 969.5
form. The reconstructed Boolean logic rule number is listed inside the parenthesis just followed the M-of-N form rule number. The second issue is the accuracy rate, the accuracy rates come out from using the extracted rules to test the test data set of MONK‘s problem. The third issue is the complexity of the extracted rules, for example, a Boolean logic rule - x1x2+x3x4'. In this example, the extracted rule number is 2 (rule x1x2 and rule x3x4'), however, there are four antecedent rules (x1, x2, x3, and x4'). The last issue is the execution time, since the commercial programming languages have the benefits in reducing running time and commercialized, two different programming languages, Java and Matlab, are used to implement the three algorithms.. In Table 1, h1-h3 are 3 hidden nodes in the hidden layer and o1 is the output neuron. “Rule Number” is the rule number extracted from a neuron and “The Complexity of the Extracted Rules” is the number of antecedents of all rules. Obviously, a smaller “Rule Number” represents fewer rules and a smaller “The Complexity of the Extracted Rules” represents more concise rules. Table 1 represents that the BDT algorithm makes “Rule Number” and “The Complexity of the Extracted Rules” greatly reduced and keep the accuracy rate high at meanwhile. As we can also find out that the execution time of BDT is less than the MofN with k-mean equals to 7. It is acceptable that the less time is needed for the MofN algorithm with k-mean equals to 3 and 2, because the value of k-mean is less the accuracy rate is lower. The only question mark is occurred in the execution time of Java application. As the last row in Table 1 shows that the Java version BDT algorithm takes too much time than the Java version MofN algorithm.
5 Future Works Based on the analysis of the hyperplane, this paper proposed an algorithm to extract the positive and negative, even uncertain rules of a neuron at the same time. After the
1310
J.-C. Chen, J.-S. Heh, and M. Chang
rules are extracted from a well-trained neural network, the users then could get more clear understandings about the neural network-based systems. Moreover, the extracted rules could be used as the knowledge base of expert systems [4][5]. From Table 1, the importance of the BDT algorithm is verified and competed with other two famous rule extraction algorithms, MofN and Garcez’s algorithms. The BDT algorithm is getting better performance no matter in the extracted number, the accuracy rate, the rule complexity, and the execution time. Although the BDT algorithm could extract rules from a neural network with lower complexity, the system still needs to be enhanced. For example, the extracted rules are currently in Boolean logic format, in the near future, the BDT algorithm should be improved to deal with not only the binary inputs but also multi-level inputs and even the continuous inputs..
References 1. Andrews, R., Diederich, J., Tickle, A.B.: Survey and Critique of Techniques for Extracting rules from Trained Artificial Neural Networks. Knowledge Base Systems 8 (6) (1995) 373389 2. Andrews, R., Geva, S.: Inserting and Extracting Knowledge from Constrained Error Back Propagation Networks. In Proc. of 6th Australian Conference on Neural Networks, Sydney, NSW (1995) 3. Andrews, R., Geva, S.: Rule Extraction from Local Cluster Neural Nets. Neurocomputing 47 (2002) 1-20 4. Chang, M., Chen, J.C., Chang, J.W., Heh, J.S.: Advanced Process Control Expert System of CVD Membrane Thickness Based on Neural Network. Materials Science Forum, 505-507 (2006) 313-318 5. Chen, J.C., Liu, T.S., Weng, C.S., Heh, J.S.: An Expert System of Coronary Artery Disease in Chinese and Western Medicine. In Proc. of 6th Asian-Pacific Conference on Medical and Biological Engineering, Tsukuba, Japan, Apr. 24-27, 2005 (2005) PA-3-06 6. Craven, M.W., Shavlik, J.W.: Using Sampling and Queries to Extract Rules from Trained Neural Networks. In Proc. of 11th International Conference on Machine Learning, New Brunswick, NJ (1994) 37-45 7. Freeman, J.A., Skapura, D.M.: Neural Networks. Addison-Wesley, New York (1992) 8. Fu, L.M.: Rule Generation from Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics 28 (8) (1994) 1114-1124 9. Garcez, A.S. d’Avila, Broda, K., Gabbay, D.M.: Symbolic Knowledge Extraction from Trained Neural Networks: A Sound Approach. Artificial Intelligence 125 (2001) 155-207 10. Jackson, P.: Introduction to Expert Systems. Addison-Wesley, New York (1999) 11. Krishnan, R., Sivakumar, G., Bhattacharya, P.: A Search Technique for Rule Extraction from Trained Neural Networks. Pattern Recognition Letters 20 (1999) 273-280 12. Pop, E., Hayward, R., Diedrich, A.: RULENEG: Extracting Rules from a Trained Neural Network Using Stepwise Negation. Technical Report, Neurocomputing Research Centre, Queensland University of Technology (1994) 13. Setiono, R.: A Penalty Function Approach for Pruning Feedforward Neural Networks. Neural Computation 9 (1) (1997) 185-204 14. Setiono, R.: Extracting Rules from Neural Networks by Pruning and Hidden-unit Splitting, Neural Computation 9 (1) (1997) 205-225 15. Taha, I.A., Ghosh, J.: Symbolic Interpretation of Artificial Neural Networks. IEEE transactions on Knowledge and Data Engineering 11 (3) (1999) 448-463
Designing a Decompositional Rule Extraction Algorithm for Neural Networks
1311
16. Thrun, S.B. (ed.): The MONK's Problems: A Performance Comparison of Different Learning Algorithms. Technical Report CMU-CS-91-197, Carnegie Mellon University (1991). 17. Thrun, S.B.: Extracting Provably Correct Rules from Artificial Neural Networks. Technical Report IAI-TR-93-5, Institut für Informatik III, University of Bonn, Germany (1994). 18. Tickle, A.B., Andrews, R., Golea, M., Diederich, J.: The Truth Will Come to Light: Directions and Challenges in Extracting the Knowledge Embedded within Trained Artificial Neural Networks. IEEE Trans. Neural Networks 9 (6) (1998) 1057-1068 19. Towell, G.G., Shavlik, J.W.: The Extraction of Refined Rules from Knowledge-Based Neural Networks. Machine Learning 13 (1993) 71-101 20. Towell, G.G., Shavlik, J.W.: Knowledge-Based Artificial Neural Networks. Artificial Intelligence 70 (1994) 119-165
Estimating Fractal Intrinsic Dimension from the Neighborhood Qutang Cai1 and Changshui Zhang National Laboratory of Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China [email protected], [email protected]
Abstract. This paper proposes a new method for fractal intrinsic dimension(ID) estimation. Local fractal features are extracted and combined to obtain the estimation. Compared with the contemporary methods for fractal ID estimation, the proposed method requires lower computation, can reach accurate results and can be flexibly extended. Both the theoretical analysis and experimental results show its validity.
1
Introduction
Datasets with different complexity are often used in knowledge discovery and machine learning. It is known that data complexity can significantly affect the modeling efficiency and algorithm performance, and the estimation of the complexity may be essential for model selection and parameter adaptation. For measuring the complexity, intrinsic dimension has been widely used: it is closely related with storage requirement, computation complexity and classifier’s capability [1]. The intrinsic dimension can be estimated by several methods from different viewpoints: PCA takes the number of largest eigenvalues; Isomap [2] and LLE [3] use the minimal dimensionality of embedding space; TRN-based [4] adopts the cross relation number learned by the neurons; the fractal methods are based on geometric characteristics in fractal theory [5]. The fractal intrinsic dimension has drawn many attentions and been successfully applied in areas like texture synthesis and data compression [6] since the theoretical foundation by Mandelbrot [7]. There are several types of fractal dimensions, such as Hausdoff dimension, box-counting dimension and modified dimension, in which the box-counting dimension (written by DimB ) is the most widely accepted for the intrinsic dimension [8]. One advantage of box-counting dimension is that the dimension’s value can take any nonnegative real number, which helps to present subtler measurement for the complexity, while other methods mentioned above only give integer estimation results. Unfortunately, the computation of box-counting dimension is prone to increase exponentially with the increase of dimension, so the correlation dimension in [9] has been proposed for approximation, which is of high accuracy for uniformly distributed data. The correlation dimension is theoretically lower J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1312–1318, 2006. c Springer-Verlag Berlin Heidelberg 2006
Estimating Fractal Intrinsic Dimension from the Neighborhood
1313
than box-counting dimension, and reference curve methods were proposed in [10] for additional revisions. However, the reference curve methods require abundant prior experimental results and make the assumption that the correlation dimension estimation result is only influenced by the intrinsic dimension and the size of sample set, which is not always true because that the estimation is also affected by the data distribution. The packing number method in [11] turns back to the box-counting dimension via a statistical technique which is somewhat robust to various sample distributions, but it requires high computation cost because of the requirement of the repeated calculation to obtain the average. In this paper, we will develop an estimation method which is based on a new viewpoint of the local behavior of the data and requires lower computational cost. The remainder of this paper is organized as follows: in section 2, necessary definitions are provided and one feature reflecting box-counting dimension is proposed; in section 3, the estimation methods are derived; experimental results are shown in section 4, followed by conclusions in section 5.
2
Definitions and Theoretic Results
The box-counting dimension is defined as follows: Definition 1 (Box-counting dimension). Let X ⊆ Rn be a bounded set and . δ ∈ R+ . Let Ξδ (X) = {γ | γ ⊆ X and X can be covered by the balls with . centers in γ and radius of δ } and Nδ (X) = min |γ|, where | · | represents γ∈Ξδ (X)
the cardinality. The box-counting dimension of X is defined as: DimB (X) = − lim
δ→0
log(Nδ (X)) . log(δ)
(1)
In (1), Nδ (X) (written by Nδ for short) and δ have the relation Nδ = C(δ)δ −s , where C(δ) is a function of δ with limδ→0 log(C(δ)) log(δ) = 0 and C (δ) is assumed to be ˆ = {x1 , x2 , · · · , xN }, the common const for small δ in practice. For sample set X step for the estimation is to select a set of positive real number δ1 , δ2 , · · · , δn , ˆδ1 , N ˆδ2 , · · · , N ˆδn , plot the log(N ˆδ ) versus log(δ) curve and use the compute N slope of the major linear part therein as the estimation result. In applications, the power law assumption is always made: for small r1 , r2 ∈ R+ (r2 r1 ), (most) points x in X satisfy Nr2 (B(x, r1 )) ≈ const · r2s , where B(x, r1 ) represents the closed ball centering at x with radius of r1 . We have two lemmas derived from the power law. Lemma 1. For small r1 , r2 ∈ R+ (r2 r1 ) and most x ∈ X, the relation s s 2 ) 2 C(r2 ) Nr2 (B(x, r1 )) ≈ const · rr12 holds, where the constant is in C(r , C(r1 ) C(2r1 ) . Proof. By the power law, for most x ∈ X, Nr2 (B(x, r1 )) is approximately equal, and let the value be denoted by V . Consider the set composed of centers in one minimal covering of X with balls of radius 2r1 , written by O1 . According
1314
Q. Cai and C. Zhang
to the above: |O1 | =
C(2r1 ) (2r1 )s .
Then Nr2 ( ∪ (B(x, r1 ) ∩ X)) ≤ Nr2 (X) since x∈O1
∪ (B(x, r1 ) ∩ X) ⊆ X.
x∈O1
Since Nr2 ( ∪ (B(x, r1 ) ∩ X)) ≈ |O1 | · V , it can be seen that x∈O1
Nr2 (X) 2s C (r2 ) V ≤ = |O1 | C (2r1 )
r1 r2
s .
Now consider the set of centers in one minimal covering of X with radius r1 , written by O2 . Since ∪ (B(x, r1 ) ∩ X) = X and Nr2 ((B(x, r1 ) ∩ X)) ≥ x∈O2
x∈O1
Nr2 (X), we have V ≥
Nr2 (X) C (r2 ) = |O2 | C (r1 )
r1 r2
s
.
ˆ = {x1 , x2 , · · · , xN } be a sample set from X satisfying the Lemma 2. Let X power law. Assign weight β (y, ε) for each y in X: for β (y, ε) there exist constants ˆ the inequalities hold a, b > 0 s.t ∀ε > 0, ∀x ∈ X, β (y, ε) ≥ 1 and β (y, ε) ≤ 1. ˆ y∈B(x,aε)∩X
ˆ y∈B(x,bε)∩X
ˆ Fixing the positive numbers δ, ξ(ξ δ) , then for almost x ∈ X s s δ δ c2 ≤ β (y, ε) ≤ c1 ξ ξ ˆ y∈B(x,δ)∩X
where c1 =
2s C(bξ) bs C(δ) , c2
=
C(2aξ) C(δ)(2a)s .
Proof. We have the following two inequalities:
ˆ ≈ N2aξ (B(x, δ)∩X) ≥ β (y, ε) ≥ N2aξ (B(x, δ)∩X)
ˆ y∈B(x,δ)∩X
ˆ y∈B(x,δ)∩X
C (2aξ) s C (δ) (2a)
s ˆ ≈ Nbξ (B(x, δ) ∩ X) ≤ 2 C (bξ) β (y, ε) ≤ Nbξ (B(x, δ) ∩ X) bs C (δ)
s δ , ξ
s δ . ξ
2s C (bξ) C (2aξ) , c2 = s , and the conclusion follows. bs C (δ) C (δ) (2a) s Since C (δ) is const, in lemma 2 β (y, ε) ∝ δξ . Thus β is the local Make c1 =
ˆ y∈B(x,δ)∩X
feature reflecting s and the dimension can be estimated by the statistics: sˆ = ∇ log(Ex ( β (y, ε)))/∇ log (δ) , ˆ y∈B(x,δ)∩X
where E(·) represents the expectation.
(2)
Estimating Fractal Intrinsic Dimension from the Neighborhood
3
1315
Estimation Method
To countervail the influence of different sampling methods, we apply a weighting strategy of local features in (2). Consider the following formula:
N N α(xi , δ, ξ) β(xj , ξ)I (d(xi , xj ) < δ) W (δ, ξ) =
i=1
j=1 N
(3) α(xi , δ, ξ)
i=1
where I(·) is an indicator function and d(xi , xj ) denotes the distance between xi and xj . In (3), α(xi , δ, ξ) subjects to the restriction that there exist positive constants c1 , c2 , s.t. ∀x ∈ X, the inequalities α (y, δ, ε) ≤ 1, α (y, δ, ε) ≥ 1 (4) ˆ y∈B(x,c1 δ)∩X
ˆ y∈B(x,c2δ)∩X
hold. The restriction (4) guarantees that weights of the local feature of each small patch in the sample space is comparable to others. In a similar way in section 2, we have s δ W (δ, ξ) ∝ . (5) ξ Then the estimation procedure can be stated as follows: 1) Fix δ1 , δ2 , ξ and assign weights for each xi : α (xi , δ1 , ξ),α (xi , δ2 , ξ),β(xi , ξ). 2) Calculate W (δ1 , ξ) and W (δ2 , ξ) using (3). 3) Estimate the box-counting dimension by sˆ =
log(W (δ1 , ξ)) − log(W (δ2 , ξ)) . log (δ1 ) − log(δ2 )
(6)
The procedure does not specify the way to determine α(·) and β(·). They can be chosen according to prior knowledge, e.g., set α(·) = β(·) = 1 for uniformly sampled data. When no prior knowledge can be used, the method based on resample techniques described below can be used. Since α, β should subject to similar restrictions, we simply make α independent of ξ, i.e., α (x, δ, ξ) = α (x, δ), and take the same method for setting α and β. Take α for example: Construct the weighting function by α(k) (x, δ) , k = 1, · · · K, as: randomly select a subset S (k) from X such that the pairwise distances in S (k) are not less than 2δ and the minimum distance between points in X \ S (k) and points in S (k) is less than 2δ. Choose c ∈ [0, 1/4), for any x in X, if B(x, cδ)∩S (k) = ∅ (then the intersection will contain exactly one point, denoted by x0 ) and set α(k) (x, δ) = K 1 1 (k) (k) ; Otherwise set α (x, δ) = 0. Finally set α (x, δ) = α (x, δ) . |B(x0 ,cδ)| K k=1
It can be seen that the construction method meets the desired restrictions α (y, δ, ε) ≤ 1, α (y, δ, ε) ≥ 1. ˆ y∈B(x,δ/2)∩X
ˆ y∈B(x,2δ)∩X
1316
Q. Cai and C. Zhang
When the weighting functions are determined, the computation will be proportional to the square of point numbers, which is comparable to the correlation dimension method and much lower than the packing number method; besides, results of the proposed method are more stable than the packing number method where multiple trials are prone to possess high variance.
4
Experimental Results
Three datasets, including fractal graphs, Swiss roll data and face image sequence, are tested. For simplicity, in all experiments, let c = 0 and K = 50. 4.1
Experiment 1: Classical Fractal Graphs
Two classical fractal graphs, Koch curve and Sierpinsky gasket, with box-counting dimensions of 1.26 and 1.59 respectively, are used to verify the proposed methods. The datasets are generated by iterated function system of self-similar set. Parameters are set as follows: r ranges from 0.005 to 0.15 at intervals of 0.005, ξ = 0.003, 0.005, 0.007, 0.009. The logarithm of W (δ, ξ), Corrδ (correlation integral in [9]) and P ackδ (the packing number) versus log(δ) curves are plotted in Fig.1(c)(d) in which the curve of our method goes from up to down as ξ increases. The estimation results using the major linear parts’ slopes of the curves are shown in Table 1, which shows that all the algorithms have achieved reasonable results. In the experiment, multiple estimations are inevitable for the packing number method to get a stable result since the variance of single estimation is large. The ratio of CPU times of the correlation dimension method, our method and the packing number method is approximate 1:1.2:22 for Koch curve and 1:1.2:27 for Sierpinsky Gasket, which shows that our method is of much lower computational than the packing number method. Table 1. Estimation results for classical fractal graphs Dataset Correlation Dimension Our Method Packing number method Koch 1.26 1.23 1.22 Sierpinsky 1.58 1.58 1.51
4.2
Experiment 2: Swiss Roll Data and Face Image Sequence1
Swiss Roll data (Fig.2(a)) is sampled from a curled surface in 3-D space. The local behavior of the data is like 2-D plane, so the intrinsic dimension is expected to be 2. In the experiment, let ξ = 0.5 and take r from 1 to 6 at intervals of 0.3. Face image sequence (Fig.2(b)) contains 256-grayscale images recorded from the same face under changes of three free parameters: vertical and horizonal orientation, and light direction. Intuitively, the intrinsic dimension is 3. In the 1
http://isomap.stanford.edu/datasets.html
Estimating Fractal Intrinsic Dimension from the Neighborhood
1
1317
1
(a) Koch curve
(b) Sierpinsky gasket
15
16 14 Our Method 12
Our Method
10
Correlation
Correlation
10
log(•)
log(•)
Pack
Pack
8 6
5 4 2 0 −5.5
−5
−4.5
−4
−3.5
−3
−2.5
−2
0 −5.5
−1.5
−5
−4.5
−4
−3.5
−3
−2.5
−2
−1.5
−1
log(r)
log(r)
(c) Results for Koch curve
(d) Results for Sierpinsky gasket
Fig. 1. Experimental results for fractal graphs
50 25 0 20 10 20
0
10 −10
0 −20
−10 −20
(a) Swiss roll surface
(b) Face image sequence 3.5
5 Our Method Correlation Pack
4
Dimension
Dimension
3 2.5 2 1.5
3 2 Our Method 1
Correlation Pack
1 1
2
3
r
4
5
(c) Results for Swiss roll data
6
0 0.1
0.15
0.2
0.25
0.3
r
0.35
0.4
0.45
0.5
(d) Results for face image sequence
Fig. 2. Estimation results for experiment 2
1318
Q. Cai and C. Zhang
experiment, the images are normalized so that the distance between a black and a white image is 1. Let ξ = 0.05, and take r from 0.1 to 0.5 at intervals of 0.05. Using the adjacent ri , ri+1 for estimation and plot the results with x-coordinate (ri +ri+1 )/2 in Fig.2(c)(d). All the results of our method are close to the intrinsic dimensions: for Swiss Roll data, the estimation result is about 2; For the face images, the estimation result is approximate 3 ∼ 3.5. The ratio of CPU times of the correlation dimension method, our method and the packing number method is 1:1.3:29 for Swiss roll data and 1:1.2:25 for the face image sequence.
5
Conclusions and Future Work
This paper addresses the problem of estimating fractal intrinsic dimension. The theoretical analysis reveals that the local behavior of each point has tight connections with the dimension. The proposed method based on the local behavior requires low computation and can achieve reasonable estimation. In the future work, the automatic choice of α(·) and β(·) for different types of data will be considered, and the method will be applied to more comprehensive datasets which contain subsets of different dimensions.
References 1. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998) 2. Tenenbaum, J.B., Silva, V.D., Langford, J.C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290(5500) (2000) 2319–2323 3. Roweis, S., Saul, L.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290(5500) (2000) 2323–2326 4. Bishop, C.M., Svens´en, M., Williams, C.K.I.: The Generative Topographic Mapping. Neural Computation 10(1) (1998) 215–235 5. Falconer, K.: Fractal Geometry: Mathematical Foundations and Applications. Wiley, New York (1990) 6. Barnsley, M.: Fractals Everywhere. 2nd edn. Academic Press, Boston (1993) 7. Mandelbrot, B.B.: Fractal: Form, Chance, and Dimension. Freeman, San Francisco (1977) 8. Camastra, F.: Data Dimensionality Estimation Methods: A Survey. Pattern Recognition 36(12) (2003) 2945–2954 9. Grassberger, P., Procaccia, I.: Measuring The Strangeness of Strange Attractors. Physica D9 (1983) 189–208 10. Camastra, F., Vinciarelli, A.: Estimating Intrinsic Dimension of Data with A Fractal-based Approach. IEEE Trans. Anal. Mach. Intell. 24(10) (2002) 1404–1407 11. Kegl, B.: Intrinsic Dimension Estimation Using Packing Numbers. Advances in Neural Information Processing Systems, Vol. 15. MIT Press (2003) 681–688
Dimensionality Reduction for Evolving RBF Networks with Particle Swarms Junying Chen1 and Zheng Qin1,2 1
Department of Computer Science, Xian JiaoTong University, Xian 710049, P.R. China [email protected] 2 School of Software, Tsinghua University, Beijing 100084, P.R. China
Abstract. Dimensionality reduction including both feature selection and feature extraction techniques are useful for improving the performance of neural networks. In this paper, particle swarm optimization (PSO) algorithm was proposed for simultaneous feature extraction and feature selection. First PSO was used to simultaneous feature extraction and selection in conjunction with knearest-neighbor (k-NN) for individual fitness evaluation. With the derived feature set, PSO was then used to evolve RBF networks dynamically. Experimental results on four datasets show that RBF networks evolved with the derived feature set by our proposed algorithm have more simple architecture and stronger generalization ability with the similar classification performance when compared with the networks evolved with the full feature set.
1 Introduction A large number of features can be usually measured in many pattern recognition applications. Some of the features are irrelevant. If no preprocessing is carried out before patterns are put into RBF network classifiers, the computation required may be very heavy. The size of a RBF neural network can be very large. This is not suitable for some practical applications due to speed and memory constraints [1]. Dimensionality reduction aims to map high-dimensional patterns to lower-dimensional patterns. Feature extraction and feature selection are usually used to reduce dimensionalities. Feature extraction is to find a mapping that reduces the dimensionality of the patterns from the N-dimensional data to the M-dimensional space, where M
1320
J. Chen and Z. Qin
In addition, Evolutionary algorithms have been successfully used in feature extraction and selection. A direct approach to using binary GAs for feature selection was introduced by Siedlecki and Sklansky [6]. Since then, even many researches have been explored [7, 8]. In [9], Michael L. etc presented an extension method to traditional GA which further enhances feature selection through the simultaneous optimization of feature weights and selection of key features by including a binary masking vector on the GA chromosome. In [10], binary particle swarm optimization and realvalued particle swarm optimization were used to select features. The experimental results showed both two versions of PSO can achieve nearly the same results as GA but executed in an easily way. In [11], PSO was employed to select feature subset and train RBF neural networks simultaneously. These two methods just selected the feature subset through feature weighting or other feature selecting methods and didn’t perform feature weighting operations simultaneously. In this paper, a PSO algorithm was designed to reduce the number of irrelevant attributes and to scale the relevant ones, using the predictive accuracy of k-NN classifier as an individual’s fitness measure. The selected and scaled feature set was then used to evolve RBF networks. The results of computational experiments showed RBF networks based on the derived feature set outperform the RBF networks based on the original feature set in terms of network size and generalization ability. The rest of the paper is organized as follows. Section 2 presents the architecture of the RBF networks used. In Section 3 the PSO learning algorithm for dimensionality reduction are discussed. Section 4 describes the methods for evolving RBF networks. Parameter selection and experimental results are presented in Section 5. Finally, the conclusions are given in Section 6.
2 Radial Basis Function Neural Network Radial Basis Function neural networks were introduced into the neural network literature by Broomhead and Lowe [12], which are very attractive as compared with multilayer perception because of approximate ability and even the simple learning methods. RBF network is a two-layer network. The first layer is composed of some basis functions that execute non-linear mapping. Each basis function comprises two parameters, the center vector and the width. The second layer gives the results by computing the weighted sum of outputs of the first layer. The usually used RBFNN form is shown in the following. K
oi ( x) = ∑ wik exp{− k =1
x − ck 2σ k2
2
}, i = 1, 2,..., m .
(1)
Where ... represents Euclidean norm, ck , σ k and wik are the center, the width of the k-th neuron in the first layer and the weights in the second layer respectively, K is the number of neurons in the first layer. When using RBF networks the key problems lie in that the number of training examples necessary for parameter estimation increases exponentially with the intrinsic
Dimensionality Reduction for Evolving RBF Networks with Particle Swarms
1321
dimensionality of the input space. Therefore dimensionality reduction as a data preprocessing step is effective and even necessary to this type of neural networks [7].
3 PSO Algorithms for Dimensionality Reduction 3.1 Particle Swarm Optimization
PSO is a new population-based evolutionary computation technique [13]. Particle swarms explore the search space by adjusting its position at diverse speed iteratively. The velocity update equation (2) and position update equation (3) are described as follows: Vi (t +1) = w *Vi t + c1 * rand ()*( Pi − X i(t ) ) + c2 * rand ()*( Pg − X i(t ) ) .
(2)
X i( t +1) = X i(t ) + Vi ( t ) .
(3)
Where Pi is the position with the best fitness value visited by the i-th particle, Pg is the position with the best fitness found by all particles, w is inertia weight which balances the global exploitation and local exploration abilities of the particles, c1 and c2 are acceleration constants, rand() are random values between 0 and 1. 3.2 Encoding Particles
In the feature weighting and selection problem the main interest is in representing feature weights space and all possible subsets of the given feature set. As shown in Equation 4, each particle is encoded into a real-valued vector which includes two parts. The first part is a weight vector where the ith one is the weight assigned to the ith feature and the other part is a masking vector employed to perform simultaneous selection of a feature subset. [( w1 , w2 ,..., wn )( f1 , f 2 ,..., f n )] . Where
(4)
wi is the weight assigned to feature i and n is the number of feature set
which is the same as the number of input features, f i (i = 1, 2,..., n) represents whether or not the i-th feature is selected. If the mask value for a given feature is negative, the feature is not considered for classification. If the mask value is positive, the feature is scaled according to the associated weight value, and is included in the classifier. 3.3 Fitness Evaluation
Fitness designing technique is used to optimize feature weight and selection and to maximize the classification accuracy. To evaluate a particular set of weights and selection, the PSO scales the training and testing set feature values according to the weight set and selects the features according to the masking vector simultaneously, and then classifies the testing set with k-NN classifier. The performance of the weight set and selection configure is determined according to the classification accuracy on
1322
J. Chen and Z. Qin
the testing set. At the same time, the number of selected feature is also considered to select as small as feature set. Since the PSO algorithm used minimizes the objective function, the following error function is used to evaluate each particle: f =
( Ntest − CorrectedNtest ) SelectedNfeature . +λ⋅ Ntest Nfeature
(5)
Where Ntest is the number of total testing patterns, CorrectedNtest is the number of testing patterns correctly classified by k-NN, Nfeature is the number of features and SelectedNfeature is the number of selected features, λ is a parameter that balances classification performance and the size of the selected feature set. 3.4 PSO-DR Algorithm
The particle swarm optimization to dimensionality reduction for pattern classification is presented in the following: 1. Initialize swarm of N particles. Each particle defines the information of feature weights and feature selection. Set the number of iterations as MaxIteration. Set count=0. 2. Decode each particle and execute weighting and selecting operations on patterns to form new patterns. Compute the fitness of each particle with k-NN algorithm based on the new patterns. 3. Update Pi for each particle and Pg for whole swarm. 4. Update the velocity of each particle according to formula (2). Limit the velocity in [Vmin, Vmax]D. 5. Update the position according to formula (3). 6. Set count=count+1; if count < MaxIteration, go to 2; otherwise, Terminate the algorithm.
4 Evolving RBF Networks When using PSO to dynamically evolve RBF networks, it is necessary to tackle with two key problems, encoding RBFNN architecture and designing fitness function. The encoding technique used in this paper is the same as in [14]. Fitness designing technique is used to balance network performance and network complexity. The fitness of the RBF network was formulated as equation (6).
E = log(
1 Nt
Nt
∑ n =1
yn − on ) + λ × 2
Nh . N h max
(6)
Where Nt is the number of training patterns, yn, on are the desired output and network output for pattern n respectively, Nh is the number of hidden units involved in networks, Nhmax is the predefined maximal value for hidden units, λ is a parameter
Dimensionality Reduction for Evolving RBF Networks with Particle Swarms
1323
that balances network performance and network complexity. The algorithm of using PSO to evolve RBF is referred to as PSO-RBF, which is similar to that in [14].
5 Experiments 5.1 Experimental Setup
The parameters of the PSO algorithm were set as follows: weight w decreasing linearly between 0.9 and 0.4, learning rate c1 = c2 = 2 for all cases. The algorithm stopped when a predefined number of iterations has been reached. Once finished the long set of experiments, values selected for parameters were shown in table 1. Table 1. Execution parameters for PSO-DR and PSO-RBF
Parameters for PSO-DR Parameter value Population Size 20 Iterations 3000 k 3 0.1 λ
Parameters for PSO-RBF Parameter value Population Size 20 Iterations 5000 0.4 λ
In order to prevent any initial bias due to the natural ranges of values for the original features, input feature values were normalized over the range [0, 5] as follows:
xi', j =
xi , j − min ( xk , j ) k =1,..., n
max ( xk , j ) − min ( xk , j )
k =1,..., n
k =1,..., n
∗5
(8) .
Where xi , j is the jth feature of the ith pattern, xi', j is the corresponding normalized feature, and n is the total number of patterns. 5.2 Experimental Results
Evaluations of PSO-RBF on classification tasks were developed by using four wellknown real databases, Wdbc, Thyroid, radar Ionosphere and Sonar [15]. There are two methods considered for comparison. One was PSO-RBF algorithm without dimensionality reduction, and the other was PSO-RBF algorithm with dimensionality reduction based on PSO-DR method. Table 2. Evolving RBF networks without dimensionality reduction
Dataset Train Accuracy Test Accuracy Hidden units
Wdbc 0.9984 0.9854 10
Thyroid 0.9990 0.9416 13
Ionosphere 0.9964 0.9052 20
Sonar 1 0.8816 21
1324
J. Chen and Z. Qin Table 3. Evolving RBF networks with dimensionality reduction
Dataset Selected features Train Accuracy Test Accuracy Hidden units
Wdbc 8 0.9917 0.9720 9
Thyroid 2 0.9932 0.9457 13
Ionosphere 8 0.9953 0.9195 18
Sonar 22 1 0.9064 20
Each benchmark was tested with 5-fold cross-validation. The results were listed in Tables 2, 3. Train Accuracy and Test Accuracy referred to mean correct classification rate averaged over 5 runs for the training and testing set respectively. By comparing the results, it is easy to see PSO-DR found the small-sized feature sets. And the networks evolved by PSO-RBF based on the found small feature sets have simple architecture than those evolved by PSO-RBF based on the original feature sets. In our proposed method, classification accuracy still remains high in despite of many features removed from the original feature set. Moreover, dimensionality reduction improved the generalization ability on Thyroid, Ionosphere and Sonar datasets.
6 Conclusions In this paper, dimensionality reduction was carried out in order to improve classification performance and to reduce the number of features as well as the complexity of an RBF neural network. The PSO algorithm was proposed to perform feature extraction and feature selection simultaneously. The feature set with the lowest classification error rate and the least number of features is selected. The newly derived feature set is used to evolve RBF network based on PSO-RBF method. Experimental results show that the proposed methods are effective in reducing the feature size, the size of RBF neural networks, and the classification error rates.
Acknowledgements This work was supported by the national grand fundamental research 973 program of China. (No. 2004CB719401).
References 1. Fu, X., Wang, L.: Data Dimensionality Reduction with Application to Simplifying RBF Network Structure and Improving Classification Performance. IEEE Transactions on Systems, Man and Cybernetics 33 (3) (2003) 399 - 409 2. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, Boston, MA (1990) 3. Pfleger, K., John, G., Kohavi, R.: Irrelevant Features and the Subset Selection Problem. Proceedings of the Eleventh International Conference on Machine Learning (1994) 121-129
Dimensionality Reduction for Evolving RBF Networks with Particle Swarms
1325
4. Pudil, P., Ferri, F.J., Novovicova, J., Kittler, J.: Floating Search Methods for Feature Selection with Nonmonotonic Criterion Functions. Proceedings of the 12th IAPR International Conference on Pattern Recognition 2 (1994) 279 - 283 5. Somol, P., Pudil, P., Kittler, J.: Fast Branch and Bound Algorithms for Optimal Feature Selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (7) (2004) 900 - 912 6. Siedlecki, W., Sklansky, J.: A Note on Genetic Algorithms for Large Scale Feature Selection. Pattern Recognition Letters 10 (5) (1989) 335-347 7. Scherf, M., Brauer, W.: Feature Selection by Means of A Feature Weighting Approach. Technical Report FKI-221-97, Institute fur Informatik, Technische Universitt Mnchen, Munich (1997) 8. Yang, J., Honavar, V.: Feature Subset Selection Using A Genetic Algorithm. In: Liu H & Motoda H (Eds.) Feature Extraction, Construction and Selection: A Data Mining Perspective (1998) 117-136 9. Raymer, M.L., Punch, W.F., Goodman, E.D., Sanschagrin, P.C., Kuhn, L.A.: Simultaneous Feature Extraction and Selection Using A Masking Genetic Algorithm. In Proceedings of ICGA-97, San Francisco, CA (1997) 561-567 10. Firpi, H.A.; Goodman, E.; Swarmed Feature Selection. Applied Imagery Pattern Recognition Workshop (2004) 112 - 118 11. Liu, Y., Qin, Z., Xu, Z.L., He, X.S.: Feature Selection with Particle Swarms. International Symposium on Computational and Information Science, Shanghai, China (2004) 425-430. 12. Broomhead, D.S., Lowe, D.: Multivariable Functional Interpolation and Adaptive Networks. Complex Syst. 2 (1988) 321-355 13. Kennedy, J., Eberhart, R.C.: Particle Swarm Optimization. Proceedings of IEEE International Conference on Neural Networks Piscataway, NJ (1995) 1942-1948 14. Qin, Z., Chen, J.Y., Liu, Y., Lu, J.: Evolving RBF Neural Networks for Pattern Classification. Proceeding of the International Conference on Computational Intelligence and Security, Xi’an, China (2005) 957-964 15. Blake, C., Merz, C.J.: UCI Repository of Machine Learning Databases. http://www.ics.uci. edu/~mlearn/MLRepository.html (1998)
Improved Locally Linear Embedding Through New Distance Computing Heyong Wang1 , Jie Zheng2 , Zhengan Yao2 , and Lei Li1 1
Software Research Institute, Sun Yat-sen University, Guangzhou 510275, China 2 College of Mathematics and Computational Science, Sun Yat-sen University, Guangzhou 510275, China
Abstract. Locally linear embedding (LLE) is one of the methods intended for dimensionality reduction, which relates to the number K of nearest-neighbors points to be initially chosen. So, in this paper, we want that the parameter K has little influence on the dimension reduction, that is to say, the parameter K can be widely chosen while not influence the effect of dimension reduction. Therefore, we propose a method of improved LLE, which uses new distance computing for weight of K nearest-neighbors points in LLE. Thus, even when the number K is little, the improved LLE can get good results of dimension reduction, while the traditional LLE needs a larger number of K to get the same results. When the number K of the nearest neighbors gets larger, test in this paper has proved that the improved LLE can still get correct results.
1
Introduction
Dimension reduction is an important operation for pattern recognition. Because high-dimensional data have a lot of redundancies, the purpose of this operation is to eliminate the redundancies and lessen the amount of data to be processed. There are many methods for dimension reduction, and traditional methods of performing dimension reduction are mainly linear, including Principal Component Analysis (PCA)[1], K-Means Clustering [2], and Fisher Discriminates [3], etc. A number of techniques have been proposed to perform nonlinear mappings, such as Multi-Dimensional Scaling [4], the Self-Organizing Map [5], Isomap [6], etc. Multi-Dimensional Scaling and Self-Organizing Map are both hard to operate and time-consuming. Isomap tries to preserve global geometry of the manifold. LLE [7] aims to preserve local geometry. As for traditional LLE, it is a useful nonlinear method for dimension reduction. But the result it obtains is related to parameter K, the number of the nearest-neighbors points to be initially chosen, when K is chosen unsuitably, dimension reduction has worse result. Through the improved LLE given in this paper, the selection of the parameter K will make less influence on the result than the traditional LLE . This paper is organized as follows. Section 2 presents the traditional LLE. Section 3 proposes the improved LLE. Section 4 firstly gives the handwritten J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1326–1333, 2006. c Springer-Verlag Berlin Heidelberg 2006
Improved Locally Linear Embedding Through New Distance Computing
1327
digit experiments, and then it describes how texture feature is extracted by the multi-window approach and shows experimental results obtained with the improved LLE when the technique is applied to texture feature extraction method, as well as a comparison with traditional LLE, etc. Conclusions are come to in Section 5.
2
LLE
Locally linear embedding (LLE) is an unsupervised learning algorithm that computes low dimensional, neighborhood-preserving embeddings of high-dimensional inputs. It is based on simple geometric intuitions. It computes the coefficients that best reconstruct each data point from its neighbors. These coefficients are unchanged by rotations, rescaling and translations, hence characterize intrinsic geometric properties of each neighborhood. While preserving the neighborhood coefficients, each high dimensional point is then finally mapped to a low dimensional vector. LLE maps a data set X = {x1 , x2 , . . . , xn }, xi ∈ Rd , to a data set Y = {y1 , y2 , . . . , yn }, yi ∈ Rm (m < d) .Assuming the data lie on a nonlinear manifold, which can be locally approximated linear. The LLE algorithm contains three steps: Step 1. K- nearest neighbors are found for each point xi , (i = 1, 2, . . . , n) in X by using Euclidean distances to measure similarity. Step 2. Assign a weight to every pair of the neighboring points. The weights representing contributions to the reconstruction of a given point from its nearest neighbors can be found by solving the optimization task: ε(W ) =
n
|xi −
i=1
K
(i)
wj xj |2
(1)
j=1
(i)
Subject to constraints wj = 0, if xi and xj are not neighbors and n (i) = 1, if xi and xj are neighbors. The latter condition enforces roj=1 wj tation, translation, and scaling invariance of xi and its neighbors. Step 3.Compute embedded coordinates yi for each xi . Projections are found by minimizing the embedding cost function for the fixed weights: εΓ (Y ) =
n i=1
|yi −
K
(i)
wj yi |2 = tr(Y T M Y )
(2)
j=1
Under the following constraints: n1 ni=1 yi yiT = 1 and ni=1 yi = 0 . To find the matrix Y under these constraints, a new matrix is constructed based on the matrix W of step 2 : M = (I − W )T (I − W ) . LLE then computes the bottom d + 1 eigenvectors of M , associated with the d + 1 smallest eigenvalue. The first eigenvector whose eigenvalue is closed to zero is excluded. The remaining d eigenvectors yield the final embedding Y .
1328
3
H. Wang et al.
Improved LLE
As for traditional LLE, it can be found that the selection of the nearest-neighbors points K has an influence on the result, especially on the result of sample points without distributing uniformity. The nearest-neighbors points cover more area in the sparse area than it in the dense area when the number K is chosen. Therefore, we should investigate other method in order that it is less influenced by the sample points distribution. The improved LLE uses new distance dij computing LLE in step 1: |xi − xj | dij = M (i)M (j)
(3)
K 1 2 where,M (i) = K l=1 |xi − xl | , subject to: xl is xi nearest-neighbors points. Thus, the distance between sample points of dense area becomes larger and the distance between sample points of sparse area become smaller by substituting the improved distance dij for Euclidean distance in LLE. So sample points become more proportioned than ever in order to reduce the influence of sample points distribution. The test image is a 3-dimensional half cylinder (figure 1). It is composed of 5 × 5 = 25 points above and 20 × 20 = 400 points below. To demonstrate the performance of LLE versus the improved LEE, the 3dimensional half cylinder is separately reduced to a 2-dimensional. If sparse points above could keep sparse and dense points below could keep dense after reducing dimensionality, it has good effect by using this reducing dimensionality method.
Fig. 1. 3-dimensional half cylindrical image
Improved Locally Linear Embedding Through New Distance Computing
1329
Fig. 2. 3-dimensional half cylinder is reduced to 2-dimensional using LLE (left) and the improved LLE (right) with K = 4
Fig. 3. 3-dimensional half cylinder is reduced to 2-dimensional using LLE (left) and the improved LLE (right) with K = 9
Fig. 4. 3-dimensional half cylinder is reduced to 2-dimensional using LLE (left) and the improved LLE (right) with K = 19
When K is too small, both methods have no good results. For example, it is shown with K = 4 in figure 2. As a larger value of K is chosen, there are better effects using the improved LLE than LLE with K = 9 in figure 3. There is no good effect obtained in using LLE until K = 19 (figure 4). Furthermore, the improved LLE still keeps good effect. As can be seen, the improved LLE less depended on the choice of K, that is to say, the range of chosen K is enlarged.
4
Experiment
This section presents the results of our experiments on the dimension reduction performance of improved LLE versus LLE.
1330
H. Wang et al.
Fig. 5. Experiment image
4.1
Handwritten Digit
The data set used in this set of experiments was extracted from the well- known U CI 1 handwritten digit database. Training sets have 3838 samples and testing sets have 1797 samples. The dimensionality of every sample is 65. The last dimensionality represents the class label. Every gray-value image is 8 × 8 = 64 dimension. In our experiment, the training set is 1000 samples and testing set is 500 samples, which were extracted from 3838 training sets and 1797 testing sets. To verify the superior of the improved LLE to LLE, we did extensive experiments using different values of d and K, d is the reduced dimension and K is the number of the chosen nearest neighbors. Both of them influence the effects of the experiments. The pointed d and K are 7,8,9,10,11 and 5,6,7,8. Then we compared the performance of our proposed algorithm on U CI handwritten digit database with the relevance feedback approach of K-Nearest-Neighbor algorithm[8]. That is to say, K-Nearest-Neighbor algorithm was used to find the true class from U CI for a query image. The results are listed in table 1. The error in this table is the average percentage of misclassified patterns in the test set. As is clear from table 1, the average error rate is lower in using improved LLE method than using LLE when choosing the same d and K. Table 1. Results for improved LLE and LLE on the U CI handwritten digit database
K=5 K=6 K=7 K=8
4.2
d=7 0.076 0.066 0.070 0.070
d=8 0.074 0.068 0.076 0.072
LLE d=9 0.066 0.078 0.064 0.070
d=10 0.066 0.072 0.058 0.068
d=11 0.068 0.062 0.056 0.066
d=7 0.048 0.054 0.070 0.062
Improved LLE d=8 d=9 d=10 0.044 0.046 0.048 0.052 0.048 0.046 0.058 0.058 0.060 0.066 0.056 0.054
d=11 0.044 0.042 0.060 0.054
Texture Image
In the second experiment, we used texture image (figure 5) for the image database. The image size is 128 × 128. 1
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/optdigits/
Improved Locally Linear Embedding Through New Distance Computing
1331
The texture contains important information about the structural arrangement of surfaces and their relationship with the surrounding environment. In order to extract texture image feature, we used the multi-window approach, which attempted to represent the texture inside a two-dimensional multi-window with a point in a higher dimensional space. The multi-window is made up of Q × Q windows (figure 6), each window containing e × e pixels. A measurement of some kinds is taken inside each window. These measurements are concatenated to 2 produce a point in RQ . Define the feature vector Z by Z = (m1 , m2 , . . . , mQ2 ) where mf (f = 1, 2, . . . , Q2 ) represents the measurement from the f window. The measurement uses the standard deviation of the intensity value inside each window because the standard deviation can show the changes of intensity value inside each window. e2 2 e2 r=1 ir r=1 ir 2 mf = − ( ) (4) e2 e2 2 where ir represents the intensity of a pixel , er=1 ir is taken over the area of e2 2 the f window and r=1 ir is the sum of the square of the intensity of a pixel over the area of the f window. In this paper, the number of the multi-window is (Q = 5)×(Q = 5) . The window size is (e = 7) × (e = 7). Every pixel of an image is covered with the multi-widow. Using the (4), the feature vector of each pixel for a tested image is represented by Z = (m1 , m2 , . . . , m25 ) . Therefore, the set feature vectors measured from an image will be represented by D = {x1m , x2m , . . . , x25 m }(1 ≤ m ≤ 128×128). In order to test the correctness, figure 7 illustrates the standard deviation within the center window of the multiple windows versus the x location of the multiple windows within the texture image. If feature extracted is correct, the change of figure 7 fits varies of the gray value of the pointed line of texture image. Then, we ran improved LLE and LLE in order to map the texture image into Rd with K = 12 and d = 10. Therefore, the dimensionality of every vector Z = (m1 , m2 , . . . , m25 ) is reduced to Z = (m1 , m2 , . . . , m10 ). Furthermore, Z = (m1 , m2 , . . . , m10 ) is saved in database. In figure 8, x axis denotes the number of the texture image in database, which is saved in database by vector Z = (m1 , m2 , . . . , m10 ) and y axis denotes
Fig. 6. Multi windows
1332
H. Wang et al.
Fig. 7. Shows the standard deviation of the 7th window versus the image width of figure 5
Fig. 8. Shows the accuracy curves using improved LLE verse LLE. The average accuracy of their retrieval contrasts with the number of test image saved in database
the average accuracy rates of test texture image verse retrieval texture images in database, which is the same class with the test texture image. We set d = 10, K = 12 for the LLE and improved LLE methods, separately. As shown in figure 8, the accuracy curve on different number of texture images in database using improved LLE is higher than using LLE method. That is to say, it can be concluded that the improved LLE works better than LLE method on the texture image as well.
5
Conclusions
This paper presents the limitation of LLE and gives a new dimension reduction algorithm, improved LLE, which is a significant advancement of LLE. The key different between improved LLE and LLE is that improved LLE works on new distance, while LLE uses Euclidean distance. In the end, the proposed method is tested on the handwritten digit and the texture image.
Improved Locally Linear Embedding Through New Distance Computing
1333
References 1. Jolliffe, I.T.: Principal Component Analysis, 2nd ed. Springer-Verlag, Berlin Heidelberg New York (2002) 2. Scholkopf, B., Smola, A.J., Muller, K.R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation 10(5) (1998) 1299 - 1319 3. Mika, S., Rtsch, G., Weston, J., Scholkopf, B., Muller, K.R.: Fisher Discriminant Analysis with Kernels. Neural Networks for Signal Processing IX (1999) 41 - 48 4. Borg, I., Groenen, P.: Modern Multidimensional Scaling:Theory and Applications. Springer-Verlag, Berlin Heidelberg New York (1997) 5. Kohonen, T.: The Self-Organizing Map, 3rd ed. Springer-Verlag, Berlin Heidelberg New York (2000) 6. Tenenbaum, J.B., Silva, V.D., Langford, J.C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290(5500) (2000) 2319 - 2323 7. Roweis, S.T., Saul, L.K.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290(5500) (2000) 2323 - 2326 8. Bian, Z.Q., Zhang, X.G.: Pattern Recognition, 2nd ed. Tsinghua University Press (2002)
An Incremental Linear Discriminant Analysis Using Fixed Point Method Dongyue Chen and Liming Zhang Electronic Engineer Department, Fudan University, Shanghai, 200433, China [email protected], [email protected]
Abstract. Linear Discriminant Analysis (LDA) is a very powerful method in pattern recognition. But it is difficult to realize online processing for data stream. In this paper, a new adaptive LDA method is proposed. We decompose the online LDA problem into two adaptive PCA problems and develop a fixed point adaptive PCA to implement adaptive LDA. Online updating of in-class scatter matrix S w (t ) and covariance matrix C x (t ) are derived in this paper. Simulation results show that the proposed method has no learning rate, fast convergence and less time-consuming.
1 Introduction LDA has been widely used in pattern recognition to extract features from data because of its better performance in classification. However it is difficult to realize adaptive LDA because of the computation complexity. There are some existing works are proposed to solve the adaptive LDA in online pattern recognition [1-4]. In the existing frame works, the covariance-based adaptive PCA methods become more attractive. This kind of adaptive PCA was summarized by C. Chatterjee in 2005 [6], that included eight covariance-based methods in the gradient descent framework and two new adaptive PCA methods proposed by authors. These methods are faster and more robust on convergence and widely used in adaptive LDA applications. But they need to choose learning rate by experience. In this paper, we proposed a new adaptive LDA algorithm. We decompose LDA problem into two PCA problems and develop a fixed point adaptive PCA algorithm to realize the adaptive LDA algorithm. Our method is parameter-free, more robust and less time-consuming.
2 How to Decompose the Adaptive LDA Problems The basic idea of LDA is to find out a subspace in which the separability is best. In batch process, suppose that the set L = { xij ∈ R d } consists of N samples, which is divided into k classes, denoted by l1 , l 2 ,...lk , where li = { xij | j = 1, 2, L , ni } and xi is the mean vector of li. The mean vector of all samples is denoted by x . S w and S b are in-class and between-class matrices shown as following, respectively: J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1334 – 1339, 2006. © Springer-Verlag Berlin Heidelberg 2006
An Incremental Linear Discriminant Analysis Using Fixed Point Method
Sw =
1
k
ni
i =1
j =1
∑∑ ( x N
Sb =
ij
− xi )( xij − xi )T
∑ n (x N i
i
− x )( xi − x )
(1) (2)
k
1
1335
T
i =1
We hope to find a (k-1)D subspace F = { f1 , f 2 , ...f k −1 }, fi ∈ R d , such that:
(
F * = arg max | F T Sb F | / | F T S wF | F
)
(3)
It was proved: If S w is positive definite and J = S w−1 S b , F* represents the (k-1) eigenvectors of matrix J corresponding to the k-1 biggest eigenvalues. i.e. JF = F Λ , where Λ is the eigenvalue matrix of J. Here a result proved by Fukunaga [8] that both J and J' = S w−1 ( S b + S w ) have the same eigenvectors, i.e. J'F = F Λ' .Lets S m = S w + S b then S w−1 S m F = F Λ' , S m F = S w F Λ' . Denote that the i-th eigenvalue and eigenvector of S w by λwi and ewi . Let Dw = diag (λw1 , λw2 , L, λwd ) , E w = [e w1 , e w 2 , L , e wd ] , We have: Dw−1/ 2 E wT Sm F = Dw−1/ 2 E wT S w F Λ' = Dw1/ 2 E wT F Λ'
(4)
where E w is orthonormal matrix, E w Dw−1 / 2 Dw1/2 E wT = I d × d . So we can prove that: Dw−1/ 2 EwT S m F = ( Dw−1/ 2 E wT Sm E w Dw−1/ 2 )( Dw1/2 E wT F ) = Dw1/ 2 E wT F Λ'
(5)
Let θ = Dw1 / 2 E wT F , S m' = Dw E w S m E w Dw . We get that S m'θ = θΛ' by eq. (5), where θ is the eigenvector matrix of S m' . As the definition of θ , we have: −1 / 2
T
−1 / 2
F = E w Dw−1/ 2θ
(6)
As mentioned before, we transform the batch LDA problem into the eigenvalue decomposition of matrix S w and S m' . In the first stage we get Dw (t ) E w (t ) by adaptive PCA algorithm and Compute S m' (t ) from Dw (t ), E w (t ), S m (t ) . In second stage, we update the matrix θ ( t ) by adaptive PCA algorithm again. Finally we can get F ( t ) by eq. (6). Thus adaptive LDA can be calculated by two adaptive PCA.
3 Fixed Point Adaptive PCA In section 2, we need an accurate and adaptive PCA algorithm to realize adaptive LDA. Existing accurate covariance-based PCA methods depend on learning step size too much and they are time-consuming. Here we propose a learning rate-free fixed point adaptive PCA algorithm. Let the input sequence L(t ) = { xi | i = 1, 2,....t} consist of real random vectors, where xi ∈ R , and i be the sampling index. When i = t , the instantad
neous covariance matrix of sequence L(t ) is denoted by C x (t ) . Those eigenvalues of the matrix C x (t ) are not equal each other. At index t, the first PCA basis of C x (t ) , corresponding the biggest eigenvalue for sequence L (t ) , is to maximize an objective func-
1336
D. Chen and L. Zhang
tion f ( w ) = w T C x (t ) w .Denote L(1) (t ) = L (t ) , xi(1) = xi , C x(1) (t ) = C x (t ) . Considering the orthogonality of PCA bases, the second PCA basis of C x (t ) is the first PCA basis of sample set L( 2 ) (t ) , which consists of the residual part xi( 2 ) shown as xi( m ) = xi( m−1) − ( xi( m−1)T w ) w, m = 2,L , k − 1; i = 1, 2,L t
(7)
From eq.(7) , all residual sample sets L( m ) (t ) can be obtained. By using gradient descent method to objective function f ( w ) = w T C x (t ) w , we obtained the traditional adaptive equation of PCA as following: (8)
Δw ∝ η d ( w T C x (t ) w ) / dw = 2η C x (t ) w ; w ← ( w + Δw ) / w + Δw
,
Where Δw is the increasing direction of objective function, η is the step size. It is obvious that Δw and w have the same direction at the convergence point. At every iteration, w should be normalized. So the factor η is useless. A fixed point method can be used to update w . We get a new adaptive equation as following:
w ← C x (t ) w;
(9)
w ←w/ w
Assume that w1 is the first PCA basis of L (t ) . Start from any random vector w . After iterations, eq.(9) will converge at w1 . From eq.(7), the residual set L( 2 ) (t ) and its covariance matrix C x( 2 ) (t ) can be calculated shown as follows:
{
T T T T C x(2) (t ) = E ⎡⎣( xi − w1 xi w1 ) − ( x − w1 xw1 ) ⎤⎦ ⎡⎣ ( xi − w1 xi w1 ) − ( x − w1 xw1 ) ⎤⎦
T
}
(10)
Since the matrix C x (t ) is symmetric and w1 is the eigenvector of C x (t ) , so that C x w1 w1T = w1 w1T C x = w1 w1T C x w1 w1T = λ1 w1 w1T hold true. Use it in eq.(10), the iteration formula of covariance matrices for different residual sample sets is written as:
-
C x(i ) (t ) = C x(i −1) (t ) wi −1λi −1wi −1T , i = 2,3,L k − 1
(11)
Where, λi −1 is the norm of wi −1 before wi −1 is normalized. Considering it is an adaptive process, we may update all k-1 bases in every iteration instead of updating w 2 after the w1 convergence. Now we get all PCA bases’s iteration equation at index t. Denotes the mean vector of all samples by x (t ) at index t, the iteration formula for mean and covariance matrix are shown by eq. (12) and eq. (13) respectively. x (t ) = x (t − 1) + ( x (t ) − x (t − 1)) / t
(
(12)
)
C x (t ) = tC x (t − 1) + ( x (t ) − x (t − 1))( x (t ) − x (t − 1))T (t − 1) / t 2
(13)
From eq. (9), at index t , the Fixed point PCA can be rewritten as:
wi (t ) ← C x(i ) (t ) wi (t − 1);
wi (t ) ← wi (t ) / wi (t )
i = 1, 2,L d
(14)
An Incremental Linear Discriminant Analysis Using Fixed Point Method
1337
Calculate all the temporary covariance matrices C x( i ) (t ) of L( i ) (t ) by eq. (11), all PCA bases of L (t ) can be updated by eq. (11), (14) in every sampling index.
4 The New Adaptive LDA Algorithm According to the analysis in section 2, if only we can develop a incremental way to update the temporary matrices S w (t ) and S m (t ) , the adaptive LDA algorithm will be realized by the Fixed point PCA proposed in section 3. According to the definition of matrix S m (t ) , we have S m = S w + Sb =
1
∑ ∑ N k
ni
i =1
j =1
( xij − x )( xij − x )T
(15)
It is seen that S m (t ) is equal to the temporary covariance matrix C x (t ) of sequence L (t ) . So S m (t ) can be updated by eq. (13). For S w (t ) ,We can update mean vector xi (t ) of i-th class using eq. (16). Where, ni (t ) is the number of samples in i-th class at index t; x (t ) is the new sample. ⎧(n (t − 1) xi (t − 1) + x (t )) /( ni (t − 1) + 1) xi (t ) = ⎨ i xi (t − 1) ⎩
i= j i≠ j
(16)
The iteration formula of in-class scatter matrix S w (t ) is shown as eq.(18). n j (t − 1) S w (t ) = t − 1 S w (t − 1) + ( x(t ) − x j (t − 1))( x(t) − x j (t − 1))T t t (n j (t − 1) + 1)
(17)
Now we describe the steps in our adaptive LDA algorithm as following: Step1: update x (t ) by eq. (12) update S w (t ) and S m (t ) by eq. (13)(16)(17). Step2: calculates Dw (t ), E w (t ) adaptively by eq. (9), compute S'm (t ) as its definition. Step3: calculates θ (t ) adaptively by eq. (9) again, obtained F ( t ) by eq .(18) F (t ) = E w ( t ) Dw−1/ 2 (t )θ (t )
(18)
5 Experimental Results There are two experiments in this section. In the first experiment, we test the convergence speed and the stability of Fixed Point PCA. The comparison of our adaptive LDA method and existing adaptive method is given in another experiment. Exp1.1: An iris database is used to test our adaptive LDA method. There are 3 classes in “iris” database. Every class consists of 50 4-dimensional samples. The convergence results of S w (t ) and S m (t ) by Fix-point PCA is shown in Fig.1. Our method can trace the data perfectly after t = 40 . There is a vibration when t = 30 . It is caused by the break in covariance of data stream.
1338
D. Chen and L. Zhang
Fig. 1. The x-axis is sampling index; y-axis represents the dot products of eigenvectors obtained by our method and the eigenvectors calculate by batch method
Error to batch results
Exp1.2: Selects 10 persons from ‘orl’ database. Every one has 10 pictures. Reduce the dimension to 10 by PCA. Train by the 100 pictures 10 times iteratively using 11 different adaptive methods. The convergence results show in Fig.2. FIX
Error to batch results
FIX
Sampling index
Sampling index
Fig. 2. The convergence of errors in different adaptive PCA methods for “orl” database. Fixpoint PCA is always the best one in 11 adaptive PCA methods in this case
Time cost analysis: We use the Times of Multiplication and Addition (TMA) and the time cost per 1000 samples to evaluate the computation cost as show in table1. Table 1. The comparison results of computation cost using 11 different methods
algorithms
TMA
Times/1000
algorithms
TMA
Times/1000
Fix
(2100,1100)
1652ms
Fix
(2100,1100)
1652ms
OJA RQ OJAN
(2900,2900) (3200,3100) (3300,3200)
2915ms 3371ms 3214ms
XU PF OJA
(4300,4200) (2100,2100)
3533ms 3004ms
LUO IT
(3300,3200) (4500,4300)
3284ms 2894ms
AL1 AL2
(3400,3300) (3500,3300) (4700,4400)
3325ms 3440ms 3590ms
+
Exp.2: There are 10 classes in training set. Every class has 1000 10-d random vectors to test our LDA method, gradient descent method (GD) and steepest descent method (SD)[1,2]. The errors of 3 methods are denoted by e f , eg , es respectively, where: e = S w (t ) − S w* (t ) / S w (t )
(21)
For GD and SD, S w* (t ) = ( S w−1 / 2 (t )* ) −2 , S w−1 / 2 (t )* is the estimation of S w−1/ 2 at index t. For our method, S w* (t ) = E w (t ) Dw (t ) E wT (t ) .the results show in table 2.
An Incremental Linear Discriminant Analysis Using Fixed Point Method
1339
Table 2. The errors of the evaluation for S w in three adaptive LDA methods. The table shows that the errors in our LDA method are always the smallest in three LDA method.
samples ef eg es
2000 4.19e-4 9.30 e-3 6.47e-3
4000 2.49e-4 5.93e-3 4.20e-3
6000 1,77e-4 4.80e-3 3.30e-3
8000 1.35e-4 2.70e-3 2.50e-3
10000 1.06e-4 3.80e-3 2.40e-3
Time cost analysis: the TMA of our method, GD and SD are (7600,7200), (8900,8600), (35200,35400) respectively. The 1000 iteration time cost of 3 methods are 6315ms, 7729ms 55478ms respectively in simulations.
6 Conclusion In this paper, a new adaptive LDA algorithm based on fix-point adaptive PCA algorithm was proposed. Our method can deal with data stream adaptively and no learning rate factor in it. Compare with existing two adaptive LDA algorithms (GD and SD), our method is converge faster, more robust and less time consuming. Fix-point PCA shows better performance than existing 11 adaptive PCA methods.
Acknowledgement Nos. 60571052, 30370392 supported by NSF; No. 045115020 supported by Shanghai Science and Technology Committee.
References 1. Chatterjee, C., Roychowdhury, V.P.: On Self Organizing Algorithm and Networks for Class Separability Features. IEEE Trans. Neural Net. 8 (1997) 663-678 2. Lin, R., Yang, M., Stephen, E. L.: Object Tracking Using Incremental Fisher Discriminant Analysis. ICPR (2) (2004): 757-760 3. Yan, J., Zhang, B., Yan, S., Yang, Q., Li, H., Chen, Z., Xi, W., Fan, W., Ma, W., Cheng, Q.: IMMC: Incremental Maximum Margin Criterion. KDD (2004) 725-730 4. Pang, S., Ozawa, S., Kasabov, N.: Incremental Linear Discriminant Analysis for Classification of Data Streams. IEEE Trans. Syst., Man, and Cybern. B 35(5) (2005) 905-914 5. Chatterjee, C.: Adaptive Algorithms for First Principal Eigenvector Computation. Neural Networks, 18(2), (2005) 145-159 6. Fisher, R.A.: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7(1936) 179-188 7. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press (1990)
A Prewhitening RLS Projection Alternated Subspace Tracking (PAST) Algorithm Junseok Lim1 , Joonil Song2 , and Yonggook Pyeon3 1
2
Dept. of Electronics Eng., Sejong University, Kunja, Kwangjin, 98,143-747, Seoul, Korea [email protected] S/W Lab. R&D Group 2 Mobile Communication Division Telecommunication Network Business, SAMSUNG Electronics Co.,LTD. Dong Suwon P.O. Box 105, 442-600, Korea 3 Gangwon Provincial Univ., Gyowhangli Jumoonjin Gangneung GangwonDo, 210-804, Korea
Abstract. In this paper we propose a new principal component extracting algorithm based on the PAST. A prewhitening procedure is introduced, which makes it numerically robust. The estimation capability of the proposed algorithm is demonstrated by computer simulations of DOA (Degree of Arrival) estimation. The estimation results of the proposed PAST outperform those of the ordinary PAST.
1
Introductions
Since the introduction of a simplified linear neuron with constrained Hebbian learning rule by Oja [1] which extracts the principal component from stationary input data, a variety of learning algorithms for principal component analysis (PCA) have been proposed [2],[3]. One of the novel subspace estimation algorithms is the projection approximation subspace tracking (PAST)[3]. The basic idea of the PAST is that a projection-like unconstrained criterion is approximated using an RLS-like algorithm to track the signal subspace. However, the PAST algorithm becomes unstable. The source of these numerical instabilities in 0(N 2 )RLS schemes is the update of the inverse of the exponentially-windowed input signal autocorrelation matrix, n
Rxx (n) = β δI +
n
β n−k x(k)x(k),
(1)
k=1
where x(k) = [ xl ( k ) x2 ( k ) . . . xN ( k )]T ,k≥1 is the input signal vector sequence, N is the number of system parameters, δ is a small positive constant, and β, 0 ¡ β ¡ 1 is the forgetting factor. Several authors have shown that the primary source of numerical instability in is the loss of symmetry in the numerical representation of Rxx (n)[4]. This fact has spurred the development of alternative J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1340–1345, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Prewhitening RLS Projection Alternated Subspace Tracking
1341
QR-decomposition-based least-squares (QRD-LS) methods [4] that employ the QR decomposition of the weighted data matrix XT (n) = (β n δ)1/2 I β (n−1)/2 x(1) β (n−2)/2 x(2) · · · x(n) , (2) to calculate an upper-triangular square root factor or Cholesky factor R(n) of Rxx (n) = RT (n)R(n) = XT (n)X(n).QRD-LS methods possess good numerical properties [4]. Even so, QRD-LS methods are less popular because they require N divide and N square-root operations at each time step. These calculations are difficult to perform on typical DSP hardware. In [5], a new RLS algorithm has been proposed. It is similar to a recently developed natural gradient prewhitening procedure [6],[7]. In this paper we propose a new PAST algorithm based on the least squares prewhitening, in [5] and show that it improves the estimation performance.
2
PAST Algorithm
PAST (Projection Approximation Subspace Tracking) algorithm is based on the minimum property of the unconstrained cost function, J(W(t)) =
t
2 β t−i x(i) − W(t)WH (t)x(i) ,
(3)
i=1
where x = [x1 , x2 , · · · , xN ] are the inputs to the neural networks and W(t) is an N × r matrix representing the signal subspace. To derive recursive update of W(t) from W(t-1), Yang et. al in [2] approximate WH (t)x(i) by the expression y(i) =WH (i-1)x(i). Then the (1) results in a modified cost function J(W(t)) =
t
2
β t−i x(i) − W(t)y(i) .
(4)
i=1
This becomes the exponentially weighted least squares cost function, which is well studied in adaptive filtering. J(W(t)) is minimized if W(t) = Cxy (t)C−1 yy (t), where Cxy (t) =
t
(5)
β t−i x(i)yH (i), Cyy (t) =
i=1
3
t
β t−i y(i)yH (i)
i=1
PAST Using Least-Squares Prewhitening
In this section, we apply the least-squares prewhitening procedure to the PAST algorithm cost function given by J(W(t)) =
t i=1
2
β t−i x(i) − W(t)y(i) .
(6)
1342
J. Lim, J. Song, and Y. Pyeon
The conventional RLS algorithm in [4] for adjusting W(n) is e(n) = x(n) − W(n − 1)y(n),
(7)
W(n) = W(n − 1) + e(n)k(n),
(8)
k(n) =
C−1 yy (n − 1)y(n) β + yT (n)C−1 yy (n − 1)y(n)
.
(9)
T We can replace C−1 yy (n − 1) in (9) with P (n -1)P(n - 1) to obtain
k(n) =
PT (n − 1)P(n − 1)y(n) . β + yT (n)PT (n − 1)P(n − 1)y(n)
(10)
We apply least-squares prewhitening algorithm in [5] as follows PT (n)P(n) = C−1 yy (n).
(11)
Consider the update for C−1 yy (n) in the conventional RLS algorithm, given by −1 T −1 C (n − 1)y(n)y (n)C (n − 1) 1 yy yy C−1 C−1 . (12) yy (n) = yy (n − 1) − β β + yT (n)C−1 yy (n − 1)y(n) Substituting PT (k)P(k) for C−1 yy (n) in (12) yields 1 T v(n)vT (n) 1 T P (n)P(n) = √ P (n − 1) I − P(n − 1) √ , 2 β β β + v
(13)
where the prewhitened signal vector v(n) is v(n) = P(n − 1)y(n).
(14)
The vector v(n) enjoys the following property: if y(n) is a wide-sense stationary sequence, then
lim E v(n)vT (n) ∼ (15) = (1 − β)I. n→∞
That is, as P(n) converges, the elements of v(n) are approximately uncorrelated with variance (1 - β). We decompose the matrix in large brackets on the RHS of (13) as (16). BT (n)B(n) = I −
v(n)vT (n) β + v
2
.
(16)
Then P(n-1) can be updated using 1 P(n) = √ B(n)P(n − 1). β
(17)
A Prewhitening RLS Projection Alternated Subspace Tracking
1343
The matrix on the RHS of (16) has one eigenvalue equal to β/(β + v2 ) in the direction of v(n)and N - 1 unity eigenvalues in the null space of v(n), or v(n)vT (n) β v(n)vT (n) T B (n)B(n) = I − + (18) 2 2 2 . v(n) β + v(n) v(n) Due to the symmetry and orthogonality of the two terms in (18), a symmetric square-root factor of BT (n)B(n) is v(n)vT (n) β v(n)vT (n) B(n) = I − + (19) 2 2 2 . v(n) β + v(n) v(n) Substituting (19) into (17) yields after some algebraic simplification the update 1 P(n) = √ P(n − 1) − ς(n)v(n)vT (n) . β
(20)
u(n) = PT (n − 1)v(n).
(21)
ς(n) =
1
v(n)2
1−
β β + v(n)2
.
(22)
Equation (10) along with (14) and (21) constitutes the new adaptive squareroot factorization algorithm. k(n) =
u(n) β + v(n)2
.
(23)
Therefore, combining (23) with (7) and (8), we obtain a new prewhitened PAST algorithm. The algorithm’s complexity is similar to that of the conventional PAST algorithm; however, the former algorithm’s numerical properties are much improved.
4
Simulation Results
In this simulation, we show the robustness of the proposed algorithm. For this purpose, we set the two different forgetting factors, which generally cause instability in RLS. In the following, we show some simulation results to demonstrate the applicability of our algorithm to adaptive DOA estimation. For DOA estimation, we track the principal components subspace of time-varying narrow band far-field source at SNR 15 dB by using a linear uniform array with 8 sensors and apply it to ESPRIT method to estimation of DOA. For the experimental purpose, we made the scenario in which DOA of signal changes suddenly from -30˚ to -40˚ at 150th time step. Figure 1 shows the results of the proposed
1344
J. Lim, J. Song, and Y. Pyeon −28 −30
Angle of Arrival
−32 −34 −36 −38 −40 −42 0
50
100
150
200 250 Time Step
300
350
400
Fig. 1. Comparison of DOA estimation results (solid and *:PAST with 0.98, solid and O: proposed method with 0.98, dot and x: PAST with 0.8, dot and O: proposed method with 0.8) 2
10
1
Estimatin Variance
10
0
10
−1
10
−2
10
0
50
100
150
200 250 Time Step
300
350
400
Fig. 2. Comparison of estimation variance results (solid and *:PAST with 0.98, solid and O: proposed method with 0.98, dot : PAST with 0.8, solid: proposed method with 0.8) 1.4
Subspace Distance
1.2 1 0.8 0.6 0.4 0.2 0 0
50
100
150
200 250 Time Step
300
350
400
Fig. 3. Comparison of subspace distance (dot:PAST with 0.98, x: proposed method with 0.98, dash and dot: PAST with 0.8, solid: proposed method with 0.8)
algorithm in DOA estimation. For comparison, we also show the experimental results using the PAST with a fixed forgetting factor of 0.98 and 0.8. It shows the
A Prewhitening RLS Projection Alternated Subspace Tracking
1345
almost the same estimation results in the conventional PAST and the proposed algorithm. Figure 2 shows the estimations variances. The estimation variance can be a measure of numerical robustness of an algorithm. The results in forgetting factor 0.98 are the same. However, in forgetting factor 0.8, the proposed method provides better result. It is because the robustness in the proposed algorithm prevents matrix inversion from the instability, which comes from short data window effect of forgetting factor 0.8. Figure 3 demonstrates the quality of the estimated subspace with the distance between the estimated subspace and the true subspace, which is defined as [4] ˜ ≡ sin θ(S, S) (24) (I − P ) P˜ , where S is the true subspace, S˜ is the estimated subspace, P is the projector ˜ The figure shows that the proposed algoonto S and P˜ is the projector onto S. rithm offers the same or more accurate estimation than the RLS based PAST algorithm. Especially the better result in small forgetting factor case comes from the robustness in prewhitening.
5
Conclusion
In this paper, we have proposed a new principal component extracting algorithm based on the PAST with a prewhitening procedure. It makes the conventional PAST algorithm numerically robust. The simulation results have showed that the proposed algorithm outperforms the PAST algorithm.
References 1. Oja, E.: A Simplified Neuron Model as a Principal Component Analyzer. J. Math. Biol., 15 (1982) 267–273 2. Diamantaras, K. I., and Kung. S. Y.: Principal Component Neural Networks— Theory and Applications. Wiley, New York, (1996) 3. Yang, B.: Projection Approximation Subspace Tracking. IEEE Trans. Signal Proc., 43 (1995) 95-107 4. Haykin, S.: Adaptive Filter Theory, 3rd edn. Prentice-Hall, Upper Saddle River, NJ, (1996) 5. Douglas, S.C.: Numerically - Robust. O(N2 ) Recursive Least-Squares Estimation Using Least Squares Prewhitening. In: Proceeding of International Conference of Acoustics, Speech, and Signal Processing 1 (2000) 412 – 415 6. Dasilva, F.M., Almeida, L.B.: A Distributed Decorrelation Algorithm. In: Gelenbe, E.(eds.): Neural Networks: Advances and Applications, Elsevier Science, Amsterdam (1991) 145–163 7. Douglas, S.C., Cichocki, A.: Neural Networks for Blind Decorrelation of Signals. IEEE Trans. Siqnal Processing 45 (1997) 2829-2842
Classification with the Hybrid of Manifold Learning and Gabor Wavelet Junping Zhang1,2 , Chao Shen1 , and Jufu Feng3 1
3
Shanghai Key Laboratory of Intelligent Information Processing, Department of Computer Science and Engineering, Fudan University, Shanghai 200433, China {jpzhang, 0272395}@fudan.edu.cn 2 The Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing, 100080, China Center for Information Science, National Key Laboratory for Machine Perception School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China [email protected]
Abstract. While manifold learning algorithms can discover intrinsic low-dimensional manifold embedded in the high-dimensional Euclidean space, the discriminant ability of the low-dimensional subspaces obtained by the algorithms is often lower than those obtained by the conventional dimensionality reduction approaches. Furthermore, the original feature vectors may include redundancy such as high-order correlation which cannot be removed by manifold learning algorithms. To address the two problems, we first employ Gabor wavelet to remove intrinsic redundancies of images and obtain a set of over-completed feature vectors. Then a supervised manifold learning algorithm (ULLELDA) is applied to project Gabor-based data and out-of-the-samples into a common lowdimensional subspace. Experiments in two FERET face databases indicate that Gabors indeed help supervised manifold learning to remarkably improve the discriminant ability of low-dimensional subspaces.
1
Introduction
Recently, many manifold learning algorithms, such as the Locally Linear Embedding (LLE) algorithm [1] and the Isometric mapping (ISOMAP) algorithm [2], were proposed to recover low dimensional manifolds embedded in a high dimensional Euclidean space. However, it is difficult to obtain a more effective discriminant subspace, which has the ability to improve classification accuracy when data are projected into the subspace, through manifold learning than through conventional dimensionality reduction approaches. One possible reason is that high-order information of feature vectors hasn’t been extracted as feature vectors. The other reason is that many manifold learning algorithms lay more stress on recovering a global coordinate system but put less attention on enhancing the discriminant ability of the coordinate system. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1346–1351, 2006. c Springer-Verlag Berlin Heidelberg 2006
Classification with the Hybrid of Manifold Learning
1347
In this paper, we propose the hybrid of manifold learning and Gabor wavelet (GF+ULLELDA) for face recognition. First, Gabor wavelet is applied to calculate high-order statistical information of face images and obtain a group of overcompleted feature basis sets. Then, a supervised manifold learning approach, unified locally linear embedding and linear discriminant analysis (ULLELDA) is adopted for projecting data and out-of-the-samples into a discriminant subspace. Once data are projected into the discriminant subspace, the face images are recognized by some classifiers. Experiments in two face databases show the promising of the proposed GF+ULLELDA algorithm. The remainder of this paper is organized as follows. In Section 2 we propose the combination of Gabors and supervised manifold learning approach. We report experiments on two face databases in Section 3. We end up with a discussion in Section 4.
2 2.1
The Hybrid of Gabor Wavelet and the ULLELDA Algorithm Gabor Feature Representation
Gabor wavelets are, in theory, particularly aggressive at capturing the features of the local structures corresponding to spatial frequency, spatial localization, and orientation selectively [3]. The Gabor function is basically a Gaussian modulated by a complex sinusoid. An example of Gabors is that the kernel of the 2D Gabor wavelets commonly used in face recognition (henceforth GF), has the following expression at the spatial position x = (x1 , x2 ) [3]: ψμ,ν (x) =
kμ,ν 2 kμ,ν 2 x2 σ2 exp (− )[exp (ik x) − exp (− )] μ,ν σ2 2σ 2 2
(1)
where μ and ν define the orientation and scale of the Gabor filter, which is a product of a Gaussian envelope and a complex oscillation with wave vector κu,v . kμ,ν = kν eiφmu , kν = kmax /f ν , φmu = πμ/8, kmax is the maximum frequency and f is the spacing factor between wavelets in the frequency domain. And we √ choose the following parameters: σ = 2π, kmax = π/2, and f = 2 [4]. In most cases one would use Gabor wavelets of five different scales, ν ∈ {0, ..., 4} and eight orientations, μ ∈ {0, ..., 7}. Let I(x, y) be an intensity image and ψμ,ν be a Gabor kernel. Therefore, the set S = {Oμ,ν (z) : μ ∈ {0, ...4}, ν ∈ {0, ..., 7}} forms the Gabor wavelets representation of the image I(x, y). We can compute Oμ,ν (z) using FFT and inverse FFT: Oμ,ν (z) = −1 {{I(z)}{ψμ,ν (z)}} (2) where and −1 denote the FFT and inverse FFT respectively. Considering computational cost, we first down-sample each Oμ,ν (z) to reduce the size of the image, and then concatenate all row vectors of each image into a single vector. Thus, the Gabor wavelet representation of an image is: ρ ρ ρ ρ χρ = (O0,0 , O0,1 , · · · , Oi,j , · · · , O4,7 )
(3)
1348
J. Zhang, C. Shen, and J. Feng
where the subscript i, j of O denotes scale and orientation, respectively, and the superscript ρ = 2k denotes the degree of downsampling (k: steps of halving the image. In this paper, k = 3). 2.2
The ULLELDA Algorithm
After high-order information is extracted by the Gabor wavelets, the previous proposed ULLELDA algorithm is adopted for projecting high-dimensional data into a corresponding low-dimensional discriminant subspace [5]. The basic framework of the ULLELDA algorithm can be summarized as followings: 1. Let the neighborhood factor be K and reduced dimension be d. Training samples are first projected into a d-dimensional subspace through the LLE algorithm. The first step of the LLE algorithm is to approximate manifold around sample {xi } ⊂ χρ with a linear hyperplane passing through its neighbors {xi1 , . . . , xiK } ⊂ χρ . And the weights wij is calculated through minimizing the cost ψ(W ) =
n i=1
|xi −
K
wij xij |2
xi , xij ∈ IRN
(4)
j=1
by constrained linear fitting. Where n denotes the number of training sam ples. j wij = 1 and wij = 0 if xij does not belong to the neighborhood set of xi . 2. Assuming that the weight relationship of sample and its neighborhood samples is invariant when data are mapping into a low-dimensional subspace. Therefore, define n ∗ Φ(Y ) = |yi − wij yij |2 (5) i=1
j
Where W ∗ = arg minW ψ(W ). Subjecting to constraints i yi = 0 and T ∗ y y = I, the optimal solution of Y in Formula (5) is the smallest or i i i bottom d + 1 eigenvectors of matrix (I − W ∗ )T (I − W ∗ ) except that the zero eigenvalue needs to be discarded. 3. The d-dimensional data are further mapped into a d -dimensional discriminant subspace through the LDA algorithm (linear discriminant analysis). Suppose that the occurrence class is equally probable, intra-class L ofeach ni T scatter matrix is Sw = (y i=1 j=1 j − mi )(yj − mi ) , for ni samples from class i with class-specific mean mi , i = 1, 2, . . . , L. For the overall mean meanwhile, the inter-class scatter matrix is Sb = L m of all the classes, T (m − m)(m − m) . To maximize the inter-class distances while mini i i=1 imizing the intra-class distances of face manifolds, the column vectors of −1 discriminant matrix D are calculated by the eigenvectors of Sw Sb associated with the largest eigenvalues. Then the matrix D projects vectors in the low-dimensional face subspace into the common discriminant subspace which can be formulated as follows: Z = DY
Z ⊂ IRd , Y ⊂ IRd , D ∈ IRd ×d
(6)
Classification with the Hybrid of Manifold Learning
1349
4. To project out-of-the-sample into the discriminant subspace, we assume that the neighbor weights among the sample and its neighbor samples from training set are the same as those in the discriminant subspace. Therefore, the weights of and the low-dimensional positions of out-of-the-sample xi are calculated as follows: φ(W ) = xi −
K
wi j xi j xi j ∈ IRN
(7)
j=1
zi =
K
wij zi j
zi j ∈ IRd
(8)
i=1
Once all the test images are projected into the discriminant subspace, a classifier can be employed to identify unlabelled samples.
3
Experiments
To verify whether the proposed hybird approach can project high-dimensional data into a more effective discriminant subspace, we randomly selected two collections of face images from the FERET databases [6] to construct face databases. They both include 43 subjects, each with 10 images of the face, varying intrinsic features such as illumination and expression. The FERET1 database is mainly composed of frontal images while in the FERET2 database large pose variation is introduced. The images in FERET1 are first cropped to 128 × 128 pixels for removing the influences of background and hairstyle. For the FERET2, images are zoomed into 192 × 128 followed by ellipse masking. Meanwhile, face images are roughly and manually aligned. Some examples of the two face databases are illustrated in Fig. 1. Second, the two reduced dimensions d and d of the ULLELDA algorithm are fixed. d is set to be 120. For the second mapping (LDA-based reduction), the reduced discriminant dimension L is generally no more than C − 1 (where
Fig. 1. Examples of FERET1 and FERET2 Face Images
1350
J. Zhang, C. Shen, and J. Feng
Table 1. The error rates (%) and standard deviation (%) of the combination of dimensionality reduction and several classifiers in FERET1 and FERET2. K = 50, L = 50
PCALDA ULLELDA GF+PCALDA GF+ULLELDA PCALDA ULLELDA GF+PCALDA GF+ULLELDA
Mean 1-NN FERET1 19.42 ± 2.30 21.33 ± 2.54 21.81 ± 2.63 22.98 ± 2.51 10.02 ± 2.58 14.12 ± 3.82 5.56 ± 1.26 7.47 ± 2.07 FERET2 28.30 ± 2.39 33.12 ± 3.48 39.14 ± 3.07 38.44 ± 2.83 21.77 ± 3.16 27.95 ± 4.90 20.86 ± 2.74 24.88 ± 3.53
SVM-Linear 20.84 ± 2.56 20.58 ± 2.69 11.84 ± 3.19 6.74 ± 2.08 28.63 ± 2.64 36.54 ± 3.47 23.81 ± 3.68 22.07 ± 3.31
C denotes the number of face clusters). Otherwise eigenvalues and eigenvectors will have complex values. Actually, we remain the real number of complex values when the second reduced dimension is higher than C − 1. In addition, the neighbor factor K of the LLE algorithm is set to be 50. Furthermore, the face database is randomly divided into a training set and test set without overlapping. For each person, 5 images are selected to be training samples and the remaining ones are test samples. For comparing the performance of the proposed dimensionality reduction techniques, several combined dimensionality reduction approaches are employed to project the complex face database into different discriminant subspaces. The approaches include ULLELDA [5], PCALDA [7], GF+PCALDA. Finally, three classifiers (mean algorithm, 1-nearest neighbor algorithm and pairwise SVMlinear algorithm [8]) are adopted for evaluating which dimensionality reduction technique is better for classifying face databases. The experimental results are the average of 20 repetitions. From the table 1 it is clear that when only the ULLELDA algorithm is employed, the discriminant ability is the worst one in the mentioned dimensionality reduction approaches for the two face databases. However, when combining the Gabor wavelets and supervised manifold learning, the discriminant ability of corresponding subspace is remarkable improved. For example, to FERET1, the average error rate using GF+ULLELDA with mean algorithm is about 55.49% of the GF+PCALDA, 28.63% of the PCALDA, 25.49% of the ULLELDA. Furthermore, the combination of GF+ULLELDA and several classifiers achieves the lowest error rates in the table 1.
4
Conclusion
In this paper, we propose a combination approach of Gabor wavelet and supervised manifold learning. A specific Gabor wavelet, Gabor for face (GF) [4] is
Classification with the Hybrid of Manifold Learning
1351
employed to extract high-order and nonorthogonal feature vectors of face images, and the ULLELDA algorithm is adopted to project these vectors into a discriminant subspace. Compared with other dimensionality reduction techniques, the proposed hybridization technique can help classifiers to obtain higher recognition performance in face database. In the future, we will utilize Ada-Boosted Gabor features to further refine current algorithm [9]. Also, a way to automatically select the neigborhood factor will be studied.
Acknowledgement This research is partially sponsored by NSFC under contract No. 60505002. Part of the research in this paper uses the Gray Level FERET database of facial images collected under the FERET program.
References 1. Roweis, S. T., Lawrance, K. S.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290 (2000) 2323–2326 2. Tenenbaum, J. B., de Silva, Langford, J. C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290 (2000) 2319–2323 3. Daugman, J. G.: Complete Discrete 2-D Gabor Transforms by Neural Networks for Image Analysis and Compression. IEEE Transaction on Acoustics, Speech, And Signal Processing 36(7) (1988) 1169–1179 4. Liu, C., Wechsler, H.: A Gabor Feature Classifier for Face Recognition. ICCV01, (2001) 270–275 5. Zhang, J., Shen, H., Zhou, Z. H.: Unified Locally Linear Embedding And Linear Discriminant Analysis Algorithm (ULLELDA) for Face Recognition. Lecture Notes in Computer Science 3338. Springer-Verlag, (2004) 209–307 6. Phillips, P. J., Moon, H., Rizvi, S. A., Rauss, P. J.: The FERET Evaluation Methodology for Face Recognition Algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 1090–1104 7. Swets, D. L., Weng, J.: Using Discriminant Eigenfeatures for Image Retrieval. IEEE Transactions on Pattern Analysis And Machine Intelligences 18(8) (1996) 831–836 8. Cawley, G. C.: MATLAB Support Vector Machine Toolbox (v0.50β) [ http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox]. University of East Anglia, School of Information Systems, Norwich, Norfolk, U.K. NR4 7TJ, (2000) 9. Yang, P., Shan, S., Gao, W., Li, S. Z., Zhang D.: Face Recognition Using AdaBoosted Gabor Features. In: Proceedings of The Sixth International Conference on Automatic Face and Gesture Recognition, (2004) 356–361
A Novel Input Stochastic Sensitivity Definition of Radial Basis Function Neural Networks and Its Application to Feature Selection Xi-Zhao Wang and Hui Zhang Faculty of Mathematics and Computer Science, Hebei University, China [email protected]
Abstract. For a well-trained radial basis function neural network, this paper proposes a novel input stochastic sensitivity definition and gives its computational formula assuming the inputs are modelled by normal distribution random variables. Based on this formula, one can calculate the magnitude of sensitivity for each input (i.e. feature), which indicates the degree of importance of input to the output of neural network. When there are redundant inputs in the training set, one always wants to remove those redundant features to avoid a large network. This paper shows that removing redundant features or selecting significant features can be completed by choosing features with sensitivity over a predefined threshold. Numerical experiment shows that the new approach to feature selection performs well.
1
Introduction
Consider a specified model of neural network that is employed to implement a supervised learning. What is the appropriate number of inputs? Does each input strongly affect the network output? If the training set obtained has redundant inputs, how could we find them and then delete them to get a reduced network? Our final goal is to get a reduced network with the generalization capability higher than the original network. For this goal, we turn to sensitivity analysis to give some guidelines for the input feature selection. Sensitivity analysis is a fundamental methodology in the field of neural network. It mainly tries to reveal the relationship between network output and the perturbation of particular network parameters. The sensitivity analysis of neural network could be traced to 1962 when Hoff used n-dimensional geometry approach to investigate the sensitivity of the adaline [1]. According to whether the neural network has been trained, sensitivity analysis can be approximately categorized to two parts: (A) The sensitivity of neural network which has been well trained aims at specific networks [2, 4]; (B) the sensitivity of neural network which has not been trained aims at ensemble of networks [3, 6, 8, 9]. The former concentrates on the application of a particular neural network while the later pays more attention to the neural networks design. For both kinds of sensitivity analysis, investigators have developed bundles of methods such as n-dimensional J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1352–1358, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Novel Input Stochastic Sensitivity Definition
1353
geometry [1], partial derivatives [2] and statistical analysis [3, 4, 5, 10]. Each method has its own domain of validity. For example, n-dimensional geometry adapts to threshold activation function, partial derivatives requires the activation function being differentiable with respect to parameters and statistical analysis needs to build a model to simplify the computation [3, 6, 7]. Our motivation for introducing the input stochastic sensitivity definition of radial basis function neural networks (RBFNNs) is twofold. First, we want to build a stochastic model of RBFNNs, in which each input is modelled by a random variable. Under this framework, each instance can be regarded as a realization of a random vector. Thus, network learning has relationship to the distribution of inputs while in traditional network learning this relationship is being ignored. It is that normal distribution pays an important role in statistics, under some wild conditions, it can be used to approximate a large variety of distributions in large samples. Thus, the inputs of the network are assumed to be normally and independently distributed in this paper. Second, we want to define a input stochastic sensitivity of RBFNNs under which the order of features’ sensitivity will be independent of the input perturbation. In [5], Ng et al. introduced the sensitivity of radial basis function neural network (RBFNN) as the variance of the difference of outputs before and after the inputs and weights perturbations. However, the sensitivity defined in this way depends on the magnitude of the input perturbation. This defect seriously downgrades the applicability of sensitivity formula[14]. This paper proposes a new definition of sensitivity which is independent of the input perturbation and shows that the computational complexity of the new sensitivity formula is O(m2 × n) where m is the number of hidden neurons and n is the number of inputs. This paper will focus on the input sensitivity of RBFNNs with Gaussian kernel functions. The RBFNN is a typical feedforward neural network, which is composed of 3-layer: input layer, hidden layer and output layer. Due to the fact that RBFNNs have particular advantages more than general multilayer perceptron networks, RBFNNs have shown their extensive applications [6, 7, 8]. Without loss of generality, we consider the case that only one output neuron exists in the output layer. Throughout this paper, we make the appointment that the uppercase letter denotes random variable and the lowercase letter denotes real variable. For example, X denotes a random variable while x denotes a real variable.
2
Input Stochastic Sensitivity Definition and Its Computational Formula
The radial basis function neural network realizes a mapping: IRn → IR1 f (x1 , x2 , · · · , xn ) =
m j=1
wj exp
n i=1
(xi − uji ) −2vj2
2
.
(1)
1354
X.-Z. Wang and H. Zhang
where n is the number of features, m the number of centers, (uj1 , uj2 , · · · , ujn ) the j-th center, vj the spread of the j-th center and wj the weight of the output layer connected to the j-th hidden neuron, j = 1, 2, · · · , m. Before introducing the input stochastic sensitivity definition of RBFNN, we firstly give the definition of the partial derivative of a function of random variables with respect to one random variable. Definition 1. Suppose (Ω, A, P ) is a probability space. Xi is a random variable defined on (Ω, A, P ) whose first and second moments exist, i = 1, 2, · · · , n. Real-valued function f (x1 , x2 , · · · , xn ) is differentiable with respect to each of real variable xi (i = 1, 2, · · · , n). Then f (X1 , X2 , · · · , Xn ) is a function of random variables and the partial derivative of f (X1 , X2 , · · · , Xn ) with respect to random variable X1 is defined as ∂f (X1 , X2 , · · · , Xn ) f (X1 + Δx, X2 , · · · , Xn ) − f (X1 , X2 , · · · , Xn ) = lim . Δx→0 ∂X1 Δx (2) if the limit on the right of Equation (2) converges to a random variable. In Definition 1, the limit is on the sense of Mean Square. That is lim f (X, Δx) = Y iff
Δx→0
2
lim E (Y − f (X, Δx)) = 0
Δx→0
where f (X, Δx) and Y are two random variables, E(·) denotes the expectation of a random variable. Noting that Δx is not a random variable. With the support of Definition 1, the input stochastic sensitivity definition of RBFNNs is given below. Definition 2. The input stochastic sensitivity of the first feature of RBFNNs is defined as 2 ∂f (X1 , X2 , · · · , Xn ) E . (3) ∂X1 It is easy to define the sensitivity of other features, similar to Equation (3). According to Definition 1 and Lebesgue’s Dominated Convergence Theorem [12], we have m n 2 ∂f X1 − uj1 i=1 (Xi − uji ) = wj exp ∂X1 −vj2 −2vj2 j=1 Suppose that X1 , X2 , · · · , Xn are normally and independently distributed, denoted by Xi ∼ N (μi , σi2 ) (i = 1, 2, · · · , n), that is, Xi has mean μi and variance σi2 . Then, the computational formula for Definition 2 can be derived below. E
∂f ∂X1
2 =
m wj wk (Ajk Bjk ) . vj2 vk2
j,k=1
(4)
A Novel Input Stochastic Sensitivity Definition
1355
where Ajk = − Bjk =
(vk2 σ12 uj1 + vj2 σ12 uk1 + vj2 vk2 μ1 )2 (vk2 σ12 + vj2 σ12 + vj2 vk2 )2 vj2 vk2 σ12 − (uj1 + uk1 )(vk2 σ12 uj1 + vj2 σ12 uk1 + vj2 vk2 μ1 ) + uj1 uk1 vk2 σ12 + vj2 σ12 + vj2 vk2 n
vj vk · vk2 σi2 + vj2 σi2 + vj2 vk2 i=1 σi2 (uji − uki )2 + vj2 (uki − μi )2 + vk2 (uji − μi )2 exp −2
We omit the detailed process of derivation on Equation (4) that can be obtained by computing several integrals according to the definition of the expectation of a random variable. From Equation (4), the sensitivity of the first feature can be calculated when we know these parameters: number of hidden neurons m, spread of each center vj , weight wj , center (uj1 , uj2 , · · · , ujn ), mean of the first input and variance of the first input. The value of sensitivity, computed after the network has been trained, depends on the distribution of inputs; however, it is independent of the input perturbation. The computational complexity of the sensitivity is O(m2 × n), where m is the number of hidden neurons and n is the number of inputs.
3
Application to Feature Selection
According to the sensitivity formula derived in section 2, the sensitivity of each feature could be computed one by one. By sorting the calculated sensitivity magnitude, a sequence of features can be obtained. Usually features with small sensitivity are regarded as redundant ones which may be deleted to reduce the network size. There are some strategies to decide which feature is redundant. In [2], a largest gap method has been introduced. In brief, firstly make the features in descending order according to their magnitude of sensitivity, then calculate the ratio of sensitivity between the two adjacent features, then find the largest ratio (largest gap) and the correspond feature, all the features after which may be deleted. There is additional criterion of this method introducing a constant C, which represents the degree to tolerate the gap. The smaller the constant C, the larger gap will be accepted and the fewer features will be deleted. More detailed description of this method could be found in [2]. The largest gap method will be efficient and convenient when the largest gap is salient, however, when the ratios of sensitivity between two adjacent features are in the same level, i.e. the difference between the maximum ratio and the minimum one is a small number or even close to zero, then no feature will be deleted using this method which will be illustrated in the experiment.
1356
X.-Z. Wang and H. Zhang
Motivated by the cross-validation method which is usually adapted to model selection [11], a similar strategy is taken to select features that are significant to the network. The value of sensitivity depends on network parameters and the input distribution, nevertheless, it is reasonable that the partition dividing the feature set to important and unimportant sub-sets will be independent of the network when the network can approximate the input-output mapping within accuracy. Thus, first train several RBFNNs within accuracy. Then with each RBFNN, make the features in order and choose the significant features whose sensitivities are over a predefined threshold, then we get a candidate set of significant features. Finally, the intersection of these sets is the important features set. The predefined threshold can be determined by the maximum sensitivity multiply a constant C (0 < C < 1), the larger the constant C, the fewer features will be preserved. When one prefers a relative conservative deleting mechanism, a small value of C will work. Notice that C is a constant that is problem-dependent which can be determined by prior information or chosen arbitrarily within reasonable range. Simulation has been done to verify the proposed method. Comparative results between our method and other methods are also given at the end of the paper. Firstly, generate random numbers from standard normal distribution. We generate 5 column identical independent distribution random numbers, each column denotes the realization of an input random variable while each row denotes an instance. Secondly, construct a mapping that is only related to some columns of training data but has no relationship with others. We generate the target values of the instances according to the mapping. The orthogonal least squares learning algorithm [13] is employed to train the network. After the neural network is well trained (i.e. the sum of squares error (SSE) of training set is lower than our predetermined goal), we could obtained the parameters that are needed to compute the sensitivity for each input. According to the sensitivity formula in Equation (4), the sensitivity value of each input could be calculated one by one. Here we compute the sensitivity of 3 RBFNNs. With each RBFNN obtained, the order of features’ sensitivity is demonstrated from Tables 1 to 3 (O denotes the order, S denotes the sensitivity). The ratio of two adjacent features’ sensitivity is given and the largest gap method is not effective in Table 2. From Tables 1 to 2, one could see that inputs 1, 2, 4 and 5 are more important than input 3 (here the predefined threshold is obtained by the largest sensitivity multiply a constant 0.45); while in Table 3, inputs 1, 2 and 5 are more important than inputs 4 and 3. Thus the important inputs are the intersection of two sets {1, 2, 4, 5} ∩ {1, 2, 5} = {1, 2, 5}. So these three inputs are chosen to re-train the network. After training, the sum of squares test error is 19.52. It is obviously that the test error descends dramatically. It reflects that the network will have higher generalization capability after the redundant inputs are being deleted. There are comparative results between our method and other methods describe in [2, 6]. The results are illustrated in Table 4. The two methods can not delete the redundant inputs. In [2], the sensitivity definition of input also
A Novel Input Stochastic Sensitivity Definition
1357
adopts the partial derivatives of output with respect to input, however, the inputs are modelled by real-valued variables without concerning the input distribution while in this paper the inputs are modelled by random variables, which is a improvement of the sensitivity definition. In [6], the order of sensitivity is dependent of the magnitude of input perturbation, which we have proved in [14]. Thus, the redundant inputs can be efficiently deleted adopting the novel sensitivity definition. Table 1. SSE =197.79
Table 2. SSE =465.51
Table 3. SSE =132.37
O
S
Ratio
O
S
Ratio
O
S
Ratio
2 1 5 4 3
1.3225 1.2743 0.82268 0.81577 0.31893
1.0378 1.5490 1.0085 2.5578 –
2 1 5 4 3
1.1228 0.91638 0.89073 0.61524 0.49926
1.2253 1.0288 1.4478 1.2323 –
1 2 5 4 3
2.2934 1.3644 1.1032 0.85809 0.39938
1.6809 1.4902 1.2368 2.1486 –
Table 4. Comparative Results with Other Feature Selection Methods Method
Feature Selected
Our Method 1, 2, 5 Method in [2] 1, 2, 3, 4, 5 Method in [6] 1, 2, 3, 4, 5
4
Test Error (SSE) 19.52 132.37 132.37
Conclusion
A novel input stochastic sensitivity definition of RBFNNs and its computational formula are given assuming the inputs are modelled by normal distribution random variables in this paper. Based on this sensitivity definition, the significance of each feature could be measured. Due to the fact that the definition has the advantage that the order of features’ sensitivity is independent of the input perturbation, it is convenient to apply this sensitivity definition to feature selection. Comparative results between our method and other methods are given, which show the advantage of this sensitivity definition.
Acknowledgement This research is supported by the National Natural Science Foundation of China (NSFC.60473045); The Natural Science Foundation of Hebei Province (Project No.603137); The Doctoral Foundation of Hebei Educational Committee (Project No.B2003117). The second author would like to thank Zhang Yongchao for his comments.
1358
X.-Z. Wang and H. Zhang
References 1. Hoff, M.E.: Learning Phenomena in Networks of Adaptive Switching Circuits. Technical Report 1554-1 of Stanford Electronic Laboratory, Stanford, CA. (1962) 2. Zurada, J.M., Malinowski, A. and Usui, S.: Perturbation Method for Deleting Redundant Inputs of Perception Networks. Neurocomputing 14 (1997) 177–193 3. Pich´ e, S.: The Selection of Weight Accuracies for Madalines. IEEE Transactions on Neural Networks 6 (1995) 432–445 4. Choi, J.Y. and Choi, C.H.: Sensitivity Analysis of Multilayer Perceptron with Differentiable Activation Functions. IEEE Transactions on Neural Networks 3 (1992) 101–107 5. Ng,W.W. Y., Yeung,D.S., Ran, Q. and Tsang, E.C.C.: Statistical Output Sensitivity to Input and Weight Perturbations of Radial Basis Function Neural Networks. Proceedings of IEEE on System, Man and Cybernetics, Vol. 2 (2002) 503–508 6. Ng, W.W.Y. and Yeung, D.S. : Input Dimensionality Reduction for Radial Basis Neural Network Classification Problems Using Sensitivity Measure. Proceedings of the First International Conference on Machine Learning and Cybernetics, Beijing, China, Vol. 4 (2002) 2214–2219 7. Ng, W.W.Y., Yeung, D.S. and Cloete, I.: Input Sample Selection for RBF Neural Network Classification Problems Using Sensitivity Measure. Proceedings of IEEE on System, Man and Cybernetics, Vol. 3 (2003) 2593–2598 8. Ng, W.W.Y. and Yeung, D.S.: Selection of Weight Quantisation Accuracy for Radial Basis Function Neural Network Using Stochastic Sensitivity Measure. Electronics Letters 39(10) (2003) 787–789 9. Yeung, D.S., Sun, X.: Using Function Approximation to Analyze the Sensitivity of MLP with Antisymmetric Squashing Activation Function. IEEE Transactions on Neural Networks 13(1) (2002) 34–44 10. Zeng, X. and Yeung, D.S.: Sensitivity Analysis of Multilayer Perceptron to Input and Weight Perturbations. IEEE Transactions on Neural Networks 12(6) (2001) 1358–1366 11. Forster, M.R.: Key Concepts in Model Selection: Performance and Generalizability. Journal of Mathematical Psychology 44 (2000) 205–231 12. Halmos, P.R.: Measure Theory. New York, (1950) 13. Chen, S., Cowan, C.F.N. and Grant, P.M.: Orthogonal Least Squares Learning Algorithm for Radial Basis Function Networks. IEEE Transactions on Neural Networks 2(2)(1991) 302–309 14. Wang, X.Z., Zhang, H.: An Upper Bound of Input Perturbation for RBFNNs Sensitivity Analysis. Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Vol. 8 (2005) 4704–4709
Using Ensemble Feature Selection Approach in Selecting Subset with Relevant Features Mohammed Attik LORIA/INRIA-Lorraine, Campus Scientifique - BP 239 - 54506 Vandœuvre-l`es-Nancy Cedex, France [email protected] http://www.loria.fr/~attik
Abstract. This study discusses the problem of feature selection as one of the most fundamental problems in the field of the machine learning. Two novel approaches for feature selection in order to select a subset with relevant features are proposed. These approaches can be considered as a direct extension of the ensemble feature selection approach. The first one deals with identifying relevant features by using a single feature selection method. While, the second one uses different feature selection methods in order to identify more correctly the relevant features. An illustration shows the effectiveness of the proposed methods on artificial databases where we have a priori the informations about the relevant features.
1
Introduction
Feature selection can be viewed as one of the most fundamental problems in the field of the machine learning. It is defined as a process of selecting relevant features out of the larger set of candidate features. The relevant features are defined as features that describe the target concept. Hence, two degrees of relevance are defined: weak and strong relevances. Strong relevance implies that the feature is indispensable in the sense that it cannot be removed without loss of classification accuracy. While, weak relevance implies that the feature can sometimes contribute to classification accuracy. For a discussion of relevance vs. usefulness and definitions of the various notions of relevance, see the review articles of [1] and [2]. Many potential benefits have been given by feature selection task which can be summarized as facilitating data visualization and data understanding, reducing the measurement and storage requirements, reducing training and utilization times and defying the curse of dimensionality to improve prediction performance. Feature selection methods can be divided into two main groups: filter and wrapper approaches. The main difference is that the wrapper method makes use of the classifier, while the filter method does not. The wrapper approach is clearly more accurate, but it tends to be more computationally expensive than the filter model. In this work, the accuracy and stability of the feature selection results are more important than the computation complexity where the wrapper methods are selected in order to understand data. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1359–1366, 2006. c Springer-Verlag Berlin Heidelberg 2006
1360
M. Attik
Traditional feature selection methods are used to find the relevant subset from the original feature set which presents an influence on classification performance or which describes the target concept. Kohavi and John’s [3] proved that the efficiency of a set of features to learning depends on the learning algorithm itself. The appropriate feature subset for one learning algorithm may not be the appropriate feature subset for another one. Hence, many works are proposed to combine feature selection techniques and classifier ensemble in order to improve the classification performance: random subspace methods, input decimation methods, feature subspace methods performed by partitioning the set of features and other methods for combining different feature sets using genetic algorithms [4]. As result of this combining approach, which is called ensemble feature selection, the selected subset of features is defined as a subset of features out of set of relevant candidate subsets, given by the ensemble feature selection approach, which maximize the classification performance. Furthermore, the optimal feature subset selection is defined as a small subset of features, out of set of subsets given by ensemble feature selection approach, that ideally is necessary and sufficient to describe the target concept [5]. In the following we present some important remarks which motivated our work: (1) All previous research works are interested only in selecting a relevant subset of features which describe the target concept, but not relevant subset of relevant features. Unfortunately, the process of selecting relevant features always remains a serious challenge. (2) Research works have shown that the best individual features do not necessarily constitute the best set of features, the relevance of a feature does not imply that it is in the optimal feature subset and, some what surprisingly, irrelevance does not imply that it should not be in the optimal feature subset [6]. (3) Generally, the measures used to evaluate the feature selection methods are classification performance and subset size. Unfortunately, no research work proposes the measures to evaluate the feature selection method result by comparing to correct solutions which contain only the relevant features. According to above motivating remarks, we propose to: (1) determine relevant features. Roughly, we propose to determine all relevant features which describe the target concept. Ensemble feature selection approach is used to achieve these tasks, (2) select relevant subset with only relevant feature. Hence, this task uses the results obtained by the process of determining relevant features, (3) define other measures to evaluate methods of feature selection to determine the relevant features. The next Section presents in detail our solutions for feature selection. In Section 3, an illustration of our solutions using Optimal Brain Surgeon (OBS) variants for feature selection will be given. These techniques are considered as wrapper methods for supervised neural networks. Finally, some conclusions are drawn in Section 4.
2
Our Propositions for Feature Selection
In this section we present our propositions to select relevant features.
Using Ensemble Feature Selection Approach in Selecting Subset
2.1
1361
Description of Solutions
We present in detail the following tasks: (1) building a set of relevant subset, (2) selecting relevant features, (3) estimating the number of the relevant features, (4) using multiple feature selection methods and (5) selecting subset with relevant subset. Building a set of relevant subset task. We use a simple variant of ensemble feature selection approach (see Figure (1)) which is the base of all our solutions. It is defined by: (1) Training separately multiple component classifiers. These classifiers are defined by the homogeneous main properties: error function, error minimization algorithm, etc. The diversities are given by varying: training/validating/testing data subset, data training presentation order, etc. In this step we propose to fix a minimum classification performance threshold for all classifiers. (2) Applying an adapted feature selection method to these classifiers to obtain the relevant Subsets of features. Selecting relevant features task. We propose a method to select relevant features (see Figure (2)). This method exploits the relevant subset set given by the previous task and require a priori information about the number of the relevant features. It is defined by: (1) Building the distribution of pruned/preserved features by exploiting the relevant subset set. This distribution informs us the importance of a feature compared to another, in this case we can make a ranking list of preserved features. (2) Determining the relevant features by using threshold which presents the number of relevant features. If we know a priori the number of relevant features (N br) it can be easily to select all relevant features. The information about the distribution characteristics of the used feature selection method can represent the new evaluation measures i.e. the best feature selection method is the method which can inform us sufficiently about the relevant features. Estimating of the relevant features number task. We propose a method (see Figure (2)) to estimate the number of the relevant features (N br). N br is useful for our proposed method of relevant feature selection described above. This estimating method exploits the relevant subset set given by ensemble feature selection. It is defined by: (1) Building the distribution of the number of pruned/preserved features. This distribution informs us about the different number of features which give the same classification performance. (2) Determining the number of relevant features. This step exploits the information about the characteristics of the used feature selection method. The simple proposed solutions to estimate the number of relevant feature, N br, are given by: ⎧ N br ∈ [M in , M ax] ⎪ ⎪ ⎨ or N br ∈ [M in , P eak] or N br ∈ [P eak , M ax] ⎪ ⎪ ⎩ or N br ≈ 12 (M in + M ax)
1362
M. Attik
where P eak represent the height frequencies of preserved features, M in is the minimum number of preserved features and M ax is the maximum number of preserved features. The information about the distribution characteristics of the used feature selection method can represent the new evaluation measures i.e. the best feature selection method is the method which can inform us sufficiently about the number of relevant features. Using multiple feature selection methods task. We propose to use multiple methods of feature selection in order to improve estimating the number of relevant features (see Figure (3)). Each feature selection method presents some characteristics which can be used to determine the number of relevant features. So, the different solutions can ensemble contribute to estimate more sufficiently N br. The simple solution of the number of relevant feature, N br, can be given by: N br ∈ [min(P eaki ) , max(P eaki )] N or N br ≈ N1m i m P eaksi where P eaki represent the height frequencies of preserved features, M ini is the minimum number of preserved features, M axi is the maximum number of preserved features for a method i and Nm is the number of the feature selection methods used. Selecting subset with relevant features. After determining the set of relevant features, we can select subset of relevant features (see Figure (4)). This selection depends on the evaluation criteria, the decision maker preferences, the application’s needs. We can apply: (1) the feature selection method in order to select a relevant subset with relevant features. (2) the ensemble feature selection in order to select the best or the optimal subset with relevant features . 2.2
Our Algorithms
According to different task given previously, we propose two algorithms in order to select relevant features. The first algorithm uses One Feature Selection Method to Select Relevant Features (OFSM-SRF), for more details see Algorithm (1). The second algorithm uses Multiple Feature Selection Methods to Select Relevant Features (MFSM-SRF), for more details see Algorithm (2).
3
Illustration
Feature selection methods for artificial neural networks have been first designed for topology optimization dealing with the over-training problem. Several heuristic methods based on computing the saliency (also termed sensitivity) have been proposed: Optimal Brain Damage (OBD) [7] and Optimal Brain Surgeon (OBS) [8]. These methods are known as pruning methods. The principle of these techniques is the weight with the smallest saliency will generate the smallest error
Using Ensemble Feature Selection Approach in Selecting Subset
1363
Candidate Features
Relevant Subsets of Featues
Classifier 1
Classifier 2
Feature Selection Methods Characteristics
Classifier K Distributions of the Pruned/Preserved Features
Distributions of the Number of Pruned/Preserved Features
Feature Selection Method Estimation of the Number of Relevant Features
Relevant Subsets of Featues
Relevant Features
Fig. 1. The used ensemble feature selection scheme which is based on a single feature selection method
Fig. 2. The used scheme in selecting the relevant features and estimating the number of relevant features
Candidate Relevant Features
Classifier 1
Classifier 2
Classifier K
Candidate Features Feature Selection Method Classifiers G1
Classifiers G2
Classifiers GK Relevant Subsets of Relevant Featues
A set of Feature Selection Methods
Relevant Subsets of Featues
Fig. 3. The used ensemble feature selection scheme which is based on multiple feature selection methods
Selector (optimal/best research)
Selected Subset of relevant Features
Fig. 4. The used scheme in selecting subset with relevant features
variation if it is removed. OBS has inspired some methods for feature selection such as Generalized Optimal Brain Surgeon (G-OBS), Unit-Optimal Brain Surgeon (Unit-OBS), Flexible-Optimal Brain Surgeon (F-OBS) and Generalized Flexible-Optimal Brain Surgeon (GF-OBS). G-OBS [9, 10] proposes to delete a subset of m weights in a single step. Unit-OBS [9] is based on G-OBS calculation to select input unit will generate the smallest increase of error if it is removed. The F-OBS [10], its particularity is to remove connections only between the input layer and the hidden layer. The GF-OBS is a combination of F-OBS and G-OBS [10], its removes in one stage a subset of connections only between the input layer and the hidden layer. In this work we use the first Monk’s problem. This well-known problem requires the learning agent to identify (true or false) friendly robots based on six nominal attributes. The attributes are head shape (round, square, octagon), body shape (round, square, octagon), is smiling (yes, no), holding (sword, balloon, flag), jacket color (red, yellow, green, blue) and has tie (yes, no). The “true” concept for this problem is (head shape = body shape) or (jacket color = red).
1364
M. Attik
Algorithm 1. OFSM-SRF 1 - design a set of supervised classifiers according to the selected model. 2 - give a threshold which presents a minimum classification performance for all classifiers 3 - train the supervised algorithms with different initial conditions 4- apply feature selection method f s adapted to the classification algorithm 5 - build the distribution of the number of pruned/preserved features and the distribution of pruned/preserved features 6 - study these distributions and exploit the information regarding the used feature selection method needed to determine the relevant features.
Algorithm 2. MFSM-SRF 1 2 3 4
-
select a set of different classification model Mi (not necessary homogeneous). associate for each Mi a set of adapted feature selection method f sij . apply OFSM-SRF for each Mi and f sij . combine the obtained results to determine relevant features
The training dataset contains 124 examples and the validation and test datasets contains 432 examples. To forecast the class according to the 17 input values (one per nominal value coded as 1 or -1 if the characteristic is true or false), the Multilayer perceptron (MLP) starts with 3 hidden neurons containing a hyperbolic tangent activation function. This number of hidden neurons allows a satisfactory representation which able to solve this discrimination problem. The performance in classification are equal to 100% according to the confusion matrix. In this study, GF-OBS and G-OBS remove three weights at the same time. For each method we build 300 MLPs by varying: the training/validating subset, the initialization of the weight and the order of the data presentation. Figures 5, 7, 6 and 8 give the results by applying Unit-OBS, F-OBS, GF-OBS, OBS and G-OBS methods, which it represent the distribution of pruned features and distribution of pruned feature number. For every algorithm, the number of pruned features differs from 2 to 12 according to each initialization. But the frequency depends on the algorithm. Unit-OBS presents the high frequencies for 300
OBS G−OBS
100
250
80
200
Frequencies
Frequencies
120
60
150
40
100
20
50
0
0
2
4
6 8 Number of pruned variables
10
12
Fig. 5. OBS, G-OBS: number of pruned variables distribution
OBS G−OBS
0
0
2
4
6
8 10 Pruned variables
12
14
16
18
Fig. 6. OBS, G-OBS: pruned variables distribution
Using Ensemble Feature Selection Approach in Selecting Subset 300
Unit−OBS F−OBS FG−OBS
100
250
80
200
Frequencies
Frequencies
120
60
100
20
50
0
2
4
6 8 Number of pruned variables
10
12
Fig. 7. Unit-OBS, F-OBS, FG-OBS: number of pruned variables distribution
Unit−OBS F−OBS FG−OBS
150
40
0
1365
0
0
2
4
6
8 10 Pruned variables
12
14
Fig. 8. Unit-OBS, F-OBS, pruned variables distribution
16
18
FG-OBS:
a large number of pruned features compared to the other methods. According to the first Monk rules, the concept is true if the head shape is equal to the body shape i.e. if feature 1 = feature 4 and feature 2 = feature 5 and feature 3 = feature 6. Thus, a good method should preserve these features and feature 12 (indeed, the concept is also true if feature 12 is true). Nevertheless, it is possible to obtain the desired performance by eliminating some of the first six features. In this case, a cleaver algorithm will prune couples of features. For example, to prune feature 1 (head shape=round) allows to prune feature 4 (body shape=round). In other case the ideal method is to eliminate 10 features and to keep only 7 features. If we analyze the distributions of the number of pruned features. We notice that for the first property, it is not really preserved by these techniques. For the second property, each method presents a peaki which informs us about the number of features to keep. According to the obtained results we have 5, 7, 9 variables which correspond to Unit-OBS, F-OBS, and OBS methods respectively. The peaks obtained by various methods can be presented an advantage to estimate the number of pertinent features N br: Unit-OBS gives the minimum of relevant features, F-OBS almost it estimates the relevant feature number and the OBS gives more than the relevant feature number. In this case we can determine the confidence interval of the relevant features number N br ∈ [min(peaki ) , max(peaki )] = [5, 9] and the number of relevant feature is m given by: N br ≈ peakF −OBS = 7 or N br ≈ N1m N peaksi = 13 (5 + 7 + 9) = 7 i where Nm is the number of the feature selection methods used.
4
Conclusion
In this paper, we have focused on selecting the subset of relevant features in order to understand data. Understand data is helpful or necessary to make decision in different domains (e.g. marketing, biology, medicine, . . .). We have presented novel solutions for achieve this task. Our solutions are a direct extension of ensemble feature selection approach. The first solution is based on using a single feature selection method. Conversely, the second approach is based on using multiple feature selection methods in order to determining more correctly all
1366
M. Attik
relevant features. These two solutions require a priori informations about the number of relevant features, an estimation of this number is given. We have showed that the best feature selection method is the method which can inform us sufficiently about the relevant features or the number of relevant features. An overall experimentation of our solutions have been achieved on artificial data where the information about the relevant features and rules between these features are available and the data dimensionality is supported by the used feature selection methods of optimal brain surgeon family. It has clearly highlighted that these solutions represent an important step for data comprehension.
References 1. Kohavi, R., John, G.: Wrappers for feature selection. Artificial Intelligence 97 (1997) 273–324 2. Blum, A., Langley, P.: Selection of relevant features and examples in machine learning. Artificial Intelligence 97 (1997) 245–271 3. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97 (1997) 273–324 4. Valentini, G., Masulli, F.: Ensembles of learning machines (2002) 5. Kira, K., Rendell, L.A.: The feature selection problem: Traditional methods and a new algorithm. In: Proc. of AAAI-92, San Jose, CA (1992) 129–134 6. Piramuthu, S.: Evaluating feature selection methods for learning in data mining applications. In: HICSS (5). (1998) 294–301 7. Le Cun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In Touretzky, D.S., ed.: Advances in Neural Information Processing Systems: Proceedings of the 1989 Conference, San Mateo, CA, Morgan-Kaufmann (1990) 598–605 8. Hassibi, B., Stork, D.G.: Second order derivatives for network pruning: Optimal brain surgeon. In Hanson, S.J., Cowan, J.D., Giles, C.L., eds.: Advances in Neural Information Processing Systems. Volume 5., Morgan Kaufmann, San Mateo, CA (1993) 164–171 9. Stahlberger, A., Riedmiller, M.: Fast network pruning and feature extraction by using the unit-OBS algorithm. In Mozer, M.C., Jordan, M.I., Petsche, T., eds.: Advances in Neural Information Processing Systems. Volume 9., The MIT Press (1997) 655 10. Attik, M., Bougrain, L., Alexandre, F.: Optimal brain surgeon variants for feature selection. In: International Joint Conference on Neural Networks - IJCNN’04, Budapest, Hungary. (2004)
A New Method for Feature Selection Yan Wu and Yang Yang Department of Computer Science and Engineering, Tongji University, Shanghai 200092, China [email protected]
Abstract. We present a new approach based on discriminant analysis and regularization neural network for salient feature selection. Using the discriminant analysis based feature ranking, an ordered feature queue can be obtained according to the saliency of features. The neural network is trained by minimizing an augmented cross-entropy error function in the method. Feature selection is based on the reaction of the cross-validation data set classification error due to the removal of the individual features. The approach proposed is compared with four other feature selection methods, each of which banks on a different concept. The algorithm proposed outperforms the other methods by achieving higher classification accuracy on all the problems tested.
1 Introduction A large number of features can be usually measured in many pattern recognition applications. Some of the variables may be redundant or even irrelevant. Usually better performance may be achieved by discarding such variables. Therefore, in many practical applications we need to reduce the dimensionality of the data. There are many feature selection methods, which can be divided into those based on statistical pattern recognition (SPR) techniques, and those using artificial neural networks (ANN) in [1]. Yang and Honavar [2] introduced the use of genetic algorithms (GA) for feature selection. A large variety of feature selection techniques based on ANN are proposed. The neural-network feature selector (NNFS, proposed in [3]), the signal-to-noise ratio based technique (SNR, proposed in [4]), the neural network output sensitivity based feature ranking (FQI, proposed in [5]) and the fuzzy entropy based feature ranking (OFEI, proposed in [6]) are traditional techniques of feature selection. In this paper, we propose a new approach to feature selection. Firstly the discriminant analysis based feature ranking (DA, proposed in [7]) is used to rank the features according to importance. Then salient features are chosen by a regularized BP network [7]. In the rest of the paper the method proposed is called DA&NN.
2 DA&NN In this paper, we get the feature ranking through the DA. Then salient features are selected through a regularized BP neural network. The neural network is trained by minimizing the augmented cross-entropy error function augmented. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1367 – 1372, 2006. © Springer-Verlag Berlin Heidelberg 2006
1368
Y. Wu and Y. Yang
2.1 DA Let mj denotes the sample mean vector of the jth class
mj =
Nj
∑X
1 Nj
,
jk
(1)
k =1
where Nj is the number of samples in the jth class. Similarly, m denotes the mixture sample mean
m=
nL
∑ Pm j
j
,
(2)
j =1
with Pj being a priori probability of the jth class. We can now define a within class– covariance and between class-covariance matrices Sw and Sb, respectively.
Sw =
nL
∑P
1 Nj
j
j =1
Sb =
∑ (X
∑ P ∑ (m nL
j
j =1
− m j )(X jk − m j ) , t
jk
(3)
− m )(m j − m ) . t
j
(4)
Using (3) and (4), we form the following criterion function Ji(X) for feature ranking, where X expresses the dependence of the criterion function on the data set and trX|i(Sb) stands for the trace of Sb with the ith diagonal element excluded.
J i (X ) =
tr (S b ) trX i (S b ) − . tr (S w ) trX i (S w )
(5)
The larger the value of Ji(X) the higher individual discrimination power the ith feature possesses. 2.2 The Regularized BP Neural Network Let us consider a feedforward neural network with three layers. The vector (x1, x2,…, xN) is given as the input layer, where N is the number of original features. We use the sigmoid function in the hidden and output layer. Firstly, we use the backpropagation training algorithm to train the network by minimizing the augmented cross-entropy error cost function given by (6) E=
E0 1 + α1 nL Pn h
P
nh
∑∑ f' (akph ) + α 2 p =1 k =1
1 Pn L
P
nL
p =1
j =1
∑∑ f' (a
L jp
) ,
(6)
where α1 and α2 are parameters to be chosen experimentally, P is the number of training samples, nL is the number of the output layer nodes, nh is the number of the hidden layer nodes. In (6), akph and a Ljp are the input values for the pth data point at the kth
A New Method for Feature Selection
1369
neuron in the hidden layer and the jth neuron in the output, f ' ( a kph ) and f ' ( a Ljp ) are derivatives of the transfer functions of the kth hidden and jth output nodes for the pth data point, respectively, and the cross-entropy error cost function E0 is defined by (7), where djp is the desired output for the pth data point at the jth output node.
E0 = −
1 P n [∑∑ (d jp log o Ljp + (1 − d jp ) log(1 − o Ljp ))] . 2 P p =1 j =1 L
(7)
From (6) it can be seen that two additional terms constrain the derivatives and force the neurons of the hidden and output layers to work in the saturation region. In [7], it was demonstrated that neural networks possess good generalization properties. We can expect that different sensitivity of the hidden and output nodes could be required for solving a task with the lowest generalization error. Therefore, two hyperparameters α1 and α2 are used in the error function. 2.3 Feature Selection Procedure
1) 2)
3)
4) 5) 6) 7) 8)
Calculate the values of Ji(X) (i=1, 2,…, N), and rank all features according to the values of Ji(X). Randomly initialize the weights of the neural network and divide the data set available into Training (hold 50%), Cross-Validation (hold 5%), and Test data sets (hold 45%). Train the neural network by minimizing the error function given by (6) and validate the network at each epoch on the Cross-Validation data set. Equip the network with the weights yielding the minimum Cross-Validation error. Compute the classification accuracy AT for the Test data set. Eliminate the least salient feature and execute Step 3). Compute the Test data set classification accuracy Ai and the drop in the accuracy ∆A when compared to AT. If ∆A<∆A0, where ∆A0 is the acceptable drop in the classification accuracy go to Step 5). Retain all the remaining and the last removed feature.
Classification accuracy obtained from a trained neural network depends upon the randomly selected training data set and the initial weight values. Aiming to reduce the dependence, we run an experiment 30 times with different initial values of weights and different training samples, cross-validation samples and testing samples in all the tests. The mean values and standard deviations of the correct classification rate presented in this paper were calculated. There are three parameters to be chosen, namely the regularization constants α1 and α2, the parameter of the acceptable drop in classification accuracy ∆A0 when eliminating a feature. The values of the parameters α1 and α2 have been found by cross-validation. The values of the parameters ranged: α1 [0.001,0.02] α2 [0.001,0.02]. The value of the parameter ∆A0 has been set as 3%. To make our results comparable with those presented above, we also used 12 nodes in the hidden layer. To train the network, we used the backpropagation training algorithm with momentum implemented in the <Matlab> software package.
, ∈
∈
1370
Y. Wu and Y. Yang
3 Experiments and Results To test the approach proposed we used two real-world problems such as US congressional voting records problem and the diabetes diagnosis problem. The data used in the experiments are available at: www.ics.uci.edu. In the following sections we compare the approach proposed with those approaches presented above. 3.1 US Congressional Voting Records Problem
The United States Congressional Voting Records data set consists of the voting records of 435 congressmen on 16 major issues in the 98th Congress. The votes are categorized into one of the three types of votes. The task is to predict the correct political party affiliation of each congressman. The 98th Congress consisted of 267 Democrats and 168 Republicans. According to the learning and testing conditions mentioned in Section 2.3, 197 samples were randomly selected for training, 21 samples were selected for cross-validation, and 217 for testing. From Fig.1 it is easy to find that the J4(X) is maximal, namely the feature <4> is the most salient among the original features.
Fig. 1. Criterion Ji(X) values for the individual features of the Congressional Voting Records data set
Table 1 presents the correct classification rate for the test data sets obtained from the method proposed. The method proposed selected the feature <4> and achieved the classification accuracy 97.52%. Although the NNFS achieved 100% with all the features, it only gets 94.79% with two selected features. We do not provide test results for the OFEI and FQI approaches, since the methods and our technique selected the same feature <4>. The method proposed achieved the best performance on the test data set. Table 1. Correct classification rate for the Congressional Voting Records data set (%)
All features
Training sets Testing sets Selected fea- The number of tures selected features Training sets Testing sets
DA&NN
SNR
NNFS
GA
99.52 98.04 1
98.92 95.42 1
100.0 92.0 2.03
96.87 95.47 9
98.24 97.52
96.62 94.69
95.63 94.79
98.73 96.85
A New Method for Feature Selection
1371
3.2 The Diabetes Diagnosis Problem
The Pima Indians Diabetes data set contains 768 samples taken from patients who may show signs of diabetes. Each sample is described by eight features. There are 500 samples from patients who do not have diabetes and 268 samples from patients who are known to have diabetes. From the data set, we have randomly selected 345 samples for training, 39 samples for cross-validation, and 384 samples for testing. Fig.2 presents the criterion Ji(X) values for the eight features. From Fig.2 it can be seen that the J2(X) is maximal among all the values.
Fig. 2. Criterion Ji(X) values for the individual features of the Pima Indians Diabetes data set
Table 2 provides the test data set correct classification rate achieved using feature subsets selected by the different techniques. Using the method proposed two features <2, 6> have been selected for solving the task. Again the method proposed achieved the best performance on the test data set. (“-” stands for null.) Table 2. Correct classification rate for the Pima Indians Diabetes data set (%)
All features Selected features
Training sets Testing sets The number of selected features Training sets Testing sets
DA&NN 97.68 95.81 2
SNR 80.35 95.42 1
NNFS 95.39 71.03 2.03
OFEI 2
FQI 2
GA 96.53 96.16 6
90.53 88.44
75.53 73.53
74.02 74.29
75.74 75.85
75.68 75.28
98.57 97.46
4 Conclusions In this paper, we presented a neural network based feature selection technique. A network is trained with an augmented cross-entropy error function, which forces the neural network to keep low derivatives of the transfer functions of neurons when learning a classification task. The technique developed outperformed the other methods on the problems tested.
1372
Y. Wu and Y. Yang
Acknowledgements The National Natural Science Foundation of P. R. China under grant No.60475019 supported this research.
References 1. Jain, A., Zongker, D.: Feature Selection: Evaluation, Application, and Small Sample Performance. IEEE Trans. On Pattern Recognition and Machine Intelligence 19(2) (1997) 153–158 2. Yang, J., Honavar, V.: Feature Subset Selection Using a Genetic Algorithm. IEEE Intelligent Systems 13(2) (1998) 44–49 3. Setiono, R., Liu, H.: Neural-network Feature Selector. IEEE Trans. Neural Networks 8(3) (1997) 654–662 4. Bauer, K.W., Alsing, S.G., Greene, K.A.: Feature Screening Using Signal-to-noise Ratios. Neurocomputing 31(1) (2000) 29–44 5. De, R.K., Pal, N.R., Pal, S.K.: Feature Analysis: Neural Network and Fuzzy Set Theoretic Approaches. Pattern Recognition 30(10) (1997) 1579–1590 6. Pal, S.K., De, R.K., Basak, J.: Unsupervised Feature Evaluation: a Neuro-fuzzy Approach. IEEE Trans. Neural Networks 11(2) (2000) 366–376 7. Verikas, A., Bacauskiene, M.: Feature Selection with Neural Networks. Pattern Recognition Letters 23 (11)(2002) 1323–1335
Improved Feature Selection Algorithm Based on SVM and Correlation Zong-Xia Xie, Qing-Hua Hu, and Da-Ren Yu Harbin Institute of Technology, Harbin, P.R. China [email protected]
Abstract. As a feature selection method, support vector machinesrecursive feature elimination (SVM-RFE) can remove irrelevance features but don’t take redundant features into consideration. In this paper, it is shown why this method can’t remove redundant features and an improved technique is presented. Correlation coefficient is introduced to measure the redundancy in the selected subset with SVM-RFE. The features which have a great correlation coefficient with some important feature are removed. Experimental results show that there actually are several strongly redundant features in the selected subsets by SVM-RFE. The coefficients are high to 0.99. The proposed method can not only reduce the number of features, but also keep the classification accuracy.
1
Introduction
In recent years, data has become increasingly larger not only in number of instances but also in number of features in many applications, such as gene selection from micro-array data and text automatic categorization, etc., where the number of features in the raw data ranges from hundreds to tens of thousands. High dimensionality brings great difficulty in understanding the data, constructing classifier systems for pattern recognition and machine learning. Therefore, Feature selection has attracted much attention from pattern recognition and machine learning society [1, 2, 3, 4]. The aim of FS is to find a subset of original features from a given dataset through removing irrelevant and/or redundant features [5]. FS can improve the prediction performance of the predictions, provide faster and more cost-effective predictors and give a better insight of the underlying process that generated the data [3, 4]. Traditional statistical methods for clustering and classification have been extensively used for feature selection. Support Vector Machines (SVMs), proposed in 1990s by Vapnik, have already been known as an efficient tool for discovering informative attributes [6]. SVMs minimize the upper bound of true error by maximizing margin with which they separate the data into distinct classes. In the framework of statistical learning theory, the upper bound of errors of a classifier depends on its VC dimension and the number of the training data. Feature subset selection may lead to reduce the VC dimension of the trained classifier, and then improves the classification performance. A lot of empirical analysis showed in the literatures supports this argument. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1373–1380, 2006. c Springer-Verlag Berlin Heidelberg 2006
1374
Z.-X. Xie, Q.-H. Hu, and D.-R. Yu
A series of feature selection algorithms for SVMs or using SVMs are proposed [7, 8, 9]. In 2002, SVMs was first used as a method of feature selection in gene selection for cancer classification, which is called SVM recursive feature elimination (SVM-RFE) [7]. In 2003, Rakotomamonjy showed some variable selection algorithms based on different kinds of SVM criteria, such as weight vector, the radius/margin bound, the span estimate and so on [8]. Li et al. extended these techniques to the cases of multi-class classification in 2004 [9]. In 2005, Duan generalized SVM-RFE method to multi-class problems through computing the feature ranking score from a statistical analysis of weight vectors of multiple linear SVMs trained on subsamples of the original training data [10]. However, we can find that all of the above techniques mainly focus on finding relevant features and delete those features which are irrelevant to the classification tasks, but don’t try to delete the redundant features from the data. The weight vector presents the relation between the attribute set and the decision. By deleting the features with little weights, we can cut down the irrelevant information. Just as Yu and Liu pointed [11]: an entire feature set can be conceptually divided into four basic disjoint parts: irrelevant features (I), redundant features (II, part of weakly relevant features), weakly relevant but non-redundant features (III), and strongly relevant features (IV). An optimal subset should contain the features in part III and IV. Therefore, feature selection should eliminate not only the irrelevant features, but also the redundant features [12]. In this paper, we’ll first show that SVM-RFE algorithm can’t remove the redundant features, and then present an improved feature selection method, which further deletes some redundant features by computing the correlation coefficient of the selected feature subset. Some datasets in UCI machine learning data repository are selected to show the proposed method efficient.
2
Some Related Work
Recursive feature elimination (RFE) is a method which performs backward feature elimination: starts with all features, and then removes some irrelevant features according to a ranking criterion until satisfied with a stop criterion. As for SVM-RFE, it is an application of RFE using SVM criteria for ranking. The stop criterion is to find the m features which lead to the largest margin of class separation or that the experience risk begins to go down [7]. In the follows, the latter stop criterion is used. Consider the following two-category classification problem: (x1 , y1 ), (x2 , y2 ), . . . , (xl , yl ), xi ∈ Rn , yi ∈ [−1, +1], where xi is the training example and yi the class label. First, map x into a high dimensional space via a function φ, then construct the optimal hyperplane (w · φ(x)) + b = 0, we have to solve: min φ(w, ε) = 12 (w · w) + C
l
εi
i=1
s.t. yi (w · φ(x) + b) − 1 + εi ≥ 0 0 ≤ i ≤ l
(1)
Improved Feature Selection Algorithm Based on SVM and Correlation
1375
where εi is slack variable, the constant C determines the trade off between w·w 2 l margin maximization and εi training error minimization. Then the hyperi=1
plane can generalize well based on SRM. From Karush-Kuhn-Tucker condition in optimization theory, optimization problem (1) is equivalent to min
1 2
l i=1
ai aj yi yj K(xi , xj ) −
l
ai
(2)
i=1
s.t. yi (w · φ(x) + b) − 1 + εi ≥ 0 ai ≥ 0
0≤i≤l
where K(xi , xj ) = φ(xi ) · φ(xj ) is an inner product in the feature space which can map the data points into feature space without computing φ(x). Through the kernel trick, SVMs can deal with the classification in nonlinear case easily. Then we can obtain the following decision function: l f (x, w, b) = sgn(w · φ(x) + b) = sgn( ai yi K(xi , x) + b)
(3)
i=1
where a is the Lagrange multiplier vector; w =
yi ai xi is the weight magni-
SV s
tude, and w2 is a measure of predictive ability, which is inversely proportionate to the margin. In order to remove the irrelevant features, SVM-RFE algorithm uses w2 as ranking criterion at each step. It starts with all the features and removes features based on the ranking criterion until the experience risk begins to go down [13]. This is done by performing the following iterative procedure: 1) Train SVMand get the original experience risk and the weight of every features w(t) = yi ai xi(t) where xi(t) is the value of the tth feature of the ith SV s
sample; 2) Rank the features by the values of w2 ; 3) Remove the feature with the smallest value w2 , retrain the SVM with the reduced set of features, get the new experience risk and the weight of every feature; 4) If the new experience risk is equal to or larger than the original one, go to step 2); else break; 5) The current feature subset is the selected subset. SVM-RFE algorithm is originally designed to solve 2-class problems and a multi-class version have been proposed in [9, 10]. In a word, RFE computes a feature ranking using the w2 , which is a measure of predictive ability. And the features which have small value of w2 also have some effect on SVM decision. The ranking criterion above measures the relevance between features and decision, by removing the attributes with little weights, some irrelevant information can be deleted, however the question is whether this technique can remove the redundant features from the data set.
1376
Z.-X. Xie, Q.-H. Hu, and D.-R. Yu
As w =
yi ai xi and the values of the Lagrange multiplier a of the samples
SV s
which are not support vectors are equal to 0, then w =
SV s
yi ai xi =
l
yi ai xi
i=1
where l is the number of the training samples. It can expand to the following function: ⎛ ⎞ x11 x12 . . . x1f ⎜ x21 x22 . . . x2f ⎟ ⎟ [w1 , w2 , . . . , wf ] = [a1 y1 , a2 y2 , . . . , al yl ] = ⎜ ⎝ ... ... ... ... ⎠ xl1 xl2 . . . xlf where f is the number of features, xij stands for the value of the jth feature of the ith sample. Let’s discuss the relation of weights and correlation between the attributes. Without loss of generality, we consider arbitrary two weights wm and wn : wm = a1 y1 x1m + a2 y2 x2m + · · · + al yl xlm wn = a1 y1 x1n + a2 y2 x2n + · · · + al yl xln where fm = [x1m , x2m , · · · , xlm ] and fn = [x1n , x2n , · · · , xln ] are two features of the data, if fm = k · fn + b, then xim = k · xin + b. We have wm = k · wn + b. The above analysis shows the weights are proportioned if the features are linear correlative. In particular,k = 1 and b = 0 , then wm = wn . As we know, fm and fn are redundant and one of them should be removed from the training data. However, if one of the features is with a great weight, the redundant feature will also get a large weight. According the recursive feature elimination algorithm, only the features with little weights can be deleted. Certainly, the above situation is the ideal one. In the real world, there are few data sets with linear correlative features; however we can also find there are a lot of data sets that the correlation coefficients among the features are above 0.9, sometimes high to 0.99. These features almost have the same value or are proportionate, the difference is very small. In this case, the two features will be removed or retained simultaneity if SVM-RFE is used. Therefore, there exist the linear redundant features in the selected feature subset by SVM-RFE.
3
The Improved Algorithm
From the analysis of SVM-RFE in section 2, we can know that it just considers the relation of features and decision, while it doesn’t take the relation of features into consideration. In this section, we introduce the improve algorithm by correlation coefficient matrix to further delete redundant features selected by RFE-SVM. In order to remove the redundant features which are linear correlation with each other, we need a measure to compute the relation between features. Correlation coefficient, information gain, Gini index are used to calculate the similarity among the feature in some literature [14]. In this paper, Correlation coefficient is
Improved Feature Selection Algorithm Based on SVM and Correlation
1377
chose. Suppose that the selected feature subset with RFE is [f1 , f2 , · · · , fr ] . Let’s compute the similarity between features fi and fj with the following formula: l
ρij =
(fim − fi )(fjm − fj )
l l (fim − fi )2 (fjm − fj )2 m=1
m=1
(4)
m=1
where fi is the averaged value of ith feature,fim is the value of the mth sample of the ith feature. Then the following coefficient matrix for the feature subset is gotten: ⎛ ⎞ ρ11 ρ12 . . . ρ1r ⎜ ρ21 ρ22 . . . ρ2n ⎟ ⎟ ρ=⎜ ⎝ ... ... ... ... ⎠ ρr1 ρr2 . . . ρrr We have ρii = 1,ρij = ρji and |ρij ≤ 1|. The diagonal reflects the auto correlation of the features and other elements in the matrix reflect the cross correlative degrades, which are the overlapping information among feature set. The aim is to find a feature subset which keeps the useful information and remove the superfluous. By computing the correlation coefficient of the selected feature subset, we will use these values to remove the redundant features. Considering that the features with large weight are important for the decision function, we take the following processes [11] to remove the redundant features. 1) Input the original feature set F = [f1 , f2 , · · · , fK ]; 2) Perform SVM-RFE method on F and get a feature subset F = [f1 , f2 , · · · , fk1 ] , where k1 < K ; 3) Sort the features based on the values of w2 with descending order; 4) Find the features whose correlation coefficient with the first feature are larger than λ and remove them, where λ is the threshold prespecified; get a new feature subset F = [f1 , f2 , · · · , fk2 ] ; 5) Find the features whose correlation coefficient with the 2nd feature in F2 are larger than λ and remove them; get a new feature subset F = [f1 , f2 , · · · , fk3 ] ; 6) Repeat the above process until the last feature.
4
Empirical Analysis
In order to compare the proposed method and RFE-SVM, we choose 6 data sets from UCI data repository [15]. The description of data sets is shown in Table 1. For all the datasets, we first normalize all of the attributes into the interval of [-1, 1]. First, let’s study the correlation among the features; we take the data set WPBC as an example. There are 33 features in the original data. SVM-RFE
1378
Z.-X. Xie, Q.-H. Hu, and D.-R. Yu Table 1. The description of data sets Data set Number of instances Number of attributes 569 198 476 366 208 351
30 33 166 34 60 34
4
4
3
3
2
2
feature 5
feature 4
WDBC WPBC Clean1 Dermat Sonar Ionos
1
1
0
0
−1
−1
−2 −2
0
2
4
−2
−2
−1
feature 2
0 1 feature 2
2
3
Fig. 1. The relation of features 2 and 4 or 5
removes 6 irrelative features and selects 27 features. We find the 2nd feature is highly correlative with feature 4 and 5, shown in figures 1. By computing the correlation coefficient, there are 7 redundant features removed. Table 2 shows the maximal correlation coefficient of the SVM-RFE feature subsets and the thresholds we take in removing the redundant features, the numbers of the original features, SVM-RFE subsets and the subsets without redundancy. We can find several features in data sets WDBC, WPBC, Clean1 and Dermat have been deleted with the proposed technique. The classification accuracies with three kinds of features are shown in table 3. Table 2. The number of selected features and the accuracy of original, SVM-RFE method and the new method Data set
Max coefficient
Threshold
WDBC WPBC Clean1 Dermat Sonar Ionos
0.9979 0.9929 0.9909 0.9417 0.9258 0.7483
0.9 0.9 0.95 0.9 0.9 0.9
Number of features ALL RFE New 30 33 166 34 60 34
30 27 79 23 60 31
21 20 58 18 57 31
Improved Feature Selection Algorithm Based on SVM and Correlation
1379
Table 3. Comparison of classification performance with different feature selection Data set WDBC WPBC Clean1 Dermat Sonar Ionos
ALL 0.9877± 0.0275 0.8744±0.0711 0.9625±0.1186 1.000±0.0000 0.9762±0.0753 0.9088±0.0877
RFE-SVM 0.9877± 0.0275 0.8528±0.0708 0.9729±0.0856 0.9944±0.0176 0.9762±0.0753 0.9117±0.0775
New 0.9912±0.0124 0.8694±0.0817 0.9667±0.0775 0.9944±0.0176 0.9619±0.1048 0.9117±0.0775
Max coefficient is the maximal value in correlation coefficient matrix of selected subset by SVM-RFE. The max coefficients of 5 data sets are larger than 0.9, just one is about 0.75. This fact shows that there actually are some linear redundant features after SVM-RFE based feature selection. Of the selected 6 data sets, the accuracy of 2 data sets is higher than SVMRFE. The number of features reduces to 21 from 30 and 20 from 27 respectively, while the accuracy improves.
5
Conclusion
Support vector machines are widely used to select informative feature subset. In this paper we show the traditional SVM-RFE algorithm for feature selection just considers the relation of the condition attributes and decision, but don’t analyze the relation between the condition attributes. Therefore, SVM-RFE algorithm can just remove the irrelevant features from the data. We present a technique to delete the redundant information in data based on correlation coefficient. There are two main steps in the proposed technique. First we remove the irrelevant features with SVM-RFE technique; then we compute the correlation coefficients between the selected attributes with the first step and delete the redundant feature from the selected subset. The proposed improved method of feature selection can remove some linear redundant features in the selected feature subset obtained by SVM-RFE. With the experiment analysis, we find the number of the selected features with the proposed technique is much smaller than that with SVM-RFE, and the accuracy based on 10-fold cross validation is comparable
References 1. Dash, M., Liu, H.: Feature Selection for Classification. Intelligent Data Analysis 1 (1997) 131–156 2. Kohavi, R., George, J.: Wrappers for Feature Subset Selection. Artificial Intelligence 97 (1997) 273–324 3. Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 (2003) 1157–1182
1380
Z.-X. Xie, Q.-H. Hu, and D.-R. Yu
4. Hu, Q., Yu, D., Xie, Z.: Information Preserving Hybrid Data Reduction Based on Fuzzy Rough Techniques. Pattern Recognition Letters (in press). 5. Liu, H., Yu, L., Dash, M., Motoda, H.: Active Feature Selection Using Classes. Pacific-Asia Conference on Knowledge Discovery and Data Mining (2003) 6. Guyon ,I., Matic, N., Vapnik, V.:Discovering Informative Patterns and Data Cleaning. Advances in Knowledge Discovery and Data Mining (1996) 181–203 7. Guyon, I. , Weston, J. , Barnhill, S. and Vapnik, V.:Gene Selection for Cancer Classification Using Support Vector Machines. Machine Learning 46 (2002) 389–422 8. Rakotomamonjy, A.: Variable Selection Using SVM-based Criteria. Journal of Machine Learning Research 3 (2003) 1357–1370 9. Li, G., Yang, J., Liu, G., Li, X.: Feature Selection for Multi-class Problems Using Support Vector Machines. Proceedings of 8th Pacific Rim International Conference on Artificial Intelligence(PRICAI-04) LNCS 3157 (2004) 292–300 10. Duan, K., Rajapakse, J.C., Wang, H. and Francisco, A.: Multiple SVM-RFE for Gene Selection in Cancer Classification with Expression Data. IEEE Transactions on Nanobioscience (2005) 228–234 11. Yu, L., Liu, H.: Efficient Feature Selection via Analysis of Relevance and Redundancy. Journal of Machine Learning Research 5 (2004) 1205–1224 12. Hsing, T., Liu, L., Brun, M., et al.: The Coefficient of Intrinsic Dependence (Feature Selection Using el CID). Pattern Recognition (2005) 623–636 13. Yao, K., Lu, W. , Zhang, S., et al.: Feature Expansion and Feature Selection for General Pattern Recognition Problems. IEEE Int. Conf. Neural Networks and Signal Processing (2003) 29–32 14. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers (2000) 15. Blake, C., Keogh, E., Merz, C.: UCI Repository of Machine Learning Databases. Technical Report, Department of Information and Computer Science, University of California, Irvine, CA. (1998)
Feature Selection in Text Classification Via SVM and LSI Ziqiang Wang and Dexian Zhang School of Information Science and Engineering, Henan University of Technology, Zheng Zhou 450052, China [email protected]
Abstract. Text classification is a problem of assigning a document into one or more predefined classes. One of the most interesting issues in text categorization is feature selection. This paper proposes a novel approach in feature selection based on support vector machine(SVM) and latent semantic indexing(LSI), which can identify LSI-subspace that is suited for classification. Experimental results show that the proposed method can achieve higher classification accuracies and is of less training and prediction time.
1
Introduction
With the ever-increasing growth of the on-line information and the permeation of Internet into daily life, methods that assist users in organizing large volumes of documents are in huge demand. In particular, automatic text classification has been received much attention[1]. The goal of automatic text categorization is to learn a classification scheme from training examples so that the scheme can be used to categorize unseen text documents. Text classification(TC) is an important tool for organizing documents into classes by applying statistical methods or artificial intelligence techniques. By this, the utilization of the documents can be expected to be more effective. As a result, the situation of information overload may be alleviated. So far, a variety of learning techniques for TC have been developed[2], including nearest neighbor classification, Bayesian approaches, decision trees, linear classifiers,neural networks , and support vector machines[3].An important consider in text classification performance is the choice of feature sets for text representation[4]. A popular approach for text representation is the vector space model. It represents the ”units of content” of a document as a vector. In most situations, a single word is used as a content unit. However, such a representation, called the ”bag of word” approach has the following drawbacks. Firstly, a large number of features are required for document representation. Secondly, it totally neglects semantic similarities between two words, which could be important from a classification viewpoint. For example, it would be desirable if words that describe a similar concept were represented as a single unit. Latent semantic indexing(LSI)[5,6] is an approach that tries to allow this via the use of the singular value decomposition(SVD) technique. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1381–1386, 2006. c Springer-Verlag Berlin Heidelberg 2006
1382
Z. Wang and D. Zhang
LSI has been used to improve retrieval performance and has shown favorable results in information retrieval. The usefulness of LSI has been attributed to the fact that the LSI subspace captures essential associative semantic relationships better than the original document space. Mathematically, LSI with a truncated SVD is the best approximation of X in the reduced k-dim subspace in L2 norm. However, subspace selected by this method may not be the most appropriate one to classify documents,in order to select LSI feature that is suited for classification, we propose a more compact LSI subspace selection method based on support vector machine(SVM)[7]. Experimental results show that the proposed approach generally demonstrates promising performance on text categorization.
2
Latent Semantic Indexing
The Latent Semantic Indexing(LSI)[5,6] information retrieval model uses the singular value decomposition(SVD) to reduce the dimensions of the term-document space. LSI explicitly represents terms and documents in a rich, high dimensional space, allowing the underlying (”latent”) semantic relationships between terms and documents to be exploited during searching. Briefly, vectors representing the documents are projected in a new, low-dimensional space obtained by singular value decomposition (SVD) of the term-document matrix A. This lowdimensional space is spanned by the eigenvectors of AAT that correspond to the few largest eigenvalues and thus, presumably, to the few most striking correlations between terms. LSI differs from previous attempts at using reduced-space models for information retrieval in several ways. Most notably, LSI represents documents in a high dimensional space. Secondly, both terms and documents are explicitly represented in the same space. Thirdly, each dimension is merely assumed to represent one or more semantic relationships in the term-document space. Finally, LSI is able to represent and manipulate large data sets, making it viable for real world applications. Let A be an n × m matrix of rank r whose rows represent terms and columns represent documents. Let the singular values of A(the eigenvalues of AAT ) be σ1 ≥ σ2 ≥ . . . σr . The singular value decomposition of A expresses A as the product of three matrices A = U DV T ,where D = diag(σ1 , . . . , σr ) is a r × r matrix, U = (u1 , . . . , ur ) is a r × r matrix whose columns are orthonormal, and V = (v1 , . . . , vr ) is a m × r matrix which is also column-orthonormal. LSI works by omitting all but the k largest singular values in the above decomposition, for some appropriate k; here k is the dimension of the low-dimensional space. It should be small enough to enable fast retrieval and large enough to adequately capture the structure of the corpus (in practice, k is in the few hundreds, compared with r in the many thousands). Let Dk = diag(σ1 , . . . , σk ),Uk = (u1 , . . . , uk ), and Vk = (v1 , . . . , vk ).Then Ak = Uk Dk VkT
(1)
is a matrix of rank k, which is our approximation of A. The rows of Vk Dk above are then used to represent the documents. In other words, the column vectors of
Feature Selection in Text Classification via SVM and LSI
1383
A (documents) are projected to the k-dimensional space spanned by the column vectors of Uk ,we sometimes call this space the LSI space of A. Therefore, LSI preserves the relative distances in the term-document matrix while projecting it to a lower-dimensional space[5,6]. Experimental results and theory analysis show that LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance.
3
Feature Selection Method Based on SVM
Machine learning has been applied to text classification extensively and has enjoyed a great deal of success. Among the many learning algorithms, Support Vector Machine (SVM)[7] appears to be most promising. Indeed, since Joachims’s pioneering work on applying SVMs to text classification[8], a number of studies have shown that SVMs outperform many other text classification methods. In [8], a theoretical learning model of text classification for SVMs is developed. In particular, it suggests that text classification tasks are generally linearly separable with a large margin, which, along with a low training error, can lead to a low expected prediction error. Support vector machine (SVM) is a learning method basedon margin maximization introducedby Vapnik et al.[7].SVM performs binary pattern classification by finding optimal linear discriminant function f (x) = wT x + b,the parameters w and b are determined by learning from instances.In the following,we will only overview linear SVM because the many of the text datasets are linearly separable[8].Suppose that linearly separable data {xi } and their class labels {yi }(yi ∈ {−1, +1}) are given.SVM is defined as the function that separates two classes by a hyperplane such that the distance to the closest data points(i.e. margin) is maximized.These closest points to the hyperplane are called support vectors. Resultant hyperplane is called optimal separating hyperplane.The decision function for an input vector x is defined as follows: f (x) = wT x + b
(2)
m where w = i=1 αi yi xi and b is a bias term.The coefficients αi are determined by solving a quadratic programming problem: M aximize
m
αi −
i=1
m 1 αi αj yi yj xTi xj 2 i,j=1
(3)
αi is non-zero only for support vectors.Then, the parameter w of a linear SVM is given by linear combination of support vectors.Therefore,we can decompose the w into sum of items corresponding to original features.Then,we can derive the following equation: wk2 =
m i,j=1
αi αj yi yj xki xkj
(4)
1384
Z. Wang and D. Zhang
where xki is the kth component of xi . So,we can evaluate the importance of a features in terms of their values of wk2 [9].In addition, Because we adopt the LSI model in the text classifier,w2 must be decomposed into the corresponding LSI feature item.Let uk be a proportion of variance to the total sum of variances and ˆ , Vˆ and D ˆ be those matrices corresponding to sorted columns pk = 2u1k . Let U of U , V and the diagonal components of D in terms of their pk value.Then LSI representation Aˆ can be described as follows: ˆD ˆ Vˆ T Aˆ = U
(5)
where Aˆ is a matrix whose all but first k-diagonal elements in A were set to zero.Therefore, LSI preserves the relative distances in the term-document matrix while projecting it to a lower-dimensional space. The later experimental results show that LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance.
4
Experimental Methodology and Results
For our experiments we used a variety of datasets, most of which are frequently used in the information retrieval research[10]. Table 1 summarizes the characteristics of the datasets. Texts were represented using a vector space model, where documents are represented by feature vector of terms. We represent documents as a vector of terms that are derived as follows: we replace all numerical and punctuation characters by spaces and eliminate stop-words such as articles and prepositions, etc.; each term is associated with a TF-IDF weight, where TF denotes the frequency of a term in a document, and IDF is calculated based on the distribution of the term in the training corpus [11]. In all experiments the document vectors were normalized to unit length. In all experiments, we randomly chose 70% of the documents for training and assigned the rest for testing. It is suggested in [4] that information gain is effective for term removal and it can remove up to 90% or more of the unique terms without performance degrade. So, we first selected the top 1,000 words by information gain with class labels. In this paper, we use classification accuracy such as F1 measure[2] for evaluation. However, since the datasets used in our experiments are relatively balanced and single-labeled[10], and the goal in text categorization is to achieve low Table 1. Data Sets Descriptions Datasets 20NG WebKB Reuters-Top10 TDT2
#Document 20,000 8,280 2,900 7,980
#Class 20 7 10 96
Feature Selection in Text Classification via SVM and LSI
1385
Table 2. Classification Performance Comparison Datasets 20NG WebKB Reuters TDT2
SVM 96.24 79.95 84.68 94.57
NB 85.61 61.21 83.35 91.62
KNN ME 50.73 89.16 44.89 71.38 74.12 81.75 86.73 89.34
Table 3. Running Time of SVM Datasets 20NG WebKB Reuters TDT2
Training Time(s) Prediction Time(s) 98.53 6.23 90.26 0.41 58.37 0.14 20.58 4.98
misclassification rates, we consider that accuracy is the best measure of performance. All of our experiments were carried out on a P4 1.7GHz machine with 512M memory running Window XP. For linear SVMs, the only parameter that needs to be specified beforehand is the constant C. When a dataset is separable, performance is rather insensitive to the choice of C as long as its value is sufficiently large[12]. Since most datasets in text categorization domain are linearly separable, we simply used a fixed value of C = 1000 throughout the experiment. Now we present and discuss the experimental results. Here we compare our proposed method(SVM for short) against Naive Bayes (NB for short), K-Nearest Neighbor (KNN for short), and Maximum Entropy (ME for short) on the same datasets with the same training and testing data. For experiments involving SVM we used SV M light . For KNN we set k to 30. Table 2 shows classification performance comparison, the experimental results show that SVM outperformed all the other three methods on 20NG, WebKB, Reuters-Top10 and TDT2. SVM is also very efficient and most experiments are done in several seconds. Table 3 summarizes the running time for all the experiments of SVM. In summary, these experiments have shown that SVM provides an alternate choice for fast and efficient text categorization.
5
Conclusions
In this paper, We proposed to apply SVM-based feature selection method in order to identify LSI-subspace that is suited for classification. Experimental results show that the proposed method achieves higher classification accuracies and is of less running time.
1386
Z. Wang and D. Zhang
References 1. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1) (2002) 1-47 2. Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Gey, F.(ed.): Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York (1999) 42-49 3. Joachims, T.: A Statistical Learning Model of Text Classification with Support Vector Machines. In: Croft, W.B., Harper, D.J., Kraft, D.H., Zobel, J. (eds): Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York (2001) 128-136 4. Yang, Y., Pederson, J.O.: A Comparative Study on Feature Selection in Text Categorization. In:Fisher, D.H. (ed.): Proceedings of the Fourteenth International Conference on Machine Learning. Morgan Kaufmann, San Diego, CA (1997) 412-420 5. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6) (1990) 391-407 6. Papadimitriou, C.H., et al.: Latent Semantic Indexing: a Probabilistic Analysis. Journal of Computer and System Sciences 61(2) (2000) 217-235 7. Vapnik, V.: Statistical Learning Theory. John Wiley&Sons, New York (1998) 8. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nedellec,C., Rouveirol, C.(eds.): Machine Learning.Lecture Notes in Computer Science, Vol. 1398.Springer-Verlag,Berlin Heidelberg New York (1998) 137-142 9. Guyon, I., Weston, J., Barnhill, S.: Gene Selection for Cancer Classification Using Support Vector Machines. Machine Learning 46(3) (2002) 389-422 10. Li, T., Zhu, S., Ogihara, M.: Efficient Multi-way Text Categorization via Generalized Discriminant Analysis. In: Kraft,D. (ed.): Proceedings of the Twelfth International Conference on Information and Knowledge Management. ACM Press, New York (2003) 317-324 11. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGrawHill, New York (1983) 12. Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Trans. Neural Networks 10(5) (1999) 1048-1054
Parsimonious Feature Extraction Based on Genetic Algorithms and Support Vector Machines Qijun Zhao1, Hongtao Lu1, and David Zhang2 1
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200030, China {zhaoqjatsjtu, htlu}@sjtu.edu.cn 2 Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong [email protected]
Abstract. Most existing feature extraction algorithms aim at best preserving information in the original data or at improving the separability of data, but fail to consider the possibility of further reducing the number of used features. In this paper, we propose a parsimonious feature extraction algorithm. Its motivation is using as few features as possible to achieve the same or even better classification performance. It searches for the optimal features using a genetic algorithm and evaluates the features referring to Support Vector Machines. We tested the proposed algorithm by face recognition on the Yale and FERET databases. The experimental results proved its effectiveness and demonstrated that parsimoniousness should be a significant factor in developing efficient feature extraction algorithms.
1 Introduction An important issue in pattern recognition is how to cope with the high-dimensional data. A direct answer is to reduce the dimensionality. This is called, in the literature of pattern recognition, feature selection or feature extraction. Standard feature extraction methods, such as Principal Component Analysis (PCA) [1] and Linear Discriminant Analysis (LDA) [2], attempt to preserve the information residing in the original data as much as possible, or to make the data more separable. Although they have been successfully applied in many classification applications, it is still not well solved how many features should be chosen. In real-world applications, the number of chosen features is usually determined by experience [3][4]. Evolutionary Pursuit (EP) [5] has been recently proposed to search for the projection axes, also known as basis, in feature extraction. Its searching process is done by Genetic Algorithm (GA) [6] and it evaluates the basis candidates according to the classification performance on both the training and the testing samples. It is reported that EP achieves better results using fewer features than Eigenface [3] in training and overwhelms both Eigenface and Fisherfaces [4] in testing. However, EP does not show whether the number of used features can be reduced further. More recently, Zheng et al. [7] presented another feature selection and extraction algorithm, GAFisher, for face recognition. They employed GA to select a subset of principal J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1387 – 1393, 2006. © Springer-Verlag Berlin Heidelberg 2006
1388
Q. Zhao, H. Lu, and D. Zhang
components to represent the original data (this step is called GA-PCA) and then used the normal LDA to calculate the final projection axes. In GA-Fisher, the number of principal components chosen by GA is set to the rank of the within-class scatter matrix of the samples such that the LDA followed the GA-PCA is relieved from the small sample size problem. GA-Fisher, however, does not explore the potential smaller basis too. In this paper, we present a preliminary study on searching for smaller bases which not only have satisfied classification accuracy, but also use fewer features. We call the proposed algorithm as a parsimonious feature extraction (PFE) algorithm to highlight its advantage in using fewer features. The remainder of this paper is organized as follows. In section 2, we present in detail the proposed parsimonious feature extraction algorithm. We then show our experiments in section 3. Finally, we conclude this paper in section 4.
2 Parsimonious Feature Extraction Given N n-dimensional samples, x1 , x2 ,L , xN , which are from L clusters and suppose
there are N i samples in the ith cluster and I i denotes the set of indexes of the samples belonging to the ith cluster, i = 1, 2,L , L , we first, for the efficiency sake, project the original sample data into a lower dimensional space using Whitened Principal Component Analysis (WPCA). Due to the space limitation, we will not introduce WPCA in detail but refer the readers to [1]. We denote the WPCA preprocessed data just by xi for simplicity of notation. We then propose a GA to search for the optimal projection axes in the WPCA transformed space. Suppose WPFE = [ w1 w2 L wm ] are the optimal projection axes and after feature extraction the samples become y1 , y2 ,L , y N ∈ R m (0 < m n) :
yi = WPFE T xi , i = 1, 2,L N .
(1)
In the following, we will present the proposed GA for parsimonious feature extraction according to the three fundamental problems in applying GA: how to encode solutions in individuals, how to generate new individuals, and how to evaluate individuals. 2.1 Definition of Individuals
We use binary individuals, i.e. sequences of ‘0’ and ‘1’, and we take a method similar to that used in EP. However, we generate new basis vectors via linearly combining the standard basis vectors, rather than via rotating them. This works because of the fact that the projection axes lie in the original space and thus should be linear combinations of any basis of the space. Since the original searching space is n dimensional, we record n projection axes in each individual and for every projection axis, n combination coefficients are needed. We use 11 bits to represent each coefficient with the leftmost bit denoting its sign (‘0’ means negative and ‘1’ positive) and the remaining
Parsimonious Feature Extraction Based on Genetic Algorithms
1389
10 bits giving its value as a binary decimal. Assume the combination coefficients of the ith projection axis wi are ai1 , ai 2 ,L , ain , then n
wi = ∑ aij e j , i = 1, 2,L , n,
(2)
j =1
where e1 , e2 ,L , en are the standard basis of the original searching space. An advantage of using the standard basis lies in that neither multiplication nor summation is needed at all for retrieving wi . On the contrast, O(n 2 ) trigonometry operations are needed in EP. From this viewpoint, the linear combination based method is supposed to be more efficient. In order to choose a subset of the n projection axes, we also define n flag bits in each individual to randomly choose m ones from the n projection axes (‘0’ means discarding and ‘1’ choosing). 2.2 Genetic Operators
We employ the following three genetic operators to generate new individuals: selection, crossover, and mutation. The selection is conducted according to the relative fitness of individuals, i.e. the proportion of the fitness of an individual to the total fitness of the population determines how many times it will be employed to generate individuals for the next generation. The crossover is similar to that in EP. However, EP does not treat discriminatively the bits denoting rotation angles and the flag bits though they should be different. Instead, we deal with the coefficient and flag bits separately. More specifically, to recombine two individuals, a crossover point is randomly chosen within the coefficient bits and this point divides the coefficient bits in both individuals into two parts. One of these parts is exchanged between them. The same type of operation is then applied to the flag bits. This results in two new individuals. As for mutation, each bit in an individual is subjected to mutation from ‘0’ to ‘1’ or reversely under a specific probability. 2.3 Fitness Function
We evaluate the generalization ability of individuals consulting the idea of maximizing the margins of separating hyper-planes proposed in SVMs [8]. We assume that the separating hyper-plane of two clusters is the plane which is perpendicular to the straight line between the two cluster centers and transits the mean of the two cluster centers. The margin of a separating hyper-plane is defined as the sum of the distances of it to the closest samples in the two clusters separated by it. When there are more than two clusters, we first calculate the margins for every pair of clusters and then take their average as the generalization term. In order to calculate the margins, we construct an N × L matrix, which we call the Performance Matrix (PM). The element in the ith row and jth column of PM, PM ij , is defined as the distance from the sample xi to the separating hyper-plane of the clusters c j and ck (suppose xi ∈ ck ). Let c j der note the center of the jth cluster, O jk be the mean of c j and ck , and w jk be the direction vector from c j to ck , then
1390
Q. Zhao, H. Lu, and D. Zhang
r PM ij = wTjk ( xi − O jk ),
(3)
r r where O jk = (c j + ck ) / 2 , w jk = (ck − c j ) / ck − c j , and v denotes the length of the r vector v . The distance from the cluster ck to the separating hyper-plane of the clusters c j and ck can be calculated by d kj = min{PM ij |PM ij ≥ 0, i ∈ I k }.
(4)
Similarly, the distance from c j to the separating hyper-plane of c j and ck is d jk = min{PM ik |PM ik ≥ 0, i ∈ I j }.
(5)
The margin of the hyper-plane of c j and ck is mkj = d kj + d jk .
(6)
The generalization term is defined as the average margin of the whole set of samples:
ζ g (F ) =
L −1 L 2 ∑ ∑ mkj , L( L − 1) k =1 j = k +1
(7)
where F is an individual. As in EP, we also use the recognition rate as the measurement of performance on the training samples. Let ei ∈ {0,1} denote whether the sample xi is classified correctly ( ei = 0 ) or not ( ei = 1 ):
⎧1 if ∃k ∈ {1, 2,L , L}: PM ik < 0; ei = ⎨ otherwise. ⎩0
(8)
The performance term is then defined as N
ζ a ( F ) = 1 − ∑ ei / N .
(9)
i =1
Different from EP and GA-Fisher, we define a penalty term for the fitness function as follows:
ζ p ( F ) = e− m / n ,
(10)
where n is the dimensionality of the original space, and m is the number of chosen features. In this way, when two individuals have the same generalization ability and the same performance on the training samples, the one which uses fewer features will be assigned a larger fitness value. All in all, the fitness of an individual F is given as the product of these three terms:
ζ ( F ) = ζ g ( F ) × ζ a ( F ) × ζ p ( F ).
(11)
Parsimonious Feature Extraction Based on Genetic Algorithms
1391
3 Experiments We evaluated our proposed parsimonious feature extraction (PFE) algorithm in comparison to EP and GA-Fisher using the Yale and FERET databases. In all the experiments, the parameters of GA were set by experience as follows: the size of population was 100; the probabilities of crossover and mutation were 0.8 and 0.01; and the number of GA iterations was 50. A minimum distance classifier was used to recognize faces. The whole experiments were divided into two parts. Table 1. Comparative testing performance for EP and PFE on Yale database
Method EP PFE EP PFE
N 30 30 60 60
m 27 16 44 34
Top 1 85.45% 90.91% 93.33% 93.94%
Top 3 95.76% 96.36% 96.97% 96.97%
Top 5 96.36% 98.18% 98.79% 98.79%
3.1 Part I: Experiment on Yale Database
In part I, we compared PFE and EP in terms of the recognition rate and the number of chosen features. The Yale database was used in this part and the images were first cropped such that only the face area was retained. They were preprocessed to have the same inter-ocular distance and an average face shape and were resized to the size of 24 by 24. In a session, the images taken under the same conditions were used for testing (totally 15 images) and the rest for training (150 in all). We considered the top one, top three and top five recognition rates [5]. Table 1 shows the results, where n is the dimensionality of the WPCA space, m is the number of chosen features. Obviously, PFE has a better classification performance than EP, but it uses fewer features (in average, about 10 fewer features were used by PFE than by EP in the experiments). This verifies that it is possible to further reduce the number of features used in classification in the circumstance of such small scale face database as the Yale database. 3.2 Part II: Experiment on FERET Database
In this part, we accessed the performance of PFE using a relatively large database -the FERET database. We selected a subset of 1004 frontal images corresponding to 251 subjects (four images per subject). These images were also preprocessed such that the area outside the face was cropped and the face area was aligned by the centers of eyes and mouth. This preprocessing step is similar to that in the experiments of GAFisher [7] except that we normalized the images with the resolution 20 by 20 for the sake of computation cost. The dimension of the WPCA transformed space was set to 60 in this part. We randomly selected one image of each person to form a testing set of 251 images, and the rest were used for training. We ran the experiment for ten times and the
1392
Q. Zhao, H. Lu, and D. Zhang
Fig. 1. Recognition rate of PFE on a subset of FERET database
average of ten best recognition rates is reported in the Fig. 1. These results are comparable with those reported by GA-Fisher [7]. However, our proposed PFE merely used fewer than 40 features to obtain such classification performance whereas the number of features chosen by normal LDA was generally larger than 100 in the experiment. With such parsimonious subset of features, the total efficiency of recognition systems can be improved greatly.
4 Conclusions In this paper, we propose a parsimonious feature extraction (PFE) algorithm. We use a genetic algorithm to search for the optimal subset of features and evaluate the candidate subsets of features according to both their classification performance and the number of features they choose. The final optimal subset of features found by PFE not only has as good classification performance as Evolutionary Pursuit and GA-Fisher, but also uses much fewer features than them and thus improves the total efficiency of recognition systems. The experiments on the Yale and FERET databases verify the effectiveness of our proposed algorithm and demonstrate that parsimoniousness should be an important factor in developing efficient feature extraction algorithms. Our future research will focus on enhancing the stability of parsimonious feature extraction algorithms and further improving their efficiency.
Acknowledgment This work is supported by NSFC under project No. 60573033.
Parsimonious Feature Extraction Based on Genetic Algorithms
1393
References 1. Jolliffe, I.T.: Principle Component Analysis. 2nd edn. New York: Springer (2002) 2. Fukunaga, K.: Introduction to Statistical Pattern Recognition. 2nd edn. Academic Press, Inc. (1990) 3. Turk, M., Pentland, A.: Eigenfaces for Recognition. J. Cognitive Neuroscience 13(1) (1991) 71-86 4. Belhumeur, P.N., HesPanha, J.P., and Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. PAMI 19(7) (1997) 711-720 5. Liu, C., Wechsler, H.: Evolutionary Pursuit and Its Application to Face Recognition. IEEE Trans. PAMI 22(6) (2000) 570-582 6. Goldberg, D.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley (1989) 7. Zheng, W., Lai, J., and Yuen, P.C.: GA-Fisher: A New LDA-Based Face Recognition Algorithm With Selection of Principal Components. IEEE Trans. Systems, Man, and Cybernetics - Part B: Cybernetics 35(5) (2005) 1065-1078 8. Vapnik, V.: The Nature of Statistical Learning Theory. New York: Springer (1995)
Feature Extraction for Time Series Classification Using Discriminating Wavelet Coefficients Hui Zhang1,2 , Tu Bao Ho1 , Mao-Song Lin2 , and Xuefeng Liang1 1
School of Knowledge Science, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, 923-1292 Japan 2 School of Computer Science, Southwest University of Science and Technology, Mianyang, Sichuan, 621000, China [email protected], [email protected], {bao, xliang}@jaist.ac.jp
Abstract. Many feature extraction algorithms have been proposed for time series classification. However, most of the proposed algorithms in time series data mining community belong to the unsupervised approach, without considering the class separability capability of features that is important for classification. In this paper we propose a supervised feature extraction approach by selecting discriminating wavelet coefficients to improve the time series classification accuracy. After wavelet transformation, few wavelet coefficients with higher class separability capability are selected as features. We apply three feature evaluation criteria, i.e., Fisher’s discriminant ratio, divergence, and Bhattacharyya distance. Experiments performed on several benchmark time series datasets demonstrate the effectiveness of the proposed approach.
1
Introduction
Time series classification is an importance direction of time series mining that has attracted increasing interest in last decades. One of the main difficulties of time series classification is the so-called curse of dimensionality problem that comes from the high dimensionality property of time series. Curse of dimensionality refers to the dependency of the sample size in the classifier or the density estimation on the dimensionality of the feature vector to guarantee a given accuracy [1]. It means we need huge amount of samples for high dimensional data, which can’t be always satisfied for real data. Thus most of the proposed classification algorithms work with features extracted from time series with lower dimensionality. Wavelet that possesses the capacity of decomposing time series into a multi-scale representation [2] has been extensively used in time series feature extraction. The typical method is to use the first few wavelet coefficients corresponding to the low frequency part of the time series as the features [3] or keeps few coefficients with higher energy corresponding to lower reconstruction J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1394–1399, 2006. c Springer-Verlag Berlin Heidelberg 2006
Feature Extraction for Time Series Classification
1395
error [4]. Recently, approaches taking wavelet coefficients within a specific scale chosen by entropy or energy-error tradeoff have also been proposed [5] [6]. To our knowledge, most of the wavelet-based time series feature extraction algorithms can be grouped into unsupervised approach, which don’t use the information from training data. For classification problem, supervised feature extraction that utilizes the information from training data should be better than unsupervised feature extraction. In this paper, we propose a novel feature extraction approach by selecting few wavelet coefficients with higher class discriminating ability. The wavelet coefficients decomposed to scale zero are defined as features. We evaluate the class discriminating capability of each feature for a time series dataset by three criteria, that is, Fisher’s discriminant ratio, divergence, and Bhattacharyya distance. The feature extraction algorithm selects the wavelet coefficients with higher class discriminating capability as features. The rest of the paper is organized as follows: The proposed feature extraction algorithm is presented in Section 2. Experimental evaluation for the proposed feature extraction algorithm with several benchmark time series datasets is described in Section 3. Finally, we conclude the paper by summarizing the main contributions in Section 4.
2
Feature Extraction Algorithm
Suppose that the dimensionality of the transformed wavelet coefficients is n (the original time series will be zero-padded to the length of an integer power of two). We define the transformed wavelet coefficients until scale zero as features: Definition 1. The features are the transformed wavelet coefficients until scale zero (n−1)/2
F = {a0 , d0 , d11 , d21 , . . . , dJ−1
n/2
, dJ−1 }
(1)
where J = log2 (n) is the highest scale that the wavelet coefficients located in. Note that for normalized time series, the a0 is proportional to the means of the time series that equals to zero, thus we don’t need to keep it. The purpose of feature extraction is to choose k, k n features out of F that are most significant for classification. For computational reason, we take the widely used sequential method that treats each feature individually. Suppose we have m time series, each feature fi is a one-dimensional vector whose length is m. The features that have high class separability capability might be good for classification. The intuition behind the criterion is that if data belonged to different class in terms of a feature fi are far from each other, the data within different classes are easy to be separated by any classification algorithm. We apply three widely used criteria measuring the class separability [7] shown as below.
1396
H. Zhang et al.
1. Fisher’s discriminant ratio: Fisher’s Linear Discriminant Analysis (LDA) tries to find a linear discriminant that generates optimal discrimination between two classes X and Y , where X and Y are two multivariate random variables. The discrimination capability is measured by the Fisher’s discriminant ratio F DR =
wT (μx − μy )(μx − μy )T w wT (Σx + Σy )w
(2)
where μx and μy are the mean of X and Y respectively, Σx and Σy are the covariance matrix of X and Y respectively. F DR takes a special form for the one-dimensional problem (when treating each feature individually) F DR =
(μx − μy )2 σx2 + σy2
(3)
For multiclass case, suppose that there are c classes, the multiclass F DR can be written as c c (μi − μj )2 F DR = (4) σi2 + σj2 i j =i
where the subscripts i, j refer to the mean and variance corresponding to the feature under investigation for the class ci , cj , respectively. 2. Divergence: Divergence is a distance measure between two random variables X and Y , which is the symmetric version of Kullback-Leibler distance. When two density functions are normal, the divergence becomes 1 1 trace{ΣxT Σy +ΣyT Σx −2I}+ (μx −μy )T (ΣxT +ΣyT )(μx −μy ) (5) 2 2 Similar to F DR, for one-dimensional problem and multi-class case, the averaging form of divergence is DivX,Y =
Div =
c c σj2 1 σi2 1 1 ( 2 + 2 − 2) + (μi − μj )2 ( 2 + 2 ) 2 i σj σi σi σj
(6)
j =i
3. Bhattacharyya distance: The Bhattacharyya distance that measures the distance between two normal distributed random variables has the format Σ +Σ
Bha =
| x y| 1 Σx + Σ y T 1 (μx − μy )T ( ) (μx − μy ) + ln 2 8 2 2 |Σx ||Σy |
(7)
where | · | denotes the determinant of the respective matrix. In multi-class situation for one-dimensional data, the averaging form of Bhattacharyya distance is c c σj2 1 (μi − μ)2 1 σ2 Bha = + ln{ ( i2 + 2 + 2)} (8) 2 2 4 i σi + σj 4 σj σi j =i
Feature Extraction for Time Series Classification
1397
The feature extraction problem in terms of the discriminating wavelet coefficients is shown in Algorithm 1.. The input of the algorithm is the transformed wavelet coefficients defined in Eq. 1 and the desired number of features k. The algorithm first order the features with respect to a class discriminating measure I(fi ). The class discriminating measure can be F DR, Div, or Bha. Then k features with highest I(fi ) are selected out of F .
Algorithm 1. The feature selection algorithm input: a feature set F = {f1 , f2 , . . . , fn } and the desired feature dimensionality k. S ←− Empty; ∀fi ∈ F , compute the class discriminant measure I(fi ); Sort F with respect to I(fi ) in descendent order; for i = 1 tok do S ←− S fi end for Output the features set S.
3
Experimental Evaluation
We used Harr wavelet in our experiments, which has the fastest transformation algorithm. We used two classification algorithms to evaluate the proposed approach. One algorithm is One-Nearest-Neighbour algorithm (1NN), another algorithm is RBF probability neural network (PNN) [8]. The performance of the classification algorithm is evaluated by the classification error rate validated by Leave-One-Out validation. We evaluated five feature extraction methods. LowK uses the first k wavelet coefficients which correspond the low frequency part of time series as features [3]. HighEnerK is the feature extraction algorithm that selects k wavelet coefficients with higher energy corresponding to lower reconstruction error [4]. FDR, Div, and Bha are our proposed feature extraction algorithms using the Fisher’s discriminant ratio, divergence and Bhattacharyya distance as the feature evaluation criterion respectively. As the dimensionality of features k can be ranged from 1 to n, we select the k with highest classification accuracy for all the used feature extraction algorithms. We used four classified datasets (CBF, CC, Trance, Gun) from the UCR Time Series Data Mining Archive [9] and one dataset (ECG) downloaded from Internet [10]. 3.1
Experimental Results
Table 1 and Table 2 give the classification error rates provided by 1NN and PNN classification algorithms for original data and features. The classification results for original time series are termed as Time series. Classification with features outperforms classification with original data for both of the two classification algorithms.
1398
H. Zhang et al.
Table 1. The classification error rate (%) produced by 1NN algorithm for all the used datasets Time series LowK HighEnerK FDR Div Bha
CBF 0.26 0.00 0.00 0.00 0.00 0.00
CC 1.33 0.33 0.33 0.17 0.33 0.17
Trace 11.00 5.50 5.00 3.00 0.50 1.50
Gun 5.50 5.50 5.00 3.00 0.50 3.00
ECG 38.57 37.14 37.14 34.29 37.14 37.14
Then we compared the five different feature extraction algorithms for 1NN classification algorithm shown in Table 1. The performance of HighEnerK is similar to that of LowK, HighEnerK performs a little better only for Trace and Gun data. The three feature extraction algorithms based on class separability F DR, Div, and Bha outperform LowK and HighEnerK on average. For CBF data, all the five feature extraction algorithms achieve the same classification accuracy. For the other four data, F DR outperforms LowK and HIghEnerK on four data, Div and Bha outperform LowK and HighEnerK on three data. For the PNN classification algorithm, the three feature extraction algorithms that use discriminating wavelet coefficients - F DR, Div, and Bha also achieve superior classification accuracy than two unsupervised feature extraction algorithms - LowK and HighEnerK on average. As shown in Table 3, F DR, Div, and Bha outperform LowK and HighEnerK on Trace and Gun datasets. F DR, Div, and Bha are outperformed by LowK and HighEnerK only on CBF dataset. For CC dataset, both F DR and HighEnerK achieve the best classification accuracy. For ECG dataset, F DR and Bha perform better than LowK and HighEnerK, Div is outperformed by LowK but still better than HighEnerK. Among the three proposed feature extraction algorithms using discriminating wavelet coefficients, F DR achieves best performance on average. F DR outperforms LowK and HighEnerK on four datasets out of five datasets with 1NN and PNN algorithms, only performs worse than LowK and HighEnerK algorithms on CBF dataset with PNN classification algorithm. Div and Bha algorithms achieve competitive performance. Table 2. The classification error rate (%) produced by PNN algorithm for all the used datasets Time series LowK HighEnerK FDR Div Bha
CBF 66.67 15.36 15.36 21.09 26.56 26.56
CC 83.33 29.83 24.50 24.50 33.33 30.83
Trace 10.00 9.00 9.00 1.50 2.50 1.50
Gun 6.00 6.00 5.50 4.50 4.50 4.50
74.28 42.86 51.43 41.43 48.57 40.00
Feature Extraction for Time Series Classification
4
1399
Conclusion
In this paper feature extraction is carried out in order to improve the time series classification accuracy. Although many feature extraction algorithms have been proposed for time series classification, most proposed approaches don’t consider the discriminating ability of features. We proposed a feature extraction algorithm selecting few wavelet coefficients with higher class discriminating capability for time series classification. We applied three class separability measurements to evaluate the features, i.e., Fisher’s discriminant ratio, divergence, and Bhattacharyya distance. Experiments with several benchmark time series datasets demonstrate that classification with extracted features achieves higher classification accuracy than classification with original time series data. The proposed feature extraction algorithms outperform other two unsupervised feature extraction algorithms in the experiments.
References 1. Randy, S.J., Jain, A.K.: Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners. IEEE Trans. on Pattern Analysis and Machine Intelligence 13 (1991) 252-264 2. Mallat, S.: A Wavelet Tour of Signal Processing. Second edn. Academic Press, San Diego (1999) 3. Chan, K.P., Fu, A.W., Clement, T.Y.: Harr Wavelets for Efficient Similarity Search of Time-Series: with and without Time Warping. IEEE Trans. on Knowledge and Data Engineering 15 (2003) 686-705 4. M¨ orchen, F.: Time Series Feature Extraction for Data Mining Using DWT and DFT. Technical Report No. 33, Dept. of Machmatics and Computer Science, University of Marburg, Germany (2003) 5. Zhang, H., Ho, T.B., Lin, M.S.: A Non-Parametric Wavelet Feature Extractor for Time Series Classification. In Dai, H., Srikant, R., Zhang, C., eds.: Lecture Notes in Artificial Intelligence. Volume 3056., Berlin Heidelberg New York, Springer-Verlag (2004) 595-603 6. Zhang, H., Ho, T.B., Huang, W.: Blind Feature Extraction for Time-Series Classification Using Haar Wavelet Transform. In Wang, J., Liao, X., Yi, Z. eds.: Lecture Notes in Computer Science. Volume 3497. Springer-Verlag, Berlin Heidelberg New York (2005) 605-610 7. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Second edn., San Diego, CA (1990) 8. Specht, D.F.: Probabilistic Neural Networks. Neural Networks 11 (1990) 109-118 9. Keogh, E., Folias, T.: The UCR Time Series Data Mining Archive. http://www.cs.ucr.edu/ eamonn/TSDMA/index.html (2002) 10. Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: PhysioBank, PhysioToolkit, and PhysioNet: Components of A New Research Resource for Complex Physiologic Signals. Circulation 101 (2000) e215-e220 http://www.physionet.org/ physiobank/database
Feature Extraction of Underground Nuclear Explosions Based on NMF and KNMF Gang Liu, Xi-Hai Li, Dai-Zhi Liu, and Wei-Gang Zhai Xi’an Research Institute of High Technology, Hongqing Town, Baqiao, Xi’an 710025, People’s Republic of China xihai [email protected]
Abstract. Non-negative matrix factorization (NMF) is a recently proposed parts-based representation method, and because of its nonnegativity constraints, it is mostly used to learn parts of faces and semantic features of text. In this paper, non-negative matrix factorization is first applied to extract features of underground nuclear explosion signals and natural earthquake signals, then a novel kernel-based non-negative matrix factorization (KNMF) method is proposed and also applied to extract features. To compare practical classification ability of these features extracted by NMF and KNMF, linear support vector machine (LSVM) is applied to distinguish nuclear explosions from natural earthquakes. Theoretical analysis and practical experimental results indicate that kernel-based non-negative matrix factorization is more appropriate for the feature extraction of underground nuclear explosions and natural earthquakes.
1
Introduction
In July and August 1958, a committee of experts met in Geneva to consider the design of an inspection system for nuclear tests[1]. In September 1996, a treaty banning underground nuclear test was subscribed in Geneva, seismic methods were the first chosen detection methods to monitor compliance with this treaty, and this led immediately to the problem of distinguishing between the signals from underground nuclear explosions and those of natural earthquakes. In order to distinguish the two kinds of signals, the first step is feature extraction of these signals. There are different feature extraction methods to this identification problem, such as principal components analysis (PCA) and vector quantization (VQ), and these methods mostly learn holistic, not parts-based feature representations. However, these holistic features are not always very effective for the identification of underground nuclear explosions and natural earthquakes. For example, the correct recognition rate is not more than 70% for these features extracted by PCA. Therefor, we should try to extract features by learning parts of these signals. Non-negative matrix factorization is justly a parts-based representation method[2], so feature extraction based on NMF is discussed in this paper. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1400–1405, 2006. c Springer-Verlag Berlin Heidelberg 2006
Feature Extraction of Underground Nuclear Explosions
1401
The remainder of this paper is organized as follows. section 2 discusses feature extraction based on non-negative matrix factorization. In Section 3, we propose a kernel-based non-negative matrix factorization, and then feature extraction based on it is also carried out. Feature extraction algorithms are experimentally evaluated and discussed in section 4. Finally, a concluding remarks is drawn in Section 5.
2 2.1
Feature Extraction Based on Non-negative Matrix Factorization Non-negative Matrix Factorization
In 1999, Lee and Seung [2] proposed a new technique, called non-negative matrix factorization (NMF), to obtain a reduced representation of data. NMF differs from other methods by its use of non-negativity constraints, these constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations of the original data [2]. Suppose an image database is an n × m matrix V , each column of which contains n non-negative pixel values of one of the m facial images. Then NMF constructs approximate factorization of the form V ≈ W H, or Viu ≈ (W H)iu =
r
Wia Hau
(1)
a=1
The r columns of W are called basis images. Each column of H is called an encoding and is one-to-one correspondence with a face in V . The dimension of the matrix factors W and H are n × r and r × m respectively. Usually r is chosen to be smaller than n or m, so that W and H are smaller than the original matrix V . To find an approximate factorization of Equation 1, Lee and Seung [3] proposed two multiplicative update rules. Firstly, two cost functions are defined, one is simply the square of the Euclidean distance between A and B, ||A − B||2 = (Aij − Bij )2 (2) ij
This is lower bounded by zero, and clearly vanishes if and only if A = B. The other is the divergence of A and B, i.e. D(A||B) =
ij
(Aij log
Aij − Aij + Bij ) Bij
(3)
Using the above cost function, we can consider two alternative formulations of NMF as optimization problems: min||V − W H||2 , s.t.W, H > 0
(4)
1402
G. Liu et al.
minD(V ||W H), s.t.W, H > 0
(5)
The optimization of Equation 4 can be done by using the following multiplicative update rules: Hau ← Hau
(W T V )au (V H T )ia , W ← W ia ia (W T W H)au (W HH T )ia
(6)
The Euclidean distance is invariant under these updates if and only if W and H are at a stationary point of the distance. The optimization of Equation 5 can be done by using the following multiplicative update rules: Hau ← Hau
i
Wia Viu /(W H)iu , Wia ← Wia k Wka
u
Hau Viu /(W H)iu ν Haν
(7)
The divergence is invariant under these updates if and only if W and H are at a stationary point of the divergence. 2.2
Feature Extraction Based on NMF
According to non-negative matrix factorization, V ≈ W H, here, W is called basis matrix, and H is coefficient matrix, each column of H corresponds to each column of the original matrix V , and each signal (or image) is the combination between each column of H and basis matrix W , i.e. vj = W hj , j = 1, 2, · · · , m, therefor, hj = W −1 vj . So, the coefficient matrix H can be regarded as the features of the original matrix V .
3
Feature Extraction Based on Kernel-Based Non-negative Matrix Factorization
In the last years, a number of powerful kernel-based learning machines, e.g. Support Vector Machines (SVMs)[4] [5], Kernel Principal Component Analysis (KPCA) [6] and Kernel Fisher Discriminant (KFD)[7], have been proposed, all these nonlinear information processing algorithms have been designed by means of linear techniques in implicit feature space induced by kernel functions. Reproducing kernels are functions k : χ2 → R which for all patterns sets {x1 , · · · xl } ⊂ χ give rise to positive matrices Kij = k(xi , xj ) [8]. Here, χ is some compact set in which the data lives, typically (but on necessarily) a subset of RN . Kernel functions provide an elegant way of dealing with nonlinear algorithms by reducing them to linear ones in some feature space F nonlinearly related to input space: Using k instead of a dot product in RN corresponds to mapping the data into a possibly high-dimensional dot product space F by a (usually nonlinear) map Φ : RN → F , and takeing the dot product there[4], k(x, y) = (Φ(x) · Φ(y))
(8)
Feature Extraction of Underground Nuclear Explosions
1403
By virtue of this property, any linear algorithm which can be carried out in terms of dot products can be made nonlinear by substituting a priori chosen kernel. Therefor, we generalize non-negative matrix factorization to kernelbased non-negative matrix factorization. However, during the computation steps of NMF, the form of dot product does not appear obviously, so we should construct a vector mapping to realize the dot product. The basic ideas behind it are as follows. Firstly, we choose a kernel function to map samples in input space to samples in feature space implicitly, the benefit of this is reducing the correlation between samples, and finding better discriminability in feature space. For each sample (Xi , y) in input space, Φ maps it to high-dimensional feature space (Φ(xi ), y). If the dot product of two different kinds of samples in feature space is smaller than in input space, i.e. (Φ(Xi ) · Φ(Xj )) < (Xi · Xj ), then the discriminability in feature space is superior than that in input space. We define a new sample vector pi , that is, for the ith sample Φ(Xi ) of feature space, we compute the dot product of Φ(Xi ) and all other samples Φ(Xj ), i = j, then form a new vector series pi = [(Φ(Xi ) · Φ(X1 )), (Φ(Xi ) · Φ(X2 )), · · · , (Φ(Xi ) · Φ(Xm ))]T
(9)
Substituting Equation 8 to the above equation, then we obtain sample vector based on kernel function. pi = [k(Xi · X1 ), k(Xi · X2 ), · · · , k(Xi · Xm )]T
(10)
Through the above operations, each n dimensional sample in input space becomes m dimensional sample in feature space. That is, in feature space, the sample sets are ((p1 , y1 ), (p2 , y2 ), · · · , (pm , ym )) ∈ F × {±1}, we denote it by Vp . Secondly, we use NMF to factorize matrix Vp , that is Vp ≈ Wp Hp . For each unknown sample z, first mapping it to feature space,that is, Φ(z), then computing all dot products between Φ(z) and all training samples Φ(Xi ). Substituting Equation 8 into it, then, we obtain a new vector Pz = [k(z, X1 ), k(z, X2 ), · · · , k(z, Xm )]T . All k unknown samples can do similar operation as sample z, and then form a new m × k non-negative matrix Vz . Therefore, the feature extraction is, Hz = Wp−1 Vz .
4
Classification Based on Linear Support Vector Machine
For nuclear explosion sample set and natural earthquake sample set, there are 100 samples respectively. We select 50 samples from each class to form training set, the others form testing set. Because the length of each sample is n = 1024, then the training set matrix Vt and the testing set matrix Ve are all 1024 × 50 dimension matrices. Here, we should note that training set matrix and testing set matrix do not satisfy the non-negative constraints of NMF, so preprocessing must be done to make the these two matrices satisfy the constraints of NMF. Because each sample signal just indicates the waveform of underground nuclear
1404
G. Liu et al.
explosions and natural earthquakes, so a simple preprocessing is that shifting each signal vertically to make the minimum value of that signal is non-negative. After that, the two matrices can carry out non-negative matrix factorization and kernel-based non-negative matrix factorization. nm For training set matrix Vt , we choose different values of r(r < n+m ≈ 91), and then carry out NMF and KNMF. For NMF and KNMF, the maximum iteration number is set to be 500, and the initial matrix is chosen randomly. For KNMF, Gaussian kernel is chosen as the kernel function, i.e. k(X, Y ) = exp(−||X − Y ||2 /2σ 2 ) , and different σ are chosen. The experimental results (average value of 10 experiments) are listed in Tab.1 (T– training recognition rate, V – testing recognition rate) and Tab.2. Table 1. Identification Results of NMF r T% V% r T% V%
1 70 66.41 49 83.80 67.73
9 77 66.41 59 84.90 67.73
19 79.60 68.28 69 86.60 66.95
25 81.60 68.52 79 87.00 66.02
29 81.80 67.89 89 86.40 67.73
39 82.10 68.05 99 58.50 55.16
Table 2. Identification Results of KNMF
σ=500 σ=550 σ=600 σ=650 σ=700 σ=750 σ=800 σ=900
r=69 86.41% 84.06% 85.39% 86.33% 84.53% 85.39% 84.14% 84.23%
r=79 85.55% 86.09% 86.02% 88.44% 86.25% 85.55% 84.14% 83.91%
r=89 86.64% 85.39% 87.34% 87.11% 86.33% 87.58% 86.17% 85.16%
r=99 85.55% 85.78% 87.34% 87.42% 88.05% 87.89% 86.17% 86.02%
From Tab.1, we can find that the whole recognition rates are not very good, and the best recognition rate only reaches 68.52%. The reason may be that the extracted features (coefficient matrix) of NMF only obtain characteristics of waveform for each signal, and they do not show these properties of classification effectively. From Tab.2, we can find that the whole recognition rates of KNMF are very good, and obviously superior than NMF. At the same time, we also find that the parameter σ and r affects the classification results obviously. Therefor, to obtain better classification results, we must choose proper σ and r. Besides, because each recognition rate in Tab.2 is the average of 10 recognition rates (Initiation are chosen randomly), so we list 10 recognition rates in Tab.3 when σ=650 and r=99.
Feature Extraction of Underground Nuclear Explosions
1405
Table 3. Ten Identification Results of KNMF when σ=650 and r=99 88.28% 89.80%
87.50% 89.84%
89.06% 88.28%
89.06% 88.28%
66.44% 88.28%
From Tab.3, we can find that random initiation of W and H can lead to bad recognition rate, even if all other conditions are the same. So, the initiation of W and H is a crucial factor for the classification of underground nuclear explosions and natural earthquakes. Maybe, we can use clustering method to initialize matrix W and H, this need to be discussed in detail in the further research.
5
Concluding Remarks
In this paper, NMF is firstly applied to extract the features of underground nuclear explosions and natural earthquakes, and experimental results show that the features are not very effective for classification, then a new kernel-based NMF is proposed, and is applied to extract features. Experimental results and the corresponding analysis provide suggests that feature extraction based on KNMF is more effective than NMF.
Acknowledgment This work is supported by National Natural Science Foundation (No. 40274044).
References 1. Douglas, A.: Seismic Source Identification: A Review of Past and Present Research Efforts. Lecture Notes of Dr. David Booth, Zhang Jiajie (2001) 2. Lee, D.D., Seung, H.S.: Learning Parts of Objects by Non-negative Matrix Factorization. Nature 401 (1999) 788-791 3. Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. Proceedings of Advanced Neural Information Processing Systems 13 (2001) 556-562 4. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A Training Algorithm for Optimal Margin Classifiers. In: D.Haussler (eds.): Porceedings of the 5th Annual ACM Workshop on Computational Learning Therory, (1992) 144-152 5. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1995) 6. Sch¨ olkopf, B., Smola, A.J., M¨ uller, K.R.: Nonlinear Component Analysis as A Kernel Eigenvalue Problem. Neural Computation 10 (1998) 1299-1319 7. Mika, S., R¨ otsch, G., Weston, J., Sch¨ olkopf, B., M¨ uller, K.R.: Fisher Discriminant Analysis With Kernels. In: Hu, Y.H., Larsen, J., Wilson, E., Gouglas, S. (eds.): Neural Networks for Signal Processing IX, (1999) 41-48 8. Saitoh, S.: Theory of Reporducing Kernels and Its Applications. Longman, Harlow, U.K. (1998)
Hidden Markov Model Networks for Multiaspect Discriminative Features Extraction from Radar Targets Feng Zhu, Yafeng Hu, Xianda Zhang, and Deguang Xie Dept. of Automation, Tsinghua University, Beijing 100084, China [email protected]
Abstract. This paper presents a new target recognition scheme via the neural network based on Hidden Markov Model (HMM), which processes the multiaspect features. The target features are extracted by the adaptive gaussian representation (AGR) from the view of physics. Discrimination results are presented for ISAR radar return signal.
1
Introduction
The radar echo, including radar range profile, is sensitive to the aspect of target so that the signal is nonstationary. Therefore, HMM (Hidden Markov Model) is employed for multiaspect target classification [1],[2]. However, one HMM only corresponds to the one class of targets, which result in the loss of the discriminative features among different targets. Neural networks (NN), being difficult to deal with the nonstationary signal such as acoustic signal, has the discriminative power and high parallel processing capacity. Recently, the HMM and NN are combined for speech recognition [3]. In this paper, the HMM network is proposed for feature extraction and classification.The features extracted by the adaptive Gaussian representation (AGR) is the efficient time-frequency features of target [4]. Furthermore, the target’s multiaspect feature, as the space feature, is extracted via HMM. Consequently, the main attribute of our work is the space-time-frequency (STF) feature extraction and classification.
2
Feature Extraction Via AGR
Adaptive gaussian representation (AGR) can decompose the backscatterd signal into time-frequency (T-F) centers corresponding to scattering centers and local resonances. AGR expands a backscattered field in time-domain r(t) in terms of normalized Gaussian elementary functions hp (t) with an adjustable T-F center (tp , fp ) and a variance αp ∞ ∞ r(t) = rp (t) = Bp hp (t) (1) p=0
p=0
J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1406–1411, 2006. c Springer-Verlag Berlin Heidelberg 2006
Hidden Markov Model Networks for Multiaspect Discriminative Features
where −0.25
hp (t) = (παp )
(t − tp )2 exp − exp(j2πfp t) 2αp
1407
(2)
The adjustable parameters tp , fp , and αp for Gaussian basis functions, and Bp for the coefficient can be obtained such that hp (t) is most similar to rp (t) 2 ∗ |Bp | = max rp (t)hp (t) (3) tp ,fp ,αp
where r0 (t) = r(t). rp+1 (t) is the remaider after othogonal projection of rp (t) onto hp (t) and this iterative procedure is described as rp+1 (t) = rp (t) − Bp hp (t)
(4)
In the actual implemention of the above AGR algorithm, the upper limit of the summation in (1) is limited to P since the iteration stops after the extracted Gaussian basis functions faithfully appoximate the time-domain signal r(t) with sufficiently high accuracy. After P stages of AGR decomposition, the following relationships hold: P r(t) = Bp hp (t) + rP +1 (t) (5) p=0
and 2
r(t) =
P
2
|Bp | + rK+1 (t)
2
(6)
p=0
Therefore, the AGR iteration in (4) continues until the reconstruction error rP +1 (t)2 is enough small. Finally, the feature vector for target recogntion can be formed by concatenating the obtained AGR parameters as follows: y = [t1 , · · · , tP , f1 , · · · , fP, α1 , · · · , αP ]
3
(7)
HMM Network
It is assumed that T scattered signals are measured at T distinct target-radar orientations. Assume that the features associated with the tth scattered signal are denoted by yt as (7). Maximum-likelihood target identification is effected by choosing that target Ci for which p(y1 , y2 , · · · , yT |Ci ) ≥ p(y1 , y2 , · · · , yT |Ck )∀Ck
(8)
Note that the problem is treated statistically, since the absolute target-radar orientation is hidden and modeled as a stochastic parameter. A HMM is used to evaluate p(y1 , y2 , · · · , yT |Ck ). In particular, it is well known that signal scattering from most tragets is characterized by angular
1408
F. Zhu et al.
sectors over which the angle-dependent scattered fields are slowly varying. Each such sector is here termed a ”state” and the number of states characteristic of a given target is dictated by the target complexity. In this paper, the continuous HMM is considered because the discrete HMM suffers from the distortion of feature vector quantization [5]. 3.1
The Whole Architecture of HMM Network
This HMM network that implements Bayes’ decision rule to discriminate among K classes, using K HMM models, is shown in Fig. 1. The outputs of K HMM, taken after t = T when the entire sequence has been input, are combined by an additional layer. This layer acts as a maximum selector, giving outputs Xk between 0 and 1. Ideally, one of the Xk would be 1 and all the others would be 0, indicating the classification of the sequence. In practice this won’t happen, so the decision rule is to pick the largest Xk . In particular, the HMMs in Fig. 1 are transformed into the nerural network (NN), which will be detailed in the following two chapters.
Fig. 1. The whole architecture of Hmm network; Filled circle indicate summing units; ”” units divide one input by the other and Λ is the logrithm probability of the input sequence; all connection have weight=1
3.2
The NN Implemention of HMM
Assume that target Ck is composed of the N states q1 , · · · , qN and transition probabilities P {qi → qj } = aij . Let s(t) denote the state at time. At each t = 1, · · · , T , one of the observations y1 , y2 , · · · , yT from state qi is generated with probabiilty P {yt |s(t) = qi } = diyt , which has the Gaussian mixture distribution: diyt = P {yt |s(t) = qi } =
L L bli gl (yt ), bli = 1 l=1
(9)
l=1
where gl (yt ) is the Gaussian distribution. Given a HMM model m, with the initial distribution πi = P {s(0) = qi }, the probability of a sequence of observation y1 , y2 , · · · , yT can be calculated by the forward-backward algorithm: For 1 ≤ t ≤ T , let αi (t) be the ”forward probability” p(y1 , · · · , yt , s(t) = qi ), and βi (t) be the ”backward probability” p(yt+1 , · · · , yT |s(t) = qi ). Then αi (t) and βi (t) can be calculated recursively,
Hidden Markov Model Networks for Multiaspect Discriminative Features
1409
Fig. 2. The HMM net, the neural network implementation of a HMM. g(yt) is one of the Gaussian mixture distributions in (9).”⊗” units multiply two inputs together. The A and B matrices represent connection weights; all other connections have weight=1. The network is initialized by α i (0) = πi . The mean vector and the covariance matrix of every Gaussian distribution can be calculated by K-means algorithm.
αi (t) =
N
αj (t − 1)aji diyt , αi (0) = πi ,
(10)
j=1
βi (t) =
N
aij diyt+1 βj (t + 1), βi (T ) = 1,
j=1
and the probability of the entire obeservation sequence of HMM model m is P (y1 , · · · , yT |Cm ) =
N
αi (t)βi (t)
(11)
i=1
for any t between 1 and T . The straightforward implemention of (10) and (11) suffers from numerical underflow of αi (t) and βi (t). A standard solution is to compute normalized variables α i (t) = αi (t)/ αi (t) and βi (t) = βi (t)/ αi (t). Then Pm (y1 , · · · , yT ) is i i computed recursively, as a logarithm Λm = log Pm to avoid underflow. After some manipulation, (10) and (11) become α j (t − 1)aji diyt j α i (t) = , αi (0) = πi , (12) α j (t − 1)aji diyt i,j aij diyt+1 βj (t + 1) j βi (t) = , βi (T ) = 1, aij diyt+1 βj (t + 1) i,j
and Λm (y1 , · · · , yt ) = P (y1 , · · · , yT |Cm ) = Λm (y1 , · · · , yt−1 ) + log
i,j
(13) α j (t − 1)aji diyt
1410
F. Zhu et al.
An architecture for a recurrent neural network that implements (12) and (13) to compute Λm (y1 , · · · , yt ), is shown in Fig.2. It should be emphasized that the L × N matirx B in Fig.2 represents the mixture coffecients bli in (9), in which L is the mixture number of Gaussian distribution. 3.3
The Training of the HMM Network
It is assumed that there are a set of training data {Y1 , · · · , YNT rain } and each Yn is a sequence y1 , · · · , yTn from class Cn . The training algorithms minimize the MMSE criterion function G with respect to the HMM network parameters, namely minimization of the mean squared error of the network outputs from ideal targets XCn (n) = 1 and Xk =Cn (n) = 0, where Xk (n) is output k of the classifier network in Fig.2, for input sequence Yn . Let λk (n) and Λk (n) be the probability and log probability, respectively, of training sequence Yn given the k th class model. Then the criterion functions can be written 1 G= (Xk (n) − δk,Cn )2 (14) 2 n k
where δk,Cn is the Kroneker delta. Gradient ascent is used for searching the parameters of the HMM network. Taking aij as an example of one of the parameters (which could just as well be a bik or πi ), the derivative can be expanded in terms of the individual class log likelihoods, N K T rain ∂G ∂Λk (n) = φn,k (15) ∂aij ∂aij n=1 k=1
where φn,k =
∂G = XCn (n)(δk,Cn − Xk (n)) ∂Λk (n) − Xk (n)(δk,k − Xk (n))
(16)
k k (n) and the derivative ∂Λ can be computed reccusively by using (12) and (13). ∂aij Most important, the HMM net parameters’ sum-to-one constraint are relaxed even if they are probabilties [3]. Furthermore, the parameter corresponds to allowing inhibition in the neural network while it’s negative value renders it meaningless as probability.
4
Experiment Result
In this example, the range profiles of three kinds of actual aeroplanes (Yak42, An-26, and Cessna S/II) measured by an experiment ISAR in outfield are used.Each aircraft has 1000 range profiles totally. The range profile set for each aircraft is subdivided into ten subsets by enlarging the sample interval by 10.
Hidden Markov Model Networks for Multiaspect Discriminative Features
1411
Table 1. Recognition Rates(%) of HMM and HMM network (HN) Classification Method Yak-42 An-26 Cessna S/II HMM 94.2 92.1 91.5 HN 96.6 95.1 95.4
Namely, every target has 10 subsets which has 100 range profiles with the sample interval 10. The first subset is selected as the training set and all the 10 subsets is selected as the test sets. After the feature extraction as the section 2 and the training of the HMM network as the section 3, the test sets are classified and the classification result was shown in Table 1. From Table 1, we can see that the recognition rate achieved by the HMM network is higher than that of the HMM.
5
Conclusion
The HMM network has been developed for multiaspect target classification, which extracts the discriminative space-time-frequency features by the adaptive Gaussian representation. The HMM network, combining the HMM with the neural network, has the advantages of processing the nonstationary signal as well as the strong discriminant power. The experiments with the radar data of three aircrafts show that the satisfactory recognition can be achieved by the proposed scheme.
Ackowledgement This work was supported by the Aerospace Supporting Technology Foundation of China under Grant 514770501. Feng Zhu and YaFeng Hu are the first coauthors.
References 1. Paul, R. R., Priya, K. B.: Hidden Markov Models for Multiaspect Target Classification. IEEE Trans. Signal Processing 47(7) (1999) 2035–2040 2. Bingnan, P., Zheng, B.: Radar Target Recognition Based on Peak Location of HRR Profile and HMMs Classifiers. in Proc. Radar (2002) 414–418 3. Edmondo, T., Marco, G.: Robust Combination of Neural Networks and Hidden Markov Models for Speech Recognition. IEEE Trans. Neural Networks 14(11) (2003) 1519–1531 4. Chen, V.C., Ling, H.: Joint Time-Frequency Analysis for Radar Signal and Image Processing. IEEE Signal Processing Mag. 16(3) (1999) 81–93. 5. Yanting, D., Lawrence, C.: Rate-Distortion Analysis of Discrete-HMM Pose Estimation via Multiaspect Scattering Data. IEEE Trans. Pattern Analysis and Machine Intelligence 25(7) (2003) 872–883
Application of Self-organizing Feature Neural Network for Target Feature Extraction Dong-hong Liu1, Zhi-jie Chen2, Wen-long Hu2, and Yong-shun Zhang1 1
Air Force Engineering University, Sanyuan, Shanxi 713800, China {ldhwrite, zhyshwrite}@163.com 2 Equipment Academy of Air Force, Beijing 100085, China {chzhjwrite, hwlwrite}@163.com
Abstract. An effective method, extending Prony algorithm based on Self-organizing feature map neural network, is introduced for radar target feature extraction. The method is a modified classical Prony method based on singular value decomposition and excellent classified capability of Self-organizing feature map neural network which can improved robust for spectrum estimation. Simulation results show that poles and residues of target echo can be extracted effectively using this method, at the same time, random noises are restrained in some degree. It is applicable to target feature extraction such as UWB radar or the other high resolution range radar.
1 Introduction There are various methods of target recognition using radar. The invariable features of target are needed to extract during target recognition. Poles of target are regarded as invariable features which can reflect the size and shape of target. Prony algorithm [1-3], pencil of function (POF) [4] and E-pulse method [5] are general methods for extracting target poles. Unfortunately, when radar worked in low SNR condition, those methods will not be used because their capability is very poor in anti-jamming, even the phenomenon of misjudgement always occurs in extracting poles. Therefore, improving the capability of anti-jamming is an important question that must be resolved firstly in the process of extracting poles. Prony algorithm is described in document [6] by Guo Gui-rong. A conclusion is drawn, that is, Prony algorithm is sensitive to noise, and its accuracy is low when SNR is less than 20dB, SNR must be more than 30dB in order to achieve satisfied accuracy. Neural network is extraordinary in series data processing because its many kinds of advantages, such as self-learning, self-adapting and self-organizing. The rapid development of computer technology make neural network is increasingly in target recognition [7] and pattern recognition [8]. SOFM (Self-Organizing Feature Map) extending Prony algorithm is put forward in this paper, which can extract radar target echo poles exactly when SNR = 5dB , can reduce the sensibility of Prony algorithm on noise largely. J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1412 – 1420, 2006. © Springer-Verlag Berlin Heidelberg 2006
Application of Self-organizing Feature Neural Network for Target Feature Extraction
1413
2 Self-organizing Feature Map Neural Network Self-organizing feature map (SOFM) neural networks is one of the most fascinating topics in the neural network field. Such networks can learn to detect regularities and correlations in their input and adapt their future responses to that input accordingly. The neurons of competitive networks learn to recognize groups of similar input vectors. Self-organizing maps learn to recognize groups of similar input vectors in such a way that neurons physically near each other in the neuron layer respond to similar input vectors. Its block diagram is shown in Fig. 1(a). The construction drawing of SOFM is shown in Fig. 1(b) when its input vector is one dimension. The number of input node n is equal to the number of sample data, the number of output node m is equal to the number of classification. wij is weight value. When poles and its number J are extracted using this network, singular values that are input vector are classified into two parts in output port, one part represents signal, and other part represents noise. Thus, the serial number of singular value that is located in boundary is the value J . y1
Out put
y2
ym
wnm
w11
x0
x1
xn
I nput
x1
(a) Two dimension output
x2
xn
(b) One dimension output
Fig. 1. The construction drawing of SOFM network
The approach of SOFM algorithm [9] are: 1) Initializing weight values and selecting the size of adjacent domains; 2) Conforming input pattern; 3) Calculating spatial distance d j =
N −1
∑ [ x i (t ) − wij (t )] 2 , where i =0
x i (t ) is the input of
node i in the time of t , wij (t ) is connection weight between node i and j , N is the number of input nodes; 4) Selecting node j* , it is meets the condition of min dj ; j
5) Adjusting j* and wij (t ) according as eq.(1) wij (t + 1) = wij (t ) + η (t )[ x i (t ) − wij (t )]
(1)
1414
D.-h. Liu et al.
where j is involved in adjacent domain of j* , 0 ≤ i ≤ N − 1 , η (t ) is decay factor; 6) Return to the step 2), until [ x i (t ) − wij (t )] 2 < ε , where ε is the error.
3 SOFM Extending Prony Algorithm Deduction SOFM Extending Prony algorithm makes a signal model be assumed an exponential function that consists of p amplitudes, p phases p frequencies and p decay factor, where decay p is greater than true value M . The discrete form of model can be written as p
yˆ (k ) = ∑ c m z mk , k = 0,1,2, L, N − 1
(2)
m =1
Generally, c m and z m are complex number and c m = Am exp( jθ )
⎫⎪ ⎬ z m = exp[(α m + j 2πf m )Δt ]⎪⎭
(3)
where Am , θ m , α m and Δt are amplitude, phase (radian), decay factor and sampling interval. In order to calculate {Am , θ m , α m , f m } , square error must be minimized N −1
ε = ∑ y (k ) − yˆ (k )
2
(4)
k =0
Eq. (2) is a homogeneous solution of constant coefficient linear difference equation. In order to solve this linear difference equation, we define p
p
k =1
i =0
ψ ( z ) = ∏ ( z − z k ) = ∑ a i z p −i = 0 , a 0 = 1
(5)
From (2), we obtain p
yˆ (k − m) = ∑ bl z lk − m , 0 ≤ k − m ≤ N − 1
(6)
l =1
The sum of p + 1 products of a m and (6) is given by p
p
p
p
p
m =0
l =1
m=0
l =1
m =0
∑ am yˆ (k − m) =∑ bl ∑ am zlk − m = ∑ bl zlk − p ∑ am zlp −m = 0
(7)
The second section of (7) is equal to ψ ( z l ) in (5), which makes the result of (7) is zero, so (7) accord with recursion discrete equation p
yˆ (k ) = − ∑ am yˆ (k − m) , p ≤ k ≤ N − 1 m =1
(8)
Application of Self-organizing Feature Neural Network for Target Feature Extraction
1415
e(k ) is defined as error between measured data y (k ) and approximate data yˆ (k ) ,
thus y (k ) = yˆ (k ) + e(k ) , p ≤ k ≤ N − 1
(9)
Substituting (8) into (9) produces p
p
p
m =1
m =1
m=0
y (k ) = − ∑ a m yˆ (k − m) + e(k ) = − ∑ a m y (k − m) + ∑ a m e(k − m)
(10)
Eq. (10) shows that the exponential function of white noise is an ARMA ( p , p ) model, which has the same parameter of AR and MA. Parameter of least squares N −1
estimation will make a minimum of
∑ e( k )
2
→ 0 , thus, (10) is modified into
k=p p
y ( k ) = − ∑ am y ( k − m )
(11)
Ya = − y
(12)
m =1
The matrix form of (11) is
⎡ y ( p − 1) ⎢ y( p) Y=⎢ ⎢ M ⎢ ⎢⎣ y ( N − 2)
y ( p − 2) L y ( p − 1) L
⎤ ⎥ ⎥ ⎥ M M M ⎥ y ( N − 3) L y ( N − p − 1)⎥⎦ y (0) y (1)
a = [a1 , a2 ,L, a p ]T , y = [ y ( p ), y ( p + 1),L, y ( N − 1)]T
Computes the matrix Y singular value decomposition Y = U Σ VH
(13)
where (⋅) H denotes complex conjugate transpose of matrix, U , V are unitary matrixes, Σ is a diagonal matrix. ⎡σ 1 ⎢ σ2 ⎢ ⎢ O ⎢ σJ Σ=⎢ ⎢ ⎢ ⎢ ⎢0 ⎣
σ J +1
0⎤ ⎥ ⎥ ⎥ ⎥ , σ1 ≥ σ 2 ≥ σ 3 ≥ L ≥ σ p ≥ 0 ⎥ ⎥ ⎥ O ⎥ σ p ⎥⎦ p× p
(14)
1416
D.-h. Liu et al. Input
Self-Organizing Map Layer
W S× R
p R× 1
n
ndist
a C
S× 1
R
S× 1
S
ni = − wi − p
a = compet(n )
Fig. 2. The model of SOFM network
If measured data is obtained under the condition of no noise, then σ J +1 ,L,σ p = 0 , and matrix Y contains M nonzero singular value. Since noise interference exists in actuarial measurement, σ J +1 ,L,σ p ≠ 0 . Because different targets are affected by noise in the process of measurement, and the order of model (that is the number of scattering centers) isn’t known in advance, the value of J needs to adjudge exactly, which makes matrix Σ ' only contain J nonzero singular values. In order to obtain these nonzero singular values σ 1 , σ 2 , L,σ J , the p singular values of (14) are inputted into SOFM network that is shown in fig. 2. Nerve cells which have same function are assembled together after input data are arranged and modulated by SOFM network, So singular values that are input vector are classified into two parts in output port, one part represents signal, and other part represents noise. Thus, the serial number of singular value that is located in boundary is the value J . Then make σ J +1 ,L,σ p = 0 , we obtain ⎡σ 1 ⎢ σ2 ⎢ ⎢ O ⎢ Σ' = ⎢ σJ ⎢ ⎢ ⎢ ⎢0 ⎣
0⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 0 ⎥ O ⎥ 0⎥⎦ p× p
(15)
Thus, the processed measure matrix Y ' can be written as Y ' = U Σ' V H
(16)
Combining (12) and (16), we obtain a = −( Y ' ) + y
(17)
where (⋅) + denotes Moore-Penrose pseudoinverse of matrix. The estimated value aˆ m of a m can be obtained according to (17). Build polynomial
Application of Self-organizing Feature Neural Network for Target Feature Extraction
1417
Aˆ ( z ) = 1 + aˆ1 z −1 + L + aˆ J z − J = 0
(18)
Poles z m can be obtained from (18), where m = 1, 2 ,L, J . Thus, (2) is simplified linear equation about c m . Its matrix form is given by Pc = yˆ
⎡ 1 ⎢ z P=⎢ 1 ⎢ M ⎢ N −1 ⎢⎣ z1
1 z2 M z2N −1
(19) 1 ⎤ ⎥ zJ ⎥ M M ⎥ ⎥ L z JN −1 ⎥⎦ L L
c = [c1 , c 2 , L , c J ]T , yˆ = [ yˆ (1), yˆ (1), L, yˆ ( J )]T
where P is a N-by-J Vandermode matrix, which is a nonsingular matrix (because z m is no different to each other). So vector c is c = [P H P]−1 P H yˆ
(20)
c = P + yˆ
(21)
or
According to the above approaches, cm and z m can be estimated.
4 Computer Simulation SVD-TLS extending Prony algorithm as mentioned above is analyzed in theory, then the validity of extracting target poles and residuum using this algorithm will be verified via computer simulation. Assuming signal model is y ( n) =
8
∑ ai exp( pi n) , n = 0,1,L, N − 1
(22)
i =1
where p1, 2 = −0.03 ± 0.2π i , p3, 4 = −0.04 ± 0.3π i , p5, 6 = −0.05 ± 0.4π i , p7 ,8 = −0.06 ± 0.5π i
(23)
ai = 1 , i = 1,2,L,8
When SNR = 5dB , original signal y (n) is submerged in noise (see Fig. 3(a)). In order to extracting target echo feature and eliminating noise effect, firstly, extracting target
1418
D.-h. Liu et al. SNR = 5dB
SNR = 5dB
10
10
Original signal Rebuilt signal
5
Amplitude
Amplitude
Original signal Signal with noise
0
-5
0
50
100
150
200
5
0
-5
0
50
Time
100
150
200
Time
(a)
(b)
Fig. 3. Rebuilding signal using SOFM extending Prony algorithm
poles and residuum using SOFM extending Prony algorithm, secondly, rebuilding signal waveform according to poles and residuum mentioned above. The rebuilt signal waveform is shown in Fig. 3(b). Through 20 Monte Carlo trials, estimated poles using SOFM extending Prony algorithm and true poles are shown in fig. 4 when SNR = 20dB . Figs. 3 and 4 show that poles of target can be exactly extracted using SOFM extending Prony algorithm in strong noise background, echo also can be rebuilt. In order to compare the degree of approximation between poles and residuum extracted via SOFM extending Prony algorithm and their true values, the values of poles and residuum extracted via SOFM extending Prony algorithm are shown in table 1 when SNR is 25dB, 15dB, 5dB and 0dB (the imaginary part of poles in table should multiplied by π , omitting it for the convenience of comparison). Table 1 and Fig. 4 show that estimated values converge to their true values in high SNR condition (ex. SNR = 20dB). Estimated values depart from their true values a little in low SNR condition, but their frequency values are close to their true values, only decay factors exist error. So target natural resonance frequencies can be denoted from those estimated values. Radar target also can be recognized using these estimated poles in low SNR. Table 1. Extracted poles and residuum
p1, 2 p3, 4 p5, 6 p7,8
25dB -0.0301 ± 0.2001i 1.0015 m 0.0039i -0.0401 ± 0.3000i 0.9989 ± 0.0039i -0.0497 ± 0.4002i 0.9967 m 0.0088i -0.0595 ± 0.4999i 0.9982 m 0.0022i
15dB -0.0304 ± 0.2004i 1.0005 m 0.0118i -0.0403 ± 0.2999i 0.9974 ± 0.0119i -0.0493 ± 0.4005i 0.9909 m 0.0272i -0.0584 ± 0.4998i 0.9943 m 0.0068i
5dB -0.0317 ± 0.2012i 1.0237 m 0.0319i -0.0417 ± 0.2998i 0.9976 ± 0.0349i -0.0491 ± 0.4015i 0.9830 m 0.0802i -0.0556 ± 0.4995i 0.9858 m 0.0218i
0dB -0.0337 ± 0.2018i 1.0548 m 0.0450i -0.0443 ± 0.2999i 1.0051 ± 0.0558i -0.0508 ± 0.4026i 0.9915 m 0.1354i -0.0538 ± 0.4996i 0.9896 m 0.0397i
Application of Self-organizing Feature Neural Network for Target Feature Extraction
Fig. 4. Pole distribution diagram
1419
Fig. 5. Chart of MSE in different SNR
In order to describe the degree of approximation between rebuilt waveforms and original signal waveforms in different SNR, weigh the validity of radar target recognition based on SOFM extending Prony algorithm, then mean square error ( MSE ) is defined as MSE =
1 N
N −1
∑ y(n) − yˆ (n)
2
(24)
n =0
where yˆ (n) is rebuilt by natural resonance frequencies and residuum of estimated signal, Ps is signal energy. From (24), knows that MSE incarnate the degree of approximation between rebuilt waveforms and original signal waveforms. The chart of in different SNR is shown in Fig. 5.
5 Conclusion Extraction radar target feature based on modified Prony algorithm, SOFM extending Prony algorithm, is brought forward. Computer numerical simulation is completed on the foundation of theoretical analysis. The results of simulation show that SVD-TLS extending Prony algorithm reduces the sensibility of Prony algorithm on noise largely, break through the limit which make the accuracy of Prony algorithm is very low when SNR is less than 20dB. The algorithm is put forward in this paper, which can estimate signal parameter value and rebuild signal waveform accurately. Mean square error MSE < 0.2% when SNR > 9dB (see Fig. 5). It show that this algorithm can extract inhere features of UWB radar target echo effectively, can restrain Gaussian random noise to a great extent.
References 1. Wang, J.: High Resolution Parametric Modeling for Two-dimensional Radar Target Using Prony Algorithm. Journal of Electronics 17(1) (2000) 38-45 2. Nam, S., Kang, S., Park, J.: An Analytic Method for Measuring Accurate Fundamental Frequency Components. IEEE Trans. Power Delivery 17(1) (2002) 405-411 3. Kim, K. T., Seo, D. K., Kim, H. T.: Radar Target Identification Using One-Dimensional Scattering Centres. IEE Proc. Radar, Sonar Navig 148(5) (2001) 285-296
1420
D.-h. Liu et al.
4. Wang, Y., Longstaff, I., Leat, C., and Shuley, N.: Complex Natural Resonances of Conducting Planar Objects Buried in a Dielectric Half-Space. IEEE Trans. Geoscience and Remote Sensing 39(6) (2001) 1183-1189 5. Han, H., Chen Z., and Zhuang Z.: Unconstrained E-Pulse Synthesis Method. Journal of Systems Engineering and Electronics 21(10)(2002) 23-26 6. Guo, G., Zhuang, Z., and Chen, Z.: Electromagnetic Feature Extraction and Target Recognition. National University of Defense Technology, Changsha, China (1996) 7. Wang, L., Sandor, Z., and Nasrabadi, N.: Automatic Target Recognition Using a Feature-Decomposition and Data-Decomposition Modular Neural Network’. IEEE Trans. Image Processing 7(8) (1998) 1113-1121 8. McConaghy, T., Leung, H., Bossé, E., and Varadan, V.: Classification of Audio Radar Signals Using Radial Basis Function Neural Networks. IEEE Trans. Instrumentation and Measurement 52(6) (2003) 1771-1779 9. Howard, D., Beale, M.: Neural Network Toolbox User’s Guide. The MathWorks, Inc. (2001)
Divergence-Based Supervised Information Feature Compression Algorithm Shi-Fei Ding1,2 and Zhong-Zhi Shi2 1 College of Information Science and Engineering, Shandong Agricultural University, Taian 271018, P.R. China [email protected] 2 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, P.R. China [email protected]
Abstract. In this paper, a novel supervised information feature compression algorithm based on divergence is set up. Firstly, according to the information theory, the concept and its properties of the divergence, i.e. average separability information (ASI) is studied, and a concept of symmetry average separability information (SASI) is proposed, and proved that the SASI here is a kind of distance measure, i.e. the SASI satisfies three requests of distance axiomatization, which can be used to measure the difference degree of a twoclass problem. Secondly, based on the SASI, a compression theorem is given, and can be used to design information feature compression algorithm. Based on these discussions, we design a novel supervised information feature compression algorithm based on the SASI. At last, the experimental results demonstrate that the algorithm here is valid and reliable.
1 Introduction Pattern recognition (PR) is the scientific discipline whose goal is the classification of objects into a number of categories or classes, and it includes four stages: information acquisition, feature compression, or feature extraction and selection, classifier design and system evaluation, where the feature compression plays a role and important part in the PR system, and affects several aspects of the PR, such as accuracy, required learning time, and the necessary numbers of samples et al [1,2]. One might expect that the inclusion of increasing numbers of features would increase the likelihood of including enough information to separate the class volumes. Unfortunately, this is not true if the size of the training data set does not also increase rapidly with each additional feature included. This is the so-called “curse of dimensionality” [3,4]. In order to choose a subset of the original features by reducing irrelevant and redundant, many feature selection algorithms have been studied. The literature contains several studies on feature selection for unsupervised learning in which he objective is to search for a subset of features that best uncovers “natural” groupings (clusters) from data according to some criterion. Principal components analysis (PCA) is an unsupervised feature extraction method that has been successfully applied in the area of face recognition, feature extraction and feature analysis [5]. But the PCA method is effective to deal with the small size and high-dimensional problems, and gets the J. Wang et al. (Eds.): ISNN 2006, LNCS 3971, pp. 1421 – 1426, 2006. © Springer-Verlag Berlin Heidelberg 2006
1422
S.-F. Ding and Z.-Z. Shi
extensive application in Eigenface and feature extraction [6]. In high-dimensional cases, it is very difficult to compute the principal components directly. Fortunately, the algorithm of Eigenfaces artfully avoids this difficulty by virtue of the singular decomposition technique. Thus, the problem of calculating the eigenvectors of the total covariance matrix, a high-dimensional matrix, is transformed into a problem of calculating the eigenvectors of a much lower dimensional matrix. In this paper, based on information theory [7], the authors are going on studying supervised information feature compression problem. Second 2, we study and discuss the divergence theory, and provide the definition of average separability information (ASI), symmetry average separability information (SASI). Section 3, we give and prove a compression theorem, on the basis of this theorem, we design an algorithm of supervised information feature compression based on the SASI. Computer experiment is given in second 4, and the experimental results indicate that the proposed algorithm is efficient and reliable.
2 Divergence Theory Let ω i , ω j be the two classes in which our patterns belong. In the sequel, we assume that the priori probabilities, P(ω i ) , P(ω j ) , are known. This is a very reasonable
assumption, because even if they are not known, they can easily be estimated from the available training feature vectors. Indeed, if N is the total number of available training patterns, and N 1 , N 2 of them belong to ω i and ω j , respectively, then P(ω i ) ≈ N 1 N , P(ω j ) ≈ N 2 N . The other statistical quantities assumed to be known are the class-conditional probability density functions p( x | ω i ), p( x | ω j ) , describing the distribution of the feature vectors in each of the classes. Then the loglikelihood function is defined as Dij ( x) = log
p( x | ω i ) p( x | ω j )
(1)
This can be used as a measure of the separability information of class ω i with respect to ω j . Clearly, for completely overlapped classes we get Dij ( x) = 0 . Since x takes different values, it is natural to consider the average value over class ω1 , the definition of the average separability information (ASI) is
[
] ∫x p( x | ω i ) Dij ( x)dx = ∫x p( x | ω i ) log pp((xx || ωω i )) dx j
Dij = E Dij ( x) =
(2)
where E denotes mathematical expectation. It is not difficult to see that Dij , i.e. the ASI is always non-negative and is zero if and only if p( x | ω i ) = p ( x | ω j ) . However, it is not a true distance distribution, since it is not symmetric and does not satisfy the
Divergence-Based Supervised Information Feature Compression Algorithm
1423
triangle inequality. Nonetheless, it is often useful to think of the ASI as a separability measure for class ω1 . Similar arguments hold for class ω2 and we define
[
] ∫x p( x | ω j ) D ji ( x)dx = ∫x p( x | ω j ) log pp((xx || ωω j )) dx i
D ji = E D ji ( x) =
(3)
In order to make ASI true distance measure between distributions for the classes
ω1 and ω2 , with respect to the adopted feature vector x . We improve it as symmetric average separability information (SASI), denoted by S (i, j ) , i.e.
∫
S (i, j ) = Dij + D ji = [ p( x | ω i ) − p ( x | ω j )] log x
p( x | ω i ) dx p( x | ω j )
(4)
About the SASI, we give the following Theorem. Theorem 1. The SASI, i.e. S (i, j ) satisfies the following basic properties:
1) Non-negativity: S (i, j ) ≥ 0 , S (i, j ) = 0 if and only if p( x | ω i ) = p ( x | ω j ) ; 2) Symmetry: S (i, j ) = S ( j , i ) ; 3) Triangle inequation: Suppose that class ω k is another class with the classconditional probability density function p( x | ω k ) , with respect to the adopted feature vector x , describing the distribution of the feature vectors in class ω k , then S (i , j ) ≤ S (i , k ) + S ( k , j )
(5)
Proof. according to the definition of the ASI, the properties 1) and 2) are right obviously. Now we prove the property 3) as follows. Based on the formulae (2), (3) and (4), we have p( x | ω i ) dx S (i, k ) + S (k , j ) − S (i, j ) = [ p ( x | ω i ) − p( x | ω k )] log x p( x | ω k )
∫
∫
+ [ p( x | ω k ) − p( x | ω j )] log x
= +
p( x | ω j )
∫ p( x | ω ) log p( x | ω x
∫
x
i
p( x | ω j ) log
k
)
p( x | ω j ) p( x | ω k )
p( x | ω k ) p( x | ω i ) dx − [ p( x | ω i ) − p ( x | ω j )] log dx x p( x | ω j ) p( x | ω j )
∫
dx +
∫ p( x | ω
dx +
∫ p( x | ω
x
x
k
k
) log
p( x | ω k ) dx p( x | ω j )
) log
p( x | ω k ) dx ≥ 0 p( x | ω i )
which is the triangle inequality. Therefore, the SASI is a true distance measurement, which can be used to measure the degree of variation between two random variables. We think of the SASI as separability criterion of the classes for information feature compression. We can see that the smaller the SASI is, the smaller the difference of two groups of data is. In particular, when the value of the SASI is zero, the two groups of data are same completely, namely there is no difference at this time. For information feature
1424
S.-F. Ding and Z.-Z. Shi
compression, under the condition of the given reduction dimensionality denoted by d , we should select d characteristics, and make the SASI tend to the biggest value. For convenience, we can use the following function, denoted by H (i, j ) , instead of S (i, j ) , which is equivalent to H (i, j ) , i.e.
∫
H (i, j ) = [( p( x | ω i ) − p( x | ω j )] 2 dx
(6)
x
3 Supervised Information Feature Compression Algorithm 3.1 Compression Theorem
Based on discussions above and in order to construct divergence-based supervised information feature compression algorithm, a compression theorem is given as follows [8]. Theorem 2. Suppose { X (j1) } ( j = 1,2, L , N 1 ) and { X (j2) } ( j = 1,2, L , N 2 ) are squ-
ared normalization feature vectors which belongs to Class C1 and C2, with covariances G (1) and G ( 2) respectively, then SASI, i.e. H (i, j ) =maximum if and only if the coordinate system is composed of d eigenvectors corresponding to the first d eigenvalues of the matrix A = G (1) − G ( 2) . 3.2 Design of Algorithm
According to the theorem 2 above, a supervised information feature compression algorithm based on the SASI is derived as follows. Step 1: Carry out square normalization processing for two classes original data, and get data matrix X (1) , X ( 2) respectively. Step 2: Carry out centralization for obtained data matrixes X (1) , X ( 2) , and then
calculate the symmetric matrixes G (1) , G ( 2) and difference matrix A . Step 3: Calculate all eigenvalues corresponding to all eigenvectors of the matrix A according to Jacobi method. Step 4: Arrange λ 2i i = 1,2, L , n) of matrix A as λ12 ≥ λ 22 ≥ L ≥ λ 2d ≥ L ≥ λ 2n . The n
total sum of variance square is denoted by V n =
∑ k =1
λ 2k , and so V d =
d
∑λ
2 k
, and then
k =1
Vd Vd = . The VSR value can be used to Vn s0 measure the degree of information compression. Generally speaking, so long as Vi ≥ 80% , we can reach the purpose of feature compression. When Vi ≥ 80% , we
the variance square ratio (VSR) is VSR=
Divergence-Based Supervised Information Feature Compression Algorithm
1425
may select d eigenvectors u1 , u 2 , L , u d corresponding to the first d eigenvalues, and construct the information compression matrix T = (u1 , u 2 , L , u d ) . Step 5:
Compress data matrixes X (1) , X ( 2) , and make transformation
Y (i ) = T ′X (i ) (i = 1,2) . So, the purpose to compress the data information is attained.
4 Experimental Results The original data sets come from the mean temperature of the Spanish weather in 1989 (Come from Spanish statistics yearbook, 1990) [9]. Where the first class C1 means the month mean temperature of the inland city, in correspondence with the city of the inland district of the lower temperature. The second class C2 means the month mean temperature of the coastland city, in correspondence with the city of the coastland district of the higher temperature. According to the information feature compression algorithm above, the compressed results of class C1 and class C2 are seen in Fig. 1.
Fig. 1. Compressed results of class C1 and class C2
Fig.1. shows that the distribution of feature vectors after compressed for class C1 denoted by “*”, and class C2 denoted “+”, is obviously concentrated, meanwhile for these two classes, the within-class distance is small, the between-class distance is big, and the SASI is maximum. Therefore, we get two-dimensional pattern vector loaded 99.14% information contents of the original data, by compressing the original twelvedimensional pattern vector. The experimental results demonstrate that the algorithm presented here is valid and reliable. This method in this paper is a kind of supervised algorithm, and takes full advantage of the class-label information of the training samples.
1426
S.-F. Ding and Z.-Z. Shi
5 Conclusions From information theory, we study and discuss the compression problem of the information feature in this paper, and come to a conclusion. According to the discussion on the divergence theory, we give definitions of ASI and SASI, and prove that SASI is a kind of distance measure. Based on SASI, we construct a new supervised compression algorithm for information feature. The experimental results show that algorithm presented here is valid, and compression effect is significant.
Acknowledgements This study is supported by the National Natural Science Foundation of China under Grant No.60435010, the China Postdoctor Science Foundation under Grant No.2005037439, and the Doctoral Science Foundation of Shandong Agricultural University under Grant No. 23241.
References 1. Duda, R.O., Hart, P.E. (eds.): Pattern Classification and Scene Analysis. Wiley, New York (1973) 2. Devroye, L., Gyorfi, L., Lugosi, G. (eds.): A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York (1996) 3. Fukunaga, K. (ed.): Introduction to Statistical Pattern Recognition. Academic Press, 2nd ed., New York (1990) 4. Hand, D.J. (ed.): Discrimination and Classification. Wiley, New York (1981) 5. Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal Cognitive Neuroscience 3(1) (1991) 71-86 6. Yang, J., Yang, J.Y.: A Generalized K-L Expansion Method That Can Deal with Small Sample Size and High-dimensional Problems. Pattern Analysis Applications 6(6) (2003) 47-54 7. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (1991) 8. Ding, S.F., Shi, Z.Z.: Symmetric Cross Entropy and Information Feature Compression Algorithm. Journal of Computational Information Systems 1(2) (2005) 247-252 9. Chen, S.H., Lu, C.L.: An Entropic Approach to Dimensionality Reduction in the Representation Space on Discrete Processes. Journal of Nanjing Institute of Meteorology 24(1) (2001) 74-82
Author Index
Abraham, Ajith I-873, III-518, III-891 Acu˜ na, Gonzalo II-808 Aguilar, Jose I-623 Ahmad, Zainal II-943 Ahn, Sang-Ho II-669 Ahn, Tae-Chon I-768, I-780 Alonso, Graciela Ramirez III-1105 Amari, Shun-ichi I-1 An, Ji-yao III-231 Arafat, Yasir III-660 Araz, Ceyhun III-898 ¨ Araz, Ozlem Uzun III-898 Asari, Vijayan K. I-584 Attakitmongcol, Kitti II-1378 Attik, Mohammed I-1359 Auer, Dorothee III-606 Ayhan, Bulent III-1216 Bacauskiene, Marija I-837 Bae, Hyeon III-420, III-471, III-991, III-1261 Bae, Joing-Il III-471 Baek, Seong-Joon II-419 Bagchi, Kallol III-1181 Bai, Jie III-255 Bai, Qiuguo III-1248 Bannister, Peter R. III-828 Bao, Xiao-Hua III-914 Bao, Zheng I-915 Beadle, Patch J. III-560 Ben Amor, Nadia II-293 Bhattacharya, Arijit III-891 Bi, Jianning III-654 Bi, Rongshan III-1065 Bian, Ning-Yan II-719 Bo, Liefeng I-1083 Brot´ ons, Francisco III-208 Brusic, Vladimir III-716 Buiu, Octavian III-1366 Bundzel, Marek I-955 Cai, Qutang I-1312 Cai, Xiushan II-747, II-777 Cai, Zixing II-1227, III-267
Cao, Aize II-504 Cao, Bin I-442 Cao, Bing-gang II-1370 Cao, Feilong I-72 Cao, Heng III-876 Cao, Jianting III-531 Cao, Jinde I-153, I-285, I-369 Cao, Wenming I-669 Cao, Yang II-719, III-600, III-1313 Cao, Yi-Jia II-1246, II-1332, II-1416, II-1422 Cao, Yong-Yan I-179 Cao, Yu II-158, II-569 Celebi, Fatih V. III-997 Cengiz, Yavuz III-815 Cerrada, Mariela I-623 Cha, Eui-Young II-1 Chacon, Mario I. III-1105 Chai, Tianyou III-852, III-858 Chai, Yi II-1052, II-1348 Chaiyaratana, Nachol II-998 Chang, Guoliang III-560 Chang, Maiga I-1305 Chang, Ming III-1091 Chang, Ming-Chang III-370 Chang, Roy Kwang Yang I-830 Chang, Thoon-Khin III-358 Chang, Yeon-Pun II-1104, III-370 Che, Haijun III-864 Chen, Ai-ling III-885 Chen, Bo I-915 Chen, Boshan I-230 Chen, Chunjun III-957 Chen, Dan II-20 Chen, Dingguo II-875 Chen, Dongyue I-1334 Chen, Fang-Fang I-590 Chen, Feng III-777 Chen, Guo III-326 Chen, Guoliang III-682 Chen, Guo-Ping I-810 Chen, Haixia I-545 Chen, Huafu I-21 Chen, Huawei I-509
1428
Author Index
Chen, Hung-Cheng II-1296, II-1324 Chen, Jen-Cheng I-1305, III-1091 Chen, Joseph C. III-970 Chen, Jun III-332 Chen, Junying I-1319 Chen, Lei-Ting II-188, III-1053 Chen, Ling I-385 Chen, Peng II-1178 Chen, Po-Hung II-1296, II-1324 Chen, Qi-an III-464 Chen, Qingzhan II-55 Chen, Rong I-515 Chen, Shuyan III-1 Chen, Songcan II-128 Chen, Tianping I-192, I-204, I-303 Chen, Wei-Hua II-1416, II-1422 Chen, Weijun II-259 Chen, Weimin II-1046 Chen, Wen-Ping III-1005 Chen, Xi III-1304 Chen, Xiaoming I-551 Chen, Xiaoying II-55 Chen, Xin I-564 Chen, Xing-lin II-1131 Chen, Yajie III-1366 Chen, Yan-Qiu II-8 Chen, Yen-Wei II-517 Chen, Ying III-1380 Chen, Ying-Wu III-927 Chen, Yixin I-385 Chen, Yong II-545 Chen, Yuebin II-442 Chen, Yuehui I-873, III-518 Chen, Yu-Jen I-599 Chen, Yunping II-1052, II-1348 Chen, Zhi-jie I-1412 Cheng, Hong III-46 Cheng, Jian II-783 Cheng, Peng III-156 Cheng, Qiyun II-1252 Cheng, Yufang II-1019 Cheng, Yuhu I-607 Chi, Zheru II-331 Cho, Seongwon I-448, I-456, II-26 Choi, Jeoung-Nae I-774 Choi, Jin Young I-991 Choi, Nakjin III-382 Choi, Seung Ho II-419 Choi, Young Joon II-1239 Chou, Chien-Ming II-1324
Chouchourelou, Arieta I-41 Chow, Tommy W.S. I-80 Chu, Ming-Hui II-1104 Chu, Si-Zhen II-911, II-916, II-922, II-928 Chun, Myung-Geun III-406, III-426 Chun, Seung-Pyo III-420 Chung, Duck-Jin I-723, III-1340 Chung, Fu-Lai II-610, II-623 Chung, Sun-Tae I-448 Chung, Wooyong I-659 Cibulskis, Vladas I-837 Cichocki, Andrzej III-531 Clifton, David A. III-828 Clifton, Lei A. III-836 Coessens, Bert III-635 Cubillos, Francisco II-808 Cui, Baotong I-165 Cui, Deng-Zhi II-991 Cyganek, Boguslaw III-52 Dai, Ming-Wei II-765 Dai, Qiong-Hai III-156, III-165 Dai, Xianhua III-722 Dai, Xianzhong III-1085 Dai, Xue-Feng II-991 Dai, Yuewei III-273 Dai, Zhiming III-722 Dalkiran, Ilker III-997 Dang, Zhi II-474 Danisman, Kenan III-997 Deguchi, Toshinori I-502 de Jes´ us Rubio, Jos´e II-956 De Moor, Bart III-635 Deng, Ke III-102 Deng, Xian-Rui III-1071 Deng, Yong I-1286, III-728 Ding, Chunmei I-72 Ding, Mingxiao I-27, I-489 Ding, Shi-Fei I-1421 Ding, Weijun III-777 Ding, Xiaoqing II-429 Ding, Xiaoyan III-306 Ding, Yongshan III-414 Ding, Yuhan III-1085 Djurdjanovic, Dragan III-396 Dlay, Satnam S. III-760 Do, Yongtae II-557 Dong, Fu-guo II-511 Dong, Guang-jun II-337
Author Index Dong, Sung Soo III-120 Dong, Xiaomin II-1046 Dong, Yongkang III-1237 Dou, Fuping III-864 Du, Dang-Yong III-477 Du, Ding III-876 Du, Ji-Xiang I-747, II-216, II-355 Du, Xiu-Ping II-150 Du, Zhi-Ye I-590 Du, Zifang I-344 Duan, Guangren I-379, II-1131, II-1153 Duan, Zhuohua II-1227 Egidio Pazienza, Giovanni I-558 Ejnarsson, Marcus III-1111 Elliman, Dave III-606 Emmert-Streib, Frank I-414 Eom, Il Kyu II-652, II-661 Eom, Jae-Hong III-642, III-690, III-710 Er, Meng-Joo I-1231, II-498 ¨ ur III-898 Eski, Ozg¨ Essoukri Ben Amara, Najoua II-271, II-293 Fan, Hui II-511 Fan, Min III-485 Fan, Mingming II-741 Fan, Quanyi III-844 Fan, Wei-zhong III-285 Fan, Yugang I-1273 Fan, Youping II-1052, II-1348 Fang, Rui-Ming II-1068 Fang, Yong I-80 Fei, Shu-Min II-842, II-881 Feng, Boqin I-577 Feng, Chun II-1096 Feng, Dazhang I-1195 Feng, Deng-chao II-539 Feng, Fuye I-359 Feng, Jianfeng I-1 Feng, Jufu I-1346 Feng, Song-He II-448, II-589 Feng, Xiao-yi II-474 Feng, Yong I-866 Feng, Yue I-1089 Frankoviˇc, Baltaz´ ar I-955 Freeman, Walter J. II-93, II-343, III-554 Fu, Chaojin I-230 Fu, Jun II-343 Fu, Pan III-964
1429
Fu, Si-Yao II-1218 Fu, Yan I-1293 Fu, Zhumu II-842 Gan, Liang-zhi I-1016 Gan, Woonseng II-1033 Gao, Haichang I-577 Gao, Hongli III-957 Gao, Mao-ting I-1256 Gao, Peng III-346 Gao, Qing I-21 Gao, Tianliang III-23 Gao, Xiao Zhi I-616 Gao, Xinbo II-436 Gao, Xun III-1313 Gao, Yang I-1189 Gao, Zengan I-1140 Garc´ıa-C´ ordova, Francisco II-1188, II-1198 Garc´ıa, Federico III-208 Gatton, Thomas M. II-1239 Gavrilov, Andrey I-707 Gazzah, Sami II-271 Ge, Qiang III-1237 Ge, Shuzhi Sam III-82 Ge, Yu II-253 Ge, Yunjian II-253 Geng, Hui III-1059 Geng, Yu-Liang II-448, II-589 Geng, Zhi III-777 Georgakis, Apostolos II-595 G¨ oksu, H¨ useyin III-815 Gong, Qingwu II-1052 Gong, Tao III-267 Gonzalez-Olvera, Marcos A. II-796 G´ orriz, Juan Manuel II-676 Gouton, Pierre II-349 ´ Grediaga, Angel III-208 Griffin, Tim III-1216 Grosan, Crina III-891 Grosenick, Logan III-541 Gu, Hong I-322 Gu, Xiao-Dong II-616, III-740 Gu, Xuemai III-88 Guan, Gen-Zhi I-590 Guan, Xiaohong I-350, III-346 Guan, Xinping III-202, III-573 Guimaraes, Marcos Perreau III-541 G¨ une¸s, Filiz III-815 Guo, Bing III-1283
1430 Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo,
Author Index Chengan I-1057 Chenlei II-404 Chuang-Xin II-1332 Hai-xiang III-1173 Jing-Tao III-1254 Jun I-1030 Lei I-496, II-962 Ping II-486 Qianjin III-1144 Wei II-741 Wenqiang I-1177 Xiao-Ping III-1138 Yi-Nan II-783 Ying III-108 Zun-Hua II-369
Hall, Steve III-1366 Han, Bing II-949 Han, Bo III-1210 Han, Chang-Wook I-798 Han, Fei I-631 Han, Feng-Qing I-1022 Han, Jianghong II-55 Han, Jiu-qiang II-34 Han, Li II-1039 Han, Min II-741, II-949 Han, Qing-Kai III-982 Han, Sangyong III-891 Han, Seung-Soo III-285, III-1014, III-1020, III-1028, III-1043 Han, Xin-Yang II-1277 Handoko, Stephanus Daniel III-716 Hao, Jiang I-7 Hao, Zhi-Feng I-981, I-997, I-1010, II-1084 Hasegawa, Yoshizo II-1178 Havukkala, Ilkka III-629 Haynes, Leonard III-352 He, Bei II-1039 He, Bin III-566 He, Hanlin I-147 He, Min II-1062 He, Pi-Lian II-150 He, Qing I-1299 He, Xiaofei II-104 He, Xiao-Xian I-570 He, Yongfeng III-844 He, Yu-Lin III-1270 He, Zhaoshui I-1171 He, Zhenya I-1153, I-1189
Heh, Jia-Sheng I-1305, III-1091 Ho, Daniel W.C. I-524 Ho, Tu Bao I-1394 Hong, Kwang-Seok II-222 Hong, Sang J. III-376 Hong, Sang Jeen III-1014, III-1036, III-1043 Hong, Taehwa II-69 Hongyi, Zhang I-1159 Hope, Anthony D. III-964 Hou, Yun I-577 Hou, Zeng-Guang II-523, II-1218 Hsueh, Chien-Ching III-370 Hsueh, Ming-Hsien III-821 Hsueh, Yao-Wen III-821, III-1005 Hu, Bo III-94 Hu, Dewen I-1133, III-620 Hu, Guang-da I-211 Hu, Guang-min III-190 Hu, Hai-feng II-375 Hu, Jianming III-23 Hu, Jing-song I-1267 Hu, Lai-Zhao I-1044 Hu, Meng III-554 Hu, Qing-Hua I-1373 Hu, Shan-Shan III-1270 Hu, Tingliang II-849 Hu, Wen-long I-1412 Hu, Xuelei I-1214 Hu, Yafeng I-1406, II-287, III-144 Hu, Yun-An II-349 Huang, De-Shuang II-216 Huang, Fenggang II-238, II-411 Huang, Gaoming I-1153, I-1189 Huang, Guang-Bin III-114 Huang, Houkuan III-261 Huang, Hua I-732 Huang, Jian I-843, III-792 Huang, Lin-Lin II-116 Huang, Panfeng II-1208 Huang, Qi III-58 Huang, Qiao III-23 Huang, Shanglian II-1046 Huang, Song-Ling III-1254 Huang, Sunan II-1007, III-364 Huang, Tingwen I-243 Huang, Yan I-316 Huang, Wei III-512 Huang, Xianlin I-616 Huang, Xinbo III-346
Author Index Huang, Xi-Yue II-759 Huang, Yanxin III-674 Huang, Yi II-158, II-569 Huang, Yumei I-93, I-141 Huang, Yu-Ying III-1117 Huang, Zhi-Kai I-1165, II-355 Hwang, Kao-Shing I-599 Ibarra, Francisco III-208 Im, Ki Hong I-991 Ishii, Naohiro I-502 Iwasaki, Yuuta II-517 Jeon, Jaejin III-128 Jeong, Hye-Jin II-1239 Jeong, Min-Chang III-1099 Ji, Ce I-198 Ji, Hai-Yan III-1296 Ji, Liang I-895 Jia, Ming-Xing III-1138 Jia, Xinchun I-1063 Jiang, Dongxiang III-414 Jiang, Feng I-1208 Jiang, Jing-Qing I-1010 Jiang, Kai I-545 Jiang, Minghu I-1244, I-1250 Jiang, Minghui I-273 Jiang, Ping I-334 Jiang, Quan-Yuan II-1332, II-1416, II-1422 Jiang, Zhao-Yuan III-8 Jiao, Licheng I-909, I-1083 Jin, Bo I-922 Jin, Fan I-509 Jin, Hai-Hong I-1147 Jin, Hong II-1277 Jin, Huihong I-942 Jin, Huiyu III-921 Jin, Long III-1202 Jin, Ya-Qiu II-8 Jin, Yihui III-165 Jin, Yingxiong III-734 Jin, Wei-Dong I-1044 Jing, Chunguo III-1248 Jing, Cuining II-232 Jing, Ling I-1076 Jing, Wen-Feng II-765 Jordaan, Jaco II-1311 Jun, Byong-Hee III-426 Jun, Byung Doo III-382
1431
Jung, Keechul III-1350 Jung, Kyu-Hwan III-491 Kacalak, Wojciech III-1155, III-1161 Kaewarsa, Suriya II-1378 Kaizoji, Taisei III-432 Kalra, Prem K. I-424 Kamruzzaman, Joarder III-660 Kanae, Shunshoku III-746 Kang, Byoung-Doo II-322, III-246 Kang, Hoon II-581, III-991 Kang, Hyun Min II-652 Kang, Lishan III-1210 Kang, Long-yun II-1370 Kang, Min-Jae I-100, II-1338, III-1357 Kang, Woo-Sung I-991 Kang, Yuan II-1104, III-370 Kang, Yun-Feng II-867 Karmakar, Gour III-660 Kasabov, Nikola II-134, III-629 Kasanick´ y, Tom´ aˇs I-955 Kelly, Peter III-1366 Keong, Kwoh Chee III-716 Keum, Ji-Soo II-165 Khan, Farrukh A. I-100, III-1357 Khan, Muhammad Khurram III-214 Khashman, Adnan II-98 Khor, Li C. III-760 Ki, Myungseok II-140 Kim, Bo-Hyun III-524 Kim, Byungwhan III-1020, III-1028, III-1036 Kim, ChangKug III-1374 Kim, Changwon III-1261 Kim, Do-Hyeon II-1 Kim, Donghwan III-1028 Kim, Dong Seong III-224 Kim, Dong-Sun I-723, III-1340 Kim, Euntai I-659 Kim, Hagbae II-69 Kim, Ho-Chan I-100, II-1338 Kim, Hyun-Chul I-1238, III-491 Kim, Hyun Dong III-586 Kim, HyungJun II-460 Kim, Hyun-Ki II-821 Kim, Hyun-Sik I-723, III-1340 Kim, Intaek III-770 Kim, Jaemin I-448, I-456, II-26 Kim, Jin Young II-419 Kim, Jong-Ho II-322, III-246
1432
Author Index
Kim, Jongrack III-1261 Kim, Jung-Hyun II-222 Kim, Kwang-Baek II-1, II-299, II-669, III-991 Kim, Kyeong-Seop II-40 Kim, Moon S. III-770 Kim, Myounghwan II-557 Kim, Sang-Kyoon II-178, II-322, III-246 Kim, Sooyoun III-1036 Kim, Sun III-710 Kim, Sung-Ill II-172 Kim, Sungshin II-299, III-420, III-471, III-991, III-1261 Kim, Tae Hyung II-652, II-661 Kim, Taekyung II-277 Kim, Tae Seon III-120, III-586, III-976 Kim, Wook-Hyun II-244, II-602 Kim, Woong Myung I-1096 Kim, Yejin III-1261 Kim, Yong-Guk II-69 Kim, Yong-Kab III-1374 Kim, Yoo Shin II-652, II-661 Kim, Yountae III-991 Kinane, Andrew III-1319 Kirk, James S. III-1167 Ko, Hee-Sang II-1338 Ko, Hyun-Woo III-382 Ko, Young-Don III-1099 Kong, Wei II-712 Krongkitsiri, Wichai II-1378 Kuang, Yujun II-694 Kuan, Yean-Der III-1005 Kuntanapreeda, Suwat II-998 Kuo, Cheng Chien II-1317 Kwan, Chiman III-352, III-1216 Kwon, O-Hwa II-322, III-246 Kyoung, Dongwuk III-1350 Lai, Kin Keung I-1261, III-498 Lan, Hai-Lin III-477 Larkin, Daniel III-1319 Ledesma, Bernardo III-208 Lee, Chong Ho III-120 Lee, Dae-Jong III-406 Lee, Daewon III-524 Lee, Dong-Wook II-390 Lee, Dukwoo III-1020 Lee, Hyo Jong III-66 Lee, Hyon-Soo I-1096, II-165 Lee, In-Tae I-774
Lee, Jaehun I-659 Lee, Jae-Won II-178, II-322, III-246 Lee, Jaewook I-1238, III-491, III-524 Lee, Joohun II-419 Lee, Jung Hwan III-1099 Lee, Ka Keung III-73 Lee, Kwang Y. II-1338 Lee, Mal rey II-1239 Lee, Sang-Ill III-426 Lee, Sang Min III-224 Lee, Sangwook III-128 Lee, Seongwon II-277 Lee, SeungGwan I-476 Lee, Song-Jae I-689, III-376 Lee, Sungyoung I-707 Lee, Tong Heng III-364 Lee, Woo-Beom II-244, II-602 Lee, Yang-Bok II-69 Lee, Yang Weon I-1280 Lee, Young-Koo I-707 Lee, Yunsik I-689, III-196 Lei, Jianhe II-253 Lei, Jianjun I-1030 Lei, Lin II-1019 Leong, Jern-Lin II-480 Leprand, Aurelie J.A. I-676 Leung, Caroline W. III-1181 Li, Aiguo II-316 Li, Ang III-1328 Li, Bai III-606 Li, Bailin II-1096 Li, Bao-Sen II-1277 Li, Bin II-1146, II-1233 Li, Bo-Yu II-8 Li, Chao-feng II-468 Li, Chaoyi III-548 Li, Chuan II-1290 Li, Chuan-Dong I-1022, III-464 Li, Chuandong I-279 Li, Deyu III-754 Li, Dong III-734 Li, Feng-jun I-66 Li, Guang I-15, II-93, II-343, III-554 Li, Guocheng I-344, I-350 Li, Guo-Zheng III-1231 Li, Haibo II-595 Li, Hai-Jun I-903 Li, Hongyu I-430, I-436 Li, Hongwei I-1121 Li, Houqiang III-702
Author Index Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li,
Huaqing II-646 Hui II-1039 Jianyu I-695 Jiaojie III-554 Jie II-436 Jin-ling III-1173 Jun II-934, III-957 Kai III-414 Li I-1127 Lei I-1326 Le-min III-190 Linfeng I-135 Luoqing I-928 Maoqing III-921 Ming-Bin III-114 Ping I-309, I-1050, I-1273 Pingxiang I-1070 Qunzhan I-968 Ren-fa III-231 Rui I-1121 Ronghua I-1183 Shang-Ping III-1270 Shangsheng II-1052, II-1348 Shao-Hong II-369 Shujiang III-858 Shuyong I-141 Songsong I-551 Tao I-889 Tieshan II-888 Wenwei III-184 Xiaodong II-1033 Xiaofeng II-442 Xiaohong III-1237 Xiaokun III-1216 Xiao-Li II-867 Xiaoli III-573 Xiaolu I-1171 Xiaoou II-1110 Xi-Hai I-1400 Xue I-1140 Yanda III-654 Yangmin I-564 Yanlai I-889 Yao II-1233 Ying-Hong I-1089 Yixing III-1277 Yugang III-1065 Zaiming II-442 Zhancai III-1328 Zhihui II-411
1433
Li, Zhi-quan III-1380 Lian, Hui-Cheng II-202 Lian, Shiguo III-273 Liang, Bin II-1208 Liang, Hua-Lou III-318 Liang, Shi III-1270 Liang, Xuefeng I-1394 Liang, Xun III-442 Liang, Yanchun I-981, II-1084 Liang, Yang-Chun I-1010 Liao, Changrong II-1046 Liao, Ke II-735 Liao, Wentong I-297 Liao, Wudai I-159, I-224 Liao, Xiang III-548 Liao, Xiao-Feng I-279, I-1022 Liao, Xiaohong II-1233 Liao, Xiaoxin I-115, I-147, I-159, I-224, I-249, I-273, I-328 Lien, Hsin-Chung III-821, III-1005 Lim, Jong-Seok II-244, II-602 Lim, Junseok I-637, I-1340, III-128 Lim, Soonja III-1374 Lin, David III-1053 Lin, Haisheng I-616 Lin, Hui II-63, II-1409 Lin, Jiangli III-754 Lin, Kai-Biao II-306 Lin, Mao-Song I-1394 Lin, Qiu-Hua III-318 Lin, Tsu-Wei II-1104 Lin, Tzung-Feng I-599 Lin, Yang-Cheng III-799 Ling, Hefei II-638 Ling, Ping I-1222 Ling, Wei-Xin III-477 Liu, Ben-yu III-1223 Liu, Bin III-1248 Liu, Bo I-997 Liu, Chang I-1127 Liu, Dai-Zhi I-1400 Liu, Derong I-804 Liu, Dong II-1348 Liu, Dong-hong I-1412 Liu, Fang I-880 Liu, Fan-yong III-485 Liu, Fei II-975, III-939, III-945, III-951 Liu, Gang I-1400 Liu, Guangjie III-273 Liu, Guisong III-240
1434 Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu, Liu,
Author Index Guixia III-696 Hai-kuan I-1016 Hai Tao III-58 Han II-1124 Han-bo III-1242 Hong-Bing I-949 Hongwei I-915 Huaping II-533 Jianbo III-396 Jian-Guo III-1187 Jilin I-172 Jin II-682 Jinguo II-1146 Ji-Zhen II-1027 Ju III-293, III-312 Jun II-128 Jun I-15 Jun-Hua II-934 Liang I-489 Lijun I-399 Meichun I-303 Meiqin I-122, I-683 Mei-Qin II-904 Ming III-777 Mingzhe III-1202 Qingshan I-369 Ruixin III-279 Shan III-792 Shirong II-771, II-1172 Shu-Dong II-991 Singsing I-669 Tianming I-172 Tian-Yu III-1231 Xiangdong III-88 Xiang-Jie II-1027 Xinggao III-1126 Xuelian II-1252 Xuhua III-1192 Yan II-981 Yang III-620 Yan-Jun II-836 Yan-Peng I-714 Yiguang I-405 Yihui III-606 Yi-Jun III-592 Yue III-1231 Yuehu I-1063 Yunhui I-732 Zhen II-1409 Zhigang II-1402
Liu, Zhixin III-202 Liu, Zhong III-1290 Lo, King Chuen II-331 Lok, Tat-Ming I-631, I-1165 Long, Fei II-842 Loo, Chu Kiong I-830 L´ opez-Coronado, Juan II-1188, II-1198 L´ opez-Y´ an ˜ez, Itzam´ a I-818 Lou, Shun-Tian I-1147 Lou, Xuyang I-165 Lou, Zhengguo I-15, II-343 Lu, Bao-Liang I-537, II-202, II-210, III-667 Lu, Guodong I-172 Lu, Hongtao I-1387, II-610, II-623 Lu, Huchuan II-63 Lu, Jiwen II-232 Lu, Jun-Wei II-349 Lu, Ke II-104 Lu, Wei II-610, II-623 Lu, Wenlian I-192 Lu, Xiao-feng III-40 Lu, Yinghua I-1244, I-1250 Lu, Zhengding II-638 Lu, Zhiwu I-464, II-492, II-753 Luan, Xiao-Li II-975 Luo, Ali II-361 Luo, Chuanjiang II-383 Luo, Dayong III-33 Luo, Haigeng I-328 Luo, Jun II-210 Luo, Siwei I-482, I-695, I-732 Luo, Xin II-1409 Luo, Yan III-754 Luo, Yanhong I-804 Luo, Zhengguo II-343 Lv, Hairong II-287 Lv, Jian Cheng I-701 Lv, Wenzhi III-390 Lv, Ziang I-732 Lyu, Michael R. I-631, I-1165 Ma, Ma, Ma, Ma, Ma, Ma, Ma, Ma,
Fang-Lan III-1270 Guangfu I-974 Hongliang I-482 Jie I-752 Jie-Ming III-600 Jinghua II-253 Jinwen I-442, I-464 Libo II-454
Author Index Ma, Ming I-752 Ma, Run-Nian I-267 Ma, Shugen II-1146 Ma, Xiaohong II-682, III-306 Ma, Yu III-740 Maeda, Kouji III-746 Majewski, Maciej III-1155, III-1161 Mandal, Ajit K. I-762 Mao, Shuai III-1059 Marko, Kenneth III-396 Matsuka, Toshihiko I-34, I-41 McDaid, Liam III-1366 Meng, De-Yu I-60, II-765 Meng, Haitao II-688 Meng, Xiaoning III-73 Meng, Zhiqing I-942 Mi, Bo II-545 Miao, Sheng III-1223 Min, Chul Hong III-586, III-976 Ming, Cong II-1007 Mirza, Anwar M. III-1357 Mishra, Deepak I-424 Mo, Quanyi I-1109, II-700 Moon, Taesup III-1261 Moreno-Armendariz, Marco A. I-558 Morgan, Paul Simon III-606 Mu, Shaomin I-962 Mukhopadhyay, Sumitra I-762 Mulero-Mart´ınez, Juan Ignacio II-1188, II-1198 Muresan, Valentin III-1319 Muuniz, Dan III-1053 Myoung, Jae-Min III-1099 Na, Seung You II-419 Namkung, Jaechan II-140 Nguyen, Duc-Hoai I-689 Ng, Wan Sing II-480 Ni, He III-504 Ni, Jun III-396 Nie, Xiangfei I-1030 Nie, Yinling I-909 Nie, Yiyong III-1144 Nie, Weike I-909 Nilsson, Carl Magnus III-1111 Ning, Xuanxi I-942 Niu, Ben I-570 Niu, Dong-Xiao II-1269 Niu, Xiao-Li III-293 Niu, Xiao-Xiao I-949
1435
Noe, Frank I-1244, I-1250 Noh, Sun-Kuk III-135 Nwoye, Ephram III-760 O’Connor, Noel III-1319 Oh, Sung-Kwun I-768, I-774, I-780, I-786, II-815, II-821 Okada, Toshimi I-551 Ou, Yongsheng III-73 Ou, Zongying II-20 ¨ calık, H. Rıza II-1075 Oz¸ Pai, Ping-Feng III-1117 Paik, Joonki II-140, II-277 Pang, Shaoning II-134, III-629 Pao, Hsiao-Tien II-1284 Park, Byoung-Jun I-786 Park, Changhan II-140 Park, Chang-Hyun II-390 Park, Chan-Ho I-1096, II-165 Park, Dong-Chul I-689, I-1038, III-196, III-376 Park, Ho-Sung I-780 Park, Jaehyun III-376 Park, Jang-Hwan III-406, III-426 Park, Jong Sou III-224 Park, Jung-Il I-798 Park, Keon-Jun II-815, II-821 Park, Sancho I-1038 Park, Se-Myung II-322, III-246 Park, Soo-Beom II-178 Pedrycz, Witold I-786, II-815 Pei, Run II-1153 Peng, Chang-Gen I-739 Peng, Hong I-1267, II-790, II-802 Peng, Jing II-8 Peng, Lizhi I-873, III-518 Peng, Qi-Cong I-1103 Peng, Xiuyan II-859 Peng, Yunhua III-876 Peng, Zhen-Rui III-8 Pi, Dao-Ying I-179, II-911, II-916, II-922, II-928 Pian, Jinxiang III-852 Piao, Cheng-Ri III-285 Pichl, Luk´ aˇs III-432 Pietik¨ ainen, Matti II-474 Pogrebnyak, Oleksiy III-933 Poo, Kyungmin III-1261 Poo, Mu-Ming I-7
1436
Author Index
Pu, Qiang II-265 Pu, Xiaorong II-49, II-110 Pu, Yi-Fei II-735 Pu, Yun-Wei I-1044 Puntonet, Carlos G. II-676 Pyeon, Yonggook I-1340 Pyun, Jae-Young III-135 Qi, Feihu II-646 Qi, Xin-Zheng III-906 Qian, Feng III-190 Qian, Fucai II-1124 Qian, Jian-Sheng II-783 Qian, Ji-Xin I-714 Qian, Li III-1237 Qian, Tao III-1216 Qiang, Wei III-1313 Qiang, Wen-yi III-1242 Qiao, Jian-Ping III-293 Qiao, Xiao-Jun II-539 Qin, Ling I-385 Qin, Zhao-Ye III-982 Qin, Zheng I-1319 Qiu, Fang-Peng I-880 Qiu, Jianlong I-153 Qiu, Tianshuang I-1177, III-108 Qiu, Wenbin III-255 Qu, Hong I-291 Qu, Yanyun I-1063 Ram´ırez, Javier II-676 Rao, Machavaram V.C. I-830 Rao, Ni-ni I-1109 Rao, Zhen-Hong III-1296 Ren, Changlei I-204 Ren, Dianbo I-236 Ren, Guanghui III-171 Ren, Quansheng III-178 Roh, Seok-Beom I-768 Roque-Sol, Marco I-243 Ruan, Xiaogang I-27, I-489 Ryu, DaeHyun II-460 Ryu, Kyung-Han III-376 S´ anchez-Fern´ andez, Luis P. I-818, III-933 Sandoval, Alejandro Cruz I-86 Saratchandran, Paramasivan III-114 Scheer, S´ergio I-391 Schumann, Johann II-981
Segura, Jos´e C. II-676 Semega, Kenneth III-352 Seo, Dong Sun III-1014 Seok, Jinwuk I-456 Seong, Chi-Young II-322, III-246 Seow, Ming-Jung I-584 Shang, Li II-216 Shao, Chao I-482 Shao, Huai-Zong I-1103 Shen, Chao I-1346 Shen, Guo-Jiang III-15 Shen, I-Fan I-430, I-436 Shen, Minfen III-560 Shen, Minghui III-390 Shen, XianHua III-754 Shen, Xue-Shi III-927 Shen, Yan III-1283 Shen, Yi I-122, I-273, I-683 Shen, Yifei III-682 Sheng, Zhi-Lin III-906 Shenouda, Emad A.M. Andrews I-849 Shi, Chuan III-452 Shi, Daming II-480 Shi, Wei-Feng II-1354 Shi, Xiao-fei I-1127 Shi, Zelin II-383 Shi, Zhang-Song III-1290 Shi, Zhong-Zhi I-1299, I-1421, III-452 Shimizu, Akinobu II-116 Shin, Seung-Won II-40 Shu, Yang I-644 Sim, Kwee-Bo II-390, II-669 Siqueira, Paulo Henrique I-391 Siti, Willy II-1311 Sitte, Joaquin III-1334 Smith, Jeremy C. I-1244, I-1250 Song, Chong-Hui I-129, I-198 Song, Chu-Yi I-1010 Song, Huazhu III-1210 Song, Jingyan III-23 Song, Joonil I-637, I-1340 Song, Min-Gyu II-419 Song, Qing I-1231, II-504 Song, Quanjun II-253 Song, Shen-min II-1131, II-1153 Song, Shiji I-344, I-350 Song, Wang-Cheol I-100, III-1357 Song, Xue-Lei II-1386 Song, Yongduan II-1233 Song, Zhi-Huan I-1050, I-1273
Author Index Song, Zhuo-yi II-1131 Song, Zhuoyue I-616 Soon, Ong Yew III-716 Spinko, Vladimir II-480 Srikasam, Wasan II-998 Steiner, Maria Teresinha Arns I-391 Su, Han II-238 Su, Jing-yu III-1223 Su, Liancheng II-383 Sun, Cai-Xin II-1259 Sun, Changyin I-135, II-77, II-194 Sun, Chong III-46 Sun, Dong I-1286 Sun, Fuchun II-533 Sun, Hongjin III-614 Sun, Jian II-563 Sun, Jian-De III-293, III-312 Sun, Jun II-706 Sun, Li I-1076 Sun, Ming I-379 Sun, Ning II-77, II-194 Sun, Pu III-396 Sun, Qian II-1068 Sun, Qindong III-346 Sun, Wei I-607, II-1166 Sun, Wen-Lei II-306 Sun, Yaming II-827 Sun, Youxian I-1016, II-916, II-922, II-928 Sun, Zengqi II-533, II-849 Sun, Zhanquan III-786 Sun, Zhao II-1233 Sun, Zhao-Hua III-914 Sundararajan, Narasimhan III-114 Sung, Koeng Mo I-637, III-382 Suo, Ji-dong I-1127 Suppes, Patrick III-541 Tai, Heng-Ming II-188 Tan, Ah-Hwee I-470 Tan, Feng III-326 Tan, Kay Chen I-530, I-858 Tan, Kok Kiong III-364 Tan, Min II-523, II-1218 Tan, Minghao III-852, III-858 Tan, Xiaoyang II-128 Tan, Ying II-14, III-340 Tang, He-Sheng I-515 Tang, Huajin II-498 Tang, Min II-790
1437
Tang, Xiaowei III-554 Tang, Xusheng II-20 Tang, Yu II-796 Tang, Zaiyong III-1181 Tang, Zheng I-551 Tao, Jun III-870 Tarassenko, Lionel III-828 Tee, Keng-Peng III-82 Teoh, Eu Jin I-530, I-858 Tian, Hongbo III-102 Tian, Mei I-482 Tian, Shengfeng I-962, III-261 Tian, Zengshan II-694 Titli, Andr´e I-623 Tong, Wei-Ming II-1386 Toxqui, Rigoberto Toxqui II-1110 Tran, Chung Nguyen I-1038, III-196 Tu, Xin I-739 Ukil, Abhisek II-1311 Uppal, Momin III-1357 Uy, E. Timothy III-541 van de Wetering, Huub III-46 Van Vooren, Steven III-635 Vasant, Pandian III-891 Venayagamoorthy, Ganesh III-648 Verikas, Antanas I-837, III-1111 Wada, Kiyoshi III-746 Wan, Chenggao I-928 Wan, Lichun II-790 Wan, Yong III-1328 Wan, Yuan-yuan II-355 Wang, Bin II-719, III-600 Wang, Chengbo I-1057 Wang, Chen-Xi III-906 Wang, Chia-Jiu II-188, III-1053 Wang, Chong II-398, III-306 Wang, Chun-Chieh III-370 Wang, Da-Cheng I-1022 Wang, Dan II-898 Wang, Dan-Li II-1277 Wang, Dongyun III-805 Wang, Fasong I-1121 Wang, Fu-Li III-1138 Wang, Gang I-129, I-1109, I-1133, II-700 Wang, Gao-Feng III-1304 Wang, Guang-Hong I-334
1438
Author Index
Wang, Guoyin III-255 Wang, Guoyou I-792 Wang, Haiqi III-1192 Wang, Heyong I-1326 Wang, Hong I-496, I-676 Wang, Hongwei I-322 Wang, Houjun II-1019 Wang, Hui I-279, II-922, II-1277 Wang, Jiabing I-1267 Wang, Jian I-1030, II-1370 Wang, Jiang III-722 Wang, Jianjun I-60 Wang, Jie-Sheng III-1078 Wang, Jin II-898 Wang, Jinfeng III-1192 Wang, Jinshan II-747, II-777 Wang, Jinwei III-273 Wang, Jun I-369, I-824, II-790, II-802 Wang, Kuanquan I-889 Wang, Lei I-909, III-184 Wang, Lili I-204 Wang, Li-Min I-903 Wang, Lin I-1244, I-1250 Wang, Ling I-1083, III-614, III-754 Wang, Linshan I-297 Wang, Liyu II-343 Wang, Lu II-398 Wang, Lunwen II-14 Wang, Meng-hui II-1395 Wang, Mingfeng III-777 Wang, Ming-Jen II-188 Wang, Miqu III-777 Wang, Pu I-1109, II-700 Wang, Qin III-1328 Wang, Qun I-255 Wang, Rongxing III-696 Wang, Ru-bin I-50 Wang, Saiyi III-144 Wang, Shaoyu II-646 Wang, Shengjin II-429 Wang, Shilong II-1290 Wang, Shi-tong II-468 Wang, Shoujue I-669, II-158, II-569 Wang, Shouyang I-1261, III-498, III-512 Wang, Shuwang III-560 Wang, Tianfu III-754 Wang, Wancheng III-1085 Wang, Wei II-836, II-867, III-1, III-1231 Wang, Wentao I-792 Wang, Wenyuan II-398
Wang, Xiang-yang III-299 Wang, Xiaodong II-551, II-747, II-777 Wang, Xiaofeng II-888 Wang, Xiaohong I-968 Wang, Xiaoru I-968 Wang, Xihuai II-1363 Wang, Xin III-870, III-1242 Wang, Xinfei III-1237 Wang, Xi-Zhao I-1352 Wang, Xu-Dong I-7 Wang, Xuesong I-607 Wang, Xuexia III-88 Wang, Yan III-674 Wang, Yaonan II-1166 Wang, Yea-Ping II-1104 Wang, Ying III-921 Wang, Yong II-34 Wang, Yonghong III-1277 Wang, Yongji III-792 Wang, You II-93 Wang, Yuan-Yuan III-740 Wang, Yuanyuan II-616 Wang, Yuechao II-1146 Wang, Yue Joseph I-1159 Wang, Yuzhang III-1277 Wang, Zeng-Fu I-747 Wang, Zheng-ou I-1256 Wang, Zheng-Xuan I-903 Wang, Zhiquan III-273 Wang, Zhi-Yong II-1246, II-1422 Wang, Zhong-Jie III-870 Wang, Zhongsheng I-159 Wang, Zhou-Jing II-306 Wang, Ziqiang I-1381 Wee, Jae Woo III-120 Wei, Le I-1159 Wei, Peng III-144 Wei, Pengcheng II-545, III-332 Wei, Xiaotao III-261 Wei, Xun-Kai I-1089 Wen, Bang-Chun III-982 Wen, Jianwei III-165 Wen, Ming II-398 Wen, Peizhi II-629 Wen, Xiao-Tong III-579 Weng, Liguo II-1233 Weng, Shilie III-1277 Wong, Dik Kin III-541 Wong, Stephen T.C. I-172, III-702 Woo, Dong-Min III-285
Author Index Woo, Wai L. III-760 Wu, Bo I-1070 Wu, Cheng I-344 Wu, Chengke II-629 Wu, Chun-Guo I-1010 Wu, Fangfang I-936 Wu, Fu III-8 Wu, Fu-chao II-361 Wu, Geng-Feng III-1231 Wu, Haiyan II-1124 Wu, Honggang II-442 Wu, Huihua III-202, III-573 Wu, Jian I-15 Wu, Jiansheng III-1202 Wu, Jun III-299 Wu, Ming-Guang I-714 Wu, Pei II-63 Wu, Qiu-Feng III-156 Wu, Si I-1 Wu, Sitao I-968 Wu, Wei I-399 Wu, Wei-Ping III-1187 Wu, Xiao-jun II-629, III-1242 Wu, Yan I-1367, II-122 Wu, Yanhua II-14 Wu, Yu III-255 Wu, Yue II-49, II-361 Wu, Zhilu III-88, III-171 Wu, Zhi-ming III-885 Wu, Zhong Fu I-866 Wunsch II, Donald C. II-1140, III-648 Xi, Guangcheng III-786 Xi, Ning II-20 Xia, Bin I-1202, I-1208 Xia, Huiyu III-654 Xia, Xiaodong I-1076 Xia, Yong I-359 Xian, Xiaodong I-843 Xiang, Chang-Cheng II-759 Xiang, Cheng I-530, I-858 Xiang, Lan I-303 Xiao, Jian II-802 Xiao, Jianmei II-1363 Xiao, Mei-ling III-1223 Xiao, Min I-285 Xiao, Ming I-1183 Xiao, Qian I-843 Xiao, Renbin III-512 Xiao, Zhi II-1259
1439
Xie, Li I-172 Xie, Deguang I-1406 Xie, Gaogang III-184 Xie, Keming III-150 Xie, Zong-Xia I-1373 Xie, Zhong-Ke I-810 Xing, Guangzhong III-1248 Xing, Li-Ning III-927 Xing, Mian II-1269 Xing, Tao III-58 Xing, Yan-min III-1173 Xiong, Kai III-414 Xiong, Qingyu I-843 Xiong, Sheng-Wu I-949 Xiong, Zhihua III-1059, III-1132 Xu, Aidong III-1144 Xu, Anbang II-486 Xu, Bingji I-255 Xu, Chengzhe III-770 Xu, Chi II-911, II-916, II-922, II-928 Xu, Daoyi I-93, I-141, I-261 Xu, De II-448, II-589 Xu, Fei I-249 Xu, Jiang-Feng III-1304 Xu, Jianhua I-1004 Xu, Jinhua I-524 Xu, Jun I-179 Xu, Lei I-1214, II-468 Xu, Mingheng III-957 Xu, Roger III-352, III-1216 Xu, Rui III-648 Xu, Sheng III-914 Xu, Wei-Hong I-810 Xu, Weiran I-1030 Xu, Wenfu II-1208 Xu, Xiaodong I-122, I-328, I-683 Xu, Xiao-Yan II-1062 Xu, Xin I-1133 Xu, Xu II-1084 Xu, Yajie III-864 Xu, Yangsheng II-1208, III-73 Xu, Yao-qun I-379 Xu, Yongmao III-1059, III-1132 Xu, You I-644, I-653 Xu, Yulin I-224, III-805 Xu, Yun III-682 Xu, Zongben I-60, I-66, I-72, II-563, II-765 Xue, Chunyu III-150 Xue, Feng III-1290
1440
Author Index
Xue, Song-Tao
I-515
Yadav, Abhishek I-424 Yamano, Takuya III-432 Yamashita, Mitushi II-1178 Yan, Genting I-974 Yan, Hua III-312 Yan, Rui II-498 Yan, Shaoze II-1090 Yan, Yuan-yuan III-1313 Y´ an ˜ez-M´ arquez, Cornelio I-818, III-933 Yang, Bin II-712 Yang, Chengfu I-1140 Yang, Chengzhi II-968 Yang, Degang I-279, III-326 Yang, Deli II-63 Yang, Gen-ke III-885 Yang, Guo-Wei II-265 Yang, Huaqian III-332 Yang, Jiaben II-875 Yang, Jianyu II-700 Yang, Jin-fu II-361 Yang, Jingming III-864 Yang, Jinmin III-184 Yang, Jun-Jie I-880 Yang, Lei I-1063 Yang, Li-ping II-474 Yang, Luxi I-1153, I-1189 Yang, Min I-1050 Yang, Shangming I-1115 Yang, Shuzhong I-695 Yang, Simon X. III-255 Yang, Tao III-94 Yang, Wenlu II-454 Yang, Xia III-1065 Yang, Xiaodong III-702 Yang, Xiao-Song I-316 Yang, Xiao-Wei I-981, I-997, I-1010, II-1084 Yang, Xinling II-343 Yang, Xu II-589 Yang, Xulei I-1231, II-504 Yang, Yang I-1367, III-667 Yang, Yansheng II-888 Yang, Yongqing I-185 Yang, Zhao-Xuan II-539 Yang, Zhen I-1030 Yang, Zhiguo I-93 Yang, Zhou I-895 Yang, Zhuo III-734
Yang, Zi-Jiang III-746 Yang, Zu-Yuan II-759 Yao, Dezhong I-21, III-548, III-614 Yao, Li III-579, III-592 Yao, Xing-miao III-190 Yao, Yuan I-385 Yao, Zhengan I-1326 Ye, Bang-Yan II-727 Ye, Chun Xiao I-866 Ye, Ji-Min I-1147 Ye, Liao-yuan III-1223 Ye, Mao I-1140 Ye, Meiying II-747, II-777 Ye, Shi-Jie II-1259 Yeh, Chung-Hsing III-799 Y´elamos, Pablo II-676 Yi, Jian-qiang II-1160 Yi, Jianqiang III-786 Yi, Zhang I-291, I-701, II-49, III-240 Yin, Chuanhuan I-962 Yin, Fu-Liang II-682, III-306, III-318 Yin, Hujun III-504, III-836 Yin, Li-Sheng II-759 Yin, Qinye III-102 Yin, Yu III-548 Yin, Yuhai III-876 Yoon, Tae-Ho II-40 You, Zhisheng I-405 Yu, Daoheng III-178 Yu, Da-Ren I-1373 Yu, Ding-Li II-1013, III-358 Yu, Ding-Wen II-1013 Yu, Fei III-231 Yu, Haibin III-1144 Yu, Jing I-50 Yu, Jinshou II-771, II-1172 Yu, Jinxia II-1227 Yu, Lean I-1261, III-498 Yu, Miao II-1046 Yu, Naigong I-27, I-489 Yu, Pei I-249 Yu, Qijiang II-771, II-1172 Yu, Shaoquan I-1121 Yu, Shi III-635 Yu, Shu I-981 Yu, Tao I-1103, III-982 Yu, Wan-Neng II-1062 Yu, Wen I-86, I-558, II-956, II-1110 Yu, Yanwei II-638 Yu, Yong II-253
Author Index Yu, Zhandong II-859 Yu, Zhi-gang II-1153 Yuan, Da II-511 Yuan, Dong-Feng II-706 Yuan, Jia-Zheng II-448 Yuan, Sen-Miao I-903 Yuan, Senmiao I-545 Yuan, Xiao II-735 Yuan, Zejian I-1063 Yue, Guangxue III-231 Yue, Heng III-852 Yue, Hong I-676 Yun, Ilgu III-1099 Zeng, Fanzi III-458 Zeng, Ling I-21 Zeng, Zhi II-429 Zeng, Zhigang I-115, I-824 Zha, Daifeng I-1177 Zhai, Chuan-Min I-747 Zhai, Jun-Yong II-881 Zhai, Wei-Gang I-1400 Zhai, Xiaohua II-1166 Zhang, Bing-Da II-1304 Zhang, Bo II-259, II-1269 Zhang, Byoung Tak III-642 Zhang, Changjiang II-551, II-747, II-777 Zhang, Changshui I-1312 Zhang, Chao II-575 Zhang, Dabo II-1402 Zhang, Dafang III-184 Zhang, David I-1387 Zhang, Dexian I-1381 Zhang, Dong-Zhong II-1386 Zhang, Erhu II-232 Zhang, Fan III-279 Zhang, Gang III-150 Zhang, Guangfan III-352 Zhang, Guang Lan III-716 Zhang, Guo-Jun I-747 Zhang, Hang III-33 Zhang, Haoran II-551, II-747, II-777 Zhang, Honghui II-1046 Zhang, Hua I-1195 Zhang, Hua-Guang I-129, I-198, I-804 Zhang, Huanshui I-109 Zhang, Hui I-1352, I-1394, III-512 Zhang, Jia-Cai III-592 Zhang, Jiashu III-214
1441
Zhang, Jie II-943, II-1084, III-1132 Zhang, Jin II-93 Zhang, Jinmu I-1070 Zhang, Jiye I-217, I-236 Zhang, Julie Z. III-970 Zhang, Jun I-1267 Zhang, Junping I-1346 Zhang, Junying I-1159 Zhang, Kai I-590 Zhang, Kan-Jian II-881 Zhang, Ke II-1116, II-1304 Zhang, Kunpeng III-682 Zhang, Lei I-792 Zhang, Liang III-1334 Zhang, Liangpei I-1070 Zhang, Li-Biao I-752 Zhang, Li-Ming III-600 Zhang, Liming I-1334, II-404, II-616, II-719 Zhang, Lin I-590 Zhang, Ling II-14 Zhang, Liqing I-1202, I-1208, II-454 Zhang, Na II-436 Zhang, Nian II-1140 Zhang, Qing-Lin III-1304 Zhang, Qizhi II-1033 Zhang, Quanju I-359 Zhang, Sheng-Rui I-267 Zhang, Shiying II-827 Zhang, Stones Lei I-701 Zhang, Su-lan III-452 Zhang, Tao III-734 Zhang, Tengfei II-1363 Zhang, Tianqi II-694 Zhang, Wei III-332, III-1254 Zhang, Weihua I-217, I-236 Zhang, Weiyi II-55 Zhang, Xian-Da I-1406, I-1147, II-287 Zhang, Xiaodong III-352 Zhang, Xiao-Hui I-7 Zhang, Xiao-Li III-1187 Zhang, Xinhong III-279 Zhang, Yang III-836 Zhang, Yan-Qing I-922 Zhang, Yi II-1096, III-40 Zhang, Yi-Bo II-928 Zhang, Yong III-1078 Zhang, Yonghua III-458 Zhang, Yong-Qian II-523 Zhang, Yong-sheng II-337
1442
Author Index
Zhang, Yong-shun I-1412 Zhang, Zengke II-575 Zhang, Zhi-Kang I-50 Zhang, Zhi-Lin I-1109, II-688, II-700 Zhang, Zhisheng II-827 Zhang, Zhi-yong III-452 Zhang, Zhousuo III-390 Zhang, Zhu-Hong I-739 Zhao, Dong-bin II-1160 Zhao, Hai I-537 Zhao, Hui-Fang III-914 Zhao, Jianye III-178 Zhao, Jidong II-104 Zhao, Li II-77, II-194 Zhao, Liangyu III-531 Zhao, Lianwei I-482 Zhao, Mei-fang II-361 Zhao, Nan III-171 Zhao, Qiang II-1090 Zhao, Qijun I-1387 Zhao, Song-Zheng III-906 Zhao, Weirui I-109 Zhao, Xiao-Jie III-579, III-592 Zhao, Xiren II-859 Zhao, Xiu-Rong I-1299 Zhao, Xue-Zhi II-727 Zhao, Yaqin III-171 Zhao, Yinliang I-936 Zhao, Yuzhang I-1177 Zhao, Zeng-Shun II-523 Zhao, Zhefeng III-150 Zhao, Zhong-Gai III-939, III-945, III-951 Zheng, Chun-Hou I-1165, II-355 Zheng, En-Hui I-1050 Zheng, Hua-Yao II-1062 Zheng, Jie I-1326 Zheng, Nanning III-46 Zheng, Shiqing III-1065 Zheng, Shiyou II-842 Zheng, Weifan I-217 Zheng, Wenming II-77, II-85, II-194 Zheng, Ziming II-110 Zhong, Bo II-1259
Zhong, Jiang I-866 Zhong, Weimin II-911, II-916 Zhou, Chun-Guang I-752, I-1222, III-674, III-696 Zhou, Jian-Zhong I-880 Zhou, Ji-Liu II-735 Zhou, Jin I-303 Zhou, Jun-Lin I-1293 Zhou, Ligang I-1261 Zhou, Li-qun I-211 Zhou, Qiang III-1132 Zhou, Wei II-110, III-1304 Zhou, Wengang III-696 Zhou, Xiaobo III-702 Zhou, Yali II-1033 Zhou, Yiming II-575 Zhou, Ying I-866 Zhou, Zhengzhong II-694 Zhou, Zongtan III-620 Zhu, Chao-jie II-337 Zhu, Feng I-1406, II-287, II-383, III-144 Zhu, Jihong II-849 Zhu, Ke-jun III-1173 Zhu, Li I-577 Zhu, Liangkuan I-974 Zhu, Ming I-1044 Zhu, Qi-guang III-1380 Zhu, Shan-An III-566 Zhu, Shuang-dong III-40 Zhu, Shunzhi III-921 Zhu, Wei I-261 Zhu, Wenjie I-895 Zhu, Yuanxian III-696 Zhu, Yun-Long I-570 Zhuang, Yin-Fang II-122 Zong, Xuejun III-852 Zou, An-Min II-1218 Zou, Cairong II-77, II-194 Zou, Fuhao II-638 Zou, Ling III-566 Zou, Shuxue III-674 Zou, Weibao II-331 Zurada, Jacek M. I-100