Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
6064
Liqing Zhang Bao-Liang Lu James Kwok (Eds.)
Advances in Neural Networks – ISNN 2010 7th International Symposium on Neural Networks, ISNN 2010 Shanghai, China, June 6-9, 2010 Proceedings, Part II
13
Volume Editors Liqing Zhang Bao-Liang Lu Department of Computer Science and Engineering Shanghai Jiao Tong University 800, Dongchuan Road Shanghai 200240, China E-mail: {zhang-lq; blu}@cs.sjtu.edu.cn James Kwok Department of Computer Science and Engineering The Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong, China E-mail:
[email protected]
Library of Congress Control Number: 2010927009 CR Subject Classification (1998): I.4, F.1, I.2, I.5, H.3, J.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-13317-7 Springer Berlin Heidelberg New York 978-3-642-13317-6 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
This book and its sister volume collect refereed papers presented at the 7th International Symposium on Neural Networks (ISNN 2010), held in Shanghai, China, June 6-9, 2010. Building on the success of the previous six successive ISNN symposiums, ISNN has become a well-established series of popular and high-quality conferences on neural computation and its applications. ISNN aims at providing a platform for scientists, researchers, engineers, as well as students to gather together to present and discuss the latest progresses in neural networks, and applications in diverse areas. Nowadays, the field of neural networks has been fostered far beyond the traditional artificial neural networks. This year, ISNN 2010 received 591 submissions from more than 40 countries and regions. Based on rigorous reviews, 170 papers were selected for publication in the proceedings. The papers collected in the proceedings cover a broad spectrum of fields, ranging from neurophysiological experiments, neural modeling to extensions and applications of neural networks. We have organized the papers into two volumes based on their topics. The first volume, entitled “Advances in Neural NetworksISNN 2010, Part 1,” covers the following topics: neurophysiological foundation, theory and models, learning and inference, neurodynamics. The second volume entitled “Advance in Neural Networks ISNN 2010, Part 2” covers the following five topics: SVM and kernel methods, vision and image, data mining and text analysis, BCI and brain imaging, and applications. In addition to the contributed papers, four distinguished scholars (Andrzej Cichocki, Chin-Teng Lin, DeLiang Wang, Gary G. Yen) were invited to give plenary talks, providing us with the recent hot topics, latest developments and novel applications of neural networks. ISNN 2010 was organized by Shanghai Jiao Tong University, Shanghai, China, The Chinese University of Hong Kong, China and Sponsorship was obtained from Shanghai Jiao Tong University and The Chinese University of Hong Kong. The symposium was also co-sponsored by the National Natural Science Foundation of China. We would like to acknowledge technical supports from the IEEE Shanghai Section, International Neural Network Society, IEEE Computational Intelligence Society, Asia Pacific Neural Network Assembly, International Association for Mathematics and Computers in Simulation, and European Neural Network Society. We would like to express our sincere gratitude to the members of the Advisory Committee, Organizing Committee and Program Committee, in particular to Jun Wang and Zhigang Zeng, to the reviewers and the organizers of special sessions for their contributions during the preparation of this conference. We would like to also acknowledge the invited speakers for their valuable plenary talks in the conference.
VI
Preface
Acknowledgement is also given to Springer for the continuous support and fruitful collaboration from the first ISNN to this seventh one.
March 2010
Liqing Zhang James Kwok Bao-Liang Lu
ISNN 2010 Organization
ISNN 2010 was organized and sponsored by Shanghai Jiao Tong University, The Chinese University of Hong Kong, and it was technically cosponsored by the IEEE Shanghai Section, International Neural Network Society, IEEE Computational Intelligence Society, Asia Pacific Neural Network Assembly, International Association for Mathematics and Computers in Simulation, and European Neural Network Society. It was financially supported by the National Natural Science Foundation of China.
General Chairs Jun Wang Bao-Liang Lu
Hong Kong, China Shanghai, China
Organizing Committee Chair Jianbo Su
Shanghai, China
Program Committee Chairs Liqing Zhang Zhigang Zeng James T.Y. Kwok
Shanghai, China Wuhan, China Hong Kong, China
Special Sessions Chairs Si Wu Qing Ma Paul S. Pang
Shanghai, China Kyoto, Japan Auckland, New Zealand
Publications Chairs Hongtao Lu Yinling Wang Wenlian Lu
Shanghai, China Shanghai, China Shanghai, China
Publicity Chairs Bo Yuan Xiaolin Hu Qingshan Liu
Shanghai, China Beijing, China Nanjing, China
VIII
Organization
Finance Chairs Xinping Guan Xiangyang Zhu
Shanghai, China Shanghai, China
Registration Chairs Fang Li Gui-Rong Xue Daniel W.C. Ho
Shanghai, China Shanghai, China Hong Kong, China
Local Arrangements Chairs Qingsheng Ren Xiaodong Gu
Shanghai, China Shanghai, China
Advisory Committee Chairs Xiaowei Tang Bo Zhang Aike Guo
Hangzhou, China Beijing, China Shanghai, China
Advisory Committee Members Cesare Alippi, Milan, Italy Shun-ichi Amari, Tokyo, Japan Zheng Bao, Xi'an, China Dimitri P. Bertsekas, Cabridge, MA, USA Tianyou Chai, Shenyang, China Guanrong Chen, Hong Kong Andrzej Cichocki, Tokyo, Japan Ruwei Dai, Beijing, China Jay Farrell, Riverside, CA, USA Chunbo Feng, Nanjing, China Russell Eberhart, Indianapolis, IN, USA David Fogel, San Diego, CA, USA Walter J. Freeman, Berkeley, CA, USA Kunihiko Fukushima, Osaka, Japan Xingui He, Beijing, China Zhenya He, Nanjing, China Janusz Kacprzyk, Warsaw, Poland Nikola Kasabov, Auckland, New Zealand Okyay Kaynak, Istanbul, Turkey
Anthony Kuh, Honolulu, HI, USA Frank L. Lewis, Fort Worth, TX, USA Deyi Li, Beijing, China Yanda Li, Beijing, China Chin-Teng Lin, Hsinchu, Taiwan Robert J. Marks II, Waco, TX, USA Erkki Oja, Helsinki, Finland Nikhil R. Pal, Calcutta, India Marios M. Polycarpou, Nicosia, Cyprus José C. Príncipe, Gainesville, FL, USA Leszek Rutkowski, Czestochowa, Poland Jennie Si, Tempe, AZ, USA Youxian Sun, Hangzhou, China DeLiang Wang, Columbus, OH, USA Fei-Yue Wang, Beijing, China Shoujue Wang, Beijing, China Paul J. Werbos, Washington, DC, USA Cheng Wu, Beijing, China Donald C. Wunsch II, Rolla, MO, USA Youlun Xiong, Wuhan, China
Organization
Lei Xu, Hong Kong Shuzi Yang, Wuhan, China Xin Yao, Birmingham, UK Gary G. Yen, Stillwater, OK, USA
Nanning Zheng, Xi'an, China Yongchuan Zhang, Wuhan, China Jacek M. Zurada, Louisville, KY, USA
Program Committee Members Haydar Akca Alma Y. Alanis Bruno Apolloni Sabri Arik Vijayan Asari Tao Ban Peter Baranyi Salim Bouzerdoum Martin Brown Xindi Cai Jianting Cao Yu Cao Jonathan Chan Chu-Song Chen Liang Chen Sheng Chen Songcan Chen YangQuan Chen Yen-Wei Chen Zengqiang Chen Jianlin Cheng Li Cheng Long Cheng Zheru Chi Sung-Bae Cho Emilio Corchado Jose Alfredo F. Costa Ruxandra Liana Costea Sergio Cruces Baotong Cui Chuanyin Dang Mingcong Deng Ming Dong Jixiang Du Andries Engelbrecht
Meng Joo Er Jufu Feng Chaojin Fu Wai-Keung Fung John Gan Junbin Gao Xiao-Zhi Gao Xinping Guan Chen Guo Chengan Guo Ping Guo Abdenour Hadid Honggui Han Qing-Long Han Haibo He Hanlin He Zhaoshui He Akira Hirose Daniel Ho Noriyasu Homma Zhongsheng Hou Chun-Fei Hsu Huosheng Hu Jinglu Hu Junhao Hu Sanqing Hu Guang-Bin Huang Tingwen Huang Wei Hui Amir Hussain Jayadeva Minghui Jiang Tianzi Jiang Yaochu Jin Joarder Kamruzzaman
IX
X
Organization
Shunshoku Kanae Qi Kang Nik Kasabov Okyay Kaynak Rhee Man Kil Kwang-Baek Kim Sungshin Kim Mario Koeppen Rakhesh Singh Kshetrimayum Edmund Lai Heung Fai Lam Minho Lee Chi-Sing Leung Henry Leung Chuandong Li Fang Li Guang Li Kang Li Li Li Shaoyuan Li Shutao Li Xiaoli Li Xiaoou Li Xuelong Li Yangmin Li Yuanqing Li Yun Li Zhong Li Jinling Liang Ming Liang Pei-Ji Liang Yanchun Liang Li-Zhi Liao Wudai Liao Longnian Lin Guoping Liu Ju Liu Meiqin Liu Yan Liu Hongtao Lu Jianquan Lu Jinhu Lu Wenlian Lu Jian Cheng Lv Jinwen Ma Malik Magdon Ismail Danilo Mandic
Tiemin Mei Dan Meng Yan Meng Duoqian Miao Martin Middendorf Valeri Mladenov Marco Antonio Moreno-Armendáriz Ikuko Nishkawa Stanislaw Osowski Seiichi Ozawa Shaoning Pang Jaakko Peltonen Vir V. Phoha Branimir Reljin Qingsheng Ren Tomasz Rutkowski Sattar B. Sadkhan Toshimichi Saito Gerald Schaefer Furao Shen Daming Shi Hideaki Shimazaki Michael Small Qiankun Song Jochen J. Steil John Sum Roberto Tagliaferri Norikazu Takahashi Ah-hwee Tan Ying Tan Toshihisa Tanaka Dacheng Tao Ruck Thawonmas Xin Tian Christos Tjortjis Ivor Tsang Masao Utiyama Marc Vanhulle Bin Wang Dan Wang Dianhui Wang Lei Wang Liang Wang Rubin Wang Wenjia Wang Wenwu Wang Xiaoping Wang
Organization
Xin Wang Yinglin Wang Yiwen Wang Zhanzhan Wang Zhongsheng Wang Zidong Wang Hau-San Wong Kevin Wong Wei Wu Cheng Xiang Hong Xie Songyun Xie Rui Xu Xin Xu Guirong Xue Yang Yang Yingjie Yang Yongqing Yang Jianqiang Yi
Dingli Yu Jian Yu Xiao-Hua Yu Bo Yuan Kun Yuan Pong C Yuen Xiaoqin Zeng Changshui Zhang Jie Zhang Junping Zhang Kai Zhang Lei Zhang Nian Zhang Dongbin Zhao Hai Zhao Liang Zhao Qibin Zhao Mingjun Zhong Weihang Zhu
Reviewers Ajith Abraham Alma Y. Alanis N.G. Alex Jing An Sung Jun An Claudia Angelini Nancy Arana-Daniel Nancy Arana-Daniel Kiran Balagani Tao Ban Simone Bassis Anna Belardinelli Joao Roberto Bertini Junior Amit Bhaya Shuhui Bi Xuhui Bo Salim Bouzerdoum N. Bu Qiao Cai Xindi Cai Hongfei Cao Yuan Cao Jonathan Chan
Wenge Chang Benhui Chen Bo-Chiuan Chen Chao-Jung Chen Chu-Song Chen Cunbao Chen Fei Chen Gang Chen Guici Chen Junfei Chen Lei Chen Min Chen Pin-Cheng Chen Sheng Chen Shuwei Chen Tao Chen Xiaofen Chen Xiaofeng Chen Yanhua Chen Yao Chen Zengqiang Chen Zhihao Chen Jianlin Cheng K. H. Cheng
Lei Cheng Yu Cheng Yuhu Cheng Seong-Pyo Cheon Zheru Chi Seungjin Choi Angelo Ciaramella Matthew Conforth Paul Christopher Conilione Paleologu Constantin Jose Alfredo F. Costa Ruxandra Liana Costea Fangshu Cui Zhihua Cui James Curry Qun Dai Xinyu Dai Spiros Denaxas Jing Deng Xin Deng Zhijian Diao Ke Ding Jan Dolinsky
XI
XII
Organization
Yongsheng Dong Adriao Duarte Doria Neto Dajun Du Jun Du Shengzhi Du Wei Du Qiguo Duan Zhansheng Duan Julian Eggert Yong Fan Chonglun Fang Italia De Feis G.C. Feng Qinrong Feng Simone Fiori Chaojin Fu Jun Fu Zhengyong Fu Zhernyong Fu Sheng Gan Shenghua Gao Fei Ge Vanessa Goh Dawei Gong Weifeng Gu Wenfei Gu Renchu Guan Chengan Guo Jianmei Guo Jun Guo Ping Guo Xin Guo Yi Guo Juan Carlos Gutierrez Caceres Osamu Hasegawa Aurelien Hazart Hanlin He Huiguang He Lianghua He Lin He Wangli He Xiangnan He Zhaoshui He Sc Ramon Hernandez Esteban Hernandez-Vargas
Kevin Ho Xia Hong Chenping Hou Hui-Huang Hsu Enliang Hu Jinglu Hu Junhao Hu Meng Hu Sanqing Hu Tianjiang Hu Xiaolin Hu Zhaohui Hu Bonan Huang Chun-Rong Huang Dan Huang J. Huang Kaizhu Huang Shujian Huang Xiaodi Huang Xiaolin Huang Zhenkun Huang Cong Hui GuoTao Hui Khan M. Iftekharuddin Tasadduq Imam Teijiro Isokawa Mingjun Ji Zheng Ji Aimin Jiang Changan Jiang Feng Jiang Lihua Jiang Xinwei Jiang Gang Jin Ning Jin Yaochu Jin Krzysztof Siwek Yiannis Kanellopoulos Enam Karim Jia Ke Salman Khan Sung Shin Kim Tae-Hyung Kim Mitsunaga Kinjo Arto Klami Mario Koeppen Adam Kong
Hui Kong Qi Kong Adam Krzyzak Jayanta Kumar Debnath Kandarpa Kumar Sarma Franz Kurfess Paul Kwan Darong Lai Jiajun Lai Jianhuang Lai Wei Lai Heung Fai Lam Paul Lam Yuan Lan Ngai-Fong Law N. K. Lee Chi SingLeung Bing Li Boyang Li C. Li Chaojie Li Chuandong Li Dazi Li Guang Li Junhua Li Kang Li Kelin Li Li Li Liping Li Lulu Li Manli Li Peng Li Ping Li Ruijiang Li Tianrui Li Tieshan Li Xiaochen Li Xiaocheng Li Xuelong Li Yan Li Yun Li Yunxia Li Zhenguo Li Allan Liang Jinling Liang Pei-Ji Liang Li-Zhi Liao
Organization
Wudai Liao Hongfei Lin Qing Lin Tran Hoai Lin Bo Liu Chang Liu Chao Liu Fei Liu Hongbo Liu Jindong Liu Lei Liu Lingqiao Liu Nianjun Liu Qingshan Liu Wei Liu Xiangyang Liu Xiwei Liu Yan Liu Yanjun Liu Yu Liu Zhaobing Liu Zhenwei Liu Jinyi Long Jinyi Long Carlos Lopez-Franco Shengqiang Lou Mingyu Lu Ning Lu S.F. Lu Bei Lv Jun Lv Fali Ma Libo Ma Singo Mabu Danilo Mandic Qi Mao Tomasz Markiewicz Radoslaw Mazur Tiemin Mei Bo Meng Zhaohui Meng Marna van der Merwe Martin Middendorf N. Mitianoudis Valeri Mladenov Alex Moopenn Marco Moreno
Loredana Murino Francesco Napolitano Ikuko Nishkawa Tohru Nitta Qiu Niu Qun Niu Chakarida Nukoolkit Sang-Hoon Oh Floriberto Ortiz Stanislaw Osowski Antonio de Padua Braga Antonio Paiva Shaoning Pang Woon Jeung Park Juuso Parkkinen Michael Paul Anne Magály de Paula Canuto Zheng Pei Jaakko Peltonen Ce Peng Hanchuan Peng Jau-Woei Perng Son Lam Phung Xiong Ping Kriengkrai Porkaew Santitham Prom-on Dianwei Qian Lishan Qiao Keyun Qin Meikang Qiu Li Qu Marcos G. Quiles Mihai Rebican Luis J. Ricalde Jorge Rivera Haijun Rong Zhihai Rong Tomasz Rutkowski Jose A. Ruz Edgar N. Sanchez Sergio P. Santos Renato José Sassi Chunwei Seah Nariman Sepehri Caifeng Shan Shiguang Shan
XIII
Chunhua Shen Furao Shen Jun Shen Yi Shen Jiuh-Biing Sheu Licheng Shi Qinfeng Shi Xiaohu Shi Si Si Leandro Augusto da Silva Angela Slavova Sunantha Sodsee Dandan Song Dongjin Song Doo Heon Song Mingli Song Qiang Song Qiankun Song Kingkarn Sookhanaphibarn Gustavo Fontoura de Souza Antonino Staiano Jochen Steil Pui-Fai Sum Jian Sun Jian-Tao Sun Junfeng Sun Liang Sun Liming Sun Ning Sun Yi Sun Shigeru Takano Mingkui Tan Ke Tang Kecheng Tang Y. Tang Liang Tao Yin Tao Sarwar Tapan Ruck Thawonmas Tuan Hue Thi Le Tian Fok Hing Chi Tivive Christos Tjortjis Rutkowski Tomasz Julio Tovar
XIV
Organization
Jianjun Tu Zhengwen Tu Goergi Tzenov Lorenzo Valerio Rodrigo Verschae Liang Wan Min Wan Aihui Wang Bin Wang Bo Hyun Wang Chao Wang Chengyou Wang Dianhui Wang Guanjun Wang Haixian Wang Hongyan Wang Huidong Wang Huiwei Wang Jingguo Wang Jinghua Wang Lan Wang Li Wang Lili Wang Lizhi Wang Min Wang Ming Wang Pei Wang Ruizhi Wang Xiaolin Wang Xiaowei Wang Xin Wang Xu Wang Yang Wang Ying Wang You Wang Yunyun Wang Zhanshan Wang Zhengxia Wang Zhenxing Wang Zhongsheng Wang Bunthit Watanapa Hua-Liang Wei Qinglai Wei Shengjun Wen Young-Woon Woo Ailong Wu Chunguo Wu
Jun Wu Qiang Wu Si Wu Xiangjun Wu Yili Xia Zeyang Xia Cheng Xiang Linying Xiang Shiming Xiang Xiaoliang Xie Ping Xiong Zhihua Xiong Fang Xu Feifei Xu Heming Xu Jie Xu LinLi Xu Rui Xu Weihong Xu Xianyun Xu Xin Xu Hui Xue Jing Yang Liu Yang Qingshan Yang Rongni Yang Shangming Yang Wen-Jie Yang Wenlu Yang Wenyun Yang Xubing Yang Yan Yang Yongqing Yang Zi-Jiang Yang John Yao Jun Yao Yingtao Yao Keiji Yasuda Ming-Feng Yeh Xiao Yi Chenkun Yin Kaori Yoshida WenwuYu Xiao-Hua Yu Kun Yuan Weisu Yuan Xiaofang Yuan
Zhuzhi Yuan Zhuzhu Yuan P.C. Yuen Masahiro Yukawa Lianyin Zhai Biao Zhang Changshui Zhang Chen Zhang Dapeng Zhang Jason Zhang Jian Zhang Jianbao Zhang Jianhai Zhang Jianhua Zhang Jin Zhang Junqi Zhang Junying Zhang Kai Zhang Leihong Zhang Liming Zhang Nengsheng Zhang Nian Zhang Pu-Ming Zhang Qing Zhang Shaohong Zhang Tao Zhang Teng-Fei Zhang Ting Zhang Xian-Ming Zhang Yuyang Zhang Hai Zhao Qibin Zhao Xiaoyu Zhao Yi Zhao Yongping Zhao Yongqing Zhao Ziyang Zhen Chengde Zheng Lihong Zheng Yuhua Zheng Caiming Zhong Mingjun Zhong Shuiming Zhong Bo Zhou Jun Zhou Luping Zhou Rong Zhou
Organization
Xiuling Zhou Haojin Zhu Song Zhu
Wenjun Zhu Xunlin Zhu Yuanming Zhu
Wei-Wen Zou Xin Zou Pavel Zuñiga
Qiang Wang Qiang Wu
Rong Zhou Tianqi Zhang
Secretariat Jin Gang Kan Hong
XV
Table of Contents – Part II
SVM and Kernel Methods Support Vector Regression and Ant Colony Optimization for Grid Resources Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guosheng Hu, Liang Hu, Jing Song, Pengchao Li, Xilong Che, and Hongwei Li An Improved Kernel Principal Component Analysis for Large-Scale Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weiya Shi and Dexian Zhang
1
9
Software Defect Prediction Using Fuzzy Support Vector Regression . . . . . Zhen Yan, Xinyu Chen, and Ping Guo
17
Refining Kernel Matching Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianwu Li and Yao Lu
25
Optimization of Training Samples with Affinity Propagation Algorithm for Multi-class SVM Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guangjun Lv, Qian Yin, Bingxin Xu, and Ping Guo
33
An Effective Support Vector Data Description with Relevant Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhe Wang, Daqi Gao, and Zhisong Pan
42
A Support Vector Machine (SVM) Classification Approach to Heart Murmur Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samuel Rud and Jiann-Shiou Yang
52
Genetic Algorithms with Improved Simulated Binary Crossover and Support Vector Regression for Grid Resources Prediction . . . . . . . . . . . . . Guosheng Hu, Liang Hu, Qinghai Bai, Guangyu Zhao, and Hongwei Li Temporal Gene Expression Profiles Reconstruction by Support Vector Regression and Framelet Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei-Feng Zhang, Chao-Chun Liu, and Hong Yan Linear Replicator in Kernel Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei-Chen Cheng and Cheng-Yuan Liou Coincidence of the Solutions of the Modified Problem with the Original Problem of v-MC-SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Xue, Taian Liu, Xianming Kong, and Wei Zhang
60
68 75
83
XVIII
Table of Contents – Part II
Vision and Image Frequency Spectrum Modification: A New Model for Visual Saliency Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dongyue Chen, Peng Han, and Chengdong Wu 3D Modeling from Multiple Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Zhang, Jian Yao, and Wai-Kuen Cham
90 97
Infrared Face Recognition Based on Histogram and K-Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shangfei Wang and Zhilei Liu
104
Palmprint Recognition Using 2D-Gabor Wavelet Based Sparse Coding and RBPNN Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li Shang, Wenjun Huai, Guiping Dai, Jie Chen, and Jixiang Du
112
Global Face Super Resolution and Contour Region Constraints . . . . . . . . Chengdong Lan, Ruimin Hu, Tao Lu, Ding Luo, and Zhen Han
120
An Approach to Texture Segmentation Analysis Based on Sparse Coding Model and EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lijuan Duan, Jicai Ma, Zhen Yang, and Jun Miao
128
A Novel Object Categorization Model with Implicit Local Spatial Relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lina Wu, Siwei Luo, and Wei Sun
136
Facial Expression Recognition Method Based on Gabor Wavelet Features and Fractional Power Polynomial Kernel PCA . . . . . . . . . . . . . . . Shuai-shi Liu and Yan-tao Tian
144
Affine Invariant Topic Model for Generic Object Recognition . . . . . . . . . . Zhenxiao Li and Liqing Zhang
152
Liver Segmentation from Low Contrast Open MR Scans Using K-Means Clustering and Graph-Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yen-Wei Chen, Katsumi Tsubokawa, and Amir H. Foruzan
162
A Biologically-Inspired Automatic Matting Method Based on Visual Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Sun, Siwei Luo, and Lina Wu
170
Palmprint Classification Using Wavelets and AdaBoost . . . . . . . . . . . . . . . Guangyi Chen, Wei-ping Zhu, Bal´ azs K´egl, and R´ obert Busa- Fekete Face Recognition Based on Gabor-Enhanced Manifold Learning and SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chao Wang and Chengan Guo
178
184
Table of Contents – Part II
XIX
Gradient-based Local Descriptor and Centroid Neural Network for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nguyen Thi Bich Huyen, Dong-Chul Park, and Dong-Min Woo
192
Mean Shift Segmentation Method Based on Hybridized Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanling Li and Gang Li
200
Palmprint Recognition Using Polynomial Neural Network . . . . . . . . . . . . . LinLin Huang and Na Li
208
Motion Detection Based on Biological Correlation Model . . . . . . . . . . . . . . Bin Sun, Nong Sang, Yuehuan Wang, and Qingqing Zheng
214
Research on a Novel Image Encryption Scheme Based on the Hybrid of Chaotic Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhengqiang Guan, Jun Peng, and Shangzhu Jin Computational and Neural Mechanisms for Visual Suppression . . . . . . . . Charles Q. Wu Visual Selection and Attention Shifting Based on FitzHugh-Nagumo Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haili Wang, Yuanhua Qiao, Lijuan Duan, Faming Fang, Jun Miao, and Bingpeng Ma
222 230
240
Data Mining and Text Analysis Pruning Training Samples Using a Supervised Clustering Algorithm . . . . Minzhang Huang, Hai Zhao, and Bao-Liang Lu
250
An Extended Validity Index for Identifying Community Structure in Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jian Liu
258
Selected Problems of Intelligent Corpus Analysis through Probabilistic Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keith Douglas Stuart, Maciej Majewski, and Ana Botella Trelis
268
A Novel Chinese Text Feature Selection Method Based on Probability Latent Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiang Zhong, Xiongbing Deng, Jie Liu, Xue Li, and Chuanwei Liang
276
A New Closeness Metric for Social Networks Based on the k Shortest Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chun Shang, Yuexian Hou, Shuo Zhang, and Zhaopeng Meng
282
A Location Based Text Mining Method Using ANN for Geospatial KDD Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chung-Hong Lee, Hsin-Chang Yang, and Shih-Hao Wang
292
XX
Table of Contents – Part II
Modeling Topical Trends over Continuous Time with Priors . . . . . . . . . . . Tomonari Masada, Daiji Fukagawa, Atsuhiro Takasu, Yuichiro Shibata, and Kiyoshi Oguri
302
Improving Sequence Alignment Based Gene Functional Annotation with Natural Language Processing and Associative Clustering . . . . . . . . . Ji He
312
Acquire Job Opportunities for Chinese Disabled Persons Based on Improved Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ShiLin Zhang and Mei Gu
322
Research and Application to Automatic Indexing . . . . . . . . . . . . . . . . . . . . Lei Wang, Shui-cai Shi, Xue-qiang Lv, and Yu-qin Li
330
Hybrid Clustering of Multiple Information Sources via HOSVD . . . . . . . . Xinhai Liu, Lieven De Lathauwer, Frizo Janssens, and Bart De Moor
337
A Novel Hybrid Data Mining Method Based on the RS and BP . . . . . . . . Kaiyu Tao
346
BCI and Brain Imaging Dynamic Extension of Approximate Entropy Measure for Brain-Death EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qiwei Shi, Jianting Cao, Wei Zhou, Toshihisa Tanaka, and Rubin Wang Multi-modal EEG Online Visualization and Neuro-Feedback . . . . . . . . . . . Kan Hong, Liqing Zhang, Jie Li, and Junhua Li Applications of Second Order Blind Identification to High-Density EEG-Based Brain Imaging: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akaysha Tang A Method for MRI Segmentation of Brain Tissue . . . . . . . . . . . . . . . . . . . . Bochuan Zheng and Zhang Yi
353
360
368
378
Extract Mismatch Negativity and P3a through Two-Dimensional Nonnegative Decomposition on Time-Frequency Represented Event-Related Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fengyu Cong, Igor Kalyakin, Anh-Huy Phan, Andrzej Cichocki, Tiina Huttunen-Scott, Heikki Lyytinen, and Tapani Ristaniemi
385
The Coherence Changes in the Depressed Patients in Response to Different Facial Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenqi Mao, Yingjie Li, Yingying Tang, Hui Li, and Jijun Wang
392
Table of Contents – Part II
Estimation of Event Related Potentials Using Wavelet Denoising Based Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ling Zou, Cailin Tao, Xiaoming Zhang, and Renlai Zhou
XXI
400
Applications Adaptive Fit Parameters Tuning with Data Density Changes in Locally Weighted Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Han Lei, Xie Kun Qing, and Song Guo Jie
408
Structure Analysis of Email Networks by Information-Theoretic Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yinghu Huang and Guoyin Wang
416
Recognizing Mixture Control Chart Patterns with Independent Component Analysis and Support Vector Machine . . . . . . . . . . . . . . . . . . . Chi-Jie Lu, Yuehjen E. Shao, Po-Hsun Li, and Yu-Chiun Wang
426
Application of Rough Fuzzy Neural Network in Iron Ore Import Risk Early-Warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . YunBing Hou and Juan Yang
432
Emotion Recognition and Communication for Reducing Second-Language Speaking Anxiety in a Web-Based One-to-One Synchronous Learning Environment . . . . . . . . . . . . . . . . . . . . . Chih-Ming Chen and Chin-Ming Hong A New Short-Term Load Forecasting Model of Power System Based on HHT and ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhigang Liu, Weili Bai, and Gang Chen Sensitivity Analysis of CRM Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Virgilijus Sakalauskas and Dalia Kriksciuniene Endpoint Detection of SiO2 Plasma Etching Using Expanded Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sung-Ik Jeon, Seung-Gyun Kim, Sang-Jeen Hong, and Seung-Soo Han
439
448 455
464
Kernel Independent Component Analysis and Dynamic Selective Neural Network Ensemble for Fault Diagnosis of Steam Turbine . . . . . . . Dongfeng Wang, Baohai Huang, Yan Li, and Pu Han
472
A Neural Network Model for Evaluating Mobile Ad Hoc Wireless Network Survivability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tong Wang and ChuanHe Huang
481
Ultra High Frequency Sine and Sine Higher Order Neural Networks . . . . Ming Zhang
489
XXII
Table of Contents – Part II
Robust Adaptive Control Scheme Using Hopfield Dynamic Neural Network for Nonlinear Nonaffine Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . Pin-Cheng Chen, Ping-Zing Lin, Chi-Hsu Wang, and Tsu-Tian Lee A New Intelligent Prediction Method for Grade Estimation . . . . . . . . . . . . Xiaoli Li, Yuling Xie, and Qianjin Guo
497 507
Kernel-Based Lip Shape Clustering with Phoneme Recognition for Real-Time Voice Driven Talking Face . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Po-Yi Shih, Jhing-Fa Wang, and Zong-You Chen
516
Dynamic Fixed-Point Arithmetic Design of Embedded SVM-Based Speaker Identification System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jhing-Fa Wang, Ta-Wen Kuan, Jia-Ching Wang, and Ta-Wei Sun
524
A Neural Network Based Model for Project Risk and Talent Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nadee Goonawardene, Shashikala Subashini, Nilupa Boralessa, and Lalith Premaratne Harnessing ANN for a Secure Environment . . . . . . . . . . . . . . . . . . . . . . . . . . Mee H. Ling and Wan H. Hassan
532
540
Facility Power Usage Modeling and Short Term Prediction with Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sunny Wan and Xiao-Hua Yu
548
Classification of Malicious Software Behaviour Detection with Hybrid Set Based Feed Forward Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yong Wang, Dawu Gu, Mi Wen, Haiming Li, and Jianping Xu
556
MULP: A Multi-Layer Perceptron Application to Long-Term, Out-of-Sample Time Series Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eros Pasero, Giovanni Raimondo, and Suela Ruffa
566
Denial of Service Detection with Hybrid Fuzzy Set Based Feed Forward Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yong Wang, Dawu Gu, Mi Wen, Jianping Xu, and Haiming Li
576
Learning to Believe by Feeling: An Agent Model for an Emergent Effect of Feelings on Beliefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zulfiqar A. Memon and Jan Treur
586
Soft Set Theoretic Approach for Discovering Attributes Dependency in Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutut Herawan, Ahmad Nazari Mohd Rose, and Mustafa Mat Deris
596
An Application of Optimization Model to Multi-agent Conflict Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu-Teng Chang, Chen-Feng Wu, and Chih-Yao Lo
606
Table of Contents – Part II
Using TOPSIS Approach for Solving the Problem of Optimal Competence Set Adjustment with Multiple Target Solutions . . . . . . . . . . . Tsung-Chih Lai
XXIII
615
About the End-User for Discovering Knowledge . . . . . . . . . . . . . . . . . . . . . . Amel Grissa Touzi
625
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
637
Table of Contents – Part I
Neurophysiological Foundation Stimulus-Dependent Noise Facilitates Tracking Performances of Neuronal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Longwen Huang and Si Wu
1
Range Parameter Induced Bifurcation in a Single Neuron Model with Delay-Dependent Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Xiao and Jinde Cao
9
Messenger RNA Polyadenylation Site Recognition in Green Alga Chlamydomonas Reinhardtii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guoli Ji, Xiaohui Wu, Qingshun Quinn Li, and Jianti Zheng
17
A Study to Neuron Ensemble of Cognitive Cortex ISI Coding Represent Stimulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hu Yi and Xin Tian
27
STDP within NDS Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mario Antoine Aoun Synchronized Activities among Retinal Ganglion Cells in Response to External Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Xiao, Ying-Ying Zhang, and Pei-Ji Liang Novel Method to Discriminate Awaking and Sleep Status in Light of the Power Spectral Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lengshi Dai, You Wang, Haigang Zhu, Walter J. Freeman, and Guang Li
33
44
51
Current Perception Threshold Measurement via Single Channel Electroencephalogram Based on Confidence Algorithm . . . . . . . . . . . . . . . . You Wang, Yi Qiu, Yuping Miao, Guiping Dai, and Guang Li
58
Electroantennogram Obtained from Honeybee Antennae for Odor Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . You Wang, Yuanzhe Zheng, Zhiyuan Luo, and Guang Li
63
A Possible Mechanism for Controlling Timing Representation in the Cerebellar Cortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takeru Honda, Tadashi Yamazaki, Shigeru Tanaka, and Tetsuro Nishino
67
XXVI
Table of Contents – Part I
Theory and Models Parametric Sensitivity and Scalability of k-Winners-Take-All Networks with a Single State Variable and Infinity-Gain Activation Functions . . . . Jun Wang and Zhishan Guo
77
Extension of the Generalization Complexity Measure to Real Valued Input Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Iv´ an G´ omez, Leonardo Franco, Jos´e M. Jerez, and Jos´e L. Subirats
86
A New Two-Step Gradient-Based Backpropagation Training Method for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuewen Mu and Yaling Zhang
95
A Large-Update Primal-Dual Interior-Point Method for Second-Order Cone Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liang Fang, Guoping He, Zengzhe Feng, and Yongli Wang
102
A One-Step Smoothing Newton Method Based on a New Class of One-Parametric Nonlinear Complementarity Functions for P0 -NCP . . . . . Liang Fang, Xianming Kong, Xiaoyan Ma, Han Li, and Wei Zhang
110
A Neural Network Algorithm for Solving Quadratic Programming Based on Fibonacci Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jingli Yang and Tingsong Du
118
A Hybrid Particle Swarm Optimization Algorithm Based on Nonlinear Simplex Method and Tabu Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhanchao Li, Dongjian Zheng, and Huijing Hou
126
Fourier Series Chaotic Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jia-hai Zhang, Chen-zhi Sun, and Yao-qun Xu
136
Multi-objective Optimization of Grades Based on Soft Computing . . . . . . Yong He
144
Connectivity Control Methods and Decision Algorithms Using Neural Network in Decentralized Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demin Li, Jie Zhou, Jiacun Wang, and Chunjie Chen
152
A Quantum-Inspired Artificial Immune System for Multiobjective 0-1 Knapsack Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiaquan Gao, Lei Fang, and Guixia He
161
RBF Neural Network Based on Particle Swarm Optimization . . . . . . . . . . Yuxiang Shao, Qing Chen, and Hong Jiang
169
Genetic-Based Granular Radial Basis Function Neural Network . . . . . . . . Ho-Sung Park, Sung-Kwun Oh, and Hyun-Ki Kim
177
Table of Contents – Part I
XXVII
A Closed-Form Solution to the Problem of Averaging over the Lie Group of Special Orthogonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simone Fiori
185
A Lower Order Discrete-Time Recurrent Neural Network for Solving High Order Quadratic Problems with Equality Constraints . . . . . . . . . . . . Wudai Liao, Jiangfeng Wang, and Junyan Wang
193
A Experimental Study on Space Search Algorithm in ANFIS-Based Fuzzy Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Huang, Lixin Ding, and Sung-Kwun Oh
199
Optimized FCM-Based Radial Basis Function Neural Networks: A Comparative Analysis of LSE and WLSE Method . . . . . . . . . . . . . . . . . . Wook-Dong Kim, Sung-Kwun Oh, and Wei Huang
207
Design of Information Granulation-Based Fuzzy Radial Basis Function Neural Networks Using NSGA-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jeoung-Nae Choi, Sung-Kwun Oh, and Hyun-Ki Kim
215
Practical Criss-Cross Method for Linear Programming . . . . . . . . . . . . . . . . Wei Li
223
Calculating the Shortest Paths by Matrix Approach . . . . . . . . . . . . . . . . . . Huilin Yuan and Dingwei Wang
230
A Particle Swarm Optimization Heuristic for the Index Tacking Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanhong Zhu, Yun Chen, and Kesheng Wang
238
Structural Design of Optimized Polynomial Radial Basis Function Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Young-Hoon Kim, Hyun-Ki Kim, and Sung-Kwun Oh
246
Convergence of the Projection-Based Generalized Neural Network and the Application to Nonsmooth Optimization Problems . . . . . . . . . . . . . . . . Jiao Liu, Yongqing Yang, and Xianyun Xu
254
Two-Dimensional Adaptive Growing CMAC Network . . . . . . . . . . . . . . . . . Ming-Feng Yeh
262
A Global Inferior-Elimination Thermodynamics Selection Strategy for Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fahong Yu, Yuanxiang Li, and Weiqin Ying
272
Particle Swarm Optimization Based Learning Method for Process Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kun Liu, Ying Tan, and Xingui He
280
XXVIII
Table of Contents – Part I
Interval Fitness Interactive Genetic Algorithms with Variational Population Size Based on Semi-supervised Learning . . . . . . . . . . . . . . . . . . Xiaoyan Sun, Jie Ren, and Dunwei Gong
288
Research on One-Dimensional Chaos Maps for Fuzzy Optimal Selection Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tao Ding, Hongfei Xiao, and Jinbao Liu
296
Edited Nearest Neighbor Rule for Improving Neural Networks Classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Alejo, J.M. Sotoca, R.M. Valdovinos, and P. Toribio
303
A New Algorithm for Generalized Wavelet Transform . . . . . . . . . . . . . . . . . Feng-Qing Han, Li-He Guan, and Zheng-Xia Wang
311
Neural Networks Algorithm Based on Factor Analysis . . . . . . . . . . . . . . . . Shifei Ding, Weikuan Jia, Xinzheng Xu, and Hong Zhu
319
IterativeSOMSO: An Iterative Self-organizing Map for Spatial Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qiao Cai, Haibo He, Hong Man, and Jianlong Qiu
325
A Novel Method of Neural Network Optimized Design Based on Biologic Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ding Xiaoling, Shen Jin, and Fei Luo
331
Research on a Novel Ant Colony Optimization Algorithm . . . . . . . . . . . . . Gang Yi, Ming Jin, and Zhi Zhou A Sparse Infrastructure of Wavelet Network for Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Zhang, Zhenghui Gu, Yuanqing Li, and Xieping Gao Information Distances over Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maxime Houllier and Yuan Luo
339
347 355
Learning and Inference Regression Transfer Learning Based on Principal Curve . . . . . . . . . . . . . . . Wentao Mao, Guirong Yan, Junqing Bai, and Hao Li
365
Semivariance Criteria for Quantifying the Choice among Uncertain Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yankui Liu and Xiaoqing Wang
373
Enhanced Extreme Learning Machine with Modified Gram-Schmidt Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianchuan Yin and Nini Wang
381
Table of Contents – Part I
XXIX
Solving Large N-Bit Parity Problems with the Evolutionary ANN Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin-Yu Tseng and Wen-Ching Chen
389
Multiattribute Bayesian Preference Elicitation with Pairwise Comparison Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shengbo Guo and Scott Sanner
396
Local Bayesian Based Rejection Method for HSC Ensemble . . . . . . . . . . . Qing He, Wenjuan Luo, Fuzhen Zhuang, and Zhongzhi Shi
404
Orthogonal Least Squares Based on Singular Value Decomposition for Spare Basis Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Han and De-cai Li
413
Spectral Clustering on Manifolds with Statistical and Geometrical Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yong Cheng and Qiang Tong
422
A Supervised Fuzzy Adaptive Resonance Theory with Distributed Weight Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aisha Yousuf and Yi Lu Murphey
430
A Hybrid Neural Network Model Based Reinforcement Learning Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pengyi Gao, Chuanbo Chen, Kui Zhang, Yingsong Hu, and Dan Li
436
A Multi-view Regularization Method for Semi-supervised Learning . . . . . Jiao Wang, Siwei Luo, and Yan Li
444
Multi-reservoir Echo State Network with Sparse Bayesian Learning . . . . . Min Han and Dayun Mu
450
Leave-One-Out Cross-Validation Based Model Selection for Manifold Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jin Yuan, Yan-Ming Li, Cheng-Liang Liu, and Xuan F. Zha
457
Probability Density Estimation Based on Nonparametric Local Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Han and Zhi-ping Liang
465
A Framework of Decision Making Based on Maximal Supported Sets . . . Ahmad Nazari Mohd Rose, Tutut Herawan, and Mustafa Mat Deris
473
Neurodynamics Dynamics of Competitive Neural Networks with Inverse Lipschitz Neuron Activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaobing Nie and Jinde Cao
483
XXX
Table of Contents – Part I
Stability and Hopf Bifurcation of a BAM Neural Network with Delayed Self-feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shifang Kuang, Feiqi Deng, and Xuemei Li
493
Stability Analysis of Recurrent Neural Networks with Distributed Delays Satisfying Lebesgue-Stieljies Measures . . . . . . . . . . . . . . . . . . . . . . . . Zhanshan Wang, Huaguang Zhang, and Jian Feng
504
Stability of Genetic Regulatory Networks with Multiple Delays via a New Functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhenwei Liu and Huaguang Zhang
512
The Impulsive Control of the Projective Synchronization in the Drive-Response Dynamical Networks with Coupling Delay . . . . . . . . . . . . Xianyun Xu, Yun Gao, Yanhong Zhao, and Yongqing Yang
520
Novel LMI Stability Criteria for Interval Hopfield Neural Networks with Time Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaolin Li and Jia Jia
528
Memetic Evolutionary Learning for Local Unit Networks . . . . . . . . . . . . . . Roman Neruda and Petra Vidnerov´ a
534
Synchronization for a Class of Uncertain Chaotic Cellular Neural Networks with Time-Varying Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianjun Tu and Hanlin He
542
Global Exponential Stability of Equilibrium Point of Hopfield Neural Network with Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaolin Liu and Kun Yuan
548
Stability of Impulsive Cohen-Grossberg Neural Networks with Delays . . . Jianfu Yang, Wensi Ding, Fengjian Yang, Lishi Liang, and Qun Hong
554
P-Moment Asymptotic Behavior of Nonautonomous Stochastic Differential Equation with Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bing Li, Yafei Zhou, and Qiankun Song
561
Exponential Stability of the Neural Networks with Discrete and Distributed Time-Varying Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qingbo Li, Peixu Xing, and Yuanyuan Wu
569
Mean Square Stability in the Numerical Simulation of Stochastic Delayed Hopfield Neural Networks with Markovian Switching . . . . . . . . . . Hua Yang, Feng Jiang, and Jiangrong Liu
577
The Existence of Anti-periodic Solutions for High-Order Cohen-Grossberg Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhouhong Li, Kaihong Zhao, and Chenxi Yang
585
Table of Contents – Part I
XXXI
Global Exponential Stability of BAM Type Cohen-Grossberg Neural Network with Delays on Time Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chaolong Zhang, Wensi Ding, Fengjian Yang, and Wei Li
595
Multistability of Delayed Neural Networks with Discontinuous Activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaofeng Chen, Yafei Zhou, and Qiankun Song
603
Finite-Time Boundedness Analysis of Uncertain CGNNs with Multiple Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaohong Wang, Minghui Jiang, Chuntao Jiang, and Shengrong Li
611
Dissipativity Analysis of Stochastic Neural Networks with Time-Varying Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianting Zhou, Qiankun Song, and Jianxi Yang
619
Multistability Analysis: High-Order Networks Do Not Imply Greater Storage Capacity Than First-Order Ones . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhenkun Huang
627
Properties of Periodic Solutions for Common Logistic Model with Discrete and Distributed Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ting Zhang, Minghui Jiang, and Zhengwen Tu
635
New Results of Globally Exponentially Attractive Set and Synchronization Controlling of the Qi Chaotic System . . . . . . . . . . . . . . . . Jigui Jian, Xiaolian Deng, and Zhengwen Tu
643
Stability and Attractive Basin of Delayed Cohen-Grossberg Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ailong Wu, Chaojin Fu, and Xian Fu
651
Exponential Stability Analysis for Discrete-Time Stochastic BAM Neural Networks with Time-Varying Delays . . . . . . . . . . . . . . . . . . . . . . . . . Tiheng Qin, Quanxiang Pan, and Yonggang Chen
659
Invariant and Globally Exponentially Attractive Sets of Separated Variables Systems with Time-Varying Delays . . . . . . . . . . . . . . . . . . . . . . . . Zhengwen Tu, Jigui Jian, and Baoxian Wang
667
Delay-Dependent Stability of Nonlinear Uncertain Stochastic Systems with Time-Varying Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng Wang
675
Stability Analysis of Fuzzy Cohen-Grossberg Neural Networks with Distributed Delays and Reaction-Diffusion Terms . . . . . . . . . . . . . . . . . . . . Weifan Zheng and Jiye Zhang
684
XXXII
Table of Contents – Part I
Global Exponential Robust Stability of Delayed Hopfield Neural Networks with Reaction-Diffusion Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaohui Xu, Jiye Zhang, and Weihua Zhang
693
Stability and Bifurcation of a Three-Dimension Discrete Neural Network Model with Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Yang and Chunrui Zhang
702
Globally Exponential Stability of a Class of Neural Networks with Impulses and Variable Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianfu Yang, Hongying Sun, Fengjian Yang, Wei Li, and Dongqing Wu Discrete Time Nonlinear Identification via Recurrent High Order Neural Networks for a Three Phase Induction Motor . . . . . . . . . . . . . . . . . Alma Y. Alanis, Edgar N. Sanchez, Alexander G. Loukianov, and Marco A. Perez-Cisneros
711
719
Stability Analysis for Stochastic BAM Neural Networks with Distributed Time Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guanjun Wang
727
Dissipativity in Mean Square of Non-autonomous Impulsive Stochastic Neural Networks with Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhiguo Yang and Zhichun Yang
735
Stability Analysis of Discrete Hopfield Neural Networks Combined with Small Ones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weigen Wu, Jimin Yuan, Jun Li, Qianrong Tan, and Xing Yin
745
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
753
Support Vector Regression and Ant Colony Optimization for Grid Resources Prediction Guosheng Hu, Liang Hu, Jing Song, Pengchao Li, Xilong Che, and Hongwei Li College of Computer Science and Technology, Jilin University, Changchun 130012, China
[email protected]
Abstract. Accurate grid resources prediction is crucial for a grid scheduler. In this study, support vector regression (SVR), which is an effective regression algorithm, is applied to grid resources prediction. In order to build an effective SVR model, SVR’s parameters must be selected carefully. Therefore, we develop an ant colony optimization-based SVR (ACO-SVR) model that can automatically determine the optimal parameters of SVR with higher predictive accuracy and generalization ability simultaneously. The proposed model was tested with grid resources benchmark data set. Experimental results demonstrated that ACO-SVR worked better than SVR optimized by trial-and-error procedure (T-SVR) and back-propagation neural network (BPNN). Keywords: Grid resources prediction, Support vector regression, Ant Colony Optimization.
1 Introduction The Grid Computing tries to enable all kinds of resources or services being shared across the Internet. In the grid environment, the availability of grid resources vary over time and such changes will affect the performance of the tasks running on the grid. If we can predict the future information of grid resources, the scheduler will be able to manage the grid resources more effectively. In grid resources prediction, many relevant research models [1-4] have been developed and have generated accurate prediction in practice. The Network Weather Service (NWS) [1] uses a combination of several models for the prediction of one resource. NWS allows some adaptation by dynamically choosing the model that has performed the best recently for the next prediction, but its adaptation is limited to the selection of a model from several candidates that are conventional statistical models. Resource Prediction System (RPS) [2] is a project in which grid resources are modeled as linear time series process. Multiple conventional linear models are evaluated, including AR, MA, ARMA, ARIMA and ARFIMA models. Their results show that the simple AR model is the best model of this class because of its good predictive power and low overhead. With the development of artificial neural networks (ANNs), ANNs have been successfully employed for modeling time series. Liu et al.[3] and Eswaradass et al. [4] L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 1 – 8, 2010. © Springer-Verlag Berlin Heidelberg 2010
2
G. Hu et al.
applied ANNs to grid resources prediction successfully. Experimental results showed the ANN approach provided an improved prediction over that of NWS. However, ANNs have some drawbacks such as hard to pre-select the system architecture, spending much training time, and lacking knowledge representation facilities. In 1995, support vector machine (SVM) was developed by Vapnik [5] to provide better solutions than ANNs. SVM can solve classification problems (SVC) and regression problems (SVR) successfully and effectively. However, the determination of SVR’s parameters is an open problem and no general guidelines are available to select these parameters [5]. Ant Colony Optimization (ACO) [6] is a new evolutionary algorithm, and it has been successfully applied to various NP-hard combinatorial optimization problems. Therefore, in this study, ACO was adopted to automatically determine the optimal hyper-parameters of SVR.
,
2 Support Vector Regression In order to solve regression problems, we are given training data (xi ,yi) (i=1,…,l), where x is a d-dimensional input with x∈Rd and the output is y∈R. The linear regression model can be written as follows [7]:
f ( x ) = ω , x + b, ω , x ∈ ℜ d , b ∈ ℜ
(1)
where f(x) is a target function and <·,·> denotes the dot product in Rd . The ε -insensitive loss function (Eq.(2)) proposed by Vapnik is specified to measure the empirical risk [7].
for f (x ) − y ≤ ε othervise
⎧0 Lε ( y ) = ⎨ ⎩ f (x ) − y − ε
(2)
Besides, the optimal parameters and b in Eq.(1) are found by solving the primal optimization problem [7]: min
12 ω
2
l
(
+ C ∑ ξ i− + ξ i+ i =1
)
(3)
with constraints:
y i − ω , x i − b ≤ ε + ξ i+ ,
ω , x i + b − y i ≤ ε + ξ i− , ξ i− , ξ i+ ≥ 0,
i = 1,..., l
(4)
where C is a pre-specified value that determines the trade-off between the flatness of f(x) and the amount up to which deviations larger than the precision are tolerated. The slack variables ξ+ and ξ¯ represent the deviations from the constraints of the ε -tube. This primal optimization problem can be reformulated as a dual problem defined as follows:
Support Vector Regression and Ant Colony Optimization for Grid Resources Prediction
12 ∑∑ (a
max∗ − x, x
l
l
∗ i
)(
)
− ai a ∗j − a j xi , x j +
i =1 j =1
0 ≤ a i , a i∗ ≤ C ,
∑ (a i =1
i
l
i
i =1
with constraints: l
∑ y (a
)
3
l
* i
− ai ) − ε ∑ (ai* + ai ) (5) i =1
i = 1,..., l
(6)
− a i∗ = 0 .
Solving the optimization problem defined by Eq.(5) and (6) gives the optimal Lagrange multipliers α and α*, while ω and b are given by
ω = ∑ (a i∗ − a i )x i , l
i =1
1 b = − ω , (x r + x s ) , 2
(7)
where xr and xs are support vectors. Sometimes nonlinear functions should be optimized, so this approach has to be extended. This is done by replacing xi by a mapping into feature space [7], φ(xi), which linearizes the relation between xi and yi. According to the computed value of ω , the f(x) in Eq.(1) can be written as:
f (x ) =
∑ (a N
i =1
i
)
− a i* · K ( x i , x ) + b
(8)
K(xi , x)=< φ(xi), φ(x)> is the so-called kernel function [7]. Any symmetric positive semi-definite function that satisfies Mercer’s Conditions [7] can be used as a kernel function. Our work is based on the RBF kernel [7].
3 ACO-SVR Model This study proposed a new method, ACO-SVR, which optimized all SVR’s parameters simultaneously through ACO evolutionary process. Then, the acquired parameters were used to construct optimized SVR model. The details of ACO-SVR model are described as follows: (1) Path Selection: Each influencing factor in current system is regarded as a city node. An ant n in city a chooses the next city b to move to by applying the following probabilistic formula:
if q < Q , else
⎧ arg max{ τ t ab }, T (a, b) = ⎨ Sr ⎩
0
(9)
where q is a variable which is chosen randomly with uniform probability [0,1], Q0 ∈ (0,1) is a parameter and τ represents pheromone. Sr means that a standard roulette wheel selection is employed to determine the next city.
4
G. Hu et al.
(2) Pheromone update: pheromone update is defined by:
τ ijt +1 = (1 − ρ ) × τ ijt + ρΔτ ij Δ τ ij where
τ ijt
(10)
(i , j ) ∈ G otherwise
⎧Q ⎪ = ⎨ Lb ⎪0 ⎩
(11)
signifies the amount of pheromone trail on city i and city j at time t;
is a coefficient such that ( 1 − ρ ) represents the evaporation of the pheromone level; Q is a constant and Lb is tour length of the iteration-best solution; G is
ρ ∈ (0,1)
the factor city set belonging to iteration-best solution;
Δτ ij
is the pheromone trail
accumulated between city i and city j in this iteration.
(3) Fitness evaluation: When all the ants have completed all the paths, each path corresponding to the value of each variable is calculated. In order to overcome overfitting phenomenon, cross validation technique which was successfully adopted by Duan[8] is used in ACO-SVR model. In this study, the fitness function is defined as the Mean Square Error(MSE) of actual values and predicted values using five-fold cross validation technique. (4) Stopping criteria: The maximal number of iterations works as stopping criteria. It is selected as a trade-off between the convergence time and accuracy. In this study, the maximal number of iterations is equal to 100.
4 Performance Evaluation 4.1 Data Preprocessing Strategy
In our experiment, we chose host load, one kind of typical grid resource, as prediction object. For host load prediction, we chose “mystere10000.dat” as benchmark data set [9]. We took the last 204 items of the data set for our experiment. It’s very important to scale data before applying SVR method on them. Before the SVR was trained, all the data in the database were linearly scaled to fit within the interval (0, 1). When artificial intelligence technology is applied to the prediction of time series, the number of input nodes critically affects the prediction performance. According to Kuan [10], this study experimented with the number 4 for the order of autoregressive terms. Thus, 204 observation values became 200 input patterns. The prior 150 input patterns were employed for the training set to build model; the other 50 input patterns were employed for test set to estimate generalization ability of prediction models. The simulation of SVR model had been carried out by using the ‘Libsvm’, a toolbox for support vector machines, which was originally designed by Chang and Lin [11]. The experimental results were obtained using a personal computer with Intel CoreTM 2 Duo processor @2.8GHz, 2.79GHz and 2 GB RAM.
Some statistical metrics, such as NMSE and R, were used to evaluate the prediction performance of models [12].
Support Vector Regression and Ant Colony Optimization for Grid Resources Prediction
5
4.2 Parameters Determination for Three Models
1) ACO-SVR model The choices of ACO’s parameters were based on numerous experiments, as those values provided the smallest MSEcv on the training data set. Table 1 gave an overview of ACO parameter settings. Table 1. ACO parameter settings Ant number
20
Iteration number
100
Constant Q0 in Eq.(9)
0.6
evaporation coefficient of pheromone
0.8
According to Wang [7] and the convenience of computing ,we set the parameters searching space: C ( 0,256 ), σ ( 0,256 ) and ε( 0,1 ); 2) T-SVR model The traditional parameter selection procedure of SVR is the trial-and-error procedure, namely T-SVR model. T-SVR model used the same training set and test set as ACOSVR and had the same parameters searching space: C (0,256), σ (0,256 ) and ε( 0,1 ) in our experiment. Considering precision and computing time, we picked 30 discrete points equally from the searching space of C, 30 from σ and 20 from ε. Hence, we got 18000 (18000= 30 × 30 × 20) group of parameters. Cross validation technique was also applied to trialand-error procedure. The optimal parameters that provided the smallest MSEcv on the training set were obtained after each group of parameters was tried. 3) BPNN model In the area of time series prediction, the most popular ANN model is the BPNN due to its simple architecture yet powerful problem-solving ability. The parameters of BPNN in our experiment were set as follows. Hornik et al. [13] suggested that one hidden layer network was sufficient to model any complex system with any desired accuracy. Hence, a standard three-layer network, including one hidden layer, was used in our experiment. The number of nodes for input layer was set to 10, 4 for hidden layer and 1 for output layer. Rumelhart et al.[14] suggested using a small learning rate to set the network parameters. Therefore, the learning rate was set to 0.1. The hidden nodes used the tanh (Eq.(12)) transfer function, and the output node used the linear transfer function.
tanh( x ) =
ex − e−x e x + e −x
(12)
Considering both the accuracy and time-consuming of BPNN model, the convergence criteria used for the training set was a maximum of 500 iterations.
6
G. Hu et al.
4.3 Experimental Results
Firstly, the results of parameters selection were shown. Fig. 1 illustrated the correlation curves of ACO-SVR model for the optimal fitness versus the generation number. From Fig. 1, it was obvious that the MSEcv of the optimal fitness decreased with the increase of generation number. When the sample evolution reached Generation 62, the MSEcv of five-fold cross validation converged, indicating that the searching of the ACO was featured with quite excellent efficiency.
Fig. 1. Convergence during evolutionary process
Table 2 compared the parameters selection results. Compared with T-SVR model, ACO-SVR model spent even less time, but obtained higher precise parameters with smaller MSE during the procedure of parameters selection. It means that ACO-SVR model outperforms T-SVR model. Table 2. Comparison of parameter selection procedure model ACO-SVR T-SVR
time(s) 191 514
optimal(C, σ and ε) (87.0543, 0.0531, 0.0508) (8.533, 25.6, 0.05)
MSE 0.01026 0.01854
Thereafter, the prediction results of different models were compared. From Table 3, the value of NMSE made by ACO-SVR model was smallest. According to Lewis[15], we can rate the prediction results made by ACO-SVR model to be of highest precision. Table 3. Comparison of prediction results Model ACO-SVR T-SVR BPNN
NMSE 0.2470 0.6187 0.3022
R 0.9709 0.9308 0.9671
Support Vector Regression and Ant Colony Optimization for Grid Resources Prediction
7
However, the correlative coefficient R from the ACO-SVR model was the highest, indicating an extremely high correlation between the predicted values and the actual values. It could also be observed that the NMSE of BPNN is smaller than that of TSVR and the R of BPNN is larger than that of T-SVR. It means that BPNN worked better than T-SVR under the parameters settings in our experiment. From Fig. 2, it could be observed that smallest deviations between the predicted and actual values were made by ACO-SVR model among all the three models. From Fig. 2, it could also be observed that the smaller deviations were made by T-SVR than those made by BPNN at most time. However, there were several points where the deviations made by T-SVR were too large. These points caused the large average errors that could be observed from Table 3.
Fig. 2. Graphical presentation of different models
5 Conclusions In this study, an effective SVR model with ACO had been applied to predict grid resources. Compared to T-SVR model, the ACO-SVR model provided higher prediction precision and spent even less time on parameters selection. It meant that ACO was applied to SVR’s parameters selection successfully. In this study, ACO-SVR worked better than BPNN and BPNN outperforms T-SVR. Hence, the parameters selection was very important for SVR’s performance and trial-and-error method indeed needed some luck. On the other hand, the superior performance of ACO-SVR model over BPNN approaches was mainly due to the following causes. Firstly, the SVR model have nonlinear mapping capabilities and can easily capture data patterns of grid resources, host load in this study, compared to the BPNN model. Secondly, improper determining of SVR’s parameters will cause either over-fitting or under-fitting of a SVR model. In this study, the ACO can determine suitable parameters of SVR and
8
G. Hu et al.
improves the prediction performance of the proposed model. Finally, the ACO-SVR model performs structural risk minimization (SRM) principle rather than minimizing the training errors. Minimizing the upper bound on the generalization error improves the generalization performance compared to the BPNN model. The promising results obtained in this study reveal the potential of the ACO-SVR model for predicting grid resources. In future, we will study some other advanced search techniques for parameters selection. Acknowledgments. This project is supported by National 973 plan of China (No. 2009CB320706), by the National Natural Science Foundation of China (No.60873235&60473099), and by Program of New Century Excellent Talents in University of China (No.NCET-06-0300).
References 1. Wolski, R., Spring, N.T., Hayes, J.: The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing. The Journal of Future Generation Computing Systems (1999) 2. Dinda, P.A.: Design, Implementation, and Performance of an Extensible Toolkit for Resource Prediction in Distributed Systems. IEEE Trans. Parallel Distrib. Syst., 160–173 (2006) 3. Liu, Z.X., Guan, X.P., Wu, H.H.: Bandwidth Prediction and Congestion Control for ABR Traffic based on Neural Networks. In: Wang, J., et al. (eds.) ISNN 2006, Part II. LNCS, vol. 3973, pp. 202–207. Springer, Heidelberg (2006) 4. Eswaradass, A., Sun, X.H., Wu, M.: A Neural Network based Predictive Mechanism for Available Bandwidth. In: 19th International Parallel and Distributed Processing Symposium (2005) 5. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998) 6. Dorigo, M., Stützle, T.: Ant Colony Optimization. MIT Press, Cambridge (2004) 7. Wang, L.P.: Support Vector Machines: Theory and Application. Springer, Berlin (2005) 8. Duan, K., Keerthi, S., Poo, A.: Evaluation of Simple Performance Measures for Tuning SVM Hyper Parameters. Technical Report, National University of Singapore, Singapore (2001) 9. Host Load Data Set, http://cs.uchicago.edu/lyang/Load/ 10. Chen, K.Y.: Forecasting Systems Reliability based on Support Vector Regression with Genetic Algorithms. Reliability Engineering and System Safety, 423–432 (2007) 11. Chang, C.C., Lin, C.J.: LIBSVM: A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 12. Hu, L., Hu, G., Tang, K., Che, X.: Grid Resource Prediction based on Support Vector Regression and Genetic Algorithms. In: The 5th International Conference on Natural Computation (2009) 13. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximations. Neural Networks, 336–359 (1989) 14. Rumelhart, E., Hinton, G.E., Williams, R.J.: Learning Internal Representations by Error Propagation in Parallel Distributed Processing. MIT Press, Cambridge (1986) 15. Lewis, C.D.: International and Business Forecasting Methods. Butterworths, London (1982)
An Improved Kernel Principal Component Analysis for Large-Scale Data Set Weiya Shi and Dexian Zhang School of Information Science and Engineering Henan University of Technology, Zhengzhou, China
[email protected]
Abstract. To deal with the computational and storage problem for the large-scale data set, an improved Kernel Principal Component Analysis based on 1-order and 2-order statistical quantity, is proposed. By dividing the large scale data set into small subsets, we could treat 1-order and 2-order statistical quantity (mean and autocorrelation matrix) of each subset as the special computational unit. A novel polynomial-matrix kernel function is also adopted to compute the similarity between the data matrices in place of vectors. The proposed method can greatly reduce the size of kernel matrix, which makes its computation possible. Its effectiveness is demonstrated by the experimental results on the artificial and real data set.
1 Introduction Kernel Principal Component Analysis (KPCA) [1] is the nonlinear generalization of Principal Component Analysis (PCA) [2]. The standard KPCA generally needs to eigendecompose the Gram matrix [3], which is acquired using the kernel function. It must firstly store the Gram matrix of all data, which takes the space complexity of O(m2 ), where m is the number of data samples. In addition, it needs the time complexity of O(m3 ) to extract the kernel principal components. But traditional kernel function is based on the inner product of data vector, the size of kernel matrix scales with the number of data points. When faced with the large-scale data set, it is infeasible to store and compute the kernel matrix because of the limited storage capacity. Consequently, some approaches must be adopted to account for the inconvenience. In order to solve the problem of the large-scale data set, some methods have been proposed to compute kernel principal component. Zheng [4] proposed to partition the data set into several small-scale data set and handle them, respectively. Some approximation algorithms [5][6][7]are proposed to extract some representative data, and these data are chosen to approximate the original data set. The major difference between these methods lies in the sampling way. An iterative procedure is proposed to estimate the kernel principal components by kernelizing the generalize Hebbian algorithm [8]. But the convergence is slow and cannot be guaranteed. Recently, we have given a new framework, matrix-based kernel principal component analysis (M-KPCA) [9], which can effectively solve the problem of large-scale data set. But it was only the fundamental result and did not give much illustration and contrast. In this paper, we will extend that idea and use 1-order and 2-order statistical quantity to L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 9–16, 2010. c Springer-Verlag Berlin Heidelberg 2010
10
W. Shi and D. Zhang
deal with large-scale data set. First, we divide the large scale data set into small subsets, each of which can produce the 1-order and 2-order statistical quantity (mean and autocorrelation matrix). For the 1-order statistical quantity, the traditional kernel method can be used to compute the kernel matrix. Because the 2-order statistical quantity is a matrix, the kernel function based on vectors can not be used. A novel polynomial-matrix kernel function was proposed to compute the similarity between matrices. Because the number of subsets is less than the number of samples, the size of kernel matrix can be greatly reduced. The small size of kernel matrix makes the computation and storage of large-scale data set possible. The effectiveness of the proposed methods is demonstrated by the experimental results on the artificial and real data set. The rest of this paper is organized as follows: section 2 describes the proposed methods in detail. The experimental evaluation of the proposed methods is given in the section 3. Finally we conclude with a discussion.
2 Proposed Method Let X = (x1 , x2 ....xm ) be the data matrix in input space, where xi ,i = 1, 2, · · · , m, is a n-dimensional vector and m is the number of data samples. 2.1 The Formation of Subsets The data set X = (x1 , x2 ....xm ) is firstly divided into M subsets Xi (i = 1, ...M), each of which consists of about k = m/M. Without loss of generality, it is denoted: X1 = (x1 , ..., xk ), ..., XM = (x(M−1)k+1 , ..., xm )
(1)
Accordingly, X = ∪M i=1 Xi = (X1 , X2 , ..., XM ). 2.2 Kernel Principal Component Analysis Based on 1-Order Statistical Quantity The 1-order statistical quantity (mean) for each subset is given as follows: Xcenter 1
1 1 = xi , ..., Xcenter = M k m − (M − 1)k k
m
i=1
i=(M−1)k+1
xi ,
(2)
Having computing mean of each subset, it is still a vector. The standard kernel method can be used to map it to high dimensional space. The same derivation of computing nonlinear feature can be gotten except that xi is substituted for Xcenter . i 2.3 Computing the Autocorrelation Matrix of Subset Similarly, 2-order statistical quantity (autocorrelation matrix) for each subset can be computed. The autocorrelation matrix can be defined as follows: T T Σ1 = X1 XT 1 , Σ2 = X2 X2 , ..., ΣM = XM XM ,
(3)
An Improved Kernel Principal Component Analysis for Large-Scale Data Set
11
The data set is then transformed into Σ = (Σ1 , Σ2 , ..., ΣM ), where Σi is n × n matrix. Because the traditional kernel method is based on vector, the 2-order statistical quantity is a matrix. We must find some way of approaching the problem. In this circumstance, we can treat autocorrelation matrices as the special computational units in input space. It is shown [2] that the autocorrelation matrix contains the statistical information between samples. Thus, the special computational unit can be projected into high-dimensional (even infinite dimensional) Reproducing Kernel Hilbert Space (RKHS) using a mapping function φ. φ:
n×n → F Σi → φ(Σi )
(4)
After having projected into feature space, the data set can be represented as Φ(Σ) = (φ(Σ1 ), φ(Σ2 ), ...φ(ΣM )). 2.4 A Novel Polynomial-Matrix Kernel Function Because the special computational unit is now based on matrix, the kernel function cannot be used. In order to compute the similarity between the mapping computational units in feature space, a positive definite kernel function needs to be denoted. Similar to tradition polynomial kernel function, a novel polynomial-matrix kernel function is denoted as: κ(., .) = κ(Σi , Σj ) = ((φ(Σi ) · φ(Σj )) = ||Σi . ∗ Σj ||D B
(5)
Where ||.||B = ||(.).1/2 ||F (||.||F is the Frobenius norm of matrix), .∗ denotes the component-wise multiplication of matrices, .1/2 means the component-wise extraction of matrices, and D is the degree of the polynomial-matrix kernel function. Theorem. When each subset has one sample, the polynomial kernel function based on the data vector equals to twice of the polynomial-matrix one based on the autocorrelation matrix. That means the degree d is the twice of the degree D. Proof. When each subset contains one sample, the autocorrelation matrix Σi = xi xT i . Using the polynomial-matrix kernel function, it follows: T T D κ(Σi , Σj ) = ||Σi . ∗ Σi ||D B = ||xi xi . ∗ xj xj ||B
⎛
x2i1 xi1 xi2 ⎜ xi2 xi1 x2i2 ⎜ = || ⎜ . .. ⎝ .. . xin xi1 xin xi2
⎞ ⎛ 2 xj1 xj1 xj2 . . . xi1 xin ⎜ xj2 xj1 x2j2 . . . xi2 xin ⎟ ⎟ ⎜ .. .. ⎟ . ∗ ⎜ .. .. . . ⎠ ⎝ . . . . . x2in xjn xj1 xjn xj2
⎞ . . . xj1 xjn . . . xj2 xjn ⎟ ⎟ D .. .. ⎟ ||B . . ⎠ 2 . . . xjn
(6)
= (x2i1 x2j1 + xi1 xi2 xj1 xj2 +, ..., +x2in x2jn )D n 2D = (( k=1 xik xjk )2 )D = (xT = κ(xi , xj )2D = κ(xi , xj )d , i xj ) and the theorem is derived. In other words, the polynomial kernel function is the extreme case of the polynomialmatrix one, when each subset comprises only one sample.
12
W. Shi and D. Zhang
2.5 Kernel Principal Component Analysis Based on 2-Order Statistical Quantity Because the data set is divided into many subsets, the number of subsets is less than the number of original data set. As a result, the large-scale data set is compressed by down-sampling the data. The size of kernel matrix can be greatly reduced from m × m to M× M by the novel polynomial-matrix kernel function. Thus, the small size of kernel matrix makes the computation and storage possible. At present, the mapped data set is Φ(Σ) = (φ(Σ1 ), φ(Σ2 ), ..., φ(ΣM )) in feature space. The covariance matrix is given as follows: 1 φ(Σi )φ(Σi )T , M i=1 M
C=
(7)
It also accords with the eigen-equation: Cν = λν,
(8)
Where ν and λ are corresponding eigenvector and eigenvalue of covariance matrix. The eigenvector is now expanded using all the projected matrix Φ(Σ) as: ν=
M
αi φ(Σi ),
(9)
i=1
By substituting Eq. 7, Eq. 9 into Eq. 8, we can get the following formula: Kα = Mλα,
(10)
where α is span coefficient, K is Gram matrix denoted as K = Φ(Σ)T Φ(Σ) = (κij )1≤i≤M,1≤j≤M . The entry of Gram matrix is κij = κ(Σi , Σj ). After having got the eigenvector α, the kernel principal components ν can be achieved using Eq. 9. For a test sample x , its autocorrelation matrix Σx = xxT . The nonlinear feature is then given: (ν, φ(Σx )) =
M
αi (φ(Σi ) · φ(Σx )) =
i=1
M
αi κ(Σi , Σx ),
(11)
i=1
In the process of whole deduction, it is assumed that the data covariance matrix have zero mean, otherwise, it is easy to derive the centering kernel matrix: κ
(Σi , Σj ) = ||(Σi − = ||(Σi . ∗ Σi −
1 M Σ).
1 M Σi .
∗ (Σi −
∗Σ−
1 M Σ.
1 D M Σ)||B
∗ Σi +
1 M2 Σ.
∗ Σ)||D B
(12)
= (K − IM K − KIM + IM KIM )ij
= K − IM K − KIM + IM KIM , where IM = Therefore, the centering kernel matrix K (1/M)M×M .
An Improved Kernel Principal Component Analysis for Large-Scale Data Set
13
3 Experimental Results and Discussion Some experiments were performed to demonstrate the effectiveness of the proposed method. In order to differentiate from the standard KPCA, we abbreviate the method 1order-KPCA, which means Kernel Principal Component Analysis based on 1-order statistical quantity and shorten the method 2order-KPCA, which means Kernel Principal Component Analysis based on 2-order statistical quantity. The polynomial kernel κ(x, y) = (xT y)d (where d is the degree) is used in standard KPCA and 1order-KPCA. The polynomial-matrix kernel function κ(Σi , Σj ) = ||Σi . ∗ Σi ||D B (where D is the degree) is used in 2order-KPCA. 3.1 Toy Examples We firstly perform experiments on the 2-dimensional toy problem. The 200 2-dimensional data samples are generated, where x-values are uniformly distributed in [−1, 1] and yvalues are given by y = x2 + η (η is the normal noise with standard deviation 0.2). In the 1order-KPCA and 2order-KPCA, the data set is divided into 100 subsets, each of which contains 2 samples. The degree d and D of two kernel function equal to 2 and 1, respectively. The 1-order and 2-order statistical quantity was firstly computed. The nonlinear feature is then extracted. The experiment results are given in Fig. 1. It gives contour lines of constant value of the first 3 principal components, where the gray values represent the feature value. From the result, the 1order-KPCA and 2order-KPCA can get almost similar performance with the standard KPCA [1]. The result shows the effectiveness of proposed methods, which can successfully extract the nonlinear components. 3.2 USPS Examples We also test the proposed methods on real-world data. The US postal Service (USPS) data set1 is 256-dimensional handwritten digits 0 − − 9 . It consists of 7291 training samples and 2007 testing samples. Firstly, we randomly select 3000 training samples to extract the nonlinear feature. The nearest neighbor classifier is used to classify the projecting testing sample. The
1
1
0.5
0.5
0
0
−0.5 −1
1.5
1.5
1.5
0
1 0.5 0
−0.5 −1 1
0
KPCA
−0.5 −1 1
1.5
1.5
1.5
1
1
1
0.5
0.5
0 −0.5 −1
1
1
0
1
0
1
0.5
0 0
0
0
−0.5 −1
0
1
−0.5 −1
1order−KPCA
1.5
1.5
1.5
1
1
1
0.5
0.5
0 −0.5 −1
0.5
0 0
1
0
−0.5 −1
0
1
−0.5 −1
2order−KPCA
Fig. 1. Contour image of first 3 principal components obtained from the standard KPCA (the first row), 1order-KPCA (the second row) and 2order-KPCA (the third row) 1
Available at http://www.kernel-machines.org
14
W. Shi and D. Zhang
3000 training samples were divided into some subsets with different size (1 ≤ k ≤ 5) in each subset. For each k, 10 independent runs are performed, where the data samples are randomly reordered. The classified results of 2007 testing samples are averaged. For the sake of limited space, we only give the results under degree D = 1 and 2 for 2order-KPCA, respectively. Table 1 give the error rate of testing sample using standard KPCA and proposed methods with different number samples in each subset. It also gives the corresponding result under degree d = 2, and 4 for standard KPCA and 1order-KPCA, respectively. It can be found that the result of 1order-KPCA and 2orderKPCA equals to the result of the standard KPCA, which corresponds to aforementioned theorem. It also shows 1order-KPCA and 2order-KPCA with different number samples in each subset could generally achieve competitively classified result than the standard KPCA. To visualize the result more clear, we plot the recognize rate under different number of kernel principal components in Fig. 2 and Fig. 3. Result of 2order−KPCA with D=1 8
7
7
6
6
5
5
Error rate (%)
Error rate (%)
Result of 1order−KPCA with d=2 8
4 3 2
4 3 2
1 0
k=2 k=3 k=4 k=5 k=1
1
3
0
4 5 6 7 Number of components (log())
3
4 5 6 7 Number of components (log())
Fig. 2. Performance of proposed methods using different number samples (k) in each subset under varying number of kernel principal components (using log scale) corresponding to Table 1a and Table 1b Result of 2order−KPCA with D=2 11
10
10
9
9
8
8
7
7
Error rate (%)
Error rate (%)
Result of 1order−KPCA with d=4 11
6 5
6 5
4
4
3
3
2
2
1 0
k=2 k=3 k=4 k=5 k=1
1 3
4 5 6 7 Number of components (log())
0
3
4 5 6 7 Number of components (log())
Fig. 3. Performance of proposed methods using different number samples (k) in each subset under varying number of kernel principal components (using log scale) corresponding to Table 1c and Table 1d
In addition, we also use all the training samples to extract the nonlinear feature. Because the size of Gram matrix is 7291 × 7291, it is impossible for standard KPCA algorithm to run in the standard hardware. Using the proposed methods, we firstly divide 7291 training samples into 1216 subsets, each of which consists of 6 samples (The last subset contains only 1 sample). Table 2 is the result of proposed method with 6 samples in each subset trained with all training samples. Here, the size of kernel matrix drops from 7291 × 7291 to 1216 × 1216, which can be easily stored and computed. As shown in Table 2, we can also see that 1order-KPCA and 2order-KPCA can achieve the right
An Improved Kernel Principal Component Analysis for Large-Scale Data Set
15
Table 1. Error rate of 2007 testing sample using proposed method (D=1 and D=2)and the standard KPCA(d=2 and d=4) with 3000 training samples (a) Result of 1order-KPCA Number of components 32 64 128 256
(b) Result of 2order-KPCA
KPCA 1order-KPCA with d=2 k=1 k=2 k=3 k=4 k=5 7.17 7.17 6.88 7.13 7.08 7.22 6.98 6.98 6.98 7.08 7.13 7.13 7.17 7.17 7.47 7.22 7.47 7.52 7.22 7.22 6.93 7.13 7.22 7.13
(c) Result of 1order-KPCA Number of components 32 64 128 256
Number of components 32 64 128 256
KPCA 2order-KPCA with D=1 k=1 k=2 k=3 k=4 k=5 7.17 7.17 7.27 7.03 7.17 7.42 6.98 6.98 6.98 6.93 6.78 6.78 7.17 7.17 7.37 6.88 7.08 7.03 7.22 7.22 7.32 7.03 7.22 6.93
(d) Result of 2order-KPCA
KPCA 1order-KPCA with d=4 k=1 k=2 k=3 k=4 k=5 10.26 10.26 9.52 8.92 8.87 9.02 8.62 8.62 8.27 7.67 8.62 8.17 7.97 7.97 7.57 7.77 8.72 7.97 8.07 8.07 7.92 7.97 8.37 8.47
Number of components 32 64 128 256
KPCA 2order-KPCA with D=2 k=1 k=2 k=3 k=4 k=5 10.26 10.26 9.27 9.77 8.87 9.27 8.62 8.62 8.27 7.77 8.02 8.02 7.97 7.97 7.62 7.97 7.77 7.72 8.07 8.07 7.72 8.32 8.32 8.02
Table 2. Error rate of 2007 testing sample using 1order-KPCA and 2order-KPCA(having different degree D) with all training samples (a) Result of 1order-KPCA Number of components 32 64 128 256
d=2 d=3 d=4 d=5 5.93 5.63 5.73 5.93
6.48 5.93 5.98 5.63
7.42 6.73 6.33 6.44
7.82 7.52 7.13 6.58
(b) Result of 2order-KPCA Number of components 32 64 128 256
D=1 D=1.5 D=2 D=2.5 6.08 5.78 5.73 5.48
6.88 6.13 6.18 6.13
7.87 7.37 6.93 6.48
9.42 8.17 7.47 7.37
classified performance even the eigen-decomposition technique cannot work out when faced with large-scale data set. The result shows that the proposed methods are more effective and efficient than standard KPCA.
4 Conclusions An efficient Kernel Principal Component Analysis for large-scale data set is proposed. The method divides the large scale data set into small subsets, each of which can produce mean and autocorrelation matrix. Then the achieved matrices can be treated as special computational units. The similarity between matrices can be computed using a novel polynomial-matrix kernel function. It can greatly reduce the size of kernel matrix, which effectively solve the large scale problem.
Acknowledgment This work was supported in part by Natural Science Foundation of Henan Educational Committee under contract 2010B520005, Innovation Scientists and Technicians Troop Construction Projects of Henan Province under contract 094200510009 and Doctor Fund of Henan University of Technology under contract 2009BS013.
16
W. Shi and D. Zhang
References 1. Scholkopf, B., Smola, A., Muller, K.R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation 10, 1299–1319 (1998) 2. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, London (1990) 3. Shawe-Taylor, J., Scholkopf, B.: Kernel Methods for Pattern Analysis, 3rd edn. Cambridge University Press, Cambridge (2004) 4. Zheng, W.M., Zou, C.R., Zhao, L.: An Improved Algorithm for Kernel Principal Components Analysis. Neural Processing Letters 22, 49–56 (2005) 5. France, V., Hlavac, V.: Greedy Algorithm for a Training Set Reduction in the Kernel Methods. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 426–433. Springer, Heidelberg (2003) 6. Achlioptas, D., McSherry, M., Scholkopf, B.: Sampling techniques for kernel methods. In: Advances in Neural Information Processing Systems (2002) 7. Smola, A., Cristianini, N.: Sparse Greefy Matrix Approximation for Machine Learning. In: International Conference on Machine Learning (2000) 8. Kim, K.I., Franz, M.O., Scholkopf, B.: Iterative Kernel Principal Component Analysis for image modeling. IEEE Trans. Pattern Anal. Mach. Intell. 27(9), 1351–1366 (2005) 9. Shi, W.Y., Guo, Y.F., Xue, X.Y.: Matrix-based Kernel Principal Component Analysis for Large-scale Data Set. In: International Joint Conference on Neural Networks, USA
Software Defect Prediction Using Fuzzy Support Vector Regression Zhen Yan1 , Xinyu Chen2 , and Ping Guo1 1
School of Computer, Beijing Institute of Technology, Beijing 100081, China 2 The State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
Abstract. Regression techniques have been applied to improve software quality by using software metrics to predict defect numbers in software modules. This can help developers allocate limited developing resources to modules containing more defects. In this paper, we propose a novel method of using Fuzzy Support Vector Regression (FSVR) in predicting software defect numbers. Fuzzification input of regressor can handle unbalanced software metrics dataset. Compared with the approach of support vector regression, the experiment results with the MIS and RSDIMU datasets indicate that FSVR can get lower mean squared error and higher accuracy of total number of defects for modules containing large number of defects. Keywords: Fuzzy support vector regression, Software defect prediction, Software metrics.
1
Introduction
Software defect, commonly defined as deviations from expectation that might lead to software failures [1], is one of the most important problems in software engineering. Software engineers always want to identify which software modules contain more defects so that those modules would be paid more attention during the testing period. Knowing how many defects in each module is also a pressing problem to tell whether a project is on schedule. However, predicting defects in software modules is a difficult problem because many factors may impact the results, such as software functional complexity, quality of codes, etc. Much research work has been done over nearly 40 years to try to solve this problem, and it has been proved that such work is of conductive help during the software development process. Software defect prediction techniques consider that the number and distribution of software defects have close relation with static software metrics. According to [2], many software defect prediction models based on statistical theories, especially classification and regression methods, have been proven successful in estimating the number and location of defects in software by using static software metrics, such as Halstead’s software volume metrics [3] and McCabe’s cyclomatic complexity metrics [4]. These prediction models, for example, Akiyama’s linear L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 17–24, 2010. c Springer-Verlag Berlin Heidelberg 2010
18
Z. Yan, X. Chen, and P. Guo
regression model [5], assume that the number and distribution of defects in software depend on certain metrics of the software; however, many of them choose different software metrics. Empirical studies show that software defects are not distributed uniformly in software modules. A few modules cover large number of defects while most of modules contain only several defects or even no perceivable defects at all. It indicates that software metrics datasets are unbalanced. In recent years, fuzzy support vector regression (FSVR) is a new technique for solving regression problems, and it enhances in reducing the effect of outliers and noises of data fitting and can achieve better results when using unbalanced dataset. Therefore, we intend to apply FSVR in software defect prediction, and expect that it can improve the regression on software metrics datasets in modules containing large number of defects with proper fuzzy member function.
2
Backgrounds
In this section we briefly review the fundamental applications about FSVR and software defect prediction. 2.1
Fuzzy Support Vector Regression
Support vector machines (SVM) was developed for classification problem by Vapnik in the late 1960s. SVM is known for its well generalization and easy adaptation at modeling non-linear functional relationships. With the introduction of Vapnik’s -insensitive loss function [6], SVM has obtained its extended usage of solving linear and non-linear regression estimation problems in which case it is named support vector regression (SVR). Further extension with fuzzy membership function makes SVR into fuzzy support vector regression (FSVR) [7,8]. By applying a fuzzy logic in SVR, different input data points contribute differently to the optimization of regression function [9]. Bao et al. employed FSVR in financial time series forecasting and achieved high performance in stock composition index [10]. 2.2
Software Defect Prediction Techniques
Software metric is a function whose inputs are software codes and whose output is a single numerical value that can be interpreted as the degree to which software possesses a given attribute that affects its quality, and it is also called software complexity or quality metric. For decades of years, researchers have been trying to find the relationship between software complexity metrics and software quality. Many software metrics have been developed, for example, Halstead’s software science metrics [3], McCabe’s cyclomatic metric [4], etc. Some of these metrics can only be obtained in the late stages of software life circle; however, other metrics can be extracted in the very early stages. With these metrics, we can conclude conductive statistical criteria to predict the defect numbers of software modules and their fault-proneness.
Software Defect Prediction Using Fuzzy Support Vector Regression
19
Software metrics related studies mainly consist of time series prediction, defect number, and defect distribution in software modules. Xing and Guo proposed to apply SVR to build software reliability growth model [11], and they also studied some techniques on classifying fault-prone and fault-free modules using SVM [12]. Jin et al. proposed to engage SVR for software prediction and proved that it is a promising technique through comparison with multivariate linear regression, conjunctive rule, and locally weighted regression [13]. Ostrand et al. applied a negative binomial regression model to predict the expected number of defects in each file of the next release of a system, and they found that 20% files with the highest predicted number of defects contained 71% ∼ 92% of the defects that are actually detected [14,15]. Bibi et al. applied regression via classification to estimate the number of software defects by exploiting symbolic learning algorithms [16], and the representation of the fault knowledge can be in the form of rules and decision trees. The logistic regression model was employed to predict whether files or packages have post-release defects using the data collected from the Eclipse project [17].
3
Software Defect Prediction Using FSVR
We investigate software metrics, extracted in datasets of the MIS (Medical Imaging System) and RSDIMU (Redundant Strapped-Down Inertial Measurement Unit) projects, to predict the number of defects in software modules. These software metrics are shown in Table 1. The MIS has been in widely commercial use, and it contains nearly 4, 500 modules and about 400,000 lines of code written in Pascal, Fortran, assembly language, and PL/M. The MIS dataset we used in experiments is a subset of 390 modules which can be obtained from the CD attached in [18]. The RSDIMU dataset was developed at the Chinese University of Hong Kong in C language [19]. Unlike MIS, the RSDIMU dataset is based on files, and it contains data of 223 files. We aim to find a hyperplane that best fits the datasets to predict the exact defects in each module or file. In addition, the hyperplane should be as flat as possible. 3.1
Dataset Preprocessing
When using software complexity metrics as input of a regression model to estimate the number of defects in software modules, it is assumed that these metrics are uncorrelated. As shown in Table 1, we can see that the original metrics dataset does not meet this assumption. Principal Components Analysis (PCA) is a method to perform de-correlation and reduce data dimensionality; what is more, previous research [13,20] has verified that the usage of the first few principal components perform good in static-metrics-based software defect prediction. We choose the first two principal components (PCA2) as input of regression training. 3.2
Fuzzy Membership Function
There are 308 modules out of MIS whose defect numbers are not greater than 10, and the other modules’ defect numbers range from 11 to 98, and in RSDIMU,
20
Z. Yan, X. Chen, and P. Guo Table 1. Detailed description of metrics in MIS and RSDIMU
Dataset
Both
Metrics
Detailed Description
LOC
Number of lines of code
COM LOC
Number of lines of comments, named TComm in MIS
SLOC
Number of lines of source code, named CL in MIS
N1 , N2 , n1 , n2 Halstead’s software metrics, corresponding to N and N’ in MIS
MIS
RSDIMU
TChar
Number of characters
MChar
Number of comment characters
DChar
Number of code characters
NF
Jensen’s estimate of program length metric
V(G)
McCabe’s cyclomatic complexity metric
BW
Belady’s bandwidth measure
COM RAT
The ratio of COM LOC to LOC
NSC
Number of sub-level classes
NTM
Number of top-level classes
TCOM RAT
The ratio of COM LOC to SLOC
167 files are less than 4, the other files range from 4 to 14. According to [21], we gain much freedom in selecting an appropriate fuzzy membership function, as long as it meets the following two constraints: – A membership function must be bounded in [0, 1]; – An element of the dataset cannot map to different degrees of membership for one fuzzy function. Therefore, we employ the following equation as our fuzzy membership function: si = (yi − ymin ) ·
(1 − σ) − σ + σ, ymax − ymin
σ = 0.01,
(1)
in which ymax is the maximum value of the target value set, and ymin is the minimum value of the target value set. σ ensures that si will not be zero. From Equation 1 we can see that si ∈ (σ, 1 − σ), and the larger yi is, the larger si is. This fuzzy membership function means that the more defects there are in a module, the more this sample contributes to the regression problem. 3.3
Kernel Function
The linear kernel has been chosen for defect prediction in SVR [13]; however, for us, a reasonable first choice of the kernel function is radial basis function (RBF). Because the mean absolute error of SVR using linear kernel and PCA2 is 4.3 according to [13], while using RBF in SVR the result is 3.76. Equaiton 2 is the RBF kernel we used:
Software Defect Prediction Using Fuzzy Support Vector Regression
k(xi , xj ) = exp(−γ|xi − xj |2 ),
γ > 0.
21
(2)
The RBF kernel maps data into a higher dimensional space, and has less numerical difficulties with value 0 < k(xi , xj ) ≤ 1. Actually the linear kernel is a special case of the RBF kernel as shown in [8]; however, it has more parameters which makes the regression more complexity. 3.4
Cross-Validation and Grid-Search
Two parameters need to be identified when using the RBF kernel function: C and γ. C is a cost parameter to control the tradeoff between allowing training errors and forcing rigid margins. Here, cross-validation is used to prevent the overfitting problem. In ν-fold cross-validation, the training set is divided into ν subsets, and sequentially one subset is used as a test set, while the other subsets are merged into a training set. We use 10-fold (i.e., ν = 10) crossvalidation in FSVR. Pairs of (C, γ) using grid-search are tried and the one with the best cross-validation performance value is picked (C = 20 , 21 , . . . , 215 and γ = 2−5 , 2−4 , . . . , 25 ). 3.5
Performance Criteria
The two performance criteria engaged in this paper to evaluate the effects of software defect prediction are mean squared error (MSE) and squared correlation coefficient (denoted as r2 ) as follows: n
M SE = and r2 =
4
[n
1 (f (xi ) − yi )2 , n j=1
(3)
n n n [n i=1 f (xi )yi − i=1 f (xi ) i=1 yi ]2 n . n n 2 2 2 2 i=1 f (xi ) − ( i=1 f (xi )) ][n i=1 yi − ( i=1 yi ) ]
n
(4)
Experiments and Discussion
Table 2 shows the experimental results when we treat the whole set as the training set. We can observe that SVR with RBF kernel gets MSE 48.09 and 5.03 in MIS and RSDIMU, respectively. The results are better than FSVR’s 63.64 and 7.62. And SVR also has better r2 than FSVR in both sets. Table 2. Experimental results of the whole dataset after cross-validation MIS-MSE
MIS-r 2
RSDIMU-MSE
RSDIMU-r 2
SVR
48.09
0.590
5.03
0.174
FSVR
63.64
0.510
7.62
0.137
22
Z. Yan, X. Chen, and P. Guo
However, empirical statistical studies show that minority modules contain most of the defects in software, and software testing engineers need to pay much more attention to modules that possess more defects. Furthermore, our fuzzy membership function (see Equation 1) defines that those samples with higher defect numbers contribute more to regression training. So we sort the dataset by the ascending order of the number of defects in software, and divide it into two parts. The first-80% modules and the last-20% modules are merged into two subsets respectively. As a result, Table 3 shows the total number of defects in the last-20% modules in the training datasets of MIS and RSDIMU. Now we find that FSVR can detect much closer number of defects to the target value in both datasets. Table 3. Total number of defects in the last-20% modules Whole training set
SVR
FSVR
MIS-20%
1821
1064
1475
RSDIMU-20%
302
106
254
Table 4 shows the MSE results in different subsets. SVR does a better job in the first-80% modules or files in the sorted datasets. The MSE values are 21.12 and 2.27 in MIS and RSDIMU, respectively; however, the MSE values for FSVR are 45.07 and 7.95. On the other hand, FSVR performs much better in the last-20% modules, whose MSE are 144.56 and 1.66, relative to 307.13 and 9.31 for SVR. Table 4. MSE of first-80% and last-20% subsets of MIS and RSDIMU MIS-80%
MIS-20%
RSDIMU-80%
RSDIMU-20%
SVR
21.12
307.13
2.27
17.48
FSVR
45.02
145.66
7.95
1.66
There are 78 modules in the last-20% modules in the sorted MIS dataset. We randomly draw 10 samples from these 78 modules as a test dataset, other modules and the first-80% modules are used as the training set. After that this process has been repeated ten times, the mean MSE is calculated. We do the same experiment on the RSDIMU dataset. As shown in Table 5, we find that the MSEs of training sets are 43.53 and 2.92 for SVR and 64.93 and 8.71 for FSVR, respectively; while the MSEs of test sets for SVR are 352.69 and 21.20 and for FSVR are 220.09 and 5.23, respectively. We exploit the reason behind it by sorting the dataset with the descending order. The modules that contain a few
Software Defect Prediction Using Fuzzy Support Vector Regression
23
Table 5. MSE for training and test subsets MIS-Training
MIS-Test
RSDIMU-Training
RSDIMU-Test
SVR
43.53
352.69
2.92
21.20
FSVR
64.93
220.09
8.71
5.23
defects demonstrate large sample numbers, and samples for modules with high number of defect are few. The unbalanced training set results in low training MSE and high test MSE.
5
Conclusions
In this paper, we propose a novel method of using FSVR in predicting software defect numbers. This regressor performs quite well for modules that contain large number of defects. SVR is engaged as comparison, and it can achieve better MSE in the whole dataset regression training. But when we randomly draw high defect number modules to test, the performance of FSVR is more excellent than that of SVR. We consider of further work employing SVC first to classify software modules as fault-prone and fault-free. After that, by considering the characteristics of different categories of software modules, we can take respective advantages of SVR and FSVR to predict software defect numbers more precisely. Acknowledgments. The work described in this paper is partially supported by the grants from the National High Technology Research and Development Program of China (863 Program) (Project No. 2009AA010314), the National Natural Science Foundation of China (Project No. 60675011, 90820010), and the State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences (Project No. SYSKF0906). Dr. Chen and Prof. Guo are the authors to whom all correspondence should be addressed.
References 1. Fenton, N.E., Neil, M.: A critique of software defect prediction models. IEEE Transactions on Software Engineering 25(5), 675–689 (1999) 2. Wang, Q., Wu, S., Li, M.: Software defect prediction technologies. Journal of Software 19(7), 1560–1580 (2007) (in Chinese) 3. Halstead, M.H.: Elements of Software Science. Elsevier, North-Holland (1975) 4. McCabe, T.J.: A complexity measures. IEEE Transations on Software Engineering 2(4), 308–320 (1976) 5. Akiyama, F.: An example of software system debugging. Information Processing 71, 353–379 (1971) 6. Drucker, H., Burges, C., Kaufman, L., Smola, A., Vapnik, V.: Support vector regression machines. In: Advances in Neural Information Processing Systems (NIPS), December 1996, vol. 9, pp. 155–161. MIT Press, Cambridge (1996)
24
Z. Yan, X. Chen, and P. Guo
7. Hong, D.H., Hwang, C.: Support vector fuzzy regression machines. Fuzzy Sets and Systems 138(2), 271–281 (2003) 8. Lin, C.F., Wang, S.D.: Fuzzy support vector machine. IEEE Transactions on Neural Networks 13(2), 464–471 (2002) 9. Sun, Z., Sun, Y.: Fuzzy support vector machine for regression estimation. In: Proc. of IEEE International Conference on Systems, Man and Cybernetics., vol. 4, pp. 3336–3341 (2003) 10. Bao, Y.K., Liu, Z.T., Guo, L., Wang, W.: Forecasting stock composite index by fuzzy support vector machines regression. In: Proc. of International Conference on Machine Learning and Cybernetics, August 2005, vol. 6, pp. 3535–3540 (2005) 11. Xing, F., Guo, P.: Support vector regression for software reliability growth modeling and prediction. In: Wang, J., Liao, X., Yi, Z. (eds.) ISNN 2005. LNCS, vol. 3496, pp. 925–930. Springer, Heidelberg (2005) 12. Xing, F., Guo, P., Lyu, M.R.: A novel method for early software quality prediction based on support vector machine. In: Proc. of the 16th IEEE International Symposium on Software Reliability Engineering (ISSRE 2005), November 2005, pp. 213–222 (2005) 13. Jin, X., Liu, Z., Bie, R., Zhao, G., Ma, J.: Support vector machines for regression and applications to software quality prediction. In: Alexandrov, V. (ed.) ICCS 2006. LNCS, vol. 3994, pp. 781–788. Springer, Heidelberg (2006) 14. Ostrand, T.J., Weyuker, E.J., Bell, R.M.: Automating algorithms for the identification of fault-prone files. In: Proc. of International Symposium on Software Testing and Analysis, July 2007, pp. 219–227 (2007) 15. Ostrand, T., Weyuke, E., Bell, R.: Predicting the location and number of faults in large software systems. IEEE Transactions on Software Engineering 31(4), 340–355 (2005) 16. Bibi, S., Tsoumakas, G., Stamelos, I., Vlahavas, I.: Regression via classification applied on software defect estimation. Expert Systems with Applications 34, 2091–2101 (2008) 17. Zimmermann, T., Premraj, R., Zeller, A.: Predicting defects for Eclipse. In: Proc. of the 3rd International Workshop on Predicator Models in Software Engineering (May 2007) 18. Lyu, M.R. (ed.): Handbook of Software Reliability Engineering. IEEE Computer Society Press and McGraw-Hill Book Company (1996) 19. Lyu, M.R., Huang, Z., Sze, K.S., Cai, X.: An empirical study on testing and fault tolerance for software reliability engineering. In: Proc. of the 14th IEEE International Symposium on Software Reliability Engineering (ISSRE 2003), November 2003, pp. 119–130 (2003) 20. Yang, B., Chen, X., Xu, S., Guo, P.: Software metrics analysis with genetic algorithm and affinity propagation clustering. In: Proc. of the 2008 International Conference on Data Mining (DMIN 2008), July 2008, vol. II, pp. 590–596 (2008) 21. Engelbrecht, A.P.: Computational Intelligence: An Introduction, 2nd edn. Wiley, New Jersey (2007)
Refining Kernel Matching Pursuit Jianwu Li and Yao Lu Beijing Key Lab of Intelligent Information Technology, School of Computer, Beijing Institute of Technology, Beijing 100081, China
[email protected]
Abstract. Kernel matching pursuit (KMP), as a greedy machine learning algorithm, appends iteratively functions from a kernel-based dictionary to its solution. An obvious problem is that all kernel functions in dictionary will keep unchanged during the whole process of appending. It is difficult, however, to determine the optimal dictionary of kernel functions ahead of training, without enough prior knowledge. This paper proposes to further refine the results obtained by KMP, through adjusting all parameters simultaneously in the solutions. Three optimization methods including gradient descent (GD), simulated annealing (SA), and particle swarm optimization (PSO), are used to perform the refining procedure. Their performances are also analyzed and evaluated, according to experimental results based on UCI benchmark datasets. Keywords: Kernel matching pursuit, Gradient descent, Simulated annealing, Particle swarm optimization.
1 Introduction Kernel matching pursuit (KMP), recently proposed by Vincent and Bengio [1], appends functions from a redundant dictionary to an initial empty basis sequentially, by using some certain loss criterion, to build a discriminant function for a classification problem. KMP can achieve comparable classification performance to support vector machine (SVM), but typically with sparser expressions [1]. The basic KMP algorithm, as well as its two improved variants: back-fitting and pre-fitting, are in detail described in [1]. To make KMP practical for large datasets, a stochastic version was introduced as an approximation of the original KMP [2]. Additionally, Li and Jiao proposed to pre-select base vectors from original data in terms of vector correlation principle, and this method can greatly reduce the scale of optimization problems and achieve much sparser solutions [3]. To further improve the classification performance of KMP, Jiao and Li attempted to perform KMP ensemble and addressed two ensemble strategies: random replicating sample and average interval sampling sample [4]. Also, Popovici and Thiran introduced an adaptive KMP, which can adapt the parameters of kernel functions in dictionary to a given classification task [5]. Though many improved versions of KMP, have been developed, we are still confronted with an evident problem: all kernel functions in dictionary will always keep unchanged during the whole procedure of appending. In fact, it is very difficult, ahead of L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 25 – 32, 2010. © Springer-Verlag Berlin Heidelberg 2010
26
J. Li and Y. Lu
training, to determine the optimal dictionary of kernel functions, without enough prior knowledge. For instance, given the dictionary including Gaussian kernel functions, KMP only considers two tasks: choosing which Gaussian kernel functions from this dictionary to append, and determining the coefficients in front of kernel functions. However, during training, the centers and widths of all Gaussian kernel functions keep invariable, and hence the performance of KMP may be affected. Thus, inspired is a spontaneous thinking on how to optimize all the parameters to boost the KMP. This paper proposes a two-stage modeling method to refine KMP. KMP is first trained to build a preliminary solution, then all parameters in the solution are further optimized. During the second stage, three optimization techniques, including gradient descent (GD), simulated annealing (SA), and particle swarm optimization (PSO), are tested respectively. The rest of this paper is organized as follows. Basic KMP is reviewed in Section 2. In Section 3, described are the processes of applying GD, SA, and PSO respectively, to refining KMP. Experimental results are presented in Section 4, and then some conclusions are given in the last section.
2 Basic Matching Pursuit and Kernel Matching Pursuit In the following, basic matching pursuit and kernel matching pursuit are addressed. The former was first proposed by Mallat and Zhang in 1993 [6], in the field of signal processing, but the latter was introduced by Vincent and Bengio in 2002 [1], from a perspective of machine learning. 2.1 Basic Matching Pursuit (BMP) [1, 6] Given l observations {y1,..., yl} of an unknown target function f ∈ H at data points {x1,..., xl}, as well as a finite dictionary D = {d1,…, dm} including m functions in Hilbert space H. The aim of BMP is to find a sparse approximation of f, with the form N
f N = ∑α i gi ,
(1)
i =1
which minimizes the squared norm of the residual RN = y − f N , where y = ( y1 ,..., yl ) , f N = ( f N ( x1 ),… , f N ( xl )) , and gi ∈ D. Equation (1) is formed in a greedy, constructive
fashion: starting at stage 0 with f 0 = 0 , then recursively appending functions in D to an initially empty basis. At stage n + 1, f n +1 = f n + α n +1 g n +1 is built, by searching for gn+1∈D 2
and αn+1∈R which minimize Rn +1 = y − f n +1 = Rn − α n +1 gn +1 2
2
. The optimization
problem can be formulated as
( gn +1 , αn +1 ) = arg min
( g∈D ,α ∈R )
2
Rn − α g .
(2)
Refining Kernel Matching Pursuit
27
Finally, the solution of (2) can be expressed as g n +1 , Rn
g n +1 = arg max
α n +1 =
,
g n +1
gn +1 ∈D
g n +1 , Rn g n +1
2
(3)
.
(4) 2
The algorithm will be terminated until the error Rn goes below a given threshold, or the iteration reaches a predefined maximum number [1]. 2.2 Kernel Matching Pursuit Kernel matching pursuit (KMP) applies the principle of BMP to solving the problems of machine learning, by adopting a kernel-based dictionary [1]. The dictionary of KMP can be denoted as D={di=K(•, xi)| i=1,..., m}, where K : Rd × Rd → R is a kernel function, and xi are from training examples. Kernel-based machine learning algorithms have been widely developed in recent decade, mainly due to the success of support vector machine (SVM). Though both SVM and KMP use kernel functions, the former requires kernel functions must satisfy Mercer’s theorem, but the latter has no strict restriction on the shape of kernel functions [1]. In practice, KMP usually adopts the following Gaussian kernel functions,
(
K ( x, xi ) = exp −γ x − xi
2
).
(5)
3 Three Approaches to Refining Kernel Matching Pursuit Without loss of generality, we take into account the dictionary consisting of only 2 Gaussian kernel functions, D = {di = exp( −γ i x − ci ) | i = 1,..., m} , for binary classification d problems. Let S={(x1, y1) ,..., (xl, yl)} be a set of training examples, where xi∈R , yi∈{-1, +1}, and yi represent class labels of data xi, i=1,..., l. The discriminant function built by KMP is of the form N
(
f z ( x ) = ∑ wi exp −γ i x − ci i =1
2
),
(6)
where vector z is denoted as (w1, γ1, c1, ... , wN, γN, cN), a condensed expression of (6). Given x, if fz(x) > 0, then x is classified as a positive example, else negative one. We propose to further refine the discriminant function (6) by searching for the zopt which minimizes E ( z) =
1 l ∑ ( y j − f z ( x j ))2 . 2 j =1
(7)
28
J. Li and Y. Lu
Three optimization techniques, gradient descent (GD), simulated annealing (SA), and particle swarm optimization (PSO), are used to minimize (7) and search for the zopt. The three algorithms have distinct characteristics respectively. GD has a fast speed, yet is easily trapped at a local optimum. SA improves GD by accepting, to a limited extent, deteriorations during searching, and so can escape from the local extreme points. However, SA is very slow, since it produces and tests each solution sequentially. Further, PSO is also chosen to minimize (7), considering its ability to perform parallel search and global optimization. 3.1 Gradient Descent (GD) GD carries out the following iterative procedure, to find the zopt in the direction of the negative gradient -∇E of (7), Zt+1=zt - η∇E(zt), where ∇E ( z ) = ( t
(8)
∂E ∂E ∂E ∂E ∂E ∂E , , ,..., t , t , t ) , and η is a learning rate. ∂w1t ∂γ 1t ∂c1t ∂wN ∂γ N ∂cN
For the Gaussian kernel functions in (5), we can obtain the following expression rules, N ∂E = − ∑ k ( x j , ci )( y j − f ( x j )) , ∂wi j =1
(9)
N 2 ∂E = ∑ wi k ( x j , ci ) x j − ci ( y j − f ( x j )) , ∂γ i j = 1
(10)
N ∂E = −2∑ wiγ i ( x jk − cik )k ( x j , ci )( y j − f ( x j )) . ∂cik j =1
(11)
3.2 Simulated Annealing (SA) The concept of simulated annealing (SA) is based on a strong analogy between the physical annealing process of solids and the problem of solving complex optimization problems [7]. We perform the following procedure to find the optimal solution zopt: Step 1, set the solution found by KMP as the initial point of SA, and determine a reasonable annealing strategy (i.e. set initial temperature T0, the annealing schedule, etc.); Step 2, let zt+1 = zt + Δz, where Δz is a small random disturb with uniform distribution, and compute ΔE = E(zt+1) - E(zt); Step 3, if ΔE < 0, then directly accept zt+1 as a new solution; else, zt+1 is accepted only with a probability P = exp(-ΔE/(kTt)), where k is Boltzmann constant; Step 4, repeat step 2 and 3 until an equilibrium state is reached under the current temperature Tt; Step 5, cool the temperature, Tt+1 = αTt, then perform step 2 ~ 4 repeatedly, until Tt+1 = 0 or a predefined low temperature is reached.
Refining Kernel Matching Pursuit
29
3.3 Particle Swarm Optimization (PSO) Particle swarm optimization (PSO) was originally discovered by Kennedy and Eberhart in 1995, through simulating the social behavior of bird flock [8]. PSO first constructs an initial group, the particles among which represent candidate solutions of optimized problem. Each particle has its fitness value, as well as two special characteristics: position and velocity. The position of the i-th particle in swarm can be denoted as xi = (xi1 ,..., xid), and its velocity vi = (vi1 ,..., vid). The best previously visited position of the i-th particle is recorded and expressed as pi = (pi1 ,..., pid). The best position among all particles is also saved and written as pg= (pg1 ,..., pgd). The i-th particle updates its velocity and position iteratively by vij = wvij + c1q1(pij-xij) + c2q2(pgj-xij), xij = xij + vij,
(12) (13)
where w is called inertia weight, c1 and c2 are two positive constants, q1 and q2 are two random numbers in the range [0, 1]. The first part in (12), wvij, integrates previous velocities of the i-th particle; the second, c1q1(pij-xij), considers self-cognition; the third, c2q2(pgj-xij), is the social part representing the shared information and mutual cooperation among the particles. Through combining these factors, PSO coordinates “exploitation” ability and “exploration” ability smoothly. When applying PSO to minimizing (7), we need first train basic KMP M times, to obtain M solutions of (7), which constitute an initial swarm of PSO. Subsequently, PSO algorithm is iteratively implemented to search for the optimal representation.
4 Experiments 4.1 Description on Data Sets and Parameter Settings We compared KMP + GD, KMP + SA, KMP + PSO with KMP and SVM, based on four datasets from the UCI machine learning repository: Heart, Pima Indians diabetes, Sona, and Ionosphere [9]. The LIBSVM software package [10] was directly used to implement SVM. Throughout the experiments, 1) All training data and test data were normalized to [-1, 1]; 2) Two-thirds of examples were randomly selected as training examples, and the remaining one-third as test examples; 3) Gaussian kernel functions were chosen for SVM, in which kernel width γ and penalty parameter C were decided by 10-fold cross validation on training sets;
30
J. Li and Y. Lu
4) The dictionary of KMP is composed of Gaussian kernel functions, the widths of which were set the same with those of SVM, and the centers of which consist of randomly chosen one-third of training examples; 5) The annealing schedule of SA is set as Tt+1=0.9*Tt; 6) In equation (12), c1=c2=1, and w decreased gradually with iterations proceeding but belonged to the range 0.1 ~ 0.9; 7) The size of PSO swarm was set as 20; 8) Each algorithm was run 10 times for 10 different divisions of training examples and test examples, then the averages of their accuracies were computed. 4.2 Experimental Results For the four data sets, we illustrated the results of five algorithms in Fig. 1 ~ Fig. 4 respectively. In each figure, the horizontal axis represents the numbers of support vectors (i.e. the lengths of the solutions), and the vertical axis denotes the accuracies of classifiers. We used a straight line to express the accuracy of SVM, since the number of its support vectors is decided automatically by training. We also listed, in Table 1, the numbers of the least support vectors for KMP, KMP+GD, KMP+SA, and KMP+PSO respectively, when their performances reach, or approach SVM. The sparsity extent of different methods can be compared when they obtain optimal performances respectively. heart
diabetes 0.76
0.95 0.9
0.74 0.85 0.72 accuracy
accuracy
0.8 0.75
0.7
0.7 0.68 KMP+PSO KMP+SA KMP+GD KMP SVM
0.65 0.6 0.55
0
20
40 60 support vectors
80
KMP+PSO KMP+SA KMP+GD KMP SVM
0.66
0.64
100
0
50
sona
100
150 200 support vectors
250
300
ionosphere
0.95
1
0.9
0.95
0.85
0.9
0.8 accuracy
accuracy
0.85 0.75 0.7
0.8 0.75
0.65 KMP+PSO KMP+SA KMP+GD KMP SVM
0.6 0.55 0.5 0
20
40
60 support vectors
80
100
KMP+PSO KMP+SA KMP+GD KMP SVM
0.7 0.65
120
0
20
40 60 support vectors
80
Fig. 1. Experimental results with the number of support vectors increasing
100
Refining Kernel Matching Pursuit
31
According to Fig. 1, we find, 1) For the dataset of Heart, KMP + PSO performs better than SVM, even when the number of support vectors is less than twenty. KMP + SA and KMP + GD have comparable accuracies to SVM, but the former two need less support vectors. Additionally, the single KMP is worse than SVM. 2) For each of the other three datasets: Diabetes, Sona, Ionosphere, SVM shows the best accuracy, yet the number of its support vectors is far larger than others. 3) KMP + PSO is uniformly better than the other three methods of using KMP: KMP + SA, KMP + GD, and single KMP. 4) On the whole, KMP + SA is not superior to KMP + GD, though in theory SA can realize the global optimization. 5) The single KMP shows the worst performances, which also prove GD, SA, and PSO can indeed improve the results of KMP. Table 1. The numbers of the least support vectors for 5 methods, with comparable accuracies
KMP+PSO KMP+SA KMP+GD KMP SVM
Dataset #SVs
Heart 11
Diabetes 168
Sona 42
Ionosphere 52
Accuracy
0.8433
0.7307
0.8830
0.9459
#SVs Accuracy #SVs Accuracy #SVs Accuracy #SVs Accuracy
13 0.8356 34 0.8301 36 0.8156 99 0.8300
178 0.7228 172 0.7244 187 0.7236 280 0.7461
40 0.8457 45 0.8310 35 0.7667 111 0.9130
49 0.8938 51 0.9067 59 0.8776 130 0.9744
From Table 1, KMP+PSO, KMP+SA, and KMP+GD outperform basic KMP in accuracy as well as the sparsity of solutions.
5 Conclusions and Further Thoughts This paper proposes to optimize simultaneously all parameters in the solutions of KMP via three different methods: GD, SA, and PSO, respectively. Thus, on one hand, the fast training speed of KMP is taken into account to obtain good initial solutions; on the other hand, using further other optimization methods can overcome the drawback that kernel functions in dictionary keep unchanged during training. Experimental results show that, GD, SA, and PSO can refine basic KMP to different extent, but PSO exhibits the best performance. Additionally, an interesting observation is that many machine learning algorithms have the same shape of solutions with KMP, such as SVMs, radial basis function neural networks, Gaussian mixture models. So, the attempt to apply the idea of this paper to refining these algorithms, may be of some significance.
32
J. Li and Y. Lu
Acknowledgments. The work was supported by the foundation of Beijing Key Lab of Intelligent Information Technology.
References 1. Vincent, P., Bengio, Y.: Kernel Matching Pursuit. Mach. Learn. 48(1), 165–187 (2002) 2. Popovici, V., Bengio, S., Thiran, J.P.: Kernel Matching Pursuit for Large Datasets. Pattern Recogn. 38, 2385–2390 (2005) 3. Li, Q., Jiao, L.: Base Vector Selection for Kernel Matching Pursuit. In: Li, X., Zaiane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 967–976. Springer, Heidelberg (2006) 4. Jiao, L., Li, Q.: Kernel Matching Pursuit Classifier Ensemble. Pattern Recogn. 39, 587–594 (2006) 5. Popovici, V., Thiran, J.P.: Adaptive Kernel Matching Pursuit for Pattern Classification. In: Proceedings of the Lasted International Conference on Artificial Intelligence and Applications, Innsbruck, Austria, pp. 235–239 (2004) 6. Mallat, S., Zhang, Z.: Matching Pursuit with Time-Frequency Dictionaries. IEEE Trans. Signal Proc. 41(12), 3397–3415 (1993) 7. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220(4), 671–680 (1983) 8. Kennedy, J., Eberhart, R.: Particle Swarm Optimization. In: Proc. IEEE International Conference on Neural Networks, Perth, WA, vol. 4, pp. 1942–1948 (1995) 9. Blake, C., Keogh, E., Merz, C.J.: UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/~mlearn/MLRepository.html 10. Chang, C.C., Lin, C.J.: LIBSVM: a Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libSVM
Optimization of Training Samples with Affinity Propagation Algorithm for Multi-class SVM Classification Guangjun Lv, Qian Yin, Bingxin Xu, and Ping Guo Image Processing and Pattern Recognition Laboratory Beijing Normal University, Beijing 100875, China
[email protected],
[email protected]
Abstract. This paper presents a novel optimization method of training samples with Affinity Propagation (AP) clustering algorithm for multi-class Support Vector Machine (SVM) classification problem. The method of optimizing training samples is based on region clustering with affinity propagation algorithm. Then the multi-class support vector machines are trained for natural image classification with AP optimized samples. The feature space constructed in this paper is a composition of combined histogram with color, texture and edge descriptor of images. Experimental results show that better classification accuracy can be obtained by using the proposed method. Keywords: Affinity Propagation Clustering, multi-class Support Vector Machine, natural image classification, training sample optimization.
1 Introduction Most proposed systems for content-based image retrieval (CBIR) are based on low-level visual features of image, such as color, texture and shape statistics [1]. One of the main challenges for CBIR approaches is to bridge the semantic gap between low-level features and high-level contents [2][3]. Automatic image annotation at semantic level employs keywords to represent images, which is often a more practical choice compared with a query-by-example approach [4], because people are willing to describe an image with keywords rather with low-level features. Cusano et al [1] used Support Vector Machine (SVM) to annotate image, they classified image regions into one of seven classes. Shao et al [2] also adopted SVM to realize automatic image annotation for semantic image retrieval, they applied SVM to classify the visual descriptors into different image categories such as landscape, cityscape, vehicle or portrait. As we known, the most important step of automatic annotation is image classification. SVM has been applied to image semantic classification intensively by researchers, for example, Wan et al [5] employed one-versus-all SVM to group images into semantic classes. However, they only employ SVM to classify images, and did not consider the local region information of image, it becomes very difficult to get the best classification precision. In order to investigate the possibility to develop algorithms that can get better desired classification accuracy, we should study currently most popular algorithms. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 33 – 41, 2010. © Springer-Verlag Berlin Heidelberg 2010
34
G. Lv et al.
Affinity Propagation (AP) algorithm [6] can be applied to identify a relatively small number of exemplars to represent the whole feature vectors [6], which is a powerful clustering algorithm. Yang et al [6] adopted AP algorithm to realize the improvement of image modeling for semantic annotation. SVM classification is belongs to supervised learning, the performance of classifier is strongly depends on the learning algorithm as well as the quality of training samples. If we firstly get the representative pieces of each category by AP clustering algorithm, the redundant information of training set can be removed away. Therefore, the training samples for SVM could be said to be optimized in this way, then the classification precision should be improved while classifying images with multi-class SVM. In this paper, we proposed a new method which can obtain high quality training set for SVM classification in order to increase classification precision. As we know, SVM is a famous machine learning method which is used extensively in small training sample cases. It is well known that training sample has a great influence on the results of classification. Most used methods to choose training samples are by human, usually it is not considered whether these samples are representatives of image content or not. Therefore, if we consider using a clustering method to find the representative images from each semantic class as training samples, it will be expected to improve classification accuracy. AP algorithm is a new clustering algorithm, the computing speed of which is fast when handling problems with large amount of classes, and it can also determine the center of each cluster automatically. During the AP clustering, real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges[7]. Therefore, AP algorithm is preferable to this experiment. The paper is organized as follows. Section 2 describes relative knowledge including feature descriptor, AP algorithm and multi-Class SVM. Section 3 presents our method of optimizing the training set for SVM in detail. In section 4, we report and analyze the experimental results obtained by the method presented in this paper. The conclusions are given in Section 5 where we also present discussion of future work.
2 Backgrounds In this section, we briefly review the relative knowledge applied in the work. Section 2.1 reviews the color, texture and edge descriptors used in this work, which is basis of the entire experiment. AP algorithm and multi-Class SVM are depicted in Section 2.2 and Section 2.3, respectively. 2.1 Feature Description Color Descriptor. In this research work, a generic Color Histogram Descriptor (SCD) is used [15]. In SCD, the color image is transformed from RGB to HSV color space firstly, then the triple-color components (H, S, V) is quantized into non-equal intervals according to human vision perception for color. We uniformly quantize the HSV space into a total of 256 bins, this includes 16 levels in H, 4 levels in S, and 4 levels in V, respectively.
Optimization of Training Samples with Affinity Propagation Algorithm
35
Based on the quantization level aforementioned, the triple-color components are mapped into a one-dimension vector using formula (1)
L = H ∗ Qs ∗ Qv + S ∗ Qv + V ,
(1)
where Qs and Qv are the numbers of quantization levels for color components S and V respectively. Therefore H, S, V can be represented by a vector according to formula (1), with the value range of L= [0, 1, 2, …, 255]. Then we can get the image color spectrum {h[k]} (k=0, 1, 2,…, 255) according to formula (2) m −1 n −1
h[k ] = ∑∑ f (i, j , k ) .
(2)
i =0 j =0
Texture Descriptor. Image texture means a kind of change of pixels’ intensity in some neighborhood, which is spatially a statistical relationship [5]. In our experiment, we use Pass’s method to describe the image texture histogram, that is, the texture feature at pixel (j, k) is defined to be the number of neighboring pixels whose intensities differ by more than a fixed value [16]. The detail description of the method can be found in reference [5]. Edge Histogram Descriptor (EHD). Edge is a basic feature of images. The edge histogram descriptor captures the spatial distribution of edges, somewhat is the same spirit as the color layout descriptor. The distribution of edges is a good texture signature that is useful to image matching even when the underlying texture is not homogeneous [15]. The extraction processing of the EHD is presented explicitly in references [15] and [16]. 2.2 Affinity Propagation Algorithm Affinity propagation algorithm is proposed by Frey et al in 2007 [7], it simultaneously considers all data points as potential exemplars. By viewing each data point as a node in a network, AP algorithm recursively transmits real-valued messages along edges of the network until a good set of exemplars and corresponding clusters emerges. AP algorithm takes as input a collection of real-valued similarities between data points, where the similarity s(i, k) indicates how well the data point with index k is suited to be the exemplar for data point i [7]. It can be briefly described as following [8]:
s (i, k ) = − X d − X k . r(i, k ) ← s(i, k ) − max{a(i, k ' ) + s(i, k ' )}.
a (i, k ) ← min{0, r (k , k ) +
∑ max{0, r (i' , k )}} .
(3) (4) (5)
i ' ≠ i ,i '≠ k
The responsibility r(i, k), sent from data point i to candidate exemplar point k, reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars for point i[7]. Availability a(i, k)
36
G. Lv et al.
reflects the accumulated evidence for how appropriate it would be for feature i to choose feature k as its exemplar, considering the support from other feature vectors that feature k should be an exemplar. When the preference s(k, k) grows big, each node tends to select itself as the exemplar, then the number of clusters will increase consequently [8]. For k = i, the responsibility r(k, k) is set to the input preference that point k be chosen as an exemplar, s(k, k), minus the largest of the similarities between point i and all other candidate exemplars. This “self-responsibility” reflects accumulated evidence that point k is an exemplar, based on its input preference tempered by how ill-suited it is to be assigned to another exemplar[7]. The “self-availability” a(k, k) is updated differently:
a (i, k ) ←
∑ max{0, r (i' , k )}} .
(6)
i ' ≠ i ,i ' ≠ k
This message reflects accumulated evidence that point k is an exemplar, based on the positive responsibilities sent to candidate exemplar k from other points. 2.3 Multi-class SVM In the research fields of machine learning and pattern classification, support vector machines are categorized as a supervised learning approach that has been demonstrated to perform well in numerous practical applications [9] [10] [11]. The SVM methodology comes from the application of statistical learning theory to separating hyperplanes for binary classification problems [12][13]. For pattern classification, SVM has a very good generalization performance without domain knowledge of the problems [5]. It is particular suitable for the problem of small amount of samples with high dimension. SVM classifiers are two-class classifiers in nature. However, we can get a multi-class SVM classifier through training several classifiers and combining their results. Two common methods of multi-class SVM are “one per class” and “pairwise coupling” [1]. On considering the time of running SVM classifier, we select the former one, which trains one classifier for each class to discriminate between one class and others. It means that with a discrimination function f(i), we can classify positive samples to class i and negative ones to other classes.
3 Optimizing Training Samples for SVM Classifier An image representation method based on image partition and region clustering by AP algorithm is proposed, which could optimize the training samples for SVM classifier. The mapping an image to its representation feature vector does not only really depend on that image alone but also on the entire collection of images from which the region groups have been built [14]. Therefore image partition is necessary to construct representative models for image classes automatically. Compared with image segmentation, image partition is easy to be realized and the spatial information is embedded into every representative pieces. This is the reason for using image partition rather than image segmentation in this work. As it is known to us, only one type of low-level feature does not work well in image retrieval. While it is difficult to incorporate color, texture and edge feature seamlessly,
Optimization of Training Samples with Affinity Propagation Algorithm
37
because they belong to different metric systems, and these features are not comparable [5]. However, if we use texture and edge descriptor, just like color histogram, the problem can be solved, because histogram descriptors are in the same feature space. In this research work, color, texture and edge histograms are combined together to form image feature vector. Figure 1 is proposed classification process framework for natural images in this paper. Firstly, the images of training set need to be split evenly. The size of partition window influences the final result because the pieces of different sizes capture different scale information. Generally speaking, the smaller piece reveals local image content information, while the bigger one describes relevant information of adjacent pieces [14]. Taking into consideration of both aspects and the running time, the size of partition window is set as 50×50 pixels. Using this partition window, we decompose an image into a set of non-overlapping blocks with the size of 50×50 pixels. After image partition, a combined 512-dimensional feature vector including color and texture descriptors of each region is extracted. In order to optimize the feature vectors of training samples for SVM, feature vectors of all regions for each category are clustered by AP algorithm. Then we get the representative images for each category. There exist some clustering methods, such as nearest neighbor clustering, k-means, AP and so on. Compared with other clustering algorithms, AP founds clusters with much lower error and it costs less than one-hundredth the amount of time [7]. Another merit of AP is that it is suited for the problem of samples with high dimension. Therefore AP is suitable for the problem we addressed. Secondly, using the cluster centers which are obtained by AP algorithm compose the training data for each semantic class. All images from the training set are processed by the same way and the number of representative image pieces of every clustered class is determined by AP algorithm adaptively, which employed another advantage of AP algorithm. When the training samples for SVM are optimized by AP algorithm, it can produce better classification hyperplanes than using original training samples. Thirdly, the SVM classifiers for each class are trained with corresponding training samples of that class as positive ones and samples from other classes as negative ones. In the stage of test, we extract aforementioned 517-dimensional feature vectors of all images firstly, including color, texture and edge descriptors, and integrating them as the inputs for testing. Then we count the number of images of each class that is classified correctly to compute the classification precision.
Training Images
Image Partition
Testing Images
Combined Histogram
Combined Histogram
One perclass SVM
Classifying Results
Fig. 1. Proposed natural image classification system framework
38
G. Lv et al.
4 Experiments and Analysis In the experiment, we gather 5 classes of image samples from the internet. Table 1 shows the number of training and testing samples of each class. We conduct two groups of experiments, experiment 1 is conducted by one per class SVM classifier without optimizing training samples, and experiment 2 is conducted by our method. Table1. The number of training and testing samples
training sample number
testing sample number
grass
12
25
sky
10
40
woods
14
37
water
13
24
sunset
10
25
Table 2 shows the amount of increments of classification precision of grass, sky and woods is 24%, 2.5% and 4.17%, respectively, and it retains unchanged to water and sunset, the precision is increased by 6.134% on average. The classification precision of sunset is always high, no matter using which method. After analysis, we find that its feature distribution is even, which means any piece extracted from sunset image may represent the image itself well. Figure 2 is some representative pieces of sunset class obtained by our method. Figure 3 is the feature distribution map of all representative pieces of sunset class, with original features reduced to 3 dimensions by principal component analysis (PCA) algorithm. From Figure 3 we can see that the distribution of most nodes is centralized, except few ones scattered, which is the reason of high classification precision in sunset category. Table 2. The classification precision of 5 categories Experiment 1
Experiment 2
Grass
64%
88%
Sky
87.5%
90%
Woods
87.5%
91.67%
Water
92%
92%
Sunset
100%
100%
Optimization of Training Samples with Affinity Propagation Algorithm
39
Fig. 2. Some representative pieces of sunset image 1
0.5
0
-0.5 0.5 0.5 0
0 -0.5 -0.5
-1
Fig. 3. The combined feature distribution in 3-D space
The images that cannot be classified correctly if training data is not optimized, now they can be classified into correct classes after using proposed method, such as grass, sky and woods. Figure 4 is a sample of grass that can be classified rightly with our method while incorrectly if training samples are not preprocessed. The reason is that the redundant information of original images is removed when applying the representative pieces of grass images for training obtained by our method. Therefore, using representative pieces as training samples will bring about fine classification hyperplanes. Through 2 groups of experiments, we find that when the samples of every category for training cover as much cases as possible, the precision of that category is high, such as sunset, sky, water and woods in table 2. Moreover, the similarity among all categories
Fig. 4. Grass image I from testing set
40
G. Lv et al.
Fig. 5. Grass image II from testing set
affects the final classification precision. When it is small, the probability of misclassification is low, and the discrimination function can differentiate one class of images from others well, therefore the classification precision is high. For example, Figure 5 is one image from grass class, and the experiment result shows that it is classified into the woods class, which suggests that its similarity with woods class is high.
5 Conclusion and Future Work The method of optimizing the training samples for SVM with AP algorithm is proved to be feasible and can improve the classification precision a lot than only using SVM. The optimized training samples can represent the most image content and reduce the redundant information. We can use small amount training images to generate more training samples with image partition method. The proposed method can be applied to realize fast automatic image semantic classification also. The further research work is to develop the automatic feature selection method for semantic image annotation. Acknowledgments. The research work described in this paper was fully supported by the grants from the National Natural Science Foundation of China (Project No. 60675011, 90820010). Prof. Qian Yin and Ping Guo are the authors to whom all correspondences should be addressed.
References 1. Cusano, C., Ciocca, G., Schettini, R.: Image annotation using SVM. In: Proc SPIE, vol. 5304, pp. 330–338 (2004) 2. Shao, W.B., Naghdy, G., Phung, S.L.: Automatic Image Annotation for Semantic Image Retrieval. In: Qiu, G., Leung, C., Xue, X.-Y., Laurini, R. (eds.) VISUAL 2007. LNCS, vol. 4781, pp. 369–378. Springer, Heidelberg (2007) 3. Lokesh, S., Hans, B.: Feature Selection for Automatic Image Annotation. In: Franke, K., Müller, K.-R., Nickolay, B., Schäfer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 294–303. Springer, Heidelberg (2006) 4. Wang, L., Latifur, K.: Automatic Image Annotation and Retrieval Using Weighted Feature Selection. Multimedia Tools and Applications 29, 55–71 (2006)
Optimization of Training Samples with Affinity Propagation Algorithm
41
5. Wan, H.L., Chowdhury, M.U.: Image Semantic Classification by Using SVM. Journal of Software 14, 1891–1899 (2003) (in Chinese) 6. Yang, D., Guo, P.: Improvement of Image Modeling with Affinity Propagation Algorithm for Semantic Image Annotation. In: Leung, C.S., Lee, M., Chan, J.H. (eds.) ICONIP 2009, Part I. LNCS, vol. 5863, pp. 778–787. Springer, Heidelberg (2009) 7. Frey, B.J., Dueck, D.: Clustering by Passing Messages between Data Points. Science 315(5814), 972–976 (2007), (Epub. January 11, 2007) 8. Frey, B.J., Dueck, D.: Mixture Modeling by Affinity Propagation. In: Advances in Neural Information processing Systems, vol. 18, pp. 379–386 (2006) 9. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 121–167 (1998) 10. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2001) 11. Abe, S.: Support vector machines for pattern classification. Springer, New York (2005) 12. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 13. Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20(3), 273–297 (1995) 14. Xing, H.Q., Wang, G.Y.: Partition- Based Image Classification Using SVM. Control & Automation 22(13) (2006) 15. Manjunath, B.S., Ohm, J.-R., Vasudevan, V.V., Yamada, A.: Color and Texture Descriptors. IEEE Trans. Circuits Syst. Video Technol. 11, 703–715 (2001) 16. Pass, G., Zabih, R.: Comparing images using joint histograms. Multimedia Syst. 7, 234–240 (1999)
An Effective Support Vector Data Description with Relevant Metric Learning Zhe Wang1 , Daqi Gao1 , and Zhisong Pan2 1 2
Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai, 200237, P.R. China Institute of Command Automation, PLA University of Science & Technology, Nanjing, 210007, P.R. China
Abstract. Support Vector Data Description (SVDD) as a one-class classifier was developed to construct the minimum hypersphere that encloses all the data of the target class in a high dimensional feature space. However, SVDD treats the features of all data equivalently in constructing the minimum hypersphere since it adopts Euclidean distance metric and lacks the incorporation of prior knowledge. In this paper, we propose an improved SVDD through introducing relevant metric learning. The presented method named RSVDD here assigns large weights to the relevant features and tights the similar data through incorporating the positive equivalence information in a natural way. In practice, we introduce relevant metric learning into the original SVDD model with the covariance matrices of the positive equivalence data. The experimental results on both synthetic and real data sets show that the proposed method can bring more accurate description for all the tested target cases than the conventional SVDD. Keywords: Support vector data description; Relevant metric learning; One-class classification.
1
Introduction
The one-class classification [1,2,3,4] recently has become an active research in machine learning. Since only one certain class named the target class is generally available, the one-class classification differs from the tradional binary or multi-class classification. Support Vector Domain Description (SVDD) as one popular one-class classifier was proposed by Tax and Duin [2]. SVDD constructs such a hypersphere that can enclose as many of the target objects as possible, while minimizes the chance of accepting the non-target data named the outlier objects. Since SVDD is motivated by the support vector classifier, it inherits the advantages of both solution sparseness and kernel-induced flexibleness. It is known that the original SVDD model adopts the Euclidean distance metric. But an important problem in those learning algorithms based on Euclidean distance metric is the scale of the input variables. In the Euclidean case, SVDD takes all the features of the target class data equivalently in training. As a result, L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 42–51, 2010. c Springer-Verlag Berlin Heidelberg 2010
An Effective Support Vector Data Description
43
those irrelevant features of the data might be considered in training and would mislead the data description of the SVDD model into an irrelevant hypersphere. Simultaneously, the SVDD with Euclidean distance metric fails to consider the prior relationship among the target data. In this paper, we introduce the Relevant Metric Learning [5] rather than the original Euclidean distance metric into SVDD and therefore propose an improved SVDD classifier named RSVDD, whose underlying motivations and contributions are as following: • Relevant metric learning was developed for unsupervised learning through using positive equivalence relations. Its special algorithm is Relevant Component Analysis (RCA) [5]. RCA is an effective linear transformed algorithm and constructs a Mahalanobis distance metric based on the covariance matrices of the positive equivalence data. In RCA, the positive equivalence data are selected from the same chunklet. Each chunklet is the set in which data come from the same class but without special class labels. Through the transformation based on a group of chunklets, RCA can assign large weights to relevant features and low weights to irrelevant ones [5]. Here, we introduce RCA into the oringal SVDD model such that the improved SVDD named RSVDD can inherit the advantages of RCA. Concretely, the proposed RSVDD will reduce the scale influence of input variable due to the use of Mahalanobis distance metric from RCA. Simultaneously, the proposed RSVDD can easily incorporate a priori knowledge by considering of the positive equivalence data from the same chunklets rather than the whole target class. • The original RCA model is a linear transform algorithm in the input space. Thus, RCA fails in nonlinear problems. Meanwhile, since the number of parameters of RCA is dependent on the dimensionality of the feature vectors, RCA suffers from the curse of dimensionality problem. The proposed RSVDD adopts the linear RCA and also comes up against the shortcomings. To this end, we further propose a kernelized RSVDD that can deal with nonlinear classification cases. • The proposed RSVDD is one single classification processing rather than the two separate steps with one preprocessing and one classifying. The rest of this paper is organized as follows. Section 2 gives the structure of the proposed RSVDD in terms of both linearization and kernelization. Section 3 experimentally shows that the proposed method RSVDD can bring more accurate description for all the tested target cases than the conventional SVDD. Following that, both conclusion and future work are given in Section 4.
2
Relevant Support Vector Data Description (RSVDD)
SVDD is proposed to construct a hypersphere that can contain all target objects and minimize the probability of accepting outliers. SVDD adopts the kernel technique and therefore obtains a more flexible description for the target class. In order to integrate more prior knowledge, we propose an improved SVDD named RSVDD with relevant metric learning instead of the original Euclidean distance metric. This section gives both the linear and kernel RSVDD algorithms.
44
Z. Wang, D. Gao, and Z. Pan
2.1
Linear RSVDD
n Suppose that there is a set of one-class training samples {xi }N i=1 ⊆ R . SVDD N seeks such a hypersphere that can contain all the samples {xi }i=1 and minimize the volume of the hypersphere through the following optimization formulation min J = R2 + C ξi (1) i
s.t.
(xi − a)T M −1 (xi − a) ≤ R2 + ξi
(2)
ξi ≥ 0, i = 1...N
(3)
where the parameters R ∈ R and a ∈ Rn are the radius and the center of the optimized hypersphere respectively, the regularization parameter C ∈ R gives the tradeoff between the volume of the hypersphere and the errors, and the ξi ∈ R are slack variables. Since SVDD adopts Euclidean distance metric, the matrix M ∈ Rn×n is an identity one with all the diagonal elements 1 and the others 0. It can be found that SVDD views all the features of the samples equivalently. In contrast, our proposed RSVDD framework assigns large weights to the relevant features and small weights to the irrelevant ones through introducing the relevant metric learning instead of the Euclidean metric. The relevant metric learning matrix M ∈ Rn×n is defined as following [5] M=
D nd 1 (xdj − xd )(xdj − xd )T N j=1
(4)
d=1
where D is the size of the chunklets, nd is the number of samples in the dth chunklet, xd is the mean of the dth chunklet. Here, the sample set {xi }N i=1 is D divided into D chunklets without replacement, i.e., N = d=1 nd . The positive equivalent information is provided in the form of chunklets, where the samples in the same chunklet belong to the same class, though its exact class label is not known. Through submitting (4) into (2), the objective function of the proposed RSVDD can be obtained. In order to optimize the parameters R, a, ξi , we construct the Lagrangian function through introducing Lagrangian multipliers αi , γi and taking (2), (3), (4) into (1). Thus, we get L = R2 + C
i
ξi −
N
αi [R2 + ξi − (xi − a)T M −1 (xi − a)] −
i=1
N
γi ξi
(5)
i=1
where αi ≥ 0, γi ≥ 0. Setting partial derivatives of L with respect to R, a, ξi to 0, we can get ∂L =0⇒ αi = 1 ∂R i=1 N
(6)
An Effective Support Vector Data Description
45
∂L =0⇒a= αi xi ∂a i=1
(7)
∂L = 0 ⇒ γi = C − αi ∂ξi
(8)
N
Further, we take the constraints (6), (7), (8) into the Lagrange function (5) and obtain the maximized criterion as following αi xTi M −1 xi − αi αj xTi M −1 xj (9) max L(αi ) = i
s.t.
i,j
0 ≤ αi ≤ C, M=
1 N
i = 1...N
nd D
(xdj − xd )(xdj − xd )T
(10) (11)
d=1 j=1
The maximization of (9) can be solved through Quadratic Programming (QP) [6]. Then a test sample z ∈ Rn is classified as the target class when the relevant distance z − a M between the sample z to the center a of the hypersphere is smaller than or equal to the radius R, i.e., z − a 2M = (z − a)T M −1 (z − a) ≤ R2
(12)
The radius R can be calculated from the center a of the hypersphere to the sample on the bound. In mathematics, the radius R is given as following R2 = (xi − a)T M −1 (xi − a)
(13)
where, xi is the sample from the set of support vectors, i.e., its Lagrangian multiplier 0 < αi < C. 2.2
Kernel RSVDD
The kernel-based methods map the data from an input space to a feature space through kernel functions and have been successfully applied in classification problem [7]. It should be stated that the kernel-based methods achieve the mapping implicitly without large computations. They only depend on the inner product defined in the feature space, which can be calculated from a kernel function. This subsection shows how to achieve a kernelized RSVDD. In doing so, the kernelized RSVDD can work in the non-linear classification problem and overcome the curse of dimensionality problem. From (9), the key problem of kernelizing RSVDD is to achieve the inner product form of xTi M −1 xj . Here, we achieve the inner product form of xTi M −1 xj with the technique shown in [8]. Firstly, we give the chunklet covariance matrix M ∈ Rn×n defined in the literature [8] as following M=
1 XHX T N
(14)
46
Z. Wang, D. Gao, and Z. Pan
where X = [x1,1 , x1,2 ..., x1,n1 , ..., xd,1 , x1d,2 ..., xd,nd , ..., xD,1 , xD,2 , ...xD,nD ] ∈ D Rn×N ; H = d=1 (Id − n1d 1d 1Td ) ∈ RN ×N ; xd,i denotes the ith sample of the dth chunklet set; 1d ∈ RN where [1d ]i = 1, if the ith sample belongs to the dth chunklet set, otherwise [1d ]i = 0; and Id ∈ RN ×N with its diagonal elements 1d and the others 0. Preventing the M from becoming singular, we give a regularized matrix ˜ = I + M = I + 1 XHX T , M ˜ ∈ Rn×n M N where ∈ R > 0 is a small positive value. Further, from the Woodbury formula ˜ can be given as following [9], the inverse of M ˜ −1 = (I + 1 XHX T )−1 = 1 I − 1 XH(I + 1 X T XH)−1 X T M N N 2 N
(15)
Then the xTi M −1 xj can be converted into ˜ −1 xj = xTi [ 1 I − 1 XH(I + 1 X T XH)−1 X T ]xj xTi M N 2 N
(16)
It can be found that the inner production xTi xj is got in (16). Through taking (16) into (9) and the kernel function k(xi , xj ) instead of the inner product xTi xj , we can further obtain the dual problem of the kernelized RSVDD ˆ i , xi ) − ˆ i , xj ) max L = αi k(x αi αj k(x (17) i
s.t.
0 ≤ αi ≤ C
i,j
(18)
ˆ i , xj ) = 1 k(xi , xj ) − kT [ 12 H(I + 1 KH)−1 ]kx , K = [k(xi , xj )] ∈ where k(x j xi n n N ×N R is the kernel matrix defined on the X, kxi = [k(x1,1 , xi ), ..., k(xD,nD , xi )]T ∈ RN , and kxj = [k(x1,1 , xj ), ..., k(xD,nD , xj )]T ∈ RN . The optimization processing (17) of the kernelized RSVDD can be solved through QP. Through taking (16) into (12) and (13), a test sample z ∈ Rn can be classified in the same way as that of the linear RSVDD case.
3
Experiments
In our experiments, we compare the proposed RSVDD algorithm with the other classical one-class classifiers SVDD and k-Nearest Neighbor Data Description (kNNDD) [2]. Both RSVDD and SVDD adopt the linear kernel k(xi , xj ) = xTi xj , polynomial kernel (Poly) k(xi , xj ) = (xTi xj + 1)p where p is set to 3, and radial basis kernel (RBF) k(xi , xj ) = exp(−||xi − xj ||2 /σ 2 ) where σ = ν σ ¯ , ν = 0.1, 1, or 10, σ ¯ is set to the average value of all the l2 -norm distances for the training samples. The k-NNDD bases on the k nearest neighbor method, where k is set to 1 and 3 here. All computations were run on Pentium IV 2.10-GHz processor running, Windows XP Professional and MATLAB environment.
An Effective Support Vector Data Description
3.1
47
Synthetic Data Set
In order to clearly validate the effectiveness of the proposed RSVDD algorithm, we first implement two groups of experiments on synthetic data. In one-class classification problem here, we adopt the vectors e, f ∈ R2 to measure the performance of the one-class classifier, where e(1) gives the False Negative (FN) rate (the error on the target class), e(2) gives the False Positive (FP) rate (the error on the outlier class), f (1) gives the ratio between the sample number of correct target predictions and the sample number of target predictions, f (2) gives the ratio between the sample number of correct target predictions and the sample number of target samples. The first group of experiments was run on the Iris data [10]. For visualization, we only use the third and fourth features of Iris for experiments, where the 50 samples of the 2nd class as the target class and the 50 samples of the 3rd class as the outlier class. In the proposed RSVDD, the size D of the chunklets is set to the size of class, i.e., D = 2. Figure 1 gives the classification boundaries of SVDD and RSVDD with linear and RBF kernel on Iris with the 3rd and 4th features. From Figure 1, it can be found that 1) both the proposed linear and kernelized RSVDD have lower target classification error than that of SVDD; 2) the decision boundary of RSVDD has a clear separation for the target and outlier class than that of SVDD in the same coordinate scale; 3) both the kernelized RSVDD and SVDD are superior to both linear RSVDD and SVDD, respectively. The second group of experiments was run on a two-dimensional two-class data set, where each class with a banana shaped distribution has 50 samples. The data are uniformly distributed along the bananas and are superimposed with a normal distribution. Figure 2 gives the classification boundaries of the kernelized SVDD and the kernelized RSVDD with the size D = 2, 4, 10, 20, 50 of the chunklets, respectively. From Figure 2, we can find that 1) the RSVDD has a significant superior advantage to SVDD in terms of FN; 2) the parameter D plays an important role in RSVDD; 3) the higher the value of D, the lower the values of both FN and FP here. 3.2
UCI Data Set
In this subsection, we report the experimental results of the proposed RSVDD, SVDD and k-NNDD on some real data TAE (3 classes/151 samples/5 features), WATER (2 classes/116 samples/38 features) and WINE (3 classes/178 samples/13 features) from the UCI machine learning repository [10]. The size D of the chunklets in each classification problem is set to the size of the classes. Here, we adopt the average value of Area Under the Receiver Operating Characteristics Curve (AUC) as the measure criterion for the performance of one-class classifiers [11]. It is known that a good one-class classifier should have a small FP and a high True Positive (TP). A higher AUC might be preferred over another classifier with a lower AUC. It means that for the specific FP threshold, the TP is higher for the first classifier than the second classifier. Thus the larger
48
Z. Wang, D. Gao, and Z. Pan
8
8 e=[0.02, 0.16] f =[0.86, 0.98]
e=[0.08, 0.06] f =[0.94, 0.92]
6 Feature 2
Feature 2
6
4
2
0 2
4
2
4
6
8 Feature 1
10
0 2
12
8
8 Feature 1
10
12
6
8 Feature 1
10
12
e=[ 0, 0.08] f =[0.93, 1 ]
6 Feature 2
6 Feature 2
6
8 e=[0.02, 0.08] f =[0.92, 0.98]
4
2
0 2
4
4
2
4
6
8 Feature 1
10
12
0 2
4
Fig. 1. The left and right sub-figures in the first row give the decision boundaries of SVDD with linear and RBF kernels, respectively. The left and right sub-figures in the second row give the decision boundaries of RSVDD with linear and RBF kernels, respectively. The values of ’e’ and ’f’ are given in each sub-figure.
the value of the AUC, the better the corresponding one-class classifier. In our experiments, the value of the AUC belongs to the range [0, 1]. Table 1 gives the average AUC values and their corresponding standard deviations of the proposed RSVDD, SVDD and k-NNDD of ten independent runs for the data sets. The value of k is set to 1 and 3 for k-NNDD. Both RSVDD and SVDD adopt linear, polynomial and radial basis kernels. The label of a target data class is indicated in the first column. In each classification, we take one class as the target class and the other classes as the outlier data. From this table, it can be found that the proposed RSVDD has a significantly superior performance to the other one-class classifiers k-NNDD and SVDD in all the tested cases.
5
5
0
0
Feature 2
Feature 2
An Effective Support Vector Data Description
−5
−5 e=[0.0600 0.0400] f =[0.9592 0.9400]
e=[0.1200 0.0200] f =[0.9778 0.8800]
−10
−10 −5
0 Feature 1
5
−10
5
5
0
0
Feature 2
Feature 2
−10
−5
−5
0 Feature 1
5
−5 e=[0.0600 0.0200] f =[0.9792 0.9400]
e=[0.1200 0.0200] f =[0.9778 0.8800]
−10
−10 −10
−5
0 Feature 1
5
−10
5
5
0
0
Feature 2
Feature 2
49
−5
−5
0 Feature 1
5
−5
e=[0.0600 0.0200] f =[0.9792 0.9400]
e=[ 0 0.0200] f =[0.9804 1.0000]
−10
−10 −10
−5
0 Feature 1
5
−10
−5
0 Feature 1
5
Fig. 2. This figure gives the classification boundaries of the kernelized SVDD and the kernelized RSVDD with D = 2, 4, 10, 20, 50, respectively. The left and right sub-figures of the first row correspond to SVDD and RSVDD with D = 2, respectively. The left and right sub-figures of the second row correspond to RSVDD with D = 4, 10, respectively. The left and right sub-figures of the third row correspond to RSVDD with D = 20, 50, respectively.
50
Z. Wang, D. Gao, and Z. Pan
Table 1. The average AUC values and their corresponding standard deviations of ten independent runs for TAE, WATER and WINE. The larger the value of the AUC, the better the performance of the corresponding one-class classifier. Class No. k-NNDD k=1 TAE 1 0.79±0.22 2 0.78±0.28 3 0.78±0.28 Total 0.7833 WATER 1 0.85±0.10 2 0.89±0.08 Total 0.8700 WINE 1 0.90±0.09 2 0.85±0.12 3 0.86±0.10 Total 0.8700
4
k=3
SVDD Linear
Poly
RBF
RSVDD Linear
Poly
RBF
0.50±0.13 0.55±0.18 0.66±0.13 0.5700
0.61±0.17 0.45±0.19 0.47±0.17 0.5100
0.60±0.17 0.47±0.17 0.43±0.17 0.5000
0.69±0.20 0.54±0.14 0.55±0.15 0.5933
1.00±0 0.63±0.13 0.99±0.01 0.8733
0.98±0.06 0.61±0.15 0.97±0.06 0.8533
1.00±0 0.62±0.12 0.99±0.01 0.8700
0.78±0.19 0.52±0.29 0.63±0.34 0.88±0.11 0.97±0.04 1.00±0 0.97±0.04 0.92±0.09 0.81±0.16 0.65±0.27 0.89±0.07 0.90±0.10 0.91±0.09 0.90±0.10 0.8500 0.6650 0.6400 0.8850 0.9350 0.9550 0.9350 0.94±0.06 0.85±0.14 0.81±0.12 0.8667
0.97±0.04 0.62±0.34 0.84±0.11 0.8100
0.97±0.07 0.52±0.35 0.82±0.11 0.7700
0.86±0.14 0.83±0.11 0.84±0.11 0.8433
0.99±0.02 0.89±0.07 0.99±0.03 0.9567
0.96±0.08 0.87±0.08 0.98±0.06 0.9367
0.99±0.02 0.87±0.08 0.99±0.03 0.9500
Conclusion and Future Work
In this paper, we propose an improved SVDD named RSVDD. RSVDD adopts the relevant metric learning instead of the original Euclidean distance metric learning. In doing so, the proposed RSVDD assigns large weights to the relevant features and tights the similar data through incorporating the positive equivalence information in the same chunklet. The experimental results validate that the proposed RSVDD significantly improves the effectiveness of the one-class classifier. In future, we plan to integrate both the positive and negative equivalences into the one-class classifier model and extend our work to large scale classification cases.
Acknowledgment The authors thank Natural Science Foundations of China under Grant No. 60675027 and 60903091, the Specialized Research Fund for the Doctoral Program of Higher Education under Grant No.20090074120003 for support. This work is also supported by the Open Projects Program of National Laboratory of Pattern Recognition.
References 1. Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R.: Estimating the support of a high dimensional distribution. Neural Computation 13(7), 1443–1471 (2001)
An Effective Support Vector Data Description
51
2. Tax, D., Duin, R.: Support vector domain description. Pattern Recognition Letters 20(14), 1191–1199 (1999) 3. Tax, D., Duin, R.: Support vector data description. Machine Learning 54, 45–66 (2004) 4. Tax, D., Juszczak, P.: Kernel whitening for one-class classification. International Journal of Pattern Recognition and Artificial Intelligence 17(3), 333–347 (2003) 5. Shental, N., Hertz, T., Weinshall, D., Pavel, M.: Adjustment learning and relevant component analysis. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 776–790. Springer, Heidelberg (2002) 6. Alizadeh, F., Goldfarb, D.: Second-order cone programming. Mathematical Programming 95, 3–51 (2003) 7. Scholkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2002) 8. Tsang, I., Cheung, P., Kwok, J.: Kernel relevant component analysis for distance metric learning. In: Proceeding of the International Joint Conference on Neural Networks (2005) 9. Golub, G., Van Loan, C.: Matrix Computations, 3rd edn. Johns Hopkins, Baltimore (1996) 10. Asuncion, A., Newman, D.: Uci machine learning repository. University of California, School of Information and Computer Science, Irvine (2007), http://www.ics.uci.edu/~ mlearn/mlrepository.html 11. Bradley, A.: The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)
A Support Vector Machine (SVM) Classification Approach to Heart Murmur Detection Samuel Rud and Jiann-Shiou Yang Department of Electrical and Computer Engineering University of Minnesota, Duluth, MN 55811, USA
Abstract. This paper focuses on the study of detecting low frequency vibrations from the human chest and correlate them to cardiac conditions using new devices and techniques, custom software, and the Support Vector Machine (SVM) classification technique. Several new devices and techniques of detecting a human heart murmur have been developed through the extraction of vibrations primarily in the range of 10 – 150 Hertz (Hz) on the human chest. The devices and techniques have been tested on different types of simulators and through clinical trials. Signals were collected using a Kardiac Infrasound Device (KID) and accelerometers integrated with a custom MATLAB software interface and a data acquisition system. Using the interface, the data was analyzed and classified by an SVM approach. Results show that the SVM was able to classify signals under different testing environments. For clinical trials, the SVM distinguished between normal and abnormal cardiac conditions and between pathological and non-pathological cardiac conditions. Finally, using the various devices, a correlation between heart murmurs and normal hearts was observed from human chest vibrations. Keywords: Hear murmur detection, support vector machine.
1 Introduction Heart murmurs are sounds caused by turbulent blood flow through a heart’s valve. Turbulence is present when the flow across the valve is excessive for the area of the open valve. It can be due to normal flow across a diseased valve, abnormally high flow across a normal valve, or a combination. Murmurs are classified as systolic, diastolic, or innocent to describe the location of the turbulent blow flood in the heart [1]. Noninvasive heart murmur detection is the process of diagnosing a patient’s heart condition without intrusion into the body. The process has evolved in many different directions stemming from listening to the human chest with a stethoscope to using computers to detect heart conditions. The simplest and most formal way of diagnosing heart murmurs is via a primary care physician’s judgment on what he or she heard. Another technique is to use an electronic stethoscope in order for a physician to view the phonocardiogram of the heart sound signal. Nowadays, computers are being trained to distinguish different heart conditions. One study from the University of Colorado Health Sciences Center trained a computer to distinguish innocent murmurs L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 52–59, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Support Vector Machine (SVM) Classification Approach to Heart Murmur Detection
53
from unhealthy murmurs [2]. Another study by Zargus Medical Corporation has successfully classified systolic murmurs as an unhealthy murmur [3]. A heart murmur detection system using various types of neural networks and a fuzzy logic approach can be found in the literature. The reasons for this sparked interest in this area are due to the low diagnosis rates by primary care physicians. In one study, it is said that primary care physicians accurately associate a murmur diagnosis with its pathology as little as 20% of the time [4]. By providing a physician with effective and economical tools to help aid in the diagnosis could decrease misdiagnosis rates. This paper focuses on the study of detecting low frequency vibrations from the human chest and correlate them to cardiac conditions using new devices and techniques, custom software, and the Support Vector Machine (SVM) classification technique. The new devices and techniques target vibrations with frequencies ranging from 10 to 150 Hz. Another aspect of the study is to use an SVM algorithm to classify the data received from the instruments. The SVM classification is used to devise a computationally efficient way of learning “good” separating hyperplanes in a high dimensional feature space. Since the invention of SVMs by Vapnik [5, 6], there have been intensive studies on the SVM for classification and regression (e.g., [7]). Recently, the application of SVM to various research fields have also been reported in the literature. This paper will describe the methodology and results for detecting and classifying heart murmurs via non-invasive infrasound means.
2 Methods The primary devices used for this study were Kardiac Infrasound Devices (KID) and accelerometers. The KID contains a Honeywell DCN001NDC4 ultra low pressure transducer. It can detect a change in pressure about 0.0361 pounds per square inch. The cup, referred to as BIG KID for big Kardiac Infrasound Device, is like that of a stethoscope’s bell feature, except that it is closed to outside air [8]. A small rubber tube connects the pressure transducer to the top of the diaphragm. The closed bell chamber with a press monitor both in and out of the bell will be placed on a human chest that has a readily palpable thrills or vibration. Two diaphragms, referred to as MED KID and SMALL KID for medium and small Kardiac Infrasound Device, respectively, have an inner chamber covered by a latex membrane as shown in Fig. 1. The latex membrane isolates the chamber from the ambient air in such a way that when a vibration is present on the membrane the pressure inside the chamber will change. System integration of the diaphragm is composed of connecting the open end of the rubble tube to the Honeywell ultra low pressure sensor. These KIDs detect changes in pressure created by vibrational movements of the human chest as a result of the heart beating. A patient with a heart murmur emits a low frequency vibration and will correlate to a small change in pressure off the human chest. The second type of transducer is an Analog Devices ADXL203EB accelerometer (ADXL). The transducer is either mounted directly to the testing surface or is placed on a flexible piece of plastic creating wings such that it can be adhered to a human subject [8]. These devices interface with an analog signal conditioning circuit containing a high pass filter to eliminate any DC offset followed by an analog amplifier. The signal
54
S. Rud and J.-S. Yang
Fig. 1. Diaphragm located on SimMan’s chest
generated by these circuits is then processed by a Measurement Computing PCIDAS6036 data acquisition card which interfaces with a custom MATLAB software application to record the signals. The software application dubbed “MurmurPro” [8] allows one to view recorded signals, analyze and process the recorded signals, and classify the signals using an SVM. 2.1 Testing Environments and Procedures Two different testing environments were used to test the devices. The first environment was St. Luke’s One SimMan which is an artificial human being which could simulate various heart sounds and conditions (see Fig. 2). Within SimMan lie several speakers which are used to simulate a heart sound of a real human. These heart sounds range from normal heart sounds to diastolic heart murmurs. While in the lying down position, the KID and the accelerometer were both secured to SimMan’s lower left sternum border of the chest. The recordings of the devices were taken with length of 5 seconds with a sampling rate of 104 samples per second (S/s), and with various heart rates ranging from 60 beats per minute (bpm) to 80 bpm. The SimMan Universal Patient Simulator command window controls the SimMan’s heart [8]. The second environment, with consent of the University of Minnesota Institutional Review Board’s Human (IRB) Subjects Committee, was human patient testing or clinical trials. Each of the devices were secured to the patient at the apex, the right second intercostals space, the left second intercostals space, and the lower left sternum border. The patient was in the lying down position and relaxed. The patient was asked to exhale all air in the lungs and not to breathe until a recording was finished. Before the recordings took place, a physician diagnosed the patient with the type of heart condition present. Recordings of the patient were taken at 5 second intervals with a sampling rate of 104 S/s. 2.2 Support Vector Machine Training and Testing Support Vector Machine, used primarily in classification and regression, is an algorithm for creating nonlinear hyper-planes in a hyper space [9]. For this study, the SVM creates an optimized boundary that separates between normal and abnormal cardiac conditions. We used the LS-SVM toolbox, which implements the SVM algorithm for MATLAB and was created by researchers at Katholieke Universiteit [10].
A Support Vector Machine (SVM) Classification Approach to Heart Murmur Detection
55
The toolbox provides in-depth functionality ranging from tuning, optimizing, validating, and training SVMs. It also provides a good multidimensional visual representation of the trained SVM. The toolbox is utilized in both the SVM trainer and SVM classifier in MurmurPro [8]. The MurmurPro is the graphical user interface package that combines data acquisition, signal analysis, and signal detection into one package [8]. SVM training is a multi-step process that includes signal processing, twodimensional data transformation, SVM tuning and optimization, and finally, the SVM training. Signal processing entails filtering the signal to eliminate excessive noise and cropping the signal to a fixed number of heartbeats. This is done to standardize each signal such that a two dimensional representation can be achieved. For detailed about this process, please refer to [8].
Fig. 2. St. Luke’s One SimMan
2.3 Support Vector Machine Training and Testing Support Vector Machine, used primarily in classification and regression, is an algorithm for creating nonlinear hyper-planes in a hyper space [9]. For this study, the SVM creates an optimized boundary that separates between normal and abnormal cardiac conditions. We used the LS-SVM toolbox, which implements the SVM algorithm for MATLAB and was created by researchers at Katholieke Universiteit [10]. The toolbox provides in-depth functionality ranging from tuning, optimizing, validating, and training SVMs. It also provides a good multidimensional visual representation of the trained SVM. The toolbox is utilized in both the SVM trainer and SVM classifier in MurmurPro [8]. The MurmurPro is the graphical user interface package that combines data acquisition, signal analysis, and signal detection into one package [8]. SVM training is a multi-step process that includes signal processing, twodimensional data transformation, SVM tuning and optimization, and finally, the SVM training. Signal processing entails filtering the signal to eliminate excessive noise and cropping the signal to a fixed number of heartbeats. This is done to standardize each signal such that a two dimensional representation can be achieved. For detailed about this process, please refer to [8].
56
S. Rud and J.-S. Yang
3 Results 3.1 SimMan SVM Fig. 3 shows the results obtained from the BIG KID while simulating a systolic murmur. This figure contains four plots. The first plot is the original time-series (TS) plot (i.e., voltage (V) vs. time (s)) while the plot following is the filtered TS plot. The subsequent plot is the frequency response (FR) (i.e., magnitude (dB) vs. frequency (f)) of the original and the plot after that is the FR of the filtered phonocardiogram. The frequency response plots were created using the Fast Fourier Transform (FFT) method. In the TS plots, S1 and S2 are not easily distinguishable and separable, indicating extra vibrations have occurred after S1. Also, the FR plots indicate the presence of extra distinct frequency spikes occurring at various frequencies compared to that of the normal heart. (Note: S1 and S2 represent the timings of the sounds of a normal heart cycle, where S1 is the first sound and S2 is the second sound. For this study, it is assumed that the major and minor peaks in the signal occur in synchronous with S1 and S2 respectively.) Similar TS and FR plots were also found using the MED KID, SMALL KID, and accelerometer (ADXL) devices.
Fig. 3. BIG KID phonocardiogram and FFT during systolic murmur at 70bpm
The results in this subsection show SVM plots of signals taken from the BIG KID. The signals from the device include the following heart conditions: (a) aortic stenosis; (b) Austin Flint murmur; (c) diastolic murmur; (d) friction rub; (e) mitral valve prolapse; (f) systolic murmur; and (g) normal heart. Each SVM was first trained with a training set and then tested with a testing set. The testing set did not contain any of the same signals as the training set. Also, heart rates of the signals ranged from 60 bpm to 80 bpm. Fig. 4 shows the tested SVM for the BIG KID on SimMan after training [8]. Notice how normal heart signals tend to cluster linearly in the lower left hand corner while the abnormal signals also cluster linearly as the type differs. The normal signals cluster due to their similar duration time and magnitude. With 16 test cases, varying from normal to abnormal, only 1 misclassification occurred resulting in 6% misclassification. The test case that was misclassified was abnormal. SimMan MED KID and ADXL trained SVM and the tested SVM were also conducted. With 16 test cases, varying from normal to abnormal, we found no errors occurred resulting in 0% misclassification in the MED KID and ADXL cases.
A Support Vector Machine (SVM) Classification Approach to Heart Murmur Detection
57
3.2 Clinical Trials SVM Detailed results from the BIG KID, MED KID, SMALL KID, and ADXL taken during clinical trials together with the clinical trials SVM can be found in [8]. Due to page limitation, we will only provide a brief results in this subsection. Fig. 5 shows the results obtained while testing a human subject with a grade 5 systolic murmur. S2 is not apparent in this figure. It is believed this is caused by the severe turbulence during the systole cycle drowning out S2. The FFT shows relatively the same frequency range as the normal heart recording. However, four peaks arise at 1, 7, 17, and 40 Hz unlike the steady sloping of peaks on normal patient. The trained SVM with analyzed signals collected at the right second intercostals space using the ADXL is shown in Fig. 6. Again, we found that in the SVM plot the normal hearts clustered in several areas while the abnormal hearts were sporadic in the SVM plot. The normal heart samples that were collected clustered in a linear fashion.
Fig. 4. SimMan BIG KID tested SVM
The SVM trained with analyzed data collected at the apex using the ADXL is given in Fig. 7. This SVM was created using pathological and non-pathological patient data. Patients were deemed “pathological” due to diseased valves in the heart. The outcome of this trained SVM yielded interesting results. A separation exists between the pathological patient data and the non-pathological patient data. The previous results of the SVM classifier show that there is an objective separation between normal hearts and abnormal hearts and between pathological hearts and non pathological hearts. The results, though, are highly dependent on the diagnosis of the physician at the time of Clinical Trials. Assuming the diagnosis was correct, the results prove that the detection systems or devices are able to distinguish between abnormal and normal hearts and pathological and non-pathological hearts. During clinical trials, it was seen that the signal generated from a patient with a normal heart could be distinguished from a person with an abnormal heart from the devices. Also, the SVM plots showed that it is possible to classify the signals. A correlation between pathological and non-pathological heart conditions was also seen. Unfortunately, the test data was limited to the amount of patients received. Also, there was no electro-cardiogram present at the clinical trials to establish the timings of the heart. This is due to the limits established by the IRB. Other sensor
58
S. Rud and J.-S. Yang
Fig. 5. SMALL KID on patient with a grade 5 systolic murmur. located at the lower left sternum border.
Fig. 6. SVM from the right second intercostals space using ADXL
Fig. 7. SVM of pathological vs. non pathological heart signals using ADXL
systems (e.g., a TekScan FlexiForce sensor system) were implemented during this study. However, those sensor systems proved inadequate. Other test platforms were also used such as the Harman Kardon HK-595 subwoofer [8]. These platforms were used only to develop devices and techniques.
A Support Vector Machine (SVM) Classification Approach to Heart Murmur Detection
59
4 Conclusion This paper focuses on the detection of low frequency vibrations from the human chest and correlating them to cardiac conditions using new devices and techniques. Throughout this study, experimental devices, hardware and software interfaces were developed to detect low frequency vibrations using different testing environments (i.e., SimMan, and Clinical Trials). The various types of devices include the KIDs, the ADXL accelerometer, and a FlexiForce sensor based device. In particular, the main focus was on the KIDs and the ADXL. The devices used, excluding the FlexiForce sensor device, can be placed on the chest around the heart and are mobile on the chest. The devices do not penetrate the chest, nor are there any risks to the patient involved. The devices were interfaced with analog hardware in order to acquire the signal with a zero dc offset and an adjustable gain. The signals were transmitted to a computer through a data acquisition system and recorded. Once recorded, the custom software developed through MATLAB processed the signals and classified the signals using a SVM algorithm. Results indicate that a SVM was able to classify signals under different testing environments. For clinical trials, the SVM distinguished between normal and abnormal cardiac conditions and between pathological and nonpathological cardiac conditions. Also, low frequency vibrations from a human chest were detected in the targeted frequency range of 10 to 150 Hz. A more precise frequency range is from 1 to 40 Hz. Finally, using the various devices, a correlation between heart murmurs and normal heart conditions were observed from human chest vibrations. Future developments of this study include improvements on each sensor systems’ design and implementation. Also, improvements on the SVM twodimensional representation algorithm would further classify the signals more effectively. In addition, a real-time classification scheme could be devised to render an immediate diagnosis. Finally, extensive medical trials should be conducted to verify the sensor systems and their respective classification accuracy rates.
References 1. Epstein, O., et al.: Clinical Examination. Gower Medical Publishing, New York (1992) 2. Dr. Computer Check for Dangerous Heart Murmurs. Prevention 54(1), 112 (2002) 3. Watrous, et al.: Computer-Assisted Detection of Systolic Murmurs Associated with Hypertrophy Cardiomyopathy. Texas Heart Institute Journal 31(4), 368 (2004) 4. Mangione, S., et al.: The Teaching and Practice of Cardiac Auscultation during Internal Medicine and Cardiology Training. Annals of Internal Medicine 119, 47–54 (1993) 5. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995) 6. Vapnik, V.N.: An Overview of Statistical Learning Theory. IEEE Transactions on Neural Networks 10, 988–999 (1999) 7. Gunn, S.R.: Support Vector Machine for Classification and Regression. Technical Report, University of Southampton, Southampton, UK (1998) 8. Rud, S., et al.: Non-Invasive Infrasound Heart Murmur Detection. Senior Project Report, Department of Electrical and Computer Engineering. University of Minnesota, Duluth (2005) 9. Haykin, S.: Neural Networks – A Comprehensive Foundation. Prentice Hall, New York (1999) 10. Pelckmans, K., et al.: LS-SVM Toolbox User’s Guide, Version 1.4. Department of Electrical Engineering, Katholieke Universiteit Leuven (2002)
Genetic Algorithms with Improved Simulated Binary Crossover and Support Vector Regression for Grid Resources Prediction Guosheng Hu, Liang Hu, Qinghai Bai, Guangyu Zhao, and Hongwei Li College of Computer Science and Technology, Jilin University, Changchun 130012, China
[email protected]
Abstract. In order to manage the grid resources more effectively, the prediction information of grid resources is necessary in the grid system. This study developed a new model, ISGA-SVR, for parameters optimization in support vector regression (SVR), which is then applied to grid resources prediction. In order to build an effective SVR model, SVR’s parameters must be selected carefully. Therefore, we develop genetic algorithms with improved simulated binary crossover (ISBX) that can automatically determine the optimal parameters of SVR with higher predictive accuracy. In ISBX, we proposed a new method to deal with the bounded search space. This method can improve the search ability of original simulated binary crossover (SBX) .The proposed model was tested with grid resources benchmark data set. Experimental results demonstrated that ISGA-SVR worked better than SVR optimized by genetic algorithm with SBX(SGA-SVR) and back-propagation neural network (BPNN). Keywords: Grid resources prediction, Support vector regression, Genetic algorithms, Improved Simulated Binary Crossover.
1 Introduction Grid resources prediction is important for grid scheduler in grid environment. In grid resources prediction, many relevant research models [1-4] have been developed and have generated accurate prediction in practice. The Network Weather Service (NWS) [1] uses a combination of several models for the prediction of one resource. NWS allows some adaptation by dynamically choosing the model that has performed the best recently for the next prediction, but its adaptation is limited to the selection of a model from several candidates that are conventional statistical models. Resource Prediction System (RPS) [2] is a project in which grid resources are modeled as linear time series process. Multiple conventional linear models are evaluated, including AR, MA, ARMA, ARIMA and ARFIMA models. Their results show that the simple AR model is the best model of this class because of its good predictive power and low overhead. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 60–67, 2010. © Springer-Verlag Berlin Heidelberg 2010
Genetic Algorithms with Improved Simulated Binary Crossover
61
With the development of artificial neural networks (ANNs), ANNs have been successfully employed for modeling time series. Liu et al.[3] and Eswaradass et al. [4] had applied ANNs to grid resources prediction successfully. Experimental results showed the ANN approach provided an improved prediction over that of NWS. However, ANNs have some drawbacks, such as hard to pre-select the system architecture, spending much training time, and lacking knowledge representation facilities. In 1995, support vector machine (SVM) was developed by Vapnik [5] to provide better solutions than ANNs. SVM can solve classification problems (SVC) and regression problems (SVR) successfully and effectively. However, the determination of SVR’s parameters is an open problem and no general guidelines are available to select these parameters [5]. Recently, Genetic Algorithms (GAs) are applied extensively to optimize SVR’s parameters. In these researches, GA with SBX and polynomial mutation (SGA) [6, 7, 8] was usually applied. However, how to deal with bounded search space is an open problem for SBX. Optimizing SVR is just a problem that must deal with bounded search space. In this study, an improved SBX, in which a new method is proposed to deal with bounded search space, is proposed. And the performance of SGA-SVR, ISGA-SVR and BPNN are compared with grid resources benchmark data set.
2 Support Vector Regression In order to solve regression problems, we are given training data (xi ,yi) (i=1,…,l), where x is a d-dimensional input with x∈Rd and the output is y∈R. The linear regression model can be written as follows [9]:
f ( x ) = ω , x + b, ω , x ∈ ℜ d , b ∈ ℜ
(1) d
where f(x) is a target function and <·,·> denotes the dot product in R . The ε -insensitive loss function proposed by Vapnik is specified to measure the empirical risk [9]: ⎧0 for f ( x ) − y ≤ ε (2) Lε ( y ) = ⎨ ( ) f x − y − ε othervise ⎩
And the optimal parameters and b in Eq.(1) are found by solving the primal optimization problem [9]: l 1 2 (3) min ω + C ∑ (ξ i− + ξ i+ ) 2 i =1 with constraints: y i − ω , x i − b ≤ ε + ξ i+ , (4) ω, x + b − y ≤ ε + ξ − ,
i
ξ i− , ξ i+ ≥ 0,
i = 1,..., l i
i
where C is a pre-specified value that determines the trade-off between the flatness of f(x) and the amount up to which deviations larger than the precision are tolerated. The slack variables ξ+ and ξ¯ represent the deviations from the constraints of the ε -tube.
62
G. Hu et al.
This primal optimization problem can be reformulated as a dual problem defined as follows: l l (5) 1 l l max ∗ − ∑ ∑ a i∗ − a i a ∗j − a j x i , x j + ∑ y i ( a i* − a i ) − ε ∑ ( a i* + a i ) x,x 2 i =1 j =1 i =1 i =1 with constraints:
(
)(
)
0 ≤ a i , a i∗ ≤ C ,
∑ (a l
i =1
i
)
i = 1,..., l
− a i∗ = 0 .
(6)
Solving the optimization problem defined by Eq.(5) and (6) gives the optimal Lagrange multipliers α and α*, while w and b are given by
ω = ∑ (a i∗ − a i )x i , l
(7) 1 b = − ω , (x r + x s ) , 2 where xr and xs are support vectors. Sometimes nonlinear functions should be optimized, so this approach has to be extended. This is done by replacing xi by a mapping into feature space[9], φ(xi), which linearizes the relation between xi and yi. Then, the f(x) in Eq.(1) can be written as: i =1
f (x ) =
∑ (a N
i =1
i
)
− a i* · K ( x i , x ) + b
(8)
K(xi , x)=< φ(xi), φ(x)> is the so-called kernel function [9]. Any symmetric positive semi-definite function that satisfies Mercer’s Conditions [9] can be used as a kernel function. Our work is based on the RBF kernel [9].
3 Improved Simulated Binary Crossover GAs[10] are mainly composed of selection operator, crossover operator, mutation operator and fitness functions. In this section, we just focus on simulated binary crossover [11,12] and the method that deals with bounded searching space. In the original SBX procedure, the children can be created almost anywhere in the whole real space. However, sometimes the variables have fixed bounds and how to deal with bounded search space is an open problem [11]. Deb gave some complicated but effective suggestions [12] on it; however, he didn’t provide the details of his suggestions. In this study, the details of the method are proposed. We assume that the region of the variable is [LB, UB], and LB and UB are the lower and upper bound respectively. According to [12], the probability distribution function [12] is multiplied by a factor depending on these limits and location of the parents, so as to make a zero probability of creating any solution outside these limits. For the child (C1) that is closer to parent P1, this factor can be computed as 1/(1-v1), where v1 is the cumulative probability of creating solutions from C1= − ∞ to C1=LB. Similarly, a factor v2 for the child (C2) solution closer to P2 can also be calculated. In the original SBX procedure, the relationship between the parents (P1 and P2) and between the children (C1 and C2) could be written as follows [12]:
C1=0.5(P1+P2) + 0.5(P1-P2) β and C2=0.5(P1+P2) + 0.5(P2-P1) β
(9)
Genetic Algorithms with Improved Simulated Binary Crossover
63
P1 and P2 can be regarded as constants, hence, C1 and C2’s probability distribution functions (Eq.(10) and Eq.(11)) can be obtained according to β ’s probability distribution function[12]. ⎧ ⎪ 0 , C 1 < LB ⎪ (10) C − b − (n + 2) 1 ⎪ f (C 1 ) = ⎨ − 0 .5 ( n + 1)( 1 ) . , LB ≤ C 1 < P1 a a ⎪ C1 − b n 1 ⎪ ⎪ − 0 .5 ( n + 1)( a ) . a , P1 ≤ C 1 < b ⎩
,
C2 − b n 1 ⎧ ⎪ − 0 . 5 ( n + 1)( a ) . a b ≤ C 2 < P2 ⎪ C − b −(n+ 2) 1 ⎪ f ( C 2 ) = ⎨ − 0 . 5 ( n + 1)( 2 ) . , P2 ≤ C 2 < UB a a ⎪ , C 2 ≥ UB ⎪0 ⎪ ⎩
(11)
where a = 0 . 5 ( P1 − P2 ) , b = 0 . 5 ( P1 + P2 ) . We assumed that P2 is larger than P1 and C2 is larger than C1. It can be observed that Eq(10) and Eq(11) make a zero probability of creating children that are outside the region (LB,UB). In original SBX procedure, the solutions sometimes fall outside the region (LB, UB), which will never happen in ISBX. After obtaining the C1 and C2’s probability distribution functions, we can calculate the values of v1 and v2. The results of v1 and v2 are shown as follows:
v1 = 0 . 5 (
LB − 0 . 5 ( p 1 + p 2 ) − ( n +1) ) 0 .5 ( p 1 − p 2 )
(12)
UB − 0 . 5 ( p 1 + p 2 ) − ( n +1 ) ) 0 .5 ( p 2 − p1 )
(13)
v 2 = 0 .5 (
Then, β 1 and β 2 are obtained according to the same theory of obtaining 1 ⎧ ⎪[2u (1 − v1 ) ]n +1 ⎪ β1 = ⎨ 1 ⎤ n +1 1 ⎪⎡ ⎪ ⎢ 2 − 2 (1 − v )u ⎥ 1 ⎦ ⎩⎣ 1 ⎧ ⎪[2 u (1 − v 2 ) ] n +1 ⎪ β2 = ⎨ 1 ⎤ n +1 ⎪⎡ 1 ⎪ ⎢ 2 − 2 (1 − v ) u ⎥ 2 ⎦ ⎩⎣
0 ≤ u ≤ 0 . 5(
0 . 5(
1 ) 1 − v1
β
[12].
(14)
1 ) < u ≤1 1− v1
0 ≤ u ≤ 0 .5 (
0 .5 (
1 ) 1 − v2
1 ) < u ≤1 1− v 2
(15)
64
G. Hu et al.
After obtaining β 1 and β 2 from the above probability distribution, the children solutions are calculated as follows:
C1 = 0 .5[(1 + β 1 ) P1 + (1 − β 1 ) P2 ], C 2 = 0 .5[(1 − β 2 ) P1 + (1 + β 2 ) P2 ]. (16)
4 GA-SVR Model SVR optimized by different GAs (for example, SGA and ISGA) uses the same model named GA-SVR. The proposed GA-SVR model dynamically optimizes the SVR’s parameters through GA’s evolutionary process, and then uses the acquired parameters to construct optimized SVR model for prediction. Details of GA-SVR model are described as follows: 1) The three SVR parameters are directly coded to generate the chromosome. The chromosome X is represented as X={p1, p2, p3}, where p1, p2 and p3 denote the parameters C, σ and ε respectively. 2) Initialization: The chromosomes are created by randomly obtaining the diversity solutions. 3) Selection: A standard roulette wheel was employed in our model. 4) Mutation: polynomial mutation and SBX are often used together to optimize SVM[6,7,8]. Hence, polynomial mutation is used in this study. 5) Fitness definition: In order to overcome over-fitting phenomenon, cross validation technique [13] is used in GA-SVR model. In this study, the fitness function is defined as the Mean Square Error(MSE) of actual values and predicted values using five-fold cross validation technique. 6) Stopping criteria: The maximal number of iterations works as stopping criteria. It is selected as a trade-off between the convergence time and accuracy. In this study, the maximal number of iterations is equal to 100.
5 Performance Evaluation 5.1 Data Preprocessing Strategy
In our experiment, we chose host load, one kind of typical grid resources, as prediction object. For host load prediction, we chose “mystere10000.dat” as benchmark data set [14]. We took the last 204 items of the data set for our experiment. Before the SVR was trained, all the data in the database were linearly scaled to fit within the interval (0, 1). When artificial intelligence technology is applied to the prediction of time series, the number of input nodes (order of autoregressive terms) critically affects the prediction performance. According to Kuan [6], this study experimented with the number 4 for the order of autoregressive terms. Thus, 204 observation values became 200 input patterns. The prior 150 input patterns were employed for the training set to build model; the other 50 input patterns were employed for test set to estimate generalization ability of prediction models. The simulation of SVR model had been carried out by using the ‘Libsvm’, a toolbox for support vector machines, which was originally designed by Chang and Lin [15].
Genetic Algorithms with Improved Simulated Binary Crossover
65
At the same time, some statistical metrics, such as NMSE and R, were used to evaluate the prediction performance of models [16]. 5.2 Parameters Settings
1) According to Chtioui et al.[17], the converged solution is usually affected by parameter settings. The choices of GA’s parameters are based on numerous experiments, as those values provide the smallest MSEcv on the training data set. Table 1 shows the details of parameters settings of ISGA and SGA. Table 1. GA parameter settings Population size Crossover probability Mutation probability Generations number
20 0.8 0.1 100
According to Wang [9] and the convenience of computing, we set the parameters searching space: C (0,256), σ(0,256) and ε(0, 1). 2) The parameters of BPNN in our experiment were set as follows. Hornik et al. [18] suggested that one hidden layer network was sufficient to model any complex system with any desired accuracy. Hence, a standard three-layer network, including one hidden layer, was used in our experiment. The number of nodes for input layer was set to 10, 4 for hidden layer and 1 for output layer. Rumelhart et al.[19] suggested using a small learning rate to set the network parameters. Therefore, the learning rate was set to 0.1. The hidden nodes used the tanh [16] transfer function, and the output node used the linear transfer function. Considering both the accuracy and timeconsuming of BPNN model, the convergence criteria used for the training set was a maximum of 500 iterations. 5.3 Experimental Results
Firstly, the results of GAs’ parameters selection were shown. In order to compare the parameter selection ability of SGA and ISGA fairly, 100 experiments had been done for SGA and ISGA respectively. Table 2 compared the average results of parameters selec-
tion of ISGA and SGA. Table 2. Comparison of average parameter selection ability
ISGA SGA
MSE 0.0105 0.0108
standard deviation (10-7) 1.3924 2.5322
From Table 2, it is clear that the MSEcv of ISGA is smaller than that SGA. Compared with SGA, hence, ISGA can find a better solution in limited time (100 iterations in our experiment). It can also be observed that the standard deviation of ISGA is smaller than that of SGA. It means, on the average, the implementation of ISGA is
66
G. Hu et al. Table 3. Comparison of prediction results Model BPNN SGA-SVR ISGA-SVR
NMSE 0.3022 0.2339 0.2007
R 0.9671 0.9686 0.9690
much more stable. Hence, ISGA works better than SGA. There is no difference between ISGA and SGA except for the crossover operator. Hence, the improved SBX outperforms original SBX.
After the parameters of SVR were determined, the SVR prediction models were built. From the above design, 100 groups SVR’s parameters for ISGA-SVR and SGASVR were determined respectively. Hence, 100 ISGA-SVR and 100 SGA-SVR prediction models were built. In order to estimate generalization ability and prediction effect of these 200 prediction models, the built models were tested on test set. Table 3 compared the average prediction performance of 100 ISGA-SVR models and 100 SGA-SVR models.
From Table 3, the value of NMSE made by ISGA-SVR model is smallest. Hence, we can rate the prediction results made by ISGA-SVR model to be of highest precision and BPNN lowest. However, the correlative coefficients(R) from the ISGA-SVR model are the highest, and BPNN lowest. The high value of R of ISGA-SVR indicates a very high correlation between the predicted values and the actual values. Hence, GA-SVR (ISGA-SVR and SGA-SVR) works better than BPNN. ISGA-SVR outperforms SGA-SVR.
6 Conclusions Accurate grid resources prediction is crucial for a grid scheduler. A novel ISGA-SVR had been applied to predict grid resources. The results of this study showed that the parameters selection ability of ISGA was better than that of SGA. There is no difference between ISGA and SGA except the crossover operator; hence, ISBX works better than original SBX. ISBX can make a zero probability of creating children solutions outside the bound; hence, ISBX can obtain better performance when it deals with bounded search space. On the other hand, the ISGA-SVR and SGA-SVR worked better than BPNN model. The superior performance of GA-SVR model is mainly due to the following causes. Firstly, the SVR model has nonlinear mapping capabilities and can easily capture data patterns of grid resources, host load in this study. Secondly, improper determining of SVR’s parameters will cause either over-fitting or under-fitting of a SVR model. In this study, the SRGA can determine suitable parameters of SVR and improve the prediction performance of the proposed model. Acknowledgments. This project is supported by National 973 plan of China (No. 2009CB320706), by the National Natural Science Foundation of China (No.60873235&60473099), and by Program of New Century Excellent Talents in University of China (No.NCET-06-0300).
Genetic Algorithms with Improved Simulated Binary Crossover
67
References 1. Wolski, R., Spring, N.T., Hayes, J.: The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing. The Journal of Future Generation Computing Systems (1999) 2. Dinda, P.A.: Design, Implementation, and Performance of an Extensible Toolkit for Resource Prediction in Distributed Systems. IEEE Trans. Parallel Distrib. Syst., 160–173 (2006) 3. Liu, Z.X., Guan, X.P., Wu, H.H.: Bandwidth Prediction and Congestion Control for ABR Traffic based on Neural Networks. In: Wang, J., Yi, Z., Żurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006, Part II. LNCS, vol. 3973, pp. 202–207. Springer, Heidelberg (2006) 4. Eswaradass, A., Sun, X.H., Wu, M.: A Neural Network based Predictive Mechanism for Available Bandwidth. In: 19th International Parallel and Distributed Processing Symposium (2005) 5. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998) 6. Chen, K.Y., Wang, C.H.: Support Vector Regression with Genetic Algorithms in Forecasting Tourism Demand. Tourism Management, 215–226 (2007) 7. Wu, F., Zhou, H., Ren, T., Zheng, L., Cen, K.: Combining Support Vector Regression and Cellular Genetic Algorithm for Multi-objective Optimization of Coal-fired Utility Boilers. Fuel (2009) 8. Chen, K.Y.: Forecasting Systems Reliability based on Support Vector Regression with Genetic Algorithms. Reliability Engineering and System Safety, 423–432 (2007) 9. Wang, L.P.: Support Vector Machines: Theory and Application. Springer, Berlin (2005) 10. Holland, J.H.: Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor (1975) 11. Deb, K., Agrawal, R.B.: Simulated Binary Crossover for Continuous Search Space. Complex Systems, 115–148 (1995) 12. Deb, K., Goyal, M.: A Combined Genetic Adaptive Search (geneAS) for Engineering Design. Computer Science and Informatics, 30–45 (1996) 13. Duan, K., Keerthi, S., Poo, A.: Evaluation of Simple Performance Measures for Tuning SVM Hyper Parameters. Technical Report, National University of Singapore, Singapore (2001) 14. Host Load Data Set, http://cs.uchicago.edu/lyang/Load/ 15. Chang, C.C., Lin, C.J.: LIBSVM: A Library for Support Vector Machines., http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 16. Hu, L., Hu, G., Tang, K., Che, X.: Grid Resource Prediction based on Support Vector Regression and Genetic Algorithms. In: The 5th International Conference on Natural Computation (2009) 17. Chtioui, Y., Bertrand, D., Barba, D.: Feature Selection by A Genetic Algorithm Application to Seed Discrimination by Artificial Vision. Journal of Science: Food and Agriculture, 77–86 (1998) 18. Hornik, K., Stinchcombe, M., White, H.: Multilayer Feedforward Networks Are Universal Approximations. Neural Networks, 336–359 (1989) 19. Rumelhart, E., Hinton, G.E., Williams, R.J.: Learning Internal Representations by Error Propagation in Parallel Distributed Processing. MIT Press, Cambridge (1986)
Temporal Gene Expression Profiles Reconstruction by Support Vector Regression and Framelet Kernel Wei-Feng Zhang1 , Chao-Chun Liu2 , and Hong Yan2 1
Department of Applied Mathematics, South China Agricultural University 483 Wushan Road, Guangzhou 510642, China
[email protected] 2 Department of Electronic Engineering, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong
Abstract. Gene time series microarray experiments have been widely used to unravel the genetic machinery of biological process. However, most temporal gene expression data often contain noise, missing data points, and non-uniformly sampled time points, which will make the traditional analyzing methods to be unapplicable. One main approach to solve this problem is to reconstruct each gene expression profile as a continuous function of time. Then the continuous representation enables us to overcome problems related to sampling rate differences and missing values. In this paper, we introduce a novel reconstruction approach based on the support vector regression method. The proposed approach utilizes a framelet based kernel, which has the ability to approximate functions with multiscale structure and can reduce the influence of noise in data. To compensate the inadequate information from noisy and short gene expression data, we use its correlated genes as the test set to choose the optimal parameters. We show that this treatment can help to avoid overfitting. Experimental results demonstrate that our method can improve the reconstruction accuracy. Keywords: Support vector regression, Gene time series, Kernel.
1
Introduction
As time series microarray experiments can provide more insight about the dynamic nature of a given biological process, analysis of temporal gene expression data is receiving growing attention from system biologists [1,2]. A significant challenge in dealing with gene time series data comes from the experimental errors or the variability in the timing of biological processes, giving rise to noise, missing data points, and non-uniformly sampled time points [3]. In many algorithms for biological studies such as clustering, a sufficient quantity of data in the appropriate format is the basic requirement, thus current gene expression time series data are often unapplicable. A lot of works have been done to solve the problem of non-uniformly sampled and noise data. One main approach is to reconstruct each gene expression profile as a continuous function of L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 68–74, 2010. c Springer-Verlag Berlin Heidelberg 2010
Temporal Gene Expression Profiles Reconstruction
69
time and then operate directly on the continuous representations [3,4,5,6,7,8,9]. Bar-Joseph et al. and Luan et al. respectively proposed the mixed-effects model for the time series gene expression data using B-splines, which can estimate the continuous curves of gene profiles and clustering them simultaneously [3,4]. Song et al. also used the B-spline to reconstruct the gene expression functions, then they reduced the dimensionality of the data by functional principal component analysis for clustering [5] and classification [7]. The continuous representations methods are effective for non-uniformly sampled and noise data. However, the existing B-spline based methods are under the same problem setting as that the number of B-spline bases for each gene profiles is fixed to be same, which will forbid a flexible approximation to the gene curve for different sized data sets, where more terms is needed if there are more sampled data points. Furthermore, owing to noise and missing values, the traditional least squares method for estimating these spline coefficients from expression data for each gene could lead to over-fitting of the data. In this paper, we propose a new temporal gene expression profiles reconstruction model based on the support vector regression (SVR) with the framelet based kernel. SVR has been successfully applied in many areas such as spectral reflectance estimation [10], due to their capacity in handling nonlinear relations and learning from sparse data [11]. The framelet based kernel has the ability to approximate functions with multiscale structure and can reduce the influence of noise in data [12,13]. To compensate the inadequate information from noisy and short gene expression data, we use its correlated genes as the test set to choose the optimal parameters. Experiment results demonstrate that our method is robust to noise and can improve the reconstruction accuracy.
2
Methods
The measurement of a time series microarray experiment can be modeled as Yi,j = fi (tj ) + ei,j ,
(1)
where Yi,j is the expression level of the ith gene at time tj , for i = 1, . . . , n, j = 1, . . . , T , n is the number of genes, and T is the number of time points, fi (t) is the ith gene profile as a function of time, ei,j denotes the measurement noise or error which is always assumed to be uncorrelated normal distributed with E(ei,j ) = 0, V ar(ei,j ) = σ 2 . The goal here is to reconstruct fi (t) from gene expression measurements Yi,j by learning from examples. Note that the gene time series are usually sampled non-uniformly and there are large number of missing values in Y. To make a better gene profiles reconstruction from such kind of data, we use the framelet kernel based support vector regression to approximate the nonlinear gene expression curves, and use each gene’s correlated genes as the test set to select the optimal parameters.
70
2.1
W.-F. Zhang, C.-C. Liu, and H. Yan
Gene Profiles Reconstruction by Support Vector Regression and Framelet Kernel
The cubic B-spline model is the most widely used tool for gene expression curves reconstruction, due to its nonlinear approximation capacity and smoothness property [3,4,5,7]. Let N (x) be the cubic B-spline function. Then the ith gene profile can be defined as ci,l N (t − l), (2) fi (t) = l
for i = 1, . . . , n where ci,l are the spline coefficients. The solution of ci can be searched by the least-squares method. However, if the time range is fixed as [tmin , tmax ], there will be fixed number of terms corresponding to the index l in model (2). This will forbid a flexible approximation to the gene curve for different sized data sets, where more terms are needed if there are more sampled data points. In order to enable a flexible approximation to match different kinds of data, we choose to use support vector regression to reconstruct gene profiles. SVR was developed by Vapnik and his coworkers [14,15], and it is a kernel based regularization method. It solves the “over-fitting” problem by using structure risk minimization principle, which minimizes both empirical risk and confidence interval. Another remarkable characteristic of SVR is the sparse representation of solution, which makes it more robust to outliers in the data and thus has good generalization performance. Suppose that we have the sample set Si = {(tj , Yi,j ), j = 1, . . . , J} of timemeasurement pairs for the ith gene, where the values of J maybe differ between genes. Then the method of SVR corresponds to minimizing the following functional: J |Yi,j − fi (tj )|ε + λfi 2K , (3) min H[fi ] = fi ∈HK
j=1
fi 2K
where is the norm in a RKHS HK defined by kernel function K(x, z) and the first term 0 if |x| < ε, (4) |x|ε = |x| − ε otherwise, is Vapnik’s ε-insensitive loss function [14,15]. The parameter ε defines the tube around the regression function within which errors are not penalized. This loss function provides the advantage of using sparse data points to represent the solution. The regularization parameter λ > 0 determines the trade-off between the flatness of f and the amount up to which deviations larger than ε are tolerated. The minimizer of (3) has the general form ci,l K(tl , t), (5) fi (t) = l
where the coefficients ci can be found by solving a quadratic programming problem.
Temporal Gene Expression Profiles Reconstruction
71
Kernel function K(x, z) and the associated RKHS HK play important roles as HK describes the hypothesis space where one looks for the solution. The choice of the kernel K is critical in order to obtain good performance. In order to eliminate the influence of non-uniformly sampled and noise data, we use a framelet based kernel, called weighted multiscale framelet kernel (WMFK) [12,13], with the support vector regression algorithm to reconstruct the continuous gene profiles. The WMFK is defined based on the framelet [16], a well-known tool in the field of signal processing which has both the merit of wavelet and frame. The authors in [12,13] have proven that WMFK can approximate functions with multiscale structure and can reduce the influence of noise in data, and perform better than the traditional kernels. 2.2
Selection of Optimal Parameters
For each gene, the data set Si is used as training set in (3) to build the expression curve with certain parameters ε and λ. The optimal choice of the parameters should be such that it helps to minimize over-fitting on the training set, and therefore generalizes well to novel data. As a result a separate test set is needed, the optimal parameters of the individual genes are then be chosen to minimize the error on the test set. In our method, we assume that the strong correlated genes have the same expression profiles, and thus we use its m reference genes to form the test set of a specific gene. To search for the test set for each gene, all the missing values in the dataset Y are set to zero initially. Let {yi1 , . . . , yim } be the m reference genes that are nearest to the ith gene on Euclidean distance. Assume that the Euclidean distances between the m reference genes and the target gene yi are {d1i , . . . , dm i }. i for gene yi is estimated by the weighted average of the m Then the test set y reference genes as m i = y wr yir , (6) r=1
where the weights given by wr =
3
m1/wr . s=1 1/ws
Experiments
We provide an experiment on the yeast cell cycle gene time series data from Spellman et al. [17], showing results for missing data estimation. The aim is to exemplify the improvement of reconstruction performance by using our support vector regression and the framelet kernel. We compare our method to other two methods: linear interpolation [18] and cubic B-spline model [5,7]. SVR is constructed based on LIBSVM 2.84 [19]. The other methods are implemented by our own Matlab code. We concentrate on the cdc15 dataset in which the yeast cells were sampled non-uniformly in the interval of 10 min to 290 min with a total of 24 time points. The authors identified 800 genes as cell cycle regulated genes. Among
72
W.-F. Zhang, C.-C. Liu, and H. Yan 1 1% missing points 5% missing points 10% missing points 15% missing points 20% missing points
0.95
Averaged NRMSE
0.9
0.85
0.8
0.75
0.7
0.65
1
2
3
4 5 6 7 Number of reference genes
8
9
10
Fig. 1. Averaged NRMSE values plotted versus the number of reference genes
the 800 genes, 633 genes had no missing values. In the following, we use the 633 complete genes for the comparison. To facilitate the numerical implementation of our method, the sampled time interval is linearly scaled from [10, 290] to [0, 10]. Different sized training sets are derived by randomly removing 1%, 5%, 10%, 15%, and 20% points of the complete gene data matrix. Then the accuracy of the recovery is assessed by measuring the normalized root-mean-square-error (NRMSE) between the original and reconstructed missing values. The NRMSE is expressed as o − Y p )2 (Yi,j i,j , (7) RM SE = o )2 (Yi,j p o where Yi,j and Yi,j represent the original and predicted missing values respectively. For each training size, the training set is randomly selected 10 times, and the estimation results are summarized over the 10 trials. The performance of our method depends on the choice of m, which is the number of reference genes for each gene. We test different numbers of m ranging from 1 to 10 on the different sized experiment data. Fig. 1 shows the averaged NRMSE values of our method with different choices of m, in which the averaged NRMSE value is plotted as function of the number of reference genes. As it can be seen, our method with 4 or 5 reference genes gives the lowest averaged NRMSE value. Next we fix the number of reference genes to 5 and compare our method with the other two methods. The detailed result is presented in Table 1. For each method, the NRMSE results are summarized over 10 trials. As can be seen, our method provides lower averaged NRMSE value than the
Temporal Gene Expression Profiles Reconstruction
73
Table 1. Summary of the averaged NRMSE values for different methods on the cdc15 data. For each method, the results are summarized over 10 trials. In every row the best result is labeled in bold type. Percentage of missing values 1% 5% 10% 15% 20% a
Linear 0.8671 0.8670 0.8910 0.9562 0.9718
B-spline 0.7513 0.7864 0.7981 0.8417 0.8535
SVR 0.6832 0.7270 0.7240 0.7412 0.7538
Linear, linear interpolation; B-spline, cubic B-spline model; SVR, support vector regression with framelet kernel.
other methods in all training sizes. The missing values recovery performance of the linear interpolation is the poorest. We guess that it may be because the gene time expression profiles are essentially nonlinear functions, and the linear interpolation is not suitable here. Note that the cubic B-spline model provide lower averaged NRMSE than the linear interpolation, but still do worse than our method. This may be because that they only use individual genes for reconstruction, which will prone to over-fitting with less training points. Interestingly, we can see that the performance of our method is not very sensitive to the percentage of missing data points. Therefore, our method is more robust to noise and missing values.
4
Conclusion
In this paper, we propose a new approach based on the support vector regression method and the framelet kernel for temporal gene expression profiles reconstruction. We do experiments on a real gene time series data with different-sized training sets. The SVR estimation method shows better performance than the previously developed methods. The outstanding estimation ability of this estimation method is partly due to the use of each gene’s correlated genes as the test set to choose the optimal parameters. In addition, the solid theoretical foundation of SVR method and framelet kernels also help in improving estimation performance.
Acknowledgment This work is supported by the Hong Kong Research Grant Council (Project CityU 122607), the National Natural Science Foundation of China (60903094), the President Foundation of South China Agricultural University (4900-208064).
74
W.-F. Zhang, C.-C. Liu, and H. Yan
References 1. Bar-Joseph, Z.: Analyzing time series gene expression data. Bioinformatics 20, 2493–2503 (2004) 2. Wang, X., Wu, M., Li, Z., Chan, C.: Short time-series microarray analysis: methods and challenges. BMC Syst. Biol. 2 (2008) 3. Bar-Joseph, Z., Gerber, G.K., Jaakkola, T.S., Gifford, D.K., Simon, I.: Continuous representations of time series gene expression data. J. Comput. Biol. 10, 341–356 (2003) 4. Luan, Y., Li, H.: Clustering of time-course gene expression data using a mixed-effects model with b-splines. Bioinformatics 19, 474–482 (2003) 5. Song, J.J., Lee, H.J., Morris, J.S., Kang, S.: Clustering of time-course gene expression data using functional data analysis. Comput. Biol. Chem. 31, 265–274 (2007) 6. Leng, X.Y., M¨ uller, H.G.: Classification using functional data analysis for temporal gene expression data. Bioinformatics 22, 68–76 (2006) 7. Song, J.J., Deng, W.G., Lee, H.J., Kwon, D.: Optimal classification for time-course gene expression data using functional data analysis. Comput. Biol. Chem. 32, 426–432 (2008) 8. Bar-Joseph, Z., Gerber, G.K., Simon, I., Gifford, D.K., Jaakkola, T.: Comparing the continuous representation of time-series expression profiles to identify differentially expressed genes. Proc. Nat. Acad. Sci. U.S.A. 100, 10146–10151 (2003) 9. Liu, X.L., Yang, M.C.K.: Identifying temporally differentially expressed genes through functional principal components analysis. Biostatistics 10, 667–679 (2009) 10. Zhang, W.F., Dai, D.Q.: Spectral reflectance estimation from camera responses by support vector regression and a composite model. J. Opt. Soc. Am. A. 25, 2286–2296 (2008) 11. Smola, A., Sch¨ olkopf, B.: A tutorial on support vector regression. Stat. Comput. 14, 199–222 (2004) 12. Zhang, W.F., Dai, D.Q., Yan, H.: On a new class of framelet kernels for support vector regression and regularization networks. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 355–366. Springer, Heidelberg (2007) 13. Zhang, W.F., Dai, D.Q., Yan, H.: Framelet kernels with applications to support vector regression and regularization networks. IEEE Trans. Syst. Man Cybern. Part B Cybern. (2009) (in press) 14. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995) 15. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998) 16. Daubechies, I., Han, B., Ron, A., Shen, Z.: Framelets: Mra-based constructions of wavelet frames. Appl. Comput. Harmon. Anal. 124, 44–88 (2003) 17. Spellman, P.T., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., Fucher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell. 9, 3273–3297 (1998) 18. Aach, J., Church, G.M.: Aligning gene expression time series with time warping algorithms. Bioinformatics 174, 495–508 (2001) 19. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm
Linear Replicator in Kernel Space Wei-Chen Cheng and Cheng-Yuan Liou Department of Computer Science and Information Engineering National Taiwan University Republic of China
[email protected]
Abstract. This paper presents a linear replicator [2][4] based on minimizing the reconstruction error [8][9]. It can be used to study the learning behaviors of the kernel principal component analysis [10], the Hebbian algorithm for the principle component analysis (PCA) [8][9] and the iterative kernel PCA [3]. Keywords: Replicator, Principal component analysis, Generalized Hebbian algorithm, Kernel Hebbian algorithm, Gaussian kernel.
1 Introduction The replicator is constructed by the multilayer perceptron and has many applications [2][4]. This paper presents a linear replicator and its training algorithm based on minimizing the reconstruction error [9][8] in the kernel space. It can facilitate the study on the principal component analysis (PCA). The PCA projects data onto several selected orthogonal bases which preserve variational information. Those bases are called principal components. The projection process is a linear transformation. The kernel PCA [10] applies a nonlinear transformation, that projects data onto a very high dimensional space based on Mercer’s theorem [6][1], and finds principal components in that high space. The space complexity of the kernel PCA is the square of the number of data. This complexity is severe in many large scale applications. An iterative kernel PCA, called kernel Hebbian algorithm (KHA) [3], is devised for on-line learning to reduce the size of the storage. The technique of the generalized Hebbian algorithm (GHA) [9] is used in the KHA. The replicator is also constructed in the high dimensional space. Its energy function and training algorithm are formulated in the next section.
2 The Linear Replicator The replicator is illustrated in Fig. 1. There are three layers, input layer, output layer and hidden layer. All neurons are linear elements. Both input and output layers have N neurons. The hidden layer has M neurons and M is less than N usually. The weight matrix of the synapses connecting the input layer and the hidden layer is W and the matrix connecting the hidden layer and the output layer is W T , where W T is the transpose
Corresponding author. Supported by National Science Council.
L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 75–82, 2010. c Springer-Verlag Berlin Heidelberg 2010
76
W.-C. Cheng and C.-Y. Liou
Fig. 1. Illustration of the linear replicator
of the matrix W . W is an N -by-M matrix. This replicator possesses the self-similarity structure in the two weight matrices. According to the kernel PCA, each D-dimensional data, xp ∈ {xp ; p = 1, .., P }, is mapped to the N -dimensional space using a pre-designed mapping function Φ, Φ(xp ) : RD → RN , where N D. Φ (xp ) is an N -dimensional column vector. Let the N -byP matrix X contain all mapped data, X = [Φ (x1 ) , . . . , Φ (xP )]. We plan to find the M principal components which contain large amounts of variational information in the N -dimensional space, RN . The weight matrix of the synapses connecting the input layer and the hidden layer is W = [w1 , w2 , . . . , wM ]. Each column vector wq contains all weights of the qth hidden neuron. According to the kernel PCA, wq is a linear combination of all mapped data, wq =
P
apq Φ (xp ) = Xaq , q ∈ {1, . . . , M } .
(1)
p=1
Let A be an P -by-M matrix whose elements are the coefficients of the linear combination, (2) A = [a1 , . . . , aM ] . We get W = XA. Let the matrix Y contains the P outputs of the M hidden neurons, Y = [y1 , y2 , . . . yP ] = W T X
(3)
Linear Replicator in Kernel Space
Each output yp is an M -dimensional column vector. Define the kernel matrix K, ⎡ ⎤ T T Φ (x1 ) Φ (x1 ) . . . Φ (x1 ) Φ (xP ) ⎢ ⎥ .. .. .. K = [k1 , k2 , . . . , kP ] = ⎣ ⎦ . . . T
77
(4)
T
Φ (xP ) Φ (x1 ) . . . Φ (xP ) Φ (xP ) Rewrite the output Y (3) using the two matrices in (2) and (4). Get Y = AT K.
(5)
In (5), the output Y is a function which depends linearly on the coefficient matrix A. Write the error energy for the reconstruction error [4][5] of the network in Fig. 1, E=
P 1 Φ (xp ) − W W T Φ (xp ) 2 . 2 p=1
(6)
We plan to seek a weight matrix W to minimize the total error E (6). Using (5) and (4), we rewrite this energy (6) in terms of A that contains all unknown coefficients explicitly, {apq ; p = 1, ..P ; q = 1, .., M }, P
T T 1
kpp − 2 AT kp A kp + (Ayp )T K (Ayp ) 2 p=1
1 T = trace K − 2Y T Y + (AY ) K (AY ) , 2
E=
(7)
where trace means the summation of all diagonal elements. Accordingly, the norm of the weight vector of the qth neuron, wq 2 , in (1) is 2 wq = wqT wq = aTq Kaq , q ∈ {1, . . . , M } .The mapping matrix K must be a semi-positive matrix, otherwise, certain norm may be negative.
3 Training Algorithm To reduce the storage size, we plan to develop an iterative on-line learning algorithm for 2 P P each individual data. Write E = 12 p=1 Φ (xp ) − W W T Φ (xp ) = 12 p=1 Ep = P 1 p 2 p T p=1 e , where e = Φ (xp ) − W W Φ (xp ). The training algorithm will follow 2 the deepest descent direction to reduce the error Ep . The descent direction is ΔA ≈ − ∂Ep ∂A . The individual error for the pth data inside the energy equation E (6) is 2
T
T
ep = [Φ (xp ) − Φ (xp ) W W T ][Φ (xp ) − W W T Φ (xp )] = Φ (xp )T Φ (xp ) − 2 × Φ (xp )T W W T Φ (xp ) T
+Φ (xp ) W W T W W T Φ (xp ) .
(8)
When we impose the orthonormal condition among the bases, W T W = I = iden2 T T tity matrix, then get ep = Φ (xp ) Φ (xp ) − Φ (xp ) W W T Φ (xp ) . The descent direction for the current data Φ (xp ) is
78
W.-C. Cheng and C.-Y. Liou
∂ 2 ep . (9) ∂A We will keep the orthogonal condition among the base components during the training process, {wiT wj = 0, for i = j; i, j = 1, 2, . . . , M }, to fulfill the orthogonal requirement in PCA. In order to have a zero mean data in the N -dimensional space, we use ˜ in the algorithm, where K ˜ = K − 1 K11T − 1 11T K + the augmented matrix [3], K, L L T 1 T 1 K1 and 1 is a unit vector, 1 = [1, . . . , 1] . 2 L The training algorithm for the linear replicator is as follows: ΔAreplicator ≈ −
1. Set t = 0 and assign random values to the coefficients in the initial matrix A(t = 0), where t is the iteration number. Assume that the dataset X has a zero-mean center. 2. Select a current data Φ (xp ). Compute the descent ΔAreplicator in (9). 3. Update the coefficient matrix A, A (t + 1) = A (t) + ηΔAreplicator ,
(10)
where η is the learning rate. 4. Compute W (t + 1) = XA (t + 1). 5. Accomplish a set of orthogonal bases {wiT (t + 1)wj (t + 1) = 0, for i = j; i, j= 1, 2, . . . , M } by applying the Gram-Schmidt orthonormalization to the vectors, {wj (t + 1), j = 1, 2, . . . , M }.
(11)
6. Compute the matrix A(t+1) using the orthogonal bases, {wq (t+1), q = 1, 2, ..M }. 7. If the energy (7) has not converged to the minimum yet, go back to Step 3. Otherwise, finish the algorithm.
4 Alternative Training Algorithm Since the weight matrices are limited by the self-similarity structure and the neurons are linear elements, we do not expect that the replicator will give any better performance than that of a general multilayer perceptron without such limitations. When we take an alternative approach to solve the orthogonal bases, we find something interesting and report them in this paper. As an alternative approach, we use the Gram-Schmidt orthonormalization to accomplish theorthogonal bases, wp , in a similar way as that used in [9]. Assume W W T has T T a form q wq wq . This means W W = q wq wqT . For each data xp , we define the jth residual error by wq wqT Φ (xp ) , j ∈ {1, . . . , M } . (12) epj = Φ (xp ) − q≤j
instance, ep1
For = Φ (xp ) − w1 w1T Φ (xp ) and ep2 = Φ (xp ) − w1 w1T Φ (xp ) − w2 w2T Φ (xp ). The updating formula for the jth hidden neuron is
Linear Replicator in Kernel Space
79
P N P ∂ˆ epnj ∂ 1 p 2 ( e ) ∼ − eˆpnj × ∂aij 2 p=1 ∂aij n=1 p=1 N P ∂ wnj wjT Φ (xp ) =− eˆpnj ∂aij n=1 p=1
Δaij ≈ −
=−
N P n=1 p=1
eˆpnj (Φ (xi ))n wjT Φ (xp ) + wnj Φ (xi )T Φ (xp ) .
(13)
Expanding and simplifying the updating formula (13), we obtain a compact form, named GKHA, ΔAGKHA = [Δa1 , . . . , ΔaM ]
≈ KY T + KY T − Y T × UT Y Y T − KY T × UT [Y A] = KY T − Y T × UT Y Y T ; when W T W = I .
(14)
In (14), the operation UT [·] sets all elements below the diagonal of its argument matrix to zeros. We can keep W T W = I in the training algorithm. The GHA method [9] writes ΔW GHA = XY T − W × UT Y Y T as the updating GHA formula. Suppose that this update is derived from certain energy, ∂E∂W . This means ΔW GHA ≈ −
∂E GHA = XY T − W × UT Y Y T . ∂W
(15)
According to the chain rule, the update with respect to A is † ∂E GHA ∂E GHA ∂W GHA ΔA ≈− =− = X T XY T − W × UT Y Y T ∂A ∂W ∂A = KY T − Y T × UT Y Y T , (16) ∂W † . GHA where = X T . The derivation of the formula, ∂E∂A = −X T (XY T − W × ∂A UT Y Y T ), is omitted. (16) is exactly the same as the the GKHA updating formula (14). The GHA has a form in terms of weight matrix W , W GHA (t + 1) = XAGHA (t + 1) = XAGHA (t) + XY T − XAGHA (t) × UT Y Y T .
(17)
The updating formula in KHA [3] is AKHA (t + 1) = AKHA (t) + Y T − AKHA (t) × UT Y Y T = AKHA (t) + ΔAKHA .
(18)
From the above equations, ΔAGHA in (16) is a K weighted version of the update ΔAKHA , ΔAGHA = KΔAKHA . The weighted update KΔAKHA is also the same as
80
W.-C. Cheng and C.-Y. Liou Table 1. Comparison of four updating formulas Method GKHA GHA [8][9] KHA [3] Subspace Network [7]
ΔAGKHA ΔW GHA ΔAKHA ΔW sub
Updating formula = KY T − KA × UT Y Y T = X Y T − A × UT Y Y T = Y T − A × UT Y Y T =X YT −A×YYT
Table 2. The computational complexity comparison of algorithms in sequential mode Method GKHA (14) GHA (15) KHA (18)
Time complexity O (M P + M P D) O N M + N M 2 O M 2P + M P D
Space complexity O (M P + DP ) O (N M + DP ) O (M P + DP )
that of the GKHA (14). Table 1 lists the four updating formulas including the subspace network learning algorithm given by [7]. Suppose that Φ is an identity function, Φ (x) = x, then GKHA is the same as GHA, ΔAGKHA ≡ ΔAGHA . When K is an identity matrix whose diagonal elements are 1 and all other elements are 0, then GKHA is the same as KHA, ΔAGKHA ≡ ΔAKHA . We expect that GKHA can provide some explanations on the learning behaviors of KHA and GHA. It is expected that all the three algorithms, GKHA, GHA and KHA, will have similar performance as that of the linear replicator algorithm. The choice may depend on their complexities. The GKHA does not need the eigen-decomposition of the matrix K. It can T and be implemented of the terms KY 2 in low-level language. The time complexity KA T T T is O P M 2 in (14) is O P M , the time complexity of Y Y and Y × UT Y Y and the minus operation is O (M P ). The overall complexity is O P 2 M + P M 2 . The number of hidden neurons is usually small in many dimension reduction applications, P >> M . We also assume that the memory space is sufficient and matrix K is precalculated. In such case, the time complexity of updating A is O P 2 M . The time complexity to compute all the inner products of input data to construct the matrix K is O P 2 D . The space complexity for the matrix K is O P 2 . In the sequential mode, we calculate the elements of K when necessary. The space complexity to save the data matrix X is O ) and store matrix A is O (M P ). The (DP time complexity of computing Y Y T is O M 2 , and computing Y is O (M P D). When P >> M , the time complexity of each update is O (M P D), the space complexity is O (DP + M P ). Table 2 lists the comparisons in the sequential mode among the three methods. The advantage of GHA is that the space complexity is O (M N ) instead of O (M P ). When the number P is much larger than N , GHA is applicable. GKHA and KHA do not depend on N . GKHA and KHA are applicable when the mapped data is in a very high dimensional space. Finally, we’ll analyze the learning behavior in the weight space and show the superior part of GKHA. Suppose the hidden layer has a single neuron, M = 1. There are five input data, P = 5,
Linear Replicator in Kernel Space
81
Fig. 2. The location of hollow circles are initial values of synapse weights. The optimal minimum is at the location of hollow squares. The contour in the background shows the energy landscape of (7). In this simulation, input has two dimensions and there is only one hidden neuron. Therefore, the two axes represent the value of w11 and w21 , respectively. (a) GKHA converge to the optimal minimum along the red trajectory. Moreover, KHA, GHA and subspace network converge to optimal minimum along the same black trajectory. (b) The blue curve shows the KHA trajectory under the constraints that a31 = a41 = a51 = 0 and the the green curve shows the KHA trajectory under the non-negative restriction, {ai1 ≥ 0 |i = 1, . . . , 5 }. Under the two restrictions, we found that GKHA in the red is capable of converging to the optimal minimum. However the green and blue trajectory shows that KHA converges to non-optimal result.
X=
−0.6 −0.6 1.4 −0.6 0.4 , , , , . −1.8 2.2 0.2 1.2 −1.8
The mapping function is an identity function, Φ (x) = x. Therefore the vector a1 ∈ R5 has five dimensions (elements). We set random values, in the range [−1, +1], as the initial weights in w1 (t = 0). Having the weights, w1 (0), we calculate a1 , a1 = X −1 w1 (0), for GKHA and KHA to be the initial value of a1 . During training, it is the weights w1 that the algorithms GHA and subspace network adjust. However, it is the a1 that the algorithms GKHA and KHA adjust. After updating a1 , the weights can be calculated, w1 = Xa1 , and recorded. Figure 2(a) plots the trajectories, in the weight space, of the convergences of four methods in Table 1. We found that all methods, except GKHA, stepped forward along the same trajectory in weight space and went toward the same minimum. See the black trajectory in Figure 2(a). However GKHA had different track with other methods in the weight space because GKHA went toward the direction of gradient descent in the A space. We suggest that GKHA is actually capable of learning and adapting but KHA is not capable of that by the following experiment. Adding the constraints, {ai1 = 0 |i = 3, 4, 5 }, to the algorithms GKHA and KHA. We reset a31 = a41 = a51 = 0 right after each update of a1 . The blue trajectory in
82
W.-C. Cheng and C.-Y. Liou
Figure 2(b) shows that KHA converged to the incorrect area. Even though, GKHA went toward the correct location that is at the hollow squares, and converged with the same trajectory as in Figure 2(a). Without this zero constraint, all elements of a1 gotten by KHA are nonzero after converging. GKHA benefited by those elements which are zero because that can save the space of input storage. This kind GKHA convergence has not been reported. Besides zero constraints, we also examined the non-negative constraints, {ai1 ≥ 0 |i = 1, 2, 3, 4, 5 }. The simulation result showed GKHA can always find feasible solution of a1 . Nonetheless, KHA failed under those constraints of A. The blue trajectories in Figure 2(b) are the results of KHA. For these reasons, we consider GKHA is a more useful and flexible model for applications than KHA.
References 1. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A Training Algorithm for Optimal Margin Classifiers. In: COLT 1992: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152 (1992) 2. Hecht-Nielsen, R.: Replicator Neural Networks for Universal Optimal Source Coding. Science 269, 1860–1863 (1995) 3. Kim, K.I., Franz, M.O., Scholkopf, B.: Iterative Kernel Principal Component Analysis for Image Modeling. IEEE Transcations on Pattern Analysis and Machine Intelligence 27, 1351–1366 (2005) 4. Liou, C.-Y., Chen, H.-T., Huang, J.-C.: Separation of Internal Representations of the Hidden Layer. In: Proceedings of the International Computer Symposium, Workshop on Artificial Intelligence, pp. 26–34 (2000) 5. Liou, C.-Y., Cheng, W.-C.: Resolving Hidden Representations. In: Ishikawa, M., et al. (eds.) ICONIP 2007, Part II. LNCS, vol. 4985, pp. 254–263. Springer, Heidelberg (2008) 6. Mercer, J.: Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations. Philosophical Transactions of the Royal Society of London. Series A 209, 415–446 (1909) 7. Oja, E.: Neural Networks, Principal Components, and Subspaces. International Journal of Neural Systems 1, 61–68 (1989) 8. Oja, E.: Simplified Neuron Model as a Principal Component Analyzer. Journal of Mathematical Biology 15, 267–273 (1982) 9. Sanger, T.D.: Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural Network. Neural Networks 2, 459–473 (1989) 10. Scholkopf, B., Smola, A., M¨uller, K.R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation 10, 1299–1319 (1998)
Coincidence of the Solutions of the Modified Problem with the Original Problem of v-MC-SVM Xin Xue1 , Taian Liu2 , Xianming Kong1 , and Wei Zhang1 1 College of Mathematics and System Science, Taishan University, 271021, Tai’an, P.R.China,
[email protected] 2 Department of Information and Engineering, Shandong University of Science and Technology, 271019, Taian, P.R.China
Abstract. Multi-class support vector machine with parameter v (v-MC-SVM)is a kind of machine learning method which is similar to QPMC-SVM. The constraints of v-MC-SVM and it’s dual problem are too complicated. By adding the term bm to the objective function of v-MCSVM, the original problem was modified. Then,by employing the Kesler’s construction, the modified problem is simplified. Efficient algorithms can be used to solve the simplified problem. Numerical testing results show that v-MC-SVM has the same accuracy rate as QP-MC-SVM. Basing on Lagrange function and KKT condition, this paper proves strictly that the solution of the modified problem is the solution of the original problem, which provides the muli-class SVM with theory bases. Keywords: Multi-class support vector machine; v-MC-SVM; QP-MCSVM; Coincidence of the Solutions.
1
Introduction
SVMs can well resolve such practical problem as nonlinearity, high dimension and local minima. They have attracted more and more attention and become a hot issue in the field of Machine Learning, such as handwritten numerals recognition, face recognition, texture classification, and so on[1]. The standard Support Vector Machine (C-SVM)[2] is designed for binary classification. The multi-class problem is commonly solved by a decomposition to several binary problems for which the stand SVM can be used[3]. However, the multi-class problem can be solved directly effectively by multi-class SVM(QP-MC-SVM) which is based on C-SVM[4,5]. Because the selection of value of C of C-SVM is difficult, Scholkopf proposed another support vector machine with parameter v(v-SVM) which is an improved algorithm. The value of v of v-SVM is related to the number of misclassified samples, support vectors and all train samples. A new model of multi-class support vector machine with parameter v( v-MC-SVM) is proposed based on v-SVM[6]. The existence of optimal solutions and dual problem of v-MC-SVM are also L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 83–89, 2010. c Springer-Verlag Berlin Heidelberg 2010
84
X. Xue et al.
given. Because the constraints of v-MC-SVM is too complicated, the original problem of v-MC-SVM is modified by adding bm to the objective function and employing the Kesler’s construction which simplifies the original problem. Numerical testing results show that v-MC-SVM algorithm is as well as QP-MCSVM algorithm. In this paper, the original problem and modified problem of v-MC-SVM are introduced firstly.Then, the coincidence of solution of the modified problem with the original problem of v-MC-SVM is proved strictly with theory,which enriches the theory of multi-class SVM.
2
v-MC-SVM
Let us consider that we are given labeled training patterns {(xi , yi )|i ∈ I}, where a pattern xi from an n-dimensional space X and its label attains a value from a set K = {1, · · · , k}. I = {1, · · · , l} denotes a set of indices. The linear classification rules fm (x) = (wm · x) + bm , m ∈ K, (the dot product is denoted by(·) ) can be found directly by solving v-MC-SVM problem which is proposed based on v-SVM as follows 1 wm 2 −vρ + l(k−1) ξim min 12 w,b,ξ,ρ
m∈K
i∈I m∈K\yi
s.t. (wyi · xi ) + byi − ((wm · xi ) + bm ) ≥ ρ − ξim , ρ ≥ 0, ξim ≥ 0, i ∈ I, m ∈ K\yi ,
(1)
where the minimization of the sum of norms wm 2 leads to maximization of the margin between classes.For a non-separable case, ξi denotes the nonnegative slack variant of training sample xi , i ∈ I. v is a parameter which will be selected in procedure. The duel problem of problem (1) is given as follows 1 yi m yi 1 m m ( 2 cj Ai Aj − αi αj + 2 αi αj )(xi , xj ) min α i∈I j∈I m∈K m m m∈K αi = ci Ai , m ∈ K, s.t. i∈I (2) i∈I m αi ≥ v, i∈I m∈K\yi 0 ≤ αm i ≤
where, Ai =
m∈K
C, αyi i = 0, i ∈ I, m ∈ K\yi ,
αm i , and cyj i
=
1 if 0 if
yi = yj , i ∈ I, j ∈ I. yi = yj
(3)
The constraints of problem (1) and (3) are too complicated. We modify Problem (1) by adding bm to the objective function as follows 1 min 12 wm , bm 2 −vρ + l(k−1) ξim w,b,ξ,ρ
m∈K
i∈I m∈K\yi
s.t. (wyi · xi ) + byi − ((wm · xi ) + bm ) ≥ ρ − ξim , ρ ≥ 0, ξim ≥ 0, i ∈ I, m ∈ K\yi .
(4)
Coincidence of the Solutions of the Modified Problem
set and
85
w = ((w1T , b1 ), · · · , (wkT , bk ))T ,
(5)
zim = ((zim (1), · · · , (zim (k))T .
(6)
This Kesler’s construction (5) and (6) maps the input n-dimensional spaceX to a new (n + 1) · k-dimensional space Y where the multi-class problem appears as the one-class problem. Each training pattern xi is mapped to new (k − 1) pattern zim , m ∈ K\yi , defined as follows. We assume that coordinates of zim are divided into k slots. Each slot[7,8] ⎧ ⎨ [xi , 1] for j = yi zim (j) = −[xi , 1] for j = m , j ∈ K. (7) ⎩ 0 otherwise By performing the transformation (5)-(7), the problem (4) can be equivalently expressed as the following problem 1 min 12 w 2 −vρ + l(k−1) ξim w,ξ,ρ
i∈I m∈K\yi
s.t. (w · zim ) ≥ ρ − ξim , i ∈ I, m ∈ K\yi , ρ ≥ 0, ξim ≥ 0, i ∈ I, m ∈ K\yi .
(8)
We have the duel problem of problem (8) is as follows n m n αm min i αj (zi · zj ) α
s.t.
i,j∈I m∈K\yi 1 0 ≤ αm ≤ l(k−1) ,i mi αi ≥ v, m,i
∈ I, m ∈ K\yi ,
(9)
where, the dot product between zim and zjn (zim · zjn ) = (k(xi , xj ) + 1) · (δ(yi , yj ) + δ(m, n) − δ(yj , n) − δ(yi , m)),
and δ(i, j) =
3
1 for 0 for
i=j . i =j
(10)
(11)
Coincidence of the Solutions of the Modified Problem with the Original Problem of v-MC-SVM
˜ ρ˜) of (4) is a solution of (1), whenever Theorem 3.1. Each solution (w, ˜ ˜b, ξ, the following linear system has a set of {vim ≥ 0|i ∈ I, m ∈ K\yi }: solutions m m vi (xi (δ(j, yi )− δ(j, m))) = 0, vi (δ(j, yi ) − δ(j, m)) = −˜bj , i∈I m∈K\yi i∈I m∈K\yi 1 for i = j and vim ξ˜im = 0, where, j ∈ K and δ(i, j) = . 0 for i = j i∈I m∈K\yi
86
X. Xue et al.
Proof: The Lagrange function of (4)[9] L(w, b, ξ, ρ, α, β, δ) =
1 2
m∈K
(wm , bm ) 2 −vρ +
ξim − i∈I m∈K\yi i∈I m∈K\yi βim ξim − δρ, i∈I m∈K\yi
1 l(k−1)
m αm i ((wyi · xi ) + byi − ((wm · xi ) + bm ) − ρ + ξi ) −
(12)
m where αm i , βi and δ are Lagrange multipliers. Let
= wj − xi αm i (δ(j, yi ) − δ(j, m)) = 0, i∈I m∈K\yi ∂ αm i (δ(j, yi ) − δ(j, m)) = 0, ∂bj L(w, b, ξ, ρ, α, β, δ) = bj − ∂ ∂wj L(w, b, ξ, ρ, α, β, δ)
∂ ∂ξim L(w, b, ξ, ρ, α, β, δ) ∂ ∂ρ L(w, b, ξ, ρ, α, β, δ)
=
=
i∈I m∈K\yi 1 m l(k−1) − αi −
i∈I m∈K\yi
(13)
βim = 0,
αm i − v − δ = 0.
Accordingly, we have that wj = xi αm i (δ(j, yi ) − δ(j, m)), i∈I m∈K\yi bj = αm i (δ(j, yi ) − δ(j, m)), i∈I m∈K\yi
1 αm βim = l(k−1) , i + αm − v − δ = 0. i
(14)
i∈I m∈K\yi
˜ ρ˜) is the solution of (4). There are α ˜m ˜ Suppose (w, ˜ ˜b, ξ, ˜m i , βi , and δ such that w ˜j =
i∈I m∈K\yi
˜bj =
xi α ˜m i (δ(j, yi ) − δ(j, m)),
α ˜m i (δ(j, yi ) − δ(j, m)), i∈I m∈K\yi 1 α ˜m β˜im = l(k−1) , i + m αi − v − δ = 0, i∈I m∈K\yi α ˜m ˜yi · xi + ˜byi − ((w˜m · xi ) + ˜bm ) − i ((w m m β˜i ξ˜i = 0, δ˜ρ˜ = 0, ˜m ˜ α ˜m i ≥ 0, βi ≥ 0, δ ≥ 0.
(15) ρ˜ + ξ˜im ) = 0,
˜ ρ˜) is the solution of (1). If the parameter v is replaced Now, wevalidate ˜ ˜b, ξ, (w, m by v + vi ,the Lagrange function of (1) is as follows i∈I m∈K\yi
¯ δ) ¯ = 1 wm 2 −(v + vim )ρ + 1 ξim − L(w, b, ξ, ρ, α ¯ , β, 2 l(k−1) m∈K i∈I m∈K\yi i i∈I¯m∈K\y m m ¯ α ¯ i (wyi · xi + byi − ((wm · xi ) + ¯bm ) − ρ + ξi ) − βim ξim − δρ, i∈I m∈K\yi
i∈I m∈K\yi
(16)
Coincidence of the Solutions of the Modified Problem
¯m ¯ where α ¯m i , βi , and δ are the Lagrange multipliers. Let ∂ ¯ δ) ¯ = wj − xi α ¯ , β, ¯m i (δ(j, yi ) − δ(j, m)) = 0, ∂wj L(w, b, ξ, ρ, α i∈I m∈K\yi ∂ ¯ δ) ¯ = ¯ , β, α ¯m i (δ(j, yi ) − δ(j, m)) = 0, ∂bj L(w, b, ξ, ρ, α i∈I m∈K\yi ∂ ¯ δ) ¯ = 1 −α ¯m L(w, b, ξ, ρ, α ¯ , β, ¯m m i − βi = 0, ∂ξi l(k−1) ∂ ¯ δ) ¯ = ¯ , β, α ¯m vim i −v− ∂ρ L(w, b, ξ, ρ, α i∈I m∈K\yi i∈I m∈K\yi
87
(17)
− δ¯ = 0.
Accordingly, we have that xi α ¯m wj = i (δ(j, yi ) − δ(j, m)), i∈I m∈K\yi α ¯m i (δ(j, yi ) − δ(j, m)) = 0,
i∈I m∈K\yi 1 ¯m α ¯m i + βi = l(k−1) , m α ¯ i ((w xi ) + bm ) − ρ yi · xi + byi − ((w m · α ¯m − v − vim − δ¯ = i i∈I m∈K\yi i∈I m∈K\yi ¯ = 0, β¯im ξim = 0, δρ ¯m ≥ 0, δ¯ ≥ 0. α ¯m ≥ 0, β i i
+ ξim ) = 0, 0,
˜ ρ˜) satisfy the KKT conditions (18). Since We should proof (w, ˜ ˜b, ξ, vim (xi (δ(j, yi ) − δ(j, m))) = 0, i∈I m∈K\yi vim (δ(j, yi ) − δ(j, m)) = −˜bj , i∈I m∈K\yi vim ξ˜im = 0,
(18)
(19)
i∈I m∈K\yi
1 for 0 for
where, j∈ K and δ(i, j) = that
i∈I m∈K\yi
i=j . Set 0 ≤ ξ˜im ≤ ρ, obviously we have i =j vim (ξ˜im − ρ) ≤ (˜bj )2 .
(20)
m ¯m m ˜m ¯ ˜ Set α ¯m ˜m i = α i + vi , βi = βi − vi , and δ = δ. We have the following formulas 1)
i∈I m∈K\yi
=
xi α ¯m i (δ(j, yi ) − δ(j, m)) =
i∈I m∈K\yi
i∈I m∈K\yi
xi α ˜m i (δ(j, yi ) − δ(j, m)) +
m x i (α ˜m i + vi )(δ(j, yi ) − δ(j, m))
i∈I m∈K\yi
vim xi (δ(j, yi ) − δ(j, m)) = w ˜j .
(21) 2)
i∈I m∈K\yi
=
α ¯m i (δ(j, yi ) − δ(j, m)) =
i∈I m∈K\yi
i∈I m∈K\yi
α ˜m i (δ(j, yi ) − δ(j, m)) +
m (˜ αm i + vi )(δ(j, yi ) − δ(j, m))
i∈I m∈K\yi
vim (δ(j, yi ) + δ(j, m)) = 0. (22)
88
X. Xue et al.
3) ¯m ˜ m + v m + β˜m − v m = α ¯m i + βi = α i i i i 4)
1 l(k−1) .
α ¯m vim − δ¯ = i −v− i∈I m∈K\yi i∈I m∈K\yi α ˜m vim − v − i +
(23)
i∈I m∈K\yi
i∈I m∈K\yi
i∈I m∈K\yi
vim − δ˜ = 0.
(24)
5) Since ρ˜ − ((w˜yi · xi ) + ˜byi − ((w˜m · xi ) + ˜bm )) − ξ˜im ≤ 0, v ≥ 0,
(25)
we have that m (˜ αm ρ − (w ˜yi · xi ) − ˜byi + (w ˜m · xi ) + ˜bm − ξ˜im ) i + vi )(˜ m ˜ ρ − (w ˜yi · xi ) − byi + (w ˜m · xi ) + bm − ξ˜im ) ≤ 0. = vi (˜
(26)
And since m (˜ αm ρ − (w˜yi · xi ) − ˜byi + (w ˜m · xi ) + ˜bm − ξ˜im ) i + vi )(˜ i∈I m∈K\yi = vim (˜ ρ − (w˜yi · xi )) − ˜byi + (w˜m · xi ) + ˜bm − ξ˜im ) i∈I m∈K\yi = vim ((w ˜m · xi ) + ˜bm − (w ˜yi · xi )) − ˜byi ) − (vim (ξ˜im − ρ˜) i∈I m∈K\yi i∈I m∈K\yi = wj vim (xi (δ(j, yi ) − δ(j, m))) + bj vim ((δ(j, yi ) j∈K i∈I m∈K\yi j∈K i∈I m∈K\yi (vim (ξ˜im − ρ˜) = (˜bj )2 − (vim (ξ˜im − ρ˜)) ≥ 0. −δ(j, m))) − i∈I m∈K\yi
i∈I m∈K\yi
(27) So, we have that m (˜ αm ρ − (w˜yi · xi ) − ˜byi + (w˜m · xi ) + ˜bm − ξ˜im ) = 0. i + vi )(˜
6)Obviously,we have that β¯im ξ˜im = β˜im ξ˜im − vim ξ˜im = 0, δ¯ρ˜ = δ˜ρ˜ = 0.
(28)
(29)
˜ ρ˜) is the KKT point of problem(1). Form 1)-6)we can conclude that (w, ˜ ˜b, ξ, Obviously problem(1) is a convex quadratic programming problem. So, we can ˜ ρ˜) is certainly the solution of problem(1). conclude that (w, ˜ ˜b, ξ,
4
Conclusion
v-MC-SVM is a kind of machine learning algorithm which is similar to QPMC-SVM. For evidence, the original problem of v-MC-SVM is modified. This paper studies the property of solutions of the modified problem and the original problem of v-MC-SVM. Basing on Lagrange function and KKT condition,this paper proves strictly that the solution of the modified problem mostly coincides with the solutions of the original problem, which enriches the theory of multiclass SVM.The complexity analysis to equation (1) and (3) should be studied and the time comparison of them should be demonstrated by us further.
Coincidence of the Solutions of the Modified Problem
89
Acknowledgements The work is supported by National Natural Science Foundation of China (10571109, 10971122), Natural Science Foundation of Shandong (Y2008A01), Scientific and Technological Project of Shandong Province (2009GG10001012), and Program of Shandong Tai’an Science and Technology (20082025).
References 1. Weston, J., Watkins, C.: Support vector machines for multi-class pattern recognition. In: ESANN, Brussels, pp. 219–224 (1999) 2. Platt, J.C., Cristianini, N., Shawe-Taylor, J.: Large margin DGAs for multiclass classification. In: Solla, S., Leen, T.K., Muller, K.R. (eds.) ANI(12), Cambridge, pp. 547–553 (2000) 3. Takahashi, F., Abe, S.: Decision-Tree-Based Multi-Class Support Vector Machines. In: NIP’9, Singapore, pp. 1418–1422 (2002) 4. Dietterich, T.G., Dieiri, G.: Solving multi-class learning problem via error-correcting output codes. JAIR 2, 263–286 (1995) 5. Zhu, M.L., Liu, X.D., Chen, S.F.: Solving the problem of multiclass pattern recognition with sphere-structruerd support vector machines. JNU 39(2), 153–158 (2003) 6. Xin, X., Taian, L.: A New Model of Multi-class Support Vector Machine with Parameter v. In: FSKD 2009, vol. 1, pp. 57–61 (2009) 7. Franc, V., Hlavac, V.: Multi-class Support Vector Machine. In: Kasturi, R., Laurendeau, D., Suen, C. (eds.) ICPR 2002. PR’16(2), pp. 236–239. IEEE Computer Society, Los Alamitos (2002) 8. Franc, V., Hlavac, V.: Kernel represention of the Kesler construction for Multiclass SVM classification. In: Wildenauer, H., Kropatsch, W. (eds.) CWWW 2002, pp. 7–15 (2002) 9. Yuan, Y., Sun, W.: POptimazation theory and methed, pp. 422–482. Science Press, Beijing (1997)
Frequency Spectrum Modification: A New Model for Visual Saliency Detection∗ Dongyue Chen, Peng Han, and Chengdong Wu College of Information Science and Engineering, Northestern University, China {chendongyue,wuchengdong}@ise.neu.edu.cn
Abstract. Previous research has shown that Fast-Fourier-Transform based method was an effective approach for studying computational attention model. In this paper, a quantitative analysis was carried out to explore the intrinsic mechanism of FFT-based approach. Based on it, a unified framework was proposed to summarize all existing FFT-based attention models. A new saliency detective model called Frequency Spectrum Modification (FSM) was also derived from this framework. Multiple feature channels and lateral competition were applied in this model for simulating human visual system. The comparison between FSM and other FFT-based models was implemented by comparing their responses with the real human eye’s fixation traces. The result leads to the conclusion that FSM is more effective in saliency detection. Keywords: attention selection, saliency detection, Fast Fourier Transform, frequency spectrum, feature integration theory.
1 Introduction Attention selection is a refined biological mechanism that assists human brain to understand visual signals effectively, and it has been also applied in image processing and computer vision for decades. In some fundamental works of attention models [1,2], Winner-Take-All rule (WTA) and Feature Integration Theory (FIT) have been referred to as two basic principles. WTA means only the neuron with the highest response can attract the attention at one moment, which is usually used for locating the Focus of Attention (FOA). And as the direct consequence of FIT, the structure of linear summation of multiple feature channels is frequently used for producing the saliency map[2,3,4]. The saliency map of each single channel is computed by filtering the image pyramid iteratively with some pre-designed or adaptive filters[2,3]. However, iterative filtering consequentially causes huge computation cost. Besides, those pre-designed filter groups are unable to span all scales and orientations of image patchs. All these inherent limitations make it difficult to apply these models in online visual machine. ∗
Funded by National Higher-education Institution General Research and Development Project N090404001.
L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 90–96, 2010. © Springer-Verlag Berlin Heidelberg 2010
Frequency Spectrum Modification: A New Model for Visual Saliency Detection
91
Recently, a novel saliency detector called Spectrum Residual (SR) and its variants, such as PFT and PQFT, are proposed[5,6]. All these models applied the same strategy that changing the amplitude spectrum of the original image. And they are very similar in the simulation results. These models are fast and convenient, for they are based on Fast Fourier Transform (FFT) technique and are parameter-free. Until now, however, there is no any quantitative analysis that can explain why these models are capable of producing saliency map. Besides, how to detect salient regions with different size is still a challenge for these models. In this paper, we preliminarily focus on the FFTbased approach. We provide a quantitative analysis that demonstrates the basic mechanism of FFTbased approach, and propose a unified framework to summarize all FFT-based models. We also extend a new model that is sensitive to the salient proto objects in different scales. The rest parts are organized as follows: Section 2 presents the quantitative analysis; Section 3 introduces the unified framework; the proposed attention model is given in Section 4; Section 5 and 6 are the simulation results and conclusions.
2 Quantitative Analysis for FFT-Based Attention Models All existing FFT-based attention models can be regarded as different kinds of filter in frequency domain. The common strategy of these models is balancing the energy of frequency spectrum, which is basically a kind of whitening in frequency domain.
Fig. 1. Test images containing different geometrical patterns
A test for SR and PFT is help to understand why whitening in frequency domain highlights salient patterns. An input image I containing two different geometrical patterns, the rectangle R and the triangle T, is shown in Fig. 1(a), where R is the dominant pattern (there are three triangles in the image) and T is the unique pattern. That means the energy of pattern R is larger than the energy of pattern T in this image. The frequency spectrum of I, R and T are denoted by FI, FR and FT respectively. According to the algorithms of SR and PFT, the amplitude FI (u;v) at each point (u, v) in frequency domain should be normalized to 1. In other words, the modified spectrum F I0 of image I is obtained by filtering FI with a filter MI, where MI is related closely to the inverse of FI. There is a little difference between SR and PFT in the form of MI, which will be discussed in Section 3. Filtered by MI, the energy of two patterns are suppressed in different degree. Residual rate PX is used to measure the
92
D. Chen, P. Han, and C. Wu
loss of the energy, which is defined as the square root of the ratio of the residual energy of pattern X to its original energy, which can be computed by: r P 2 u ;P v kF X ( u ;v ) M I ( u ;v ) k (1) PX = ; X = R or T kF X ( u ;v ) k 2 u ;v
Residual rate is a reliable criterion for estimating the saliency of a pattern. The pattern with higher residual rate will pop out from the background for the reason that it suffers a less loss in energy than other patterns. Table 1 shows the values of the residual rates PX, X=1 or 2 when the input images are Fig. 1(a) to Fig. 1(j) respectively, where P1 is the residual rate of the unique pattern and P2 is the residual rate of the dominant pattern. The results indicate that the residual rates of unique patterns are statistically higher than the dominant patterns (7 cases out of 10 for SR and 9 cases out of 10 for PFT). This result strongly supports the hypothesis that balancing the energy of the input image in frequency domain can highlight salient patterns. The four counter examples (case a, d and j for SR and case j for PFT) in Table 1 arise from a common limitation of SR and PFT that we will discuss in section 3. Table 1. Residual rates of ten cases for SR and PFT a SR P1 0.0197 P2 0.0265 PFT P1 0.0387 P2 0.0360
B 0.0322 0.0174 0.0608 0.0245
c 0.0293 0.0218 0.0479 0.0316
d 0.0232 0.0233 0.0519 0.0282
e 0.0403 0.0394 0.1053 0.0632
f 0.0405 0.0391 0.1054 0.0632
g 0.0344 0.0255 0.1153 0.0670
h 0.0301 0.0277 0.1149 0.0671
i 0.0518 0.0329 0.1676 0.0490
j 0.0377 0.0559 0.0866 0.1067
3 A Unified Framework for FFT-Based Attention Models All FFT-based Attention models are different from each other in appearance, even though their strategies are the same in essence. Thus we propose a unified framework to summarize all these models, such as SR, PFT and PQFT. As mentioned in Section 2, SR and PFT can be regarded as two adaptive filters in frequency domain. So, the main content of FFT-based approach is designing the kernel function MI of the filter. For SR and PFT, we have: ½ 1=(kF I (u; v)k ¤ h n (u; v)) For SR (2) M I (u; v) = 1=kF I (u; v)k For P FT where, h n (u; v) is an averaging filter mask with size n £ n [5]. (3) From Eq.(2) and Eq.(3), it is easy to understand that PFT is a special case of SR when . According to the famous 1/f law, the amplitude spectrum kF I (f )k of the ensembles of natural images obeys the distribution: E f kF I (f )kg / 1=f , where f is denotes the frequency [7]. According to Eq. (2) and the 1/f law, E fkMI(f)kg/ f, namely, SR and PFT (k = 0) enhance the energy of high-frequency components and suppress the energy of low-frequency components statistically. In other words, SR
Frequency Spectrum Modification: A New Model for Visual Saliency Detection
93
and PFT do not whiten the spectrum of images, but give more weight to small-size patterns. This can give rise to wrong results as it did in the case j in Table 1, where the small “L” had a larger residual rate even though it was the dominant pattern. The other three counter examples in Table1 were also caused by this limitation of SR and PFT. To overcome this limitation, we extend a unified form of the kernel function that can summarize SR and PFT and also provide an opportunity to balance the energy in different frequency domain, which can be written as:
³ M I (u; v) =
( luu ) 2 + ( lvv ) 2
¡ k =2
(kF I (u; v)k ¤ h n (u; v)) ¡
1
(4)
Where, lu and lv are the lengths of the spectrum along u axis and v axis respectively. It is obvious that Eq. (4) is simplified to when k = 0 and is simplified to PFT when k = 0 and n = 1. The factor ((u=lu ) 2 + (v=lv ) 2 ) ¡ k =2 in the right hand of Eq. (4) is to control the balance of energy in frequency domain. As a direct result of 1/f law and Eq.(4), the low-frequency components gains more weights when k > 1 and high-frequency components gains more weights when k < 1. A reasonable value for saliency detection is k = 1, which assures the whitening of the frequency spectrum. Table 2 displays the corresponding residual rates when the unified framework is used with k = 1 and n = 1. The data listed in Table 2 suggest that the unified framework is capable of defeating the inherent limitations of SR and PFT. Table 2. Residual rates in the proposed unified framework when k=1 and n=1 a b c d e f g h i j P1 0.3340 0.3897 0.4594 0.4497 0.1927 0.2008 0.2747 0.2740 0.5953 0.4300 P2 0.2312 0.1984 0.2708 0.2672 0.1164 0.1116 0.1582 0.1586 0.2418 0.4189
4 Frequency Spectrum Modification Attention Model As the early FFT-based models, SR and PFT deal with only gray-scale image [5,6]. The further work PQFT is capable of processing color image and dynamic video, for it introduces the concept of Quaternion Fourier Transform (QFT) [6]. However, the concept of QFT is actually against Feature Integration Theory (FIT). QFT restrict the model in only four feature channels, and the simulation results of PQFT are not better than applying PFT directly in a multi-channel structure [9]. Based on the proposed framework, we develop a new attention model called Frequency Spectrum Modification (FSM). As plotted in Fig.3, FSM contains four feature channels, red/green, blue/yellow, intensity and saturation, which are computed as follows: RG = (r ¡ b + 255)=2
(5)
B Y = (2b ¡ r ¡ g)=4 + 255=2
(6)
I = (r + g + b)=3
(7)
94
D. Chen, P. Han, and C. Wu
S = (max(r; g; b) ¡ min(r; g; b))
(8)
where, r, g and b are the three primary color components of the input image. Eqs. (5)~(8) limit the responses of each channel in the interval of [0, 255]. The filter M Ii for each channel is designed as Eq. (3), but with different value of parameter k. Generally, k · 1 for the intensity channel and 1 · k < 1:2 for others because the intensity channel in human visual system is more sensitive to edges (high-frequency components). According to the analysis in Section 3, it is foreseeable that those salient details (edges and textures) will be highlighted in intensity channel, and the proto object with salient color will pop out in other channels.
Fig. 2. The system structure of FSM
The feature integration in FSM is implemented by the lateral competition between different channels. The winner at each pixel is denoted by W (x; y) , we have: W (x; y) = maxf C i (x; y); i = 1; 2; ¢¢¢ ; mg
(9)
th
where, Ci is the feature map of the i channel. The lateral competition ensures that the salient regions in each channel will not combine together in the final saliency map. As last, the lateral excitation is added to smooth the reshape of the saliency map. The function of the lateral excitation can be written as: E = kW k2 ¤ G(l; ¾)
(10)
where, G is a two-dimensional Gaussian filter whose size and covariance are l £ l and ¾2 respectively. in Eq.(10) is to give more prominence to the salient regions.
5 Simulation Results A detailed comparison between FFT-based models (SR, PFT and PQFT) and the traditional models (STB [3] and NVT [2]) was drawn by Guo and Zhang [5]. In this paper, only the comparison between FSM and PQFT is considered, for they are both FFT-based approach and are both capable of dealing with color images. In the first simulation, a group of psychological patterns are introduced to test the performance of the proposed attention model. As shown in Fig.3, the top line displays
Frequency Spectrum Modification: A New Model for Visual Saliency Detection
95
the test psychological images, the middle line and the bottom line are the saliency maps obtained respectively by FSM and PQFT. For the first test image, PQFT highlights the red bar with the salient orientation and FSM lays more stress on the large rectangle. That means FSM is capable of detecting salient proto object in larger size but not only the details in small size. For the 4th test image, the saliency map of FSM is obviously better than QPFT. That means FSM is not sensitive to noises. All The results in Fig.3 indicate that PQFT pays more attention to the fine objects with salient orientations but neglect the objects with salient scales. And PQFT is also noisesensitive because it enhances the energy of high-frequency components. On the contrary, FSM works well in highlighting all kinds of salient patterns, and it is noiseinsensitive.
Fig. 3. The saliency maps for an ensemble of psychological patterns by FSM and PQFT
Fig. 4. The natural image’s saliency maps and corresponding eye’s fixations by FSM and PQFT
96
D. Chen, P. Han, and C. Wu
In Fig. 4, several natural color images and the corresponding human eye’s fixations [8] are used to test FSM and PQFT. As shown in Fig. 4, the top-left image in each red box is the input image; the top-right figure is the distribution of 20 receiver’s eye fixations, the light area indicate the locations of FOA; the bottom-left and the bottomright figure are saliency maps obtained by FSM and PQFT respectively. The results show that FSM works well even the input images are complex. By comparing the results, we can see that the saliency maps obtained by FSM resemble human eye’s fixation much more.
6 Conclusions and Future Works Visual attention selection model has been studied for over two decades. Many researchers pay more attention to the biological plausibility and the match of the results between the model and real human behavior, but neglect the practicability of the computational model. As a result, most of these models are complicated and timeconsuming. However, FFT-based approach is a good attempt in this way. It is fast, simple, effective and easy to apply. By analyzing the algorithms and the results of some FFT-based model, such as SR, PFT and QPFT, we explored the intrinsic mechanism and the inherent limitations of these models. We also developed a unified framework to summarize all FFT-based attention models. Based on it, we established a novel saliency detector called FSM. The simulation results show that the proposed model is more effective in producing saliency maps. However, the details of the work is too trivial to present here, we will discuss it in future works.
References 1. Koch, C., Ullman, S.: Shifts in selective visual-attention – towards the underlying neural circuitry. Human Neurobiology 4, 219–227 (1985) 2. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11), 1254– 1259 (1998) 3. Walther, D., Koch, C.: Modeling attention to salient proto-objects. Neural Networks 19, 1395–1407 (2006) 4. Chen, D.Y., Zhang, L.M., Weng, J.: Spatiotemporal Adaptation in the Unsupervised Development of Networked Visual Neurons. IEEE Trans. on Neural Networks 20(6), 992–1008 (2009) 5. Hou, X., Zhang, L.: Saliency Detection: A Spectral Residual Approach. In: Proc. CVPR (2007) 6. Guo, C.L., Ma, Q., Zhang, L.M.: Spatio-temporal Saliency Detection Using Phase Spectrum of Quaternion Fourier Transform. In: Proc. CVPR (2008) 7. Ruderman, D.: The Statistics of Natural Images. Network: Computation in Neural Systems 5(4), 517–548 (1994) 8. Bruce, N.D., Tsotsos, J.K.: Saliency based on Information Maximization. In: Proc. NIPS (2005) 9. Chen, D.Y., Wu, C.D.: A New Model of Visual Attention Selection Based on Amplitude Modulation Fourier Transform. In: CCDC 2009 (2009)
3D Modeling from Multiple Images Wei Zhang1, Jian Yao2 , and Wai-Kuen Cham1 1
Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China 2 College of Computer Science and Electronic Information, Guangxi University, Nanning 530004, China
Abstract. Although the visual perception of 3D shape from 2D images is a basic capability of human beings, it remains challenging to computers. Hence, one goal of vision research is to computationally understand and model the latent 3D scene from the captured images, and provide human-like visual system for machines. In this paper, we present a method that is capable of building a realistic 3D model for the latent scene from multiple images taken at different viewpoints. Specifically, the reconstruction proceeds in two steps. First, generate dense depth map for each input image by a Bayesian-based inference model. Second, build a complete 3D model for the latent scene by integrating all reliable 3D information embedded in the depth maps. Experiments are conducted to demonstrate the effectiveness of the proposed approach. Keywords: 3D modeling, Depth map, Fusion.
1 Introduction As a popular research topic, image-based 3D scene modeling has attracted much attention in the past decades. In short, the task is to build a realistic 3D representation for the latent scene from a collection of images. Such technique can be widely applied in various areas such as robot navigation, virtual reality, computer games and art. In this paper, an algorithm is presented which is capable of creating a complete and detailed 3D model from multiple views. The reconstruction proceeds by a two-step process. Firstly, generate dense depth map for each view. Then, integrate all reliable 3D information embedded in these input views into a single model through patch-based fusion. In specific, a Bayesian-based framework is employed to infer the depth maps of the multiple input images. However, each depth map can only reveal the scene’s 3D information at one viewpoint. For a large and complex scene, a single depth map is insufficient to produce the desirable detailed and complete structure. Therefore, a patch-based fusion scheme is adopted to integrate all individual modeling structures into a single one. Besides, due to the influence of geometric occlusion, specular reflection and image noise, the resulting depth maps may contain some outlier pixels that have inaccurate depth estimates. Hence, it is necessary to introduce a refinement step to ensure that the tessellated surface patches at each view are derived only from reliable points and thus avoid fusing these outliers into the final 3D model. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 97–103, 2010. c Springer-Verlag Berlin Heidelberg 2010
98
W. Zhang, J. Yao, and W.-K. Cham
The remainder of this paper is organized as follows. Section 2 reviews some related work. Section 3 introduces a Bayesian-based inference model for depth map generation. Section 4 describes how to build a complete 3D model by patch fusion. Experimental results are shown in Section 5. Section 6 gives some concluding remarks.
2 Related Work Since lots of efforts such as [1,2,3,4,5,6,7,8,9] have been made to develop new approaches for modeling complex scene from a single or multiple images, we just refer some methods which are the most related to ours. In this work, the depth map recovery problem is formulated in a improved Bayesianbased framework [3], which can be regarded as an extension of [4] and [5]. However, some new contributions have been done. For example, the hidden consistency variables are introduced to smooth and integrate the depth maps at the same time. A data-driven regularizer is adopted to preserve the discontinuities at the image boundaries. A new visibility prior is defined based on the transformation consistency between different depth maps, which is used to account for the occlusion and noise problem. Also, a bilateral consistency prior is developed to impose the spatial smoothness in one depth map and the temporal consistency among different depth maps. Moreover, the EM (Expectation Maximization) optimization is implemented in a coarse-to-fine resolution manner. Narayanan et al. [6] presented a technique, Virtualized Reality, to build a complete surface model by merging depth maps into a common volumetric space. They designed a special system, 3D Dome which consists of 51 cameras, to capture images at multiple viewpoints. Also, conventional multi-baseline stereo technique was adopted to recover the dense depth maps. Goesele et al. [7] used a robust window-based matching method to produce depth map for each input image. The depth map result is not a dense one since only the pixels that can be matched with high confidence are reconstructed.
3 Depth Map Estimation Given a collection of images taken from different viewpoints, the latent scene will be reconstructed under a Bayesian-based inference model, which is briefly described as follows. More details can be found in [3]. Finally, the latent scene will be represented by a set of depth maps. From a small collection of N input images I = {Ii , i = 1, . . . , N } and a sparse set of Np 3D scene points Z = {Zp , p = 1, . . . , Np } precalculated based on camera self-calibration and stereo feature matching, we intend to estimate the unknown model θ = (D, I ∗ ) where D = {Di , i = 1, . . . , N } and I ∗ = {Ii∗ , i = 1, . . . , N } represent the sets of estimated depth maps and estimated images, respectively. In fact, I ∗ corresponds to the input image set I. The variable τ represents the set of parameters that will be fixed or heuristically updated in our inference system. To efficiently deal with occlusion, specular reflection and image noise, we introduce a set of hidden
3D Modeling from Multiple Images
99
C I∗
D
V
I
Z
Fig. 1. Network representation of the joint probability decomposition. Arrows represent statistical dependencies between variables.
visibility variables V = {Vj,xi |xi ∈ Ii , i, j = 1, · · · , N } based on priors of transformation consistencies in the geometrical sense where Vj,xi is a boolean variable that denotes whether the pixel xi in Ii is visible or not in Ij . In addition, a set of hidden consistency variables C = {Cj,yi ,xi |xi ∈ Ii , yi ∈ N (xi ), i, j = 1, · · · , N } are introduced to smooth and integrate the depth maps while ensuring consistencies among different depth maps and allowing discontinuities based on priors of local gradients of the estimated images. In specific, Cj,yi ,xi is a boolean variable that denotes whether the pixels xi and yi are consistent or not via transformation w.r.t. Ij . After defining all the variables (I, Z, I ∗ , D, V, C, τ ), next step of the Bayesian modeling task is to choose a suitable decomposition of their joint probability p(I, Z, I ∗ , D, V, C, τ ). The decomposition defines the statistical dependencies between the variables involved in our proposed model. Based on the proposed decomposition shown in Fig.1, the joint probability can be written as: p(I, Z, I ∗ , D, V, C, τ ) = p(τ )p(I ∗ |τ )p(V|D, τ )
(1)
p(C|I ∗ , τ )p(D|I ∗ , C, τ ) p(Z|D, τ )p(I|I ∗ , D, V, τ ). Each term of the decomposition in (1) will be introduced briefly as follows. p(τ ) which defines the prior probability of all involved parameters is to assumed to be uniform and thus is ignored in this work. p(I ∗ |τ ) denotes the prior of the images to be estimated. In general, this term was introduced to enforce that the estimated images I ∗ look more like natural images. p(V|D, τ ) is the consistent visibility prior that depends on D and τ . p(C|I ∗ , τ ) is the bilateral consistency prior that depends on I ∗ and τ . p(D|I ∗ , C, τ ) is the prior on depth maps given I ∗ ,C and τ . p(Z|D, τ ) is the likelihood of the input 3D scene points with known visibility values. It measures the similarity between the model and the input scene points and is used to preserve the correspondences appear in these precalculated 3D scene points. p(I|I ∗ , D, V, τ ) is the likelihood of the input images,
100
W. Zhang, J. Yao, and W.-K. Cham
which measures the similarity between the unknown model θ = (D, I ∗ ) and the input image data. In summary, the Bayesian-based inference problem can be recasted to estimate θ = (D, I ∗ ) as below: ˆ = arg max p(θ|I, Z, τ ) = arg max θ p(I, Z, I ∗ , D, V, C, τ )dVdC. (2) C V θ θ In the implementation, the EM optimization strategy is adopted to solve (2) and produce the desired depth maps. Particularly, EM is implemented efficiently with a coarse-tofine resolution scheme.
4 Create 3D Model by Patch Fusion When the depth map is fixed, 3D structure of each image can be created by triangulation. Next, to seek a more complete and detailed model for the latent scene, we integrate these tessellated structures obtained at different views into a single model. However, although the above Bayesian-based inference model provides a fairly reliable way for depth map estimation, it is inevitable that some pixels may have inaccurate depth estimates due to the influence of geometric occlusion, specular reflection and image noise. To remove the influence of these outlier pixels, depth map will be firstly refined with the guidance of pixel’s visibility. A binary mask Mi (i = 1, . . . , N ) is defined for each image Ii where Mi (xi ) = 1 denotes the depth estimate di (xi ) of pixel xi in image Ii is reliable. Otherwise, it is unreliable. As a typical multi-view stereo method, our Bayesian-based inference system can impose effective constraints when points of the scene are visible in three views or more. Therefore, a criterion can be defined based on the visibility map as (3). If a pixel xi in image Ii is visible at least k neighbor views (k ≥ 3), its depth estimate is regarded as reliable. N 1 j=1 Vj,xi ≥ k, Mi (xi ) = (3) 0 otherwise. As addressed in [10], the visibility map can be estimated in a straightforward way. While in this work, it is formulated into the Bayesian-based inference model by introducing a visibility prior p(V|D, τ ) as mentioned in the last section. Hence, the visibility map of each image will be produced more robustly as a by-product of the above depth map estimation system. However, since the visibility estimates may also contain outliers, an additional refinement step is introduced based on the criterion that: if the neighbors of a pixel have reliable depth estimates, this pixel should also have reliable depth estimate. Otherwise, the current depth estimate is probably unreliable. Since the outliers look like salt and pepper noise in the binary mask Mi , an adaptive median filter is employed to remove them. The reasons of using median filter are as follows. Firstly, the visibility mask is a binary image, the value of each pixel can only be 1 or 0. Median filter use a neighborhood value to substitute the false one, so the filtered mask remains binary. Secondly, as a non-linear filtering technique, it works particularly well in removing shot and isolated noise with edge-preserving property.
3D Modeling from Multiple Images
101
After fixing the mask for each input image, we are able to preserve pixels with the reliable depth estimates and discard the outlier ones. For each image, a set of surface patches will be created by tessellating these points that have reliable depth estimates. Motivated by the work on range image data [11,12], we adopt the volumetric fusion technique to integrate all the structure patches into a single 3D model due to some of its desirable properties such as resilience to noise, simplicity of design, and noniterative operation. As in [11], a weighting function W(p) and a cumulative signed distance function Dis(p) are defined in (4) and (5), respectively. p denotes a point of the structure. Dis(p) is constructed by combining the signed distance functions d1 (p), . . . , dn (p) with their corresponding weight factors w1 (p), . . . , wn (p). W(p) =
n
wi (p).
(4)
i=1
Dis(p) =
n
i=1
wi (p)di (p) . W (p)
(5)
In the implementation, the functions are casted in discrete voxel grid of a 3D volume. Finally, an isosurface corresponding to Dis(p) = 0 can be extracted by employing Marching Cubes [13]. Please refer to [11] for more technical details.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. Testing on Cityhall sequence. (a) shows one sample image of the sequence. (b) is the estimated depth map of (a). (c) shows the textured 3D modeling result. (d), (e) and (f) show some close-up comparisons between the fused untextured 3D model and the corresponding image view. In each pair, left shows a sight in the image, right shows the output 3D structure.
102
W. Zhang, J. Yao, and W.-K. Cham
5 Experimental Results In this section, the proposed algorithm are tested on different kinds of image sequences to demonstrate its effectiveness. Cityhall shows a complex scene with significant depth discontinuities. 7 images are captured arbitrarily in a wide-baseline condition. Fig.2(a) shows one sample image. Fig.2(b) and (c) show the corresponding depth map and textured 3D model respectively. Some parts of the final fusion model are enlarged and compared with the image as shown in Fig.2(d), (e) and (f). Apparently, the proposed method produced a good 3D model with abundant details.
(a)
(c)
(b)
(d)
(e)
Fig. 3. Testing on Dinosaur sequence. (a) shows two sample images of the sequence. (b) shows two views of the fused complete 3D model (untextured). (c), (d) and (e) show some close-up comparisons between the fused untextured 3D model and the corresponding image view. In each pair, left shows a sight in the image, right shows the output 3D structure.
Dinosaur is a Turn-Table sequence which consists of 36 images [14]. This data is used to demonstrate that our method is able to build a complete and detailed 3D model. Two sample images are shown in Fig.3(a). Fig.3(b) shows two shots of the output 3D model reconstructed by fusing 36 structures. As shown in the close-up comparisons in Fig.3(c), (d) and (e), the generated 3D structure is highly faithful to the truth.
6 Conclusions In this paper, we presented an image-based modeling method to create a realistic 3D model for complex scene. The motivation of this work is as follows. Each image reveals
3D Modeling from Multiple Images
103
a certain characteristic of the latent scene at one viewpoint. Hence, we intend to exploit the individual 3D information at each view and then combine the reliable estimates to produce a complete and more detailed 3D model for the latent scene. Experimental results demonstrated the effectiveness of our method. A complete 3D model can be built if enough images which contain all information about the latent scene are given. However, the proposed approach shares the common limitation of most 3D modeling methods. For example, it cannot work well when serious geometric occlusion, specular reflection or noise occurs in the input image sequence. In the future, we would like to fuse the input images to texture the generated 3D model.
References 1. Saxena, A., Sun, M., Ng, A.Y.: Make3D: Learning 3D scene structure from a single still image. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) 31, 824–840 (2009) 2. Fitzgibbon, A., Wexler, Y., Zisserman, A.: Image-based rendering using image-based priors. In: Proceedings of ICCV, vol. 2, pp. 1176–1183 (2003) 3. Yao, J., Cham, W.K.: Consistent 3D modeling from multiple widely separated images. In: Proceedings of ECCV Workshop on WRUPKV. LNCS. Springer, Heidelberg (2006) 4. Strecha, C., Fransens, R., Gool, L.V.: Wide-baseline stereo from multiple views: a probabilistic account. In: Proceedings of CVPR, vol. 1, pp. 552–559 (2004) 5. Gargallo, P., Sturm, P.: Bayesian 3D modeling from images using multiple depth maps. In: Proceedings of CVPR, vol. 2, pp. 885–891 (2005) 6. Narayanan, P., Rander, P., Kanade, T.: Constructing virtual worlds using dense stereo. In: Proceedings of ICCV, pp. 3–10 (1998) 7. Goesele, M., Curless, B., Seitz, S.M.: Multi-view stereo revisited. In: Proceedings of CVPR, vol. 3, pp. 1278–1285 (2006) 8. Tan, P., Zeng, G., Wang, J., Kang, S., Quan, L.: Image-based tree modeling. In: Proceedings of SIGGRAPH, vol. 26(3) (2007) 9. Sinha, S.N., Steedly, D., Szeliski, R., Agrawala, M., Pollefeys, M.: Interactive 3D architectural modeling from unordered photo collections. ACM Trans. on Graphics (TOG) 27, 1–10 (2008) 10. Szeliski, R.: A multi-view approach to motion and stereo. In: Proceedings of CVPR, vol. 1, pp. 23–25 (1999) 11. Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proceedings of SIGGRAPH, pp. 303–312 (1996), http://grail.cs.washington.edu/software-data/vrip/ 12. Hilton, A., Illingworth, J.: Geometric fusion for a hand-held 3D sensor. Machine Vision and Applications 12(1), 44–51 (2000) 13. Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3D surface construction algorithm. In: Proceedings of SIGGRAPH, vol. 21, pp. 163–169 (1987) 14. Fitzgibbon, A.W., Cross, G., Zisserman, A.: Automatic 3D model construction for turn-table sequences. In: Koch, R., Van Gool, L. (eds.) SMILE 1998. LNCS, vol. 1506, pp. 155–170. Springer, Heidelberg (1998)
Infrared Face Recognition Based on Histogram and K-Nearest Neighbor Classification Shangfei Wang and Zhilei Liu Key Lab of Computing and Communicating Software of Anhui Province, School of Computer Science and Technology, University of Science and Technology of China, HeFei, Anhui, P.R. China
[email protected],
[email protected]
Abstract. Infrared facial images record the temperature-field distribution of facial vein embranchment, which can be regarded as gray features of images. This paper proposes an infrared face recognition algorithm using histogram analysis and K-Nearest Neighbor Classification. Firstly, the irregular facial region of an infrared image is segmented by using the flood-fill algorithm. Secondly, the histogram of this irregular facial region is calculated as the feature of the image. Thirdly, K-Nearest Neighbor is used as a classifier, in which Histogram Matching method and Histogram Intersection method are adopted respectively. Experiments on Equinox Facial Database showed the effectiveness of our approach, which are robust to facial expressions and environment illuminations. Keywords: Infrared face recognition, histogram analysis, K-Nearest Neighbor Classification.
1 Introduction Nowadays, face recognition technology has a wide range of applications related to security and safety industry. The traditional technology of face recognition from visible images is easily affected by the environment illumination or facial expression changes, which is unable to meet the needs of practical application. With the development of infrared technology, thermal infrared face recognition has received more and more attention. It is independent of illumination, since infrared images record the temperature-field distribution of facial vein embranchment [1-4, 9, 11]. There are two kinds of representative methods in the domain of infrared spectrum. One of them is based on physiological or medical knowledge [1-7], such as blood perfusion and its variety, which try to extract the blood vessel distribution as the facial features for recognition. However, the spatial resolution and thermal accuracy of current infrared cameras are not high enough to detect blood vessel properly. The other kind of methods regards thermograms as gray image. Several statistic algorithms usually used in the domain of visible spectrum, like eigenfaces, local feature analysis, linear discriminant analysis and independent component analysis, are adopted for the recognition to demonstrate their potential in the infrared domain [8- 10]. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 104–111, 2010. © Springer-Verlag Berlin Heidelberg 2010
Infrared Face Recognition Based on Histogram and K-Nearest Neighbor Classification
105
Comparing with the visible images, infrared images reflect the temperature distributions of human face, details on the outline of organs are very blurry. Therefore, the infrared face recognition should focus on gray distribution and texture features, which reflect the pattern of blood vessels on each face [10]. This paper proposes an infrared face recognition algorithm using histogram analysis and K-Nearest Neighbor (KNN) classification. We first use flood-fill algorithm to segment irregular facial region of infrared images. The histogram features are then extracted from the segmented irregular facial region. After that, KNN is adopted as a classifier using histogram matching method and histogram intersection method respectively. Finally, the experiments under several different conditions are conducted on the Equinox facial database, in which the influences of facial expressions, illumination conditions and eyeglasses are taken into considered. Excellent and stable recognition performances are achieved under the multifacial expression and multi-illumination databases, illustrating the robustness of our approach to facial expressions and illumination conditions. Our approach also performs well on person with glasses. However, when we try to identify a person using glasses as camouflage, no such good or stable recognition performances are achieved. This indicates that our system is not robust to eyeglass. The experimental results also demonstrate that histogram intersection method outperforms histogram matching method. Comparing with the related work in infrared face recognition field, our contributions can be summarized as follows: (1) Most studies segment regular face regions like rectangle or ellipse, while our approach uses flood fill algorithm to obtain the irregular facial regions, which means the size and shape of the segmented face region is dependent on the subject. Thus, it is useful for face recognition. (2) To the best of our knowledge, few researches have been reported to recognize face using histogram from infrared thermal images. We introduce histogram as the feature of thermal images. Experiments showed it is simple and effective. The rest of this paper is organized as follows. Section 2 introduces our framework for face recognition based on histogram and K-Nearest Neighbor classification in detail. The experiments and results are represented in Section 3. Section 4 concludes this paper.
2 Face Recognition Using Histogram and KNN Classifier The framework of our face recognition system based on histogram and K- Nearest Neighbor classification is showed in Fig. 1.
Infrared Face Image Database
Face Segmentation
Histogram Extraction
Recognition Result
KNN Classifier
Fig. 1. Architecture of Infrared Face Recognition system based on histogram analysis and KNN classification
106
S. Wang and Z. Liu
2.1 Irregular Face Segmentation Using Flood-Fill Algorithm Normally, the human body keeps a constant average temperature and the human skin has average emissivity between 0.98~0.99, which is higher than the other substances [3]. Thus, the grey scale distribution of the human skin region is significantly different from that of other objects. It is possible to segment face region from background using the flood fill algorithm, which is an algorithm that determines the area connected to a given node in a multi-dimensional array. Three parameters should be manually given at first: a seed point inside face region, a lower limit and an upper limit of grey scale value. The pixels around seed point, whose gray scale value is between the lower limit and upper limit, will be connected to form a region as showed in Fig.2.b. Then we change this connected region to a binary mask image like Fig.2.c. After that, we obtain irregular face region image by multiplying original image and the mask one [14].
(b)
(a)
(c)
(d)
Fig. 2. Segmentation of the irregular facial region. (a) Thermal face image (b) Irregular Facial region (c) Facial Mask Image (d) Target Facial Image.
2.2 Histogram Extraction The geometric property of the infrared images is not clear enough for image analysis. We believe that it is reasonable to analyze the infrared image from the aspect of grey scale distribution because it reflects the facial thermal distribution of the subject in the infrared facial image. Here histogram of the irregular facial region is used. The histogram of an image is a 1-D discrete function which is given as equation (1):
H (k ) =
nk N
, k = 0,1,..., L − 1.
(1)
Where N represents the total pixel number of all the grey scales, and L represents the dimension of the grey scale, and nk represents the total pixel number of the k grey scale. 2.3 K-Nearest Neighbor Classification Rule Here, KNN is utilized as a classifier of our face recognition approach. A face image is identified by a majority vote of its neighbors. Given imagery database which contains
Infrared Face Recognition Based on Histogram and K-Nearest Neighbor Classification
107
images of n different people {W1 , W2 ,..., Wn } , and person W has N images. For an input image X that is to be identified, the distances among all the N = N1 + N 2 + ... + N n images and image X are calculated. Then, the person of image
X is identified as the person who has the most images in the k nearest neighbors. Two kinds of distance method are used: histogram matching method and histogram intersection method, and they are described in the following. Histogram matching method (HMM)
The distance between the histograms could be measured by the function of Euclidean distance. The histogram distance between the histogram H q of image Q which is to be identified and the histogram H d of the image D which is given in the image database could be calculated as formula (2): L
∑ (H
M E (Q, D ) =
q
(i ) − H d (i )) 2
i=2
(2)
Histogram intersection method (HIM)
The matching value P (Q , D ) of the histogram H q of the image Q which is to be identified and the histogram H d of the image D which is given in the image database could also be calculated using the Histogram intersection method. The formulation of Histogram Intersection method is defined as formula (3): L −1
P (Q , D ) =
∑ min( H
q
( k ), H d ( k ))
k =0
(3)
L −1
∑H
q
(k )
k =0
3 Experiments and Discussion 3.1 Experimental Condition
To evaluate the effectiveness of our proposed method, the public available database collected by Equinox Corporation [13] was used in our experiments. The database was collected in several different conditions, consisting of persons with or without glasses, long wavelength and medium wavelength (LW and MW), frontal/leftward/rightward lighting conditions (FL, LL and RL) and frowning /smiling /surprising facial expressions. Three kinds of experiments under different conditions have been designed as
108
S. Wang and Z. Liu
follows to verify the robustness to facial expression, illumination and eyeglasses. Recognition rate is used to evaluate the performance and K is set to 3 in the KNN classifier. 3.2 Experiments on the Robustness to Facial Expressions
Considering the factors of eyeglasses, wavelengths and illumination conditions, 12 sub databases as showed in Table 1 were selected from the Equinox database. Every sub database contains three kinds of facial expression of a sample. By using the KNN classifier with histogram matching method and histogram intersection method respectively, experiments were carried out on each sub database and the recognition results were showed in Table 1, in which the number of images in each sub databases was also listed. Table 1. The recognition results on the database with different facial expressions Conditions FL(111 LW RL(111) Glasses LL(111) on FL(60) MW RL(60) LL(60) FL(225) LW RL(120) Glasses LL(120) off FL(123) MW RL(120) LL(120) Average Recognition Rate
HMM 1 0.982 0.991 0.983 1 1 0.978 0.975 0.958 0.992 0.975 0.975 0.984
HLM 1 0.982 1 0.983 1 1 0.996 0.983 0.958 1 0.992 0.983 0.990
Table 2. The recognition results on the database with different lighting conditions Conditions Frown(120) LW Smile(120) Glasses Surprise11) on Frown (69) MW Smile72) Surprise60) Frown(159) LW Smile159) Glasses Surprise(156) off Frown(135) MW Smile(135) Surprise(120) Average Recognition Rate
HMM 0.975 0.983 0.937 0.986 0.958 0.967 0.955 0.974 0.974 0.985 0.970 0.983 0.971
HLM 0.983 1 0.937 0.986 0.986 1 0.994 0.987 0.994 1 0.985 1 0.990
Infrared Face Recognition Based on Histogram and K-Nearest Neighbor Classification
109
As we can see from Table 1, good and stable recognition results have been achieved under all kinds of infrared sub databases, with the average recognition rate of 0.9841 and 0.990 respectively. Further more, the recognition rate achieved 100% in 3 cases with histogram matching method and 5 cases with histogram intersection method. All the results described Table 1 demonstrate the robustness of our approach to facial expression. The reason may be that histogram neglects the spatial difference caused by expression change. In addition, from the recognition results described in Table 1, we can see that the KNN classifier using the histogram intersection method outperforms that which uses the histogram matching method. 3.3 Experiments on the Robustness to Illuminations
Considering the factors of eyeglasses, wavelengths and facial expressions, 12 sub databases as showed in Table 2 were selected from the Equinox database in this experiment. In each sub database, every sample has three facial images of different illumination conditions, consisting of the frontal, the leftward and the rightward. Experiments were carried out on each sub database and the recognition results were showed in Table 2, in which the number of images in each sub databases was also labeled. It is observed from Table 2 that excellent and stable recognition results have been achieved under all kinds of infrared sub databases, with the average recognition rate of 0.971 and 0.990 respectively. In addition, the recognition rate achieved 100% in 4 cases with histogram intersection method. All the results described in Table2 demonstrated the robustness of our proposed approach to illumination. It does well out of the illumination insensitiveness of infrared images. Furthermore, the KNN classifier using the histogram intersection method outperforms that which uses the histogram matching method. 3.4 Experiments on the Influence of Eyeglasses
To verify the influence of eyeglasses to our approach, experiments are designed as follows: Firstly, experiments were implemented on the sub databases selected in 3.2. Given two sub databases with the same wavelength and lighting condition, one with glasses on and the other with glasses off, the cross-recognition rate was calculated. The entire six groups’ recognition results are summarized as Table 3. Secondly, similar experiments were implemented on the sub databases selected in 3.3. The entire six groups’ recognition results are described as Table 4. Table 3. The recognition results between multi-expression databases with glasses-on and glasses-off Conditions FL (96) LW RL (69) LL (63) FL (60) MW RL (60) LL(57) Average Recognition Rate
HMM 0.44792 0.53623 0.60318 0.50000 0.63333 0.70175 0.57040
HLM 0.55208 0.59420 0.68254 0.65000 0.75000 0.77193 0.66679
110
S. Wang and Z. Liu
Table 4. The recognition results between multi-illumination databases with glasses-on and glasses-off Conditions Frowning (54) LW Smiling (54) Surprising (54) Frowning (69) MW Smiling (69) Surprising (57) Average Recognition Rate
HMM HLM 0.59259 0.74074 0.66667 0.74074 0.55556 0.66667 0.57971 0.72464 0.62319 0.81160 0.61404 0.75439 0.60529 0.73980
It is observed from Table 1 and Table 2 that: with a sub database in which all the subjects with glasses on or off, good and relative stable recognition results have been achieved. As we know, thermal radiation cannot transmit through glasses. So, the area of glasses will be a significant character of a person. It may be the reason for the good recognition performance for a person with glasses. It is also the reason behind bad recognition performance for a person using glasses as camouflage; and it is demonstrated in Table 3 and Table 4, in which all the recognition results are poor and not stable enough.
4 Conclusion and Future Work In this paper we have proposed an infrared face recognition algorithm using histogram analysis and K-Nearest Neighbor Classification. Given an input thermal infrared facial image, we first obtain the irregular facial region using flood fill algorithm. Then the grey histogram of this irregular facial region is extracted. After that, the KNN classification rule based on two different histogram distance calculation methods, histogram matching method and histogram intersection method, is utilized as the classifier. Finally, several experiments under different conditions were carried out by using the Equinox facial database. The experimental results demonstrate the effectiveness of our face recognition system, which is robust to facial expressions and illumination conditions. Though some good results have been achieved, there are still limitations in our approach. For instance, the face segmentation algorithm should be improved to realize the automatic face segmentation and to become robust to some ill-registered images, which is necessary in some practical applications. Furthermore, when we identify an input image, although the KNN classifier is simple and effective in our face recognition system, but it requires all the images in the database should be considered, so it is time-consuming and computation intensive. Some other classifiers will be utilized and tested in our future work. Acknowledgments. The author would like to thank the Equinox Corporation to provide the NIST/Equinox Visible and Infrared Face Image Database available on the web at [13]. This paper is supported by National 863 Program (2008AA01Z122), Anhui Provincial Natural Science Foundation (No.070412056) and SRF for ROCS, SEM.
Infrared Face Recognition Based on Histogram and K-Nearest Neighbor Classification
111
References 1. Wu, S., Lin, W., Xie, S.: Skin Heat Transfer Model of Facial Thermograms and Its Application in Face Recognition. Pattern Recognition 41(8), 2718–2729 (2008) 2. Buddharaju, P., Pavlidis, I.T., Tsiamyrtzis, P., Bazakos, M.: Physiology-Based Face Recognition in the Thermal Infrared Spectrum. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(4), 613–626 (2007) 3. Wu, S.Q., Wei, L.Z., Fang, Z.J., Li, R.W., Ye, X.Q.: Infrared face recognition based on blood perfusion and sub-block DCT in wavelet domain. In: Proceedings of the 2007 International Conference on Wavelet Analysis and Pattern Recognition, ICWAPR 2007, pp. 1252–1256 (2007) 4. Wu, S.Q., Gu, Z.H., Chia, K.A., Ong, S.H.: Infrared Facial Recognition using modified blood perfusion. In: 2007 6th International Conference on Information, Communications and Signal Processing, ICICS, Singapore, pp. 1–5 (2007) 5. Pavlidis, I., Tsiamyrtzis, P., Manohar, C., Buddharaju, P.: Biometrics: Face Recognition in Thermal Infrared. In: Biomedical Engineering Handbook, February 2006. CRC Press, Boca Raton (2006) 6. Buddharaju, P., Pavlidis, I., Tsiamyrtzis, P.: Physiology-based face recognition using the vascular network extracted from thermal facial images: a novel approach. In: Proceedings of the IEEE Advanced Video and Signal Based Surveillance, Italy, pp. 354–359 (2005) 7. Socolinsky, D.A., Selinger, A.: Thermal Face Recognition over Time. In: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR 2004), vol. 4, pp. 187– 190 (2004) 8. Socolinsky, D.A., Selinger, A.: A Comparative Analysis of Face Recognition Performance with Visible and Thermal Infrared Imagery. In: Proc. of ICPR 2002, vol. 4, pp. 217–222 (2002) 9. Selinger, A., Socolinsky, D.A.: Appearance-Based Facial Recognition Using Visible and Thermal Imagery: A Comparative Study. Equinox Corporation no. 02–01, Technical Report, 2002 (2001) 10. Prokoski, F.: History, current status, and future of infrared identification. In: Proceedings of IEEE Workshop on Computer Vision beyond the Visible Spectrum: Methods and Applications, pp. 5–14 (2000) 11. Guyton, A.C., Hall, J.E.: Textbook of Medical Physiology, 9th edn. WB Saunders Company, Philadelphia (1996) 12. NIST/Equinox Visible and Infrared Face Image Database, http://www.equinoxsensors.com/products/HID.htm 13. Flood-Fill Algorithm, http://en.wikipedia.org/wiki/Flood_fill
Palmprint Recognition Using 2D-Gabor Wavelet Based Sparse Coding and RBPNN Classifier Li Shang1, Wenjun Huai1, Guiping Dai1, Jie Chen1, and Jixiang Du2,3,4 1
Department of Electronic Information Engineering, Suzhou Vocational University, Suzhou 215104, Jiangsu, China 2 Department of Computer Science and Technology, Huaqiao University, Quanzhou 362021, Fujian, China 3 Department of Automation, University of Science and Technology of China, Anhui 230026, Hefei, China 4 Institute of Intelligent Machines, Chinese Academy of Sciences, Anhui 230031, Hefei, China {sl0930,hwj,dgp,cj}@jssvc.edu.cn,
[email protected]
Abstract. This paper proposed a novel and successful method for recognizing palmprint using 2D-Gabor wavelet filter based sparse coding (SC) algorithm and the radial basis probabilistic neural network (RBPNN) classifier proposed by us. Features of Palmprint images are extracted by this SC algorithm, which exploits feature coefficients’ Kurtosis as the maximum sparse measure criterion and a variance term of sparse coefficients as the fixed information capacity. At the same time, in order to reduce the iteration time, features of 2D-Gabor wavelet filter are also used as the initialization feature matrix. The RBPNN classifier is trained by the orthogonal least square (OLS) algorithm and its structure is optimized by the recursive OLS algorithm (ROLSA). Experimental results show that this SC algorithm is successful in extracting features of palmprint images, and the RBPNN model achieves higher recognition rate and better classification efficiency with other usual classifiers. Keywords: Sparse coding; 2D-Gabor wavelet filter; Palmprint recognition; RBPNN; Classifier.
1 Introduction Currently, many recognition methods, such as the nearest feature line method [1], the cosine measure [2], the Fisher classifier[3] and neural networks method [4,5], Fourier transform[6], wavelets-based transform [7], principal component analysis (PCA), independent component analysis (ICA)[8], and sparse coding[9] and so on, have been proposed. For these algorithms of PCA, ICA and SC, the significant advantage is that they rely only on the statistic property of image data. However, the PCA can only separate pairwise linear dependencies between pixels, in contrary, ICA and SC are very sensitive to these high-order statistics. Particularly, when ICA is applied to natural images, it is just a particular SC. Because of the sparse structures of natural L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 112–119, 2010. © Springer-Verlag Berlin Heidelberg 2010
Palmprint Recognition Using 2D-Gabor Wavelet Based Sparse Coding
113
images, SC is more suitable to process natural images than ICA. Hence, SC method has been widely used in natural image processing[10]. The contribution of this paper is that a novel and successful method for recognizing palmprint is developed, which utilizes the d sparse coding (SC) algorithm based on the maximum Kurtosis sparse measure criterion and the determinative initialization basis function to extract the palmprint images’ features, and the radial basis probabilistic neural network (RBPNN) model to implement the recognition task.
2 Initialization Features Based on 2D-Gabor Wavelet The 2D Gabor wavelet function is defined as follows:
(
)
g mn ( x, y ) = Kg x g , y g ⋅ cos ⎡⎣ −2π (U 0x + V 0 y ) − P ⎤⎦ .
(1)
where K is the normalized parameter, m is the number of orientation ( m = 1,2, , M ), and n is the scale in each orientation ( n = 1,2, , N ). Parameters of U 0 and V 0 are the 2D simple-harmonic wave and denote the spatial frequency of 2D Gabor function; P is the modulation parameter. The Gabor function can be seen as the modulation of 2D simple-harmonic wave to 2D ellipse Gauss function. The 2D ellipse Gauss function parameters of x g and y g must satisfy the following conditions: ⎧ x g = ( x − x 0 ) cos (θ n ) + ( y − y 0 ) sin (θ n ) ⎪ . ⎨ ⎪⎩ y g = − ( x − x 0 ) sin (θ n ) + ( y − y 0 ) cos (θ n )
(2)
where the coordinate ( x0 , y0 ) denotes the center of Gauss function; The parameter θ n , which is defined as θ n = nπ N , is the rotation azimuth.
For an image I ( x, y ) , its 2D Gabor wavelet transform can be written as: A mn ( x , y ) = ∫ I ( x0 , y 0 ) * g mn ( x − x0 , y − y 0 ) dx0 dy 0 .
(3)
The filter energy mean of each magnitude map calculated is defined as follows: φ1mn = μmn = ∫∫ A mn ( x, y ) dxdy .
(4)
where φ1mn is the 2D Gabor wavelet basis and behaves the property of simple cell receptive fields.
3
Sparse Coding Based on Kurtosis Measure
Referring to the classical SC algorithm [9], and combining the minimum image reconstruction error with Kurtosis and fixed variance, we construct the following cost function of the minimization problem: 2
2 ⎡ si ⎤ 1 J (A, S) = ∑ ⎡ X ( x , y ) − ∑ a i ( x , y ) s i ⎤ − λ 1∑ kurt ( s i ) + λ 2 ∑ ⎢ ⎥ . ⎢ ⎥ ⎦ i i i 2 x, y ⎣ ⎣σt ⎦
(5)
where the symbol ⋅ denotes the mean, X = ( x 1, x 2,… , x n )T denotes the n-dimensional input data, S = (s1, s 2,…, s m)T denotes the m-dimensional sparse coefficients ( m ≤ n ),
114
L. Shang et al.
and A = (a 1, a 2,… , a m ) denotes the feature basis vectors. λ 1 and λ 2 are positive constant. σ t2 is the scale of coefficient variance. In (5), the second term is the sparseness measure based on the absolute value of Kurtosis defined as follows:
(
)
kurt (si ) = E {si4} − 3 E{si2}
2
.
(6)
and maximizing kurt ( s i ) (i.e., minimizing − kurt ( s i ) ) is equivalent to maximizing the sparseness of coefficient vectors; The last term, a fixed variance term, can penalize the case in which the coefficient variance of the ith vector s i2 deviates from its target value σ 2t . To ensure the convergence and speed up the search for optimal coefficient weights, here, we use the modified Amari natural gradient descent algorithm with amnesic factor[11] to update the coefficient weight matrix W , and this updating formula is defined as follows: ⎧ ∂ J ( A, W ) ⎫ dW T = − μ 1 (t ) ⎨ W ( t ) W ( t ) + βγ ( t ) W ( t ) ⎬ dt ∂W ⎩ ⎭
.
(7)
subject to μ1 ( t ) > 0 , β > 0 , γ ( t ) > 0 , where μ1 is the learning rate; β is the scale
selected; t denotes the sampling time; J is the cost function in (5) and ∂J ( A,W) ∂W
is the gradient with respect to W ; and γ ( t ) is the forgetting factor, its definition of
γ ( t ) is written as follows:
⎛
γ ( t ) = − tr ⎜ ( W ( t ) ) ⎝
T
∂J (A , W ) ∂W
( W (t ))
T
⎞ W (t ) ⎟ . ⎠
(8)
In practice, the well-known real time algorithm of the discrete time of (7) is given as follows: T W ( k + 1) = W ( k ) + η k ⎡ W ( k ) − F (S ( k ) ) ( S ( k ) ) W ( k ) − β r ( k ) W ( k ) ⎤ . (9) ⎢⎣ ⎥⎦
(
)
where F(S) =− ⎡⎣∂J ( A,W) ∂W⎤⎦ WT and γ ( k) = tr W( k) Γ( k) . Here, Γ( k ) is defined as: T
Γ ( k ) = W ( k ) − F ( S ( k ) ) (S ( k ) )
T
W(k ) .
(10)
And the gradient with respect to W is written as: ∂J ( A,W) 4 T = − ( A) ( I − AW) XXT − λ1α ⎡⎣ S3X − 3 S2 SX ⎤⎦ + λ22 S2 W X XT ∂W σt
.
(11)
where α = sign ( kurt ( s i ) ) , and for super-Gaussian signals, α = 1 , and for subGaussian signals, α = −1 . Otherwise, the feature basis function A is updated using the normal gradient descent algorithm, and thus the updating rule can be written as: A ( k + 1) = A ( k ) + ⎡⎣ I − A ( k ) W ( k ) ⎤⎦ XXT WT .
(12)
Palmprint Recognition Using 2D-Gabor Wavelet Based Sparse Coding
115
Fig. 1. Basis vectors obtained by applying our sparse coding algorithm to natural scenes
In performing loop, we update W and A in turn. Otherwise, for the convenience of computation, A is scaled in programming. Using the above description, the obtained results for 64 basis functions extracted from natural scenes are shown in Fig. 1, where gray pixels denote zero weight, black pixels denote negative weight, and brighter pixels denote positive weights.
4 The RBPNN Model and Training Algorithm The radial basis probabilistic neural network (RBPNN) model [5] is shown in Fig2. The first hidden layer is a nonlinear processing layer, generally consisting of the selected centers from training samples. The second hidden layer selectively sums the outputs of the first hidden layer according to the categories, where the hidden centers belong to. For pattern recognition problems, the outputs in the second hidden layer need to be normalized. The last layer for the RBPNN is just the output layer. In mathematics, for input vector x , the actual output value of the ith output neuron of the RBPNN, y ia , can be expressed as: M M ⎡n k a y i = ∑ w ik h k ( x ) = ∑ w ik ⎢ ∑ φ i ( x − c ki k =1 k =1 ⎣ i =1
2
⎤ )⎥ ( k = 1, 2,3, ⎦
,M ) .
(13)
where h k ( x ) is the kth output value of the second hidden layer of the RBPNN; wik is the synaptic weight between the kth neuron of the second hidden layer and the ith neuron of the output layer of the RBPNN; c ki represents the ith hidden center vector for the kth pattern class of the first hidden layer; n k represents the number of hidden center vector for the kth pattern class of the first hidden layer; ⋅ 2 is Euclidean norm; and M denotes the number of the neurons of the output layer and the second hidden layer, or the pattern class number for the training samples set; φi ( x−cki 2) is the kernel function, and it can be written as:
116
L. Shang et al.
φi
( x−c ) ki
2
⎡ x − c ki ⎢ = exp ⎢ − σ i2 ⎢ ⎣
2 2
⎤ ⎥ . ⎥ ⎥ ⎦
(14)
where σ i is the shape parameter for Gaussian kernel function.
Σ
X1
y1
X2 yk
Σ
Xm yM Σ
XN
Fig. 2. The structure of radial basis probabilistic neural network
5 Experimental Results and Conclusions 5.1 Date Preprocessing
In test, the Hong Kong Polytechnic University (PolyU) palmprint database is used to perform palmprint recognition. This database contains 600 palm images with the size of 128×128 pixels from 100 users, with 6 images from each individual. For each person, the first three images were used as training data while the remaining ones were treated as test data. For the convenience for calculating, PCA is used to make the training data whitened and a dimension reduced from 1282 to an appropriate dimension, denoted by k . Namely, let P k denote the matrix containing the first k principal component axes in its columns and let X denote the data set of zero-mean images. Then, the principal component coefficient matrix R k is represented by R k = X T P k . When setting k to be 16, the first 16 principal component axes of the image set are shown in Fig. 3a. Thus, instead of performing our SC algorithm directly on the 128 2 image pixels, it was performed on the first k PCA coefficients of palmprint images. These coefficients R Tk comprised the columns of the input data matrix, where each coefficient had zero mean. The representation for the training images was therefore contained in the columns of the coefficients U U = W ∗ R Tk . where the weight matrix W was k × k , resulting in k coefficients in U for each palmprint image, consisting of the outputs of each of the weight filter. The representation for test images was obtained in the columns of U test as follows:
:
Palmprint Recognition Using 2D-Gabor Wavelet Based Sparse Coding T T U test = W ⋅ R test = W ⋅ ( X test ⋅ P k )
T
.
117
(15)
and the basis vectors were obtained from the columns of P k ⋅ W −1 . First 16 SC basis images was shown in Fig. 3b. Here, each column of the weight matrix W −1 attempts to get close to a cluster of images that look similar across pixels. Thus, this approach tends to generate basis images that look more palmprint-like as the same as PCA.
(a)
(b)
Fig. 3. First 16 basis of the palmprint image set, ordered left to right, top to bottom, by the magnitude of the corresponding eigenvalues. (a) Basis of PCA; (b) Basis of our SC.
5.2 Palmprint Recognition Rate
Using our SC architectures, basis vectors (features) of palmprint images are extracted. Three classifiers were tested, i.e., Euclidean distance, RBPNN, and PNN. Euclidean distance is the simplest distance-matching algorithm among all. The RBPNN classifier proposed by us possesses the advantages of the RBFNN and the PNN, and is very suitable for classification problems [5]. First, to determine the appropriate feature length, we used the three types of classifiers to perform the recognition task of PCA with different k principal components. Here, there is a point to be noted that, when using the RBPNN classifier, we selected 300 training samples as the hidden centers of the first hidden layer. The number of the second hidden neurons is set as 100, thus, the number of output layer neurons is also set as 100. According to literature [5], the shape parameter is set to 650. The OLSA is used to train the RBPNN model. Likewise, by using the parameter similar to the one mentioned above, we use the ROLSA to optimize and prune the structure of the RBPNN. As a result, the number of the selected hidden centers of the first hidden layer is greatly reduced from 300 to 64. The recognition rates of PCA with different principal components are still invariant. This shows that the RBPNN model has better performance in classification. By testing, the fact that PCA with 85 principal
118
L. Shang et al.
Table 1. Recognition rate of our SC algorithm using three types of different classifiers with different principal components Recognition Methods (k=85)
RBPNN (%)
PNN (%)
PCA Classical sparse coding Our sparse coding
94.97 95.34 96.72
93.50 94.97 95.75
Euclidean distance (%) 91.33 92.32 94.28
components yields the best performance. Therefore, the PCA feature length of 85 is then used as the input to our SC algorithm calculation. The recognition rates obtained by using our SC algorithm were shown in Table 1. Otherwise, we compared our SC methods with the classical SC algorithm [9-10] and the PCA method with 85 principal components, and the comparison results were also shown in Table 1. It is clearly seen that the recognition rate of using our SC algorithm is better than those of the methods of PCA and Olshausen’s SC. At the same time, it also can be observed that the Euclidean distance is the worst among the three classifiers, and that the recognition performance of RBPNN is higher than those of PNN and Euclidean distance. Therefore, from the above experimental results, it can be concluded that our palmprint recognition method based on the modified SC algorithm and the RBPNN not only achieves higher statistical recognition rate, but also behaves faster training speed and testing speed. This method is indeed effective and efficient, which greatly support the claim that the RBPNN is a very promising neural network model in practical applications. Acknowledgement. This research was supported by the grants of Natural Science Foundation of Jiangsu Province of China (No.BK2009131), the grants of the National Science Foundation of China (No.60970058), and is also sponsored by “Qing Lan Project” of Jiangsu Province. At the same time, it is also supported by the grants of the National Science Foundation of China (No.60805021), the China Postdoctoral Science Foundation (No.20060390180 and 200801231), as well as the grants of Natural Science Foundation of Fujian Province of China (No.A0740001 and A0810010).
References 1. Julian, E., Körner, E.: Sparse coding and NMF. In: Proceedings of 2004 IEEE International Joint Conference on Neural Networks, vol. 4, pp. 2529–2533 (2004) 2. Hoyer, P.: Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research 5, 1427–1469 (2004) 3. Lee, D.D., Seung, H.S.: Learning the Parts of Objects by Non-negative Matrix Factorization. Nature 401, 788–791 (1999) 4. Olshausen, B.A., Field, D.J.: Emergence of Simple-cell Receptive Field Properties by Learning a Sparse Code for Natural Images. Nature 381, 607–609 (1996)
Palmprint Recognition Using 2D-Gabor Wavelet Based Sparse Coding
119
5. Shang, L., Cao, F., Chen, J.: Denoising Natural Images Using Sparse Coding Algorithm Based on the Kurtosis Measurement. In: Sun, F., Zhang, J., Tan, Y., Cao, J., Yu, W., et al. (eds.) Euro-Par 2008. LNCS, vol. 5264, pp. 351–358. Springer, Heidelberg (2008) 6. Bell, A., Sejnowski, T.J.: The ‘Independent Components’ of Natural Scenes Are Edge Filters. Vision Research 37, 3327–3338 (1997) 7. Hyvärinen, A., Hoyer, P.O.: Independent Component Analysis Applied to Feature Extraction from Colour and Stereo Images. Network Computation in Neural Systems 11(3), 191– 210 (2000) 8. Hyvärinen, A.: Sparse Coding Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation. Neural Computation 11, 1739–1768 (1997) 9. Shang, L., Zhang, J., Huai, W., et al.: Image Reconstruction Using NMF with Sparse Constraints Based on Kurtosis Measurement Criterion. In: Huang, D.S., Jo, K.-H., Lee, H.-H., et al. (eds.) Euro-Par 1996. LNCS, vol. 5755, pp. 834–840. Springer, Heidelberg (1996) 10. Hyvärinen, A., Oja, E., Hoyer, P., Horri, J.: Image Feature Extraction by Sparse Coding and Independent Component Analysis. In: 19th Proc. Int. Conf. on Pattern Recognition (ICPR 1998), pp. 1268–1273. IEEE Press, New Work (1998)
Global Face Super Resolution and Contour Region Constraints Chengdong Lan1,3, Ruimin Hu1, Tao Lu2, Ding Luo1, and Zhen Han1 1
National Engineering Research Center on Multimedia Software, Wuhan University, Wuhan 430072, China 2 Hubei Province Key Laboratory of Intelligent Robot, College of Computer Science and Engineering Wuhan Institute of Technology, Wuhan, 430070, China 3 State Key Lab of Software Engineering, Wuhan University, Wuhan 430072, China
[email protected]
Abstract. Principal Component Analysis (PCA) is commonly used for facial images representation in global face super-resolution. But the features extracted by PCA are holistic and difficult to have semantic interpretation. For synthesizing a better super-resolution result, we introduce non-negative matrix factorization (NMF) to extract face features, and enhance semantic (nonnegative) information of basis images. Furthermore, for improving the quality of super-resolution facial image which has been deteriorated by strong noise, we propose a global face super resolution with contour region constraints (CRNMF), which maks use of the differences of face contour region in gray value as face similarity function. Because the contours of the human face contain the structural information, this method preserves face structure similarity and reduces dependence on the pixels. Experimental results show that the NMF-based face super-resolution algorithm performs better than PCA-based algorithms and the CRNMF-based face super-resolution algorithm performs better than NMF-based under the noisy situations. Keywords: Face super-resolution, CRNMF, Structure similarity, Face features.
1 Introduction In most of surveillance scenarios, there is a far distance between cameras and their interesting objects, which leads to these objects having very low resolution. Human face is one of the most familiar objects in surveillance video. Because a lot of details of the facial features are lost in low-resolution facial images, faces are often difficult to be identified. Effectively enhancing the resolution of face images has become a problem which needs to be solved urgently. For the past few years, many superresolution techniques have been proposed. Face super-resolution, also called face hallucination, is a method which reconstructs high-resolution face image from a low resolution face image, with the help of the priori information of a sample dataset. It can effectively enhance the resolution of poor quality face images in surveillance video and restore the detail information of face features. It has a significant role in improving the perceptual quality and the recognition rate of face image. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 120–127, 2010. © Springer-Verlag Berlin Heidelberg 2010
Global Face Super Resolution and Contour Region Constraints
121
Face super-resolution technologies can be divided into three categories. The first of these is based on the global face parameter model. The data representation method is used to transform the sample images into a subspace, which is used as the priori information to obtain the high resolution image [1-5]. The second is nonparameter model strategy which is based on local patches or pixels. It utilizes the local information of example images as priori information, and estimates the highfrequency details of the input low-resolution image [6-7]. The third is a combination of the previous two [8]. This paper discusses the first one. In 2001, Zisserman and Capel [1] proposed to use the PCA (Principal Component Analysis) eigenface space of the sample images as a prior model constraint, and MAP (maximum a posteriori probability) estimator is combined to reconstruct the super-resolution result from a low-resolution face image. In 2003, Gunturk and Zisserman etc. [2] proposed to perform super-resolution in the low-dimension PCA eigenface space. This method greatly reduced the complexity of the super-resolution, and is applied to the preprocessing of face recognition. In 2005, Wang and Tang [3] used the eigenvalue transformation to improve performance of the face hallucination algorithm. They used the PCA method to construct a linear representation of the input low-resolution image with low-resolution training set images, and the representation coefficients were mapped to high-resolution image space. In 2007, Ayan Chakrabarti etc.[4] proposed face super-resolution method based on the KPCA (kernel PCA) priority. This method also used the PCA subspace of the examples as regularization of maximum a posteriori probability (MAP) framework, and defined the kernel function for projectiong images to the pricipal component and the highdimension feature space. In 2008, Jeong Seon etc. [5] proposed to use recursive error back-projection combined with the deformation face model and PCA method to reconstruct the high-resolution images from a single frame low-resolution face images. This method introduced the deformation face model, and used the shape and texture information of face image simultaneously. In face super-resolution technologies, PCA is the most common representation approach for facial images. It is a kind of dimension reduction methods, and considers the dimension reduction and redundancy decrease. But the features extracted by PCA are holistic, and PCA is not a good factorization method for synthesis and reconstruction. Furthermore, the cost function of the face super-resolution is established on the difference of image gray values. But in the real-world application of strong noise environment, the gray values change greatly. So the global differences between two face images can not reflect their actual similarity well in the cost function. This paper presents a NMF-based face super-resolution algorithm first. It uses the NMF to obtain structural information representation of sample face images, and the target image is regularized by Markov random fields. Finally, the steepest descent method is used to optimize NMF coefficient of high-resolution image. Furthermore, for improving the quality of super-resolution face image which has been deteriorated by strong noise, we propose a global face super resolution with contour region constraints (CRNMF), which takes advantage of the differences of face contour region in gray value as face similarity function.
122
C. Lan et al.
2 A Global Face Super Resolution Arithmetic Non-negative matrix factorization is a linear, non-negative approximate data representation [9]. Let’s assume that our data consists of T measurements of N nonnegative scalar variables. Denoting the (N-dimensional) measurement vectors v t (t=1,...,T), a linear approximation of the data is given by M
v t ≈ ∑ wi hit = Wht , i =1
where W is an N × M matrix containing the basis vectors wi as its columns. Note that each measurement vector is written in terms of the same basis vectors. Arranging the measurement vectors v t into the columns of an N × T matrix V, we can now write: V ≈ WH , where each column of H contains the coefficient vector h t corresponding to the measurement vector v t . Given a data matrix V, the optimal choice of matrices W and H is defined to be those nonnegative matrices that minimize the reconstruction error between V and WH. Various fidelity functions have been proposed, perhaps the squared error (Euclidean distance) function is most widely used:
E (W , H ) = V − WH
2
= ∑ (Vij − (WH )ij ) 2 . i, j
Although the minimization problem is convex in W and H separately, it is not convex in both simultaneously. Gradient algorithm can be used for this optimization. The algorithm of global face super-resolution based on non-negative matrix factorization is as follows. The target high-resolution face image is defined as Z, Low-resolution face image is defined as Y. And the observation image is supposed to be affected by additive noise. So that imaging model can be expressed like this: (1) y = DBZ + n , where B is the optical fuzzy matrix, D is the sampling matrix determined by the CCD size. n is the noise matrix. Given the low-resolution image observed, we follow the principle of maximum a posteriori probability and Bayesian theory, so that: Zˆ = arg min{− log P(Y | Z ) − log P( Z )} , (2) where P(Z) denote the priori probability of high-resolution. And P(Y|Z) is the conditional probability. Therefore, for finding the most optimal solution of equation, we have to determine P(Z) and P(Y|Z). For P(Z), we use the Huber-Markov random field model [3]: 1 1 P( Z ) = exp[− ∑ Vc ( Z )] , (3) Zc λ c∈C where, Z c is a normalizing constant, λ is the “temperature” parameter. qN1 −1 qN 2 −1 3
∑V ( Z ) = ∑ ∑ ∑ ρ c∈C
c
k =0
l =0 m=0
S
(d kt ,l , m Z ) ,
Global Face Super Resolution and Contour Region Constraints
123
where q is the magnification, N1 and N 2 are the height and width of the lowresolution image. The process of calculating the conditional probability P(Z|Y) is discussed as follows. Additional noise can be regarded as Gaussian white noise, so: 1 1 2 exp[− 2 n ] , P ( n) = (4) N1 N 2 / 2 N1 N 2 (2π ) 2σ σ Combining with equation (1) and equation (4), we have:
P(Y | Z ) = P(( DBZ + n) | Z ) = P (n) =
1
exp[−
1
Y − DBZ ] 2
,
(5)
(2π ) 2σ σ This is the formula of conditional probability. Replacing equation (3) and (5) into equation (2), and ignoring the items nothing to do with Z, we have the optimal solution of target high-resolution image as follows: 1 1 2 Zˆ = arg min( 2 Y − DBZ + ∑ Vc ( Z )) , (6) 2σ λ c∈C We use the NMF to obtain the basis images matrix W, and define: Z = We where e denotes the unknown coefficient vector. Equation (6) can be rewritten as: 1 1 2 eˆ = arg min( 2 Y − DBWe + ∑ Vc (We)) , (7) 2σ λ c∈C The steepest descent method is used for solving e. We can obtain equation like this: eˆn +1 = eˆn + α d n , where α is the constant step size, and d n = −∇Ω[en , S ] − where we define:
N1 N 2 / 2
N1 N 2
2
λ (W t B t D t DBWen − W t Bt D t Y ) , σ2 Ω(e, S ) = ∑ Vc (We) . c∈C
3 Face Contour Region Constraints In the real-world scenarios, low resolution surveillance video usually has strong noise. It leads to image pixels contains great distortion. The cost function of traditional super-resolution is built on the basis of whole image gray value, but the whole image gray value can not reflect the similarity of the face very well, which reduces the quality of the reconstructed images significantly. Unlike the traditional method, this paper aims at the strong noise condition, and introduces a face contour factor in the similarity criterion. Because the contours of the human face contain the structural information, this method preserves face structure similarity and reduces dependence on the pixel values. And it is suitable for practical surveillance application.
124
C. Lan et al.
The cost function of traditional face super-resolution for the reconstructed image texture (pixels) constraint is generally defined as this: Y − DBZ ,
(8)
where Y is the low-resolution face image, B is the optical fuzzy matrix, D is the sampling matrix determined by the CCD size, Z is the target high-resolution face image. In order to improve the performance of the cost function to noise, we will alter the formula (8) as this: Q Y − DBWe ,
(9)
where Q is the face contour weighting factor, W is the face feature images, e is the unknown synthetic coefficient. From the formula (9) we know that face feature images and face contour weighting factor are the main parts of the two-dimension contour semantic model constraints. Section two have introduced the method of how to get the face feature images and calculate the feature coefficient, next we will describe how to use the face contour weighting factor: Carrying on super-resolution by using contour template in image pixels (texture) cost function constraint, the algorithms are as follows. Firstly, a face contour template is obtained. This process selects the face image contour area manually. We will select the face structure information edge and obvious characteristics area as contour area, and transform this area to binary face contour template, as shown in the figure 1. Then the contour weighting factor is calculated. The contour weighting factor Q is obtained by calculating the binary face template M. Firstly we define the contour area weighing as q, then the contour weighing factor can be expressed as follows: Q = (1 − q ). ∗ M + q. ∗ ( E − M ) , where E is a matrix of which all the elements are the one, and q can be regarded as a constant. Finally, the super-resolution result is reconstructed. We obtain the reconstruction constraint by replacing the calculated weighing factor to formula (10). Then according to the method of section two, we carry on the super-resolution process. Finally the super-resolution reconstructed image will be obtained.
(a)
(b)
Fig. 1. (a) Face grayscale image b) Face contour area
Global Face Super Resolution and Contour Region Constraints
125
4 Experiments and Results The face dataset FERET of Massachusetts Institute Technology (MIT) was used for our experiment. We selected 100 sample human faces, and used 20 fiducial points for face alignment. The resolution of sample human faces was 256*288. 10 sample images were sampled by 8 times (resolution 32*36) and used as the testing image like Figure 2(a). The remaining 90 images were used as the training sample database. Testing images were enlarged 8 times by Bi-Cubic interpolation method and subjective results of images were shown in Figure 2(b). Their PSNR (Peak Signal to Noise Ratio) values were calculated with the original highresolution images and shown in Table 1 column ‘Bi-Cubic’. PCA-based and NMF-based face super-resolution methods were performed and subjective results were shown in Figure 2(c) and (d). Their PSNR values were calculated with the original high-resolution images to show in Table 1 column ‘PCA’ and ‘NMF’. Original high-resolution images were shown in Figure 2(e).
(a)
(b)
(c)
(d)
(e)
Fig. 2. Experimental results: (a) Testing images (b) The images obtained by Bi-Cubic interpolation, (c) PCA-based method, and (d) Our proposed arithmetic based on NMF, (e) The original HR images
From the experiment results, reconstructed images of the PCA-based approach have enhanced resolution than the Bi-Cubic approach, but they seem to be more serious errors of luminance and a low similarity with the original image. In this paper, proposed NMF-based face super-resolution algorithm is compared with PCA-based and Bi-Cubic interpolation method in the subjective quality, and the results are significant improved. On the other hand, Bi-Cubic interpolation method is the highest in objective quality. The objective quality of PCA-based method is lowest. Under the same conditions with PCA-based method, the PSNR values of the NMF-based method are improved, and this is consistent with the results of subjective quality. Therefore, the experiments demonstrate that
126
C. Lan et al.
NMF-based face super-resolution algorithm performs better than PCA-based algorithm in the objective and subjective quality. Random noise was added to testing images like Figure 3(a). Bi-Cubic interpolation method was used for enlarging noisy images 8 times and subjective results of images were shown in Figure 3(b). NMF-based face super-resolution image reconstruction methods were performed and subjective results were shown in Figure 3(c). CRNMFbased face super-resolution image reconstruction methods were performed and subjective results were shown in Figure 3(d), where q was equal to 0.8. Original highresolution images were shown in Figure 3(e). The results of experiments demonstrate that, compared to NMF-based method, CRNMF-based method obtains better quality under noisy condition. Table 1. The results of objective data Method
Bi-Cubic (dB)
PCA (dB)
NMF (dB)
Face 1
23.287
18.368
22.309
Face 2
23.067
17.054
Testing images
Face 3
22.874
Method
Bi-Cubic (dB)
PCA (dB)
NMF (dB)
Face 6
25.999
13.114
25.57
22.716
Face 7
22.867
16.851
22.288
15.78
22.07
Face 8
23.38
18.302
22.682
Testing images
Face 4
27.348
16.218
26.344
Face 9
27.156
16.111
26.643
Face 5
24.458
18.279
23.43
Face 10
25.182
16.142
22.814
(a)
(b)
(c)
(d)
(e)
Fig. 3. Experimental results: (a) Testing images with noise (b) The images obtained by BiCubic interpolation, (c) NMF-based method, and(d) CRNMF-based arithmetic, (e) The original HR images
Global Face Super Resolution and Contour Region Constraints
127
5 Conclusions This paper resolves many issues of the traditional method based on PCA. For example, the features of traditional PCA-based method can not maintain the local structure information, and have poor ability of representation, and are difficult to have semantic interpretation and so on. We enhance the image semantic (negative) information by introducing the method of NMF to extract features. Furthermore, because the face image similarity is reduced by the gray value changing in the strong noise environment, we obtain the face contour through a two-dimension contour template, and use the gray value differences of the contour as the face similarity function. Because the contour contains the face structural information, the constraint of contour keeps the structural similarity and reduces the dependence on pixel values. The experiments demonstrate that NMF-based face super-resolution algorithm performs better than PCA-based algorithm, and the CRNMF-based face super-resolution algorithm performs better than NMF-based under the noise conditions. Acknowledgments. This research was funded by The National Basic Research Program of China (973 Program) (no. 2009CB320906) and Key Research Project of Ministry of Public Security of China (no. 2008ZDXMHBST011).
References 1. Capel, D.P., Zisserman, A.: Super- Resolution From Multiple Views Using Learnt Image Models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 627–634 (2001) 2. Gunturk, B.K., Batur, A.U., Altunbasak, Y., Hayes, M.H., Mersereau, R.M.: EigenfaceDomain Super-Resolution for Face Recognition. IEEE Transactions on Image Process. 12(5), 597–606 (2003) 3. Wang, X., Tang, X.: Hallucinating Face by Eigentransform. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 34(3), 425–434 (2005) 4. Chakrabarti, A., Rajagopalan, A.N., Chellappa, R.: Super-Resolution of Face Images Using Kernel PCA-Based Prior. IEEE Transactions on Multimedia 9(4), 888–892 (2007) 5. Park, J.S., Lee, S.W.: An Example-Based Face Hallucination Method for Single-Frame, Low-Resolution Facial Images. IEEE Transactions on Image Processing 17(10) (2008) 6. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning Low level Vision. International Journal of Computer Vision 40(1), 25–47 (2000) 7. Baker, S., Kanade, T.: Limits on Super-Resolution And How To Break Them. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(9), 1167–1183 (2002) 8. Liu, C., Shum, H., Freeman, W.T.: Face Hallucination: Theory and Practice. International Journal of Computer Vision 75(1), 115–134 (2007) 9. Lee, D., Seung, H.S.: Learning the Parts of Objects By Non-Negative Matrix Factorization. Nature 401(6755), 788–791 (1999)
An Approach to Texture Segmentation Analysis Based on Sparse Coding Model and EM Algorithm Lijuan Duan1, Jicai Ma1, Zhen Yang1, and Jun Miao2 1
College of Computer Science and Technology, Beijing University of Technology, Beijing 100124, China
[email protected],
[email protected],
[email protected] 2 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
[email protected]
Abstract. Sparse coding theory is a method for finding a reduced representation of multidimensional data. When applied to images, this theory can adopt efficient codes for images that captures the statistically significant structure intrinsic in the images. In this paper, we mainly discuss about its application in the area of texture images analysis by means of Independent Component Analysis. Texture model construction, feature extraction and further segmentation approaches are proposed respectively. The experimental results demonstrate that the segmentation based on sparse coding theory gets promising performance. Keywords: Sparse Coding, ICA, EM, Texture Segmentation.
1 Introduction Texture segmentation plays an important role in both pattern recognition and image processing. It consists of portioning the input image into connected regions which are homogeneous according to a texture property. In recent years, it has been widely applied in the field of content based image retrieval, medical image processing, remote sensing, scene recognition and so on. Among contents based features, texture is a fundamental property which provides useful information for image classification. In general, feature extraction, feature selection and classification make up the procedure of image processing based on texture [1]. How to extract the inherent information of the texture is a complex and vital task, which directly impacts on the performance of following selection and segmentation. During the past decades, a wide variety of techniques have been proposed to address this problem. These methods can be roughly classified to four categories: statistical methods, structural methods, model based methods and signal processing methods (filtering methods) [2]. Each of these methods has its peculiar merits and is applicable for different occasions. In this paper, we discuss about a newly multi-channel filtering approach based on the sparse coding theory which derives from neurophysiology researches [3-4]. It is motivated by the psycho-physical phenomenon in the early stages of the human visual system. As pointed out by Olshausen et al. [5], in the human visual system, there is a L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 128–135, 2010. © Springer-Verlag Berlin Heidelberg 2010
An Approach to Texture Segmentation Analysis
129
series of cells from the retina to the cerebral cortex characterized as “receptive field”, which is the basic structure and function unit of information processing in the visual system. These units are a collection of orientation, location and frequency selective, Gabor-like filters. In summary, a single neuron shows a strong response to only certain information such as the specific direction of the edge, line segments, stripes and other image characteristics. Intuitively, one can learn a possibly basis set such that only a small fraction of these basis functions is necessary to describe a given image, in which the operation is in a way of sparse coding. Therefore, we can apply this theory to extract image features which are more in line with the human visual habits and possess better discrimination ability. The remainder of this paper is structured as follows: Section 2 roughly reviews the Sparse coding theory as well as Independent Component Analysis (ICA), which is a generative model of sparse coding, and its application in image feature extraction. In section 3, we give a brief overview of EM algorithm that is used to cluster the feature images in this paper. Section 4 displays the proposed approach. Several experiments are presented in Section 5, and the last section is devoted to conclusion.
2 Texture Feature Analysis Using Sparse Coding Model 2.1 Sparse Coding and Independent Component Analysis Sparse coding is a mathematical exploration in order to find the expression of neural network for a multi-dimensional data. In the context of sparse coding, the data is expressed with a set of basis and only a small portion of the basis is activated at the same time, that is, a given neuron is rarely activated. From a mathematical point of view, the sparse coding theory can be interpreted as a representation of linear decomposition to the multi-dimensional data. Suppose random variables x = ( x1 , x2 ,...xn )T , which are observed as neural network input, and
s = ( s1 , s2 ,..., sn )T stands for the output of neural network. Its weight vector is denoted by wi , i = 1,..., n , in a form of matrix W = ( w1 , w2 ,..., wn )T , each row of the matrix indicates a weight vector. So the linear transformation can be expressed as:
s = Wx .
(1)
The number of sparse components (the number of neurons) is equal to the number of observed variables. Therefore, sparse coding can be seen as finding the weight matrix W makes the si sparse as far as possible, while another property of the sparse coding requires that the outputs ( si ) of the representation are independent as far as possible. Independent Component Analysis (ICA) is a data analysis method which aims to estimate a set of latent and generally non-Gaussian sources (sparse components) from a set of observations [6] with the assumption that these sources are mutually independent. It is developed from the field of Blind Source Separation based on high statistic, many researchers have addressed this rapidly emerging research area and the interested reader is encouraged to consult other literatures for details [6, 7]. In order to find the independent components, we can construct a quantitative measure of sparsity of a random variable, and then maximizes this quantitative measure in the transformation,
130
L. Duan et al.
thus the sparse coding transformation of observed data can be estimated. Actually, because sparsity is so close to the nongaussianity, classical measures of nongaussianity such as kurtosis and approximations of negentropy can be explained as measures of sparsity [7]. Therefore independent component analysis can be viewed as an implementation method of sparse coding theory. 2.2 Texture Feature Representation and Extraction Based on Sparse Coding
Textures often refer to homogeneous patterns or spatial arrangements of pixels that regional intensity or color alone does not sufficiently describe. In structural methods mentioned above, we mainly interest in certain texture units. Based on this point of view, a texture image is constructed by a set of latent micro units which appear repeatedly. Heuristically, every given window in the texture image can be assembled via these units. In substance, sparse coding method is a kind of generative representation method of texture analysis [8]. Therefore, we can apply this method to get the latent units and extract image features. For a given texture window, vector x can be formed through row scanning in the gray value. A collection of theses vectors makes up an observed data set Ω , suppose it belongs to a linear space in which all the element can be expressed by the linear combination of the basis. In terms of texture image, these basis are units which constitute the observed texture. Therefore, the generative model for texture image can be given by: n
X = ∑ ai si .
(2)
i =1
where ai ∈ R m is the basis function according to row-expansion and {a1 , a2 ,..., an } stand for the basis set of sample set Ω , si is the coefficient of combination. Several methods were proposed to perform ICA [7]. In this paper we choose FastICA [9] algorithm presented by Hyvarinen due to its simplicity and fast convergence. After transforming these basis functions into frequency domain, we can get a set of filters with good response characteristic. Then we use these filters to generate image features. The basic idea is that for a given test image, it is convoluted with these filters, each with specific frequency and orientation characteristic will be computed. If we apply a set of N filters, the resulted feature will consist of N filtered images which are same size as the test image.
3 Texture Image Segmentation Using EM After extracting all the filtered features for an image, we obtain a set of feature vectors, which can be viewed as points characterized by a mixture of Gaussian probability distributions in a multidimensional feature space. A C -component Gaussian mixture can be expressed as:
P( xi | Θ) = ∑ j =1 p j N ( xi | μ j , Σ j ) . c
(3)
An Approach to Texture Segmentation Analysis
131
where xi is a feature vector; p j are the mixing weights, and 0 < p j < 1 satisfying
∑
c j =1
p j = 1 , C is the number of clusters which is assumed known apriori in this pa-
per; N ( xi | μ j , Σ j ) is the multivariate normal density of class j parameterized by
μ j and Σ j : N ( xi | μ j , Σ j ) =
1 d 2
(2π ) (det Σ j )
1 2
e
1 − ( xi − μ j )T Σ −j 1 ( xi − μ j ) 2
.
(4)
where d is the dimension of the feature space. In order to perform segmentation, we first make use of Expectation Maximization (EM) algorithm to determine the parameters of the mixture probability model in the feature space [10]. EM algorithm consists of the E-step and M-step, in which the probability density function of the sample, that is the likelihood function are maximized by finding the maximum likelihood estimate of the unknown parameters. After the parameters are calculated, the next step is to perform spatial grouping of the pixels by assigning each pixel to its group label for which it attains the highest likelihood value, P( xi | Θ) . Lastly, for a better segmentation result, a Gaussian smooth filter is adopted as after processing.
4 Proposed Approach to Texture Segmentation Analysis Based on Sparse Coding Model and EM Algorithm In summary, the outline of the segmentation approach presented in this paper is illustrated in Fig 1, which involves three stages as mentioned before. (1) Learning ICA filters from the training textures by applying FastICA. (2) Generating feature images by convoluting adopted filters with test image (3) Modeling the probability distribution of those features with a mixture of Gaussian, employing the EM algorithm to determine the parameters. Final segmented result is produced according to the likelihood of each pixel.
5 Experiment In the experiment, we choose 20 images with 256 scales from the Brodatz album, 15000 sub-windows are extracted from these images by using 12 × 12 sliding Training Textures
FastICA
ICA Filters
EM Algorithm Result
Test Image
Convolution
Feature Images
Fig. 1. Block diagram of the proposed approach
132
L. Duan et al.
Fig. 2. Training textures from the Brodatz album
window, which makes up of a 15000 ×144 training data set x as the input samples of ICA. Figure 1 shows the training textures. Figure 2 demonstrates the basis functions learned by FastICA. PCA is used to reduce the dimension, obtaining a total of 40 basis. It is noted that independent components from different scales of window are different. We can discover from the figure 2 that these basis functions possess location, orientation and frequency selective properties. However, unlike the Gabor filters, ICA filters are data dependent and based on the high order statistics. They reveal the latent structure of image meanwhile have a reasonable statistical independence.
Fig. 3. Example of ( 12 × 12 ) ICA basis functions for texture
Three multi-texture images are composed manually as the input for segmentation. Figure 3 presents the test image set. Among them, Image (a) and image (b) consist of unit, while the right lower part of image (c) does not show periodicity because there is no stable unit in this texture image. The scale of the filter window should be considered when performing the segmentation. It is a critical factor to the success of segmentation or classification [11],
An Approach to Texture Segmentation Analysis
(a)
(b)
133
(c)
Fig. 4. Test Images
different segmentation results on image (a) and image (b) are reported in figure 4, in which the feature dimension is fixed at 20 and 12 respectively. Segmentation errors are reported in Table 1.It is noted that when the scale of filter window proportional to the texture units of input image, both test cases get good segmentation result although certain misclassifications appear near the boundary.
Test Image (a)
Test Image (b)
Fig. 5. Segmentation results on image (a) and image (b). From left to right, the scale of filter window is set as 10, 11, and 12,13,14,15 respectively. Table 1. Segmentation error for the proposed method in Fig.4 Scale of filter window 10 11 12 13 14 15
Error (%) Test image (a) 5.29 3.46 3.36 3.20 2.94 4.88
Test image (b) 2.97 3.42 5.28 3.25 3.12 4.60
Another issue should be addressed is how many of the filtered image to be used for EM algorithms, that is the performance of segmentation is obliviously depending on how many features are used. Figure 5 depicts this dependency.
134
L. Duan et al. 50 Scale of filter window 8 16 25
40
Error(%)
30
20
10
0
5
10
15
20 25 Dimension of Features
30
35
40
Fig. 6. Error vs. Dimension of features on Test Image (c)
where we plot the error rate as function of the dimension of features for segmentation on test image (c), the scale of filter window in each trial is fixed as indicated. When the scale is adopted appropriately, the dimension of features does not bring much influence on the performance of segmentation. However, it does impact the result when the scale is unreasonable. Fig. 5 depict that by applying scale 25, the error rate rises when too much features are considered for segmentation. Therefore it is crucial to choose the right scale of filter window while it is desirable to reduce the redundancy between features and keep optimal features which provide significant information for the segmentation.
6 Conclusion In this paper, we introduced the basic principle of sparse coding theory and one of its implementation-independent component analysis. The capability of sparse coding method in the field of texture analysis was studied and it was successfully applied to extract the feature basis functions of texture images. These basis functions are data dependent and thus sensitive to the training data. We employed these basis functions to build feature images, and then performed segmentation using EM algorithms. Several simulation results showed that texture analysis based on sparse coding theory can get a good performance on the feature extraction of texture image. Further research will address the issue of how to choose appropriate window scale and feature dimension adaptively. Acknowledgements. This research is partially sponsored by the Natural Science Foundation of China (No.60673091, 60702031 and 60970087), the Hi-Tech Research and Development Program of China (No.2006AA01Z122), the Beijing Municipal Natural Science Foundation (No.4102013, 4072023 and 4102012), the Beijing Municipal Education Committee (No.KM200610005012), the National Basic Research Program of China (No.2007CB311100), and the Beijing Municipal Foundation for Excellent Talents (No.20061D0501500211).
An Approach to Texture Segmentation Analysis
135
References 1. Reed, T.R., Hans du Buf, J.M.: A review of recent texture segmentation and feature extraction techniques. CVGIP: Image Understanding 57(3), 359–372 (1993) 2. Tuceryan, M., Jain, A.K.: Texture Analysis. In: Chen, C.H., Pau, L.F., Wang, P.S.P. (eds.) The Handbook of Pattern Recognition and Computer Vision, 2nd edn., pp. 207–248. World Scientific Publishing Company, Singapore (1998) 3. Van Hateren, J.K., Van der Schaaf, A.: Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London, B225 265, 359–366 (1998) 4. Simoncelli, E.P.: Vision and the statistics of the visual environment. Current opinion in Neurobiology 13(2), 144–149 (2003) 5. Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996) 6. Hyvarinen, A., Oja, E.: Independent component analysis: Algorithms and applications. Neural Networks 13(4-5), 411–430 (2000) 7. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. In: Haykin, S. (series ed.) Communications, and Control. Wiley Series on Adaptive and Learning Systems for Signal Processing, pp. 165–237. John Wiley and Sons, Inc., Chichester (2001) 8. Peyre, G.: Non-negative sparse modeling of textures. In: Sgallari, F., Murli, A., Paragios, N. (eds.) SSVM 2007. LNCS, vol. 4485, pp. 628–639. Springer, Heidelberg (2007) 9. Hyvarinen, A.: Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Transactions on Neural Network 10(3), 626–634 (1999) 10. Demster, A.P., Laird, N.M., Rubin, D.B.: Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B. 39, 1–38 (1977) 11. Jain, A.K., Farrokhnia, F.: Unsupervised texture segmentation using Gabor filters. Pattern Recognition 24(12), 1167–1186 (1991)
A Novel Object Categorization Model with Implicit Local Spatial Relationship Lina Wu, Siwei Luo, and Wei Sun School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
[email protected]
Abstract. Object categorization is an important problem in computer vision. The bag-of-words approach has gained much research in object categorization, which has shown state-of-art performance. This bag-of-words(BOW) approach ignores spatial relationship between local features. But local features in most classes have spatial dependence in real world. So we propose a novel object categorization model with implicit local spatial relationship based on bag-ofwords model(BOW with ILSR). The model use neighbor features of one local feature as its implicit local spatial relationship, which is integrated with its appearance feature to form two sources of information for categorization. The characteristic of the model can not only preserve some degree of flexibility, but also incorporate necessary spatial information. The algorithm is applied in Caltech-101 and Caltech-256 datasets to validate its efficiency. The experimental results show its good performance. Keywords: Object Categorization, Bag-of-words Model, Implicit local spatial relationship.
1 Introduction The object categorization can be regard as the process of assigning a specific object to a certain category. Object categorization is an important problem in computer vision recently. Currently, there are thousands of categories in our life, which brings much difficulty to object categorization due to pose changes, intra-class variation, occlusion and clutter background. Many algorithms of object categorization have been proposed in the past years. There are two kinds of methods basically: methods based on global features and the ones based on local features. The methods based on global feature have been proposed for many years. Their methods describe an image by extracting its global features such as colors, texture, and other features[1,2]. Methods based on global feature(such as PCA, ICA) are proposed in the past years. This kind of methods is not suitable for large intra-class variant, so methods based on local features are more popular for visual categorization. Recently many algorithms based on local features are proposed and has gained stateof-art performance. In this kind of algorithms, an image is regard as a set of local regions. They obtained local regions by regular grid [3], and key point detector [4-6]. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 136–143, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Novel Object Categorization Model with Implicit Local Spatial Relationship
137
Then each extracted local region is described by a vector for categorization. There are kinds of descriptors such as [5,6]. Variant spatial constraint is proposed like Constellation model[7], star-shaped [8], a hierarchical configuration of local features[9], tree[10] etc.. The Constellation model proposed by R. Fergus models the joint distribution of the geometry of parts where the spatial constraint is strongest. This model has good performance when the individuals in a category have a little variety. But in the real world, there exists large intra-class variation where the geometry of parts may be different in the individuals of the same category; this spatial constraint will not work well. Another extreme is bag-of words model analogized from text classification. The basic bag-of-words model is geometry free which ignores the spatial information which is necessary for object categorization. A number of successful categorization systems have been presented over the past years[11-14]. This paper proposed a novel object categorization model with implicit local spatial relationship based on bag-of-words model. To preserve the flexible of basic bag-ofwords model but not lose necessary spatial information, the appearance of a local region is integrated with its implicit local spatial relationship which is represented by neighbor features of a local feature. This model is based on the basic bag of words model, we introduce the related work in section 2, and present our algorithm in section 3. The experiments will show in section 4. In the last section, we conclude this algorithm.
2 Related Work Visual bag-of-words model is analogized from text classification, both models ignored geometry relations of words. We call the cluster centers of patches visual words. Now bag-of-words model is an important method for image categorization and has gained stat-of-art performance. In the early days, it was usually used in texture recognition[15]. From 2003 Blei[16]applied the bag-of-words in the hierarchy Bayesian model of literature recognition, and proposed pLSA, LDA etc. methods. When bag-of-words model is used in object categorization [13] and nature scene classification [12], it has a good performance. Compared with methods based on global feature of an image, the bag-of-words method has strong flexibility, so it is effective to deal with individual variant. But in real world, most intra-class individuals have some spatial dependency in different degree. Some researches have proposed methods to add spatial information to the bag-of-words model[17-22]. M. Marszalek [21] proposed an algorithm which employs spatial relationships between features to reduce background clutter. Demir[22]construct two models that a “bag of individual regions” representation where each region is regarded separately, and a “bag of region pairs” representation where regions with particular spatial relationships are considered together. This algorithm only considers the vertical relationship of “above-below”. We consider that the global spatial constraint is not suitable for object categorization and also the relationship of patches in one direction is not enough to describe the spatial relationship, as the object may have scale or rotation change in the real world. To solve this problem, this paper decides to add an implicit spatial relationship to the basic bag-of-words model. The advantages are: 1. the spatial information is important
138
L. Wu, S. Luo, and W. Sun
for object categorization, this novel algorithm can preserve the spatial relationship of the feature which can provide necessary spatial information; 2.it is easy to deal with scale or rotation change, although an object is in different scale or rotated, the neighbor features may not change despite their absolute position of features change, so it is flexible and robust.
3 Object Categorization Model with Implicit Local Spatial Relationship (The Algorithm) This model is an extension of basic bag-of-words algorithm. The basic bag-of-words method usually contains four steps: first, form a set of local patches through some sample methods from training images; then codebooks are created from these patches set using some clustering algorithms such as K-means; a histogram can be computed on the codebooks to represent an image; in classification phase, the category label is regarded as an output given a histogram vector as an input to a classifier, see the above row of Fig.1. Basic bag-of-words
Feature extraction
Feature description
Form vocabulary
Our model with implicit local spatial relationship
Appearance extraction
Appearance description
Clustering
Position extraction
Record its position and its neighbors’
Neighbor words of one word
Classification
l sua Vi rds wo
(BOW with ILSR)
Histogram
Classification
Fig. 1. The structure of the algorithm
We add implicit spatial information to bag-of-words model, which is presented as follows. The structure of the algorithm is shown as the bottom row in Fig.1. 3.1 Feature Extraction In our model, we both extract appearance and position information of each patch in the feature extraction phase. The training set is ℵ = I1 ,… , I n , if there are n images. We use SIFT[6] which is a scale invariant region detector to detect interest patches from images. Given an image, a set of 16 × 16 patches p1 , , pm are extracted around each
resulting keypoint, and their positions a1 ,
, am are recorded. Here the appearance of
A Novel Object Categorization Model with Implicit Local Spatial Relationship
139
each extracted patch is describe by a SIFT 128-demension vector, and ai = ( xi , yi ) is the absolute coordinate. 3.2 Representation
The codebook is formed by K-means to cluster the patches set, where the cluster centers are called “visual words”. After the visual words obtained, each patch is assigned to the nearest visual word, We denote I = p1 ,… , pm as a set of appearance descriptions of an image, and each patch pi is assigned a label idi = h where h ∈ {1,… , K }
and K in the number of the visual words w1 ,… , wK . we use the corresponding visual word instead of each patch itself to describe its appearance. And at the same time we can find the u nearest neighbor words around it, which is considered as its implicit spatial information. We denote B = nb1 ,… , nbm is the corresponding implicit local
(
)
spatial information. nbi = p1i ,… , p iu is the u nearest neighbor words instead of its absolute coordinate to represent its implicit spatial information which is computes as follows.
{
}
pi1 = p j max dist ( ai , a j ) i≠ j
(1)
And
{
}
piu = p j max dist ( ai , a j ) , where p j ≠ pi1 ,… piu −1 , u ≥ 2 i≠ j
(2)
Here dist (⋅) is computed as Euclidean distance. 3.3 The Model
We use a Bayesian probable method to describe this model. We assumed that each patch is independent of others given the class. As an image is regard as a set of its patches, we use Bayesian decision rule for object classification. The Bayesian decision rule according to posterior probabilities is represented as (3). c j ∗ = arg max p ( c j I , B )
(3)
j =1,…, C
Where c j is the j ' th class, and C is the number of classes. The posterior probabilities can be computed as (4) p (c j I , B) =
(
)
p I , B c j p (c j )
(4)
p ( I , B)
We assumed that the priors are equal for all classes, so the computation of posterior
(
)
probabilities is to compute class-condition probabilities p I , B c j . The probability
140
L. Wu, S. Luo, and W. Sun
values are computed using the maximum likelihood estimation. θ is the parameter, the probability can be factors as:
(
)
(
) (
p I , B c j ;θ = p I B , c j ;θ p B c j ; θ
)
(5)
The log-likelihood function computed on the whole training set is l (θ ) = ln p (ℵ θ )
(6)
Where n
(
p (ℵ θ ) = ∏ p I i , Bi c j ;θ j i =1
)
(7)
Using the maximum likelihood estimation, we can obtain θ and the probability values for classification.
4 Experiments and Results To properly evaluate the effectiveness and efficiency of our algorithm in object categorization, we implement our algorithm in diverse datasets. We randomly select half of the dataset as the training set, and left as test set to evaluate our algorithm. First we introduce the datasets, and then we report the experimental results.
Fig. 2. Some examples of the datasets. (a) is from Caltech-101 dataset and (b) is from Caltech256 dataset
4.1 Experiment Set
The Caltech-101 dataset[23] contains 101classes(including faces, animals, airplanes, flowers, etc.) with high variability. The number of images per category varies from 31 to 800. Some examples are shown in Fig.2(a). The Caltech-256 Dataset[24] is larger than Caltech-101 dataset. It contains 256categories which have higher intra-class variability. Each class contains at least 80 images, and some images contain more than one object. Some examples are shown in Fig.2(b). We can see images from it are more complicated.
A Novel Object Categorization Model with Implicit Local Spatial Relationship ROC Curves
ROC Curves
1
1
0.9
0.9
0.8
0.8
0.7
0.7 0.6
0.5
d
BOW BOW with ILSR
P
P
d
0.6
0.4
0.3
0.3
0.2
0.2
0.1
BOW BOW with ILSR
0.5
0.4
0 0
0.1 0.2
0.4
0.6
0.8
0 0
1
0.2
0.4
P
1
ROC Curves
1
1
0.9
0.9
0.8
0.8 0.7
0.7
0.6 d
0.6 BOW BOW with ILSR
0.5
P
d
0.8
fa
ROC Curves
P
0.6 P
fa
BOW BOW with ILSR
0.5 0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1 0 0
141
0.2
0.4
0.6
0.8
1
P
fa
0 0
0.2
0.4
0.6
0.8
1
P
fa
Fig. 3. The ROC curves of two methods. The x-coordinate and y-coordinate are the false positive rate and the true positive rate separately. We record ROC curves for some categories from these two datasets. Other categories have similar results in our experiments. (a) and (c) are for the Caltech-101 dataset, and (b) and (d) are for the Caltech-256 dataset.
To validate our algorithm, we randomly partition the dataset into two parts for training and testing. When clustering extracted patches to form visual words, we set the vocabulary size K=300 like other popular methods. With the same parameter, we can compare the experimental results. We set another parameter u = 2 , only 2 neighbor words is its local spatial information. The Performance is evaluated by the receiver operating characteristics (ROC) curve which plots the true positive rate versus the false positive rate. 4.2 Experimental Results and Discussion
In Fig.3, we demonstrate results on the Caltech-101 dataset and Caltech-256 dataset of object categories. After training the Bayesian model on the training part of the dataset, we perform the categorization for the test part. Fig.3 presents some of the computed ROC curves. It shows the curves for basic bag-of-words algorithm and an object categorization algorithm with implicit local spatial relationship based on bagof-words model (denoted by BOW and BOW with ILSR separately in Fig.3). One can notice that the ROC curve of BOW with ISLR is above BOW’s both in (a), (b), (c) and (d), so the proposed method improves the performance obviously. A conclusion can be drawn that the implicit local spatial information is necessary for object categorization. But the result of Caltech-101is better than that of Caltech256, we can inference the reason is that the images of Caltech-256 are more complicated, as its background is more clutter, and it may has more than one object in an image. Although the complexity of image dataset affects the categorization performance, proposed algorithm can improve the performance.
142
L. Wu, S. Luo, and W. Sun
5 Conclusion In this paper we have proposed an extension to object categorization based on basic bag-of-words that incorporate implicit local spatial relationship. We use neighbor features of a local feature to represent its implicit local spatial relationship instead of traditional absolute or relative coordinates. This algorithm can deal with large intraclass variation and keep necessary spatial information. The experimental evaluation has shown that proposed algorithm has improved the performance. To deal with more complicated images, future research may focus on extracting more robust features of patches or incorporating other useful information like priors. Acknowledgments. This work is supported by National High Technology Research and Development Program of China (2007AA01Z168), National Nature Science Foundation of China (60975078, 60902058, 60805041, 60872082, 60773016), Beijing Natural Science Foundation (4092033) and Doctoral Foundations of Ministry of Education of China (200800041049).
References 1. Szummer, M., Picard, R.W.: Indoor-outdoor image classification. In: ICCV Workshop on Content-based Access of Image and Video Databases, Bombay, India, pp. 42–50 (1998) 2. Vailaya, A., Figueiredo, A., Jain, A., Zhang, H.: Image classification for content-based indexing. Transactions on Image Processing, 117–129 (2001) 3. Vogel, J., Schiele, B.: Natural Scene Retrieval Based on a Semantic Modeling Step. In: Enser, P.G.B., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A., Smeulders, A.W.M. (eds.) CIVR 2004. LNCS, vol. 3115, pp. 207–215. Springer, Heidelberg (2004) 4. Harris, C., Stevens, M.: A combined corner and edge detector. In: Proceedings of the 4th Alvey Vision Conference, pp. 147–151 (1988) 5. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the 7th International Conference on Computer Vision, Kerkyra, Greece, pp. 1150–1157 (1999) 6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 191–210 (2004) 7. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scaleinvariant learning. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2, pp. 264–271 (2003) 8. Crandall, D., Felzenszwalb, P., Huttenlocher, D.: Spatial priors for part-based recognition using statistical models. In: IEEE International Conference on Computer Vision, pp. 10–17 (2005) 9. Bouchard, G., Triggs, B.: Hierarchical part-based visual object categorization. In: IEEE International Conference on Computer Vision, vol. 1, pp. 710–715 (2005) 10. Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition. International Journal of Computer Vision 61(1), 55–79 (2005) 11. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, European Conference on Computer Vision, pp. 1–22 (2004)
A Novel Object Categorization Model with Implicit Local Spatial Relationship
143
12. Li, F.F., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: IEEE International Conference on Computer Vision, vol. 2, pp. 524–531 (2005) 13. Sivic, J., Russell, B., Efros, A., Zisserman, A., Freeman, W.: Discovering object categories in image collections. Technical Report, Massachusetts Institute of Technology (2005) 14. Ullman, S., Naquet, M.V., Sali, E.: Visual features of intermediate complexity and their use in classification. Nature Neurosci. 5(7), 682–687 (2002) 15. Cula, O.G., Dana, K.J.: Recognition Methods for 3D Textured Surfaces. In: Proceedings of SPIE Conference on Human Vision and Electronic Imaging VI, San Jose, California, pp. 209–220 (2001) 16. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003) 17. Aksoy, S., Koperski, K., Tusk, C., Marchisio, G., Tilton, J.C.: Learning Bayesian classifiers for scene classification with a visual grammar. IEEE Transactions on Geoscience and Remote Sensing 43(3), 581–589 (2005) 18. Bloch, I.: Fuzzy spatial relationships for image processing and interpretation: A review. Image and Vision Computing 23(2), 89–110 (2005) 19. Boutell, M.R., Luo, J., Brown, C.M.: Factor graphs for region-based whole-scene classification. In: IEEE Conference on Computer Vision and Pattern Recognition, SLAM Workshop, New York (2006) 20. Kumar, S., Hebert, M.: A hierarchical field framework for unified context-based classification. In: IEEE International Conference on Computer Vision, Beijing, China, vol. 2, pp. 1284–1291 (2005) 21. Marszalek, M., Schmid, C.: Spatial Weighting for Bag-of-Features. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2, pp. 2118–2125 (2006) 22. Gökalp, D., Aksoy, S.: Scene Classification Using Bag-of-Regions Representations. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1–8 (2007) 23. http://www.vision.caltech.edu/Image_Datasets/Caltech101/ 24. http://www.vision.caltech.edu/Image_Datasets/Caltech256/
Facial Expression Recognition Method Based on Gabor Wavelet Features and Fractional Power Polynomial Kernel PCA* Shuai-shi Liu1,3 and Yan-tao Tian1,2,** 1
School of Communication Engineering, Jilin University Key Laboratory of Bionics Engineering, Ministry of Education, Jilin University 3 School of Electrical and Electronic Engineering, Changchun University of Technology 130025, Changchun, Jilin, China
[email protected],
[email protected] 2
Abstract. The existing methods of facial expression recognition are always affected by different illumination and individual. A facial expression recognition method based on local Gabor filter bank and fractional power polynomial kernel PCA is presented for this problem in this paper. Local Gabor filter bank can overcome the disadvantage of the traditional Gabor filter bank, which needs a lot of time to extract Gabor feature vectors and the high-dimensional Gabor feature vectors are very redundant. The KPCA algorithm is capable of deriving low dimensional features that incorporate higher order statistic. In addition, SVM is used to classify the features. Experimental results show that this method can reduce the influence of illumination effectively and yield better recognition accuracy with much fewer features. Keywords: local Gabor filter bank, kernel principle component analysis, facial expression recognition.
1 Introduction Recently, facial expression recognition has become a very active topic in machine vision community. More and more technical papers are concerning this area, and a brief tutorial overview can be found in [1]. Among various face recognition algorithms, one of the most successful techniques is the appearance-based method. To resolve the too large dimension problem when using original facial expression images, dimensionality reduction techniques are employed widely. Two of the most popular algorithms of these dimensionality reduction techniques are Principal Component Analysis (PCA) [2] [3] and Linear Discriminant Analysis (LDA) [8]. Scholkopf etc. [4] [5] proposed a novel approach so called Kernel Principal Component Analysis (KPCA). It can be used to extract nonlinear principal components *
This paper is supported by the Key Project of Science and Technology Development Plan for Jilin Province. (Grant No. 20071152). ** Corresponding author. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 144–151, 2010. © Springer-Verlag Berlin Heidelberg 2010
Facial Expression Recognition Method
145
efficiently instead of carrying out the nonlinear mapping explicitly. Zhong [6] discussed a problem of robustness of existing kernel principal component analysis and proposed a new approach to do facial expression analysis by using KPCA. Yang [7] presented an upper facial action units recognition method based on KPCA and SVM. Local Gabor filter bank can overcome the disadvantage of the traditional Gabor filter bank, which needs a lot of time to extract Gabor feature vectors and the highdimensional Gabor feature vectors are very redundant. Fractional power polynomial kernel PCA algorithm is capable of deriving low dimensional features that incorporate higher order statistic. So a method that uses the local Gabor filter bank and KPCA to extract facial expression features and uses SVM to classify the features is proposed in this paper.
2 Facial Expression Recognition System The facial expression recognition system designed in this paper consists of three modules: Pre-processing, Feature extraction, Classification. Figure 1 shows the flow chart.
Fig. 1. Flow chart of the facial expression recognition system
3 Features Extraction and Features Dimension Reduction 3.1 Gabor Features Representation Comparing with traditional Fourier transformation, there are so many desirable properties which make Gabor transformation prevalent, e.g. Gabor filter can be easily adjusted to get a content localization in detail both in space and frequency domains and it has multi-resolving ability and tunable focus. We can use a group of Gabor filters with different special-frequency properties to extract expression features to analyze the image in different granularity. Daugman [9] applied the 2D Gabor transformation in the field of computer vision firstly. The 2D Gabor filter is basically a bi-dimensional Gaussian function and defined as follows
146
S.-s. Liu and Y.-t. Tian
⎞⎡ 2 ⎤ ⎟ ⎢exp ( ikμ ,ν z ) − exp ⎛⎜ − σ ⎞⎟⎥ . ⎟⎣ ⎝ 2 ⎠⎦ ⎠
(1)
and ν are the frequencies and orientations of Gabor filters,
z = ( x, y ) is
ϕμ ,ν ( z ) = Here,
μ
spatial location,
kμ ,ν
σ2
2
⎛ k 2 z μ ,ν exp ⎜ − ⎜ 2σ 2 ⎝
2
k μ ,ν is plane wave vector and expressed as:
k μ ,ν = kν e Here,
sφμ
.
(2)
kν = kmax / f ν , φμ = πμ / 8 , kmax = π / 2 is maximum frequency.
The Gabor transformation of facial expression images is realized by convolution of facial expression image in
I ( z ) and Gabor filters ϕ( μ ,ν ) ( z ) in multi-frequencies and
multi-orientations.
Generally,
we
choose
ν = {0,1, 2, 3, 4, 5, 6, 7} . Total 40 Gabor filters.
(a)
μ = {0,1, 2, 3, 4}
and
(b)
Fig. 2. (a) The real part of the filters with five frequencies and eight orientations (the row corresponds to different frequency
μm , the column corresponds to different orientation ν n ) (b) The
magnitudes of the Gabor features representation of one face image
3.2 Local Gabor Filter Bank Global filter bank is shown with G ( m × n )
,which is composed of all the filters of
m-scale and n-direction. It can be seen from figure 2 (b) that the eigenvalues extracted by Gabor filters with same direction and different frequencies are very similar especially in the adjacent frequencies, which shows the eigenvalues have great redundancy and relativity. So a novel local filter bank LG ( m × n ) is proposed in this paper, whose filters spread all over m-scale and n-direction of the global filter bank and one or multi-scale (less than m) is selected in the same direction. Local filter bank not only contains the multi-scale and multi-direction feature information of global filter
Facial Expression Recognition Method
(a) G(4h8)
(b) LG1(4h8)
(c) LG2(4h8)
(d) G(3h8)
(e) LG1(3h8)
(f) LG2(3h8)
147
Fig. 3. Examples of several global and local Gabor filter bank
bank but also reduces the redundancy in eigenvalues. Through it feature extracting time can be shortened, feature dimension can be decreased and at the same time recognition rate can be ensured. Examples of several global and local Gabor filter bank are shown as figure 3. 3.3 Fractional Power Polynomial Kernel PCA The Kernel trick is demonstrated efficiently to represent complicated nonlinear relations of the input data into an implicit feature space R F with a non-linear mapping, and then the data are analyzed in R F . KPCA overcomes many limitations of its linear counterpart by nonlinearly mapping the input space to a high dimensional feature space. KPCA is capable of deriving low dimensional features that incorporate higher order statistics. This justification of KPCA comes from the Cover’s theorem on the separability of patterns, which states that non-linearly separable patterns in an input space are linearly separable with high probability if the input space is transformed nonlinearly to a high dimensional feature space. Computationally, KPCA takes advantage of the Mercer equivalence condition and is feasible because the dot products in the high dimensional feature space are replaced by those in the input space. The computational complexity is reduced to the number of training examples rather than the dimensions of the features space. A kernel for two image elements x and y is represented as K ( x , y ) . This is a similarity measure defined by an implicit mapping φ , which is definitely non-linear in nature. This mapping is done from the original space to the vector space such that
K ( x, y ) = φ ( x ) .φ ( y ) .
(3)
According to Mercer, kernel K ( x , y ) is a valid kernel if K is positive definite and symmetric. In other words the Kernel Gram Matrix, which has the elements as K ( x , y ) must also be a positive definite for all the nomial Kernel is defined as
xi , xj of the input space. Poly-
148
S.-s. Liu and Y.-t. Tian
K ( x, y ) = ( x. y + 1) . d
where d is the degree of polynomial. Kernel matrix K ( x, y ) is to form projected samples φ ( x ) in teger and fractional value of d .
(4)
R F . We used in-
4 SVM Classification SVM is a statistical learning method which is based on the principle of minimizing the risk of structure. The fundamental of it is that the non-linear data are mapped to a higher dimension space through the non-linear transformation, and then the optimal separable hyper-plane is found in the high-dimension space which makes the sample points of training set far away from the class-surface as far as possible, and that means making the interval of class maximum. SVM solved the problem of two-class classification essentially. Expression classification is a question of a multi-classification, so it needs to be converted into a question of two-class classification. The usual method is "one to many" or "one to one". The latter is used in this paper and the classification process is as shown in figure 3. To k -class discrimination, k ( k − 1) 2 SVM are built and each SVM distinguishes the two classifications respectively. In the testing phase, the classifications are voted by each SVM and the final recognition result is the classification which has the most votes.
5 Experimental Results The performance of the proposed facial expression recognition method based on Gabor wavelets transform and fractional power polynomial kernel PCA is assessed in this section. The image library of the experiment is JAFFE (Japanese Female Facial Expression) database of Japan Kyushu University which contains 213 facial expression images of 10 Japanese women. Each one has seven kinds of facial expressions including anger, disgust, fear, happy, neutral, sadness, and surprise. The same kind of facial expression of one person has 3 or 4 images. 137 images of seven facial expressions are chosen as training samples, in which the numbers of samples with various expressions are: 20, 18, 20, 19, 20, 20, 20. The remaining 76 images are testing samples in which the numbers of samples with various expressions are: 10, 11, 12, 12, 10, 11, 10. The size of the original image ( 256 × 256 ) is normalized to 128 × 104 . 5.1 Recognition Rates Corresponding to Different Gabor Filter Bank
The effectiveness of KPCA method is tested against PCA method and the recognition rates corresponding to different Gabor filter bank are showed in Table 1. The result shows that the effectiveness of the KPCA method is better than the PCA method. Feature dimension can be decreased and at the same time recognition rate can be ensured by using local filter bank. The detailed recognition results of 7 expressions by using local filter bank with LG2(4×8) are showed in Table 2.
Facial Expression Recognition Method
149
Table 1. Recognition rates corresponding to different Gabor filter bank Gabor Filter Bank G(5×8) G(4×8) G(3×8) LG1(4×8) LG2(4×8) LG1(3×8) LG2(3×8)
Filters Number 40 32 24 8 16 8 12
PCA 92.11 % 90.79 % 88.16 % 90.79 % 93.42 % 86.84 % 88.16 %
KPCA 96.05 % 94.74 % 92.11 % 94.74 % 96.05 % 92.11 % 94.74 %
Table 2. The recognition results of 7 expression Expression Anger Disgust Fear Happy Neutral Sadness Surprise Average
Testing Samples Number 10 11 12 12 10 11 10 76
Recognition Number 9 11 10 12 10 11 10 73
Recognition Rate:% 90 100 83.33 100 100 100 100 96.05
Analysis of experimental results: 1) Disgust, happy, neutral, sadness and surprise can be recognized 100%. 2) Anger is misidentified to disgust in figure 4 (a). Reason: the description of subtle changes in the person’s anger and disgust expression is similar. Improvement: the description of subtle changes of facial expression ought to be further enhanced. 3) Fear is misidentified to surprise in figure 4 (b). Reason: The difference between the person's fear and surprise expression is not distinct. Improvement: the distinction of different expression ought to be strengthened and the description of the subtle changes of facial expression ought to be enhanced. 4) Fear is misidentified to happy in figure 4 (c). Reason: the eyes and the mouth are not very symmetrical, and the illumination of the image on the right is noticeably stronger. Improvement: the facial expression images ought to be normalized and the illumination of the images ought to be further processed.
(a)
(b)
(c)
Fig. 4. Misidentified testing samples (a) Anger (b) Fear (c) Fear
150
S.-s. Liu and Y.-t. Tian
5.2 Influence of Illumination Normalization to Recognition Rate
Recognition rates with and without illumination normalization are compared in the experiment, and the recognition results are showed in Table 3. Table 3. Recognition rate with and without illumination normalization Illumination Normalization NO YES
PCA G(4×8) 86.84 % 90.79 %
LG2(4×8) 85.53 % 93.42 %
KPCA G(4×8) 93.42 % 94.74 %
LG2(4×8) 94.74 % 96.05 %
The result shows that PCA features are hypersensitive to illumination and the recognition rates with illumination normalization improve 4~8 percent. However, the recognition rate of KPCA features with illumination normalization improves only 1 percent. It is clearly that the method of Gabor+KPCA can reduce the influence of illumination effectively.
6 Conclusions A facial expression recognition method based on local Gabor filter bank and fractional power polynomial kernel PCA is presented in this paper. Gabor and KPCA algorithms are used to extract the facial expression features. KPCA algorithm can bring down the dimensions of the image feature matrix to reduce computational cost by mapping the image to the feature space, and remove the features reflecting illumination variation. The features extracted can mask the effect caused by different individual features and illumination variation effectively. At last, SVM is used to train and recognize the facial expression features. A better recognition rate with 96.05% and lower dimensions of the image feature matrix are obtained by using this method. The main significances of the presented method in this paper are as follows: 1) Gabor wavelet features are not hypersensitive to the changes of facial expression and have better fault tolerance to the normalization error of facial expression images. 2) Compared with global Gabor filter bank, local Gabor filter bank has great advantages. For example, feature extracting time can be reduced, feature dimensions can be decreased, memory requirements can be reduced and recognition rate can be improved on some conditions by using local Gabor filter bank. 3) We can yield better recognition rate with much fewer dimensions of the image feature matrix as well as less CPU time of features matching than traditional PCA algorithm by using KPCA algorithm. 4) The method combined Gabor algorithm with KPCA algorithm can reduce the influence of illumination effectively. 5) A better recognition rate can be obtained by selecting parameter of SVM rationally.
Facial Expression Recognition Method
151
References 1. Liu, S., Tian, Y., Li, D.: New Research Advances of Facial Expression Recognition. In: International Conference on Machine Learning and Cybernetics, pp. 1150–1155. IEEE Press, New York (2009) 2. Andrew, A., Calder, J., Burton, M.: A principal component analysis of facial expressions. J. Visi. Rese. 41, 1179–1208 (2001) 3. Sun, W., Ruan, Q.: Two-Dimension PCA for Facial Expression Recognition. In: International Conference on Signal Processing Proceedings, pp. 1721–1724. IEEE Press, New York (2006) 4. Scholkopf, B., Smola, A., Muller, K.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem. J. Neur. Comput. 10, 1299–1319 (1998) 5. Muller, K., Mika, S., Ratsch, G., Tsuda, K., Scholkopf, B.: An Introduction to Kernelbased Learning Algorithms. J. Neur. Net. 12, 181–201 (2001) 6. Zhong, J., Franck, D., Zhen, L.: Facial Expression Analysis by Using KPCA. In: International Conference on Robotics, Systems and Signal Processing, pp. 736–741. IEEE Press, New York (2003) 7. Yang, C., Zhan, Y.: Upper Facial Action Units Recognition Based on KPCA and SVM. In: Computer Graphics, Imaging and Visualisation (2007) 8. Martinez, A.M., Kak, A.C.: PCA versus LDA. J. Patt. Anal. 23, 228–233 (2001) 9. Daugman, J.G.: Complete Discrete 2 D Gabor Transforms by Neural Networks for Image Analysis and Compression. J. Acou. Speec. 36, 1169–1179 (1998) 10. Wang, K., Lin, X., Wang, W., Duan, S.: Application of Kernel Method on Face Feature Extraction. In: International Conference on Mechatronics and Automation, pp. 3560–3564. IEEE Press, New York (2007) 11. Reilly, J., Ghent, J., McDonald, J.: Non-Linear Approaches for the Classification of Facial Expressions at Varying Degrees of Intensity. In: International Machine Vision and Image Processing Conference, pp. 125–132. IEEE Press, New York (2007)
Affine Invariant Topic Model for Generic Object Recognition Zhenxiao Li and Liqing Zhang MOE-Microsoft Key Laboratory for Intelligent Computing and Intelligent Systems Department of Computer Science and Engineering Shanghai Jiao Tong University
[email protected],
[email protected]
Abstract. This paper presents a novel topic model named Affine Invariant Topic Model(AITM) for generic object recognition. Abandoning the “bag of words” assumption in traditional topic models, AITM incorporates spatial structure into traditional LDA. AITM extends LDA by modeling visual words with latent affine transformations as well as latent topics, treating topics as different parts of objects and assuming a common affine transformation of visual words given a certain topic. MCMC is employed to make inference for latent variables, MCMC-EM algorithm is used to parameter estimation, and Bayesian decision rule is used to perform classification. Experiments on two challenging data sets demonstrate the efficiency of AITM. Keywords: Topic model, Graphical model, Object recognition.
1
Introduction
Generic object recognition is an essential and challenging problem in computer vision. The crux of this problem is how to form global representation of objects in diverse appearances within cluttered backgrounds, at which human visual systems excel computer systems to a great extent. Researches from computer science, statistics, psychology and neuroscience have been making extensive explorations for this problem within several decades. In recent years, with the rapid development of the theory of probabilistic graphical models [1], modeling complex problems and data with probabilistic graphical models enjoys flexibility and computational feasibility. Topic models [2], as a large class of probabilistic graphical models, have proved quite successful in information retrieval and text mining [3][4]. They automatically cluster cooccurring words into topics, yielding an overall representations of documents. In computer vision field, topic models have also shown satisfactory results in some certain tasks [5][6]. Traditional topic models, such as probabilistic Latent Semantic Analysis (pLSA) [3] and Latent Dirichlet Allocation(LDA) [4], are so-called “bag of words” models, which only focus attention on the occurrences/co-occurrences of words (local image features). Due to the “bag of words” assumption, spatial structure L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 152–161, 2010. c Springer-Verlag Berlin Heidelberg 2010
Affine Invariant Topic Model for Generic Object Recognition
153
among the local features in the images are neglected, which has been proved very useful in computer vision task, especially in generic object recognition. To incorporate spatial structure information into the learning model, several works have been done to extend the traditional topic models [6][7]. However, most of these works model the spatial relationships among foreground local features and background patches. Few attempts have been made to directly model spatial relationships among various parts of the object. In this paper, we propose a novel topic model which models the spatial structure of local features using affine transformations, while holding the strong power of LDA. We use the SIFT [8] feature detector to extract local features from images. Clustering local features descriptors generates a universal codebook, where then each descriptor is assigned to its codeword (visual word). Given all visual words and their locations in images, our model specifies a full joint generative distribution over these data using latent affine transformations as well as latent topics. We name our topic model Affine Invariant Topic Model (AITM). The remainder of the paper is organized as follows. In section 2, we mention some related works. Section 1 briefly reviews LDA model, and Section 4 gives a detailed description of AITM. Section 5 presents empirical results to demonstrate AITM’s efficiency. Section 6 gives some discussion.
2
Related Work
LDA [4], as an improvement of pLSA [3], originally aims to find latent topics in texts and perform document classification. They provide two different ways of classification, one of which is based on Bayesian decision rule and another of which is to train a SVM using LDA low-dimensional representation in a discriminative way. Inspired by the success of LDA’s application on text mining, Fei-Fei et al. [5] introduce LDA into computer vision to learn natural scene categories. They treat local image patches as words in texts, performing LDA in almost the same way as in [4]. Several works attempt to introducing spatial structure into LDA. Spatial LDA [6] models the distances between the locations of local patches and some predefined reference points with a Gaussian distribution, and aims to segment images with respect to the learned topics. Cao et al.[7] propose Spatial-LTM, which characterize the spatial relationships depending on segmentation regions, to modify the segmentation results and also perform classification tasks. Constellation model [9] directly models the spatial configurations of local features, which are not considered by the above mentioned works. However, constellation model enumerates all possible combinations of a certain amount of local features, the number of which grows exponentially with the number of features under combination.
3
LDA
In this section, we briefly describe LDA, which serves as the basis of our approach. There are totally D images in the collection. The d-th image has Nd
154
Z. Li and L. Zhang
β
α
z
θ
w
Nd
D
Fig. 1. Graphical model representation of LDA
local features. A universal codebook with its size being V is generated by clustering local feature descriptors. Let K denote the number of latent topics, which can either be predefined by hand or be learned from data through some model selection approaches. Each visual word wdn (resp. each latent topic zdn )is rep1 V 1 K resented by wdn = (wdn , . . . , wdn ) (resp. zdn = (zdn , . . . , zdn )) with exactly one component equal to one and all other components equal to zero. The topic mixture θ is a K-dimensional random vector sampled from a Dirichlet distribution with hyperparameter α. For different form of modeling, θ can be sampled either once an image (denoted as θd ) [4] or once a category (denoted as θc ) [5]. For each visual word wdn , a latent topic zdn is a multinomial random variable sampled from a multinomial distribution with the mixture θd . Given the topic zdn , the visual word wdn is sampled from a multinomial distribution over the K zk feature codebook with the parameters k=1 βkvdn . The full joint probability distribution over observed visual words w, latent topics x and latent topic mixtures θ is described as following: p(θd , zd , wd |α, β) = p(θd |α)
Nd
p(zdn |θd )p(wdn |zdn , β)
(1)
n=1
where K αk ) αk −1 Γ( K p(θd |α) = K k=1 θk k=1 Γ(αi ) k=1 p(zdn |θd ) =
K
zk
θdkdn
(2)
(3)
k=1
p(wdn |zdn , β) =
V K
v z k wdn
βkvdn
(4)
k=1 v=1
Fig. 1 describes LDA in graphical representation. Marginalizing over latent variables x and θ gives the marginal distribution of the visual words w. However, this marginalization is intractable, causing the infeasibility of exact inference
Affine Invariant Topic Model for Generic Object Recognition
155
β
w α
θ
z l
ω
λ
e
Nd
D
h
Fig. 2. Graphical model representation of AITM
of posterior distribution on latent variables, which plays an important role in parameter estimation and prediction. Due to the conjugacy between Dirichlet distribution and discrete(multinomial) distribution, variational inference is used for approximation in LDA [4]. Gibbs sampling is also used as stochastic approximation for inference in LDA [11].
4
Affine Invariant Topic Model
The traditional LDA assumes that the local features are “bag of words”. We abandon this assumption to incorporate spatial structure among local features. To this end, we choose SIFT [8] detector as our local feature detector. In addition to the feature descriptor, SIFT also provides substantial spatial information, including location, scale, and orientation. AITM only utilizes the location items. y x , ldn ) Therefore, besides the visual words {wdn }, location coordinates ldn = (ldn are available for each visual word. AITM gives the same generative process for visual words {wdn }, while AITM assumes an extra distribution for locations. The crucial point of this problem is how to utilize these spatial information, while introducing a small amount of parameters to keeping the approach tractable. We assume that the topics in our model represent some part of a certain object, and that the spatial structure of local features from a common topic should be relatively stable in different images under some affine transformations. So we employ prior spatial configurations and affine transformations to characterize the observed spatial information. To be more specific, we model the prior spatial configuration for the visual word v with respect to the topic k with a set of random variables hkv = (hxkv , hykv ) which is Gaussianly distributed y x , gkv ) and a common variance qkv . with means gkv = (gkv
156
Z. Li and L. Zhang
hxkv hykv
∼N
x gkv y gkv
qkv 0 , 0 qkv
(5)
For simplicity, we denote ekv = {gkv , qkv }. Furthermore, for each image m and for each topic k, a set of affine transfory x , ξdk , sdk , ϕdk ) is used to describe how the mation random variables λdk = (ξdk observed visual words wd with respect to the topic k are affinely transformed from the prior location ldn , given by the following relationship: x x x ˆ cos ϕˆdn sin ϕˆdn h ξˆdn ρd 0 ldn dn ∼ N sˆdn + ˆy , (6) y ˆy ldn − sin ϕˆdn cos ϕˆdn 0 ρd h ξdn dn where x ξˆdn =
K
k
x zdn (ξdk ) ,
y ξˆdn =
k=1
sˆdn =
K
k
y zdn (ξdk )
(7)
k=1 zk
sdkdn ,
ϕˆdn =
k=1
ˆx = h dn
K
K
zk
ϕdkdn
(8)
k=1
K V
k v (hxkv )zdn wdn ,
ˆy = h dn
k=1 v=1
K V
k
v
(hykv )zdn wdn
(9)
k=1 v=1
To avoid overfitting and render the approach tractable, several Bayesian prior distributions are modeled over the affine transformation random variables. For y x , ξdk ), we choose Gaussian prior with μd = (μxd , μyd ) as displacements ξdk = (ξdk y x its means and σd = (σd , σd ) as its variances. For scale s, however, Gaussian prior seems not a reasonable assumption, since s is a positive real number and the variation of sdk yields nonuniform errors when sdk has different magnitudes. Guided by this concern, a more reasonable assumption is that the logarithm of sdk is Gaussianly distributed with mean μsd and variance σds . For orientation ϕdk , uniform prior is taken. x x x ξdk μd σd 0 ∼ N , (10) y ξdk μyd 0 σdy log sdk ∼ N (μsd , σds ) ϕdk ∼ U([0, 2π))
(11) (12)
Note that all parameters in Bayesian prior are image-level, shared across topics. For simplicity, we denote ωd = {μxd , μyd , μsd , σdx , σdy , σds }. y x AITM has two sets of parameters modeling spatial structure, {gkv , gkv , qkv } y x representing prior spatial configuration and {ξmk , ξmk , smk , φmk } representing affine transformations. However, these parameters have redundancy, rendering the problem undetermined. To overcome this problem, the following there cony x , gkv } for all k: straints are placed on the {gkv 0=
V v=1
x gkv ,
0=
V v=1
y gkv
(13)
Affine Invariant Topic Model for Generic Object Recognition
1= x gk,1 =
V
y 2 x 2 (gkv ) + (gkv )
v=1 x gk,2
157
(14) (15)
Fig. 4 describes AITM in graphical representation. We apply Gibbs sampling to make inference for latent random variables. For y x {hxkv , hykv , ξdk , ξdk }, since prior and likelihood are Gaussian-like, the conditional probability distributions also take the Gaussian form. For sdk (resp. ϕdk ), prior × likelihood takes the form log-normal (resp. uniform) × Gaussian. Since lognormal (resp. uniform) density is bounded, sdk (resp. ϕdk ) can be sampled by rejective sampling [10]. For zdn , Gibbs iteration has one more multiple than the LDA Gibbs iteration equation. Please refer to [11] for LDA Gibbs iteration equation. The additional multiple is easily derived from (6). All the parameters can be trained by MCMC-EM algorithm [10]. Gibbs sampling described above is the E-step, and the M-step involves substituting samples from Gibbs sampling in the joint distribution and maximize the likelihood with respect to the parameters under the constraints (13)-(15).
5 5.1
Experiment Data Set
For evaluation we performed experiments on two publicly available image data sets: Caltech 101 [12] and Caltech 256 [13]. Caltech 101 contains more 9000 images belonging to 101 object categories and a background class. Each category contains from 31 to 800 images. The images in Caltech 101 have occlusions, clutter and intra-class appearance variation to a certain extent. Fig. 3 show examples from Caltech 101. Caltech 256 improves Caltech 101, containing 30607 images belonging to 256 object categories and a clutter class. Each category contains at least 80 images. Caltech 256 increases the diversification of images, rendering it more challenging. Fig. 4 show examples from Caltech 256.
Fig. 3. Sample images in Caltech 101
158
Z. Li and L. Zhang
Fig. 4. Sample images in Caltech 256
Caltech 101 70
Performance(%)
60 50 40 30 20 LDA AITM
10 0
0
5
10
15 20 25 Number of training samples
30
Fig. 5. Performance as a function of the number of training samples for Caltech 101
5.2
Experimental Setup and Empirical Results
We randomly separate the data set into training images and testing images. According to the suggestion in [12], for each category, the number of training images is set to be 5, 10, 15, 20, and 30, and the number of testing images is set to be 30 (if a category has less than 30 remaining images, all the remaining images are chosen to be testing images). For each image, we extract at most 200 SIFT features. Collecting all SIFT descriptors from training images, we learn a universal codebook by k-means clustering algorithm as in [5]. The size of codebook V is chosen to be 1000, and the number of topics is chosen to be 40. Traditional LDA is used as baseline for comparison. Bayesian decision rule is employed to perform classification in both LDA and AITM. Fig. 5 illustrates the experimental results on Caltech 101, from which we find that LDA outperforms AITM with few training samples and AITM outperforms LDA with adequate training samples. A plausible explanation for this phenomenon is that since
Affine Invariant Topic Model for Generic Object Recognition
159
Caltech 101 70
Performance(%)
60 50 40 30 20 10 AITM 0
0
10
20
30
40 50 number of topics
60
70
80
90
Fig. 6. Performance as a function of the number of topics for Caltech 101
Caltech 256 50
performance(%)
40
30
20
10
0
SPM AITM 0
10
20 30 Number of training samples
40
50
Fig. 7. Performance as a function of the number of training samples for Caltech 256. The performance of SPM was reported in [13]
AITM contains more parameters than LDA, AITM is more prone to overfitting with few training samples. We then fix the number of training samples on 30, and tune the number of topics K to be 5, 10, 20, 40 and 80. Experimental results are shown in Fig. 6. The performance does not increase markedly when K is larger than 40, demonstrating that 40 is likely to be close to the “true” number of latent topics.
160
Z. Li and L. Zhang
We also perform classification on Caltech 256. To compare AITM’s performance with spatial pyramid matching [14] (SPM)’s performance reported in [13], the number of training images is set to be 10, 20, 30, 40. Experimental results are shown in Fig. 7. AITM’s performance is quite close to the SPM, one of the most efficient algorithms on Caltech 256.
6
Discussion and Future Work
This paper proposes a new topic model AITM for modeling spatial structure of local features of images. AITM can cluster visual words which are both cooccurred and stable in spatial configuration into clusters. A notable fact is that if the variance parameter q is so large that AITM gives the roughly the same probability regardless of the locations. Under this circumstance, AITM is similar to LDA. The additional flexibility renders AITM more prone to overfitting with few training samples, which is shown in experiments. As for the experimental results, AITM is inferior to SPM on Caltech 256. A possible reason is that SPM is a discriminative method, bearing more classification power. Hence modifying AITM to a discriminative model may be our next work.
Acknowledgement The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301), the Science and Technology Commission of Shanghai Municipality (Grant No. 08511501701), and the National Natural Science Foundation of China (Grant No. 60775007).
References 1. Jordan, M.I.: Graphical Models. Statistical Science 19, 140–155 (2004) 2. Blei, D.M., Lafferty, J.D.: Topic Models. In: Srivastava, A., Sahami, M. (eds.) Text Mining: Theory and Applications. Taylor and Francis, London (2009) 3. Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning Journal 42, 177–196 (2001) 4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003) 5. Fei-Fei, L., Perona, P.: A Bayesian Hierarchical Model for Learning Natural Scene Categories. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 524–531. IEEE Press, New York (2005) 6. Wang, X., Grimson, E.: Spatial Latent Dirichlet Allocation. In: Advances in Neural Information Processing Systems, vol. 20. MIT Press, Cambridge (2007) 7. Cao, L., Fei-Fei, L.: Spatially Coherent Latent Topic Model for Concurrent Object Segmentation and Classification. In: IEEE 11th International Conference on Computer Vision (2007)
Affine Invariant Topic Model for Generic Object Recognition
161
8. Lowe, D.G.: Distinctive Image Features from Scale-invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 9. Fergus, R., Perona, P., Zisserman, A.: Weakly Supervised Scale-invariant Learning of Models for Visual Recognition. International Journal of Computer Vision 71(3), 273–303 (2004) 10. Andrieu, C., De Freitas, N., Doucet, A., Jordan, M.I.: An Introduction to MCMC for Machine Learning. Machine Learning Journal 50, 5–43 (2003) 11. Griffiths, T., Steyvers, M.: Finding Scientific Topics. Proceedings of the National Academy of Sciences 101, 5228–5235 (2004) 12. Fei-Fei, L., Fergus, R., Perona, P.: Learning Generative Visual Models from Few Training Examples: an Incremental Bayesian Approach Tested on 101 Object Categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Workshop on Generative-Model Based Vision (2004) 13. Griffin, G., Holub, A.D., Perona, P.: The Caltech-256, Caltech Technical Report 14. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 2169–2178. IEEE Press, New York (2006)
Liver Segmentation from Low Contrast Open MR Scans Using K-Means Clustering and Graph-Cuts Yen-Wei Chen1,2, Katsumi Tsubokawa2, and Amir H. Foruzan2,3 1
Electronics & Inf. Eng. School, Central South Univ. of Forestry and Tech., China College of Information Science and Eng., Ritsumeikan University, Shiga, Japan 3 College of Engineering, University of Tehran, Tehran, Iran
2
Abstract. Recently a growing interest has been seen in minimally invasive treatments with open configuration magnetic resonance (Open-MR) scanners. Because of the lower magnetic field (0.5T), the contrast of Open-MR images is very low. In this paper, we address the problem of liver segmentation from lowcontrast Open-MR images. The proposed segmentation method consists of two steps. In the first step, we use K-means clustering and a priori knowledge to find and identify liver and non-liver index pixels, which are used as “object” and “background” seeds, respectively, for graph-cut. In the second step, a graph-cut based method is used to segment the liver from the low-contrast Open MR images. The main contribution of this paper is that the object (liver) and background (non-liver) seeds (regions) in every low-contrast slice of the volume can be obtained automatically by K-means clustering without user interaction. Keywords: Liver segmentation, Low-contrast object segmentation, K-means clustering, Open-MR image, Graph-cut.
1 Introduction Evaluation of liver geometry, its vessels structures, liver’s tumors sizes and locations are considered as a critical step prior to liver treatment planning [1, 2]. The initial stage of any CAD/CAS system that deals with liver is segmentation. A wide range of image processing techniques have been used by researchers to develop liver segmentation algorithms, such as probabilistic atlases [3], active contours [4] statistical shape models [5], Graph-Cut technique [6], and intensity-based approached [7]. These include both automatic and semi-automatic approaches. On the other hand, recently a growing interest has been seen in minimally invasive treatments with open configuration magnetic resonance (Open-MR) scanners [8]. Figure 1 is a typical picture of minimally invasive treatments with Open MR scanners. The doctor can do the treatment under the guidance of MR images. Because of the lower magnetic field (0.5T), the contrast of Open-MR images is very low. Liver segmentation from such low contrast volumes is still considered as a challenging task. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 162–169, 2010. © Springer-Verlag Berlin Heidelberg 2010
Liver Segmentation from Low Contrast Open MR Scans
163
Fig. 1. Minimally invasive treatments with Open MR scanners
In this paper, we propose a novel technique for liver segmentation from low contrast Open-MR images. The proposed segmentation method consists of two steps. In the first step, we use K-means clustering and a priori knowledge to find and identify liver and non-liver indexing pixels. In the second step, a graph-cut based method is used to segment the liver from the low-contrast Open MR images. The identified liver and non-liver pixels are used as “object” and “background” seeds, respectively. The main contribution of this paper is that the object (liver) and background (non-liver) seeds (regions) in every low-contrast slice of the volume can be obtained automatically by K-means clustering without user interaction. The paper is organized as follows: In section 2, we describe how clustering and a priori knowledge can be used for identification of liver and non-liver indexing pixels in low-contrast images. The graph-cut based liver segmentation is explained in section 3. Section 4 shows the results of the method and section 5 concludes the paper.
2 K-Means Clustering for Automatic Initial Seeds Finding In the first step, we use K-means clustering and a priori knowledge to find and identify both liver and non-liver indexing pixels in every low contrast slice image [9]. We start segmentation from an initial slice. Initial slice is selected among middle slices of a dataset, in which liver has a single large cross-section. Although we need this slice to be segmented manually, we try to find it automatically; as the future work of our research. One example is shown in Fig.2. The selected initial slice (Slice No.16) and its manually segmented liver mask are shown in Fig.2(a) and 2(b), respectively. We also calculate the intensity mean (μ) and standard deviation (σ) of the segmented liver. The manually segmented liver mask of the initial slice and the liver intensity mean (μ) and standard deviation (σ) are used as a priori knowledge for liver segmentation. So our proposed method can be considered as a kind of semi-automatic segmentation method. By using these a priori knowledge of the initial liver slice, we try to automatically find or identify typical liver and non-liver index pixels on its neighbour slices (ex. Slice No. 17, which is shown in Fig.3(a)).
164
Y.-W. Chen, K. Tsubokawa, and A.H. Foruzan
Fig. 2. The selected initial slice (a) and its manually segmented liver mask (b)
Fig. 3. (a) Neighbour slice to be segmented, (b) Thresholding image in a narrow range round the mean of intensity, (c) Cluster centers (white circles) by K-means clustering
We assume the intensity distribution of liver is a Gaussian distribution and we threshold the slice to be segmented in a narrow region [ μ − βσ , μ + βσ ] to find liver candidate pixels as shown in Fig.3(b), where parameter β is a constant value. If we choose very large or small values for β , segmentation results may face oversegmentation or under-segmentation, respectively. However, this parameter is not too sensitive to minor changes and it has to be tuned for low-contrast and high-contrast datasets. For low-contrast images, smaller values of β have to be selected (i.e. β < 0.711σ o ), while for high-contrast images, we have to choose β > 0.711σ o [9]. The result of narrow-band thresholding is an image with a number of pixels both inside the object and background. In this paper, we call these pixels as candidate pixels. The number of candidate pixels inside the object is large and they are dense. However, candidate pixels of the background are either a small number and dispersed, or they constitute a separate cluster, with respect to clusters inside the object. We employ K-means clustering to group these pixels into several small clusters as shown in Fig. 3(c). Cluster centers are shown as small white circles in Fig. 3(c) and the cluster number is 100. We called these cluster centers as index pixels. We then use the
Liver Segmentation from Low Contrast Open MR Scans
165
initial liver mask to discriminate or identify both liver and non-liver index pixels based on their locations. The identified liver and non-liver index pixels can be used as “object” and “background” seeds, respectively, for graph-cuts based segmentation without any user interaction.
3 Graph-Cuts Based Liver Segmentation Graph-cuts [10] is an interactive segmentation technique and has been applied to organ segmentation from medical CT volumes [6, 10]. The basic idea is to separate an object of interest from the background based on graph cuts. The segmentation problem is formulated on a discrete graph. The graph G = {V, E} is composed using vertices V representing the image pixels, as well as edges E connecting the vertices. There are two special vertices (also called terminals): an “object” terminal (a source) and a “background” terminal (a sink). The source is connected by edges to all vertices identified as object seeds and the sink is connected to all background seeds. An example of a simple 2D graph for a 3x3 image is shown in Fig.4(a).
Fig. 4. A simple 2D graph for a 3x3 image (a) and its minimal cut (b)
Table 1. Weights or costs for each link (edge)
166
Y.-W. Chen, K. Tsubokawa, and A.H. Foruzan
All edges from the terminals are referred to as t-links. Pairs of neighboring pixels are connected by weighted edges that we call n-links (neighborhood links). Weights or costs of n-links and t-link are given as shown in Table 1, where p and q are neighboring pixels of image P; (Ip and Iq in following Eqs.(1)-(3)) are their intensities); O and B represent “object” and “background”, respectively. p ∈ O, p ∈ B are pixels of “object” seeds and “background” seeds, respectively, which are given by users. R p (" obj" ) = − ln Pr( I p | O)
(1)
R p ("bkg") = − ln Pr( I p | B)
(2)
⎛ (I p − I q ) 2 ⎞ 1 ⎟⋅ B{ p ,q} ∝ exp⎜ − 2 ⎜ ⎟ 2σ ⎝ ⎠ dist ( p, q)
∑ B{ p, q}
K = 1 + max p∈P
q:{ p ,q}∈N
(3)
(4)
The goal of graph cut based segmentation is to find labels L = {L1 , L2 ," , L p ,"} for each pixel that minimizes following energy function, which is the summation of the weights of the edges that are cut: E (L ) = λ R ( L ) + B (L )
(5)
where λ is a weight factor and R (L) =
∑R
p (L p )
(6)
⋅ δ ( L p , Lq )
(7)
p∈P
B (L) =
∑B
p,q
p ,q∈N
⎧1 ⎩0
δ ( L p , Lq ) = ⎨
if L p ≠ Lq otherwise
(8)
A cut on the graph (G = {E, V}) is a partition of V into two disjoint sets S and T = V −S as shown in Fig. 4(b). The cost of the cut is the sum of the costs of all edges that are severed by the cut. The minimum cut problem is to find the cut with the smallest cost. There are numerous algorithms that solve this problem such as minimum cut or maximum flow [10]. Graph cut methods differ from active contour methods in that they are not iterative, and achieve global minimization easily. The main contribution of this paper is that the object (liver) and background (nonliver) seeds (regions) in every slice of the volume can be obtained automatically by K-means clustering without user interaction as shown in Fig.5(a) and 5(b). The graphcut based liver segmentation result is shown in Fig.5(c) (green image). In order to
Liver Segmentation from Low Contrast Open MR Scans
167
make a comparison, the segmented liver is overlaid with the Open-MR slice image. It can be seen that the liver is almost perfectly segmented. The segmentation accuracy is about 80% (the manually segmented liver is used as ground truth).
Fig. 5. Automatically estimated “Object” seeds (a) and “background” seeds (b). Graph cut based liver segmentation result (green image).
The segmented liver is used as an initial mask for its neighboring slice segmentation. By repeating the K-means clustering based seeds estimation and graph cuts based segmentation, we can segment whole liver volume slice by slice.
4 Experimental Results The proposed method is applied to real clinical experiments with Open-MR scanner. The Open-MR volumes have 28 slices with 5mm thickness and their in-plane dimensions are 1.17mm x 1.17mm with a 300x300 mm2 FOV. Several segmented results are shown in Fig.6. In order to make a comparison, the manually segmentation results are also shown in Fig.6, which is used as a ground truth. The segmented liver is overlaid with the Open-MR slice image. It can be seen that the liver is almost perfectly segmented. The dice measure (DSC) is used as a measure of segmentation accuracy, which is defined as
DSC ( A, B) =
2⋅ A∩ B A+B
(9)
where A is the segmentation result, B is the ground truth (the manually segmentation result), and • denotes the number of pixels contained in a set. The mean of DSC is about 80% and the processing time is significantly reduced to 30s from 7min (the processing time of manual segmentation), both are acceptable in real clinical applications.
168
Y.-W. Chen, K. Tsubokawa, and A.H. Foruzan
Fig. 6. Comparison of automatically segmented results and manually segmented results
5 Conclusions A novel liver segmentation technique has been proposed for minimally invasive treatments with Open-MR scanners, which is based on K-means clustering and graph-cut for low-contrast Open-MR images. The proposed segmentation method consists of two steps. In the first step, we use K-means clustering and a priori knowledge to find and identify liver and non-liver pixels, which are used as “object” and “background” seeds, respectively, for graph cuts. In the second step, a graph cuts based method is used to segment the liver from the low-contrast Open MR images. The effectiveness of the proposed method has been shown.
Acknowledgments This work was supported in part by the Grand-in Aid for Scientific Research from the Japanese Ministry for Education, Science, Culture and Sports under the Grand No. 21300070 and in part by the R-GIRO Research fund from Ritsumeikan University.
Liver Segmentation from Low Contrast Open MR Scans
169
References [1] Meinzer, H.P., Schemmer, P., Schobinger, M., Nolden, M., Heimann, T., Yalcin, B., Richte, G.M., Kraus, T., Buchler, M.W., Thorn, M.: Computer-based Surgery Planning for Living Liver Donation. In: 20th ISPRS Congress, Istanbul 2004, International Archives of Photogrammetry and Remote Sensing, vol. XXXV(B), pp. 291–295 (2004) [2] Nakayama, Y., Li, Q., Katsuragawa, S., Ikeda, R., Hiai, Y., Awai, K., et al.: Automated Hepatic Volumetry for Living Related Liver Transplantation At Multisection CT. Radiology 240(3), 743–748 (2006) [3] Park, H., Bland, P., Meyer, C.: Construction of an abdominal probabilistic atlas and its application in segmentation. IEEE Transactions on Medical Imaging 22(4), 483–492 (2003) [4] Alomari, R.S., Kompalli, S., Chaudhary, V.: Segmentation of the Liver from Abdominal CT Using Markov Random Field Model and GVF Snakes. In: Proceedings of the 2008 International Conference on Complex, Intelligent and Software Intensive Systems, vol. 00, pp. 293–298 (2008) [5] Soler, L., Delingette, H., Malandain, G., Montagnat, J., Ayache, N., Koehl, C., et al.: Fully automatic anatomical, pathological, and functional segmentation from CT scans for hepatic surgery. Computer Aided Surgery 6(3), 131–142 (2001) [6] Massoptier, L., Casciaro, S.: Fully Automatic Liver Segmentation through Graph-Cut Technique. In: Proceedings of the 29th Annual International Conference of the IEEE EMBS Cité Internationale, Lyon, France, August 23-26 (2007) [7] Foruzan, A., Zoroofi, R., Sato, Y., Hori, M., Murakami, T., Nakamura, H., Tamura, S.: Automated segmentation of liver from 3d ct images. International Journal of Computer Assisted Radiology and Surgery 1(7), 71–73 (2006) [8] Morikawa, S., Inubushi, T., Kurumi, Y., Naka, S., Sato, K., Tani, T., Yamamoto, I., Fujimura, M.: MR-Guided microwave thermocoagulation therapy of liver tumors: initial clinical experiences using a 0.5 T open MR system. J. Magn. Reson. Imaging 16, 576– 583 (2002) [9] Foruzan, A.H., Chen, Y.-W., Zoroofi, R.A., Furukawa, A., Sato, Y., Hori, M.: Multimode Narrow-band Thresholding with Application in Liver Segmentation from Lowcontrast CT Images. In: Proc. of 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, September 2009, pp. 1293–1296 (2009) [10] Boykov, Y., Jolly, M.-P.: Interactive organ segmentation using graph cuts. In: Delp, S.L., DiGoia, A.M., Jaramaz, B. (eds.) MICCAI 2000. LNCS, vol. 1935, pp. 276–286. Springer, Heidelberg (2000)
A Biologically-Inspired Automatic Matting Method Based on Visual Attention Wei Sun, Siwei Luo, and Lina Wu School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
[email protected]
Abstract. Image matting is an important task in image and video editing. In this paper we propose a novel automatic matting approach, which can provide a good set of constraints without human intervention. We use the attention shift trace in a temporal sequence as the useful constraints for matting algorithm instead of user-specified “scribbles”. Then we propose a modified visual selective attention mechanism which considered two Gestalt rules (proximity & similarity) for shifting the processing focus. Experimental results on real-world data show that the constraints are useful. Distinct from previous approaches, the algorithm presents the advantage of being biologically plausible. Keywords: Visual attention, Image matting, Gestalt rules, FOA.
1 Introduction Image matting is a important task for image understanding and has been a topic for much research over the years. Most of the common matting algorithm is interactive. The aim of interactive image matting is to extract a foreground object from an image based on limited user input. Recently, there has been a lot of interest and impressive results in interactive matting[1,2,3,4,5]. Image matting is an ill-posed problem, because at each pixel we had to estimate the foreground and the back ground colors, as well as the foreground opacity α from a single color measurement. To overcome this under-constrained problem, most methods use user-specified constraints. Using only a sparse set of user-specified constraints, usually in the form of “scribbles” or a “trimap” (an example is shown in Figure 1), these methods produce a soft matte of the entire image. Hence, one of the key problems in matting is user-specified constraints. The “scribbles” are some white or black lines in the image drawing by users. White scribbles indicate foreground, black scribbles indicate background. Scribble-based methods[4,6] use these sparse constraints to iteratively estimate the unknowns at every pixel in the image. As pointed out by [7], there is a need for methods which enable a matting algorithm to get constraints automatically. A good automatic method should pick the constraints which are best suitable for the current image. In [7], the user-specified “scribbles” is replaced by constraints indicated by local occlusion information. In this paper, we will propose a new method to automatically detect useful constraints for image matting problem. We use visual attention shift trace in a L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 170–177, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Biologically-Inspired Automatic Matting Method Based on Visual Attention
171
temporal sequence as the constraints of image matting. Distinct from previous approaches, the algorithm presents the advantage of being biologically plausible.
Fig. 1. Examples of user-specified constraints. (a) Original image. (b) An accurate hand-drawn trimap. (c)An image with hand-drawn scribbles: white scribbles indicate foreground, black scribbles indicate background.
Selective attention plays an important role in visual processing in reducing the problem scale and in actively gathering useful information. A biologically motivated attention system detects regions of interest which “pop-out” automatically due to strong contrasts and the uniqueness of features. Many visual attention models have been suggested. Tsotsos et al.[8] use local winner-take-all networks and top-down mechanisms to selectively tune model neurons at the attended location. Itti et al.[9] introduced a model for bottom-up selective attention based on serially scanning a saliency map. Our attention system is based on the Itti et al.[9] implementation of the saliency-based model. In this paper, we present a new method motivated by human vision which enables a matting algorithm to get constraints automatically. First, we introduce the matting method and then propose a novel visual attention mechanism to get the attention shift traces as the constraints for matting. Then we will use the constraints to guide the process of image matting. Our experiments demonstrate that the proposed approach is indeed useful.
2 Approach The automatic matting method is composed of two sub-modules, as illustrated in Figure 2. One is the attention control module, which generates attention shift trace according to a saliency map. The second sub-module is the matting module, which separate the foreground and background according to constraints generated by attention shift trace. Our approach is closely related to the α -matting method of Levin et al.[4], and the visual attention model of Itti et al.[9].
Fig. 2. The approach is composed of two modules: an attention control module and a matting module
172
W. Sun, S. Luo, and L. Wu
2.1 Image Matting Method In a standard matting approach, one assumes that each observed pixel in an image, I ( x, y ) , is explained by the convex combination of two unknown colors, F and B. A
soft weight, α , controls this combination:
I ( x , y ) = α ( x , y ) F ( x , y ) + (1 − α ( x , y ) ) B ( x , y ) .
(1)
In this formulation, pixels with an α -value near one are likely part of the F “class”, while those with an α -value near zero are likely part of the B “class”. The goal of the method is to solve the compositing equation (1) for the unknown pixels. In our approach, we use the matting method proposed by A. Levin [4]. This method uses hand-drawn scribbles as user-specified constraints. This method first derives a cost function from local smoothness assumptions on foreground and background colors: J (α , a , b
Where a =
)= ∑
j∈ I
⎛ ⎜ ⎜ ⎝
∑
i∈ w
j
1 ,b = − B , F −B F − B
⎛ ⎜α i − ⎝
∑ c
2
⎞ a cj I ic − b j ⎟ + ε ⎠
∑ c
2 ⎞ a cj ⎟ , ⎟ ⎠
(2)
w j is a small window around pixel j .
Then the author show that in the resulting expression it is possible to analytically c
eliminate the foreground and background colors ( a and b ) to obtain a quadratic cost function in alpha: J (α
) = α T Lα
(3)
Here L is an N × N matrix, whose ( i , j ) -th element is:
k
Where
∑
( i , j )∈ w k
∑
window
k
⎛ ⎜ δ ij − 1 ⎜ wk ⎝
⎛ ⎜1 + ⎜ ⎝
(Ii
− μk
is a 3 × 3 covariance matrix,
⎛
) ⎜⎜ ∑ ⎝
k
+
ε wk
⎞ I 3 ⎟⎟ ⎠
−1
(I
j
− μk
⎞⎞
) ⎟⎟ ⎟⎟
(4)
⎠⎠
μ k is a 3 × 1 mean vector of the colors in a
wk , and I 3 is the 3 × 3 identity matrix.
Thus we can find the globally optimal alpha matte by solving a sparse linear system of equations. We demonstrate this in figure 3.
Fig. 3. Matting examples. (a)(c) Input images with scribbles. (b)(d) Extracted mattes.
A Biologically-Inspired Automatic Matting Method Based on Visual Attention
173
2.2 Attention Control Module The starting point for the proposed new method is the existing saliency model of Itti et al.[9,10], which is freely available on the World-Wide-Web. This model is used to select highly salient points and pre-attentive, low-level feature descriptors for these points. Salient points are identified by computation of seven center-surround features: image intensity contrast, red/green and blue/yellow double opponent channels, and four orientation contrasts. The model extracts feature maps for above seven features, and builds up a saliency map using intermediary conspicuity maps. A winner-take-all network of integrate and fire neurons selects winning locations, and an inhibition-ofreturn mechanism allows the model to attend to many locations successively. For a more detailed description of the module see Fig. 4. In order to deal with matting problem, we use a center-surround priority CS which has high values in the center of the image[11]. We use this priority because objects in the center of the view are much more likely to attract attention for humans. CS is expressed in the form of a two-dimensional Gaussian function:
CS = e
−[
( x − x 0 )2 + ( y − y 0 )2 ] 2 σ x2
2 σ 2y
(5)
Where x0 and y0 are the center coordinates of the input image, and σ x and σ y is the standard deviation in horizontal and vertical directions respectively. The initial saliency map is formed by:
S=
I + O + C + CS 4
(6)
Until now we have only get the first focus of attention location. We have to consider how does the selection process move from on location to the next, i.e. how can selective attention shift across the visual field. From psychophysical experiments it is known that it takes some measurable time to shift the focus of attention from one location to another. Shifts may possibly be directed under voluntary control[12], although we consider in this paper only involuntary, automatic aspects of selective attention. If the shifting apparatus is to scan automatically different parts of a given object, it is useful to introduce a bias based on both spatial proximity and similarity. Both mechanisms are related to phenomena in perceptual grouping and “Gestalt effects” which occur as a function of object similarity and spatial proximity. In order to solve the matting problem, we hope the focus of attentions explore more features of foreground object. We introduce an update rule for saliency map which makes the next attention target to stay close to the current fixation point (proximity) and to other salient features (similarity). US ( t ) indicates the possibility for a pixel to be foreground or background. The update rule is implemented by adding a trace of neighbors of the fixation points in the history of the observation duration:
US ( t ) = β × US ( t − 1) + ∑ PX ( p, t ) +MF ( p, t ) p∈ ft
(7)
174
W. Sun, S. Luo, and L. Wu
Where PX ( p, t ) is a function that puts a large neighbouring region at high values around the fixation point p at time t from the trace list ft, which is corresponding to proximity reference; MF ( p, t ) is a function puts a small neighbouring region at low values around the fixation point p at time t, this region is obtained by shape estimator proposed by Walther[13], which is corresponding to similarity reference. Each time after an attention shift, the saliency map is updated by:
S ' ( t ) = S ( t ) ⊗ US ( t )
(8)
where ⊗ is an element-by-element multiplication between two matrices. The saliency map update rule helps to focus on the foreground during the first few attention shifts over an image, and can explore as many features as possible. 2.3 Automatic Matting with FOA Shift Trace Now we have got several FOA shift traces, we use these traces as “white scribbles” which indicate foreground for image matting. In our experiments, we use the first five shift traces as constraints. Because most of the foreground object is in the center of a image, for simplicity, we use the pixels on left, upper and right boundary of the input image as “black scribbles”. With these scribbles, we can use matting algorithm to computer the matte of images.
Fig. 4. Our model combines a saliency-based attention system with an alpha-matting algorithm
A Biologically-Inspired Automatic Matting Method Based on Visual Attention
175
3 Experimental Results In this section, we exercise the proposed automatic matting method on some natural scene images. We used color images from www.alphamatting.com as test images. For each image, the selective attention module generate five FOA shift trace, the alphamatting module uses this trace as user-specified constraints to get the alpha matte. Figure 5 shows some results by our methods. From the result mattes, we can see our attention-based scribbles are very useful for image matting. For comparison, we give some hand-drawn scribbles as user-specified constraints, and using alpha-matting method[4] to get the alpha matte. Figure 6 shows the mattes extracted using our automatic matting method on the test images and compares our results with interactive matting algorithms[4]. It can be seen that our results on these examples are comparable in quality to those of[4], even though we use totally bottom-up vision motivated constraints. To obtain a more quantitative comparison between the algorithms, we performed an experiment with images which we have the ground truth alpha matte. We measured the summed absolute error between the extracted matte and the ground truth, and got the results in Figure 7. The image size is 800 678. The y axis of the histogram is the error pixel number. When the foreground is smooth, all constraints performs well with the matting algorithm. When the foreground contains more different features, matting with little scribbles performs poorly.
×
Fig. 5. Our method results. (a) input images; (b) sequences of attention shifts in the image; (c) result mattes.
Fig. 6. A comparison of alpha mattes extracted by different constraints. (a) little scribbles and matte by[4]; (b) more scribbles and matte by[4]; (c) our results with attention based scribbles.
176
W. Sun, S. Luo, and L. Wu
Fig. 7. A comparison of our results with ground truth
4 Conclusions In this paper, we have presented a modified saliency map mechanism which enables attention to stay mainly on foreground for the first several shifts and explore more features of the foreground. Then the FOA shift trace is applied to an alpha-matting algorithm as user-specified constraints, so we get an automatic matting method, and our method is biologically plausible. Experimental results have demonstrated that this method is able to deal with many natural scene images. However, our method is only considered attention shift within one object, the multiple-object scene is not considered here. We got a good foreground constraints by visual attention, but the background constraints is not considered in our work. Our future work will concentrate on how to obtain a good background constraints which is motivated by human vision system, and we will consider more about the scene information. Acknowledgments. This work is supported by National High Technology Research and Development Program of China (2007AA01Z168), National Nature Science Foundation of China (60975078, 60902058, 60805041, 60872082, 60773016), Beijing Natural Science Foundation (4092033) and Doctoral Foundations of Ministry of Education of China (200800041049).
References 1. Apolstoloff, N., Fitzgibbon, A.: Bayesian Video Matting Using Learnt Image Priors. In: 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, vol. 1, pp. 407–414 (2004) 2. Bai, X., Sapiro, G.: A Geodesic Framework for Fast Interactive Image and Video Segmentation and Matting. In: 11th IEEE International Conference on Computer Vision, Rio De Janeiro, pp. 1–8 (2007) 3. Chuang, Y., Curless, B., Salesin, D., Szeliski, R.: A Bayesian Approach to Digital Matting. In: 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Hawaii, vol. II, pp. 264–271 (2001) 4. Levin, A., Lischinski, D., Weiss, Y.: A Closed Form Solution to Natural Image Matting. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, vol. 1, pp. 61–68 (2006)
A Biologically-Inspired Automatic Matting Method Based on Visual Attention
177
5. Rhemann, C., Rother, C., Gelautz, M.: Improving Color Modeling for Alpha Matting. In: British Machine Vision Conference 2008, Leeds, pp. 1155–1164 (2008) 6. Wang, J., Cohen, M.: An Iterative Optimization Approach for Unified Image Segmentation and Matting. In: 10th IEEE International Conference on Computer Vision, Beijing, vol. 2, pp. 936–943 (2005) 7. Apostoloff, N., Fitzgibbon, A.: Automatic Video Segmentation Using Spatiotemporal Tjunctions. In: British Machine Vision Conference 2006, Edinburgh, pp. 1–10 (2006) 8. Tsotsos, J.K., Culhane, S.M., Wai, W., Lai, Y.H., Davis, N., Nuflo, F.: Modeling Visual Attention via Selective Tuning. Artificial Intelligence 78, 507–545 (1995) 9. Itti, L., Koch, C.: Computational Modelling of Visual Attention. Nature Reviews Neuroscience (2001) 10. Walther, D., Koch, C.: Modeling Attention to Salient Proto-objects. Neural Networks 19, 1395–1407 (2006) 11. Li, M., Clark, J.J.: Selective Attention in the Learning of Invariant Representation of Objects. In: 2005 IEEE Computer Society International Conference on Computer Vision and Pattern Recognition, San Diego, vol. 3, pp. 93–101 (2005) 12. Posner, M.I.: Orienting of Attention. Quat. J. Exper. Psych. 32, 2–25 (1980) 13. Walther, D., Itti, L., Riesenhuber, M., Poggio, T., Koch, C.: Attentional Selection for Object Recognition – a Gentle Way. In: Bülthoff, H.H., Lee, S.-W., Poggio, T.A., Wallraven, C. (eds.) BMCV 2002. LNCS, vol. 2525, pp. 472–479. Springer, Heidelberg (2002a)
Palmprint Classification Using Wavelets and AdaBoost Guangyi Chen1, Wei-ping Zhu2, Balázs Kégl3, and Róbert Busa- Fekete3 1
Department of Mathematics and Statistics, Concordia University, Montreal, Quebec, Canada H3G 1M8
[email protected] 2 Department of Electrical and Computer Engineering, Concordia University, Montreal, Quebec, Canada H3G 1M8
[email protected] 3 LAL/LRI, University of Paris-Sud, CNRS, 91898 Orsay, France {balazs.kegl,busarobi}@gmail.com
Abstract. A new palmprint classification method is proposed in this paper by using the wavelet features and AdaBoost. The method outperforms all other classification methods for the PolyU palmprint database. The novelty of the method is two-fold. On one hand, the combination of wavelet features with AdaBoost has never been proposed for palmprint classification before. On the other hand, a recently developed base learner (products of base classifiers) is included in this paper. Experiments are conducted in order to show the effectiveness of the proposed method for palmprint classification. Keywords: Palmprint classification, wavelet transform, feature extraction, AdaBoost.
1 Introduction Biometric authentication uses physiological characteristics of a person to recognize the identity of the person. This includes fingerprints, facial features, iris patterns, speech patterns, hand geometry, and palmprints, etc. Palmprint classification is a new branch of biometric authentication [1]. Unlike other well-developed biometric features, limited works has been reported on palmprint classification, despite the importance of palmprint features. Palmprint classification offers a number of advantages over other biometric authentication techniques. For example, the principal lines and the wrinkles of a palm can be easily obtained from a low-resolution image. They vary very little over time, and their shape and location are very important features for biometric authentication. A brief overview of some of the existing methods for palmprint classification is given here. Zhang and Shu [2], and You et al. [3] used line-segment matching and half interesting-point matching for palmprint classification, respectively. Dong et al. [4] proposed to use the curvelet transform to extract features for palmprint recognition. Chen et al. [5] used the dual-tree complex wavelet features for palmprint classification and higher classification rate was reported than the scalar wavelet L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 178–183, 2010. © Springer-Verlag Berlin Heidelberg 2010
Palmprint Classification Using Wavelets and AdaBoost
179
features. Chen and Kégl [6] developed a palmprint classification method by using the contourlet features, which have a multi-orientation representation. Zhang et al. [7] utilized a novel device for online palmprint acquisition and an efficient algorithm for palmprint classification. In addition, a 2D Gabor phase encoding scheme is proposed for palmprint feature extraction and representation. In this paper, a novel method for palmprint classification is proposed by using the 2D wavelet features at different resolution scales and AdaBoost as a classifier. Over the past two decades, the wavelet transform has received a lot of attention from researchers in many different fields. It has already shown great success in such diverse fields as pattern recognition, image compression, signal/image processing, and computer graphics, to name a few. The wavelet transform decomposes a pattern into a multiresolution representation, which exactly mimics the human vision system. This is why the wavelet transform is so successful in pattern recognition. AdaBoost is selected as a classifier to classify the unknown palmprint images by using the extracted 2D wavelet features. Experimental results show that the proposed method achieves state-of-the-art classification rates and it outperforms every other method compared in this paper. The paper is organized as follows. Section 2 reviews the basic concept of the AdaBoost algorithm. Section 3 proposes a novel technique for palmprint classification by using wavelet features and AdaBoost. Section 4 conducts some experiments for classifying unknown palmprint images. Finally Section 5 draws the conclusions and gives future work to be done.
2 AdaBoost Freund and Schapire proposed the AdaBoost algorithm in [8]. It solved many practical difficulties in earlier boosting algorithms. AdaBoost calls a given weak learning algorithm repeatedly and it maintains a distribution or set of weights over the training set. All weights are set equally at the beginning, but on each round the weights of incorrectly classified examples are increased so that the weak learner is forced to focus on the hard examples in the training set. Schapire and Singer [9] proposed the multiclass classification method for AdaBoost, namely AdaBoost.MH. The pseudo-code for AdaBoost.MH is given below. AdaBoost.MH(X,Y,W(1), BASE(·, ·, ·),T) For t = 1, · · · , T , repeat 1)
(α t , vt , ϕ t (.)) ← BASE( X , Y , Wt )
2) ht ← α t v t ϕ t (.) 3) For i = 1, · · · , n; For l = 1, · · · , K
e ( − ht x y
l i i ,l
i ,l t +1
w
←w
4) Return f T
i ,l t
∑ ∑ w (.) = ∑ h (.) . n
K
i ' =1 T
l '=1
t =1 t
)
i ',l ' ( − htl ' x i ' y i ',l ' ) t
e
180
G. Chen et al.
where X is the observation matrix, Y is the label matrix (one-hot encoded multiclass labels), W1 is the initial weight matrix, BASE(·, ·, ·) is the base learner algorithm, and T is the number of iterations. αt is the base coefficient, vt is the vote vector, ϕt(·) is the scalar base classifier, ht(·) is the vector-valued base classifier, and fT(·) is the final (strong) classifier. See [10] for a more detailed description. Kégl and Busa-Fekete [10] described and tested AdaBoost.MH with products of simple base learners. It was found that boosting products outperforms boosting trees, it is less prone to overfitting, and it is even able to improve boosting stumps in such complex feature spaces where boosting stumps is expected to be the state-of-the-art. For more detailed explanation, the readers are directed to [10].
3 Palmprint Classification Using the Wavelet Transform and AdaBoost In this section, a new palmprint classification method is proposed by using the wavelet features and the AdaBoost as a classifier. After scanning the hand, the palm samples contain the fingers and the background, which are undesirable. We extract the central portion of the palm sample and save it to a matrix of size 128 × 128 for later processing. We apply the 2D discrete wavelet transform to the extracted palmprint image for a number of decomposition levels. The wavelet representation provides a coarse-to-fine strategy, called multiresolution matching. The matching starts from the coarsest scale and moves on to the finer scales. The costs for different levels are quite different. Since the coarsest scale has only a small number of coefficients, the cost at this scale is much less than for finer scales. In practice, the majority of patterns can be unambiguously identified during the coarse scale matching, while only few patterns will need information at finer scales to be identified. Therefore, the process of multiresolution matching will be faster compared to the conventional matching techniques. We then use these extracted wavelet features to train and test the palmprint database. Fig. 1 shows a palm image without preprocessing and the extracted palmprint image. The steps of our proposed algorithm for palmprint classification can be given as follows: 1) Extract the central portion of the palm sample image. 2) Perform the 2D discrete wavelet transform on the extracted palmprint image for J decomposition levels. 3) Classify the unknown palmprint image using AdaBoost with the extracted wavelet features. AdaBoost is a very popular boosting algorithm. It assigns each sample of the given training set a weight. All weights are initially set equal, but in every round the weak learner returns a hypothesis, and the weights of all examples classified wrong by that hypothesis are increased. Therefore, the weak learner will focus on the difficult
Palmprint Classification Using Wavelets and AdaBoost
181
samples in the training set. The final hypothesis is a combination of the hypotheses of all rounds and hypotheses with lower classification error have higher weight. The novelty of our proposed algorithm lies in two folds. On one hand, the wavelet transform decomposes the palmprint image into the wavelet features in a multiresolution way. On the other hand, AdaBoost can classify the unknown palmprint sample very efficiently. Both properties combine in our proposed algorithm, making it a very successful palmprint classification algorithm. In our experiments we find that our proposed algorithm achieves state-of-the-art palmprint classification rates, and it is very competitive when compared with other published methods in the literature.
Fig. 1. The original palm sample and the extracted palmprint image
4 Experimental Results The PolyU palmprint database [11] is used in the experiments conducted in this paper. The database contains 100 different palms, each with six samples collected in two sessions. For each palm, we use four of the six palmprint samples for training and the other two for testing. The size of the original palms without preprocessing is 284 × 384 pixels. We extract the central portion of the palm image for palmprint classification. The extracted palmprint image has a size of 128 × 128 pixels. The 2D Daubechies-4 wavelet transform is used in our experiments. Table 1 lists the palmprint classification rates of those methods in [2]-[7], and the proposed method by using the wavelet features. We use AdaBoost with decision stumps, decision trees, and products of base learners. The results show that the proposed method achieves state-of-the-art classification rates and it outperforms all other methods given in the table for all tested cases. This indicates that our proposed wavelet-AdaBoost combination is a very stable choice for invariant palmprint classification.
182
G. Chen et al.
Table 1. The classification rates of different palmprint classification methods and the proposed method by using the wavelet features and AdaBoost
Classification Method Method [2] Method [3] Method [4] Method [5] Method [6] Method [7] Proposed Method Decision stump Product (4 terms) Tree (5 leaves)
Classification rate 93.3% 95% 95.25% 97% 99% 98% 99.89% 99.93% 99.92%
5 Conclusion In this paper, a novel method has been developed for palmprint classification by using the wavelet features and AdaBoost. The method combines the multiresolution property of the wavelet transform and the super classification ability of the AdaBoost classifier. For the PolyU palmprint database, our experimental results show the advantages of the proposed method for palmprint classification over existing methods published in the literature. It is possible to combine the proposed palmprint classification method with face recognition, fingerprint recognition, and iris recognition in order to achieve improved security. Future work will be done for palmprint classification by using the next generation of multi-scale, multi-orientational transforms, e.g., the ridgelet transform, the curvelet transform, the contourlet transform, the beamlet transform, etc. Acknowledgments. This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).
References 1. Zhang, D.: Automated biometrics - technologies and systems. In: Jain, A.K. (ed.). Kluwer, Norwell (2000) 2. Zhang, D., Shu, W.: Two novel characteristics in Palmprint verification: Datum point invariance and line feature matching. Pattern Recognition 32, 691–702 (1999) 3. You, J., Li, W., Zhang, D.: Hierarchical palmprint identification via multiple feature extraction. Pattern Recognition 35, 847–859 (2002) 4. Dong, K., Feng, G., Hu, D.: Digital curvelet transform for palmprint recognition. In: Li, S.Z., et al. (eds.) SINOBIOMETRICS 2004. LNCS, vol. 3338, pp. 639–645. Springer, Heidelberg (2004) 5. Chen, G.Y., Bui, T.D., Krzyzak, A.: Palmprint classification using dual-tree complex wavelets. In: Proc. of IEEE International Conference on Image Processing, Atlanta, GA, USA (2006)
Palmprint Classification Using Wavelets and AdaBoost
183
6. Chen, G.Y., Kégl, B.: Palmprint classification using contourlets. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics (SMC 2007), Montreal, Canada (2007) 7. Zhang, D., Kong, W.-K., You, J., Wong, M.: On-line palmprint identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1041–1050 (2003) 8. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119–139 (1997) 9. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37, 297–336 (1999) 10. Kégl, B., Busa-Fekete, R.: Boosting products of base classifiers. In: International Conference on machine Learning, Montreal, Canada, vol. 26 (2009) 11. The PolyU Palmprint Database, http://www.comp.polyu.edu.hk/~biometrics/
Face Recognition Based on Gabor-Enhanced Manifold Learning and SVM Chao Wang and Chengan Guo School of Electronic and Information Engineering, Dalian University of Technology, Dalian, Liaoning 116023, China
[email protected],
[email protected]
Abstract. Recently proposed Marginal Fisher Analysis (MFA), as one of the manifold learning methods, has obtained better classification results than the conventional subspace analysis methods and other manifold learning algorithms such as ISOMAP and LLE, because of its ability to find the intrinsic structure of data space and its nature of supervised learning as well. In this paper, we first propose a Gabor-based Marginal Fisher Analysis (GMFA) approach for face feature extraction, which combines MFA with Gabor filtering. The GMFA method, which is robust to variations of illumination and facial expression, applies the MFA to augmented Gabor feature vectors derived from the Gabor wavelet representation of face images. Then, the GMFA method is integrated with the Error Correction SVM classifier to form a novel face recognition system. We performed comparative experiments of various face recognition approaches on ORL database and FERET database. Experimental results show superiority of the GMFA features and the new recognition system presented in the paper. Keywords: Face recognition, Gabor wavelets, Marginal Fisher analysis, Manifold learning, Error Correction SVM.
1 Introduction Over the past few years, face recognition has become a focus in pattern recognition and computer vision research field, and many face recognition methods have been developed. Two tasks are essential in face recognition: the first is to use what features to represent a face that will have more discrimination power. The second is how to design an efficient classifier to realize the discrimination ability of the features. A good face recognition methodology should consider classification as well as representation issues, and a proper cooperation of classification and representation methods should give better recognition performance. Among all the feature extraction methods, Principle Component Analysis (PCA) [1], and Linear Discriminant Analysis (LDA) [2] are the most popular ones. PCA projects the original data into a low dimensional space with optimal representation of the input data in the sense of minimizing mean squared error (MSE). LDA encodes discriminating information by maximizing the between-class scatter matrix, while L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 184–191, 2010. © Springer-Verlag Berlin Heidelberg 2010
Face Recognition Based on Gabor-Enhanced Manifold Learning and SVM
185
minimizing the within-class scatter matrix in the projective subspace. However, both PCA and LDA effectively see only the Euclidean structure. They fail to discover the underlying structure, if the face images lie on a nonlinear sub-manifold hidden in the image space. Some nonlinear techniques have been proposed to discover the nonlinear structure of the manifold, e.g., ISOMAP [3], LLE [4], and Laplacian Eigenmap [5]. These manifold learning-based methods have the ability to find the intrinsic structure. Therefore they are superior and more powerful methods than the traditional ones. However, these manifold learning methods suffer from the difficulty of mapping new data point and the nature of unsupervised learning as well. In [6], a new supervised manifold learning algorithm called Marginal Fisher Analysis was proposed. The MFA realizes dimensionality reduction by designing two graphs that characterize the intra-class compactness and interclass separability, respectively. MFA measures the intra-class compactness with the distance between each data point and its neighboring points of the same class, and measures the inter-class separability with the class margins, using the criteria of pulling close the intra-class data points and pushing away inter-class data points while projecting the high dimensional data points into lower dimensional subspace. Thus, a higher recognition rate can be obtained in the application of face recognition. However, the performance of MFA algorithm is yet to be improved. In recent years, the Gabor filters have been widely used as an effective tool in biological image processing and pattern recognition tasks [7]-[10]. It has been shown that the Gabor wavelets can effectively abstract local and discriminating features that are useful for texture detection and face recognition [7]-[10]. Gabor filtering allows description of spatial frequency structure in the image while preserving information about spatial relations which is known to be robust to some variations, e.g., pose and facial expression changes. In this paper we present a hybrid feature extraction method named Gabor-based Marginal Fisher Analysis (GMFA) for face recognition by combining Gabor filtering and MFA method. The GMFA method applies the MFA to augmented Gabor feature vectors derived from the Gabor wavelet representation of face images. Experiments on the ORL and FERET databases demonstrate the excellent performance of the new method in the paper. In addition, many advanced classification methods have been proposed and their applications in face recognition have been studied in recent years. Among them, Support Vector Machine (SVM) [11] is an effective method. However, SVM is designed for two-class classification. For K-class problems, many methods have been proposed, for example, the One-Against-One [12] and the One-Against-All [13]. An SVM multi-classification algorithm with error correction ability has been proposed in [14], and has been proved effective in face recognition. Based on the GMFA and the error correction SVM classifier, a new face recognition framework is proposed in this paper. Many simulation experiments have been conducted using ORL database and FERET database in the paper. Experimental results show the superiority of the GMFA features and the new recognition method. The rest of paper is organized as follows: Section 2 describes the new face recognition method, including the description of the new recognition framework, the Gabor wavelets filtering algorithm, the Marginal Fisher Analysis and its implementation, and the Error
186
C. Wang and C. Guo
Correction SVM Classifier. Section 3 shows experimental results for evaluating the new method. And finally, Section 4 gives the summary and further directions of the paper.
2 A Face Recognition Method Using GFMA and SVM 2.1 The New Face Recognition Scheme In this paper, we propose a new method for face recognition which is illustrated in Fig.1. In the method, the images to be recognized are filtered by Gabor wavelets in order to capture salient visual properties such as spatial localization, orientation selectivity, and spatial frequency characteristic. The Gabor wavelets, whose kernels are similar to the response of the two-dimensional receptive field profiles of the mammalian simple cortical cells, exhibit the desirable characteristics of capturing spatial locality and orientation selectivity and can present the discriminative features of face images [7,8]. The high dimensional Gabor wavelet representation of the image is then processed by the Marginal Fisher Analysis algorithm to find the underlying structure and extract low dimensional features. At last, the Gabor filtering based MFA (GFMA) feature vector is input into the Error Correction SVM Classifier to distinguish the class information of the original images. The Error Correction SVM Classifier is a multi-classifier constructed by a number of support vector machines, which can correct l intermediate misclassifications with preset 2l + 1 redundant SVMs and has been proved an excellent classifier for face recognition [14].
Fig. 1. Block diagram of the proposed face recognition scheme
2.2 Gabor Feature Representation
Gabor wavelets were introduced to image analysis due to their biological relevance and computational properties. Gabor wavelet representation of face images can derive desirable features gained by spatial frequency, spatial locality, and orientation selectivity. The Gabor wavelets (kernels, filters) can be defined as follows [7]-[10]:
ψ μ ,v ( z ) =
|| k μ ,v ||2
σ
2
−||k μ ,v ||2 || z ||2
e
2σ
2
[e
i ( k μ ,v ⋅ z )
−e
−σ 2 2
]
(1)
Where z = [ x, y ]T , k μ ,v = [ k v cos φ μ , k v sin φ μ ] , v and μ define the scale and T
orientation of the Gabor kernels, k v = k max / f , φμ = μπ / 8 , and f is the spacing v
Face Recognition Based on Gabor-Enhanced Manifold Learning and SVM
187
factor between kernels in frequency domain. We determine the parameters ccording to [9] and [10]. The Gabor wavelet representation of an image is the convolution of the image with the family of Gabor kernels of equation (1):
Gμ ,v ( z ) = I ( z ) *ψ μ ,v ( z )
(2)
where I ( z ) is the gray level distribution of an image, “ ∗ ” denotes the convolution operator, and Gμ ,v ( z ) is the convolution result corresponding to the Gabor kernel at
scale v and orientation μ . The convolution can be computed efficiently by performing the FFT, point-bypoint multiplications, and, then the IFFT. 2.3 Marginal Fisher Analysis Algorithm
Given N data points X = [ x1 , x2 , " , xN ] , xi ∈ R
D
well sampled from the underlying
nonlinear submanifold M , where D is the dimension of the data points. For supervised learning problems, the class labels of xi are ci ∈ {1, 2, " , N c } , where
N c is the number of classes of the data points. In real world application, the dimension of data points is very high and needs to be reduced to a lower one through some dimensionality reduction method. The essence of dimensionality reduction is to find a mapping function that transforms data point
x ∈ R D into a lower dimension feature y ∈ R d , where D >> d . Marginal Fisher Analysis has been proved to be an effective and desirable method for dimensionality reduction. The algorithmic procedure of original Marginal Fisher Analysis algorithm is stated as follows [6]: Step1: Projecting the data set into the PCA subspace to reduce the dimension. Let WPCA denote the transformation matrix of PCA. Step2: Construct the intra-class compactness and inter-class separability graphs by setting the intra-class adjacency matrix and the inter-class similarity matrix. The geometrical explanation of neighborhood relation of MFA is given in Fig.2 (a). Step 3: Find the optimal projection direction by the Marginal Fisher Criterion:
w* = arg min Where diagonal matrix D (including
wT X ( D c − W c ) X T w wT X ( D m − W m ) X T w
(3)
D c and D m ) is defined as
Dii = ∑ j ≠ i Wij , ∀i
(4)
188
C. Wang and C. Guo
md
(a) Neighborhood graph for MFA algorithm
p
(b) The proposed neighborhood selection
Fig. 2. Neighborhood graph for original MFA algorithm and the proposed modification
For each sample xi ∈ [ x1 , x2 , " , xN ] , set Wij = W ji = 1 if x j is among the k1 c
c
nearest neighbors of xi of the same class and set Wij = W ji = 0 otherwise. For each c
class
c
c , set Wijm = W jim = 1 if the pair ( xi , x j ) is among the k 2 shortest pairs
between different classes and set Wij = W ji = 0 otherwise. m
m
Step 4: Project the high dimensional data point linear projection:
x into lower dimensional space via
xF = PMFA x
(5)
where PMFA = WPCA w . *
For implementation of the MFA method, two parameters in the algorithm should be determined in advance: the number of intra-class neighbors k1 and the number of inter-class neighbors k 2 . In the application of face recognition, all the samples from the same class are assumed to be the neighbors of each other; thus, k1 is set as the constant value k1 = ic − 1 , where ic is the number of training samples each class. The only parameter left is k 2 . Since determining k 2 can only be achieved by experiments and may cause W to be asymmetric, for the fact that xi being among the k 2 neighbors of x j does not ensure x j being among the k 2 neighbors of xi . We substitute k 2 by another distance parameter p . Here, p is a parameter determined by adding a constant const to the maximum distance within the class md ,
Face Recognition Based on Gabor-Enhanced Manifold Learning and SVM
189
p = md + const . The geometrical explanation of the alteration of MFA neighborhood selection is given in Fig.2 (b). This modification can not only make the parameter more reasonable, but also make it easier to be determined, since there are many ways to estimate p based on the data samples. 2.4 The Error Correction SVM Classifier
SVM is an optimal classifier in terms of structural risk minimization based on VC theory [11]. Since the SVM is originally designed for binary classification, multiple classification problems such as the face recognition must be realized by a suitable combination of a number of binary SVMs. For an m-class classification problem, k binary SVMs, where k = ⎡⎢ log 2 m ⎤⎥ , are
enough in theory for classifying the m classes. Error Correction SVM algorithm sovles the problem of deciding how many SVMs should be used in order to obtain a certain level of error tolerance and its main idea is as follows: the classification procedure of an m-class problem using binary classifiers can be viewed as a digital communication problem and the classification errors made by some binary classifiers can be viewed as transmission errors of a binary string over a channel. Therefore, the errors may be corrected by adding some redundant SVMs and using an error control coding scheme. The Error Correction SVM method [14] is such an approach that the BCH coding scheme is incorporated into the algorithm for solving the m-class learning problem, in which l intermediate misclassifications can be corrected by using n binary SVMs. Based on coding theory, for an n-bit code with the minimum Hamming distance d, it is able to correct l errors, where n ≥ ⎡⎢log 2 m ⎤⎥ + d and l ≤ ⎡⎢(d − 1) / 2 ⎤⎥ . In order to implement the Error Correction SVM classifier, two stages are included. Details for these implementing algorithms can be found in [14], which are omitted here.
3 Experiments To verify the effectiveness and discriminating power of the proposed hybrid approach, we conducted experiments for the method on two different face databases: ORL and FERET. The feature extraction method based on Gabor filtering enhanced marginal Fisher analysis (GMFA) is compared with other classic sub-space learning methods, and the Error Correction SVM classifier is compared with the nearest neighbor classifier and One-Against-All SVM method. The ORL database contains images from 40 individuals, each providing 10 different images. For some subjects, the images were taken at different times. The facial expressions and facial details also vary. For the purpose of computation efficiency, all images are resized to 46×56 pixels.
190
C. Wang and C. Guo Table 1. Simulation results (recognition rates (%)) tested on ORL database
Feature extractor
Nearest Neighbor
One-Against-All
Error Correction SVM
LDA MFA GLDA GMFA
92.75 94.75 97.60 99.00
93.45 95.30 97.95 99.25
95.30 96.35 98.15 99.35
In the experiment results shown in Table 1, the proposed GMFA feature extractor is compared to LDA and the original MFA. The number of training samples for each class is 5. The training samples are randomly selected from the face images, while the remaining 5 samples for each class are used in testing. We conduct all the experiments 20 times, and the recognition rates given in table 1 are the average results. The dimension of the feature vectors is set to 39 for all the methods. From Table 1, it can be seen that, for each classifier, the highest recognition rate can always be obtained using the Gabor filtering based MFA features compared to using other features. It can also be seen that, by using the same kind of the features (in each row of the table), the Error Correction SVM classifier can always achieve the highest recognition rate among the 3 kinds of classifiers. By examining all the results, we can see that the combination of the GMFA feature with the Error Correction SVM classifier outperforms all the other combinations. We also tested the GMFA feature representation and the classification system on a subset of the FERET face database. This subset includes 3360 face images of 560 individuals with 6 images per person. The experiment results are shown in Table 2. Table 2. Simulation results (recognition rates (%)) tested on FERET database
Feature extractor LDA MFA GLDA GMFA
Nearest Neighbor One-Against-All 47.71 52.24 61.56 62. 91 69.60 75.84 73.27 76.28
Error Correction SVM 53.19 63.81 76.73 78.35
In the experiments on the FERET database, 3 images of each person in the database are randomly selected as training samples, while the other 3 images are used as testing samples. It can also be seen that the combination of the GMFA feature with the Error Correction SVM classifier still outperforms all the other combinations although the recognition rates are lower than the results of Table 1, due to the larger scale, more classes and variations of the FERET database than the ORL database.
Face Recognition Based on Gabor-Enhanced Manifold Learning and SVM
191
4 Summary and Further Directions In this paper, we proposed a new face recognition method using Gabor-enhanced Marginal Fisher Analysis (GMFA) and Error Correction SVM classifier. In the method, the image to be recognized is filtered by Gabor wavelets and the high dimensional Gabor representation of the image is then processed by the Marginal Fisher Analysis algorithm to find the underlying structure and low dimensional features. Finally, the GMFA feature vector is input into the Error Correction SVM Classifier to obtain the class information of the original image. In the paper, simulation experiments and result analyses were conducted for evaluating the new method, which show that the GMFA feature can always provide the higher recognition rate than other features. And, combination of the GMFA feature with the Error Correction SVM outperforms all the other methods. It is noticed that the computation complexity is quite high for the new method. Therefore, effective algorithms such as parallel algorithms need to be developed to improve the computation efficiency. This is the problem for further study of the paper.
References 1. Jolliffe, I.: Principal Component Analysis. Springer, New York (1986) 2. Etemad, K., Chellapa, R.: Discriminant analysis for recognition of human face images. J. Opt. Am. A 14(8), 1724–1733 (1997) 3. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000) 4. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000) 5. Saul, L.K., Roweis, S.T.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. J. Mach. Learn. Res. (4), 119–155 (2003) 6. Yan, S., Xu, D., Zhang, B., Zhang, H.J.: Graph embedding: A General Framework for Dimensionality Reduction. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 830–837 (2005) 7. Chui, C.K.: An introduction to wavelets. Academic, Boston (1992) 8. Jones, J., Palmer, L.: An Evaluation of the Two-Dimensional Gabor Filter Model of Simple Receptive Fields in Cat Striate Cortex. J. Neurophysiology 58(6), 1233–1258 (1987) 9. Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans. Image Processing 11(4), 467–476 (2002) 10. Liu, C.: Gabor-Based Kernel PCA with Fractional Power Polynomial Models for Face Recognition. IEEE Trans. Pattern Analysis and Machine Intelligence 26(5), 572–581 (2004) 11. Vapnik, V.: Statistical Learning Theory. John Willey and Sons Inc., New York (1998) 12. Kreßel, U.: Pairwise Classification and Support Vector Machines. In: Schölkopr, B., Burges, J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge (1999) 13. Sebald, D.J., Bucklew, J.A.: Support Vector Machines and Multiple Hypothesis Test Problem. IEEE Trans. on Signal Processing 49(11), 2865–2872 (2001) 14. Wang, C., Guo, C.: An SVM Classification Algorithm with Error Correction Ability Applied to Face Recognition. In: Wang, J., Yi, Z., Żurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. LNCS, vol. 3971, pp. 1057–1062. Springer, Heidelberg (2006)
Gradient-based Local Descriptor and Centroid Neural Network for Face Recognition Nguyen Thi Bich Huyen, Dong-Chul Park, and Dong-Min Woo Dept. of Electronics Engineering, Myong Ji University, Korea {parkd,dmwoo}@mju.ac.kr
Abstract. This paper presents a feature extraction method from facial images and applies it to a face recognition problem. The proposed feature extraction method, called gradient-based local descriptor (GLD), first calculates the gradient information of each pixel and then forms an orientation histogram at a predetermined window for the feature vector of a facial image. The extracted features are combined with a centroid neural network with the Chi square distance measure (CNN-χ2 ) for a face recognition problem. The proposed face recognition method is evaluated using the Yale face database. The results obtained in experiments imply that the CNN-χ2 algorithm accompanied with the GLD outperforms recent state-of-art algorithms including the well-known approaches KFD (Kernel Fisher Discriminant based on eigenfaces), RDA (Regularized Discriminant Analysis), and Sobel faces combined with 2DPCA (two dimensional Principle Component Analysis) in terms of recognition accuracy. Keywords: neural network, face recognition, local descriptor.
1
Introduction
Face recognition is a very interesting topic in computer vision research, because of its scientific challenges and wide range of potential applications [1]. Among various developed algorithms, the three approaches that have seen the widest study and application are Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Elastic Bunch Graph Matching (EBGM) methods. Principal Component Analysis (PCA), a well-known technique introduced by Kirby and Sirovich [2], has been successfully applied in image recognition and data compression. In a face recognition context, Turk and Pentland [3] utilized PCA to represent a large vector of pixel elements built from a facial image into the compact principal components of feature space. Face detection and identification are carried out in the reduced space by measuring the distance between the corresponding feature vectors of the database images and the test image. Linear Discriminant Analysis (LDA) [4] is a statistical method that is popular in pattern recognition and classification. The approach of LDA for classifying samples of unknown classes is to maximize between-class variance and to minimize within-class variance. Many algorithms based on LDA have been successfully L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 192–199, 2010. c Springer-Verlag Berlin Heidelberg 2010
Gradient-based Local Descriptor and CNN for Face Recognition
193
applied in face recognition. The Elastic Bunch Graph Matching (EBGM) [5] method is based on the idea that a face image has many nonlinear characteristics that are not described by the linear analysis methods, such as illumination, pose and expression differences. In addition to the above methods, there are other approaches that have also attracted much attention. Recently, independent component analysis (ICA) [6] and kernel principal component analysis (KPCA) [7], both PCA-related methods, have been proposed for face representation. In addition, in [9] Kernel Fisher Discriminant (KFD) was used together with various similarity measures employed in the standard eigenspaces method. Dao-Quing et al. [10] used a regularized discrimination scheme instead of optimizing Fisher index used in LDA. The Sobel face approach [11], meanwhile, focuses on decreasing the effects of the illumination condition by transferring all the input images into Sobel images and then applying median filters to promote the accuracy of the covariance matrix. Scale invariant feature transform (SIFT) is a local descriptor extraction method, presented by David Lowe [12]. This method has many advantages such as scale invariance, rotation invariance, affine invariance, illumination and viewpoint invariance. In this paper we propose a method for extracting features from a face image derived from the SIFT approach with some assumptions such as a fixed scale on face images and fixed positions for feature extraction. With the extraction of feature vectors on face image data, a classifier is used to determine the distance between a given face image and a certain model image. This is performed by maximizing the margin between different sample classes. From Bayes classifiers to neural networks, there are many possible choices for an appropriate classifier. Among several clustering algorithms such as the k-means algorithm, Self-Organizing Map (SOM), Centroid neural network (CNN), and Fuzzy c-means algorithm, we find CNN to be the most appropriate for the image texture classifier. The use of Chi square distance over Euclidean distance as a dissimilarity measure for feature descriptors was also reported in [13]. The remainder of this paper is organized as follows: Section 2 briefly summarizes the SIFT and the proposed feature extraction method. The CNN, which is used as a clustering algorithm in this work is summarized and the CNN with a Chi square distance measure is proposed in Section 3. Section 4 describes experiments involving the Yale face database and presents the obtained results. Finally, conclusions are given in Section 5.
2
Feature Extraction Method and SIFT
Lowe [12] proposed the SIFT approach for object detection based on its template image. SIFT was first introduced as a local image descriptor, offering advantages such as invariance to scaling, rotation, translation, and illumination. This method includes the following major steps for generating image features. The first stage is to detect locations that are invariant to the scale changes of the image. These positions can be obtained by extracting SIFT features at the local extrema of the scale-space representation of the image. Once the keypoint
194
N.T.B. Huyen, D.-C. Park, and D.-M. Woo
candidates are obtained, SIFT then assigns the dominant orientation to each keypoint location based on the local image gradient direction. The keypoint descriptor can then be made relative to this consistent direction, and this provides the property of image rotation invariance. The keypoint descriptor is generated by calculating the magnitudes and orientations of the image gradient within the region around the feature point. A 128-element vector is generated as the descriptor. It is represented by a 3D histogram of gradient locations and orientations. In order to gain the illumination invariance, the feature description should be normalized. Further details on SIFT can be found in [12]. SIFT has some advantageous characteristics for object recognition problems including scaling invariance, rotation invariance, and illumination invariance. However, when applied to a face recognition problem, some of these invariant properties are not necessary. The first step in the original SIFT is to detect the locations for the candidate feature points. They are located at the local extrema of the scale-space image. However, the number and position of the feature points are difficult to control. Furthermore, the initial normalization steps in SIFT might remove information that is useful for recognition when images are not scaled [14]. For this reason, we propose some adjustments to SIFT with the assumptions that face images are not scaled and the feature points should be extracted at some fixed locations. Considering the problems discussed in this paper, the feature is located at each rectangular region obtained by dividing the face image into several regions. In other words, the image is divided into several rectangular regions such as 15×15, 20×20, or 25×25. The feature descriptors are then extracted from these regions in a manner similar to the SIFT approach. Finally, a spatial histogram is generated for the image by concatenating the regional histograms. The following summarizes the proposed gradient-based local descriptor: 1. Initially, the facial image is divided into rectangular regions. The feature position is located in the middle of each rectangular field. Fig. 1 illustrates examples of a facial image divided into 15×15, 20×20, and 25×25 rectangular regions. 2. Once the feature location is determined, the next step is to calculate the feature’s direction; the method for this is adopted from Lowe’s algorithm. – The direction for every pixel within the region around the feature point is first computed: θ(x, y) = arctan
I(x, y + 1) − I(x, y − 1) I(x + 1, y) − I(x − 1, y)
(1)
where I(x, y) is the input pixel image located at (x, y). – A gradient orientation histogram from the obtained directions is then generated. The gradient magnitude is computed as m(x, y) = (I(x + 1, y) − I(x − 1, y))2 + (I(x, y + 1) − I(x, y − 1))2 (2) – The maximal component of the histogram is then assigned as the direction of the feature.
Gradient-based Local Descriptor and CNN for Face Recognition
195
Fig. 1. Examples of a facial image divided into 15×15, 20×20, and 25×25 pixels regions
Fig. 2. Example of a Gradient-based Local Descriptor for an image block in a 4×4 array
3. Next, the feature descriptor is generated by the following the SIFT algorithm. – The orientations and magnitudes of the image gradients are calculated in the region around the feature. – The directions are accumulated into orientation histograms created over rectangular subregions. Each component is weighted by a Gaussian window and the gradient magnitude. Eight orientation bins are selected for each orientation histogram. The descriptor extraction is shown in Fig. 2 using the example of a 4×4 descriptor array computed from a 15×15 neighborhood region. 4. All descriptor vectors are then concatenated into a single extended histogram feature to represent the whole image feature.
3 3.1
Centroid Neural Network with Chi Square Distance Centroid Neural Network (CNN)
The CNN algorithm is an unsupervised competitive learning algorithm based on the classical k-means clustering algorithm. It finds the centroids of clusters at each presentation of the data vector. The CNN updates its weights only when the status of the output neuron for the presenting data has changed when compared to the status from the previous epoch.
196
N.T.B. Huyen, D.-C. Park, and D.-M. Woo
When an input vector x is presented to the network at epoch n, the weight update equations for winner neuron j and loser neuron i in CNN can be summarized as follows: 1 wj (n + 1) = wj (n) + [x (n) − wj (n)] (3) Nj + 1 1 wi (n + 1) = wi (n) − [x (n) − wi (n)] (4) Ni − 1 where wj (n) and wi (n) represent the weight vectors of the winner neuron and the loser neuron, iteration, respectively. The CNN has several advantages over conventional algorithms such as SOM or k-means algorithm when used for clustering and unsupervised competitive learning. The CNN requires neither a predetermined schedule for learning gain nor the total number of iterations for clustering. It always converges to sub-optimal solutions while conventional algorithms such as SOM may give unstable results depending on the initial learning gains and the total number of iterations. More detailed description on the CNN can be found in [15][16]. 3.2
CNN with Chi Square Distance Measure
Although CNNs have been successfully applied to various clustering problems with deterministic data, they may not be appropriate for high dimensional data such as histograms. In order to measure the similarity of 2 histograms effectively, the following Chi square distance measure is employed: χ2 (M, S) =
Q 2 (Mi − Si )
Si
i=1
(5)
where M and S correspond to the model and sample histograms, respectively, and Q represents the dimension of the histograms. For the CNN with the Chi square distance measure, the objective function to be minimized is defined as: J=
Q Nk (wk − xi (k))2 k=1 i=1
xi (k)
,
xi (k) ∈ Group k
(6)
where N denotes the number of data points in the Group k. By applying a necessary condition for optimal position of the center for each group, the update equations for winner neuron j and loser neuron i of CNN with Chi square can be summarized as follows: 1 1 1 1 1 = + − (7) wj (n + 1) wj (n) Nj + 1 x (n) wj (n) 1 1 1 1 1 = − − (8) wi (n + 1) wi (n) Ni − 1 x (n) wi (n) where wj (n) and wi (n) represent the weight vectors of the winner neuron and the loser neuron at the iteration n, respectively. CNN with Chi square is also successfully applied to a texture classification problem[17].
Gradient-based Local Descriptor and CNN for Face Recognition
197
100 95
Mean recognition rate
90 85 80 75 70 Proposed method KFD eigenfaces RDA Sobel face
65 60
2
3 4 5 Number of training images per individual
6
Fig. 3. Comparison of recognition accuracy among different algorithms
4
Experiments and Results
In this section, we demonstrate some experiments to evaluate the effectiveness of the CNN with Chi square distance in face recognition. We train and test the proposed recognition algorithm on the Yale face data. This database is composed of 15 different subjects with 11 images per subject for a total of 165 images. The 11 images for each individual show different facial expressions and illumination conditions. In the geometric normalization, the faces are manually cropped to a size of 100×100 pixels to eliminate the background and some parts of the hair and chin. To evaluate the performance rate of our approach, we compare with other current state-of-art algorithms including Kernel Fisher Discriminant based on eigenfaces (KFD), Regularized Discriminant Analysis (RDA), and an approach using Sobel as a preprocessing tool and median filtering to decrease the effect of illumination (Sobel face). A series of analyses were undertaken where the training sample sizes are varied. Five tests were performed with a varying number of training samples ranging from 2 to 6 images and the remaining data were used for testing. Note that the training samples were selected randomly. The mean recognition rates achieved by the four approaches are shown in Fig. 3. The following important observations were made from a comparison of performances of these different methods: – When two and three training samples are used, the mean recognition rate of KFD gets the highest result. However, for the remaining cases, the proposed method with GLD always achieves the best recognition performance among
198
N.T.B. Huyen, D.-C. Park, and D.-M. Woo
all four methods for various training sets of Yale faces (98.14% for six training samples per individual). – KFD based on Eigenfaces provides substantially higher recognition accuracy when compared to other methods: RDA, Sobel faces combined with 2DPCA. – Sobel faces combined with 2DPCA yields the worst performance among the evaluated methods likely due to its rather simple nature. The similarity measure employed in this classifier is the simple Euclidean distance. The main advantage of the proposed approach is that the gradient-based local descriptors are invariant with respect to any monotonic gray scale variations. As such, they are not substantially affected by illumination changes and optical lens distortions, which can cause gray scale variations.
5
Conclusion
A new face recognition approach with the gradient-based local descriptor and a CNN with the Chi square distance measure is proposed. The combination of the CNN and the Chi square distance provides an efficient approach to deal with facial histograms in face analysis. In order to evaluate the performance of the proposed method, a number of experiments were conducted on the Yale face database with 165 images in total. Through experiments with different parameters for the proposed method, we noticed relative sensitivity to the choice of number of regions and distance measures. Recognition rates obtained by the proposed method for various training samples per individual demonstrate that the proposed classification scheme using the gradient-based local descriptor and a CNN with the Chi Square distance is quite accurate. Furthermore, it outperforms the following conventional texture classification methods: KFD eigenfaces, RLD, and Sobel faces combined with 2DPCA. Future research should focus on conducting more experiments with larger data sets and investigation of a method to minimize the computational complexity related with calculations by the gradient-based local descriptor.
Acknowledgments This work was supported by the Korea Research Foundation Grant funded by the Korean government(MOEHRD, Basic Research Promotion Fund)( Grant No.: R01-2007-000-20330-0).
References 1. Chellappa, R., Wilson, C.L., Sirohey, S.: Human and machine recognition of faces: A survey. Proceedings of IEEE 83(5), 705–740 (1995) 2. Kirby, M., Sirovich, L.: Application of the karhunen-loeve procedure for the characteristic of human faces. IEEE Trans. on Pattern Analysis and Machine Intelligence 12(1), 103–108 (1990)
Gradient-based Local Descriptor and CNN for Face Recognition
199
3. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3, 71–86 (1991) 4. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19, 711–712 (1997) 5. Wiskott, L., Fellous, J.-M., Kuiger, N., von der Malsburg, C.: Face Recognition by Elastic Bunch Graph Matching. IEEE Trans. Pattern Analysis and Machine Intelligence 19, 775–779 (1997) 6. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face Recognition by Independent Component Analysis. IEEE Trans. on Neural Networks 13(6), 1450–1464 (2002) 7. Yang, J., Jin, Z., Yang, J.Y., Zhang, D., Frangi, A.F.: Essence of kernel fisher discriminant: KPCA plus IDA. Pattern Recognition 10, 2097–2100 (2004) 8. Jian, Y., Zhang, D., Frangi, A., Jing-yu, Y.: Twodimensional pca: a new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 26(1), 131–137 (2004) 9. Ruiz-del Solar, J., Navarrete, P.: Eigenspace-based face recognition: a comparative study of different approaches. IEEE Trans. on Systems, Man, and Cybernetics 35(3), 315–325 (2005) 10. Dai, D.Q., Yuen, P.: Face recognition by regularized discriminant analysis. IEEE Trans. on Systems, Man, and Cybernetics 37(4), 1080–1085 (2007) 11. Lu, Y.-M., Liao, B.-Y., Pan, J.-S.: Face recognition by regularized discriminant analysis. In: Proc. of Int. Conf. on Intelligent Information Hiding and Multimedia Signal Processing, pp. 378–381 (2008) 12. Lowe, D.G.: Distinctive image features from Scale-Invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 13. Nagasaka, A., Tanaka, Y.: Automatic video indexing and full-video search for object appearances. In: Proc. IFIP 2nd Working Conf. Visual Database systems, pp. 502–505 (1992) 14. Albiol, A., Monzo, D., Martin, A., Sastre, J., Albiol, A.: Face recognition using HOG-EBGM. Pattern Recognition Letters 29(10), 1537–1543 (2008) 15. Park, D.C.: Centroid neural network for unsupervised competitive learning. IEEE Trans. on Neural Networks 11, 520–528 (2000) 16. Park, D.C., Woo, Y.: Weighted centroid neural network for edge reserving image compression. IEEE Trans. on Neural Networks 12, 1134–1146 (2001) 17. Vu Thi, L., Park, D.-C., Woo, D., Lee, Y.: Centroid neural network with chi square distance measure for texture classification. In: Proc. of IJCNN (2009)
Mean Shift Segmentation Method Based on Hybridized Particle Swarm Optimization Yanling Li1,2 and Gang Li1 1
College of Computer and Information Technology, Xinyang Normal University, Xinyang, 464000, China 2 Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China
[email protected]
Abstract. Mean shift, like other gradient ascent optimization methods, is susceptible to local maxima, and hence often fails to find the desired global maximum. In this paper, mean shift segmentation method based on hybridized particle swarm optimization algorithm is proposed which overcomes the shortcoming of mean shift. The mean shift vector is firstly optimized using hybridized PSO algorithm when performing the new algorithm. Then, the optimal mean shift vector is updated using mean shift procedure. Experimental results show that the proposed algorithm used for image segmentation can segment images more effectively and provide more robust segmentation results. Keywords: image segmentation, mean shift, PSO, chaotic.
1 Introduction Mean shift is a popular nonparameter density analysis tool introduced in Ref. [1-3]. In essence, it is an iterative local mode detection algorithm in the density distribution space. Cheng [2] notices that mean shift is fundamentally a gradient ascent algorithm with an adaptive step size. It has been used for a wide variety of applications such as robust estimation, clustering, image segmentation and visual tracking [3-9]. Despite its successful application, mean shift can only be used to find local modes. Being trapped in a local maximum/minimum is a common problem for traditional nonlinear optimization algorithms. Particle swarm algorithm is a new evolutionary technique which is proposed by Kennedy and Eberhart [10-11]. Due to the simple concept, few parameters, and easy implementation, PSO has gained much attention and has been widely used in many areas [12]. However, the performance of simple PSO greatly depends on its parameters, and it often suffers the problem of being premature convergence [13]. Many approaches have been proposed to improve the accuracy of the optima. Due to the easy implementation and special ability to avoid being trapped in local optima, chaos has been a novel optimization technique and chaos-based searching algorithms L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 200–207, 2010. © Springer-Verlag Berlin Heidelberg 2010
Mean Shift Segmentation Method Based on Hybridized Particle Swarm Optimization
201
have aroused intense interests [14]. In Ref. [15], the authors propose an improved particle swarm optimization algorithm combined with piecewise linear chaotic map (PWLCPSO) which is a powerful strategy to diversify the PSO population and improve the PSO performance in preventing premature convergence to local minima. In PWLCPSO, the piecewise linear chaotic map is introduced to execute chaotic search for obtaining better chaotic behavior as well as higher speed. In this paper, the mean shift vector is firstly searched using PWLCPSO algorithm. Then, the optimal mean shift vector is updated using mean shift procedure. Experimental results on the test patterns are given to demonstrate the robustness and validity of the proposed algorithm used for image segmentation. The rest of this paper is organized as follows. Section 2 is the overview of mean shift segmentation. Section 3 describes in detail the hybridized particle swarm optimization algorithm. The proposed mean shift segmentation method based on hybridized particle swarm optimization is presented in section 4. Section 5 gives the experimental results. Finally, in section 6, we conclude this paper.
2 Overview of Mean Shift Segmentation Let X = {x1 , x 2 , " , x n } be a data set in a s-dimensional Euclidean space R s . Camastra and Verri [16] and Girolami [17] had recently considered kernel-based clustering for X in the feature space where the data space is transformed to a high-dimensional feature space F and the inner products in F are represented by a kernel function. On the other hand, the kernel density estimation with the modes of the density estimate over X is another kernel-based clustering method based on the data space [18]. The modes of a density estimate are equivalent to the location of the densest area of the data set where these locations could be satisfactory cluster center estimates. In the kernel density estimation, the mean shift is a simple gradient technique used to find the modes of the kernel density estimate. Mean shift produces are techniques for finding the modes of a kernel density esti-
(
mate. Let K : X → R be a kernel with K (x ) = k x − xi is given by ∧
f K (x ) =
2
). The kernel density estimate
∑ k ( x − x )w(x ) n
2
i
(1)
i
i =1
Where w(xi ) is a weight function. Based on a uniform weight, Fukunaga and Hostetler [19] first gave the statistical properties including the asymptotic unbiasedness, consistency and uniform consistency of the gradient of the density estimate given by ∧
∇ f K (x ) = 2
∑ (x − x )k ( x − x )w(x ) n
2
'
i
i =1
i
i
(2)
202
Y. Li and G. Li
(
Suppose that there exists a kernel G : X → R with G ( x ) = g x − xi g ( x ) = −k (x ) . The kernel K is termed a shadow of kernel G . Then
2
) such that
'
∧
∇ f K (x ) =
∑ g ( x − x )(x n
2
i
i
− x )w( x i )
i =1
⎡ =⎢ ⎣
∑ ( n
g x − xi
2
i =1
⎡ ⎤ w(x i )⎥ × ⎢ ⎦ ⎢⎢ ⎣
)
∑ g ( x − x )w(x )x ∑ g ( x − x )w(x ) n
2
i
i =1
i
n
2
i
i =1
i
i
⎤ − x⎥ ⎥ ⎥⎦
(3)
= f G (x )[m G (x ) − x ] ∧
∧
The term mG (x ) − x = ∇ f
∧
K
(x ) / f (x ) G
is called the generalized mean shift which is ∧
proportional to the density gradient estimate. Taking the gradient estimator ∇ f to be zero, we derive a mode estimate as
∑ g ( x − x )w(x )x x = m (x ) = ∑ g ( x − x )w(x ) n
G
K
(x )
2
i =1 n
i =1
i
i
2
i
i
(4)
i
Eq.(4) is also called the weighted sample mean with kernel G . Mean shift vector always points toward the increasing direction of density, This makes the mean shift clustering a hill climbing procedure. It clusters the data convergent to the same peak point into a local mode. The traditional mean shift segmentation includes following three step: bandwidth selection, mode detection and mode merging. In mode detection, the traditional approach should search the positions recursively along the convergent trajectory, and a threshold to mG ( x ) should be set to stop the searching. This leads to blur the regions with high density, and the number of the detected local modes is too large. Too many local modes make it difficult to merge them and eliminate the texture patches. Thus, over-segmentation often exists in the traditional approach. In addition, mode merging is based on local information decision [20], which makes the segmentation result unstable under various backgrounds.
3 Hybridized PSO Algorithm Particle swarm optimization is a population-based stochastic optimization algorithm, firstly introduced by Kennedy and Eberhart in 1995 [10-11]. It is a metaphor of the social behavior of animals such as bird flocking and fish schooling. Although PSO is often ranged in evolutionary computation, it is actually an incarnation of swarm intelligence. In the context of PSO, a swarm refers to a number of potential solutions to the optimization problem, where each potential solution is referred to as a particle and each of them has its own position and velocity. Firstly, their positions and velocities are
Mean Shift Segmentation Method Based on Hybridized Particle Swarm Optimization
203
initialized randomly. Then, all particles “fly” through the solution space and update their positions until they find the optimal social cooperation. During this iterative process, each particle’s velocity is adjusted according to its own experience and social cooperation. The piecewise linear chaotic map (PWLCM) has gained increasing attention in chaos research recently due to its simplicity in presentation, efficiency in implementation, as well as good dynamical behavior. It has been known that PWLCMs are ergodic and have uniform invariant density function on their definition intervals [21]. The simplest PWLCM is denoted as
cxt / p cxt +1 = { (1 − cxt ) / (1 − p )
cxt ∈ (o, p) cxt ∈ [ p,1)
(5)
To enhance the performance of particle swarm optimization, a hybrid particle swarm optimization (PWLCPSO) algorithm is proposed. In the hybrid particle swarm optimization (PWLCPSO) algorithm, chaotic search is only applied to the global best particle because the range around there could be the most promising area. What’s more, it saves much time compared to the schemes that apply chaotic search on all particles [22]. As chaotic optimization is more effective in a small range and the most promising area will shrink when the iteration of PSO continues, the chaotic search radius r is decreased with a shrinking coefficient ρ (0 < ρ < 1) , we set ρ = 0.8 in this paper.
4 Mean Shift Segmentation Method Based on Hybridized PSO In essence, mean shift is an iterative local mode detection algorithm in the density distribution space. It can only be used to find local modes. Being trapped in a local maximum/minimum is a common problem for traditional nonlinear optimization algorithms. In this paper, we first propose the PSO based mean shift algorithm to overcome the problem of mean shift. This algorithm uses PSO algorithm to optimize the mean shift vector firstly, then mean shift algorithm is used to mean shift. However, traditional PSO algorithm greatly depends on its parameters, and it often suffers the problem of being premature convergence. In order to improve the performance of algorithm, we propose the mean shift segmentation method based on hybridized particle swarm optimization to solve the problem of the local search, and apply to the image segmentation. The proposed algorithm can efficiently prevent being trapped in a local optimization by use of the PWLCPSO algorithm. The process for implementing the proposed algorithm is as follows: 1. 2. 3. 4. 5.
initialization repeat perform PWLCPSO algorithm, optimize the mean shift vector calculate x t +1 with equation (4) until convergence condition is met
204
Y. Li and G. Li
New algorithm is a two-phased iterative strategy which optimizes the mean shift vector using PWLCPSO algorithm firstly, then the output of the PWLCPSO algorithm is used to perform the mean shift produce.
5 Experimental Results To show our proposed mean shift segmentation method based on hybridized particle swarm optimization outperforms the traditional mean shift, we use two test patterns to demonstrate the performance. These test patterns are widely used in image segmentation literatures which are standard gray image named lena and MR image. The proposed new algorithm is compared with the traditional mean shift algorithm and the PSO based mean shift algorithm. Fig.1 and Fig.2 show the experimental results with these three algorithms. Fig.1(a) and Fig.2(a) are original images; Fig.1(b) and Fig.2(b) are the experimental results of mean shift algorithm; Fig.1(c) and Fig.2(c) are the results of PSO based mean shift algorithm; the results of proposed mean shift segmentation method based on hybridized particle swarm optimization are showed in Fig.1(d) and Fig.2(d).
Fig. 1. Comparison of segmentation results on lena image. (a) Original image (b) result of mean shift algorithm (c) result of PSO based mean shift algorithm (d) result of proposed new algorithm
Mean Shift Segmentation Method Based on Hybridized Particle Swarm Optimization
205
Fig. 2. Comparison of segmentation results on MR image. (a) Original image (b) result of mean shift algorithm (c) result of PSO based mean shift algorithm (d) result of proposed new algorithm
From the point of view of the visual analysis, it is not observed big differences among the image segmented by the three experimental algorithms. However, the image segmented with our new algorithm has an aspect a little more natural with regard to the original image. Moreover, most of details are preserved when using our new algorithm and the segmentation ability of our new algorithm is excellent which has a powerful capability distinguishing objects in the image. As seen from Fig.1, the hair and hat were well segmented with our new algorithms. But it dose not when using the other algorithm. Fig.2(d) is most distinct in the segmented results of Fig.2. Table 1 tabulates the running time of these three algorithms on the two test patterns. Table 1. The experimental results of the three algorithms for three test patterns image lena
MR image
method Mean shift algorithm PSO based mean shift algorithm proposed new algorithm Mean shift algorithm PSO based mean shift algorithm proposed new algorithm
running time (second) 45.711364 21.597634 1.870870 2.258561 40.202861 4.882389
206
Y. Li and G. Li
From the table 1 we can see that the running time of our new algorithm is between the mean shift algorithm and the PSO based mean shift algorithm. Sometimes its running time is shortest. Although the running time of our new algorithms is longer than that of mean shift algorithm sometimes, the quality of image segmentation of our new algorithm is better. The proposed new algorithm is a trade off between segmentation quality and running time.
6 Conclusion In order to overcome the problem of being trapped in a local maximum/minimum for traditional mean shift algorithm, we propose a mean shift segmentation method based on hybridized particle swarm optimization algorithm. New algorithm firstly use hybridized PSO algorithm to optimize the mean shift vector. Then mean shift algorithm is carried out to update the output of hybridized PSO algorithm. Experimental results on the test patterns are given to demonstrate the robustness and validity of the proposed algorithm used for image segmentation.
Acknowledgments The authors would like to thank the anonymous reviewers for their helpful comments and suggestions to improve the presentation of the paper. This research is supported by the Natural Science Foundation of China (No. 60874031), the Natural Science Foundation of Henan province (2008A520021) and Young Backbone Teachers Assistance Scheme of Xinyang Normal University.
References 1. Fukunaga, K., Hostetler, L.D.: The estimation of the gradient of a density function with applications in pattern recognition. IEEE Trans. on Information Theory 21(1), 32–40 (1975) 2. Cheng, Y.Z.: Mean shift, mode seeking, and clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence 17(8), 790–799 (1995) 3. Comaniciu, D., Meer, P.: Mean shift: A Robust Approach toward Feature Analysis. IEEE Trans. Pattern Analysis and Machine Intelligence 24(5), 603–619 (2002) 4. Georgescu, B., Shimshoni, I., Meer, P.: Mean shift based clustering in high dimensions: A texture classification example. In: Proceeding of the Ninth IEEE International Conference on Computer Vision, France, pp. 456–463 (2003) 5. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 564–577 (2003) 6. Collins, R.: Mean-shift blob tracking through scale space. In: Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, Wisconsin, pp. 234–240 (2003) 7. Elgammal, A., Duraiswami, R., Davis, L.S.: Probabilistic tracking in joint feature-spatial spaces. In: Proceeding of IEEE Conference on Computer on Computer Vision and Pattern Recognition, Wisconsin, pp. 1781–1788 (2003) 8. Hager, G.D., Dewan, M., Stewart, C.V.: Multiple kernel tracking with SSD. In: Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, Washington, pp. 1790– 1797 (2004)
Mean Shift Segmentation Method Based on Hybridized Particle Swarm Optimization
207
9. Yang, C., Duraiswarni, R., Davis, L.: Efficient spatial-feature tracking via the mean-shift and a new similarity measure. In: Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, pp. 176–183 (2005) 10. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceeding of IEEE Int. Conf. on Network, Australia, pp. 1942–1948 (1995) 11. Kennedy, J., Eberhart, R.C., Shi, Y.: Swarm intelligence. Morgan Kaufmann Publishers, San Francisco (2001) 12. Eberhart, R.C., Shi, Y.: Particle swarm optimization: developments, applications and resources. In: Proceeding of Congress on evolutionary computation, Seoul, pp. 81–86 (2001) 13. Angeline, P.J.: Evolutionary optimization versus particle swarm optimization: philosophy and performance differences. In: Porto, V.W., Waagen, D. (eds.) EP 1998. LNCS, vol. 1447, pp. 601–610. Springer, Heidelberg (1998) 14. Wang, L., Zheng, D.Z., Lin, Q.S.: Survey on chaotic optimization methods. Comput. Technol. Automat. 20(1), 1–5 (2001) 15. Xiang, T., Liao, X., Wong, K.W.: An improved particle swarm optimization algorithm combined with piecewise linear chaotic map. Applied Mathematics and Computation 190(2), 1637–1645 (2007) 16. Camastra, F., Verri, A.: A novel kernel method for clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27, 801–805 (2005) 17. Girolami, M.: Mercer kernel based clustering in feature space. IEEE Trans. Neural Networks 13(3), 780–784 (2002) 18. Silverman, B.W.: Density estimation for statistics and data analysis. Chapman & Hall, London (1986) 19. Fukunaga, K., Hostetler, L.D.: The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 21(1), 32–40 (1975) 20. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Transactions on PAMI 24(5), 603–619 (2002) 21. Baranovsky, A., Daems, D.: Design of one-dimensional chaotic maps with prescribed statistical properties. International journal of bifurcation and chaos 5(6), 1585–1598 (1995) 22. Meng, H., Zheng, P., Wu, R., et al.: A Hybrid particle swarm algorithm with embedded chaotic search. In: Proceedings of IEEE Conference on Cybernetics and Intelligence Systems, Singapore, pp. 367–371 (2004)
Palmprint Recognition Using Polynomial Neural Network LinLin Huang and Na Li School of Automation Science and Electrical Engineering Beijing University of Aeronautics and Astronautics, Beijing 100083, China {llhuang,nali}@buaa.edu.cn
Abstract. In this paper, we propose a robust palmprint recognition approach. Firstly, a salient-point based method is applied to segment as well as align the region of interest (ROI) from the palmprint image. Then, a subspace projection technique, namely, independent component analysis (ICA) is performed on the ROI to extract features. Finally, a polynomial neural network (PNN) is used to make classification on reduced feature subspace. The effectiveness of the proposed method has been demonstrated in experiments. Keywords: Palmprint, recognition, polynomial, neural network.
1 Introduction Automatic and reliable personal identification for effective security control has been urgently demanded with the rapid growth in the use of e-commerce applications [1]. Computer-based personal identification, also known as biometrics has been considered as the most effective solution. Compared with other biometrics, palmprints have several advantages, such as stable line features, rich texture features, low-resolution imaging, low-cost capturing devices, etc. [2] [3]. Therefore, personal identification based on palmprint has become an active research topic. So far, many methods have been proposed for palmprint recognition, which can be roughly divided into two categories: structural feature based and statistical feature based. Structural feature based methods [4] [5] [6] directly extract structural information, such as principle lines, wrinkles, minutiae points etc., which can represent structural feature of palmprint clearly. Although line features can be detected even in the low-resolution palmprint images, this kind of methods have to spend much more computation cost on matching the line segments with the templates stored in database [7]. In statistical feature based methods, palmprint image is considered as a whole and palmprint features are extracted by transforming the image. The extracted features are consequently used for classification. Many feature extraction methods, such as Fourier transform [8], Gabor filters [9], eigenpalm [10], fisherpalm [11], independent component analysis (ICA) [4], have been explored. Except Euclidean distance [10], Hammming distance [1] etc, radial basis function neural network [12], probabilistic neural network [4] are used for feature classification. Obviously, the performances of statistical feature based approaches are heavily depended on the effectiveness of feature extraction methods as well as classification L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 208–213, 2010. © Springer-Verlag Berlin Heidelberg 2010
Palmprint Recognition Using Polynomial Neural Network
209
scheme. Compared with Fourier transform and eigenpalm etc., ICA is only related with the statistical property of data and can be used for multi-dimensional data. Due to their discrimination capability obtained from learning samples, neural networks are appropriate to make classification. In this paper, we propose a robust palmprint recognition approach. Firstly, a salient-point based method is applied to segment as well as align the region of interest (ROI) from the palmprint image. Then, ICA is performed on the ROI to extract features. Finally, a polynomial neural network (PNN) is used to make classification on reduced feature subspace. The effectiveness of the proposed method has been demonstrated in experiments.
2 Preprocessing Before feature extraction and classification, variations of palmprint images in lighting condition, size and orientation induced from capturing process have to be corrected. Besides, region of interest (ROI) of palmprint, which contains useful information for classification, should be segmented and aligned.
Fig. 1. Salient points detection and ROI segmentation
Fig. 2. ROIs of different palmprints
The point A and B show in Fig.1 (a) are called salient-points, which will be used to locate the ROI. Firstly, histogram equalization is employed to alleviate the variation of lighting condition. Secondly, morphology operations and thresholding technique are applied to derive the contour of the palm from the image. Thirdly, the curvature rate of points constructing palm contour are computed and the points with highest curvature rates within a range are chosen to be the salient points as shown in Fig.1 (b).
210
L. Huang and N. Li
The line through the points of A and B is considered as y-axis, the perpendicular direction of y-axis is x-axis. The ROI is defined as the rectangle area as shown in Fig.1(c). Finally, the extracted ROI is aligned and normalized into 128x128 pixels Fig.1 (d). Fig.2 gives some resulted ROIs of different palmprint images. Images in upper row are original palmprints, the corresponding ROIs are shown in lower row.
3 Feature Extraction Method After ROI is segmented, the pixel intensities of the ROI can be arranged to be a 16,384-dimension vector for classification. An important issue here is how to extract discriminative feature as well as to reduce the dimensionality so as to obtain a compact representation. There are two well-known techniques, namely, principal component analysis (PCA) and independent component analysis (ICA), can fulfill the task. Basically, PCA considers the 2nd order moments only, so it lacks information on higher order statistics. ICA accounts for higher statistics and identifies the independent source components from their linear mixtures. ICA thus provides a more powerful data representation than PCA [13]. The model of ICA is defined as Eq. (1), which describes how the observed data x are generated by a process of mixing the components s i .
x = As
(1)
After the mixing matrix A is estimated, its inverse, W, can be computed, then the independent components si are obtained by:
s = Wx , W = A−1
(2)
Representing an observed data as the linear combination of statistically independent components seems capturing the essential structure of the data so that it achieves good performances in many applications [14]. When ICA is applied to palmprint recognition, palmprint images are considered as the observed data which is the mixture of an unknown set of independent source images. FastICA [13] algorithm can be applied to compute the independent components. The selected components construct a feature subspace. The projection of test palmprint images onto the feature subspace will be used as the input of a polynomial neural network (PNN) for classification.
4 Classification Scheme The PNN can be viewed as a generalized linear classifier which uses as inputs not only the feature measurements of the input pattern but also the polynomials of the measurements. The binomial network is also closely related to the Gaussian quadratic classifier since they both utilize the second-order statistics of the pattern space [15] [16]. However, the PNN (including the binomial network) breaks the constraints of Gaussian density and the parameters are optimized in discriminative learning so as to well separate the patterns of different classes.
Palmprint Recognition Using Polynomial Neural Network
211
Compared to other neural networks, such as the multilayer perceptron (MLP) and radial basis function (RBF) network, the PNN is faster in learning and is less susceptible to local minima because it is a single layer structure. The PNN had been applied to our previous work of face detection and achieved superior performance [17]. The outputs of PNN is computed by d d d G y ( x ) = g ( ∑ wi xi + ∑ ∑ wij xi x j + w0 ), x = ( x1... xd ) i =1
j =1 i = j
(3)
1 g (a) = 1 + exp( a )
G
G
Where x is input vector, y( x ) is the output of the network. The connecting weights are updated by gradient descent to minimize the mean square error:
1 Nx 1 Nx E = {∑ [ y ( x n ) − t n ] 2 + λ ∑ w 2 } = ∑ E n 2 n =1 2 n =1 w∈W −{ w0 } w(n + 1) = w (n) − η
∂E n ∂w
(4)
(5)
where N x is the total number of samples, t n is the target output value. λ is the coefficient of weight decay to restrict the size of connecting weights (excluding the bias). η is the learning rate, which is small enough and decreases progressively.
5 Experimental Results The palmprint database collected by the Hong Kong Polytechnic University is used to verify the proposed method. The database contains 400 palmprint images with the size of 384x384 from 20 individuals. Among the 20 images of each person, 12 images are applied for training while the left 8 images for testing. After preprocessing, the ROIs with the size of 128x128 are segmented from the 240 training images, and then are used to compute the independent components by FastICA algorithm. The weights of PNN are learned from the training samples. In testing, the projection of a new palmprint image is fed into the PNN to make classification. We run several experiments to compare the performances of PCA and ICA. The influences of the feature subspace dimensions on recognition accuracy also have been investigated. The dimensions vary from 50 to 100. The results are given in Table 1and Table 2. From the results, we can see that both PCA and ICA perform well while ICA gives better results. Besides, with the dimension of feature subspace increases, the recognition rate goes up. But when the subspace dimension is over 90, the recognition rate decreases. It could be explained that with the number of components increases, they tends to maximize un-useful information such as noise so that the performance is deteriorating.
212
L. Huang and N. Li Table 1. Recognition results using PCA
Dimension PCA -100 PCA -90 PCA -70 PCA -50
False positives 5 3 3 6
Recognition rate 96.87% 98.13% 98.13% 96.25%
Table 2. Recognition results using ICA
Dimension ICA -100 ICA -90 ICA -70 ICA -50
False positives 2 1 1 3
Recognition rate 98.75% 99.38% 99.38% 98.13%
6 Conclusions In this paper, we propose a robust palmprint recognition approach. Firstly, a salientpoint based method is applied to segment as well as align the region of interest (ROI) from the palmprint image. Then, ICA is performed on the ROI to extract features. Finally, a polynomial neural network (PNN) is used to make classification on reduced feature subspace. The effectiveness of the proposed method has been demonstrated in experiments.
References [1] Zhang, D., Kong, W.-K., Jane, Y.: Online Palmprint Identification. IEEE Trans. on Pattern Analysis and Machine Intelligence 25, 1041–1050 (1995) [2] Duda, N., Jain, A.K., Mardia, K.V.: Matching of Palmprint. Pattern Recognition Letters 23, 477–485 (2002) [3] Shu, W., Zhang, D.: Automated Personal Identification by Palmprint. Optical Engineering 37, 2659–2362 (1998) [4] Zhang, D., Shu, W.: Two Novel Characteristics in Palmprint verification: datum point invariance and line feature matching. Pattern Recognition Letters 32, 691–702 (1999) [5] Wu, X., Wang, K., Zhang, D.: Fuzzy directional element energy feature based palmprint identification. In: Proc. of International Conference on Pattern Recognition, vol. 1, pp. 95–98 (2002) [6] Han, C.C., Cheng, H.L., Lin, C.L., Fan, K.C.: Personal Authentication Using Palmprint. Pattern Recognition 36, 281–371 (2003) [7] Connie, T., Jin, A.T.B., Ong, M.G.K., Ling, A.N.: An automated Palmprint Recognition System. Image Vision Computing 23, 501–515 (2005) [8] Li, W., Zhang, D., Xu, Z.: Palmprint Identification by Fourier Transform. International Journal of Pattern Recognition and Artificial Intelligence 16, 417–432 (2003)
Palmprint Recognition Using Polynomial Neural Network
213
[9] Kong, W.K., Zhang, D., Li, W.: Palmprint Feature Extraction using 2-D Gabor Filters. Pattern Recognition 36, 2339–2347 (2003) [10] Lu, G., Zhang, D., Wang, K.: Palmprint Recognition using Eigenpalms Features. Pattern Recognition Letters 24, 1473–1477 (2003) [11] Wu, X., Zhang, D., Wang, K.: Fisherpalm Based Palmprint Recognition. Pattern Recognition Letters 24, 2829–2838 (2003) [12] Shang, L., Zhang, D., Du, J., Zheng, C.: Palmprint Recognition Using FastICA algorithm and Radial Basis Probabilistic Neural Network. Pattern Recognition Letters 69, 1782– 1786 (2006) [13] Hyvarinen, A., Oja, E.: Independent Component Analysis: Algorithms and Applications. Neural Networks 13, 411–432 (2002) [14] Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face Recognition By Independent Component Analysis. IEEE Trans. Neural Network 13, 1450–1464 (2002) [15] Schneiderman, H., Kanade, T.: Probabilistic modeling of local appearance and spatial relationships for object recognition. In: Proc. IEEE International Conf. on Computer Vision and Pattern Recognition, pp. 45–51. IEEE Press, New York (1998) [16] Yau, H.-C., Tanry, T.: Iterative improvement of a Gaussian classifier. Neural Networks 3, 437–443 (1990) [17] Huang, L., Shimizu, A., Kobatake, H.: A Multi-expert Approach for Robust Face Detection. Pattern Recognition 39, 1695–1703 (2006)
Motion Detection Based on Biological Correlation Model Bin Sun, Nong Sang, Yuehuan Wang, and Qingqing Zheng Institute for Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology, Wuhan 430074, China
[email protected]
Abstract. Research on the motion perception has received great attention in recent years. In this paper, on the basis of existing biological vision achievement, a computer implementation is carried out to examine the performance of the biologically-motivated method for motion detection. The proposed implementation is validated in both synthetic and real-world image sequences. The experimental comparisons with a representative gradient optical flow solution show that the biological correlation detector has better robustness and anti-noise capability. Keywords: Motion detection, Biological correlation model, Gradient optical flow method.
1
Introduction
Motion detection is one of the fundamental subjects in the image sequences analysis. Based on different theories, diversiform approaches have been proposed while others continue to appear [1]. So far, derived from the studies of computer vision, most of the optical flow methods rely on the well-known brightness constancy assumption. However, traditional approaches are commonly sensitive to noise in images, and the basic assumption can not be always satisfied in real world. During recent decades, the development of biological vision has produced major advances in understanding visual motion perception. Hassenstein and Reichardt [2] proposed the first computational motion model, which is based on the observed behavioral response of the insects. This model sets out the basic framework of motion detection. Subsequently considerable psychophysical and physiological studies are devoted to reveal the neural implementation on motion perception. In the mid 1980s, Van Satan and Sperling [3] successfully elaborated a model to account for human vision system, termed as correlation motion detector. It has been demonstrated that this model can explain a wide range of phenomena in human motion perception. While much research has been concentrated on biological vision, few works came to the computer vision application. Inspired by the motion perception theory, in this paper, an executable implementation is developed in order to examine the performance of biological correlation detector. The proposed approach L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 214–221, 2010. c Springer-Verlag Berlin Heidelberg 2010
Motion Detection Based on Biological Correlation Model
215
is tested on both synthetic and real image sequences. In comparison with an improved gradient method [4], the experimental results confirm that the biological motion model has better performance, at least in certain circumstances. We believe that this biologically inspired algorithm may have broader applicableness in computer vision field. The organization of the paper is as follows: In section 2, we give the implementation of the biological correlation model in detail. Section 3 shows experimental results on synthetic and real world motion sequences, and provides the comparison with a gradient optical flow method. The conclusion and discussion is given in the section 4.
2
Biological Correlation Model
The basic principle of the original Reichardt detector [2] is universal, i.e. the comparison of a visual input from one location with the time-delayed input from an adjacent location. Subsequently Van Santen and Sperling [3] successfully extended the original model with some spatial and temporal filters as a theory of human motion perception. In this paper, a modified version of the biological correlation model is proposed, and each step of the implementation will be explained in detail [5]. Fig. 1 illustrates the processing steps in such correlation motion model. According to the delay-and-comparison principle, separable spatial and temporal filters in each subunit collaborate to compute the motion. It should be noted that all machine-computing processes involved are biologically and physiologically plausible, as well feasible for computer programming.
Fig. 1. The processing steps in the correlation motion model. Typical spatial and temporal profiles are sketched to illustrate the filters adopted in implementation.
In each subunit of the correlation model, input signals are firstly processed by pairs of Gabor spatial filters [6], which are cosine (left subunit) and sine (right subunit) weighted by identical Gaussian windows (as list in (1)). The Gabor
216
B. Sun et al.
functions are shown in (2) and (3), which fit the receptive-field profiles of simple visual cortex cells quite well. Hierarchical coarse to fine Gabor filters are adopted with eligible selectivity for spatial frequency and orientation properties. gauss(x, y, σx , σy ) =
1 x2 y2 · exp[−( 2 + 2 )] 2πσx σy 2σx 2σy
(1)
gaborc (x, y, σx , σy ) = gauss(x, y, σx , σy ) · cos(2πωf x)
(2)
gabors (x, y, σx , σy ) = gauss(x, y, σx , σy ) · sin(2πωf x)
(3)
Here x = x cos θf + y sin θf and y = −x sin θf + y cos θf . The σx and σy denote the horizontal and vertical spatial extent of the filter, and ωf and θf indicate the center frequency and orientation of the filter, respectively. In the early vision of the correlation models [3,5], a relative delay was introduced by lowpass and bandpass temporal filters, which analogous to sustained and transient cell respectively. However, in implementation more temporal filters require more memory and computations, even reduce the temporal resolution to some extent. In this paper a ’pure’ delay is implemented by a lowpass temporal filter, typically designed as a Gamma filter [7]. Physiologically, neurons’ temporal filters just have such properties in which the phase spectrum is roughly a linear function [8]. tn−1 · exp(−t/λ) t ≥ 0 (4) gamma(n, λ, t) = λn ·(n−1)! 0 t<0 In proposed implementation, n is the order number of the temporal filters, λ is a time constant. The temporal filter is set with a finite length of tc . The above designed spatial and temporal filters collaborate to computer the motion in each subunit. The following comparison operation is implemented by multiplication [2,3] that observed to underlie the insects’ behavior. From two neighboring receptive fields’ (spatial filtered) output, a product is obtained by multiply one with a delayed version (temporal filtered) from the other one. Such a response is sensitive to a preferred motion direction, i.e. the left subunit normally detects leftward movement, the right one for rightward movement. In the last stage, the subunits response are subtracted from each other [2,3], and the sign of difference indicates the direction of motion. As a whole, such a biological motion model executes a spatiotemporal auto-correlation analysis.
3 3.1
Experiment Experimental Setup
The improved biological correlation detector is presented as mentioned above. In this section, the proposed approach is tested on both synthetic and real image sequences. In order to clarify the anti-noise and robust performance, we have
Motion Detection Based on Biological Correlation Model
217
chosen typical scenes that suffer from the dynamic noise and distortion. In our experiments, only gray image sequences (128×128 pixel) are studied, and the original videos are converted into 8 bit gray scale format. In the implementation of correlation detector, four Gabors are adopted, whose center frequency ωf cover [0.338, 0.169, 0.085, 0.042], the corresponding of the σx and σy are [1.160, 2.319, 4.638, 9.276] and [1.337, 2.674, 5.348, 10.70], and the orientations θf are taken from 0 to π, in a π/8 step. The parameters of temporal filter are set as follows: n = 1, λ = 3.6, tc = 16, see [5] for detail. Comparisons of the results with that of a gradient optical flow method are also made to demonstrate the performance of the biological model. The improved Horn-Schunck method [4] is adopted, which relies on brightness constancy assumption combined with a smoothness constraint. This approach reduces the effect of noise on the spatial and temporal gradients using a 3D Sobel operator (2D space plus 1D time). Tab. 1 illustrates the scheme for computing the spatial-temporal gradients at each image pixel using the spatiotemporal Sobel operator. Table 1. Scheme of 3D Sobel operator 3D Sobel operator previous frame
current frame
next frame
x- derivative -1 0 1 -2 0 2 -1 0 1 -2 0 2 -4 0 4 -2 0 2 -1 0 1 -2 0 2 -1 0 1
y- derivative 1 2 1 0 0 0 -1 -2 -1 2 4 2 0 0 0 -2 -4 -2 1 2 1 0 0 0 -1 -2 -1
t- derivative -1 -2 -1 -2 -4 -2 -1 -2 -1 0 0 0 0 0 0 0 0 0 1 2 1 2 4 2 1 2 1
In this paper, we adopt a color code [9], shown in Fig. 2, to indicate the estimation of 2D flow fields. For the color code scheme, the direction of motion is encoded as hue, and the strength as saturation.
Fig. 2. Color code scheme
218
3.2
B. Sun et al.
Synthetic Scenes
Sinusoidal gratings are commonly used to study the mechanism of motion perception. As illustrated in Fig. 3(a), a grating oriented at 45 degree moves in a -45 direction (i.e. to the bottom-right) at a speed of 1 pixel/frame. In the simulations, in consideration of the anti-noise performance, images are corrupted by some dynamic salt-and-pepper noise. The experiment results of the gradient method and the biological correlation detector are given as follow:
(a)
(b)
(c)
Fig. 3. Synthetic motion sequence example and corresponding experimental results. (a) A single frame of the moving grating with dynamic salt-and-pepper noise. (b) Result of the 3D Sobel H-S optical flow method. (c) Results of the biological correlation detector.
Fig. 4(a) illustrates another example of a synthetic scene, in which one object (the small gray rectangle) moves in a dynamic noise background. Here the gray rectangle moves to the lower right at a speed of 1 pixel/frame, and the background, i.e. the black and white random dot, varies in each step.
(a)
(b)
(c)
Fig. 4. Synthetic motion sequence example and corresponding experimental results. (a) A single frame of the moving rectangle in a dynamic noise background. (b) Result of the 3D Sobel H-S optical flow method. (c) Results of the biological correlation detector.
3.3
Real World Scenes
With the same parameterization, the proposed approach is also tested on real world sequences. Both of the following video sequences are acquired from a low resolution image device, so that some noise and distortions are induced in
Motion Detection Based on Biological Correlation Model
219
images. Fig. 5(a) illustrates an example of one person playing a football. In the short interval, the rightward moving player is changing his gesture to kick the football. The experiment results of the gradient method and the biological correlation detector are also given for comparison.
(a)
(b)
(c)
Fig. 5. Real world motion sequence example and corresponding experimental results. (a) The 84th frame of the football player sequence. (b) Result of the 84th frame with 3D Sobel H-S optical flow method. (c) Results of the 84th frame with the biological correlation detector.
Fig. 6(a) shows another scene with several boatmen moving rightward. The moment illustrated, an independently moving person from left comes into sight. The corresponding experimental results are also given in Fig. 6(b) and Fig. 6(c) respectively.
(a)
(b)
(c)
Fig. 6. Real world motion sequence example and corresponding experimental results. (a) The 90th frame of the boatmen sequence. (b) Result of the 90th frame with 3D Sobel H-S optical flow method. (c) Results of the 90th frame with biological correlation detector.
As shown above, the H-S method for motion detection may not be capable of determining the estimation precisely, especially when affected by noise, or other violations of the constancy assumption. However, the biological correlation model performs better. For the synthetic scenes, the results of the proposed method are hardly affected by the temporal noise. For real world scenes, the proposed implementation gives a much more coherent estimation, even for the finer structure.
220
4
B. Sun et al.
Conclusion and Discussion
The experimental results suggest that when there is noise, especially for dynamic noise, the performance based on brightness constancy assumption is deteriorated. In respect that the 3D Sobel operator in the gradient method only employs three successive frames, i.e. a small temporal neighborhood. It is useful to reduce spatial noise to some degree. Unfortunately, it still suffers from the temporal noise. For the noisy real world sequence, the distortion induced in low quality images leads to that the moving objects no longer obey the brightness constancy assumption strictly. Even the H-S algorithm does employ a smoothness constraint, the obtained 2D optical flows are ambiguous. In comparison with the gradient method, the biological correlation detector has better anti-noise and robustness capability. Firstly, the algorithm of biological correlation model involves some form of noise reduction in adequate processing steps. As mentioned in Section 3, both spatial and temporal filters represent the most general types of filters for removing noise from an image sequence. That is, spatial filters reduce static noise and temporal filters for dynamic noise. By taking advantage of the combination, spatiotemporal filters can effectively provide higher noise suppression than their 1D temporal and 2D spatial only. Furthermore, the opponent subunits can also cancel independent noise. Thus this computation structure reveals promising noise suppression [10]. Secondly, for the real world scenes, such model can be applied efficiently for the estimation of certain part of motion fields, even for the new person appearing from the left field of view, e.g. as shown in Fig. 6(c). Rather than computing spatial-temporal gradients as in 3D Sobel method, the biological correlation model utilizes multiscale spatiotemporal filters to compute the average luminance in local regions. Substantially, this approach relaxes the constancy and smooth requirements, so that more reliable motion boundaries can be preserved. In addition, an intrinsic problem for motion detection is the well-known aperture problem. One manifestation is that the detection results are mostly perpendicular to the orientation of the contour that is moving. Current biological correlation detector provides local estimations of motion and hence gives a partial solution to the aperture problem. We suggest that the local information should be integrated to produce a more robust estimation [11], which might be implemented in later cortical areas that perform spatial or temporal integration over larger receptive fields. Acknowledgments. This research was supported by the National Natural Science Foundation of China under Contract 60575017 and 60736010, the Program for New Century Excellent Talents in University (NNCET-05-0641).
References 1. Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of Optical Flow Techniques. International Journal of Computer Vision 12, 43–77 (1994) 2. Reichardt, W.: Autocorrelation, a Principle for the Evaluation of Sensory Information by the Central Nervous System. Rosenblith, W.A., New York (1961)
Motion Detection Based on Biological Correlation Model
221
3. Van Saten, J.P., Spelring, G.: Elaborated Reichardt Detectors. J. Opt. Soc. Am. A 2, 300–321 (1985) 4. Lopez, J., Markel, M., Siddiqi, N., Gebert, G.: Performance of Passive Ranging form Image Flow. In: IEEE 10th International Conference on Image Processing, pp. 929–932. IEEE Press, Barcelona (2003) 5. Sang, N., Xiao, L., Sun, B.: Computational Simulation of First Order Motion Perception. In: 12th Asia-Pacific Workshop on Visual Information Processing, Beijing, pp. 81–86 (2006) 6. Daugman, J.G.: Two-Dimensional Spectral Analysis of Cortical Receptive Field Profiles. Vision Research 20, 847–856 (1980) 7. Ibbotson, M.R., Clifford, C.W.G.: Characterising Temporal Delay Filters in Biological Motion Detectors. Vision Research 41, 2311–2323 (2001) 8. Hamilton, D.B., Albrecht, D.G., Geisler, W.S.: Visual Cortical Receptive Fields in Monkey and Cat: Spatial and Temporal Phase Transfer Function. Vision Research 10, 1285–1308 (1989) 9. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A Database and Evaluation Methodology for Optical Flow. In: IEEE 11th International Conference on Computer Vision, pp. 1–8. IEEE Press, Rio de Janeiro (2007) 10. Shi, L., Borst, A.: Propagation of Photon Noise and Information Transfer in Visual Motion Detection. J. Comput. Neurosci. 20, 167–178 (2006) 11. Bradley, D.C., Goyal, M.S.: Velocity Computation in the Primate Visual System. Nature Reviews Neuroscience 9, 686–695 (2008)
Research on a Novel Image Encryption Scheme Based on the Hybrid of Chaotic Maps Zhengqiang Guan1 , Jun Peng2, , and Shangzhu Jin2 1
Office of Academic Affairs, Chongqing University of Science and Technology, Chongqing 401331, China
[email protected] 2 School of Electronic Information Engineering, Chongqing University of Science and Technology, Chongqing 401331, China
[email protected],
[email protected]
Abstract. This paper presents a novel image encryption scheme based on a hybrid of three order chaotic Chen system and piecewise linear chaotic map (PWLCM). The initial conditions and control parameters of chaotic systems are viewed as the secret key, and combined sequences from these two chaotic systems are employed to encrypt original image. The proposed scheme is described in detail, along with experimental example and security analyses including sensitivity analysis, analysis of correlation of adjacent pixels and information entropy analysis. The results show that the suggested scheme has some desirable properties in a good security cryptosystem. Keywords: Image encryption, chaotic chen system, PWLCM, information security.
1
Introduction
As we know, chaos has many significant properties such as ergodicity, unpredictability, and sensitivity to the initial condition and control parameter. The close relationship has been observed in [1][2][3][4] between chaotic maps and cryptographic algorithms. In particular, the following connections would be found between them: (1) Ergodicity in chaos vs. confusion in cryptography; (2) Sensitive dependence on initial conditions and control parameters of chaotic maps vs. diffusion property of a good cryptosystem for a slight change in the plaintext and in the secret key; (3) Random-like behavior of deterministic chaotic-dynamics which can be used for generating pseudorandom sequences as key sequences in cryptography. Hence, chaos has been widely used to design the cryptosystem. For example, the cryptosystems based on chaos synchronization [5][6], chaotic stream ciphers [1][2], image encryption schemes [7][8][9] using chaotic maps, and chaotic Hash function [10][11], etc. In this paper we mainly focus on the image encryption using chaos. A number of chaos based image encryption schemes have been reported in recent years,
Corresponding author.
L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 222–229, 2010. c Springer-Verlag Berlin Heidelberg 2010
Research on a Novel Image Encryption Scheme Based on the Hybrid
223
and we will offer a brief overview here. Yen and Guo [12] proposed a chaotic key-based algorithm (CKBA) in which a binary sequence as a key is generated using a chaotic system, and the image pixels are arranged according to the generated binary sequence, and then the scale-gray values of pixels are XORed or XNORed bit-by-bit to one of the two predetermined keys. However, the analysis conducted in [13] showed that the algorithm in [12] has some drawbacks: it is vulnerable to the chosen or known-plain-text attack using only one plain-image, and its security to brute-force attack is also questionable. Recently, Chen et al. [7] proposed a symmetric image encryption in which a 3D cat map was exploited to shuffle the positions of image pixels and another chaotic map was also used to confuse the relationship between the original image and its encrypted image. Recently, Gao et al. [14] exploited an image total shuffling matrix to shuffle the position of image pixels and then used a hyper-chaotic Chen system to confuse the relationship between the original image and encrypted image. In this paper, based on the knowledge above, we will propose a novel image encryption scheme based on a hybrid of chaotic systems, i.e., Chen chaotic system and piecewise linear chaotic map. These two chaotic systems are not be too complex for the implementation, but have sufficient complex dynamics features favorable to the cryptosystem. The rest of the paper is organized as follows. In Section 2, we will firstly introduce two chaotic systems used in the scheme, and then describe the procedure of image encryption in details in Section 3. Furthermore, the experimental example are given in Section 4. To demonstrate its performance withstand the most common attacks, in Section 5, we analyze the proposed image encryption scheme in terms of sensitivity analysis, analysis of correlation of adjacent pixels and information entropy analysis. Finally, we conclude the paper in Section 6 with remarks on future work.
2
Selection of Chaotic Systems
Here we introduce two chaotic systems. The first one is Chen system, which is described as follows: ⎧ ⎨ x˙ = a(y − x), y˙ = (c − a)x − xz + cy, (1) ⎩ z˙ = xy − bz. where a, b, c are parameters, and when a = 35, b = 3, c = 28, the system is chaotic. The chaotic attractors with initial conditions (0.1, 0.1, 0.1) are shown in Fig. 1. The research in [15][16][17] has shown that Chen system can be easily implemented by circuits, and has more excellent dynamical properties which are required by cryptography application. About the second one, we would like to employ a piecewise linear chaotic map (PWLCM). The research in [18] has shown that the PWLCM has many excellent dynamical properties including ergodicity, random-like behavior, a larger positive Lyapunov exponent, a uniform invariant density function and an exponential
224
Z. Guan, J. Peng, and S. Jin 40
70
30
60
20 50 10 40 0 30 −10 20 −20
10
−30
−40 −30
−20
−10
0
(a)
10
20
30
40
0 −40
−30
−20
−10
0
10
20
30
40
(b)
Fig. 1. Chaotic attractor of the Chen system on (a) x − y plane, and (b) y − z plane, respectively
attenuation auto-correlation function. These properties are very useful for the purpose of cryptographic application. The PWLCM is defined as follows: ⎧ 1 ⎨ xn · μ , if xn ∈ [0, μ) 1 xn+1 = f (xn , μ) = (xn − μ) · 0.5−μ , if xn ∈ [μ, 0.5] (2) ⎩ f (1 − xn , μ), if xn ∈ (0.5, 1] where xn ∈ [0, 1], μ is a control parameter. When μ ∈ (0, 0.5) this map is in chaotic state. When using this map in practice we should avoid xn = 0, xn = 0.5 or xn = 1, as all of them will at last result in a same fixed point xn = 0 due to the fact that f (0, μ) = 0, f 3 (0.5, μ) = 0 and f 2 (1, μ) = 0 for any μ ∈ (0, 0.5).
3
Description of the Encryption Scheme
The proposed image encryption scheme is shown in Fig. 2, in which Chaotic I denotes Chen system and Chaotic II denotes PWLCM. These two chaotic systems output sequences Ci1 and Ci2 at time i, respectively. The regulation for generating sequences is given as follows: Ci1 = f loor((K(x2i + yi2 + zi2 )1/2 + λ)) mod 256
(3)
Ci2 = f loor((Kθi + λ)2 ) mod 256
(4)
where xi , yi and zi denote the three state values of Chen system at time i, θi denotes the state value of PWLCM at time i, K and λ are control parameters. ˜ i denotes the original image pixel and the In Fig. 2, Si = Ci1 ⊕ Ci2 , and mi , m encrypted image pixel, respectively. Function G can be used for both encryption and decryption process, and it is described as:
Research on a Novel Image Encryption Scheme Based on the Hybrid
225
Encryption: If Si is even, then m ˜ i = Ci1 ⊕ ((Ci2 + mi ) mod 256) else m ˜ i = Ci2 ⊕ ((Ci1 + mi ) mod 256) Decryption: ˜ i − Ci2 ) mod 256) else mi = Ci2 ⊕ ((m ˜i − If Si is even, then mi = Ci1 ⊕ ((m 1 Ci ) mod 256) C i2
C i1 Chaotic system I
C i1
Chaotic system II
Si
C i2
Function G
~ m i
mi Original Image
Encrypted Image
Fig. 2. Diagram of the proposed encryption scheme
The secret key of the scheme is denoted by a seven-tube array with (x0 , y0 , z0 , θ0 , μ, K, λ), where θ0 and μ represent the initial value and control parameter of PWLCM, respectively.
4
Experimental Example
Some experimental results are given in this section to show the efficiency of the proposed scheme. The original ‘Lena’ image with the size 512 × 512 and its corresponding histogram are shown in Fig. 3(a) and Fig. 3(b), respectively.
3000
2500
2000
1500
1000
500
0
0
(a)
50
100
150
200
250
(b)
Fig. 3. Original image and its histogram. (a) and (b) are original image ‘Lena’ and its corresponding histogram, respectively.
226
Z. Guan, J. Peng, and S. Jin
Applying the proposed encryption algorithm to original image with secret key (0.2, 0.2, 0.1, 0.0256, 0.4, 500, 128), we obtain encrypted image and its histogram that shown in Fig. 4(a) and Fig. 4(b), respectively.
2500
2000
1500
1000
500
0
0
0.1
0.2
0.3
0.4
(a)
0.5
0.6
0.7
0.8
0.9
1
(b)
Fig. 4. Encrypted image and its histogram. (a) and (b) are encrypted image and its corresponding histogram, respectively.
5
Security Analysis
From the cryptography point of view, an effective encryption algorithm should have desirable features for withstanding most kinds of known attacks. In this section, we will address some security analysis on the proposed algorithm in terms of sensitivity analysis, analysis of correlation of adjacent pixels and information entropy analysis. 5.1
Sensitivity Analysis
The experiments of sensitivity with respect to the key is performed. The same original image ‘Lena’ with size 512 × 512 is used, and two secret keys K1 and K2 with a slight difference are selected as follows: K1 = (0.1, 0.1, 0.1, 0.0371, 0.3, 500, 128), K2 = (0.1, 0.1001, 0.1, 0.0371, 0.3, 500, 128). The results are shown in Fig. 5. From the results we found the obtained encrypted images Fig. 5(a) and Fig. 5(b) are completely different even using a slightly different key. 5.2
Analysis of Correlation of Adjacent Pixels
In [19], Shannon has suggested two basic techniques for instructing the design of practical cryptosystem: diffusion and confusion, and these two superior properties can be demonstrated by a test on the correlations of adjacent pixels in the
Research on a Novel Image Encryption Scheme Based on the Hybrid
(a)
227
(b)
Fig. 5. Sensitivity test results of the encryption scheme. (a) and (b) are obtained encrypted images using K1 and K2 , respectively.
encrypted image [7]. To examine the correlation between two adjacent (in horizontal, vertical, and diagonal direction) pixels, we randomly select 2000 pairs of pixels from the original image and encrypted image respectively, and compute their corresponding relevant coefficients by using the following formula: m,n (Amn − A)(Bmn − B) r = (5) 2 2 m,n (Amn − A) m,n (Bmn − B) where A and B represent the corresponding mean values. Here r represents the normalized correlation between image Amn and Bmn pixel by pixel. The correlation coefficients (in horizontal, vertical, and diagonal direction) of original ‘Lena’ image are 0.9812, 0.9705, and 0.9523, respectively. However, for the encrypted one, they are 0.0012, 0.0017, and -0.0106, respectively. We found that the correlation coefficients of the encrypted images are very small, indicating that the attacker cannot obtain any valuable information by exploiting a statistic attack. 5.3
Information Entropy
It is well know that the information entropy H(m) of a plaintext message m can be calculated as n H(m) = − p(mi ) log2 p(mi ) (6) i=1
where p(mi ) represents the probability mass function of message mi and n = 256 for image. For a 256-gray-scale image, if every gray value has an equal probability, then information entropy equal to theoretical value 8, indicating that the image
228
Z. Guan, J. Peng, and S. Jin
is a purely random one. When the information entropy of an image is less than 8, there exists a certain degree of predictability, which will threaten its security. Therefore, we hope the entropy of the encrypted image is near to ideal value to against the entropy attack effectively. The information entropy of original images and its corresponding encrypted images are computed as 7.4455 and 7.9996, respectively. From the results we found that the entropy of the original image is smaller than the ideal one due to the fact that practical information sources seldom generate random messages. However the entropy of encrypted image is very close to the theoretical value, showing the suggested encryption algorithm is sufficient secure upon the entropy attack.
6
Conclusion and Discussion
In this paper, a novel image encryption scheme based on a hybrid of three order chaotic Chen system and PWLCM is proposed. The initial conditions and control parameters of chaotic systems are viewed as the secret key, and combined sequences from these two chaotic systems are employed to encrypt original image. Furthermore, experimental example and security analyses including sensitivity analysis, analysis of correlation of adjacent pixels and information entropy analysis are conducted. The results of security analyses indicate that the encryption scheme has desirable properties from cryptographic point of view. Possible future work can be pursued in the following directions: extending the proposed algorithm to handling color image encryption and examining its security as well as speed performance. Meanwhile, other measures are still worthy of further research to improve the proposed scheme’s performance. For example, a chaotic cat map can be applied to the original image before the encryption in order to obtain a better confusion.
Acknowledgment The authors would like to thank the anonymous reviewers for their valuable suggestions. The work described in this paper was partially supported by the Natural Science Foundation Project of CQ CSTC under Grant No. 2008BB2360 and the First Batch of Supporting Program for University Excellent Talents in Chongqing.
References 1. Kocarev, L.: Chaos-based cryptography: A brief overview. IEEE Circ. Syst. Maga. 1, 6–21 (2001) 2. Fridrich, J.: Symmetric ciphers based on two-dimensional chaotic maps. Int. J. Bifu. Chaos 8, 1259–1284 (1998) 3. Fridrich, J.: Image encryption based on chaotic maps. In: IEEE Int. Conf. Syst. Man Cybe., Orlando, FL, USA (1997)
Research on a Novel Image Encryption Scheme Based on the Hybrid
229
4. Alvarez, G., Li, S.: Some basic cryptographic requirements for chaos-based cryptosystems. Int. J. Bifu. Chaos 16, 2129–2151 (2006) 5. Yang, T., Wu, C., Chua, L.: Cryptography based on chaotic systems. IEEE Tran. CAS-I 44, 469–472 (1997) 6. Grzybowski, J., Rafikov, M., Balthaza, J.: Synchronization of the unified chaotic system and application in secure communication. Comm. Nonl. Scie. Nume. Simu. 14, 2793–2806 (2009) 7. Chen, G., Mao, Y., Chui, C.: A symmetric image encryption based on 3D chaotic maps. Chaos Soli. Frac. 21, 749–761 (2004) 8. Tong, X., Cui, M.: Image encryption scheme based on 3D baker with dynamical compound chaotic sequence cipher generator. Sign. Proc. 89, 480–491 (2009) 9. Peng, J., Zhang, D., Liao, X.: A digital image encryption algorithm based on hyperchaotic cellular neural network. Fund. Info. 90, 269–282 (2009) 10. Zhang, J., Wang, X., Zhang, W.: Chaotic keyed hash function based on feedforward - feedback nonlinear digital filter. Phys. Lett. A 362, 439–448 (2007) 11. Peng, J., Zhang, D., Liu, Y., Liao, X.: A double-piped iterated hash function based on a hybrid of chaotic maps. In: 7th IEEE Int. Conf. Cogn. Info., pp. 358–365 (2008) 12. Yen, J., Guo, J.: A new chaotic key-based design for image encryption and decryption. In: Proc. IEEE Int. Conf. Circ. Syst., vol. 4, pp. 49–52 (2000) 13. Li, S., Zheng, X.: Cryptanalysis of a chaotic image encryption method. In: Proc. IEEE Int. Symp. Circ. Syst., vol. 2, pp. 26–29 (2002) 14. Li, P., Li, Z., Halanga, W., Chen, G.: A multiple pseudorandom-bit generator based on a spatiotemporal chaotic map. Phys. Lett. A 349, 467–473 (2006) 15. Chen, G., Dong, X.: From chaos to order methodologies, perspectives, and applications. World Scientific, Singapore (1998) 16. Ueta, T., Chen, G.: Bifurcation analysis of Chen’S equation. Int. J. Bifu. Chaos 10, 1917–1931 (2000) 17. Yassen, M.: Chaos control of Chen chaotic dynamical system. Chaos Soli. Frac. 15, 271–228 (2003) ´ 18. Li, S., Alvarez, G., Chen, G.: Breaking a chaos-based secure communication scheme designed by an improved modulation method. Chaos Soli. Frac. 25, 109–120 (2005) 19. Shannon, C.: Communication theory of secrecy system. Bell Syst. Tech. J. 28, 656–715 (1949)
Computational and Neural Mechanisms for Visual Suppression Charles Q. Wu Stanford Continuing Studies, Stanford University, Stanford, CA 94305, U.S.A.
[email protected]
Abstract. I investigate the computational and the neural mechanisms for suppressing retinal vascular image (RVI) and attempt to generalize some conclusions to other visual suppression phenomena. First I present a new observation demonstrating RVI in negative afterimages. Then I discuss RVI suppression from a computational perspective and suggest: (1) RVI is always there in the retinal stimulation; (2) RVI must be actively suppressed on a moment-bymoment basis; and (3) in order to suppress RVI, there must exist an internal representation of RVI at a monocular stage. Mapping onto the organization of the primate visual system, particularly based on Adams and Horton's neuroanatomical demonstration of a complete representation of retinal blood vessels in layer 4C of V1, I propose that layer 4C is in fact the neural site for RVI suppression. Finally, I suggest that layer 4C is the neural substrate for phenomenal visual consciousness (particularly, color and brightness consciousness). Keywords: Retinal Vascular Image, Primary Visual Cortex, Layer 4C, Visual Suppression, Consciousness.
1 Introduction In the human retina, as shown in Figure 1, photoreceptors are “incorrectly” placed underneath retinal blood vessels (here, “incorrectly” is used relatively in comparison with such animals as Cephalopods in whose retinae photoreceptors are indeed “correctly” located above the retinal blood vessels; see Gehring, 2005). Anatomically, the diameter of thick retinal blood vessels near the optic disk (see Figure 2) is around 150µm, while the diameter of a cone photoreceptor is only about 4µm. In other words, a thick retinal blood vessel would cast a shadow on a multitude of photoreceptors underlying it, and such a shadow could cover almost 40 cones in width. Given these anatomical facts, one would certainly expect that overall retinal blood vessels would contaminate visual input information whenever an external visual image is impressed on the retina and is then sent to the brain. Then, why do not we see any traces of our retinal blood vessels in our normal vision? Throughout the present paper, I will use the term “Retinal Vascular Image” (RVI) to refer to the image (or pattern) of retinal blood vessels. RVI has also been known as the retinal circulation pattern, the angioscotomic image, and the Purkinje tree (as J. E. Purkinje was the first to describe it in a detailed way, though he was certainly not the L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 230–239, 2010. © Springer-Verlag Berlin Heidelberg 2010
Computational and Neural Mechanisms for Visual Suppression
231
first to have observed it; see Wade, 2000, pp.186-187). Although some early authors had also used the term “entoptic image” exclusively for RVI, nowadays RVI is normally classified as just one type of entoptic images (Tyler, 1978). As Tyler and several other investigators have remarked, studying entoptic images (such as RVI, phosphenes due to various reasons, and hallucination patterns) is certainly useful – not only for clinical purposes but also for gaining knowledge about some underlying neural mechanisms of human visual perception. Although we do not see our own retinal bloods vessels in normal vision, under some special circumstances it is indeed possible to see them in one's own eyes. Based on the observations by Purkinje and several others, Hermann von Helmholtz described a procedure for seeing one's own RVI: In a dimly-lit environment, with one eye closed and the other open, and by moving a point-source blue light around the sclera of the open eye in a circular motion, one should be able to capture some glances of his/her own RVI (see Helmoltz, 1962). This procedure was later further developed into the earliest form of ophthalmoscopes for clinicians to examine other people's eyes and retinae. Figure 2 shows the image of a human left eye retina, with its main blood vessels clearly visible, as seen through an ophthalmoscope by an examiner. In this paper I will first present an observation demonstrating that one is able to see one's own RVI in negative afterimages – this observation implies that RVI suppression must be active and be achieved on a moment-by-moment basis. Then I will put forward a computational-and-anatomical analysis to show that the neural site for RVI suppression is most likely monocular. Mapping onto the organization of the human visual system, particularly based on Adams and Horton's (2009) elegant demonstration of a complete anatomical map of retinal blood vessels in layer 4C of the primary visual cortex (V1), I propose that layer 4C, which is the sole bi-monocular stage in the whole visual cortex, is indeed the neural site for RVI suppression.
Fig. 1. A section through the human eye with a schematic enlargement of the retina
232
C.Q. Wu
Fig. 2. A human retina (from right eye) as seen through an opthalmoscope by an examiner
2 Seeing RVI in “Positive” and in “Negative” As mentioned above, under special conditions it is possible to see one's own RVI – normally, as illustrated in Figure 3, this RVI is a “positive” image with the retinal blood vessels appearing in dark on a lighter background. While attempting to experience various entoptic images as described by Tyler (1978), I observed that it is possible to see one's own RVI in “negative” – that is, in negative afterimages. Here is the procedure for making such an observation: If one blindfolds his/her left eye and adapts him/herself for 10 minutes in a room dimly and uniformly lit with a blue light source, and then moves his/her right hand to blindfold the right eye, one will observe a RVI of the right eye in a negative afterimage (as shown in Figure 4) just at the moment when the hand touches and closes the right eye.
Fig. 3. Seeing one's own RVI in “positive”
Computational and Neural Mechanisms for Visual Suppression
233
Fig. 4. Seeing one's own RVI in “negative”
As far as I am able to know from the relevant literature, the above procedure for evoking RVI in negative afterimages has not been reported previously. On the other hand, the related phenomenon that one can experience a negative afterimage without consciously seeing the primary image at all had been observed by Bidwell more than a century ago (Bidwell, 1896, 1897, 1901). The phenomenon of regular negative (or complementary) afterimages is well known: If one stares a red colored patch for half a minute or so and then looks at a white background, one will see the complementary color of red, which is cyan (or “greenish-blue” as it is more popularly known). Using a rotating disc, Bidwell demonstrated that by masking (or shutting off) a brief visual stimulus (e.g., a red colored patch) with dark background it is possible to see the negative afterimage (which is cyan) without seeing the primary stimulus (which is red) at all. This phenomenon is very intriguing, very robust, and had been demonstrated by Sperling (1960) in tachistoscopic experimental settings. (What is the neural mechanism underlying Bidwell afterimages? And is it the same mechanism as that underlying regular complementary afterimages? Currently, these issues remain elusive to vision researchers. Although these issues are very intriguing, they are beyond the scope of the present paper. There has not been much research on Bidwell afterimages since Sperling's work, but I think that this phenomenon is worthy to be further studied.) There is a basic commonality between the procedure for showing Bidwell afterimages and that for eliciting RVI in negative afterimages – that is, the necessity of a dark visual field masking. This similarity between the two situations indicates that the afterimage for framing a negative RVI is a Bidwell afterimage. The difference between the two situations is only that in Bidwell's and Sperling's demonstrations the image is from the external world, while in our situation concerning RVI the image is intrinsic in the retina. As Robinson (1972) remarked and as we can see from the procedure for eliciting Bidwell afterimages, the Bidwell afterimage phenomenon is doubtlessly a type of visual masking; hence, the fact that RVI could appear in Bidwell afterimages immediately suggests that RVI is actively masked (or suppressed) on a moment-bymoment basis – it is the complete or a partial failure of this RVI suppression mechanism that makes RVI showing up in Bidwell afterimages.
234
C.Q. Wu
Using Helmholtz's procedure for eliciting RVI in one's own vision, Coppola and Purves (1996) experimentally demonstrated that the RVI disappears very rapidly after its elicitation: It fades within ~ 80ms after appearing. Based on such experimental results, these investigators have already suggested that an “active erasure” mechanism must be in place in our visual system for RVI suppression. The observation presented here that one is able to see one's own RVI in negative afterimages complements Coppola and Purves' experimental results, and strengths the proposition that in our visual system there must be an active erasing/suppressing mechanism for removing RVI from visual input images on a moment-by-moment basis.
3 How Not to See Retinal Vascular Image Now we can state the problem raised at the very beginning of the present paper (that is, “why do not we see RVI in our normal viewing?”) in a more specific form: What is the neural mechanism for actively suppressing RVI in our visual system? This question in turn has two parts: Anatomically (or structurally), what is the neural substrate where RVI suppression could happen? Physiologically (or functionally), what is the dynamics of RVI suppression? In the present paper, we will primarily concern ourselves with the neural substrate issue. But before moving onto this issue, I will present a computational analysis as to what kinds of computational strategies may be utilized by our visual brain for actively removing RVI on a moment-by-moment basis. Faced with a certain computational task in vision, we can follow Marr's (1982) computational approach to vision and ask the question from a designer's perspective: What computational strategies or algorithms one may use to accomplish such a computational task? Here I use the term “computational strategy” to refer to the kind of “pseudo-algorithm” in the sense as used in computer science – that is, an algorithm without much detailed descriptions of its components. Specific to RVI suppression, thinking in this way, now we have turned the original question of “why not seeing RVI” into the question of “how not to see RVI”. There are two, and only two, possible computational strategies for removing RVI: One is to use an internal representation (or map) of RVI for removing it, and the other is to extract RVI directly from input images without utilizing any internal representation of RVI. Conceptually, as illustrated in Figure 5, the first computational strategy for removing RVI is actually very straightforward. To conceive it in a concrete manner, we can make an analogy between removing RVI in vision and erasing a watermark in image processing. The task of removing a fixed watermark from multiple input images would be greatly simplified when we have the image of the watermark itself: It is simply an image subtraction operation – that is, subtracting the watermark image from each and every one of the input images. On the other hand, as we all know, if we do not have the watermark image itself (e.g., the watermark was added by others), the task of removing such a watermark from multiple images becomes very difficult, or nearly impossible if perfect results are demanded. As to the second computational strategy for removing RVI, due to the possibility that the external visual image could be exactly the same as the internal RVI, there is simply no way to decide whether to keep or remove the image when this possibility actually happens. Again, we can make an analogy to the watermark situation: Given
Computational and Neural Mechanisms for Visual Suppression
235
an input image exactly the same as the watermark, we would simply have no way to know whether it is a blank image with the watermark on it or a non-blank image with the watermark as the real information to be conveyed. (For this special condition, the first computational strategy will still work on grayness scale.) Of course, in reality, the possibility that the whole external image coincides precisely with RVI may never happen; but there are definitely circumstances where a small part of the external image coincides with a small part of RVI. This mere possibility makes this second computational strategy for removing RVI very implausible. In short, the above computational analysis indicates that our visual system may utilize a representation (or map) of RVI for the purpose of actively removing RVI from the whole retinal image (that is, the image captured by all photoreceptors – including those beneath the retinal blood vessels) on a real-time basis.
Fig. 5. Using an internal representation of RVI for removing RVI from retinal stimulation
4 Neural Substrate for RVI Suppression Now, we turn to the question concerning the neural substrate for RVI suppression in our visual system. From Figure 2, we know that RVI is asymmetrical with respect to left eye versus right eye – that is, it is different in each of the two eyes. This fact suggests that RVI suppression is most likely to be found at a monocular stage. This is because at any possible binocular stage, an overlaid (or superimposed) image of two RVIs from the two eyes would be different from one moment to the next, thanks to various eye movements as our eyes are almost constantly in motion; therefore, removing such a superimposed binocular image would be much more difficult than removing each eye's RVI separately at a monocular stage. Figure 6 depicts the first few stages of the primate (including the humankind) visual system: When visual stimulus falls up on the retina of the eye, the photoreceptors there catch photons and convert them into electrical signals; and then the neural circuitry in the retina converges such signals and, through the LGN (lateral geniculate nucleus) of the thalamus, conveys these signals onto layer 4C in the primary visual cortex or visual cortical area V1 (which is also known as striate cortex or Brodemann's area 17 of the human brain).
236
C.Q. Wu
Fig. 6. Anatomical organization of the early stages of the primate (including the humankind) visual system
An important neuroanatomical, as well as neurophysiological, feature concerning the early parts of the primate visual system is that layer 4C is a cortical stage where cells are predominantly monocular (Hubel, 1988). As illustrated in Figure 6, only the retina, the LGN, and layer 4C are monocular neural stages; beyond layer 4C, visual neurons are predominantly binocular. Among the three monocular neural sites, only layer 4C is cortical. Furthermore, unlike in one retina or in one layer of LGN where all the neurons are completely monocular (that is, receiving their inputs solely from one eye), in layer 4C those monocular neurons receiving inputs from the left eye and those receiving inputs from the right eye co-exist in the same place, in the so-called “ocular dominance columns” fashion (for such columns in monkeys, see Hubel, 1988; in humans, see Horton, 2006). Because of this co-existence of monocular neurons from the two eyes in layer 4C, I suggest that this layer should be more appropriately referred to as a “bi-monocular” layer. The afore-mentioned monocular feature of layer 4C strongly indicates that this layer is very likely to be the neural site for RVI suppression. As a matter of fact, very recently, the existence of a complete representation of retinal blood vessels in layer 4C has been beautifully demonstrated by the neuroanatomist Jonathan Horton and his coworkers using the histochemical method of CO (cytochrome oxidase) staining (see Adams & Horton, 2009; specifically, Figure 12 of their paper shows one of such neuroanatomical maps). Based on their neuroanatomical results, I suggest that layer 4C in V1 is indeed the neural site for RVI suppression. One may argue that what was demonstrated by Adams and Horton is a map of “void (or lack) of neurons” in layer 4C corresponding to angioscotoma, rather than a map of neurons whose activities correspond to RVI. As a matter of fact, RVI actually creates two problems: the necessity of RVI suppression and the necessity of filling the suppressed RVI areas with surrounding signals. More appropriately, we can say that
Computational and Neural Mechanisms for Visual Suppression
237
they are two sides of one and the same problem; that is, by solving one of them the other is also solved. If we conceptualize the RVI suppression problem in this manner, a representation of “void of neurons” is precisely what is needed for both RVI suppression and filling-in of the suppressed RVI regions – this is because the neurons surrounding the “void of neurons” representation can utilize such a representation to accomplish filling-in. As a matter of fact, here we can link RVI suppression to blind-spot filling-in. As can be seen in Figure 2, anatomically, the fundus (namely, the root) of all the blood vessels in a retina is at the center of the optic disk where all the nerve fibers from the retina converge before going on their way to the brain. Perceptually, the optic disk corresponds to the blind-spot of the corresponding eye. Although we normally are not aware of the two blind-spots in our two eyes at all, they can indeed be mapped out perceptually (Ramachandran, 1992). The computational task concerning RVI is exactly the same as that for blind-spots – in a certain sense, the blind-spot of an eye is just a part of that eye's RVI (see Figure 2, 3, and 4 for the connection between retinal blood vessels and blind-spot). In vision research, investigators normally refer to the problem concerning blind-spots as a filling-in problem (Ramachandran, 1992), while referring to the problem concerning RVI as a suppression problem. Nonetheless, in reality, both blind-spots and RVIs must have both the suppression and the filling-in aspects. A full discussion about the issue of blind-spot filling-in is beyond the scope of the present paper, but neuroanatomically Adams and Horton's data clearly show that the blind-spot is represented in layer 4C (as shown in Figure 12 in Adams and Horton, 2009); and the neurophysiological results obtained by Komatsu, Kinoshita, and Murakami (2000) also clearly indicate that blind-spot filling-in most probably happens in layer 4C of the visual cortical area V1.
5 Other Visual Suppression Phenomena In a review article on various visual suppression phenomena such as saccadic suppression, visual masking, suppression of entoptic images, monocular and binocular rivalry, Martin (1974) noted that there may be some common mechanism(s) shared by these visual suppression phenomena. Here I will discuss the possibility that monocular visual masking and the monocular aspect of binocular rivalry may also occur in layer 4C of V1. As mentioned above, the Bidwell afterimage phenomenon is certainly a special type of visual masking – namely, it is dark field masking. Recent research indicates that visual masking may happen at both monocular and binocular stages (Macknik and Martinez-conde, 2004). There is certainly a possibility that monocular visual masking is just a generalized type of the dark field masking exhibited in Bidwell afterimages; and therefore, such monocular visual masking happens in layer 4C of V1. Of course, this possibility remains to be experimentally explored. Another visual suppression mechanism that may also occur in layer 4C is the monocular part of the binocular rivalry phenomenon. When the two eyes of an observer are presented with different stimuli (e.g., red to the left right and green to the right eye), the observer may experience a continuous alternation between the two stimuli
238
C.Q. Wu
every few seconds (that is, seeing red, green, red, green, and so on and so forth) – instead of seeing the combination of the two stimuli (that would be yellow in the case here; that is, directly mixing red and green would produce yellow). This phenomenon is known as binocular rivalry and now has become an important tool for studying visual consciousness (Tong, Meng, and Blake, 2006). Over the past 15 years, converging results from brain imaging, psychophysical, and neurophysiological studies on binocular rivalry in both humans and animals indicate that this phenomenon happens at multiple levels in the primate visual system and that the earliest stage of binocular rivalry is monocular (see Tong, Meng, and Blake, 2006). Now there is also evidence indicating that binocular rivalry shares some common mechanism with visual masking (van Boxtel, van Ee, and Erkelens, 2007). Therefore, it is reasonable to suggest that the monocular stage of binocular rivalry is in layer 4C of V1. As I have previously suggested (Wu, 2009), human color (including brightness) consciousness is essentially monocular; and therefore, layer 4C of V1 must be the neural substrate for this primary aspect of our visual consciousness.
6 Conclusions To summarize, empirically, I have demonstrated that it is possible to see one's own RVI in negative afterimages. Computationally, I have suggested that visual input information from our eye to our brain is always tainted with RVI at any moment, that the human visual system must employ an active RVI suppression mechanism on a moment-by-moment basis, and that such a RVI suppression mechanism entails a representation of RVI within our visual brain. Mapping onto the neuroanatomical organization of the human visual system, particularly based on Adams and Horton's recent neuroanatomical demonstration of a complete representation of retinal blood vessels in layer 4C of V1, I have suggested that this layer is indeed the neural substrate for active RVI suppression. Generalizing to some other visual suppression phenomena, I have further suggested that monocular visual masking and the monocular suppression part of binocular rivalry may share some common neural mechanism with RVI suppression. Acknowledgments. I am very grateful to Prof. John R. Anderson and the late Prof. Herbert A. Simon for teaching me scientific thinking when I was a graduate student in the Department of Psychology at Carnegie Mellon University. I also wish to express my gratitude to Prof. Jennifer S. Lund for teaching me the relevant neuroanatomy when I was a postdoctoral fellow in her laboratory in the Institute of Ophthalmology, University College London, U.K.
References 1. Adams, D.L., Horton, J.C.: Ocular dominance columns: enigmas and challenges. Neuroscientist, 15, 62–77 (2009) 2. Bidwell, S.: On subjective colour phenomena attending sudden changes of illumination. Proceedings of the Royal Society of London, 60, 368–377 (1896) 3. Bidwell, S.: On the negative after-images following brief retinal excitation. Proceedings of the Royal Society of London, 61, 268–271 (1897)
Computational and Neural Mechanisms for Visual Suppression
239
4. Bidwell, S.: On negative after-Images, and their relation to certain other visual phenomena. Proceedings of the Royal Society of London, 65, 262–285 (1901) 5. Coppola, D., Purves, D: The extraordinarily rapid disappearance of entoptic images. PNAS, 93, 8001–8004 (1996) 6. Gehring, W. J.: New perspectives on eye development and the evolution of eyes and photoreceptors. Journal of Heredity, 96, 171–184 (2005) 7. Horton, J.C.: Ocular integration in the human visual cortex. Canadian Journal of Ophthalmology, 41, 584–593 (2006) 8. Hubel, D.H.: Eye, Brain, and Vision. W.H.Freeman, New York (1988) 9. Marr, D.: Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W.H.Freeman, New York (1982) 10. Tong, F., Meng, M., Blake, R.: Neural bases of binocular rivalry. Trends in Cognitive Sciences, 11, 502–511 (2006) 11. Helmholtz, H. L. F.: Treatise on physiological optics. (translated by Southall, J. P. C.) The Optical Society of America, New York (1925) 12. Komatsu, H., Kinoshita,M., Murakmi, I.: Neural responses in the retinotopic representation of the blind spot in the macaque V1 to stimuli for perceptual filling-in. Journal of Neuroscience, 20, 9310–9319 (2000) 13. Macknik, S.L., Martinez-conde, S.: Dichoptic visual masking reveals that early binocular neurons exhibit weak interocular suppression. Journal of Cognitive Neuroscience, 16, 1049–1059 (2004) 14. Martin, E.: Visual suppression: a review and an analysis. Psychological Bulletin, 81, 899– 917 (1974). 15. Ramachandran, V.S.: Blind spots. Scientific America, 266, 86–91 (1992) 16. Robinson, J. O.: The Psychology of Visual Illusion. Hutchinson, London (1972) 17. Sperling, G.: Negative afterimage without prior positive image. Science, 131, 1613–1614 (1960) 18. Tyler, C.: Some new entoptic phenomena. Vision Res., 18, 1633–1639 (1978) 19. van Boxtel, J.J., van Ee, R., Erkelens, C.J.: Dichoptic masking and binocular rivalry share common perceptual dynamics. Journal of Vision, 21, 1–11 (2007) 20. Wade, N.J.: A Natural History of Vision. The MIT Press, Cambridge, MA (2000) 21. Wu, C.Q.: A multi-stage neural network model for human color vision. In: Yu, W., He, H., Zhang, N. (eds.): ISNN 2009. LNCS, vol. 5553, pp. 502–511. Springer, Heidelberg (2009)
Visual Selection and Attention Shifting Based on FitzHugh-Nagumo Equations Haili Wang1, Yuanhua Qiao1, Lijuan Duan2, Faming Fang2, Jun Miao3, and Bingpeng Ma3 1
College of Applied Science, Beijing University of Technology, Beijing 100124, China 2 College of Computer Science and Technology, Beijing University of Technology, Beijing 100124, China 3 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China {qiaoyuanhua,ljduan}@bjut.edu.cn, {wanghaili,fmfang}@emails.bjut.edu.cn, {bpma,jmiao}@ict.ac.cn
Abstract. In this paper, we make some analysis on the FitzHugh-Nagumo model and improve it to build a neural network, and the network is used to implement visual selection and attention shifting. Each group of neurons representing one object of a visual input is synchronized; different groups of neurons representing different objects of a visual input are desynchronized. Cooperation and competition mechanism is also introduced to accelerate oscillating frequency of the salient object as well as to slow down other objects, which result in the most salient object jumping to a high frequency oscillation, while all other objects being silent. The object corresponding to high frequency oscillation is selected, then the selected object is inhibited and other neurons continue to oscillate to select the next salient object. Keywords: F-N model, Neural Network, Visual Selection, Attention Shifting.
1 Introduction Due to the limited processing capacity of biological system, some mechanisms have evolved in order to permit these systems to perform tasks. Visual selection and attention shifting are important mechanisms to ensure that the limited processing ability to perform tasks best. The ability to extract salient features from the images or receptive fields, and group them into objects, then select the salient object, is a fundamental task of perception. This ability is the visual selection of the visual perception. When the salient object has been selected out, because of the adaptability of the visual system, attention will shift from the salient object to the next salient object , which is the attention shifting of the visual perception . The application of neural network model based on synchronous oscillation in image processing is more and more widely [1], [2], [3], where pulse coupled neural dynamical model in image segmentation has obtained many good results. But the application in the visual selection and attention shifting is still very important for the current research and application [4], [5]. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 240–249, 2010. © Springer-Verlag Berlin Heidelberg 2010
Visual Selection and Attention Shifting Based on FitzHugh-Nagumo Equations
241
Techniques for identifying multiple objects have a certain development. For example, in 1997, Dietmar Heinke etc. proposed SAIM model [6]. In this model took the regulating effect of higher-level knowledge networks to the visual processing into account. They use the regulating effect of the knowledge networks to perform visual selecting task. In 2007, Dietmar Heinke etc. improved the original SAIM model [7], and completed the visual selection of color images. In 2006, Itti etc. They combined the top-down regulatory and bottom-up selection mechanism to perform the visual search task of natural images [8]. They used the extracting network to select feature from different scales, and the top-down regulation is mainly searching for the characteristics of the object and the background. In recent years, neural network models based on synchronous oscillation have continually been developed and built out, and significant progress has been made in image segmentation. In 2007, Liang Zhao and Fabricio A.B etc. also improved Wilson-Cowan network on the basis of predecessors [9]. Firstly they obtained very good results in the visual selection task. Then they turned its application to the visual selection and attention shifting in [10]. They interpret the visual selection and attention shifting from the dynamical mechanism. In 2009, M.G. Qu iles, D.L. Wang and L.Zhao present a neurocomputational model of object-based selection in the framework of oscillatory correlation [11]. This object selection system has been applied to nature images. In this paper, we construct a new visual selection and attention shifting model and mainly perform the visual selection and attention shifting task when the input is gray image. Our model is based on the F-N model [1]. The rest of the paper is organized as follows. In section 2, we introduce the F-N model and our model. In section 3, the simulation experiments of the proposed model are given. In section 4, we give the conclusions.
2 Model Description 2.1 Further Study about F-N Equations The FitzHugh-Nagumo equations describe the interaction between the voltage V across the axon membrane, which is driven by an input current I and a recovery variable R . FitzHugh-Nagumo equations are given as follows:
⎧ dV V3 = 10(V − −α R + I) ⎪⎪ 3 dt ⎨ ⎪ dR = 0.8(− R + β V + 1.5) ⎪⎩ dt
(1)
Here 10 and 0.8 is the inverse of the time constant for V and R , and α >0, β >0 describes the action strength for R to V and V to R respectively. The time constant for V is 12.5 times faster than that for R , which reflects the fact that activation processes in the axon are much more rapid than the recovery processes.
242
H. Wang et al.
Let (V , R ) be equilibrium point. At equilibrium the Jacobian is:
⎛ 10 − 10V 2 A=⎜ ⎜ 0.8β ⎝
− 10α ⎞ ⎟ − 0.8 ⎟⎠
(2)
Then the characteristic equation is:
λ 2 + (10V 2 − 9.2)λ + 8(V 2 + αβ − 1) = 0
(3)
Let λ1 and λ2 be the characteristic roots of (2). Obviously we have the following conclusions: (1) If λ1λ2 = 8(V 2 + αβ − 1) < 0 , then the equilibrium point (V , R ) is unstable; (2) If λ1λ2 = 8(V 2 + αβ − 1) > 0 and λ1 + λ2 = −(10V 2 − 9.2) > 0 the equilibrium point (V , R ) is unstable; (3) If λ1λ2 = 8(V 2 + αβ − 1) > 0 and λ1 + λ2 = −(10V 2 − 9.2) < 0 , the equilibrium point (V , R ) is stable.
Fig. 1. The figure shows the phase plane of FitzHugh-Nagumo equations along with two isoclines. The red curve describes dV / dt = 0 and the green line describes dR / dt = 0 . The blue curve shows R − V trajectory. When α =1, β =2, (1) yield only one limit cycle. Here I = 1.5 .
Visual Selection and Attention Shifting Based on FitzHugh-Nagumo Equations
243
Only when the equilibrium point (V , R ) is unstable, there is a stable limit cycle around the equilibrium. When equilibrium is stable, there is no limit cycle or an unstable limit cycle, but unstable limit cycle does not exist in biological systems, it has no sense to biological systems. From the three conclusions above, let I = 1.5 , (1) has only one stable limit cycle when α =1, β =2, no limit cycle occurs when α =1, β =0.6 or α =0.3, β =2 or α =0.3, β =0.6. Phase plane for (1) with isoclines are shown in Fig 1. The spiking frequency of (1) can be controlled by changing the parameter α and β in (1). In Fig 2 and Fig 3 we show the time series of (1) by varying α and β respectively. From the two figures we notice that as α and β increase the frequency of (1) increases. When α or β takes a small value (for example the case of α =1, β =0.6 or α =0.3, β =2 or α =0.3, β =0.6), (1) do not fire spikes. In our model, we take advantages of this to determine visual attention, which means that the synchronized neurons corresponding to the salient object will fire more frequently, while the neurons corresponding to the other objects will fire with low frequency or not fire.
Fig. 2. β =2, I =1.5, the spike sequence when α equals to 0.3,1,2,5,10 respectively
244
H. Wang et al.
Fig. 3. α =1, I =1.5, the spike sequence when β equals to 0.3,1,2,5,10 respectively
2.2 Model Construction
Our model is a two dimensional neuron network governed by the following equations:
⎧ dVi , j (Vi , j + ΔVi , j )3 = 10((Vi , j + ΔVi , j ) − − α Ri , j + I i , j ) ⎪ ⎪ dt 3 ⎨ ⎪ dRi , j = 0.8(−( R + ΔR ) + β V + 1.5) i, j i, j i, j ⎪⎩ dt
(4)
Where (i, j ) indicates the i th row j th column in the image, 1 ≤ i ≤ M ,
1 ≤ j ≤ N ( M and N indicates the size of the image). Where ΔVi , j and ΔRi , j indicate that the impact of peripheral neurons, which are defined by: Δxi , j = γ i −1, j −1;i , j ( xi −1, j −1 − xi , j ) + γ i −1, j ;i , j ( xi −1, j − xi , j ) +
γ i −1, j +1;i , j ( xi −1, j +1 − xi , j ) + γ i , j −1;i , j ( xi , j −1 − xi , j ) + γ i , j +1;i , j ( xi , j +1 − xi , j ) + γ i +1, j −1;i , j ( xi +1, j −1 − xi , j ) + γ i +1, j ;i , j ( xi +1, j − xi , j ) + γ i +1, j +1;i , j ( xi +1, j +1 − xi , j )
(5)
Visual Selection and Attention Shifting Based on FitzHugh-Nagumo Equations
⎧ ⎪1
if neuron (i, j ) is coupled to ( p, q)
⎪⎩0
else
γ i, j; p ,q = ⎨
245
(6)
Where x denotes V or R . As the impact of recovery variables of each neuron to other neurons is very small, so here we make ΔRi , j = 0 . In order to achieve the behavior that the synchronized neurons corresponding to the salient object will fire more frequently, while the neurons corresponding to the other objects will fire with low frequency or not fire. Firstly, we let the neurons run with fixed parameters α and β until the neurons corresponding to the same object synchronize, which means that the segmentation task has been performed. Secondly, after the first step, whenever any neuron fires, it will produce two types of signals to itself and other neurons: an excitatory signal to itself and neurons that fire together with it and an inhibitory signal to neurons that don’t fire together with it. Without considering the coupling terms, (4) are the same as (1). From the analysis of (1), we know that parameters α and β can control the activities of neurons. For example, if the neuron at (i, j ) fires. Then the excitatory and the inhibitory signals can be defined varying parameters α and β as follows:
α p , q (τ ) = α p , q (τ − 1) + β p , q (τ ) = β p , q (τ − 1) + ⎧⎪θ1 Where h1 (α ) = ⎨ ⎪⎩θ 2
h1 (α p , q (τ − 1)) M (τ ) h2 ( β p , q (τ − 1)) M (τ )
∑
I i , j f1 ( Vi , j − V p , q )
(7)
∑
I i , j f 2 ( Vi , j − V p , q )
(8)
i , j∈Δ (τ )
i , j∈Δ (τ )
⎧⎪θ1 , h2 ( β ) = ⎨ α ≥ θα ⎪⎩θ 2
α < θα
β ≥ θβ β < θβ
, θ1 > θ 2 > 0 .
f1 ( x) = a1 x + b1 ( a1 > 0 , b1 < 0 ), f 2 ( x) = a2 x + b2 ( a2 < 0 , b2 > 0 ). Where ( p, q) indicates the p th row q th column in the image, and τ is a time instant with at least one firing neuron, M (τ ) is the number of neurons at the firing state at τ , and Δ (τ ) is the set of neurons at the firing state at τ . By setting a1 > 0 , b1 < 0 and a2 < 0 , b2 > 0 and functions h1 , h2 , defines that each firing neuron (i, j ) send excitatory or inhibitory signals to another neuron ( p, q ) . From (7) and (8) we can find that when any neuron at (i, j ) fires, the parameter α corresponding to it will increase. If α increases, the fixed point of the system will move downward along the isoclines dR / dt = 0 . In other words, the intersection of the isoclines dV / dt = 0 and dR / dt = 0 moves downward along the isoclines dR / dt = 0 . As is shown in the Fig 4, when α = 1 , the fixed point is o1 , and the isoclines dV / dt = 0 is the green curve, and when α = 2 , the fixed point is o2 , and the isoclines
246
H. Wang et al.
dV / dt = 0 is the red curve. The part with single arrow in the figure is the slow jumping area, and that with double arrow is the quick jump area. Here we call the left part from the point L of the curve left branch, and the right part from the point R right branch. If the neuron hasn’t fired, the value of parameter α corresponding to it is smaller. The R − V trajectory move downward along the left branch of the green curve. When it moves to the point L of the green curve, it will jump to the right branch from the left branch. The jumping will make the value of the parameter α increase, and the isoclines dV / dt = 0 move downwardly to the red curve. The R − V trajectory jumps to the right branch of the red curve, then move upward along the red curvet. When it moves to the point R , it will jump down to the left branch from the right branch. The jumping will make the value of the parameter α decrease, and the isoclines dV / dt = 0 move upward to the green curve again. In this way, the firing rate of the neuron increases. Conversely, if the value of the parameter α decreases, it would make neuron fires slower. For the parameter β , the influence of it to the fire rate is contrary.
Fig. 4. The trajectory charts When the parameter α changes. The blue curve is dR / dt = 0 . The green curve is dV / dt = 0 when α = 1 . The red curve is dV / dt = 0 when α = 2 . Here o1 and o2 are the fixed points. I = 1.5 β =2
,
。
When the system is running, we can just control the changes of the parameters α and β to make that the increasing speed of α is faster than the decreasing speed of β , when the neurons jump. In this way, the most salient object will jump to a high frequency periodic oscillating phase, while all other objects will be quite silent. The attention will focus on the most salient object. After receiving attention, this object is inhibited in order to permit other objects to become salient.
Visual Selection and Attention Shifting Based on FitzHugh-Nagumo Equations
247
3 Computer Simulations This section presents the simulation results performed on the image in Fig5. In the simulation of this paper, the following parameters are held constant at: a1 = 2 , b1 = −4 , a2 = −4 , b2 = 2 , θ1 = 0.1 , θ 2 = 0.01 , θα = 0.8 , θ β = 4 .
Fig. 5. Artificial image with four objects [11]
The image used in our experiment is a grayscale image consisting of four goals (Fig 5), in which the corresponding pixel values of the sun is the largest, followed by the tree hill and sky. Figure 6 shows the spike sequences of the four neuron groups corresponding to the four objects in Fig 5. In our simulations only the intensity of pixels are used as input. It means that the object with the highest intensity receives the attention earlier than other objects with lower intensity.
Fig. 6. The spike sequences of the four neuron groups corresponding to the four objects in Fig 5. From the top to bottom they are: sun, tree, hill, sky in turn.
248
H. Wang et al.
From Fig 6 we can observe that, the firing rate increases gradually after the synchronization of neurons corresponding to the sun object, and finally reaches the maximum. At the same time, the firing rates of neurons corresponding to the other objects are at relatively low level. It explains that the sun is noticed firstly. After that the firing rate of neurons corresponding to the sun object will soon reduce to a lower lever, while the firing rate of neurons corresponding to the tree object will increase to the maximum, but the firing rate of the other two objects remain at relatively low level. It shows that the tree object is noticed, which implements the attention shift from one object to another object. Finally, tree and hill object will be noticed in turn (Fig 7).
Fig. 7. Fig (1)-(4) are the objects selected out in turn
4 Conclusion This paper presents a visual selection and attention shifting model based on FitzHughNagumo equations. This system can be seen as a part of visual attention system, which is responsible for selecting the most salient object from an input image and shifting attention from one object to another. The proposed model includes not only cooperation but also competition mechanism. Computer simulations were performed in order to check our model’s viability as a selection and shifting mechanism and the results show that it is a promising system. As a future work we intend to create a new system applied for natural image. In addition, we will also combine top-down and bottom-up attentions, adding the effects of priori knowledge to the visual selection.
Acknowledgements This research is partially sponsored by Beijing Municipal Foundation for Excellent Talents(No.20061D0501500211), Beijing Municipal Education Committee (No.KM200610005012), Natural Science Foundation of China (Nos.60673091, 60702031 and 60970087), Hi-Tech Research and Development Program of China (No.2006AA01Z122), Natural Science Foundation of Beijing (Nos.4072023 and 4102013) and National Basic Research Program of China (Nos.2007CB311100 and 2009CB320902).
Visual Selection and Attention Shifting Based on FitzHugh-Nagumo Equations
249
References 1. Wang, D.L.: Object selection based on oscillatory correlation. Neural Networks 12, 579– 592 (1999) 2. Olshausen, B., Field, D.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996) 3. Campbell, S.R., Wang, D.L.: Synchronization and desynchronization in a network of locally coupled wilson-cowan oscillators. IEEE Transactions on Neural Networks 7, 541– 554 (1996) 4. Navalpakkam, V., Itti, L.: An Integrated Model of Top-down and Bottom-up Attention for Optimizing Detection Speed. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2006) 5. Deco, G.: A Neurodynamical Model of Visual Attention: Feedback Enhancement of Spatial Resolution in a Hierarchical System. Journal of Computational Neuroscience 10, 231– 253 (2001) 6. Heinke, D., Humphreys, G.W.: SAIM: A Model of Visual Attention and Neglect. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 913–918. Springer, Heidelberg (1997) 7. Heinke, D.: Selective Attention for Identification Model: Simulating visual neglect. Computer Vision and Image Understanding 100, 172–197 (2005) 8. Itti, L., Koch, C.: Computational modelling of visual attention. Nature Reviews Neuroscience 2, 194–203 (2001) 9. Zhao, L., Breve, F.A.: Visual Selection and Shifting Mechanisms Based on a Network of Chaotic Wilson-Cowan Oscillators. In: Third International Conference on Natural Computation (2007) 10. Zhao, L., Macau, E.E.N., Omar, N.: Scene segmentation of the chaotic oscillator Network. International Journal of Bifurcation and Chaos in Applied Sciences and Engineering 10(7), 1697–1708 (2000) 11. Quiles, M.G., Wang, D.L., Zhao, L., Romero, R.A.F., Huang, D.: An Oscillatory Correlation Model of Object-based Attention. In: Proc. IJCNN (2009) 12. Zhao, L.: A Dynamically Coupled Chaotic Oscillatory Correlation Network. In: Proc. VI Brazilian Symposium on Neural Networks (2000)
Pruning Training Samples Using a Supervised Clustering Algorithm Minzhang Huang1 , Hai Zhao1,2, and Bao-Liang Lu1,2, 1
Center for Brain-Like Computing and Machine Intelligence Department of Computer Science and Engineering, Shanghai Jiao Tong University 2 MOE-Microsoft Key Laboratory for Intelligent Computing and Intelligent Systems Shanghai Jiao Tong University 800 Dong Chuan Rd., Shanghai 200240, China {zhaohai,blu}@cs.sjtu.edu.cn
Abstract. As practical pattern classification tasks are often very-large scale and serious imbalance such as patent classification, using traditional pattern classification techniques in a plain way to deal with these tasks has shown inefficient and ineffective. In this paper, a supervised clustering algorithm based on min-max modular network with Gaussian-zero-crossing function is adopted to prune training samples in order to reduce training time and improve generalization accuracy. The effectiveness of the proposed training sample pruning method is verified on a group of real patent classification tasks by using support vector machines and nearest neighbor algorithm. Keywords: Supervised clustering, Min-max modular network, Gaussian-zerocrossing function, Patent classification, Training sample pruning.
1 Introduction More than one million new patent applications are submitted every year. It is a key problem to automatically classify these incoming patent applications. Currently most patents are handled in a manual way. For a very large-scale patent database, automatic classification approach may play an important role of effectively reducing the workload. Naive Bayes, k-NN, support vector machines (SVMs) and decision tree have been successfully applied to patent classification, and SVMs have been shown the best performance [3]. Patent classification task is often not only very large-scale but also serious imbalance. As a result, applying traditional pattern classification techniques becomes unacceptable for both training time and space complexities or the classification accuracy. Our solution in this work is to reduce those redundant or unreliable training samples by introducing a supervised clustering algorithm based on min-max modular network with Gaussianzero-crossing function. Clustering is a method for dividing data into several non-overlapped parts, i.e., clusters. A basic hypothesis about clusters is that data in the same cluster are more similar
Corresponding author.
L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 250–257, 2010. c Springer-Verlag Berlin Heidelberg 2010
Pruning Training Samples Using a Supervised Clustering Algorithm
251
than data in different clusters. There are three types of clustering methods, supervised, semi-supervised and unsupervised clustering. Unsupervised clustering is a learning framework using a specific object functions, for example, a function that minimizes the distances inside a cluster to keep the cluster tight. Supervised clustering is applied on classified examples with the objective of identifying clusters that have high probability density with respect to a single class. Semi-supervised clustering is to enhance a clustering algorithm by using side information in clustering process. It can be subdivided into two major groups: similarity-based methods and search-based methods [2]. Fig. 1 illustrates the differences among these three types of clustering methods. In Fig. 1 (b), all the given data are unlabeled. In Fig. 1 (c), only the data have a outer circle are labeled. In Fig. 1 (d), all the data are labeled. For supervised clustering, the class information of each data is available, most commonly used methods are learning vector quantization (LVQ) [5] and correlation clustering [1]. The simplest clustering action is to add the data into an existing cluster which closest to it or let the data to be a new cluster itself. But it is still a problem of how to discriminate these two circumstances. Li and Ye have proposed a supervised clustering method to solve this problem [7,8]. They divide the input space into several grids, only those data in the same grid can be grouped into the same cluster. But how to define the size of the grid is still unsolved. In our previous work, we have proposed a new supervised clustering method to overcome this difficulty by using min-max modular network with Gaussian-zero-crossing function [6].
G-O\KT*GZGYKZ
H;TY[VKX\OYKJ
I9KSOY[VKX\OYKJ
J9[VKX\OYKJ
Fig. 1. Three different clustering methods: (a) given dataset; (b) unsupervised clustering; (c) semisupervised clustering; and (d) supervised clustering
The min-max modular network with Gaussian-zero-crossing function (M3 -GZC) [9] is a special case of the min-max modular network (M3 -network) [10]. GZC function is directly adopted to distinguish two samples from different categories. In this paper, a supervised clustering algorithm based on M3 -GZC is introduced to prune training samples. The reminder of this paper is organized as follows. In Section 2, M3 -network and 3 M -GZC network are briefly introduced and the supervised clustering algorithm based on M3 -GZC network is described. The experiment results is presented in Section 3. Finally, the last section concludes this paper.
252
M. Huang, H. Zhao, and B.-L. Lu
2 Supervised Clustering Algorithm 2.1 Min-Max Modular Network The process of constructing an M3 -network consists of three steps: a) divide a complex K-class problem into several small independent two-class problems; b) solve these small problems in parallel, and c) finally integrate them according to two principles, namely the minimization principle and maximization principle [10]. Let T be the training set for a K-class classification problem: L , T = {Xl , Dl }l=1
where Xl ∈ Rn is the input feature vector, Dl is the expected output vector, and L is the number of training samples. The original K-class problem can be divided into K(K − 1)/2 two-class subproblems as follows: L
Li j ∪ {(Xl( j) , e)}l=1 for i = 1, 2, ..., K and j = i + 1, ..., K T i j = {(Xl(i) , 1 − e)}l=1
(1)
All of the the two-class problems defined in Eq. (1) can be trained in parallel. Their outputs are combined by the minimization principle: Mi (x) = minKj=1 Mi j (x)
(2)
where Mi j (x) represents the discriminative function of the component classifier trained on T i j . If a two-class problem defined in Eq. (1) is still large and imbalance, it can be further divided. Assume that Xi in (1) is divided into Ni subset: L( j)
i for j = 1, 2, ..., Ni Xi j = {Xl(i j) }l=1
(3)
( j)
where Li denotes the number of training samples belonging to subset Xi j , and i ∪Nj=1 Xi j = Xi . According to Eq. (3, each of the two-class problem defined in Eq. (1) can be decomposed into the following relatively smaller two-class subproblems: L(u)
L(v)
j i = {(Xl(iu) , 1 − e)}l=1 ∪ {(Xl( jv) , e)}l=1 T i(u,v) j for u = 1, ..., Ni , v = 1, ..., N j , i = 1, ..., K and j i
(4)
where Xl(iu) ∈ Xiu and Xl( jv) ∈ X jv are the input vectors belonging to Ci and C j , respectively. The solution to the original problem can be obtained by combining all of the trained component classifiers as follows: N
j Miuj (x) = minv=1 Mi(u,v) j (x)
Ni Mi j (x) = maxu=1 Miuj (x)
(5)
Pruning Training Samples Using a Supervised Clustering Algorithm
253
2.2 M3 -GZC Network Suppose that there are two samples xi and x j belonging to class Ci and class C j , respectively. The Gaussian-zero-crossing function can be adopted to separate these two samples [9], [11] and it is defined as follows: fi j (x) = exp(−(
|x − c j | 2 |x − ci | 2 ) ) − exp(−( ) ), σ σ
(6)
where x is the input vector, σ = λ|ci −c j |, λ is a constant defined by the user, which actually determines the shape of the GZC function, and λ is empirically set to 0.5 throughout the whole paper. The output of the M3 -GZC network can be precisely defined as follows: ⎧ 1, ⎪ ⎪ ⎪ ⎪ ⎨ unknown, gi (x) = ⎪ ⎪ ⎪ ⎪ ⎩ −1,
yi (x) > θi ,
(7)
−θ j yi (x) θi , −θ j > yi (x).
where θi and θ j are the thresholds for Ci and C j , respectively, and yi (x) denotes the transfer function of M3 network for Ci . 2.3 Supervised Clustering Algorithm For a very large-scale pattern classification problem, supervised clustering is useful as it can effectively reduce the number of training samples. We thus can efficiently train a pattern classifier on the pruned training set. The ideology of clustering algorithm is as follows. When a new sample comes, group it into the cluster which is nearest to it or let itself to be a new cluster. But it is not so easy to distinguish these two situations. Li and Ye [7,8] defined a grid. Only those samples in the same grid should be in the same cluster. This brings about another problem: how to determine the size of the grid. Here we follow the concept of receptive field [9,6,11]. The receptive field is determined by the distribution of sample data, and it is the local area around the data. So we can treat the receptive field as the grid. When a new sample comes, we use this receptive field to decide whether it should be assigned to an existing cluster or a new one. In detail, the clustering process is as follows. Let (x, c) be a new sample. The sample closest to it is marked as (x , c). If it can be covered in the receptive field of (x , c), then these two samples are in the same cluster. Otherwise, (x, c) will be set to a new cluster. And we denote this process as M3 -GZC-C. Since the receptive field of the samples will overlap, only those samples closest to (x, c) are considered in the above clustering procedure. This may result in a situation in which a cluster center may be covered by the receptive filed of another cluster center. If this case occurs, then these two clusters will be merged. So, we can further compress the number of samples by using this method. According to the discussion above, this processing can be formally addressed as follows. For each cluster (x, c), find its closet neighbor (x , c), if (x, c) can be involved in (x , c), then merge these two clusters. We call this process as M3 -GZC-CC.
254
M. Huang, H. Zhao, and B.-L. Lu
3 Experiments In order to verify the effectiveness of our training sample pruning method, we perform two experiments on Japanese patent classification tasks. In our experiments, we use M3 -GZC-C to prune the training samples, and M3 -GZC-CC to further prune the trining samples until the number of training samples is kept unchanged. We compare the performance of the patent classifiers with and without our training sample pruning method. All of the experiments are run on a PC with Intel Core 2 2.83GHz / 4G RAM. 3.1 Experiment Setup The data set used in our experiments was collected from the NTCIR-5 patent data set [4] which follows the International Patent Classification (IPC) taxonomy1. The IPC is a hierarchically structured system including section, class, subclass, group and subgroup layers. The section layer is the top layer, and subgroup layer is at the bottom. There are about 350,000 new Japanese patents each year, and the patents of year 2001 and 2002 are used in our experiments. A patent document is generally stored in XML format and it is usually consists of three main sections: abstract, claim and description. In our experiments, theses three sections were weighted equally and indexed into a single vector by using the TFIDF algorithm. Then the χ2avg [12] feature selection method is used. Traditional SVMs and nearest-neighbor algorithm are selected as patent classifiers and compared in our experiments. 3.2 A Two-Class Classification Problem in the Subgroup Layer Here, we choose the data from the subgroup layer with the categories of H01L021/027 and H01L021/60. After feature selection, the dimension of the data is 1941. The data distribution is shown in Table 1. SVMs with RBF kernel (C=8, g=0.022) are used. Table 1. Description of training and test data from the subgroup layer Training Test H01L021/027 H01L021/60 H01L021/027 H01L021/60 No. of samples 1256 1003 1045 934
In order to compare the performance of our supervised clustering algorithm with different thresholds, we should search suitable threshold values experimentally. We set θi equal to 0.9 and change θ j from 0.0 to 1, and then evaluate the performance with nearestneighbor algorithm. The experiment results indicate that θi =0.9 and θ j =0.1 have a good performance with relatively small training sample size. So we use these parameters in the following experiments. 1
The International Patent Classification, which is commonly referred to as the IPC, is based on an international multi-lateral treaty administered by WIPO. It provides a hierarchical system of language independent symbols for the classification of patents and utility models according to the different areas of technology to which they pertain.
Pruning Training Samples Using a Supervised Clustering Algorithm
255
The training and classification results with and without our training sample pruning method are presented in Tables 2 and 3. In Table 2, ‘Size ratio’ means the ratio of the number of training samples after and before pruning by using the proposed method and ‘No. of incorrect outputs’ means the number of the test samples classified incorrectly. We can see from this table that the classification accuracy can be improved or at least be kept with a smaller training sample set by using our training sample pruning method. The results of Table 3 indicate that after clustering the number of support vectors is reduced and the training and test time are thus less than the case without clustering. Why the classification accuracy can be improved? It may be attributed to the fact that the most important support vectors are kept and those less important support vectors are pruned. Table 2. Performance comparison of the patent classifiers in subgroup layer with and without our training sample pruning method M3 -GZC-C M3 -GZC-CC Without Clustering NN SVM NN SVM NN SVM No. of Training samples 2164 2164 2136 2136 2259 2259 95.79 95.79 94.56 94.50 100 100 Size ratio (%) 13.8 0.67 13.5 0.67 14.3 0.69 Training time (s) No. of incorrect output 92 24 91 24 92 25 98 Classification accuracy (%) 95.57 98.85 95.62 98.85 95.57
Table 3. Comparison of the number of support vectors, training time, and test time of the patent classifiers in subgroup layer with and without our training sample pruning method M3 -GZC-C M3 -GZC-CC Without Clustering No. of support vectors 239 239 248 Training time (s) 0.67 0.67 0.69 0.56 0.55 0.61 Test time (s)
3.3 An Imbalanced Two-Class Classification Problem in the Section Layer We choose 84490 samples from the section layer with the categories B and E. These two categories are the most imbalanced ones among eight categories in the section layer. After feature selection, the dimension of the samples is 5000. The data distribution is shown in Table 4. Here we use SVMs with linear kernel. Table 4. Description of training and test samples from the section layer Training Test B E B E No. of samples 66991 17499 67359 16896
256
M. Huang, H. Zhao, and B.-L. Lu
The training and classification results are presented in Tables 5 and 6. We can see from Table 5 that M3-GZC-CC performs best with the smallest training data set even the sample distribution is greatly imbalanced. The results from Table 6 indicate that after clustering the number of support vector is reduced, so the training and test time are less than the case without clustering. Table 5. Performance comparison of the patent classifiers in the section layer with and without our training sample pruning method M3 -GZC-C M3 -GZC-P Without Clustering NN SVM NN SVM NN SVM No. of training samples 75280 75280 70019 70019 84490 84490 Size ratio (%) 89.1 89.1 82.5 82.5 100 100 751m 38s 750m 23s 864m 49s Training time 92.5 97.2 92.4 97.6 Classification accuracy (%) 92.5 97.6
Table 6. Comparison of the number of support vectors, training time, and test time of the patent classifiers in the section layer with and without our training sample pruning method
No. of support vectors Training time (s) Test time (s)
M3 -GZC-C M3 -GZC-P Without Clustering 11052 6803 11418 38 23 49 16 15 17
4 Conclusions We proposed a training sample pruning method based on a supervised clustering algorithm to deal with large-scale patent classification problems. This method can be used to preprocess training samples before learning. Our preliminary experimental results demonstrate that our method can reduce the number of training samples, while keep or even improve the classification accuracy. Furthermore, both training and test time are decreased due to the smaller size of training sample set. We have also shown that this method can be effectively used for dealing with imbalanced pattern classification problems.
Acknowledgements This work was partially supported by the National Natural Science Foundation of China (Grant No. 60903119, Grant No. 60773090 and Grant No. 90820018), the National Basic Research Program of China (Grant No. 2009CB320901), and the National HighTech Research Program of China (Grant No.2008AA02Z315).
Pruning Training Samples Using a Supervised Clustering Algorithm
257
References 1. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Machine Learning 56(1), 89–113 (2004) 2. Eick, C., Zeidat, N., Zhao, Z.: Supervised clustering–algorithms and benefits. In: International Conference on Tools with Artificial Intelligence, pp. 774–776 (2004) 3. Fall, C.J., Benzineb, K.: Literature survey: Issues to be considered in the automatic classification of patents. World Intellectual Property Organization 29 (2002) 4. Fujii, A., Iwayama, M., Kando, N.: Overview of patent retrieval task at NTCIR-5. In: Proceedings of the 5th TCIR Workshop Meeting, pp. 269–277 (2005) 5. Kohonen, T.: Improved versions of learning vector quantization. In: International Joint Conference on Neural Networks, pp. 545–550 (1990) 6. Li, J., Lu, B.: A new supervised clustering algorithm based on min-max modular network with Gaussian-zero-crossing functions. In: International Joint Conference on Neural Networks, pp. 786–793. 7. Li, X., Ye, N.: Grid-and dummy-cluster-based learning of normal and intrusive clusters for computer intrusion detection. Quality and Reliability Engineering International 18(3), 231– 242 (2002) 8. Li, X., Ye, N.: A supervised clustering algorithm for computer intrusion detection. Knowledge and Information Systems 8(4), 498–509 (2005) 9. Lu, B., Ichikawa, M.: A Gaussian zero-crossing discriminant function for min-max modular neural networks. In: Knowledge-based Intelligent Information Engineering Systems and Allied Technologies, pp. 298–302 (2001) 10. Lu, B., Ito, M.: Task decomposition and module combination based on class relations: A modular neural network for pattern classification. IEEE Transactions on Neural Networks 10(5), 1244–1256 (1999) 11. Lu, B., Li, J.: A min-max modular network with Gaussian-zero-crossing function. Trends in Neural Computation, 285–313 (2007) 12. Sebastiani, F.: Machine learning in automated text categorization. ACM computing surveys 34(1), 1–47 (2002)
An Extended Validity Index for Identifying Community Structure in Networks Jian Liu LMAM and School of Mathematical Sciences, Peking University, Beijing 100871, P.R. China
[email protected]
Abstract. To find the best partition of a large and complex network into a small number of communities has been addressed in many different ways. In this paper, a new validity index for network partition is proposed, which is motivated by the construction of Xie-Beni index in Euclidean space. The simulated annealing strategy is used to minimize this extended validity index, associating with a dissimilarity-index-based k-means iterative procedure, under the framework of a random walker Markovian dynamics on the network. The proposed algorithm(SAEVI) can efficiently and automatically identify the community structure of the network and determine an appropriate number of communities without any prior knowledge about the community structure during the cooling process. The computational results on several artificial and real-world networks confirm the capability of the algorithm. Keywords: Validity index, Community structure, Dissimilarity index, Simulated annealing, K-means.
1
Introduction
In recent years we have seen an explosive growth of interest and activity on the structure and dynamics of complex networks [1,2]. This is partly due to the influx of new ideas, particularly ideas from statistical mechanics, to the subject, and partly due to the emergence of interesting and challenging new examples of complex networks such as the internet and wireless communication networks. Network models have also become popular tools in social science, economics, the design of transportation and communication systems, banking systems, powergrid, etc, due to our increased capability of analyzing these models. Since these networks are typically very complex, it is of great interest to see whether they can be reduced to much simpler systems. In particular, much effort has gone into partitioning the network into a small number of clusters [3,4,5,6,7,8], which are constructed from different viewing angles comparing different proposals in the literature. On a related but different front, recent advances in computer vision and data mining have also relied heavily on the idea of viewing a data set or an image as a graph or a network, in order to extract information about the important features of the images or more generally, the data sets [9,10]. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 258–267, 2010. c Springer-Verlag Berlin Heidelberg 2010
An Extended Validity Index for Identifying Community Structure
259
In [6], a dissimilarity index for each pair of nodes and the corresponding hierarchical algorithm to partition the networks are proposed. The basic idea is to associate the network with the random walker Markovian dynamics [11]. This can motivate us to solve the partition problem by an analogy to the traditional k-means algorithm [12] based on this dissimilarity index. In traditional clustering literature, a function called validity index [13] is often used to evaluate the quality of clustering results. The optimal number of clusters can be determined by selecting the minimal value of the index. We construct an extended formulation of Xie-Beni index [13], which has smaller values indicating stronger community structure in networks. Then simulated annealing strategy [14,15] is utilized to obtain the minimal value of the index, associating with a dissimilarity-indexbased k-means procedure. We will construct our algorithm — simulated annealing to minimize the extended validity index (SAEVI) for network partition. From the numerical performance to four model problems: the ad hoc network with 128 nodes, sample networks generated from Gaussian mixture model, the karate club network and the American football team network, we can see that our algorithm can efficiently and automatically determine the optimal number of communities and identify the community structure during the cooling process. The rest of the paper is organized as follows. In Section 2, we briefly introduce the dissimilarity index [6] which signifies to what extent two nodes would like to be in the same community, then proposed the extended validity index for network partition. After reviewing the idea of simulated annealing, we describe our algorithm (SAEVI) and the corresponding strategies in Section 3. In Section 4, we apply the algorithm to four representative examples mentioned before. Finally we make the conclusion in Section 5.
2 2.1
The Framework for Network Partition The Dissimilarity Index and the Corresponding Center
In [6], a dissimilarity index between pairs of nodes is defined, which one can measure the extent of proximity between nodes of a network. Let G(S, E) be a network with n nodes and m edges, where S is the nodes set, E = {e(x, y)}x,y∈S is the weight matrix and e(x, y) is the weight for the edge connecting the nodes x and y. We can relate this network to a discrete-time Markov chain with stochastic matrix P = (p(x, y)) whose entries are given by e(x, y) p(x, y) = , d(x) = e(x, z), (1) d(x) z∈S
where d(x) is the degree of the node x [7,8,11]. Suppose the random walker is located at node x. The mean first passage time t(x, y) is the average number of steps it takes before it reaches node y for the first time, which is given by t(x, y) = p(x, y) +
+∞ j=1
(j + 1) ·
z1 ,··· ,zj =y
p(x, z1 )p(z1 , z2 ) · · · p(zj , y).
(2)
260
J. Liu
It has been shown that t(x, y) is the solution of the linear equation ⎛
⎞ ⎛ ⎞ t(1, y) 1 ⎜ .. ⎟ ⎜ .. ⎟ [I − B(y)] ⎝ . ⎠ = ⎝ . ⎠ , t(n, y)
(3)
1
where B(y) is the matrix formed by replacing the y-th column of matrix P with a column of zeros [6]. The difference in the perspectives of nodes x and y about the network can be quantitatively measured. The dissimilarity index is defined by the following expression 1 Λ(x, y) = n−2
2 t(x, z) − t(y, z)
12 .
(4)
z∈S,z=x,y
We take a partition of S as S = N Sl = Ø if k = l. If two k=1 Sk with Sk nodes x and y belong to the same community, then the average distance t(x, z) will be quite similar to t(y, z), therefore the network’s two perspectives will be quite similar. Consequently, Λ(x, y) will be small if x and y belong to the same community and large if they belong to different communities. The center m(Sk ) of community Sk can be defined as m(Sk ) = arg min
x∈Sk
1 |Sk |
Λ(x, y),
k = 1, · · · , N,
(5)
y∈Sk ,y=x
where |Sk | is the number of nodes in community Sk . This is an intuitive and reasonable idea for us to choose the node reached others in the same community with the minimum average dissimilarity index as the center of Sk . 2.2
The Extended Xie-Beni Index
A well known validity index for fuzzy clustering called Xie-Beni index [13] is widely used to classify samples overlap in Euclidean space, which is based on the fuzzy c-means algorithm [12]. The basic idea of FCM algorithm is to minimize the following objective function with respect to the fuzzy memberships ρk (x) and the cluster centers mk J(ρ, m) =
N
ρbk (x)x − mk 2 , b ≥ 1,
(6)
k=1 x∈S
where b > 1 is the fuzziness index. For the FCM algorithm with b = 2, Xie-Beni index VXB can be explicitly written as VXB =
N
ρ2k (x)x − mk 2 J(ρ, m) . = 2 n mink=l mk − ml nK(m)
x∈S
k=1
(7)
An Extended Validity Index for Identifying Community Structure
261
Here J(ρ, m) measures the compactness of the data set S and K(m) measures the separation. The more separate the clusters, the larger K(m) and the smaller VXB . We can find an optimal cluster number by solving min2≤N ≤n−1 VXB to produce a best clustering performance for the data set S. We extend the idea of considering both compactness and separateness to our formulation, and propose a new validity index for network partition as following N 2 JE x∈Sk Λ (x, mk ) k=1 = VE = . (8) mink=l Λ2 (m(Sk ), m(Sl )) KE where JE is the objective function constructed for the dissimilarity-index-based k-means which reflects compactness, and KE plays the role of separation such as K(m) in (7). An ideal partition requires a more stable state in space S = {S1 , . . . , SN }, which has smaller JE and larger KE . Thus, an optimal partition can be find by solving min N
min
{S1 ,··· ,SN }
VE
.
(9)
The global optimal problem (9) can be solved by searching over the all possible N with k-means algorithm. But this will cost extremely much, since for each fixed N , the k-means procedure should be operated 1000 to 5000 trials due to its local minima. However, the simulated annealing strategy [14,15] can avoid repeating ineffectively and lead to a high degree of efficiency and accuracy.
3
The Algorithm
The simulated annealing strategy is utilized here to address (9), which is motivated by simulating the physical process of annealing solids [14]. Firstly, a solid is heated from a high temperature and then cooled slowly so that the system at any time is approximately in thermodynamic equilibrium. At equilibrium, there may be many configurations with each one corresponding to a specific energy level. The chance of accepting a change from the current configuration to a new configuration is related to the difference in energy between the two states. The simulated annealing strategy is widely used to optimization problems [15]. Let E = VE . E (n) and E (n+1) represent the current energy and new energy, respectively. E (n+1) is always accepted if it satisfies E (n+1) < E (n) , but if E (n+1) > E (n) the new energy level is only accepted with a probability as specified by exp(− T1 E (n) ), where E (n) = E (n+1) − E (n) is the difference of energy and T is the current temperature. The initial state is generated by randomly N clusters, here N ∈ [Nmin , Nmax ], and the initial temperature T is set to a high temperature Tmax . A neighbor of the current state is produced by randomly flipping one spin, then the energy of the new state is calculated. The new state is kept if the acceptance requirement is satisfied. This process will be repeated for R times at the given temperature. A cooling rate 0 < α < 1 decreased the current temperature until reached the bound Tmin . The whole procedure of the Simulated Annealing to minimize the Extended Validity Index (SAEVI) with k-means algorithm is summarized as follows
262
J. Liu
(1) Set parameters Tmax , Tmin , Nmin , Nmax , α and R. Choose N randomly within (0) range [Nmin , Nmax ] and initialize the partition {Sk }N k=1 randomly; Set the current temperature T = Tmax . (0) (2) Compute the centers {m(Sk )}N k=1 according to (5), then calculate the initial energy E (0) using the definition of VE (8); Set n∗ = 0. (3) For n = 0, 1, · · · , R, do the following (n) (3.1) Generate a set of centers {m(Sk )}N k=1 according to our proposal below and set N = N ; (n+1) N }i=1 using (3.2) Update the partition {Sk (n+1) (n) Sk = x : k = arg min Λ(x, m(Sl ) , k = 1, · · · , N, (10) l
(n+1)
and the corresponding {m(Sk )}N k=1 according to (5), then calculate the new energy E (n+1) using (8); (3.3) Accept or reject the new state. If E (n+1) < E (n) or E (n+1) > E (n) with u ∼ U[0, 1], u < exp{− T1 E (n) }, then accept the new solution by setting n = n + 1; Else,reject it; ∗ (3.4) Update the optimal state, i.e. if E (n) < E (n ) , set n∗ = n. (4) Cooling temperature T = α · T . If T < Tmin , go to Step (5); Else, set n = n∗ , repeat Step (3). (n∗ ) (n∗ ) (5) Output the optimal solution {Sk }N of k=1 and the minimum energy E the whole procedure. Our proposal to the process of generating a new partition in Step (3.1) comprises three functions, which are deleting a current community, splitting a current community and keeping a current community. At each iteration, one of the three functions can be randomly chosen and the community strength [16] (din (x) − dout (x)), k = 1, · · · , N, (11) M (Sk ) = x∈Sk
is used to select a center, where din = z∈Sk e(x, z) and dout = z∈S / k e(x, z). The three functions are described below (i) Delete Community. The community with the minimal community strength Sd is identified using (11) and its center should be deleted from {m(Sk )}N k=1 . (ii) Split Community. The community with the minimal average community strength M (Sl ) Ss = arg min (12) Sl |Sl | is chosen. The new center is obtained by m(SN +1 ) = arg
min
x∈Ss ,x=m(Ss )
Λ(x, m(Ss )).
(iii) Keep Community. We maintain the center set {m(Sk )}N k=1 .
(13)
Fraction of nodes clasified correctly
An Extended Validity Index for Identifying Community Structure
1
263
70 6000
0.9
JE
60 0.8 50
4000 2000
VE
0.7 0.6
0
40
2
4
30
SAEVI shortest path random walk
0.4 0.3 0.2 0.1
0
1
2
3
6
8
10
N
0.5
20
10
4
5
6
7
8
2
3
4
5
6
7
8
9
10
N
Out links zout
(a)
(b)
13
50 100
12 45
80
JE
11 10
40
60 40
7
VE
20
5
10
15
N
30
10
6
2
5 4
20
35
30 8
JE
VE
40 9
4
6
8
25
10
N 2
3
4
5
6
7
8
9
10
20
2
4
6
8
10
N
N
(c)
(d)
12
14
16
Fig. 1. (a)The fraction of nodes classified correctly of ad hoc network by SAEVI compared with the methods in [4]. (b)VE and JE changed with N for the network generated from the given 3-Gaussian mixture model. (c)VE and JE changed with N for the karate club network. (d)VE and JE changed with N for the football team network.
4 4.1
Experimental Results Ad Hoc Network with 128 Nodes
We apply our methods to the ad hoc network with 128 nodes. The ad hoc network is a typical benchmark problem considered in many papers [4,6,7,8]. Suppose we choose n = 128 nodes, split into 4 communities containing 32 nodes each. Assume pairs of nodes belonging to the same communities are linked with probability pin , and pairs belonging to different communities with probability pout . These values are chosen so that the average node degree, d, is fixed at d = 16. In other words pin and pout are related as 31pin +96pout = 16. Here we naturally choose the nodes group S1 = {1 : 32}, S2 = {33 : 64}, S3 = {65 : 96}, S4 = {97 : 128}. We change zout from 0.5 to 8 and look into the fraction of nodes which correctly classified. The fraction of correctly identified nodes is shown in Figure 1(a), comparing with the two methods described in [4]. It seems that SAEVI performs noticeably
264
J. Liu
7
159 6.5
317 281
6
172
y
5.5
104
269 305
5
54 37
4.5
386 86
147 205 66 95
4
Samples in group 1 3.5
Samples in group 2
20
Samples in group 3 3 −0.5
0
0.5
1
1.5
x
2
2.5
3
3.5
(a)
(b)
Fig. 2. (a)400 sample points generated from the given 3-Gaussian mixture distribution. The star symbols represent the centers of each Gaussian component. The circle, square and diamond shaped symbols represent the position of sample points in each component, respectively; (b)The partition for the network generated from the sample points in (a) with dist = 0.8. The optimal extended validity index achieved is VE = 2.1130 and corresponds to the 3 communities represented by the colors.
better than the two previous methods, especially for the more diffusive cases when zout is large. 4.2
Sample Network Generated from Gaussian Mixture Model
To further test the validity of the algorithm, we apply it to a sample network generated from a Gaussian mixture model. This model is quite related the concept random geometric graph proposed by Penrose [17]. We generate n sample points {xi } in two dimensional Euclidean space subject to a K-Gaussian mixture K distribution k=1 qk G (µk , Σk ), where {qk } are mixture proportions satisfying K 0 < qk < 1, k=1 qk = 1. µk and Σk are the mean positions and covariance matrices for each component, respectively. Then we generate the network as following: if |xi − xj | ≤ dist, we set an edge between the i-th and j-th nodes; otherwise they are not connected. We take n = 400 and K = 3, then generate the sample points with the means and the covariance matrices as follows µ1 = (1.0, 4.0)T , µ2 = (2.5, 5.5)T , µ3 = (0.5, 6.0)T , Σ1 = Σ2 = Σ3 =
0.15 0
0 0.15
.
(14a)
(14b)
An Extended Validity Index for Identifying Community Structure
16
10
19
13
15
9
31
21
12
4 8
20
33 34
23
3
27
14
11 1
7
2
29
30 24
28
32
5 22
25
265
6
17
18
26
Fig. 3. The partition for the karate club network. The optimal extended validity index achieved is VE = 4.0225 and corresponds to the 3 communities represented by the colors.
Here we pick nodes 1:100 in group 1, nodes 101:250 in group 2 and nodes 251:400 in group 3 for simplicity (see Figure 2(a)). With this choice, approximately q1 = 100/400, q2 = q3 = 150/400. The thresholding is chosen as dist = 0.8. The numerical and partitioning results obtained by SAEVI are shown in Figure 1(b) and Figure 2(b). The objective function of k-means JE is decreasing as N increases, while the extended validity index VE has a minimum. 4.3
The Karate Club Network
This network was constructed by Wayne Zachary after he observed social interactions between members of a karate club at an American university [18]. Soon after, a dispute arose between the clubs administrator and main teacher and the club split into two smaller clubs. It has been used in several papers to test the algorithms for finding community structure in networks [3,4,5,6,7,8]. The numerical and partitioning results obtained by SAEVI are shown in Figure 1(c) and Figure 3, which seem consistent with the original structure of the network. 4.4
The Football Team Network
The last network we investigated is the college football network which represents the game schedule of the 2000 season of Division I of the US college football league [3,6]. The nodes in the network represent 115 teams and edges represent regular season games between the two teams they connect. The teams are divided into conferences containing around 8 to 12 each. Games are more frequent between members of the same conference than between members of different conferences. The numerical and partitioning results obtained by SAEVI are shown in Figure 1(d) and Figure 4. According to the results, almost all of the football
266
J. Liu
Mountain West Atlantic Coast Big 10 Big 12 Pacific 10 Mid American SEC Big East Conference USA Sunbelt Western Athletic Independents
Fig. 4. The partition for the American football team network. The optimal extended validity index achieved is VE = 20.4117 and corresponds to the 11 communities represented by the colors.
teams are correctly clustered with the others in their conference. The teams in Independents conference seem not belonging to any community, but they tend to be clustered with the conference which they are most closely associated with. The Sunbelt conference is split into three communities, clustered with a team which is less connected in Western Athletic conference, Conference USA and SEC. Only one member in Conference USA is grouped with most of the teams in Western Athletic conference. All the other communities coincide with the known structure and our algorithm performs remarkably well.
5
Conclusions
In this paper, we have proposed a new validity index for network partition and used the simulated annealing strategy to minimize this index associating with a dissimilarity-index-based k-means procedure. The algorithm (SAEVI) is constructed and succeeds in four representative examples. It is demonstrated by experiments that our algorithm can identify the community structure with a high degree of accuracy. The optimal number of communities can be efficiently determined without any prior knowledge about the community structure during the cooling process. The proposed validity index is competitive with the modularity for network community structure in the literature [4,5] which leads to the model selection problem. However, the new validity index and the algorithm considered in this paper are efficient and deserve to be investigated.
An Extended Validity Index for Identifying Community Structure
267
Acknowledgements. This work is supported by the National Natural Science Foundation of China under Grant 10871010 and the National Basic Research Program of China under Grant 2005CB321704. The author thanks Professor M.E.J. Newman for providing the data of the karate club network and the college football team network.
References 1. Albert, R., Barab´ asi, A.L.: Statistical Mechanics of Complex Networks. Rev. Mod. Phys. 74(1), 47–97 (2002) 2. Newman, M., Barab´ asi, A.L., Watts, D.J.: The Structure and Dynamics of Networks. Princeton University Press, Princeton (2005) 3. Girvan, M., Newman, M.: Community Structure in Social and Biological Networks. Proc. Natl. Acad. Sci. USA 99(12), 7821–7826 (2002) 4. Newman, M., Girvan, M.: Finding and Evaluating Community Structure in Networks. Phys. Rev. E 69(2), 026113 (2004) 5. Newman, M.: Modularity and Community Structure in Networks. Proc. Natl. Acad. Sci. USA 103(23), 8577–8582 (2006) 6. Zhou, H.: Distance, Dissimilarity Index, and Network Community Structure. Phys. Rev. E 67(6), 061901 (2003) 7. Weinan, E., Li, T., Vanden-Eijnden, E.: Optimal Partition and Effective Dynamics of Complex Networks. Proc. Natl. Acad. Sci. USA 105(23), 7907–7912 (2008) 8. Li, T., Liu, J., Weinan, E.: Probabilistic Framework for Network Partition. Phys. Rev. E 80, 026106 (2009) 9. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intel. 22(8), 888–905 (2000) 10. Meilˇ a, M., Shi, J.: A Random Walks View of Spectral Segmentation. In: Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics, pp. 92–97 (2001) 11. Lovasz, L.: Random Walks on Graphs: A Survey. Combinatorics, Paul Erdos is Eighty 2, 1–46 (1993) 12. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2001) 13. Xie, X.L., Beni, G.: A Validity Measure for Fuzzy Clustering. IEEE Tran. Pattern Anal. Mach. Intel. 13(8), 841–847 (1991) 14. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 21(6), 1087 (1953) 15. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220(4598), 671–680 (1983) 16. Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and Identifying Communities in Networks. Proc. Natl. Acad. Sci. USA 101(9), 2658–2663 (2004) 17. Penrose, M.: Random Geometric Graphs. Oxford University Press, Oxford (2003) 18. Zachary, W.: An Information Flow Model for Conflict and Fission in Small Groups. J. Anthrop. Res. 33(4), 452–473 (1977)
Selected Problems of Intelligent Corpus Analysis through Probabilistic Neural Networks Keith Douglas Stuart1 , Maciej Majewski2 , and Ana Botella Trelis1 1
2
Polytechnic University of Valencia, Department of Applied Linguistics Camino de Vera s/n, 46022 Valencia, Spain {kstuart,apbotell}@idm.upv.es Koszalin University of Technology, Department of Mechanical Engineering Raclawicka 15-17, 75-620 Koszalin, Poland
[email protected]
Abstract. The paper describes the application of artificial neural networks for corpus analysis which consists of intelligent mechanisms of analysis and recognition of word clusters and their meaning. The task of analyzing a corpus of academic articles was resolved with probabilistic neural networks. A review of selected issues is carried out with regards to computational approaches to language modeling as well as statistical patterns of language. The paper features recognition algorithms of word clusters of similar meanings but different lexico-grammatical patterns from the established corpus using four-layer neural networks. The paper also presents experimental results of word cluster recognition in the context of phrase meaning analysis. Keywords: corpus analysis, artificial intelligence, probabilistic neural networks, natural language processing, applied computational linguistics.
1
Introduction
The hypothesis that modeling a language involves probabilistic representations of allowable sequences, determines two areas of knowledge that might be applied to text analysis. One is word clusters: it is often the case that strings of words are repeated or tend to cluster together for semantic reasons. The other is the fact that given a sequence of words one might want to try and predict the next word based on what restrictions exist on the choice of next word. Another way of putting this might be that given a sequence of possible words, estimate the probability of that sequence. In a corpus of size N, the assumption is that any combination of n words is a potential n-gram. Each n-gram in our model is a parameter used to estimate probability of the next possible word. Low frequency n-grams are the most frequent. In other words, it is very common to find strings that have low frequency. In the same way, it is very common to find words that only occur once in a corpus (hapax legomena). We have been doing research into word clusters in a corpus of 1376 academic articles and we have found that repetition is constant across many word sequences. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 268–275, 2010. c Springer-Verlag Berlin Heidelberg 2010
Selected Problems of Intelligent Corpus Analysis
269
Our corpus comprises 1,376 articles, from specialist leading journals (a total of 6,104,323 tokens, 71,516 types, and 1.17 type/token ratio). The articles have all been published in journals cited in the Science Citation Index (SCI). They have been distributed in 23 knowledge areas, each of which constitutes per se a sub-corpus. They are representative samples of the language of science and technology. The corpus has been tagged with meta-textual information and transferred to an Access database by means of an application in Visual Basic. Once the corpus had been designed and implemented, we proceeded to analyse the data by creating wordlists of technical and semi-technical terms through frequency counts and keyword identification. This process involved initially comparing a general English wordlist (from the 100 million BNC corpus) with a wordlist from our corpus. Frequencies were compared and a keyword list was created from our corpus. Analysis was conducted by processing both the corpus as a whole and each of the subject areas separately. Then, we proceeded to generate 3 to 8 word clusters (n-grams), which were transferred to a database specially designed to carry out queries related to the clusters. Furthermore, we carried out research into collocational structures which are obtained by calculating the total number of times a word is found in the neighbourhood of the node word using as the default collocation horizon 5 words to the left and 5 words to the right of the node word (although it is possible to calculate collocations using much larger horizons). Both clusters and collocational structures provide clues to lexico-grammatical patterns. For this paper, we have mainly used the data from the 3 to 8 word clusters. The aim of the research is intelligent corpus analysis through meaning recognition of word clusters using artificial intelligence methods. We have developed a method which allows for development of possible word cluster components in a corpus for training probabilistic neural networks. The networks are capable of recognizing word clusters with similar meaning but different lexico-grammatical patterns. In other words, we are working with the idea that there is a strong tendency for sense and syntax to be associated. Corpus Linguistics needs computational tools to be able to map the close association between pattern and meaning and neural networks are ideal for pattern recognition and, consequently, semantic meaning.
2
The State of the Art
Previous work on corpus analysis faced several limitations: the number of words covered, the number of structures covered, and limits on the amount of data available for low frequency items imposed by the size of the corpora. While some work has used data from larger corpora, it is an important goal to develop new reliable and efficient automatic extraction methods to be used. Towards this goal, various automated tools have been developed during the last few years. However, most of them use old-fashioned methods, lacking functionalities such as sophisticated capabilities which could be delivered with use of artificial intelligence methods. This paper proposes an approach to deal with the above mentioned problem.
270
3
K.D. Stuart, M. Majewski, and A. Botella Trelis
Description of the Method
The proposed intelligent corpus analysis system shown in abbreviated form in Fig. 1A, consists of two subsystems: statistical corpus processing and intelligent corpus processing. In the corpus processing subsystem, words are isolated from text extracted from the corpus, which are developed into various combinations of word clusters based on the statistical models of word sequences. The developed word clusters representing appropriate N-gram models are processed further for training probabilistic neural networks with learning patterns of words and clusters. In the intelligent corpus processing subsystem, text is retrieved from the corpus using a parser. In the next step, word clusters are extracted by the parser using lexical and grammar patterns. The separated words are processed for letter strings isolated in segments as possible cluster word components. This analysis has been carried out using Hamming neural networks. The output data of the analysis consists of processed word segments. Individual word segments treated here as isolated possible components of the cluster words are inputs (Fig. 1B) of probabilistic neural networks for recognizing words. The networks use learning files containing words and are trained to recognize words as word cluster components, with words represented by output neurons. The intelligent cluster word recognition method allows for recognition of words with similar meanings but different lexico-grammatical patterns. In the next stage, the words are transferred to the word cluster syntax analysis module. The module creates words in segments as word cluster components properly, which are coded as vectors. Then they are processed by the module for word cluster segment analysis using hybrid binary neural networks. The analyzed word cluster segments become inputs of the word cluster recognition module using probabilistic neural networks (Fig. 1C). The module uses 4-layer probabilistic neural networks, either to recognize the cluster and find its meaning or else it fails to recognize it. The neural networks of this module use learning files containing patterns of possible meaningful word clusters. The intelligent analysis and processing allow for recognition of any combination of meaningful word clusters with similar meanings but different lexico-grammatical patterns. The overall detailed results of the intelligent analysis are subject to processing for corpus characteristics and its linguistic description including: statistical analysis, checking occurrences, and validating linguistic rules. The proposed intelligent system for corpus analysis contains probabilistic neural networks which are pattern classifiers. They can become effective tools for solving classification problems of lexico-grammatical structures in corpus linguistics, where the objective is to assign cases of clusters of letters or words to one of a number of discrete cluster classes. Pattern classifiers place each observed vector of cluster data x in one of the predefined cluster classes ki , i=1, 2, ..., K where K is the number of possible classes in which x can belong. The effectiveness of the cluster classifier is limited by the number of data elements that vector x can have and the number of possible cluster classes K. The Bayes pattern classifier implements the Bayes conditional probability rule that the probability P (ki |x ) of x being in class ki is given by (1):
Selected Problems of Intelligent Corpus Analysis
271
P (x |ki ) P (ki ) P (ki |x ) = K j=1 P (x |kj ) P (kj )
(1)
where P (x |ki ) is the conditioned probability density function of x given set ki , P (kj ) is the probability of drawing data from class kj . Vector x is said to belong
(A)
CORPUS PROCESSING FOR WORD CLUSTERS
Corpus of academic articles
Word isolation module
Text
Isolated words
Word cluster development module: NX (N-1) possible n-grams
Text Developed word clusters
Corpus parser
Corpus characteristics module Recognized any combination of meaningful word clusters with similar meanings but different lexicogrammatical patterns
Probabilistic neural network learning module
Word clusters Letter string analysis module
Learning patterns of words and clusters
Letters in segments as word components
Word analysis module using Hamming neural networks Analyzed word segments as letter clusters
Word recognition module using probabilistic neural networks Recognized words with similar meanings but different lexicogrammatical patterns
Word cluster recognition module using probabilistic neural networks
ANN ANN learning file: learning Words as files: word Meaningful cluster word components clusters
Analyzed word cluster segments
Word cluster Words in segments segment analysis as word cluster module using components Word cluster syntax hybrid binary analysis module neural networks
INTELLIGENT CORPUS PROCESSING FOR CONSTANT REPETITIONS ACROSS MANY WORD SEQUENCES Bit = { 0 (for ANN: -1)
, 1
(C)
}
b
ABCDEFGH I J KLMNOPQRSTUVWXYZ
a
1 2 3 4 5 6 7 8 9 10 11 12 13 14
P E R F O R M A N C E
a
ABCDEFGH I J KLMNOPQRSTUVWXYZ Represented letter
ANSWERING BOOST COMPARE [...] EVALUATE [...] IMPROVE [...] OF PERFORMANCE
CLUSTER WORD 1 CLUSTER WORD 2 CLUSTER WORD 3
Instead of bit vectors as neural network inputs, optional number vectors with encoded words. Encoding method for word 'abcdefgh' is: 100b/a+100c/b+ +100d/c+80e/d+ +80f/e+80g/f+ +20h/g...
Represented word of a cluster
(B)
b READING SIMULATE STUDY TELLING TEST THE TO [...] WRITING [...]
CLUSTER WORD b Bit number
a
Fig. 1. (A) Diagram of the proposed system for intelligent corpus analysis, (B) inputs of the word recognition module, (C) inputs of the word cluster recognition module
272
K.D. Stuart, M. Majewski, and A. Botella Trelis
to a particular class ki if P (ki |x ) > P (kj |x ), ∀j = 1, 2, . . . , K, j = i. This classifier assumes that the probability density function of the population from which the data was drawn is known a priori. This assumption is one of the major limitations of implementing Bayes classifier. The probabilistic neural network was first introduced by Specht [2,3], who was inspired by the work of Parzen [1]. The network offers a way to interpret the network’s structure in the form of a probabilistic density function. The probabilistic neural network simplifies the Bayes classification procedure by using a training set of clusters for which the desired statistical information for implementing Bayes classifier can be drawn. The desired probability density function of the cluster class is approximated by using the Parzen windows approach [1]. The probabilistic neural network learns to approximate the probability density function of the cluster training samples. It should be interpreted as a function that approximates the probability density of the underlying cluster samples distribution, rather than fitting the cluster samples directly. It approximates the probability that vector x belongs to a particular class ki as a sum of weighted Gaussian distributions centered at each cluster training sample. The output of the model is an estimate of the cluster class membership probabilities. The architecture of the probabilistic neural networks in the proposed system is shown in Fig. 2. The network is composed of many interconnected processing units or neurons organized in successive layers [2]. The probabilistic neural network for recognition of clusters of letters or words consists of four layers: cluster input, cluster pattern, summation and output layers. The cluster input layer unit does not perform any computation and simply distributes the input to the neurons in the second layer. In the pattern layer, there is one pattern neuron for each cluster training sample. Each pattern neuron forms a product of the weight vector wji and the given cluster sample, where the weights entering a neuron are from a particular cluster sample. This product is then passed through the exponential activation function (2): T i wji − x wj − x exp − (2) 2σ 2 It is necessary that both vectors x and w are normalized to unity. On receiving a pattern x from the cluster input layer, the neuron xij of the cluster pattern layer computes its output (3): 1 1 i i T i φj (x) = exp − 2 x − xj x − xj (3) s/2 2σ (2π) σ s where s is the dimension of the cluster input pattern x, σ is a smoothing factor and xij is the j-th training vector for the cluster patterns in class ki . The superscript T denotes the transpose of the vector, and exp stands for the exponential function. The total number of the cluster pattern layer nodes is given as a sum of the cluster pattern units for all classes. The summation layer neurons compute the maximum likelihood of cluster pattern x being classified into ki by
Selected Problems of Intelligent Corpus Analysis
Clusters: { X1, X2, ..., XK }
fNK K
(x)
K
xp
f2
2
K
(x)
(x) 1
K
CK(x) +1
CK
PK
-1 K
i =1
Cluster class i
i
(x)
N i
i
fmi
xj
Binary output
N = å Ni
f1
K
fNi i
Cluster class K
K N K
273
m
i
P1(x|k1)
(x) Pi
P2(x|k2)
PK(x) class ki i=1, ..., K
x)
f 1i
PK(x|kK)
(C)
P (i
(x) 1
k2 kK
k1
C1(x) C1
x2
f22
1
f1
Weights:
2
i
wj = xji
fN1 1 f12
x1 i
i
i
Xi=[ x1, x2, ..., xp ]
T
i Î {1, 2, K , K } Cluster input layer
(x) Cluster class 1 1 (x) 2 P1 ) 1 (x 1
P1(x)
Clusters of letters or words
+1 -1
+1 zi=x wi
1
N 1
f 11 Cluster pattern layer
æ z -1 ö expç i 2 ÷ è s ø
2
(x) Cluster class 2 2 (x) 2 P2 ) 2 (x
N 2
P(
f2N2
x)
g(zi) 2
Summation layer
w 1i
(Cluster decision, x1 competitive) output layer (A)
w 2i
x2
i
wj
xj
i
wp
xp
(B)
Fig. 2. (A) The probabilistic neural networks for recognition of clusters of letters or words, (B) neuron of the pattern layer, (C) neuron of the output layer
summarizing and averaging the output of all neurons that belong to the same cluster class [2] (4): T Ni x − xij − x − xij 1 1 Pi (ki |x ) = exp (4) s/2 2σ 2 (2π) σ s Ni j=1 where Ni is the number of cluster training patterns in class ki . Eq. (4) is a sum of small multivariate Gaussian probability distributions that are centered at each
274
K.D. Stuart, M. Majewski, and A. Botella Trelis
cluster training sample. This function is used to generalize the classification to beyond the given cluster training samples. As the number of cluster training samples and their Gaussians increases the estimated probability density function approaches the true function of the cluster training set. The classification decision for a cluster of letters or words is taken according to the inequality (5): Ni j=1
T x − xij exp − 2σ1 2 x − xij
>
Nk j=1
T exp − 2σ1 2 x − xkj x − xkj f or all i and k
(5)
Before classification, the sums in Eq. (5) are multiplied by their respective prior probabilities (Pi and Pk ) calculated as the relative frequency of the cluster samples in each cluster class [2]. The decision layer classifies the cluster pattern x in accordance with the Bayes’s decision rule based on the output of all the summation layer neurons using (6): Ni i T i − x−x x−x ( ) ( ) j j 1 1 ˆ C(x) = arg max (2π)s/2 exp 2σ2 σ s Ni (6) j=1 i = 1, 2, ..., K ˆ where C(x) denotes the estimated class of the cluster pattern x and K is the total number of classes in the cluster training samples [2].
4
Experimental Results
Our corpus comprising 1,376 articles contains clusters of the types from 3 to 8 word clusters. The experimental results show the numbers of clusters in the (A)
(B) Nmin=1,14*N 0,63
15000
5
14000
Minimum number of words of the cluster Nmin
13000
Number of the clusters of our corpus
12000 11000 10000 9000 8000 7000 6000 5000 4000 3000 2000
4
3
2
1000
1
0 2
3
4
5
6
Number of words of the cluster N
7
8
9
2
3
4
5
6
7
8
9
Number of words of the cluster N
Fig. 3. (A) Number of clusters of our corpus vs. number of words of the cluster, (B) sensitivity of word cluster meaning recognition: minimum number of words of the cluster being recognized vs. number of cluster component words
Selected Problems of Intelligent Corpus Analysis
275
corpus, which are presented in Fig. 3A. The proposed system allowed for recognition of any combination of meaningful word clusters with similar meanings but different lexico-grammatical patterns. The tests measured the performance of the cluster meaning recognition. The effectiveness of the system was achieved to a satisfactory level. As shown in Fig. 3B, the ability of the probabilistic neural network to recognize a cluster depends on the number of words in that cluster. For best performance, the neural network requires a minimum number of words of each cluster to be recognized as its input.
5
Conclusions and Perspectives
It is assumed that language processing is closely tied to a user’s experience, and that distributional frequencies of words and structures play an important role in learning. Therefore the interest in the statistical profile of language usage plays an important role in research. This paper has developed a method which allows for extraction of possible word cluster components in a corpus for training probabilistic neural networks. The networks are capable of recognizing word clusters with similar meaning but different lexico-grammatical patterns. It has long been an ambition of corpus linguistics to investigate fully relationships between form and meaning, sense and syntax. The patterns of language have been revealed by corpus linguistics through concordance lines, word clusters, collocation and colligation but there is no automated way of generating these word clusters. It might be useful for corpus linguistics to learn from neural networks how to generate word clusters automatically based on the training of the aforementioned networks with corpus examples and thereby bridge the gap between data-driven Hallidayan approaches to language and the more formalized Chomskyan predictive approach.
References 1. Parzen, E.: On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3), 1065–1076 (1962) 2. Specht, D.F.: Probabilistic neural networks. Neural Networks 3(1), 109–118 (1990) 3. Specht, D.F.: Enhancements to probabilistic neural networks. In: Proc. IEEE International Joint Conference on Neural Networks, Baltimore, USA, vol. 1, pp. 761–768 (1992)
A Novel Chinese Text Feature Selection Method Based on Probability Latent Semantic Analysis Jiang Zhong1, Xiongbing Deng1, Jie Liu1, Xue Li2 , and Chuanwei Liang3 1
College of Computer Science, Chongqing University, Chongqing 400044, China 2 School of Information Technology and Electrical Engineering, University of Queensland, QLD 4072, Australia 3 Information Center of State Administration of Taxation, Laizhou, Shandong 261400, China
Abstract. Effective feature selection is essential to make the learning task efficient and more accurate. In this paper, a novel Chinese text feature selection algorithm based on PLSA was presented for text classification, and it was compared with other effective feature selection methods on a benchmark of 8 text classification problem instances that were gathered from Sougou Lab’s corpus. The results show that this method could make SVM classifier have the best classification performance and more robust than other methods. Keywords: Feature selection, Text classification, PLSA, SVM.
1 Introduction In text classification, one typically uses a ‘bag of words’ model: each position in the input feature vector corresponds to a given word or phrase. The number of potential words often exceeds the number of training documents by more than an order of magnitude. Feature selection is necessary to make large problems computationally efficient and improve classification accuracy [1-3]. Because Chinese text does not have a natural delimiter between words, the Chinese text feature selection would face different challenges and its methods have been discussed in the literatures [4-7]. Text feature selection methods usually used based on the assumptions that unique words in corpus are independent and orthographic, and commonly use the frequency of words and documents to select features. So, those methods would have two shortcomings. Firstly, it considered nothing about the semanteme of words and couldn’t solve polyseme and synonym problems; Secondly, it may lose some feature words which have strong correlation to one category documents, but not to all categories[7]. To resolve the above problems, we propose a feature selection method based on PLSA to select feature words for documents. This method could resolve polyseme and synonym problems to some extend [8]. The paper is organized as follows: In the section II, the PLSA algorithm and how it to be applied will be introduced; the novel feature selection method will be described in section III; finally, the experiment results and analysis will be presented. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 276–281, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Novel Chinese Text Feature Selection Method
277
2 Probability Latent Semantic Analysis (PLSA) Suppose that there is a collection of text documents D = {d1 ,..., d N } with terms from a vocabulary W = {w1 ,..., wM } . By ignoring the sequential order in which words occur in a document, one may summarize the data in a rectangular N × M co–occurrence table of counts C ( n( d i , w j )) , where n( d i , w j ) denotes the number of times the term
w j occurred in document d i . In this particular case, C is also called the term-document matrix and the rows/columns of C are referred to as document/term vectors, respectively. The starting point for probabilistic latent semantic analysis is a statistical model which has been called the aspect model [9,10]. The aspect model is a latent variable model for co-occurrence data which associates an unobserved class variable zk ∈ { z1 ,..., z K } with each observation, and an observation being the occurrence of a word in a particular document. Let us introduce the following probabilities: P ( d i ) is used to denote the probability for a particular document d i ; P ( w j | zk ) denotes the class-conditional probability of a specific word conditioned on the unobserved class variable zk ; P ( zk | d i ) denotes a specific document’s probability distribution over the latent variable, which means the probability of the document d i would belong to the topic zk ; P ( w j | zk ) means the probability of topic zk would include the word w j . Fig. 1 shows two common probability relationship patterns among documents, subjects and words. According the probability relationship showed in Fig. 1, using EM method, we can get the following formulas by applying Bayes formula:
P( zk , di , wj ) = P(di ) P( zk | di ) P(w j | zk )
(1 )
Lc = ∑i=1 ∑ j =1 ∑k =1 n(zk , di , wj )logP(zk , di , wj )
(2 )
N
M
K
Then, algorithm alternates expectation (E) step and maximization (M) step. During the E step, the posterior probabilities are computed for the latent variables, based on the current estimates of the parameters. For the maximization (M) step, parameters are updated based on the complete data log–likelihood which depends on the posterior probabilities computed in the E-step. The E-step and the M-step equations are
Fig. 1. Relationship among the document, subject and the word
278
J. Zhong et al.
alternated until a termination condition is met. In this paper, the termination condition was selected that the improving of Lc is not obviously. As described above, the greater P ( w j word
| zk ) , the stronger correlation between the
w j and the topic zk is. As a result, when we get the condition probabilities
P ( w j | zk ) , we could select the words to represent the topics, which have strong correlation to the corresponding topics.
3 Algorithm of the Feature Selection Based on PLSA The key idea of the feature selection algorithm based on PLAS is described in Fig. 2. Suppose documents in the training data set have been classified into t categories, the feature selection algorithm is to find the subset of words from the vocabulary which has the best classification performance. The PLSA method was used to choose the top mi*ki correlation words as feature words for each category. In the second phase, we merge those features words of all categories into one words set as the features for all documents. During the first phase of this selection algorithm, parameters mi and ki may be adjusted according to application context. Parameter ki represents the number of topics involved in the corresponding category documents. Parameter mi represents that the top mi words should selected for the corresponding topic. After the PLSA procedure on each category documents, feature words with mi * ki words will be found. In the second phase of this algorithm, the proceeding merge the feature selected words for all categories documents by removing the duplicate words. So we describe the first phase in detail which is called as subset feature selection algorithm. The computational complexity of PLSA algorithm has been analyzed by Hoffman [2]. The number of arithmetic operations of course depends on the number of EM iterations that have to be performed. Typically 20–50 iterations are sufficient, each iteration requiring O ( R ⋅ K ) operations, where R is the number of distinct observation pairs ( d i , w j ) , then the computational complexity of each iteration will far less than O( M ⋅ N ⋅ K ) .
Fig. 2. Two phases of the feature selection method
A Novel Chinese Text Feature Selection Method
279
Fig. 3. Subset feature selection algorithm
4 Experiment and Result Analysis 4.1 Evaluation Metric Usually, the effectiveness of a feature selection method is evaluated using the performance of classifiers which have applied the feature selection to documents in the preprocessing phase. Since mostly document classifiers score categories on a per-document basis, we use the standard definition of precision as performance measures. Precision= (categories found and correct)/ (total categories found) Because SVM has many merits [1,11], we choose it as classification algorithm to verify the effectiveness of feature selection algorithms. 4.2 The Data Set of Experiment The data set in this experiment comes from Sougou Lab’s corpus (web site: http://www.sogou.com/labs/), and we select 15920 documents from eight categories in our experiment. In each experiment round, we randomly choose fifty percent documents in each category as training data and let the left documents as testing data. In the experiment, we apply a 2-grams word segmentation analyzer to process the corpus, which is based on the Chinese corpus selected from 1998 People’s Daily. Because nouns are usually used to describe concepts in natural language, we construct word-document matrix based on the nouns with relative high appearing frequency. After the feature selection, we employ LIBSVM to training SVM classifier, and 8 SVM classifiers were trained for every category. 4.3 Results of Experiment and Analysis CDW,CHI and IG methods are usually used feature selection methods [1], and in our experiment, the comparisons of these methods with our method have made. During the experiments, we selected from 360 to 12977 feature words to train document classifier. The Fig. 4 shows the performance curves of SVM based on different feature selection
280
J. Zhong et al.
Fig. 4. Performance curves based on different feature selection methods
methods. As it can be seen, CDW, CHI, IG and the new methods have similar effects on the performance of classifier. All of them can eliminate up to 90% or more unique words (terms) with either an improvement or no loss in categorization accuracy. It was found that using CDW method would get the best performance when the number of feature words is less then 1200. On the other side, when the number of feature words is bigger than 1200, we found that our novel method would get the best performance. The performance of SVM is showed in table 1, and it used 1200 feature words selected by different feature selection methods from each category. From the table, we can find that PLSA method has the highest classification precision value on every category, and the average of classification precision value is high to 98.43%. There is another fact that can be gotten from the table that the performance of classifier using PLSA decrease slower than other methods when the number of features increase. These results showed that the new method was less sensitive to the noise words, and this means that it is more robust than other methods. Table 1. The classification precision about choosing 1200 feature words(%)
Type Finance Culture Health Sports Travel Education Hiring Military Average
CHI 93.54 93.12 95.34 96.03 93.39 94.81 95.23 96.18 94.71
IG 91.96 89.14 91.85 93.34 91.69 92.93 88.86 91.86 91.45
CDW 97.51 97.32 97.76 98.17 97.41 98.56 96.65 98.72 97.76
PLSA 97.81 97.63 99.45 99.07 97.51 99.12 97.32 99.32 98.40
5 Conclusion This paper presented a novel Chinese text feature selection algorithm, and the algorithm was based on PLSA. Comparison results with other feature selection methods in statistical learning of text categorization show that it could get the best performance when we eliminate about 90% unique terms. On the other hand, if people
A Novel Chinese Text Feature Selection Method
281
concern more about the number of feature words [12], the best choice of selection method is CDW method, which could get ideal classification accuracy, even eliminated about 98% unique terms. Acknowledgments. This work is supported by National Science and Technology Support Project (2008BAH37B04) and Natural Science Foundation Project of CQ CSTC(2008BB2195), CQ CSTC(2008BB2183) and CQ CSTC(2007BB2372).
References 1. Kim, H., Howland, P., Park, H.: Dimension Reduction in Text Classification with Support Vector Machines. Journal of Machine Learning Research 6(1), 37–53 (2005) 2. Fabrizio, S.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002) 3. Huan, L., Lei, Y.: Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering 17(4), 491–502 (2005) 4. Yan, X., Jintao, L., Bin, W., Chunming, S., Sen, Z.: A Study on Constraints for Feature Selection in Text Categorization. Journal of Computer Research and Development (04), 596–602 (2008) 5. Ronglu, L., Jianhui, W., Xiaoyun, C., Xiaopeng, T., Yunfa, H.: Using Maximum Entropy Model for Chinese Text Categorization. Journal of Computer Research and Development 42(1), 94–101 (2005) 6. Qian, Z., Ming, S.Z., Min, H.: Study on Feature Selection in Chinese Text Categorization. Journal of Chinese Information Processing 18(3), 17–23 (2004) 7. Jinshu, S., Bofeng, Z., Xin, X.: Advances in Machine Learning Based Text Categorization. Journal of Software 17(9), 1848–1859 (2006) 8. Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42, 177–196 (2001) 9. Wenqian, S., Houkuan, H., Yuling, L., Yongmin, L., Youli, Q., Hongbin, D.: Research on the Algorithm of Feature Selection Based on Gini Index for Text Categorization. Journal of Computer Research and Development 43(10), 1668–1694 (2006) 10. Deerweste, S., Dumais, S.T., George, W.F., et al.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990) 11. Fung, G.M., Mangasarian, O.L.: Multicategory Proximal Support Vector Machine Classifier. Machine Learning 59(1-2), 77–97 (2005) 12. Makrehchi, M., Kamei, M.S.: Text Classification Using Small Number of Features. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 580–589. Springer, Heidelberg (2005)
A New Closeness Metric for Social Networks Based on the k Shortest Paths Chun Shang, Yuexian Hou, Shuo Zhang, and Zhaopeng Meng Department of Computer Science and Technology, Tianjin University, Tianjin, China {springuper,krete1941}@gmail.com, single
[email protected],
[email protected]
Abstract. We axiomatically develop a metric of personal connection between individuals in social networks, and construct an optimal model to find the best weight of the metric. Our metric optimizes, in some strict-established sense, weighted average of the k shortest paths so that it is able to distinguish the closeness between nodes more relevantly than traditional metrics. The algorithms are implemented and evaluated on random networks and real social networks data. The results demonstrate relevance and correctness of our formalization. Keywords: closeness metric, social network analysis, the k shortest paths, optimal model.
1
Introduction
In recent decades, the notion of a social network and the methods of social network analysis (SNA) have attracted considerable interest from several disciplines. Much of this interest can be attributed to the appealing focus of SNA on relationships among individuals, and on the patterns and implications of these relationships. From the view of SNA, the social environment can be expressed as patterns in relationships among interacting individuals [1]. An in-depth comprehension of the link structure of a social network is necessary to evaluate current practice methods and to design future plans. For example, understanding the structure of social networks might lead to algorithms that can detect trusted or influential users [2]. Furthermore, recent work has proposed the use of social networks to mitigate email spam [3], to improve Internet search[4], and to defend against Sybil attacks [5]. In order to navigate and mine the contents of a social network, it is of fundamental and crucial importance to judge the closeness of personal connection between individuals. Traditionally, this kind of closeness is computed based on the properties of individuals. However, in a social network, a lot of information about closeness is encoded in the link structure of the graph. The length of the shortest path between two nodes is the most popular closeness metric. Another two closeness metrics based on a network flow computation and a comparison of
Corresponding author.
L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 282–291, 2010. c Springer-Verlag Berlin Heidelberg 2010
A New Closeness Metric for Social Networks
283
subgraphs based on hub and authority values separately are proposed by Lu [6], and Prasanta K. Pattanaik axiomatically characterizes several closeness metrics based on the shortest benign chains [7]. In our work, we axiomatically develop a metric of closeness between two individuals based on the k shortest paths between them, which is a general extension of metrics determined by the shortest path. The k shortest paths problem is to list the k paths connecting a given source-destination pair in the network with k-minimum length, enabling our defined metric to reveal more useful embedding information to examine the closeness than ordinary metrics. Moreover, our metric assigns each of the k shortest paths an appropriate weight, which is computed by a convex model induced from some fundamental principles. Our analysis proceeds in three stages. In section 2, we first propose a set of definitions and criteria as the principles aiming at formalizing the concept of closeness in a social network and developing a theoretical analysis. Then, we prove that each metric forced by the above definitions and criteria meets the metric axioms. A convex model with linear constraints is constructed to solve the optimal weight parameter of the defined metric for a particular network. In section 3, three improved methods are presented to reduce useless constraints. Finally, experiments in several types of real and simulated networks are demonstrated in section 4.
2
Theoretical Results
Let (V, E) denote a social network in graph form, where V is a finite set with n elements called nodes and E is a finite set with m elements called edges. Noted that, although examples and experiments in this paper are in the undirected graph, all our analysis can also be applied in directed graphs. We provide a small-scale undirected social network as an example (see figure 1). Obviously, A is a central node so that the shortest path between any pair of nodes interconnected by A is of length 2. However, we consider that the pair (G,H) should have a closer connection because they belong to a community surrounded by a dashed circle, or, in other words, there are more connections between them. Unfortunately, the metrics determined by the shortest path simply ignore this information. Inspired by this instinctive idea, we propose a new metric of closeness between individuals based on the k shortest paths aiming to explore more potential and useful information. C
B H
D
A I
E
J K
F
G
Fig. 1. A small-scale undirected social network with 11 nodes and 15 edges
284
C. Shang et al.
Definition 1. For any pair of nodes in (V, E), say x and y, the relation distance of them is defined as ⎛ (1) ⎞ δxy ⎜ (2) ⎟ k ⎜ δxy ⎟ (i) T ⎟ dxy = wi δxy = w1 w2 · · · wk ⎜ (1) ⎜ .. ⎟ = w · Δxy , ⎝ . ⎠ i=1
(k)
δxy (i)
where δxy is the length of the ith shortest path from x to y , and wi is its k wi = 1. corresponding weight w.r.t. two constraints wi ≥ 0 and i=1
The definition of dxy is very concise, just a linear combination of the k shortest paths from x to y. Compared to the metrics determined by the shortest path, dxy involves more useful connecting information so it is able to find the difference between the pairs which have the same lengths of the first h (1≤ h
(0)
Criterion 1. Let δxy = δij (h+1)
(δxy
(h+1)
> δij
(0∼h)
= 0, if δxy
(0∼h)
= δij
(h+1)
and δxy
(h+1)
< δij
), then dxy < dij (dxy > dij ); otherwise, dxy = dij .
Criterion 2: w1 ≥ w2 ≥ · · · ≥ wk ≥ 0. Criterion 1 evaluates whether the metric is well defined so that it can correctly reflect the relation between two pairs of nodes, say (x, y) and (i, j). In fact, this criterion is a general extension of the rules applied in metric determined by the shortest path. Criterion 2 comes more naturally, which indicates that the path between a pair of nodes is shorter, its weight should be larger accordingly. In order to simplify our statement, we make the following definition according to criteria 1. Definition 2. The sign of (Δxy − Δij ) is defined as ⎛⎛
⎞⎞ (1) (1) δxy − δij ⎧ (1∼h) (1∼h) (h+1) (h+1) ⎜⎜ δ (2) − δ (2) ⎟⎟ ⎪ = δij , δxy > δij ⎨ 1 δxy ⎜⎜ xy ij ⎟ ⎟ (1∼h) (1∼h) (h+1) (h+1) ⎜ ⎟⎟= −1 δxy Sign (Δxy −Δij ) = Sign ⎜ , = δij , δxy < δij .. ⎜⎜ ⎟⎟ ⎪ ⎝⎝ ⎠⎠ ⎩ 0 δ (1∼k) = δ (1∼k) . xy ij (k) (k) δxy − δij (2)
Given these definitions and criteria, we can deduce an important property of our metric. Theorem 1. If a metric defined according to definition 1 satisfies criterion 1 and 2, it meets the metric axioms. Proof. The non-negativity of the metric is obviously true by the definition. If the social network is undirected, the symmetry is also correct. So we next exhibit how the triangle inequality can be satisfied.
A New Closeness Metric for Social Networks
285
Formalize this problem as: ⎛
dxy + dyz − dxz
⎞ (1) (1) (1) δxy + δyz − δxz ⎜ (2) (2) (2) ⎟ ⎜ δxy + δyz − δxz ⎟ ⎜ ⎟ . = w1 w2 · · · wk ⎜ .. ⎟ ⎝ ⎠ . (k)
(k)
(3)
(k)
δxy + δyz − δxz (1)
(1)
(1)
Easy to know δxy + δyz − δxz ≥ 0. (h) (h) (h) Assuming δxy + δyz − δxz ≥ 0 is satisfied when N = h (h = 1, 2, · · · , k − 1), now let N = h + 1 , we make the following analysis: (h+1)
(h+1)
(h+1)
(h+1)
(h+1)
(h+1)
1. δxy , δyz and δxz are all finite. If δxy + δyz − δxz ≥ 0 fails, (h+1) (h+1) (h+1) (h+1) , e.g. δxz = δxy + δyz , and it satisfies: there is at least one path δxz (h+1)
δxz
(h+1)
= δxy
(h+1)
+ δyz
(h)
(h)
(h)
(h)
(h+1)
≥ δxy + δyz ≥ δxz ⇒ δxz ≤ δxz
(h+1)
< δxz
,
(h+1)
thus δxz can not be the (h + 1)th shortest path of the pair (x, z), which (h+1) (h+1) (h+1) + δyz − δxz ≥ 0 is satisfied. contradicts with our definition. So δxy (h+1) (h+1) (h+1) (h+1) (h+1) 2. At least one of δxy , δyz and δxz is infinite. If δxy or δyz is (h+1) (h+1) (h+1) (h+1) infinite, regardless whether δxz is infinite, δxy +δyz −δxz ≥ 0 is always (h+1) (h+1) (h+1) is infinite, the conclusion that δxy or δyz must supposed to be true; If δxz (h+1) (h+1) be infinite is proved by contradiction. Suppose δxy and δyz are both finite, (h+1) (h+1) (h+1) (h+1) there must exist a path with finite length δxz , e.g. δxz = δxy + δyz , (h) (h+1) (h+1) (h+1) and it satisfies δxz ≤ δxz < δxz = ∞, thus δxz can’t be the (h + 1)th shortest path between the pair (x, z), which contradicts with our definition. (h+1) (h+1) (h+1) Hence, δxy + δyz − δxz ≥ 0 is also satisfied in this case. (h+1)
(h+1)
(h+1)
+ δyz − δxz ≥ 0 is true when N = h + 1. As mentioned above, δxy Therefore, equation (3) can be rewritten as dxy + dyz − dxz ≥ w1 w2 · · · wk · 0 = 0. Thus this metric satisfies the metric axioms. Theorem 1 gives a theorematic guarantee of the correctness of our metric. However, there is a solution set of w meeting these conditions, and which is the best and how to find it? To deal with these two problems, we propose another definition and criterion. Definition 3. The spread level of solved distances between any nodes in (V, E) is defined as ψ = f (dij ) , i, j = 1, 2, · · · , n. Some example forms of ψ are listed below: ψ1 = std (D), ψ2 = std (Di ), ψ3 = (max (Di ) − min (Di )), (1)
i∈ δ·
(1)
i∈ δ·
where D is a set of all ordered pairs in (V, E), is a set of all ordered pairs which Di (1) have the same shortest path length i, and δ· represents a set of all shortest
286
C. Shang et al.
(1) is a path lengths between nodes in (V, E). It is obvious that Di |i ∈ δ· partition of D. In addition, std (D) is the standard deviation of solved distances of pairs in D, max (Di ) is the maximum of solved distances of pairs in Di and min (Di ) is the minimum. Following this conception, we construct an important criterion by which performance of a metric is evaluated. Criterion 3. The larger the spread level ψ derived by a metric, the better the metric. This criterion assigns a higher score to the metric which discovers more difference between the pairs of nodes. Now we can describe our basic theoretical model, which aims to find the optimal solution of w contributing to achieve the largest spread level ψ and satisfies both criterion 1 and 2. The model is formed as: arg max ψ w
s.t.
(dij − dxy ) · Sign (Δij − Δxy ) ≥ 0 k wi = 1
i, j, x, y = 1, 2, . . . , n .
(4)
i=1
wi ≥ wi+1 ≥ 0 i = 1, 2, . . . , k − 1
This basic model is a simple convex optimization problem with linear constraints which has been well-studied.
3
Improved Methods
Obviously, our basic model defined as equation (4) has at most C n2 , 2 + k constraints, which will lead to decreased efficiency when the scale of the network increases. So how to cut down useless constraints becomes a necessary task. 3.1
Method 1
On the basis of analyzing the first constraint of equation (4), we have (idx(Δij ,Δxy )) (idx(Δij ,Δxy )) (dij − dxy ) · Sign (Δij − Δxy ) = widx(Δij ,Δxy ) δij − δxy k , (idx(Δij ,Δxy )) (idx(Δij ,Δxy )) (p) (p) + >0 wp · δij − δxy · Sign δij −δxy p=idx(Δij ,Δxy )+1
(5) where idx (Δxy , Δij ) is the index of the first different element between Δxy and Δij . It’s obvious that equation (5) is always correct when (idx(Δij ,Δxy )) (idx(Δij ,Δxy )) (p) (p) ≥0 δij − δxy · Sign δij − δxy (6) (p = idx (Δij , Δxy ) + 1, · · · , k) ,
A New Closeness Metric for Social Networks
287
thus this sort of constraints can be removed. This kind of improvement is quite simple, which can remove only a small amount of constraints in most cases because the constraints that satisfy equation (6) are in a tiny minority. 3.2
Method 2
An illuminating method is demonstrated in this subsection. It is supposed that the useless constraints described in method 1 have been removed. We start by exhibiting the analysis against a simple case, i.e. k=2. In this case, the second and the third constraints limit the feasible region in the line segment between (0.5,0.5) and (1,0)(see figure 2 (a)). All elements of the first constraint can be represented in form of aw1 + bw2 > 0 (a > 0, b < 0). Given closer consideration, we see that only the line with the least positive slope, denoted line lmin , is effective. As mentioned above, the feasible region should be the line segment w1 + w2 = 1 between (1,0) and the minimum of (0.5,0.5) and the point . lmin When k=3, the feasible region is limited by the second and the third constraints in a close area surrounded by (1,0,0),(0.5,0.5,0) and (1/3,1/3,1/3)(see figure 2 (b)). After removing the useless constraints described in method 1, all remaining elements of the first constraint can be represented as aw1 + bw2 + cw3 > 0 (a > 0, b < 0 or c < 0) or bw2 + cw3 > 0 (b > 0, c < 0). We reduce the first constraint aiming to get at most 2τ -1 elements left through these steps: Step 1. Denote the intersection points corresponding to planes of the first constraint and the line segment between (1,0,0) and (0,1,0) in descending order as ϕ1 , ϕ2 , · · · , ϕnum(w1 w2 ) , where num (w1 w2 ) is the number of points. Similarly, denote the intersection points of the first constraint and the line segment between (1,0,0) and (0,0,1) in descending order as φ1 , φ2 , · · · , φnum(w1 w3 ) . Step 2. If num (w1 w2 ) and num (w1 w3 ) are both bigger than 0, denote μ = min{τ, num (w1 w2 )} and ν = min{τ, num (w1 w3 )}, then take the planes corresponding to the first μ−1 points of ϕ1 , ϕ2 , · · · , ϕnum(w1 w2 ) and the first ν −1 points of φ1 , φ2 , · · · , φnum(w1 w2 ) separately as useful constraints, and finally construct an enhanced constraint with a plane determined by point ϕμ , point φν and the origin. If num (w1 w2 ) = 0 (num (w1 w3 ) = 0), we just consider the planes corresponding to the first μ (ν) of ϕ1 , ϕ2 , · · · , ϕnum(w1 w2 ) (φ1 , φ2 , · · · , φnum(w1 w2 ) ). The cases of k>3 can be consequently analyzed by the same idea.
(a)
(b)
Fig. 2. (a) the feasible region when k=2. (b) the feasible region when k=3 edges.
288
C. Shang et al.
3.3
Method 3
Another intuitive idea is to directly enhance the constraints. We make the following inference: k (p) (p) if Δxy < Δij , then dxy < dij ⇔ wp δxy − δij < 0 p=idx(Δxy ,Δij ) , k (p) (p) − min δ· <0 ⇐ −widx(Δxy ,Δij ) + wp max δ· p=idx(Δxy ,Δij )+1
(7)
(i) (i) where δ· = δpq |p, q = 1, 2, · · · , n . Similarly, we can get k
if Δxy > Δij , then dxy > dij ⇔
p=idx(Δxy ,Δij )
⇐ −widx(Δxy ,Δij ) +
k
(p) (p) wp δxy − δij > 0
, (p) (p) − min δ· <0 wp max δ·
p=idx(Δxy ,Δij )+1
(8) After applying these simplification, we put the basic model into a new form: arg max ψ w
s.t.
− wq + k
k p=q+1
(p) (p) − min σ· <0 wp max σ·
q = 1, 2, . . . , k , (9)
wi = 1
i=1
wi ≥ wi+1 ≥ 0 i = 1, 2, . . . , k − 1
where there are only 2k+2 constraints. Although the feasible region is decreased by enhanced constraints , the complexity of our model is greatly simplified.
4
Experiments
In this section, we investigate the relevance of the proposed metric and the corresponding model by a number of simulated and real networks. The simulated networks include the example network shown in figure 1 and random networks generated by Waxman’s method [8]. The real social networks are Zachary’s karate club [9] and dolphins network [10]. Noted that, method 1 is not adapted in these tests because of its weak performance in cutting down constraints. In all experiments, we set k=3, τ =n, and if there is no the hth shortest path, we assign its length as n. 4.1
The Example Network in Figure 1
To begin with, the social network described in figure 1 is tested with our model. The results of the optimal weight and the solved distances are demonstrated in table 1 and table 2 respectively.
A New Closeness Metric for Social Networks
289
Table 1. The solutions of w using different target functions and improved methods ψ1
ψ2
ψ3
Method 2 (1.00 0.00 0.00) (0.90 0.10 0.00) (0.90 0.05 0.05) Method 3 (1.00 0.00 0.00) (0.90 0.10 0.00) (0.90 0.09 0.01) Table 2. Distances of part pairs when using ψ3 and method 3 A
B
C
D
E
F
G
H
I
J
K
A 0.00 2.00 2.00 2.00 2.00 2.00 1.20 1.20 2.01 2.01 2.01 B
0.00 2.90 2.90 2.90 2.90 2.20 2.20 3.01 3.01 3.01
G
0.00 2.00 1.11 1.10 1.11
H
0.00 1.11 1.10 1.11
From table 1 it can be concluded that the solutions with ψ1 are not good enough, because w1 occupies all weight while w2 and w3 are both zero. Besides, the solutions with ψ3 and method 3 are better because it gives each connecting path an appropriate weight. Furthermore, we consider the data in table 2 verifies the advantage of our model in distinguishing relations in a further level than metrics determined by the shortest path. 4.2
Random Networks
We adapt Waxman’s method [9] to generate three networks with λ=0.6, α=0.6, β=0.2 and [ xmin , xmax , ymin, ymax ] = [0 10 0 10] (see figure 3). We present the results in table 3.
(a)
(b)
(c)
Fig. 3. (a) 54 nodes, 181 edges. (b) 44 nodes, 84 edges. (c) 69 nodes, 250 edges. Table 3. The solutions of w using different improved methods Random network A
Random network B
Random network C
Method 2 (0.7500 0.1250 0.1250) (0.6667 0.1667 0.1667) (0.6667 0.1667 0.1667) Method 3 (0.7500 0.1825 0.0625) (0.6667 0.2222 0.1111) (0.7500 0.1825 0.0625)
290
C. Shang et al.
From table 3, we can intuitively conclude that the results with method 2 are less effective than those with method 3, because the results of method 3 clearly distinguish the difference between weight of the second shortest path and weight of the third. Considering there are too many nodes in every network, the solved distances are not illustrated in our paper. 4.3
Real Social Networks
Two real social networks (see figure 4) are analyzed using our model and the results of the optimal weight is presented in table 4.
(b)
(a)
Fig. 4. (a) Zachary’s karate club. (b) dolphins network. Table 4. The solutions of w using different improved methods Karate club
Dolphins network
Method 2 (0.9697 0.0152 0.0152) (0.9836 0.0082 0.0082) Method 3 (0.9697 0.0294 0.0009) (0.9836 0.0161 0.0003)
From the solutions we realize that the weight of the shortest path is close to one, implying that the shortest path dominates the relation between nodes.
5
Conclusion and Future Work
In our research, we have axiomatically developed a new metric of the personal connection between two individuals and an optimal model to find the best weight of the metric. From the test results, the metric is promising and efficient in distinguishing the degree of personal connection between two individuals of random networks and real social networks. Our work can be viewed as an effort to mine more useful information from the graph structure of a social network. However, several problems remains to be researched. One important problem is to improve the weight w in real social network data aiming to assign weights more properly. Another issue is to consider whether embeddability conditions can be added to our optimal model in order to guarantee the solved metric space can be embedded into a Hilbert space.
A New Closeness Metric for Social Networks
291
Acknowledgments. This work is supported in part by the National Science Foundation project of Tianjin, China (grant 09JCYBJC00200).
References 1. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications. Cambridge Univ. Press, Cambridge (1994) 2. Mislove, A., Marcon, M., Gummadi, K.P., Druschel, P., Bhattacharjee, S.: Measurement and Analysis of Online Social Networks. In: Proceedings of the 7th ACM/USENIX Internet Measurement Conference, pp. 29–42. ACM, New York (2007) 3. Garriss, S., Kaminsky, M., Freedman, M.J., Karp, B., Mazi‘eres, D., Yu, H.: Re: Reliable Email. In: Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI 2006), San Jose, CA (2006) 4. Mislove, A., Gummadi, K.P., Druschel, P.: Exploiting Social Networks for Internet Search. In: Proceedings of the 5th Workshop on Hot Topics in Networks (HotNetsV), Irvine, CA (2006) 5. Yu, H., Kaminsky, M., Gibbons, P.B., Flaxman, A.: Defending against Sybil Attacks via Social Networks. In: Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM 2006), Pisa, Italy (2006) 6. Wangzhong, L., et al.: Node Similarity in Networked Information Spaces. In: Proceedings of the 2001 Conference of the Centre for Advanced Studies on Collaborative Research. IBM Press, Toronto (2001) 7. Pattanaik, P.K., Xu, Y.: On Measuring Personal Connections and the Extent of Social Networks. Andrew Young School of Policy Studies Working Paper (2003) 8. Waxman, B.M.: Routing of Multipoint Connections. IEEE Journal of Selected Areas in Communication 9, 1617–1622 (1988) 9. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Physical Review E 69, 026113 (2004) 10. Lusseau, D., Schneider, K., Boisseau, O.J., Haase, P., Slooten, E., Dawson, S.M.: The Bottlenose Dolphin Community of Doubtful Sound Features a Large Proportion of Long-lasting Associations. Behavioral Ecology and Sociobiology 54, 396–405 (2003)
A Location Based Text Mining Method Using ANN for Geospatial KDD Process Chung-Hong Lee1, Hsin-Chang Yang2, and Shih-Hao Wang1 1
Dept. of Electrical Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan
[email protected],
[email protected] 2 Dept. of Information Management, National Kaohsiung University, Kaohsiung, Taiwan
[email protected]
Abstract. With the increasing needs of location information, applications of geospatial information have gained a lot of attention in both research and commercial organizations. Extraction of geospatial knowledge from the information content has been thus becoming a important process. Among theses applications, a typical example is to discover relationships between various geospatial texts/data and specific locations. In this paper, we describe a location based text mining approach using Artificial Neural Networks (ANN) to classify texts into various categories based on their geospatial features, with the aims to discovering relationships between documents and zones. First, the collected documents were mapped to corresponding zones by the Adaptive Affinity Propagation (Adaptive AP) clustering techniques, then we performed framed maximize zones by means of Fuzzy ARTMAP (FAM) and Support Vector Machines (SVM) methods, allowing the results of relationships between documents and zones to be well presented. Eventually, we compared our experimental results with that of baseline models using Self-organizing maps (SOM) and Learning Vector Quantization (LVQ) methods. The preliminary results show that our platform framework has potential for geospatial knowledge discovery. Keywords: text mining, geospatial data mining, clustering, classification, neural networks.
1 Introduction Location information provided from the web is a valuable contribution and useful resource for a multitude of applications and is becoming an important element of contents in web search. Recently several studies have reported that 5 to up to 20% of all user queries express a geographic information need, and an estimate of up to 20% of all web pages contain location references. In particular, a huge of geographic information are utilized and generated by various technologies, such as Locationbased services (LBS) and Location-aware technologies (LAT), and are represented in the forms of images or documents in the internet. Under such a circumstance, L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 292–301, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Location Based Text Mining Method Using ANN for Geospatial KDD Process
293
geospatial search engines therefore have to identify location-relevant content of web pages and analyze their semantics of location and related images. Meanwhile it can also supports the search results of general search engines. For instance, the Universal Search in Google, which added images, videos, and maps to search results that once were all just Web pages—are radical changes. Geospatial knowledge discovery (also known as Geospatial KDD) is emerging as a novel and broad application research topic, covering fields of geospatial information science, environment management, national security, and applications of locationbased services, etc. Examples of geospatial data include data that describe the evolution of natural phenomena, earth science data that describe spatiotemporal phenomena in the atmosphere, or data that describe the location of individuals in the geographic space as a function of time. Geographical data represent an embedding to the Euclidean space, and therefore give distance information and topological information. In essence, the domain of the attributes is normally the real numbers. Also, geographical data sets can record additional information, such as environmental parameters and location-based tourism information, therefore defining high dimensional spaces of attributes that are highly correlated. As a result, geospatial data mining methods offer solutions for finding and describing patterns in the geospatial data collections, which was previously unknown and is not explicitly stored in the database. The application scenario for the attempted solution is a location-based information system for mobile or pedestrian users. We aim to identify location references in the forms of texts and images at a fine granularity level of individual buildings or tourist attraction that is directly applicable to a mobile user or retrieval and analysis tasks at this geographical granularity. The major location references include not only photos (images), but also texts. To process the resources of texts for exploring the documentto-location relationships, it is often considered using the text mining techniques with a support of geographic references. As a result, it is necessary to develop a hybrid solution involving text and image resources to tackle the issue. In this work, we are only concentrating on discussing the aspect of text mining for geospatial knowledge discovery. The view is taken, therefore in this work we are aiming to develop a location-based text mining approach to link the mined regional documents to the developed geographic information platform. For geospatial information applications, the data types being handled are normally integrated various multimedia (e.g. text, video, image, and sound) with geospatial features. Geospatial data can be obtained through using mobile devices with LAT techniques to collect individual movementpatterns. These data contains massive geospatial features. The rest of the paper is organized as follows: Section 2 discusses related researches of technologies and concepts. Section 3 presents the architecture of our approach. Section 4 shows our experimental results, and finally we conclude this paper in Section 5.
2 Related Work The previous work related to categorizing documents into various zones based on geospatial features can be found in the fields of geographic information retrieval and geospatial/geographic data mining. They are described in details as follows.
294
C.-H. Lee, H.-C. Yang, and S.-H. Wang
2.1 Geographic Information Retrieval Geographic Information Retrieval can be regarded as a specialized branch of traditional Information Retrieval. It covers all of the research areas that have traditionally make up the core of research into Information Retrieval, but in addition has an emphasis on spatial and geographic indexing and retrieval. For instance, Sallaberry [16] proposes a core model to formally present geographic information, and also develops PIV system that combined information extraction, information retrieval and information visualization according to geographic characteristics of documents. Jones [11] presents an ontology of place that combines limited coordinate data with spatial relationships between places, and be employed to obtain semantic distance measures in geographically-referenced information retrieval. Buscaldi [3] integrates data from gazetteers (GNS and GNIS), the WordNet general domain ontology, and Wikipedia, to produce an ontology that can be used as a source for the Geographical Information Retrieval task. By the extension of vector space model, such model is applied in Expert Search task. Jones [12] organize large-scale search engine relevance experiments, using the 12% of queries that containing placenames, matching the placenames to places in the documents, and examining the impact of geographic features on the retrieval relevance. Cardoso [4] present the participation of the University of Lisbon at the 2007 GeoCLEF. They adopt a novel approach for GIR; this approach is focused on handling geographic features and features types on both queries and documents, and generating signatures with multiple geographic concepts as a scope of interest. McCuley [15] investigates several various approaches to discovering geographic context for web pages, and a navigational tool is described for web browsing by geographic proximity. 2.2 Geospatial/Geographic Data Mining On the other hand, several related work focused on the research field of geospatial data mining. In the direction of data mining methods, clustering and classification techniques are most relevant to the problem domain. Estivill-Castro [7] develops a genetic search heuristic for solving medoid based clustering problem and applies genetic algorithm to provide more efficient implementation for Random Assorting Recombination. Zalik [22] proposes an efficient hierarchical spatial clustering algorithm, through detecting well-separated, possibly-nested clusters of arbitrary shapes to generate clusters, and generates clusters for clustering large spatial databases without any modification into a streaming algorithm. Tay [18] determines regular patterns by clustering and Hugh transformation in the derived hotspots and classify hot spots as false alarms for reduce false alarm. Vatsavia [19] proposes a hybrid semi-supervised learning method based on the expectation maximization (EM) algorithm and used it to extract thematic information from geospatial data sources. Also, there are several teams attempting to discover geographic patterns. Xiao [21] proposes a co-location pattern discovery technique so-called density co-location pattern discovery to improve performance and reduce memory consumption by uses density to select areas. Bogorny [2] presents a hybrid method to mining frequent geographic patterns without well-known geographic dependences by the combination
A Location Based Text Mining Method Using ANN for Geospatial KDD Process
295
of reducing input spaces and eliminating remaining geographic dependences. In addition, several related work are concentrating on dealing with various data types. Gelernter [9] develops a system to find digital maps by organizing maps mined from journal articles into categories within region, time and theme facets. Accordingly, our research will focus on classifying documents into various categories, and discovering relationship between documents and zones, and generated various data sets by their geospatial features for more advanced data fusion tasks.
3 System Framework and Methods 3.1 System Overview In this section, we describe our system framework and proposed approaches. There are three main components in our framework, including modules of geospatial text preprocessing, document-to-zone mapping, and framing maximize zones. The system framework is illustrated in Fig. 1.
Fig. 1. System Framework
As shown in Fig. 1, the proposed system framework includes three major modules (processes). In the first phase, we collect documents with geospatial contents and convert them into a term-document matrix. Subsequently we employed clustering techniques to map documents to corresponding zones, and select the correct-clustered documents which were identified as location related ones from resulting clusters. After that, the correct-clustering documents are used as training samples to train a classifier, and we use the trained classifier to classify newly unknown documents. By analyzing maximize zones, the relationships between test documents and geographic locations can be discovered. 3.2 Geospatial Text Collection and Preprocessing The system development started with a collection of the geospatial documents introducing some scenic spots for constructing a term-document matrix. Then we
296
C.-H. Lee, H.-C. Yang, and S.-H. Wang
employed a classic text retrieval method, the vector space model (VSM), to represent text documents in the form of vector. Each collected document includes several geospatially related proper nouns. For our experiments, geographical nouns and proper nouns are collected to construct a lexicon, including geospatial named entities such as Taipei 101, Love River, Palace Museum, etc. Then, these terms extracted the documents are used to establish a feature vector, representing as a term-document matrix. Furthermore, we employed the Latent Semantic Indexing (LSI) method to construct the semantic space with reduced dimensions, and utilized the matrix for document-to-zone mapping, described in next subsection in details. 3.3 Mapping Documents to Corresponding Zones After constructing the term-document matrix, we mapped documents to corresponding zones by means of an effective clustering method so-called affinity propagation (AP) [8]. For AP method, there are several advantages shown below: • AP is capable of clustering data in the case of large number of clusters. • AP has good performance in the case of large number of clusters. Accordingly, we applied the adaptive methods for AP as our clustering approach, and adopted the adaptive method to enhance the system functionality. Such technique is called Adaptive Affinity Propagation proposed by Wang [20]. The major concept in mapping documents to corresponding zones is to find a group of documents and such group is sufficient to represent a geographical region. For practical geographical place, neighboring places may occur in the same document simultaneously. The fundamental idea is illustrated in Fig. 2. As such, neighboring geographical terms are aggregated closely and overlapped among documents. By means of Affinity Propagation (AP) approach, related documents are aggregated together according to their geospatial features. According to the clustering result, geographic documents can be divided into two parts: correct-clustered documents and incorrect-clustered documents. The correctclustered documents will be used as our training data for training classifiers.
Fig. 2. Illustration of relationships between documents and zones
A Location Based Text Mining Method Using ANN for Geospatial KDD Process
297
3.4 Framing Maximize Zones To classify the processed documents based on locations, we used a supervised learning neural network so-called Fuzzy ARTMAP (FAM) [5] as our classification approach and employed correct-clustered documents as training samples and framing maximize zones. After categorizing documents according to their geospatial features, the resulting documents can be divided two parts: correct-classified documents and incorrect-classified documents. The correct-classified documents were used to as training samples for framed decision boundaries by Support Vector Machines (SVM) [1][6][10][17], and the boundaries will be used in our work as maximize zones.
4 Experiments 4.1 Geospatial Text Collection and Preprocessing We collected 120 documents from Tourism Bureau and Kaohsiung City Government websites as training corpora, and then built a geographic lexicon from these collected documents. Each document presents an introduction about the city and surrounding landscapes, as shown in Table 1. Finally, the terms in the lexicon were used to convert each document into a feature vector. Table 1. Numbers of text samples Varieties of texts Samples from Taipei city Samples from Kaohsiung city Selected samples for clustering Selected samples for classification
Number of Clustering Set 40
Number of Classification Set 20
40
20
80
-
-
40
4.2 Mapping Documents to Corresponding Zones As mentioned above, we employed the Adaptive Affinity Propagation (Adaptive AP) method to mapping documents to corresponding zones, and we compared the results with the ones of experiments using Self-organizing maps [14]. The collected 120 documents containing scenic spot information of Taipei and Kaohsiung city are parsed, in order to create a term-documents matrix. After that, we randomly selected 80 documents as clustering data set, and other ones as test data sets. In this stage, geospatial features of texts are extracted and divided into two clusters which indicated the documents related to either Taipei or Kaohsiung City respectively. Therefore, Adaptive AP and SOM are performed by various parameter settings. Detailed parameters and experimental results were presented in Table 2 and Fig. 3.
298
C.-H. Lee, H.-C. Yang, and S.-H. Wang Table 2. Parameters of Adaptive AP and SOM approches in the experiments
Adaptive Affinity Propagation (Adaptive AP) Convergence condition (CC) and Damping factor 20, 0.5 Decreasing step of preference (pstep) 0.005 Self-organizing maps (SOM) Topology function and Distance function Hextop, Linkdist Ordering-phase learning rate (LR) and steps 0.3, 200 Tuning-phase learning rate and neighborhood distance 0.02, 0.001
7
Kaohsiung city
33
1
39
1
Taipei city
39
3 0
37 5
10
15
20
25
30
35
Taipei city
Kaohsiung city
Number of incorrect-clustered documents (SOM)
1
7
Number of correct-clustered documents (SOM)
39
33
Number of incorrect-clustered documents (AdaptiveAP)
3
1
Number of correct-clustered documents (AdaptiveAP)
37
39
40
45
Fig. 3. Results of Adaptive AP and SOM experiments
As the results shown in Fig. 3, documents are mapped to various zones according to their geospatial features. In this case, accuracy of both clustering techniques in documents related to zones of Taipei and Kaohsiung are nearly the same. However, for Kaohsiung related documents, clustering accuracy in SOM experiments is lower than the one of Adaptive AP methods. The reason for this is perhaps because that the semantics of documents are blurred, and several geographic nouns in various locations are most likely to be included in the same documents simultaneously. Under such a circumstance, the performance of our proposed technique, Adaptive AP, appears to be superior to that of SOM method. 4.3 Framing Maximize Zones Subsequently, we employed the Fuzzy ARTMAP (FAM) neural network as our approach to framing maximize zones and compared the results with the ones of Learning Vector Quantization (LVQ)[13]. After mapping documents to corresponding zones, correct-clustering data sets are selected and being acted as our training data, and categorize the test data sets into two classes. Detailed parameters and experimental results are shown in Table 3 and Fig.4. In the results of LVQ experiment, correct-classified documents in the Taipeicategory are slightly higher than the ones of FAM method. However, correctclassified documents in the Kaohsiung-category are less than the ones of FAM experiment. In reviewing the results of both approaches, FAM technique is more
A Location Based Text Mining Method Using ANN for Geospatial KDD Process
299
stable and accurate for geospatial text classification. Also, the correct-classified documents are further used to frame maximize zones, and the experimental results are shown in Fig. 5. The correct-classified documents were used to as training samples for framed decision boundaries by Support Vector Machines (SVM) methods, and the boundaries will be used in our work as maximize zones. Table 3. Parameters of FAM and LVQ Fuzzy ARTMAP (FAM) Vigilance and Bias Learning Rate (LR)
0.75, 1.0e-6 1 Learning Vector Quantization (LVQ) Network Number of hidden neurons (S1) 4 S2-element vector of typical class percentages (PC) 0.4~0.6 Learning rate (LR) and Learning Function 0.01 and LVQ2
4
Kaohsiung City (SOM)
16
1
19
0
Taipei City (SOM)
20
4
16 6
Kaohsiung City (Adaptive AP)
Taipei City (Adaptive AP)
19 20
4 0
16 5
Taipei City (Adaptive AP) Number of incorrect-classified documents (LVQ) Number of correct-classified documents (LVQ) Number of incorrect-classified documents (FAM) Number of correct-classified documents (FAM)
14
1 0
10
Kaohsiung City (Adaptive AP)
15
20
Taipei City (SOM)
25
Kaohsiung City (SOM)
0
6
0
4
20
14
20
16
4
1
4
1
16
19
16
19
Fig. 4. Results of classify geographic documents
Fig. 5. Maximize Zones of classification results
300
C.-H. Lee, H.-C. Yang, and S.-H. Wang
5 Conclusion The map- and text- based location references have been becoming critical information needs for internet users. Searching for locations with local information is very useful particularly for the visitors with little experience or knowledge about the areas for traveling purposes. In this work, we employed the Adaptive AP clustering approach to allow related documents be mapped to corresponding zones. Using the resulting correctly clustered data sets as training samples, we further utilized the FAM technique to classify documents, and meanwhile we used the SVM methods to frame maximize zones. Through such clustering and classification experiments, we can categorize the documents into various zone-based categories according to the respective geospatial features extracted the documents. Furthermore, the resulting data samples can also be used to perform tasks of data fusion from other information sources. Experimental results demonstrated that our approach is sensible for establishing a location based information and knowledge source. In the future work, we will try to add geographical coordinates (e.g. longitude and latitude) to our system to enhance the document-to-zone mapping functions, and involved other geospatial features such as images and trajectories in the system development.
References 1. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A Training Algorithm for Optimal Margin Classification. In: 5th Annual Workshop on Computational Learning Theory, pp. 144–152 (1992) 2. Bogorny, V., Camargo, S., Engel, P.M., Alvares, L.O.: Mining Frequent Geographic Patterns with knowledge Constraints. In: Proc. 14th annual ACM international symposium on Advances in geographic information systems, pp. 139–146 (2009) 3. Buscaldi, D., Peris, P.: Inferring Geographical Ontologies from Multiple Resources for GeoGraphical Information Retrieval. In: Proc. 3rd GIR Workshop, SIGIR 2006 (2006) 4. Cardoso, N., Cruz, D., Chaves, M., Silva, M.J.: Using Geographic Signatures as Query and Document Scopes in Geographic IR. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 802–810. Springer, Heidelberg (2008) 5. Carpenter, G.A., Grossberg, S., Markuzon, N., Reynolds, J.H., Rosen, D.B.: Fuzzy ARTMAP: A Neural Network Architecture for Incremental Supervised Learning of Analog Multidimensional Maps. IEEE Trans. on Neural Networks 3(5), 698–713 (1992) 6. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995) 7. Estivill-Castro, V., Murray, A.T.: Spatial Clustering for Data Mining With Genetic Algorithms. In: Proc. International ICSC Symposium on Engineering of Intelligent Systems (1997) 8. Frey, B.J., Dueck, B.: Clustering by Passing Messages Between Data Points. Science 315(5814), 972–976 (2007) 9. Gelernter, J.: Data Mining of Maps and their Automatic Region-Time-Theme Classification. SIGSPATIAL 1(1), 39–44 (2009)
A Location Based Text Mining Method Using ANN for Geospatial KDD Process
301
10. Gunn, S.R.: Support Vector Machines for Classification and Regression. Technical report, Image Speech and Intelligent Systems Research Group, University of Southampton (1997) 11. Jones, C.B., Alani, H., Tudhope, D.: Geographic Information Retrieval with Ontologies of Place. In: Montello, D.R. (ed.) COSIT 2001. LNCS, vol. 2205, pp. 322–335. Springer, Heidelberg (2001) 12. Jones, R., Hassan, A., Diaz, F.: Geographic Features in Web Search Retrieval. In: Proc. 2nd International Workshop on Geographic Information Retrieval, pp. 57–58 (2008) 13. Kohonen, T.: Learning Vector Quantization. Neural Networks 1(suppl.1), 303 (1988) 14. Kohonen, T.: The self-organizing map. Proc. IEEE 78, 1464–1480 (1990) 15. McCurley, K.S.: Geospatial mapping and navigation of the web. In: Proc. 10th International Conference on World Wide Web, pp. 221–229 (2001) 16. Sallaberry, C., Etcheverry, P., Marquesuzaa, C.: Information Retrieval and Visualization Based on Documents’ Geospatial Semantics. In: Proc. International Conference on Information Technology: Research and Education, pp. 277–282 (2006) 17. Sebald, D.J., Bucklew, J.A.: Support Vector Machine Techniques for Nonlinear Equalization. IEEE Trans. on Signal Processing 48(11), 3217–3226 (2000) 18. Tay, S.C., Hsu, W., Lim, K.H., Yap, L.C.: Spatial Data Mining: Clustering of Hot Spots and Pattern Recognition. In: Proc. IEEE International Conference on Geoscience and Remote Sensing Symposium, pp. 3685–3687 (2003) 19. Vatsavai, R.R., Bhaduri, B.: A Hybrid Classification Scheme for Mining Multisource Geospatial Data. In: Proc. 7th IEEE International Conference on Data Mining Workshops, pp. 673–678 (2007) 20. Wang, K., Zhang, J., Li, D., Zhang, X., Guo, T.: Adaptive Affinity Propagation Clustering. Acta Automatica Sinica 33(12), 1242–1246 (2007) 21. Xiao, X., Xie, X., Luo, Q., Ma, W.Y.: Density Based Co-Location Pattern Discovery. In: Proc. 16th ACM SIGSPATIAL international conference on Advances in geographic information systems (2008) 22. Zalik, K.R., Zalik, B.: A sweep-line algorithm for spatial clustering. In: Advances in Engineering Software, pp. 445–451 (2009)
Modeling Topical Trends over Continuous Time with Priors Tomonari Masada1 , Daiji Fukagawa2 , Atsuhiro Takasu2 , Yuichiro Shibata1 , and Kiyoshi Oguri1 1
2
Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki, Japan National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan
Abstract. In this paper, we propose a new method for topical trend analysis. We model topical trends by per-topic Beta distributions as in Topics over Time (TOT), proposed as an extension of latent Dirichlet allocation (LDA). However, TOT is likely to overfit to timestamp data in extracting latent topics. Therefore, we apply prior distributions to Beta distributions in TOT. Since Beta distribution has no conjugate prior, we devise a trick, where we set one among the two parameters of each per-topic Beta distribution to one based on a Bernoulli trial and apply Gamma distribution as a conjugate prior. Consequently, we can marginalize out the parameters of Beta distributions and thus treat timestamp data in a Bayesian fashion. In the evaluation experiment, we compare our method with LDA and TOT in link detection task on TDT4 dataset. We use word predictive probabilities as term weights and estimate document similarities by using those weights in a TFIDF-like scheme. The results show that our method achieves a moderate fitting to timestamp data.
1
Introduction
Term weighting is a key component of the applications in text mining such as information retrieval, document clustering, word clustering, etc.1 While TFIDF is a classic term weighting scheme widely used in such applications [8], we can obtain a more well-founded term weighting with probabilistic modeling. In this paper, we propose a new probabilistic model based on latent Dirichlet allocation (LDA) [5] and obtain efficient term weights for text mining applications. We can use LDA to obtain term weights as follows. LDA models each document as a mixture of latent topics. Therefore, we have a multinomial distribution Mult(θj ) defined over topics for each document j. From Mult(θj ), we draw as many topics as the length of document j. Further, LDA models each topic k by a multinomial Mult(φk ) defined over words. By drawing a word from the word multinomial corresponding to each of the topics which is in turn drawn from 1
In this paper, the term “term” is used exchangeably with “word”.
L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 302–311, 2010. c Springer-Verlag Berlin Heidelberg 2010
Modeling Topical Trends over Continuous Time with Priors
303
Table 1. The definition of symbols x y z s θjk φkw τk1 , τk2 ηk1 , ηk2 α β γ a 1 , b1 , a 2 , b2 nk nj njk nkw nk1 , nk2 nj1 , nj2 njk1 , njk2
set of observed word tokens set of observed timestamps set of latent topic assignments to word tokens set of latent Bernoulli trials in BTOT parameters of per-document topic multinomials parameters of per-topic word multinomials parameters of per-topic Beta distributions defined over timestamps parameters of per-topic Bernoulli trials in BTOT parameter of a symmetric Dirichlet prior for topic multinomials parameter of a symmetric Dirichlet prior for word multinomials parameter of a symmetric Beta prior for binomials parameters of Gamma priors for Beta distributions # of word tokens which are assigned to topic k # of word tokens in doc j # of word tokens in doc j which are assigned to topic k # of tokens of word w which are assigned to topic k split of nk according to the results of Bernoulli trials in BTOT split of nj according to the results of Bernoulli trials in BTOT split of njk according to the results of Bernoulli trials in BTOT
Mult(θj ), we obtain a set of word tokens composing document j. Based on this modeling, the predictive probability of word w given document n we+αcan estimate +β j as k njjk+Kα · nnkkw +W β . The definition of symbols are referred to Table 1. This predictive probability can be computed based on a result of collapsed Gibbs sampling (CGS) [7], where each word token is assigned to a topic so that the resulting set of topic assignments is a sample from the true posterior.2 We can regard the above predictive probability as a weight of word w in document j. While both TFIDF and LDA are defined based on the frequencies of words, other types of information may help in weighting terms. For example, we often sort Web search results in chronological order. This is based on an intuition that the similarity of document timestamps improves ranking. Therefore, we propose a new probabilistic model utilizing document timestamps and provide a more efficient term weighting. Our proposed model is a sophistication of Topics over Time (TOT) [15], which is proposed as an extension of LDA. In TOT, the dependency of word token generation on document timestamps is modeled by per-topic Beta distributions defined over continuous timestamps. Intuitively speaking, each Beta density represents a change of popularity over time for the corresponding topic. 2
To be precise, this is not the actual predictive probability, which is obtained by taking an average over the posterior probability over all possible topic assignments. However, it is intractable to compute the actual predictive probability. Therefore, in this paper, word predictive probability always means a predictive probability computed based on a result of collapsed Gibbs sampling.
304
T. Masada et al.
β γ
α
η
θ
a
b
φ
τ
s
y
z
x
Fig. 1. Graphical representation of BTOT
The predictive probability of word w given document j in TOT can be obtained as follows: K 1 njk + α nkw + β Γ (τk1 + τk2 ) τk1 −1 · · t (1 − tj )τk2 −1 , Z nj + Kα nk + W β Γ (τk1 )Γ (τk2 ) j
(1)
k=1
where Γ (·) denotes Gamma function and Z is a normalization constant. In [15], the parameters of Beta distributions τk1 , τk2 are directly estimated by the method of moments. Therefore, Beta distributions are likely to overfit to timestamps. To be precise, in CGS for TOT, the same topic is likely to be assigned to word tokens only because the tokens appear in the documents having similar timestamps. We say that timestamps are similar when the time interval between them is short. Consequently, the topic population at each point on the time axis is dominated by only a few topics, though a wide variety of topics may appear at the same time point. In [15], this problem is solved with a balancing parameter appearing as an exponential power of the Beta density in Eq. (1). In constrast, we propose a more well-founded approach, a Bayesian TOT (BTOT), where we apply Gamma priors to Beta distributions and marginalize out the parameters of Beta distributions. BTOT is a substantial modification of TOT, because we can obtain word predictive probabilities with no reference to a specific estimation of the parameters of Beta distributions. However, Gamma distribution is not a conjugate to Beta distribution. Therefore, we use the following trick: we set one among the two parameters of each Beta distribution to one. This is because Gamma distribution is a conjugate to Beta distribution one of whose parameters is equal to one. Further, we determine which parameter is set to one by a Bernoulli trial. Consequently, we can treat document timestamps in a Bayesian manner by marginalizing out the parameters of Beta distributions. Figure 1 shows the graphical representation of BTOT. In the evaluation experiment, we compare BTOT with LDA and TOT by link detection task on TDT4 dataset [1]. Link detection is a task to determine whether a given pair of documents relate to the same topic. Therefore, an efficient estimation of document similarity is a key to success. We use word predictive
Modeling Topical Trends over Continuous Time with Priors
305
probabilities given by the compared methods in a TFIDF-like term weighting scheme and compute cosine measure of the resulting document vectors. Our evaluation will show that BTOT gives evaluation results lying between LDA, which uses no timestamps, and TOT, which depends too strongly on timestamps. Therefore, we will conclude that BTOT shows a moderate fitting to timestamps. The rest of the paper is organized as follows. Section 2 gives exisiting approaches for topical trend analysis. Section 3 describes the details of our method. Section 4 explains how the evaluation is conducted. Section 5 includes evaluation results and discussions. Section 6 concludes the paper with future work.
2
Previous Works
In recent years, probabilistic methods find an interesting application in modeling topical trends of documents. In this paper, we focus on the applications of multitopic probabilistic models like LDA [5] to topical trend analysis. Dynamic Topic Models (DTM) [4] and its continuous time version (cDTM) [14] model topical trends as transitions of the parameters of per-topic word multinomial distributions. First, a real vector is drawn from a time-dependent Gaussian distribution at each position of time axis. The time-dependency of Gaussian distributions is modeled as a linear transition in DTM, and as a Brownian motion in cDTM. Second, the drawn vector is mapped to a set of parameters of a multinomial distribution. However, Gaussian distribution is not a conjugate to multinomial. Consequently, inference procedure becomes too complicated. Multiscale Topic Tomography Models (MTTM) [10] are based on a completely different idea, where the entire time interval is segmented into two pieces recursively. Consequently, we obtain a binary tree whose root represents the entire interval and each internal node represents a subinterval. Each leaf node is associated with a Poisson distribution for generating words. Further, the parameter of the Poisson distribution at each non-leaf node is equal to the sum of the parameters of the Poisson distributions at the two child nodes. Therefore, we can naturally express temporal localization of word counts by this branching at each non-leaf node. However, we cannot use continuous timestamps in MTTM. When compared with the works above, our proposal is remarkable in the following two features: – BTOT is an extension of LDA. Therefore, the inference can be implemented by introducing a slight modification to that for LDA. In contrast, DTM, cDTM and MTTM require heavily customized implementations. The inference used in our evaluation experiment is actually a slight modification of CGS for LDA [7] as shown later. – We can use continuous timestamps. Both MTTM and DTM lack this feature. Another important recent approach dHDP [11] also assumes that timestamps are discretized. While cDTM has this feature, the implementation is complicated, because we need a special technique to realize an efficient memory usage in modeling continuous timestamps [14].
306
T. Masada et al.
3
Topical Trend Modeling with Priors
3.1
A Bayesian Topics over Time (BTOT)
We propose a new probabilistic model by introducing a sophistication to TOT [15]. The full joint distribution of TOT can be written as follows: p(x, y, z, θ, φ|α, β, τ ) = ·
j
k
n
θjkjk ·
k
w
Γ (Kα) j
Γ (α)K
kw φnkw ·
k
α−1 θjk ·
Γ (W β) k
Γ (β)W
Γ (τk1 + τk2 ) j
k
Γ (τk1 )Γ (τk2 )
w
φβ−1 kw
tτj k1 −1 (1 − tj )τk2 −1
njk
. (2)
The definition of symbols is referred to Table 1. Based on TOT, we devise a new probabilistic model by applying Gamma prior distributions to the parameters τk1 , k = 1, . . . , K and τk2 , k = 1, . . . , K in Eq. (2) (see also Figure 1). However, Gamma distribution is not a conjugate to Beta distribution. Therefore, we set one among the two parameters of Beta distribution to one. Then, Gamma distribution becomes a conjugate. When one of the two parameters is fixed to one, Beta distribution provides density functions as shown in the left panel of Figure 2 for various values of the other parameter. Further, we determine which of the two Beta parameters is set to one by a Bernoulli trial for each word token separately. To be precise, we choose one among the two Beta distributions Beta(τk1 , 1) and Beta(1, τk2 ) based on a random 0/1 draw from a binomial distribution Bi(ηk1 , ηk2 ) for each word token. We also apply a symmetric Beta prior to these per-topic binomial distributions. Our approach is not the only way to modify TOT in a Bayesian manner. Therefore, we call our approach a Bayesian Topics over Time, though abbreviated simply as BTOT in this paper. By marginalizing out the parameters of Beta distributions and those of binomial distributions, we obtain the full conditional probability of a topic assignment followed by a Bernoulli trial as below: p(zji = k, sji = 0|x, y, z¬ji , s¬ji , α, β, γ, a, b) ∝ (α + n¬ji jk ) · ·
γ + n¬ji k1
2γ + n¬ji k
·
2γ + n¬ji k
W β + n¬ji k
a1 +n¬ji k1 {b1 − j n¬ji a1 + n¬ji jk1 log tj } k1 · · ¬ji ¬ji a tj {b1 − j njk1 log tj − log tj } 1 +nk1 +1
p(zji = k, sji = 1|x, y, z¬ji , s¬ji , α, β, γ, a, b) ∝ (α + n¬ji jk ) · γ + n¬ji k2
β + n¬ji kw
β + n¬ji kw
W β + n¬ji k
a2 +n¬ji k2 {b2 − j n¬ji a2 + n¬ji jk2 log(1 − tj )} k2 · · . ¬ji ¬ji 1 − tj {b2 − j njk2 log(1 − tj ) − log(1 − tj )}a2 +nk2 +1 (3)
where ¬ji means the count after removing ith word token in document j. The derivation is omitted due to space limitation. We use Eq. (3) in CGS for BTOT.
Modeling Topical Trends over Continuous Time with Priors 5.0
t^(τ_1-1)
3
0.5
3.0
(1-t)^(τ_2-1)
1.0
5
0.1
4.0
307
0.5
1
2.0
2 1.5
1.0
0.0
0.0 0
0.2
0.4
0.6
0.8
1
+τ2 ) τ1 −1 Fig. 2. Left panel: Beta density functions ΓΓ(τ(τ11)Γ t (1 − t)τ2 −1 for various values (τ2 ) of τ1 , while τ2 is fixed to one. Right panel: mixing proportions of two Beta distributions at each point of the time axis. Each time point correspond to a timestamp of documents used in our experiment. The solid line (resp. dashed line) shows the proportion of the number of word tokens where the Beta density ∝ (1 − t)τ2 −1 (resp. tτ1 −1 ) is selected by a Bernoulli trial. In the earlier half of the given time interval, the Beta density ∝ (1 − t)τ2 −1 is likely to be chosen. The opposite is observed in the later half.
However, the computation of the last term in each of the two cases in Eq. (3) is time consuming. To reduce the execution time, we apply an approximation shown below to the former case in Eq. (3). ¬ji a1 +nk1 {b1 − j n¬ji {b1 − j njk1 log tj }a1 +nk1 jk1 log tj } ≈ ¬ji ¬ji {b1 − j njk1 log tj − log tj }a1 +nk1 +1 {b1 − n log tj − log tj }a1 +nk1 +1 j
jk1
(4) A similar approximation is also applied to the latter case in Eq. (3). In the evaluation experiment, we update these two approximated terms once for every 10 samplings of topics in CGS. In the right panel of Figure 2, a line graph presents the proportions of 0/1 draws for each timestamp. This graph is drawn based on a result actually obtained in our evaluation experiment. CGS for BTOT provides a set of 0/1 draws for all word tokens along with a set of topic assignments. Therefore, we can count the number of 0 draws and that of 1 draws at each time point to obtain a proportion of 0/1 draws at each time point. This line graph shows that the Beta density ∝ (1 − t)τ2 −1 is likely to be chosen in the earlier half of the time axis, and that the density ∝ tτ1 −1 is likely to be chosen in the latter half. While this example is arbitrarily selected from 50 results obtained in our experiment, other results give almost the same tendency.
4 4.1
Experimental Settings Evaluation Strategy
We compare the methods by link detection task on TDT4 dataset [1]. This dataset consists of 96,259 documents, where machine-translated non-English
308
T. Masada et al.
documents are also included. 196,131 unique unstemmed words and 17,638,946 word tokens are observed after removing standard stop words. We use document dates, ranging from Oct. 1 on 2000 to Jan. 31 on 2001, as document timestamps and normalize them to the real values in the interval [0.05, 0.95], where the values close to both ends of [0, 1] are omited for numerical stability. We have two sets of evaluation topics for TDT4 dataset, i.e., TDT 2002 topic set and TDT 2003 topic set. These topic sets are prepared for TDT 2002 competition and for TDT 2003 competition, respectively. Each set consists of 40 topics and the corresponding 40 sets of on-topic documents. To avoid confusion with the “topics” in probabilistic models, we call evaluation topics prepared for TDT4 dataset “TDT-topics.” With respect to each TDT-topic, we evaluate the efficiency of document similarity as follows. Let D be the entire TDT4 document set and D0 be the on-topic document set for some TDT-topic. Then, under a similarity threshold λ, we can compute the following two evaluation measures: – False alarms probability: | {(d0 , d) : d0 ∈ D0 , d ∈ D \ D0 , sim(d0 , d) ≥ λ} | /(|D0 | × |D|) – Miss probability: = d0 , sim(d0 , d0 ) < λ} | /{|D0 | × (|D0 | − 1)}, | {(d0 , d0 ) : d0 , d0 ∈ D0 , d0 where sim(·, ·) denotes document similarity. For both measures, a less value means a better document similarity. However, there is a trade-off between the two measures. Therefore, we introduce a measure called normalized detection cost (NDC), defined as the sum of a false alarms probability multiplied by 4.9 and a miss probability [2][12]. NDC is based on an intuition that false alarms are more harmful. Based on a preliminary experiment, we set λ = 0.05 for all compared methods so that each method can give a near peak performance for all TDT-topics in average. 4.2
Term Weighting
In this paper, we estimate document similarity by cosine measure [8] of document vectors whose entries are computed based on a TFIDF-like term weighting scheme. We use the following term weighting scheme: p(w|j)ρ · (njw /nj )σ , (5) ej (w) ≡ njw × log (Jw /J) where Jw is the document frequency of w, p(w|j) is the predictive probability of word w given document j, njw is the term frequency of w in document j. This weighting scheme is also adopted in [13]. In Eq. (5), njw /nj is a maximum likelihood estimation of the probability of word w given document j where we assume that we have a different multinomial for each document. The predictive probability p(w|j) can be computed based on a result of CGS for each of the compared methods. The parameters ρ and σ can be regarded as annealing parameters for p(w|j) and njw /nj , respectively. Therefore, we compare the geometric mean of the annealed versions of p(w|j)
Modeling Topical Trends over Continuous Time with Priors
309
and njw /nj to Jw /J, which can in turn be regarded as a background probability of word w. In this manner, Eq. (5) defines a term weight based on how largely p(w|j) and njw /nj deviate from Jw /J. When ρ = σ = 0, Eq. (5) is reduced to a standard TFIDF: ej (w) ≡ njw log JJw . However, this turns out to be quite inefficient in our evaluation. When ρ = 0 and σ = 0, we define a term weight with no reference to probabilistic methods. We regard this case as our baseline method, simply denoted by TFIDF. We set σ = 0.6 based on a preliminary experiment. When ρ = 0, we obtain a term weight using a probabilistic method. Based on another preliminary experiment, we set ρ = σ = 0.3 for all of LDA, TOT, and BTOT. 4.3
Inference
For each of LDA, TOT, and BTOT, we run 50 instances of CGS starting from a random initialization. In CGS, the entire document set is scanned 800 times to achieve a good convergence. We fix the number of topics K to 100 for all compared methods. The evaluation results are worse when K = 50 and are only comparable when K = 200. We optimize hyperparameters α, β, and γ by using Minka’s fixedpoint iterations [9] as presented in [3]. For TOT, we reduce overfitting to timestamp data as follows: every time one among 2K Beta parameters τk1 , τk2 , k = 1, . . . , K gets larger than a threshold, rescale all of them by multiplying the same constant and keep them less than or equal to the threshold. This rescaling can directly suppress the unbounded increase of the parameters, which causes overfitting. The threshold is set to one, because larger values lead to worse evaluation results, and smaller values make TOT indistinguishable from LDA. The execution time of inference is about five hours for LDA and TOT, and about 11 hours for BTOT on a PC equipped with Intel Core2 Quad Q9650.
5
Evaluation Results
We have 50 NDC values for each of LDA, TOT, and BTOT, because 50 sampling results of CGS are obtained for each of these compared methods. Based on these NDCs, we conduct a series of comparisons among TFIDF (i.e., baseline method), LDA, TOT, and BTOT, as described below. – For TFIDF, we have only one evaluation result, because TFIDF is not a probabilistic method and thus has no corresponding CGS trials. Therefore, we compare each of LDA, TOT, and BTOT with TFIDF by one sample t-test [6], where we regard the NDC of TFIDF as the test mean. – Further, by applying two sample unpooled t-test [6], we conduct a comparison between LDA and TOT, a comparison between LDA and BTOT, and finally a comparison between TOT and BTOT. Table 2 summarizes evaluation results. The six columns tagged with “Improvements ” (resp. “Deteriorations ”) show the numbers of the TDT-topics where a significant improvement (resp. deterioration) is observed. We simply call
310
T. Masada et al.
Table 2. The numbers of the TDT-topics for TDT 2002 or TDT 2003 where a significant improvement or deterioration is found Improvements TDT 2002 TDT 2003
Deteriorations TDT 2002 TDT 2003
TFIDF LDA TOT TFIDF LDA TOT TFIDF LDA TOT TFIDF LDA TOT
BTOT TOT LDA
16 20 15
6 16 —
3 — —
15 11 11
9 11 —
4 — —
3 2 3
0 2 —
1 — —
2 1 5
0 0 —
0 — —
an improvement or a deterioration “significant” when it is significant at 99.5% confidence level. With respect to both improvement and deterioration, the three columns tagged with “TDT 2002” (resp. “TDT 2003”) gives the numbers of TDT-topics among the 40 TDT-topics prepared for TDT 2002 competition (resp. TDT 2003 competition). Each column tagged with “TFIDF” gives the results obtained by comparing between TFIDF and the method appearing in the first column. The other two column tags, “LDA” and “TOT”, mean the comparison with LDA and that with TOT, respectively. For example, the number “16” on the second last row in the third column means that when TOT is compared with LDA, a significant improvement is observed for 16 TDT-topics among 40 prepared for TDT 2002 competition. Table 2 gives the following observations: – The number of TDT-topics where LDA significantly improves TFIDF is larger than that of TDT-topics where LDA significantly deteriorates TFIDF. The same result is also observed for TOT and BTOT. Therefore, we can conclude that LDA-like probabilistic models lead to term weighting efficient for document similarity estimation. – The number of TDT-topics where TOT or BTOT significantly improves LDA is larger than that of TDT-topics where TOT or BTOT significantly deteriorates LDA. Therefore, we can conclude that the efficiency of term weighting can be improved by considering document timestamps in LDAlike probabilistic modeling. – The number of TDT-topics where BTOT significantly improves TOT is larger than that of TDT-topics where BTOT significantly deteriorates TOT. Therefore, we can conclude that our Bayesian approach improves TOT. Finally, we point out the following fact. The number of TDT-topics where BTOT significantly improves LDA is smaller than that of TDT-topics where TOT significantly improves LDA. At the same time, the number of TDT-topics where BTOT significantly deteriorates LDA is also smaller than that of TDTtopics where TOT significantly deteriorates LDA. In fact, BTOT deteriorates LDA for no TDT-topics. This means that BTOT behaves more similar to LDA than TOT. Intuitively speaking, BTOT is halfway between LDA and TOT. Therefore, we can conclude that BTOT exhibits a fitting to timestamp data in a more moderate manner than TOT.
Modeling Topical Trends over Continuous Time with Priors
6
311
Conclusions
In this paper, we propose a new probabilistic model, a Bayesian Topics over Time (BTOT). In BTOT, we model document timestamps with per-topic Beta distributions. Further, we apply Gamma priors to the Beta distributions after introducing a trick to make Gamma prior conjugate. Then, we marginalize out the parameters of the Beta distributions and treat the timestamps in a Bayesian manner. Based on the results of our evaluation experiment, we can conclude that BTOT achieves a more moderate fitting to timestamp data than TOT. When we utilize our methods as a component of indexing processes of a realistic search engine, we should conduct a collapsed Gibbs sampling on a document set where the arrivals of new documents frequently occur. Further, such new documents will arrive with new timestamps. Therefore, our important future work is to devise a collapsed Gibbs sampling which is applicable to the situation where a document set dynamically changes not only in observed word frequencies, but also in observed variations of timestamps.
References 1. Topic Detection and Tracking - Phase 4, http://projects.ldc.upenn.edu/TDT4/ 2. Allan, J., Lavrenko, V., Nallapati., R.: UMass at TDT 2002. In: Notebook Proceedings of TDT 2002 Workshop (2002) 3. Asuncion, A., Welling, M., Smyth, P., Teh., Y.W.: On Smoothing and Inference for Topic Models. In: UAI 2009 (2009) 4. Blei, D.M., Lafferty, J.D.: Dynamic Topic Models. In: Airoldi, E.M., Blei, D.M., Fienberg, S.E., Goldenberg, A., Xing, E.P., Zheng, A.X. (eds.) ICML 2006. LNCS, vol. 4503, pp. 113–120. Springer, Heidelberg (2007) 5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. JMLR 3, 993–1022 (2003) 6. Cabilio, P., Masaro, J.: Basic Statistical Procedures and Tables. Department of Mathematics and Statistics, Acadia University (2005) 7. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. of Natl. Acad. Sci. 101(suppl.1), 5228–5235 (2004) 8. Manning, C.D., Raghavan, P., Sch¨ utze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008) 9. Minka, T.: Estimating a Dirichlet distribution (2000), http://research.microsoft.com/~ minka/papers/dirichlet/ 10. Nallapati, R.M., Ditmore, S., Lafferty, J.D., Ung, K.: Multiscale Topic Tomography. In: KDD 2007, pp. 520–529 (2007) 11. Ren, L., Dunson, D.B., Carin, L.: The Dynamic Hierarchical Dirichlet Process. In: ICML 2008, pp. 824–831 (2008) 12. Shah, C., Croft, W.B., Jensen, D.: Representing Documents with Named Entities for Story Link Detection. In: CIKM 2006, pp. 868–869 (2006) 13. Masada, T., Fukagawa, D., Takasu, A., Hamada, T., Shibata, Y., Oguri, K.: Dynamic Hyperparameter Optimization for Bayesian Topical Trend Analysis. In: CIKM 2009, pp. 1831–1834 (2009) 14. Wang, C., Blei, D., Heckerman, D.: Continuous Time Dynamic Topic Models. In: UAI 2008, pp. 579–586 (2008) 15. Wang, X.-R., McCallum, A.: Topics over Time: A Non-Markov Continuous-time Model of Topical Trends. In: KDD 2006, pp. 424–433 (2006)
Improving Sequence Alignment Based Gene Functional Annotation with Natural Language Processing and Associative Clustering Ji He Department of Scientific Computing, The Samuel Roberts Noble Foundation 2510 Sam Noble Parkway, Ardmore, OK 73401, USA
[email protected]
Abstract. Sequence alignment has been a commonly adopted technique for annotating gene functions. Biologists typically infer the function of a unknown query gene according to the function of the reference subject gene that shows the highest homology (commonly referred to as the “top hit”). BLAST search against the NCBI NR database has been the de facto “golden companion” in many applications. However, the NR database is known as noisy and contains significant sequence redundancy, which leads to various complications in the annotation process. This paper proposes an integrative approach that encompasses natural language processing (NLP) for feature representation of functional descriptions and a novel artificial neural network customized based on the Adaptive Resonance Associative Map (ARAM) for clustering of subject genes and for reducing their redundancy. The proposed approach was evaluated in a model legume species Medicago truncatula and was shown highly effective in our experiments. Keywords: Sequence Alignment, Gene Functional Annotation, Natural Language Processing, Artificial Neural Network, Adaptive Resonance Associative Map.
1 Introduction In biology and bioinformatics studies, sequence alignment has been commonly applied to computational prediction of gene function. By aligning the sequence of a unknown gene (“query”) with those in the reference library (“subjects”), one could then infer the function of the query gene according to the functional description of the subject genes, based on the assumption that genes with a certain level of homology, as reflected by their sequence similarity, typically have conserved functions. A commonly used approach in many automatical program is to annotate the query gene using the functional description of the subject gene with the highest sequence similarity (often referred to as “top hit”). This approach is similar to the nearest-neighbor based categorization algorithm [1] in the machine learning field, being straight forward yet practical in many cases. A notably drawback of the approach, though, is that it may fail to identify the potentially multiple functions of a gene. As far as sequence alignment is concerned, arguably, BLAST search against the NCBI protein (NR) database has been the de facto “golden companion” for many L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 312–321, 2010. c Springer-Verlag Berlin Heidelberg 2010
Improving Sequence Alignment Based Gene Functional Annotation
313
sequencing projects, not only because the BLAST program is relatively computationally efficient, but also because the NR database is the largest and most frequently updated protein databases on the world. These factors in combination enable researchers to quickly survey their new sequence library on a timely manner, with the potential of achieving highest library coverage. NR however is considered not a best managed database, in the sense that with its focus on rapid library updating, it consists of a large number of genes either not being annotated with meaningful functional description, vaguely annotated, or even with wrong annotations, in addition to its significant redundancy of same genes being submitted from different sources. Very often, the functional description of the “top hit” subject gene may not provide any meaningful indication at all. Typically, a biologist has to manually inspect all subject genes, excludes these meaningless records and eliminate the redundancy before he could more reliably infer the query gene’s function. There exist a number of projects aiming to tackle this deficiency by improving reference databases. Notably, the NCBI Reference Sequence (RefSeq) collection [2] focuses on reducing the sequence redundancy in selected species; the EBI Universal Protein Resource (UniProt) databases [3] aims to provide higher quality annotations to selected reference protein sequences. These data collections, although presenting invaluable resources to the research community and actively growing in size, are heavily dependent on intensive human labors and hence are not practically synchronized with the NR database. In other words, they could not take the full advantage of the rapidly growing sequence information contributed by general public, which sometimes is not desirable to biologists, especially when a new species not in RefSeq or UniProt is concerned. This paper reports our alternative approach from the end-user point to improve the sequence annotation quality. Our approach attempts to tackle the problem in two aspects: Firstly, how to identify whether a subject gene’s functional description is meaningful enough to the biologist or not? And secondly, how to eliminate the redundancy in subject genes and identify representative ones reflecting major gene families / groupings so as to discover the potentially multiple functions of the query sequence? We applied a set of standard natural language processing (NLP) techniques to answer the former question, and applied a novel artificial neural network (ANN) model based on associative clustering to the latter. In the rest of this paper, Section 2 presents our proposed approach in detail, Section 3 reports our benchmark experiments and results, and Section 4 draws concluding remarks and outlines our future work.
2 Methods 2.1 Natural Language Processing (NLP) for Representing Features of Sequence Alignments Among the various statistics returned by the BLAST program1, our proposed approach utilizes three types of data on each alignment (i.e. high score pair (HSP)), namely 1
Our studies assumed BLAST alignment but the generalized approach applies to other sequence alignment-based functional annotation scenarios.
314
J. He
i) the significance of alignment, typically indicated with e-value; ii) the start and end positions of the aligned (conserved) region in reference to the query sequence; and iii) the functional description of the subject gene. For each alignment input I, we collate its feature in a vector format of I = {r|P |X}, (1) in which r is an integer value corresponding to ranking of the particular subject in all aligned ones, assuming a lower value corresponds to a higher degree of homology (i.e. r = 1 indicates the “top hit”); P = (p1 , p2 ) is a two-dimensional vector containing the start (p1 ) and end (p2 ) positions of conserved region, presumably p1 ≤ p2 ; and finally X is a multi-dimensional vector that represents the feature of the free text style functional description of the subject, with details given below. To generate the feature vector X, we adopted the “bag of words” (BoW) model being commonly used in information retrieval (IR) studies, which represents free text as an unordered collection of words, disregarding grammar and word order [4]. Briefly speaking, our approach consists of two major phases, namely feature selection and feature extraction. In the feature selection phase, a free text strings is firstly tokenized. That is, terms (words) are extracted based on natural delimiters (space, comma, period, etc.). The terms are then processed with the widely adopted Porter stemming algorithm [5], during which their root formats are identified. In the consequent step, the root words are compared against a stoplist and those stopwords are excluded. A stopword is a highly generic word (e.g. “high”, “very”) in reference to general corpora and is typically considered less important in BoW models. And in the last step, too general and too rare words are identified in inference to our specialized database of interest (NR in our case). Too general words (e.g. “protein”, “function” in our studies) are thought to contain less information content (IC) by IR researchers, whereas too rare words, if not caused by misspelling, unnecessarily increase the dimension of the feature space and hence the computational complexity. They are thus further removed in our studies, whereas the remainder words with mid-range frequencies are selected keywords. In the feature extraction phase, we adopted the popular inverse document frequency weighted term frequency (TF*IDF) method [6] to represent the “bag of words” into vector format, specifically, X = (x1 , x2 , . . . , xM ) = (tf1 · idf1 , tf2 · idf2 , . . . , tfM · idfM ),
(2)
where M is the total number of selected keywords, tfi is the term frequency (TF) (i.e. the occurrence) of keyword ti in the text (in our case, the functional description of the subject gene), idfi is the inverse document frequency (IDF) of ti in reference to the whole data set, defined as idfi = −log(dfi /N ), in which dfi is the document frequency (DF) of ti , particularly the total number of functional descriptions that contain ti , and N is the total number of functional descriptions in the data set. Furthermore, since our studies involve estimation of pattern proximity for clustering purpose, to avoid category proliferation [7] and following a common practise, we normalize each feature vector based on the second-level Euclidean normalization given by xi xi = . (3) M 2 x i=1 i
Improving Sequence Alignment Based Gene Functional Annotation
315
It is easily understandable that if a functional description does not contain any meaningful words based on our selection criteria, the corresponding feature vector will be null (i.e. all values are zero). This provides us a quantitative measure to exclude these meaningless records from our downstream analysis. The conversion of above-mentioned three different types of features into numeric values enables us to apply more comprehensive heuristics for further data mining. Specifically, in our studies, we adopted a modified version of Adaptive Resonance Associative Map (ARAM) artificial neural network model to carry out associative clustering based on the conserved regions (P ) and the functional descriptions (X), while kept track of the ranking information (r) for the final annotation purpose. The idea of the associative clustering is to identify subject gene groupings that correspond to the same conserved regions and have similar functional descriptions, as they typically suggest homologue genes from the same or closely related families. Thus consequently, we could use one common annotation to represent their functions and hence reduce the redundancy. Details of the associative clustering approach are given below. 2.2 Adaptive Resonance Associative Map (ARAM) for Clustering of Reference Genes The Adaptive Resonance Associative Map (ARAM) [8] belongs to the family of Adaptive Resonance Theory (ART) self-organizing neural networks [9]. Like another member of the family, ART-MAP [10], ARAM is capable of incrementally learning recognition categories (pattern classes) and multidimensional maps of patterns. Yet compared to ART-MAP, ARAM contains a simplified pattern matching and learning process. The architecture of ARAM (Figure 1) can be understood as an overlap of two ART networks. An ARAM network has two individual short term memory (STM) layers F1a and F1b , responding to independent input signals A and B respectively, but an shared long term memory (LTM) layer F2 that encodes the associated knowledge from these two feature fields. The learning of the network is guided by an orienting subsystem with two logical gates, defined with two vigilance parameters (ρa and ρb respectively). The logical gates conditionally switch and reset the network state according to predefined rules, and hence affect knowledge encoding in the LTM. An unique feature of ARAM network is its capability of learning and associating patterns from two independent domains, which makes ARAM a versatile learning architecture for various data mining tasks including classification, rule induction, and associative clustering. In our application, we used one input domain A to encode the alignment position information (P ), and the other input domain B to encode the functional description information (X). And by carrying out unsupervised learning, the ARAM network will form output clusters that satisfy our predefined associative heuristics – in other words, only subject genes that are aligned to the same conserved region on the query gene and have similar biological functions will be categorized into the same cluster. The learning of the particular ARAM network being used in our studies is further explained below. Inputs and Recognition Categories: ARAM requires inputs A and B represented in vector format, which in our studies, correspond to the two-dimensional position vector
316
J. He
Category field
y
F2
wa
wb
F1b
F1a
ρa
Xb
Xa
Feature field
+
+
Feature field
+
+ ARTa
ρb
A
ARTb
B
Fig. 1. The architecture of the Adaptive Resonance Associative Map (ARAM) neural network, which is capable of associatively clustering inputs based on their features from two independent domains
P and M-dimensional gene function vector X respectively. Each LTM recognition category j in F2 layer is associated with two adaptive weight templates, i.e. w = (waj |wbj ), waj and wbj being same dimensional as P and X respectively. Initially, the F2 recognition field contains a null set (zero category). Upon incremental presentation of input signals, it is adaptively expanded to encode new knowledge. Category Competition: In response to an input signal I = (P |X), the similarity between the input and each LTM recognition category j is evaluated according to T (I, wj ) = γTa (P, waj ) + (1 − γ)Tb (X, wbj ),
(4)
where γ ∈ [0, 1] is an associative contribution parameter, Ta (.) (or Tb (.)) is a predefined function, referred to as the choice function, that measures the similarity in domain space a (or b), which in our studies, Ta (P, waj ) =
|P ∧wa j| |P ∨wa j| a a min(p2 ,wj,2 )−max(p1 ,wj,1 )+1
a a a )−min(p ,w a )+1 if min(p2 , wj,2 ) − max(p1 , wj,1 ) > 0, 1 = { max(p2 ,wj,2 j,1 0 otherwise, (5) which essentially measures the percentage of overlap between two conserved regions, and M X·wb (xi ·w b ) j,i Tb (X, wbj ) = ||X||·||wj b || = √ M i=1 (6) M 2 b 2 j
i=1
xi ·
i=1
(wj,i )
which is based on the the cosine similarity between two feature vectors. The linear combination T (.) is referred to as the network’s choice function. The category J that
Improving Sequence Alignment Based Gene Functional Annotation
317
receives the highest choice score T (I, wJ ) = max{T (I, wj )} is marked as the winner of the competition. Resonance or Reset: If the competition generates a winner category J, its similarity to the input I is further confirmed in domain spaces a and b individually. The network is said to reach resonance if both choice function scores are over the corresponding vigilance thresholds ρa and ρb , denoted as {
Ta (P, waJ ) ≥ ρa and Tb (X, wbJ ) ≥ ρb ,
(7)
during which network learning ensures, as defined in the next step. Mismatch reset happens when either of the match score does not reach the vigilance value. During mismatch reset, the network redo the winner selection and resonance check iterations with mismatched categories excluded, until a selected winner causes network resonance, or all LTM categories are reset. Network Learning: Once the search ends and network resonance is achieved, the attentional subsystem updates the weight vector wJ by incorporating the input knowledge correspondingly from field a and b, according to two learning functions:
a a a a , wJ,2 ) = (max(p1 , wJ,1 ), min(p2 , wJ,2 )), wJa = (wJ,1
(8)
which in practice records the overlap between the two alignment regions, and
wJb = ηX + (1 − η)wbJ
(9)
which is a typical incremental learning function where η ∈ [0, 1] is referred to the learning speed. In case all LTM categories are reset but the network fails to reach a resonance state (or when F2 is null upon the presentation of the first input), the network switches to fast commitment learning mode, which essentially expand the F2 recognition field by a = P and creating a direct copy of the input as a new LTM category. That is, wnew b wnew = X. 2.3 Category Membership Tracking and Category Pruning Throughout the network’s learning process, we track the category-input mapping information to facilitate our post-cluster process. To do this, each LTM recognition category j is also associated with a collection of input members I{j} containing the original input features of the members being clustered to j. Like many incremental learning clustering algorithms, it is possible for ARAM to generate dead clusters (clusters with none members) or small-sized outlier clusters once the network converges. For this purpose, we further prune the recognition categories of ARAM according to a member count threshold t in the end of the learning. During this process, clusters with less than t number of members will we removed from the LTM field, their associated member
318
J. He
inputs will be sent to the network for an extra iteration of learning, and the outlier clusters formed in this last iteration will be permanently removed from the LTM. This process is helpful in terms of removing potentially mistakenly annotated subject genes. 2.4 Representation of Gene Cluster Features for Functional Annotation Once the associative clusters are formed, one could easily understand that they represent the relatively distinct gene groups aligned to different conserved regions. Thus the final step of our process is to identify the best representative functional description, in its original free text style, as the annotation of each gene family and for end-users review. Realizing that in one aspect an end-user typically prefers higher ranked (i.e. with low r values) subject genes as they indicate a relatively higher confidence, and in the other aspect a representative functional description of each cluster shall be closest to the cluster center, we adopted a relatively straightforward combined ranking strategy for this purpose: Firstly, for all N subject genes Xj in each cluster, we re-rank them according to their global alignment ranking r using continuous integers pj , i.e. p = 1 corresponds to the highest ranked subject whereas p = N corresponds to the lowest ranked one. Then, we calculate the distance between each member gene and the cluster center according to D(Xj , X0 ) = ||Xj − X0 ||, where the cluster center X0 is recalN culated according to X0 = j=0 Xj /N . We then sort the member genes according to this distance and assign each of them another integer ranking qj , where q = 1 indicating a gene closest to the cluster center whereas q = N being the furthest. Lastly, we use the original free text functional annotation of the winner subject gene J corresponding to to the highest overall ranking, that is, J = argmin(pj + qj ). This complements our proposed integrative approach for automatic gene functional annotation.
3 Experiments and Discussions We evaluated our proposed approach on the annotation of a relatively new model legume species, Medicago truncatula (“Medicago” in the rest of this paper). Currently, the genome sequencing of Medicago is near completion and the functional genomics studies in this species is yet to reach its prime. The latest version of the Medicago Gene Index (MtGI) sequence data set 2 contains 67,463 non-redundant sequences (unigenes) assembled from 259,642 expressed sequence tags (ESTs) and 25,600 expressed transcripts (ETs). This data set was generated in 2008, when the function of individual TCs was not the focus of the Gene Index project; and there is a lack of update on the unigenes’ functional annotation since then. Our experiment aimed to test the effectiveness of the proposed technique for the re-annotation of this sequence data set. 3.1 Experimental Setup To ensure statistical validity of our benchmark, we shuffled the input order of the sequences and randomly divided the whole data set into 20 subsets in even size (in terms 2
The Medicago Gene Index is available via ftp://occams.dfci.harvard.edu/pub/bio/tgi/data/ Medicago truncatula/
Improving Sequence Alignment Based Gene Functional Annotation
319
of number of sequences, and regardless of individual sequence’s length). On each subset of sequences, our experiment is outlined as following: We first conducted their batch BLASTX searches against the latest version (20091107) of NCBI NR database with an e-value threshold 1e-5. All functional descriptions carried by the subject genes as in the BLAST result were treated as a text corpora for selecting keyword terms and inferring IDF statistics as summarized in Section 2.1. The stoplist being used in our experiment consisted 722 stopwords collected from a variety of online resources (MySQL stoplist, Wiki stoplist, linguistic stoplist etc.) which were further stemmed into 617 common word roots, and we set DF thresholds of 20 and 10,000 for filtering of too rare and too general terms respectively, which yielded about 2,100 keywords features in each of our replications. Then, we built an ARAM model for each individual query sequence to associatively cluster all its subject genes with non-null text features correspondingly. We set γ = 0.5, ρa = 0.8, ρb = 0.5, η = 0.1 and t = 3 for all ARAM models. 3.2 Results and Discussions Table 1 reports some key statistics over our 20 replications, corresponding to the three major stages of our proposed approach, i.e. BLAST search, NLP-based feature representation, and associative clustering. Notably, out of the average of 3,373 query sequences in each of our replicate, only 2,456 (72.8%) showed significant sequence similarity with at least one subject genes (hits) in the NR database. This partially supported the fact that Medicago, as well as many other legume species, are relatively new to the plant science community. On the other hand, despite Medicago being a under-studied genome, each of these query sequence was aligned to an average of 137.1 subject sequences in the NR database. This suggested it shared a relatively common “core” biological mechanism being conservative in all other species. Through our NLP-based feature representation process, by average 1, 567 “top hits” subject genes of the average of 3,373 query sequences in each replication turned out to have null features. Our inspection identified that these are mostly meaningless descriptions like “unnamed protein product”, “hypothetical protein”, and “predicted protein” etc. In other words, if one uses the conventional “top hit” annotation approach, he will be able to roughly understand the functions of about only 26.4% of the Medicago genes. However, with our proposed approach, we were able remove those null-features in the analysis and focused on subject genes with more meaningful functional descriptions. By average, our NLP-process identified 2,142 genes aligned with at least one subjects with non-null features, thus significantly improved this rate to 63.5%. Additionally, by removing these meaningless descriptions, the average number of aligned subjects per query was significantly reduced from 137.1 to 74.3. In other words, even without the downstream clustering analysis, one can reduce the labor in reviewing the alignment results by about 4.8%. This proved the effectiveness of the NLP-based feature representation process. Further more, also as reflected in Table 1, through its associative learning, ARAM neural network was able to cluster the average of 74.3 subject genes into 7.0 clusters by average, i.e. approximately 10x redundancy reduction, showing the effectiveness of the clustering process.
320
J. He
Table 1. Statistics of the BLAST search of the Medicago truncatula Gene Index (MtGI) sequence library against the NCBI NR database in our randomized 20 replicates, the effectiveness of our natural language processing-based feature representation process in removing meaningless subject descriptions, and the effectiveness of our associative clustering method in reducing the redundancy of subject descriptions. In the table, Q: total number of queries with at least one NR hits; S: total number of aligned subjects; AvgS/Q: Average aligned subjects per query; NVec: total number of subjects with null-vector (meaningless) descriptions; QTSNVec: total number of queries with null-vector top-hit; QAllNVec: total number of queries with all null-vector subjects; AvgTrn/Q: average number of training data (alignments with non-null description vectors) per query; AvgC/Q: average number of formed clusters (our final annotations) per query. Fold 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Avg. Std.
BLAST Search Q S AvgS/Q 2459 339,344 138.0 2544 346,283 136.1 2439 334,388 137.1 2463 341,079 138.5 2511 345,487 137.6 2547 350,495 137.6 2381 320,647 134.7 2431 327,945 134.9 2368 330,054 139.4 2393 321,495 134.3 2524 350,737 139.0 2478 342,125 138.1 2466 333,840 135.4 2424 330,343 136.3 2417 334,475 138.4 2518 350,134 139.1 2428 328,026 135.1 2496 345,381 138.4 2451 332,836 135.8 2384 330,861 138.8 2456 336799 137.1 54.6 9348.5 1.64
NLP Feature Representation Clustering NVec QTSNVec QAllNVec QNonNVec AvgTrn/Q AvgC/Q 180,481 1,591 317 2,142 74.2 6.3 184,703 1,599 336 2,208 73.2 7.2 175,864 1,578 296 2,143 74.0 6.5 178,391 1,600 301 2,162 75.2 7.6 183,125 1,575 313 2,198 73.9 7.2 184,480 1,624 311 2,236 74.2 7.3 168,108 1,545 320 2,061 74.0 6.9 173,224 1,498 314 2,117 73.1 7.6 175,979 1,509 296 2,072 74.4 5.9 170,554 1,558 320 2,073 72.8 6.4 184,364 1,631 317 2,207 75.4 6.8 179,939 1,618 322 2,156 75.2 6.8 173,066 1,545 308 2,158 74.5 7.9 173,000 1,577 318 2,106 74.7 8.2 179,159 1,514 326 2,091 74.3 6.8 182,108 1,564 307 2,211 76.0 7 172,164 1,527 304 2,124 73.4 7.2 182,146 1,629 332 2,164 75.4 6.5 176,067 1,544 316 2,135 73.4 6.7 176,351 1,507 300 2,084 74.1 7.2 177664 1567 314 2142 74.3 7.0 4981.6 42.6 11.1 51.9 0.87 0.56
4 Concluding Remarks and Future Work In this paper, we addressed two major problems biologists are facing with sequence alignment based gene functional annotation, particularly by utilizing reference databases with free text style functional descriptions such as the NCBI NR database, namely that the functional descriptions of reference genes may contain large degree of noisiness, and that there may exist high redundancy within the aligned homologue genes. We proposed and examined an integrative approach that encompasses natural language processing (NLP) techniques for identification and eliminating meaningless descriptions and an artificial neural network (ANN) model named Adaptive Resonance Associative Map (ARAM) for reduce the redundancy of homologue genes. We tested our proposed
Improving Sequence Alignment Based Gene Functional Annotation
321
approach for the annotation of a relatively new genome named Medicago truncatula. And the proposed approach demonstrated satisfactory effectiveness. Specifically, in our experiments, the NLP process greatly improved the annotative coverage by more than 2 folds over the conventional “top hit” annotation method; and the associative clustering process greatly reduced the homologue genes’ redundancy by about 20 folds. In the foreseeable future work, we’ll examine the performance of the proposed approach over more genomes and against more reference databases. We’ll also study how we could integrate the reference genes’ free text style functional descriptions with those in better organized ontology-based databases (such as the Gene Ontology (GO) [11] and the Kyoto Encyclopedia of Genes and Genomes (KEGG) ontology [12]) and further improve the quality of the functional annotation.
References 1. Dasarathy, B.V. (ed.): Nearest neighbor (NN) norms: NN pattern classification techniques. IEEE Computer Society, Los Alamitos (1990) 2. Pruitt, K.D., Tatusova, T., Maglott, D.R.: NCBI reference sequences (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research Database Issue, D61–D65 (2007) 3. The UniProt Consortium: The universal protein resource (UniProt) 2009. Nucleic Acids Research Database Issue, D169–D174 (2009) 4. Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: N´edellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998) 5. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980) 6. Robertson, S.: Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation 60(5), 503–520 (2004) 7. Carpenter, G., Grossberg, S.: ART 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics 26, 4919–4930 (1987) 8. Tan, A.: Adaptive Resonance Associative Map. Neural Networks 8(3), 437–446 (1995) 9. Carpenter, G., Grossberg, S.: A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image processing 34, 54–115 (1987) 10. Carpenter, G.A., Grossberg, S., Reynolds, J.: ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks 4, 565–588 (1991) 11. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology. the gene ontology consortium. Nature Genetics 25(1), 25–29 (2000) 12. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., Kanehisa, M.: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 27(1), 29–34 (1999)
Acquire Job Opportunities for Chinese Disabled Persons Based on Improved Text Classification ShiLin Zhang and Mei Gu Faculty of Computer Science, Network and Information Management Center North China University of Technology Beijing, China {zhangshilin,gumei}@126.com
Abstract. Text Classification is an important field of research. There are a number of approaches to classify text documents. However, there is an important challenge to improve the computational efficiency and recall. In this paper, we propose a novel framework to segment Chinese words, generate word vectors, train the corpus and make prediction. Based on the text classification technology, we successfully help the Chinese disabled persons to acquire job opportunities efficiently in real word. The results show that using this method to build the classifier yields better results than traditional methods. We also experimentally show that careful selection of a subset of features to represent the documents can improve the performance of the classifiers. Keywords: Word segmentation; SVM; TFIDF; Word Vector.
1 Introduction In recent years, we have seen an exponential growth in the volume of text documents available on the Internet. These Web documents contain rich textual information, but they are so numerous that users find it difficult to obtain useful information from them. This has led to a great deal of interest in developing efficient approaches to organizing these huge resources and assist users in searching the Web. Automatic text classification, which is the task of assigning natural language texts to predefined categories based on their content, is an important research field that can help both in organizing and in finding information in these resources. Text classification presents many unique challenges and difficulties due to the large number of training cases and features present in the data set. This has led to the development of a number of text classification algorithms, which address these challenges to different degrees. These algorithms include k-NN [1], Naïve Bayes [2], decision tree [3], neural network [4], SVM [5], and Linear Least Squares Fit [6]. In this paper, we aim to achieve an efficient system to help Chinese disabled persons to find job opportunities like fig 1. The paper is structured as follows. In Section 2 we discuss related work. In Section 3 we describe models and methods. Section 4 is devoted to experiments. In Section 5 we conclude mentioning also some future work. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 322–329, 2010. © Springer-Verlag Berlin Heidelberg 2010
Acquire Job Opportunities for Chinese Disabled Persons
323
Fig. 1. One job advertisement from a Chinese web site. To help the disabled persons to acquire valuable information, we will classify it to a predefined category.
2 Related Work The goal of text categorization is to classify the information on Internet into a certain number of pre-defined categories. Text categorization is an active research area in information retrieval and machine learning. And several text categorizations have recently been proposed. Furthermore, a feature selection using a hybrid case-based architecture has been proposed by Gentili et al [7] for text categorization where two multi-layer perceptrons are integrated into a case-based reasoner. Wermeter has used the document title as the vectors to be used for document categorization [5]. Ruiz and Srinivasan [8] and Calvo and Ceccatto [9] have used the X2 measure to select the relevant features before classifying the text documents using the neural network. The hybrid architecture consists of four modules as shown in Fig.2: (1) the pagepreprocessing module is used to extract textual features of a document, (2) the featureweighting module is designed to rank the importance of features, (3) the feature-selecting module is utilized the PCA neural network to reduce the dimensionality of feature space, and (4) the page-classifying module is employed the neural network or SVM to perform the categorization. In the approach, each web page is represented by the term frequency-weighting scheme in the page-preprocessing module and the feature-weighting module. As the
Fig. 2. The traditional architectures of text classification
324
S. Zhang and M. Gu
dimensionality of a feature vector in the collection set is big, the PCA has been used to reduce it into a small number of principal components in the feature-selecting module.
3 Methodology In this section, we aim to classify the job information by district and by job type respectively to help the disabled persons easily find their interesting jobs. 3.1 Classifying Job Information by District Word segmentation and part-of-speech (POS) tagging are important tasks in computer processing of Chinese and other Asian languages. Several models were introduced for these problems, for example, the Hidden Markov Model (HMM) (Rabiner, 1989), Maximum Entropy Model (ME) (Ratnaparkhi and Adwait, 1996), and Conditional Random Fields (CRFs) (Lafferty et al., 2001). CRF have the advantage of exibility in representing features compared to generative ones such as HMM, and usually behaves the best in the two tasks. Another widely used discriminative method is the perceptron algorithm (Collins, 2002), which achieves comparable performance to CRFs with much faster training, so we base this work on the perceptron. We adopt a cascaded linear model inspired by the log-linear model (Och and Ney, 2004) widely used in statistical machine translation to incorporate different kinds of knowledge sources. Shown in Fig 3, the cascaded model has a two-layer architecture, with a character based perceptron as the core combined with other real-valued features such as language models.
Fig. 3. Structure of Cascaded Linear Model
After segmentation, we get a word vector only including the placename to represent a document. Then, we use the vector to classify the document it represents to a pre-defined class. In this scene, we adopt native bayes method to achieve a nonsupervised classification. Firstly, we use category names of predefined texts as class labels. Every category contains the all place names of the category. Then all the class texts can be used as training set, but we avoid the training procedure. Like this, we divide all Chinese districts into 14 categories. And we will use it as training sets to classify the preprocessed word vector to the one among the 14 classes.
Acquire Job Opportunities for Chinese Disabled Persons
325
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model". In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers often work much better in many complex realworld situations than one might expect. Recently, careful analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of naive Bayes classifiers. Abstractly, the probability model for a classifier is a conditional model
p ( C | F1 F 2 " " F n ) .
(1)
Here, F1-Fn represent a word vector (a place name string), and the c represent one predefined class name. Over a independent class variable C with a small number of outcomes or classes, conditional on several feature variables F1 through Fn. The problem is that if the number of features n is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, we write
p(C | F1F2 ""Fn ) =
p(C) p(F1F2 ""Fn | C) . p(F1F2 ""Fn )
(2)
In practice we are only interested in the numerator of that fraction, since the denominator does not depend on C and the values of the features Fi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model. Which can be rewritten as follows, using repeated applications of the definition of conditional probability? Now the "naive" conditional independence assumptions come into play: assume that each feature Fi is conditionally independent of every other feature Fj for . This means that
p ( Fi | C , F j ) = p ( Fi | C ) .
(3)
This means that under the above independence assumptions, the conditional distribution over the class variable C can be expressed like this:
p (C | F1F2 "" Fn ) =
n 1 p (C )∏ p ( Fi | C ) . Z i =1
(4)
As for every document, namely a word vector including a series of place names, we compute the maximum p ( C , F1 F 2 " " F n ) , and then we classify the document to the maximum probability class. 3.2 Classifying Job Information by Job Types In order to classify the document by job types, we use a two phrase algorithm. In the first phase, we predefined classes of all known job names as class training set.
326
S. Zhang and M. Gu
Fig. 4. Category example of jingjinji district
As we can not extract job type name from documents, so we use regular expression to match the document’s words and predefined work type names. By this method, we can successfully classify the documents in most scenes. But how can we know all work type names? If a document including a new job type name, the above method will fail. So we should use the following phrase. The phrase will classify the documents to a predefined class if the above method fails. The phrase contains three steps. The first step we should segment the document into words. And the method is the same as above section. But the words contain all types but stop words. The segmentation method is the same as the above section, so we will ignore it in this section. After segmentation, we get word vectors representing every document. The next step, we will compute every word feature weight. Now we use improved IF-IDF method. In the feature-weighting module, the vector obtained from the pre-processing module should be weighted using term frequency (TF) and inverted document frequency (IDF). TF-IDF is the one that has been well studied in the information retrieval literature. This scheme is based on the assumption that terms that occur in fewer documents are better discriminators. Therefore, if two terms occur with the same frequency in a document, the term occurring less frequently in other documents will be assigned a higher value. But TF-IDF simply counts TF without considering where the term occurs. Each sentence in a document has different importance for identifying the content of the document. Thus, by assigning a different weight according to the importance of the sentence to each term, we can achieve better results. Generally, we believe that a title summarizes the important content of a document. Terms that occur in the title have higher weights. In the approach, we use WTFi replace TFi in TF-IDF, which is calculated as follow: (1) each time a word occurs in the title its WTFi is increased by ten, (2) each time a word occurs in the heading its WTFi is increased by six, (3) each time a word occurs in the boldface type its WTFi is increased by three, and (4) each time a word occurs its WTFi is increased by one. Let DFi be the frequency of occurrence of sentence s in a collection. The weight of word ti, denoted by Wi,p, is expressed as follows:
Acquire Job Opportunities for Chinese Disabled Persons
§ W i , p = ¨¨ ©
WTF i P
· § ¸ * log ¨ ¨ ¸ © ¹
n DF i
· . ¸¸ ¹
327
(5)
Where n is the number of documents in the collection, j=1,..,n, |p| is used to normalize term frequency to [0,1] in order to avoid favoring long documents over short documents. In the third step, we use algorithm to train the word vector and make predictions for new coming documents. In spite of the prominent properties of SVMs, current SVM algorithms cannot easily deal with very large datasets. A standard algorithm requires solving a quadratic or linear problem; so its computational cost is unbearable. In this paper, we describe methods to build the incremental and parallel LS-SVM algorithm for classifying very large word vectors. Consider the linear binary classification task depicted in Fig 5, with m data points xi (i=1..m) in the n-dimensional input space Rn. It is represented by the [mxn] matrix A, having corresponding labels yi = ±1, denoted by the [mxm] diagonal matrix D of ±1 (where D[i,i] = 1 if xi is in class +1 and D[i,i] = -1 if xi is in class -1). For this problem, a SVM algorithm tries to find the best separating plane, i.e. the one farthest from both class +1 and class -1. Therefore, SVMs simultaneously maximize the distance between two parallel supporting planes for each class and minimize the errors.
Fig. 5. Linear separation of the data points into two classes
As for multi-class SVM, we construct N SVM classifiers. The No I classifier user the No I classes as the positive training set and all other documents as the minus training set. When we do the training complete, we get a N classifier, and then we can make prediction for new documents.
4 Experiment Results To test the proposed system, we collected a data set of job advertisement obtained from the http://www.cdpf.org.cn/ which is official web site of Chinese disabled person's federation, including 5,732 web pages. The types of jobs in the data set are workers(718 documents), designers(116 documents), programmers(953 documents), doctors (1257 documents), accounts (521 documents), managers (126 documents), and the others (962 documents),Among the data set, 4500 documents (about 80%)
328
S. Zhang and M. Gu
selected randomly from different classes were used for training data, and the remaining 1232 documents (about 20%) for test data. All the documents are coming from Chinese 31 provinces and cities. Two methods of measuring effectiveness that are widely used in the information extraction research community have been selected to evaluate the metadata extraction including the user preference extraction performance. The methods are: Precision: The percentages of actual answers given that are correct. Recall: The percentage of possible answers that are correctly extracted. Table 1. Experiment results of classification by district
KNN
Precision 82%
Recall 85%
Traditional NB Traditional SVM Our Algorithm
89% 92% 95%
90% 86% 92%
The following table 2 is our classification results by job types. Table 2. Experiment results of classification by job types
KNN Traditional NB Traditional SVM Our Algorithm
Precision
Recall
82% 89% 85% 90%
85% 88% 86% 90%
5 Conclusion This paper proposes a method of automatically classifying Chinese job information into several predefined classes by using text mining techniques for Chinese disabled persons. Based on former researches and the feature of Chinese job information, this paper makes some major improvement as follows: 1)In order to help the Chinese disabled persons to acquire valuable information, we classify the large numbers of job advertisements by district and by job types. And we use different improved algorithms to accomplish it. We did not use the traditional text classification methods. The result shows that our method beats the traditional methods in speed and efficiency. According to the feature of place names in job advertisements, we propose that documents should be classified respectively by two different procedures: place name extraction and no supervised native bayes classifier with the predefined place name sets as training sets in order to improve the accuracy of classification. 2) In the selection of word segmentation algorithms, we combine cascaded linear model and HMM to accomplish the word segmentation. Then we use revised TFIDF compute the word vectors to represent every documents. And lastly we adopt parallel LS-SVM algorithm to train and make predictions for new documents.
Acquire Job Opportunities for Chinese Disabled Persons
329
Acknowledgments. The work was supported by the Key Projects in the National Science & Technology Pillar Program during the Eleventh Five-year Plan Period of China with Grant No. 2008BAH26B02-3.
References 1. Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994), pp. 13–22 (1994) 2. McCallum, A., Nigam, K.: A comparison of event models for naïve bayes text classification. In: AAA 1998 Workshop on Learning for Text Categorization (1998) 3. Apte, C., Damerau, F., Weiss, S.: Text mining with decision rules and decision trees. In: Proceedings of Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web (1998) 4. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, pp. 148–155 (1998) 5. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The Physiology of the Grid: an Open Grid Services Architecture for Distributed Systems Integration. Technical report, Global Grid Forum (2002) 6. Yang, Y., Chute, C.G.: An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS) 12(3), 252–277 (1994) 7. Gentili, G.L., Marinilli, M., Micarelli, A., Sciarrone, F.: Text categorization in an intelligent agent for filtering information on the Web. International Journal of Pattern Recognition and Aritificial Intelligence 15(3), 527–549 (2002) 8. Wermeter, S.: Neural network agents for learning semantic text classification. Information Retrieval 3(2), 87–103 (2000) 9. Ruiz, E.M., Srinivasan, P.: Hierarchical text categorization using neural networks. Information Retrieval 5(1), 87–118 (2002) 10. Calvo, R.A., Ceccatto, H.A.: Intelligent document classification. Intelligent Data Analysis 4(5), 411–420 (2000)
Research and Application to Automatic Indexing Lei Wang1, Shui-cai Shi1,2, Xue-qiang Lv1,2, and Yu-qin Li1,2 1
Chinese Information Processing Research Center, Beijing Information Science and Technology University, Beijing, China 2 Beijing TRS Information Technology Co., Ltd., Beijing, China
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. Based on the study of TF-IDF, information gain and information entropy, the paper proposes an improved method of weight calculation, which combines the TF-IDF Normalization with information gain, to extract key words. Moreover, to abstract indexing words with counting semantic similarity of the key words in order to finish a process of automatic indexing. Through the comparative experiment shows that the comprehensive assessment value of indexing words which are obtained by the modified method of weight calculation are higher than obtained by the traditional TF-IDF method. Keywords: Automatic indexing, SVM classification, Clustering, Method of weight calculation.
1 Introduction Manual indexing has had played a very important role in the history of information retrieval. However, people have found many shortcomings of manual indexing. Cleverton[1], who summarized the shortcomings of manual indexing of indexing result with the data, said that two experienced indexing staffs index a same literature with a same thesaurus, then, at last only thirty percentages of indexing words are the same. In addition, there are many different views about the same matter between different people because of the level of knowledge and the way of thinking is different. So, the result of indexing contains many subjective views of indexing staffs. Owing to the insurmountable problems of manual indexing, we need to seek a new method of indexing. To develop computer-aided automatic indexing techniques is a breakthrough to solve this problem. Automatic indexing overcomes some disadvantages which manual indexing is difficult to solve. Moreover, automatic indexing has a series of advantages such as strong capacity, fast speed, low cost, high consistency and good stability.
2 Related Work Automatic indexing[2] contains two types, automatic assignment index and automatic keywords extraction. Automatic assignment index converts the process of automatic L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 330–336, 2010. © Springer-Verlag Berlin Heidelberg 2010
Research and Application to Automatic Indexing
331
assignment to the process of subject words classification or converts key words of text to subject words with external resources such as post controlled vocabulary(contains synonyms, upper and lower words and correlated words), thesauri, ontology. Automatic keywords extraction is a process of direct extraction words or phrases from original as indexing words to describe topic content of literature. Automatic keywords extraction is subdivided linguistic analysis method and statistical learning methods[3]. At present, the two methods are not mature enough and still have many problems need to solve. In this paper, comprehensive consideration advantage and disadvantage of linguistic analysis method and statistical learning method, it has introduced an improved method to extract indexing words. The experiment proves that this method effective enhance veracity of indexing words.
3 Algorithm Researches 3.1 Analysis of TF-IDF Algorithm TF(Term Frequency) [3] is word frequency or feature frequency. If a word has higher TF value in a text and also has higher TF value in others texts, it is difficult to say which text characteristic is expressed by this word. Thereby, to simply use TF value has much limitation. In 1972, Spark Jones proposes that calculation document frequency conduces to calculation weight of words. Hence, inverse document frequency (IDF) formula has played an important role in information retrieval[4]. IDF reduces importance of high frequency characteristics which have appeared in the mostly documents. At the same time, it enhances importance of low frequency characteristics which have appeared in small part of the documents. As follow is the calculate method of IDF. idf (T
k
)
§ N · = log ¨¨ + 0 . 1 ¸¸ © nk ¹ .
(1)
N is the total number of documents in document collections. nk is the number of documents which contain characteristic Tk . The association formula[5] of TF and IDF is defined as follow. Weighttf
_ idf
(T k ) = tf (T k ) ∗ idf (T k ) .
(2)
3.2 The Improvement Method The length of text feature vector greatly effect on weight calculation of words. To regard dataset as information source which according with some probability distribution. To calculate information gain between information entropy and word entropy of text depend on the length of dataset to make sure afford information volume power of the words in this text. Combining the information volume with the result value which is calculated result of TF-IDF normalization formula and considering the location of words to get the last weight of word.
332
L. Wang et al.
Calculation Entropy and Information Gain. In this paper, aim at the existence probability of one word Tk to calculate information entropy. The existence probability of word is crucial standard to estimate its importance. n
ent (Tk ) = −∑ pi ∗ log pi .
(3)
p i = tf (Tik ) tf .
(4)
i =1
Tk is current word. The ratio of the frequency of current word in current document to
all of the documents is p i . tf (Tik ) is the frequency of current word in current document i. tf is the frequency of current term in all documents. The method of TF-IDF ignores effect of the length of the text feature vector. The paper proposes a new concept of text entropy to express the importance of the length of text feature vector. The material meaning of text entropy is defined as follow. n
ENT = −∑ qi ∗ log qi .
(5)
qi = wordnumi wordnum .
(6)
i
ENT is text entropy. The ratio of the number of words in the current text to the sum number of words in all the texts is qi . wordnumi is the number of different words in the text i. wordnum is the sum of all the wordnumi . The effect of extraction keywords is in direct ratio to the text entropy and is inversely proportional to information entropy. The formula is defined as follow. G(Tk ) = ENT − ent (Tk ) .
(7)
Combining TF-IDF Normalization with Information Gain. Treat with TF-IDF in order to normalization and make its value between 0 and 1. TF _ IDFnormalizat ion =
f (Tik ) × idf (Tk )
∑ (tf (T ) n
2
ik
k =1
× idf (Tk )
2
).
(8)
The improvement calculation method, which considers the traditional TF-IDF formula and information gain, is much more reasonable. The value of weight is between 0 and 1 after normalization. As follow is a weight formula. weight
(T ik ) =
tf (T ik ) × idf (T k ) × G (T k
∑
n k =1
(tf (T
)
2
ik
× idf (T k
)
2
)
× G (T k
))
(9) .
Weight Calculation of Considering Location Information. TF-IDF just considers frequency factor and inverse document frequency factor, but it ignores the location information of words. The weight which is calculated by this method can not accurately reflect importance of words in text. Confront with this issue, the paper
Research and Application to Automatic Indexing
333
gives the different weight coefficient λ for the different word location. An investigation result shows, by American P.E .Baxendale[7], that the probability, which the topic of document is first sentence and last sentence in the document, are 85% and 7%. As follow is improved formula of TF. ptf (Tik ) = tf (Tik ) × λ .
(10)
Weight coefficient is appointed based on experience. It is show at Table1. Table 1. The different coefficient of weight Coefficients λ 2.5 2 1.5 1
The word location information The title First paragraph Last paragraph Other location
As follow is the formula of weight after consider location information. w(Τik ) =
ptf(Tik ) × idf(Tk ) × G(Tk )
∑ (( ptf(T )) × idf(T ) ×G(T )) n
k=1
2
ik
2
k
=
k
λ × tf (Tik ) × idf(Tk ) × G(Tk )
∑ ((λ ×tf (T )) ×idf(T ) n
2
ik
k=1
2
k
).
× G(Tk )
4 Brief Introduction of Automatic Indexing Process 4.1 The Flow Chart of Automatic Indexing Process
Word segmentation, remove stop words and so on
Dataset
Pretreatment Weight calculation Keywords extraction
Word similarity calculation Indexing extraction
term
Evaluation Clustering
Evaluation Indexing vector establishment Fig. 1. Process of automatic indexing
Classification
(11)
334
L. Wang et al.
4.2 Corpus Preprocessing
(1) Corpus Selection: the corpus is obtained from the internet download. The data includes politics, agriculture, sport and art, and each class is 1000 documents. (2) Word Segmentation and Removing Stop Words: the experiment splits word for each text with split tools ictclas1.0 edition and removing useless words. (3) Removing Unimportant Words: to filter words with part of speech is a process of removing some words which are not importance. 4.3 Words Extraction Based on Semantic
Using the information, such as words frequency, location and part of speech in text, extract some words as keywords. According to the formula of section 3 figures out every word weight in document, taxis from front to back based on weight chooses 20 words as keywords. The paper uses "HowNet"[8] to calculate semantic similarity to treat keywords. To merge all the words which the value of semantic similarity is 1 and to accumulate value of all weight as the weight of the combined word. These words, which are disposed by the similarity, are indexing words.
5 Experimentation and Analysis The paper validates the quality of result with classification experimentation and clustering experimentation. The data is obtained from the internet download. The data includes four sorts of documentations, politics, agriculture, sport and art, and each sort contains 1000 documents. The training data is 4 times of the testing data in the experimentation of SVM classification, and the training data is different from the testing data. Clustering experimentation selects three sorts of documents, agriculture, sport and politics, each sort contains 800 documents. 5.1 Experimental Result of Classification
As follows are two set different experimental results based on the same training data sets and testing data sets under different calculation methods. With intercross examination to get 5 group different experimental results, then averaged 5 group values as a last experimental result. Performance evaluation of SVM classification[9] mainly includes precision, recall rate and F1 value. The average result of SVM intercross examination shows as Table 2. Similarly, improvement method shows as Table 3. The following chart more intuitively shows the difference of two methods. As shown in Fig.2. Table 2. The average result of SVM intercross examination–TF-IDF method Category name Art Agriculture Politics Sport
Training corpus 800 800 800 800
Testing corpus 200 200 200 200
Accuracy number 196 190 140 137
Classification number 294 211 152 182
Precision Recall rate 65.46% 98.40% 96.28% 95.30% 92.21% 70.30% 82.66% 68.70%
F1 value 78.26% 95.77% 81.00% 72.41%
Research and Application to Automatic Indexing
335
Table 3. The average result of SVM intercross examination–improvement method Category name Art Agriculture Politics Sport
Training corpus 800 800 800 800
Testing corpus 200 200 200 200
Accuracy number 195 192 185 147
Classification number 239 198 212 151
Precision Recall rate 83.22% 97.60% 97.3% 95.50% 89.36% 92.60% 97.38% 73.60%
F1 value 89.47% 96.68% 89.87% 83.30%
Fig. 2. The result comparison of classification
5.2 Experimental Result of Clustering
The following clustering experimentation gets two FI values with different weight calculation method. Performance evaluation of clustering[10] mainly uses FI value which is bigger and the quality of clustering is better. Table 4. The result of clustering The method
The number of corpus
The last value of FI
TF-IDF method
Agriculture(800)+Sport(800)+Politics(800)
0.633665
Improvement method
Agriculture(800)+Sport(800)+Politics(800)
0.663815
5.3 Experimental Result Analysis
The two experimental results can clearly find out the difference between two methods. In addition, classification experimental has improved validity of experimental result with intercross examination method. The Fig.2 clearly shows the difference between two methods that the minimum F1 values of the improve method have enhance 10.89 with that of the traditional TF-IDF method. The clustering experimentation adopts the method of K-means. The Table 4 clearly shows that the results of two methods have obviously difference, although clustering is not better than classification. The last evaluation value FI of the improvement method increases of 0.03015 with one of the traditional TF-IDF method in the result of clustering. It is show that the improvement of clustering effect is obviously.
336
L. Wang et al.
6 Conclusion The improvement method overcomes the shortages of the traditional TF-IDF method. Improving arithmetic for indexing words obviously advances effect of classification and clustering. There is still has much place to discuss and research in indexing field. The quality evaluation of automatic indexing is very difficult issue. Many researchers estimate quality of automatic indexing with comparative method which compares manual indexing with automatic indexing. The subjectivity of people has too great effect on the evaluation result with this method. So we still need to make plentiful research about how to seek the better quality evaluation method of automatic indexing. Acknowledgments. The research work is supported by 863 Key Program of China (2006AA010105), National Natural Science Foundation of China (60872133), Beijing Municipal National Natural Science Foundation (4092015), Funding Project for Academic Human Resources Development in Institutions of Higher Learning Under the Jurisdiction of Beijing Municipality (PXM2007_014224_044677, PXM2007_014224_044676), and Scientific Research Common Program of Beijing Municipal Commission of Education (KM200910772022).
References 1. Cleverton, C.: Optimizing Convenient Online Access to Bibliographic Database. Information Services and Use (1984) 2. Chengzhi, Z.: Review and Prospect of Automatic Indexing Research. J. New Technology of Library and Information Service. 11, 33–39 (2007) (in Chinese) 3. Yunzhi, Z.: Improvement of Automatic Indexing by Statistical Analysis. J. Journal of the China Society for Scientific and Technical Information 19, 333–337 (2000) (in Chinese) 4. Liu-ling, D., He-yan, H.: A Comparative Study on Feature Selection in Chinese Text Categorization. J. Journal of Chinese Information Processing 18, 26–32 (2004) (in Chinese) 5. Salton, G., Buckley, B.: Term-weighting Approaches in Automatic Text Retrieval. J. Information Processing and Management 24, 513–523 (1998) 6. Shiyi, S., Zhonghua, W.: Information Theory Fundamentals and Applications. Beijing Higher Education Press, Beijing (2004) 7. Harold, B.: Abstracting Concepts and Methods. Academic Press, New York (1975) 8. Qun, L., Sujian, L.: Word Similarity Computing Based on How-net. J. International Journal of Computational Linguistics & Chinese Language Processing 7, 59–76 (2002) (in Chinese) 9. Jichao, C.: Technology and Application of Support Vector Machine. J. Science & Technology Information 25, 490–491 (2007) (in Chinese) 10. Rui, Z.: Research and Implementation on Chinese document clustering Based on k-means. Northwest University, Xian (2009) (in Chinese) 11. Deyi, T.: Study for Categorization Based on Feature Weighting. Hefei University of Technology, Hefei (2007) (in Chinese)
Hybrid Clustering of Multiple Information Sources via HOSVD Xinhai Liu1,3 , Lieven De Lathauwer1,2, Frizo Janssens1 , and Bart De Moor1 1
2
K.U. Leuven, ESAT-SCD Kasteelpark Arenberg 10, Leuven 3001, Belgium K.U.Leuven, Group Science, Engineering and Technology, Campus Kortrijk 3 WUST, CISE and ERCMAMT Wuhan 430081, China
Abstract. We present a hybrid clustering algorithm of multiple information sources via tensor decomposition, which can be regarded an extension of the spectral clustering based on modularity maximization. This hybrid clustering can be solved by the truncated higher-order singular value decomposition (HOSVD). Experimental results conducted on the synthetic data have demonstrated the effectiveness. Keywords: hybrid clustering, HOSVD, spectral clustering, tensor.
1
Introduction
Hybrid clustering of multiple information sources means the clustering of the same class of entities that can be described by different representations from various information sources. The need for clustering multiple information sources is almost ubiquitous and applications abound in all fields, including market research, social network analysis and many scientific disciplines. As an example in social network analysis, with the pervasive availability of Web 2.0, people can interact with each other easily through various social media. For instance, popular sites like Del.icio.us, Flickr, and YouTube allow users to comment shared content and users can tag their favorite content [1]. These diverse individual activities result in a multi-dimensional network among users. An interesting research problem that arises here is how to unify heterogeneous data sources from different point views to facilitate clustering. Intuitively, multiple information sources can facilitate inferring more accurate latent cluster structures among entities. Nevertheless, due to the heterogeneous property of different information sources, it becomes a challenge to identify clusters in multiple information sources as we have to fuse the information from all data sources for joint analysis. While most clustering algorithms are conceived for clustering data from a single source, the need to develop general theories or frameworks for clustering multiple heterogeneous information sources that share dependency has become more and more crucial. Unleashing the full power of multiple information sources is, however, a very challenging task, for example, the scheme of different data L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 337–345, 2010. c Springer-Verlag Berlin Heidelberg 2010
338
X. Liu et al.
collections might be very different (data heterogeneity). Although several approaches for utilizing multiple information sources have been proposed [2,3,1], these methods seem ad-hoc. Increasingly, tensors are becoming common in modern applications dealing with complex heterogeneous data, which provide novel tools for joint analysis on multiple information sources. Tensors have been successfully applied to several domains, such as chemometrics [4], signal processing [5,6], Web search [7,8] and data mining [9]. Tensor clustering is a recent generalization to the basic one-dimensional clustering problem, and it seeks to decompose a n-order input tensor into coherent sub-tensors while minimizing some cluster quality measures [10]. Higher-order singular value decomposition (HOSVD) is a convincing generalization of the matrix SVD to tensor decomposition [8]. Meanwhile, multiple information sources can be easily modeled as a tensor and the inner relationship among them can be naturally investigated by tensor decomposition analysis. In this work, we first review modularity maximization, a recently developed measure for clustering. We discuss its application on single information source and then extend it to multiple information sources. Since multiple matrices factorization is involved in our hybrid clustering of multiple information sources, we formulate our problem within the framework of tensor decomposition and propose a novel algorithm: hybrid clustering based on HOSVD (HC-HOSVD). Our experiments on synthetic data validate the superiority of our proposed approach.
2
Related Work
Some hybrid clustering methods to integrate multiple information sources have emerged: clustering ensemble [3], multi-view clustering [2] and kernel fusion [11]. Recently, Tang et al. [1] propose a method named principle modularity maximization (PMM) to detect the cluster in multi-dimensional networks. They also compare PMM with average modularity maximization (AMM), which combines the multiple information sources averagely and then maximizes the modularity. Although above methods are effective for certain application scenarios, they seem to be restricted that they lack an effective scheme to investigate the inner relationship among diverse information sources. Tensor decomposition, more especially HOSVD, is a basic data analysis task with growing importance in the application of data mining. Savas and Eld´en [9] apply the HOSVD to the problem of identifying handwritten digits. J.-T. Sun et al. [12] use HOSVD to analyze web site click-through data. Liu et al. [13] apply HOSVD to create a tensor space model, analogous to the well-known vector space model in text analysis. J. Sun et al. [14,15] have written two papers on dynamically updating a HOSVD approximation, with applications ranging from text analysis to environmental and network modeling. Based on tensor decomposition, Kolda et al. [8] propose TOPHITS algorithm for Web analysis by incorporating text information and link structure. Huang et al. [16] present a kind of HOSVD based clustering and employ it to image recognition. We call that method data vector clustering based on HOSVD (DVC-HOSVD). Our
Hybrid Clustering of Multiple Information Sources via HOSVD
339
algorithm has the similar flavor but is formulated by modularity tens instead of data vector. Collectively, multiple information analysis requires a flexible and scalable framework that exploits the inner correlation among different information sources, while tensor decomposition can fit in with this requirement. To the best of our knowledge, our work is the first unified attempt to address the modularity maximization based hybrid clustering via tensor decomposition.
3 3.1
Modularity-Based Spectral Clustering Modularity Optimization on Single Data Source
Modularity is a benefit function used in the analysis of networks or graphs. Its most common use is as a basis for optimization methods for detecting cluster structure in networks [17]. Consider a network composed of N nodes or vertices connected by m links or edges, modularity of this network is defined as follows 1 ki kj Q= (1) Aij − δ(ci , cj ), 2m ij 2m where Aij represents the weight of the edge between i and j, ki = j Aij is the sum of the weights of the edges attached to vertex i, ci is the cluster to which vertex i belongs, the δ function δ(u, v) is 1 if u = v, and 0 otherwise. The value of the modularity lies in the range [-1,1]. It is positive if the number of edges within groups exceeds the number expected on the basis of chance. In general, one aims to find a cluster structure such that Q is maximized. While maximizing the modularity over hard clustering is proved to be NP hard, a relaxing of the N problem can be solved efficiently [18]. Let d ∈ Z+ be the degree of each node, S ∈ {0, 1}N ×C (C is the number of clusters in the network) be a cluster indicator matrix defined below 1 if vertex i belongs to cluster j Sij = . (2) 0 otherwise The modularity matrix is defined as ddT . (3) 2m Here we define tr(·) as trace operation, so modularity can be reformulated as B =A−
1 tr(U T BU ). (4) 2m Relaxing U to be continuous, it can be inferred that the optimal U is composed of the top k eigenvectors of the modularity matrix [17]. Given a modularity matrix B, the objective function of this spectral clustering is Q=
max tr(S T BS), S
s.t. S T S = I.
(5)
340
3.2
X. Liu et al.
Modularity Optimization on Multiple Data Sources
By matrix decomposition, we can easily obtain U in (5), whereas it is hard to directly get the optimal solution of multiple extension of (5). Therefore, we turn to tensor methods based on Frobenius norm (F-norm) optimization. Preliminarily, we need to formulate the spectral clustering with F-norm optimization. The Frobenius norm (or the Hilbert-Schmidt norm) of a modularity matrix A can be defined in various ways [19] min{m, n} n
m ∗ |bij |2 = tr(B B) = σi2 , (6) BF = i=1 j=1
i=1
where B ∗ denotes the conjugate transpose of B and δi are the singular values of B. Considering the following F-norm maximization 2
max U T BU F , U
(7)
s.t. U T U = I,
if B is positive (semi)definite, the objective functions in (5) and (7) are different but happen to have their optima under the same matrix U , whose columns span the dominant eigenspace of U [19]. Regarding the positive (semi)definite property of modularity matrix, we can regularize the modularity matrix to guarantee that it is positive (semi)definite [18]. Thus the spectral clustering defined in (5) can be alternatively formulated by F-norm optimization in (7). From various (K types of) information sources, we can generate multidimensional modularity matrices B (i) (i ∈ 1, 2, ..., K) accordingly. Then by linear combination, we formulate the multi-dimensional spectral clustering as follows max U
K i=1
2
U T B (i) U F ,
(8)
T
s.t. U U = I, which is also hard to solve directly, so we will represent it by a 3-order tensor method in the next Section.
4
Tensor Decomposition for Hybrid Clustering
This section provides notation and minimal background on tensors and tensor decomposition used in this research. We refer readers to [20,21,22,4] for more comprehensive review on tensors. Tensor is a mathematical representation of a multi-way array. The order of a tensor is the number of modes (or ways). A first-order tensor is a vector, a second order tensor is a matrix and a tensor of order three or higher is called a higher-order tensor. We only investigate 3-order tensor decomposition that is relevant to our problem.
Hybrid Clustering of Multiple Information Sources via HOSVD
4.1
341
Basic Conceptions of Tensor Decomposition [23,24]
The n-mode matrix unfolding: Matrix unfolding is the process of reordering the elements of a 3-way array into a matrix. The n-mode (n = 1, 2, 3) matrix unfoldings of a tensor A ∈ RI×J×K are denoted by A(1) , A(3) and A(3) separately. For example, the matrix unfolding A(1) is a matrix with the number of rows I and the number of its columns is the product of dimensionalities of all other modes, that is, J × K. The n-mode product: For instance, the 1-mode product of a tensor A ∈ RI×J×K by a matrix H ∈ RI×P , denoted by A ×1 H, is a (P × J × K)-tensor of which the entries are given by (A ×1 H)pjk = apjk . (9) p
The analogous definitions are for 2-mode and 3-mode products. Higher-order Singular Value Decomposition(HOSVD): HOSVD is a form of higher-order principle component analysis. It decomposes a tensor into a core tensor multiplied by a matrix along each mode. In the three-way case where A ∈ RI×J×K , we have A = S ×1 U ×2 V ×3 W, (10) where U ∈ RI×I , V ∈ RJ×J and W ∈ RK×K are called factor matrices or factors and can be thought of as the principle components of the original tensor along each mode. The factor matrices U, V and W are assumed column-wise orthonormal. The tensor S ∈ RI×J×K is called the core tensor and its elements show the level of interaction between different components. According to [23], given a tensor A, its matrix factors U, V and W as are defined in (10) can be calculated as the left singular vectors of its matrix unfoldings A(1) , A(2) and A(3) respectively. Truncated HOSVD [23,20]: An approximation of a tensor A can be obtained by truncating the decomposition, for instance, the matrix factors U, V and W can be obtained by only considering the first left singular vectors of the corresponding matrix unfoldings. This approximate decomposition is named truncated HOSVD. 4.2
Hybrid Clustering via HOSVD(HC-HOSVD)
A tensor A can be built from several modularity matrices {B (1) , B (2) , · · · , B (K) }: the first and the second dimensions I and J of the tensor A are equal to the dimensions of the modularity matrices B (i) (i = 1, . . . , K), and its third dimension K equals the number of several information sources (different modularity matrices). According to the definition of F-norm of tensors [23], K i=1
2
2
U T B (i) U F = A ×1 U T ×2 U T F .
(11)
342
X. Liu et al.
So the optimization of (6) can be formulated equivalently as 2
maxA ×1 U T ×2 U T F , U
s.t. U T U = I.
(12)
Since the modularity matrices B (i) (i = 1, . . . , K) are symmetric, the matrix unfoldings A(1) and A(2) are identical. Consequently, the matrices U and V in (10) are the same. In (11), there is no compression along the third mode. In this case, we may take W equal to any orthogonal matrix, without affecting the cost function. Hence, we might take W = I, in (11). As is explained in [23], projection on the dominant higher-order singular vectors usually gives good approximation of the given tensor. Consequently, taking the columns of U equal to the dominate 1-model singular vectors is expected to yield a large value of the objective function in (11). The dominant 1-model singular vectors of U are equal to the dominant left singular vectors of A(1) . The truncated HOSVD that is obtained this way, does not maximize (11) in general. However, the result is usually pretty good, the algorithm is simple to implement and quite efficient. Moreover, there exists an upper bound on the approximation error [23]. The pseudo code of this hybrid clustering algorithm based on HOSVD (HC-HOSVD) is presented in as follows. Algorithm 1. HC-HOSVD(B (1) , B (2) , ..., B (R) , C) comment: C is the number of clusters 1. Build a modularity − based tensor A 2. Compute U f rom the subspace spanned by the dominant lef t (C − 1) singular vectors of A(1) 3. N ormalize the rows of U to unit length 4. Calculate the cluster idx with k − means on U return (idx : the clustering label)
5
Experiment on Synthetic Data [1]
Generally, real-world data does not provide the ground truth information of cluster membership, so we turn to synthetic data with multiple information sources to conduct some controlled experiments. In this section, we evaluate and compare different strategies applied to multi-dimensional networks. The synthetic data1 has 3 clusters, with each having 50, 100, 200 members respectively. There are 4 kinds of interactions among these 350 nodes, that is, we have four different information sources. For each dimension, cluster members connect with each other following a random generated within-cluster interaction probability. The interaction probability differs with respect to groups at distinct dimensions. 1
The data was offered by Lei Tang in Arizona University [1].
Hybrid Clustering of Multiple Information Sources via HOSVD
343
After that, we add some noise to the network by randomly connecting any two nodes with low probability. Normalized Mutual Information(NMI) [3] is adopted to measure the clustering performance. We cross compare the four types of hybrid clustering on multiple information sources with clusterings on each single information source. The four types of hybrid clustering methods are averagely modularity maximization(AMM) [1], PMM, DVC-HOSVD, HC-HOSVD. We regenerate 100 different synthetic data sets and report the average performance of each method plus its standard deviation in Table 1. Clearly, hybrid clustering on multiple information source outperforms clustering on single information source with lower variance. Due to the randomness of each run, it is not surprising that single information source method shows large variance. Among the four hybrid clustering strategies, HC-HOSVD obviously outperforms the other three. Table 1. Clustering on multiple synthetic networks Strategy Performance A1 0.6029 ± 0.1798 Single Information Source A2 0.6158 ± 0.1727 A3 0.5939 ± 0.1904 A4 0.6114 ± 0.2065 AMM 0.8939 ± 0.0945 Multiple Information Source PMM 0.8414 ± 0.1201 DVC-HOSVD 0.8975 ± 0.1109 HC-HOSVD 0.9264 ± 0.1431
6
Conclusion and Further Direction
Our main contributions are two-fold: Based on tensor decomposition, we proposed a kind of hybrid clustering algorithm named HC-HOSVD to integrate multiple information sources. We applied our method to synthetic data and cross compared our method with other clustering methods. The clustering performance demonstrated that our method is superior to other methods. In later research, we will deeply explore the inner relationship among multiple information sources via tensor decomposition and carry out our algorithm to tackle large-scale and real databases.
Acknowledgments Research supported by (1)China Scholarship Council(CSC, No. 2006153005); (2) Research Council K.U.Leuven: GOA-Ambiorics, GOA-MaNet, CoE EF/05/006 Optimization in Engineering (OPTEC), CIF1, STRT1/08/023; (3) F.W.O.: (a) project G.0321.06, (b) Research Communities ICCoS, ANMMM and MLDM;
344
X. Liu et al.
(4) the Belgian Federal Science Policy Office: IUAP P6/04 (DYSCO, “Dynamical systems, control and optimization”, 2007–2011); (5) EU: ERNSI; (6) Wuhan university of science and technolgy(WUST), college of information science and engineering (CISE).
References 1. Tang, L., Wang, X., Liu, H.: Uncovering Groups via Heterogeneous Interaction Analysis. In: ICDM 2009: Proceedings of 9th IEEE International Conference on Data Mining, pp. 143–152. ACM Press, New York (2009) 2. Bickel, S., Scheffer, T.: Multi-view Clustering. In: ICDM 2004: Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 19–26 (2004) 3. Strehl, A., Ghosh, J.: Cluster Ensembles-a Knowledge Reuse Framework for Combining Multiple Partitions. JMLR 3, 583–617 (2002) 4. Smilde, A., Bro, R., Geladi, P.: Multi-way Analysis: Applications in the Chemical Sciences. Wiley, West Sussex (2004) 5. De Lathauwer, L., Vandewalle, J.: Dimensionality Reduction in Higher-order Signal Processing and Rank-(r1 , r2 , . . . , rn ) Reduction in Multilinear Algebra. Lin. Alg. Appl. 391, 31–55 (2004) 6. Comon, P.: Independent Component Analysis, a New Concept? Signal Processing 36(3), 287–314 (1994) 7. Dunlavy, D.M., Kolda, T.G., Kegelmeyer, W.P.: Multilinear Algebra for Analyzing data with Multiple Linkages. Technical Report SAND2006-2079, Sandia National Laboratories (2006) 8. Kolda, T., Bader, B.: The TOPHITS Model for Higher-order Web Link Analysis. In: Proceedings of the SIAM Data Mining Conference Workshop on Link Analysis, Counterterrorism and Security (2006) 9. Savas, B., Eld´en, L.: Handwritten Digit Classification using Higher Order Singular Value Decomposition. Pattern Recognition 40(3), 993–1003 (2007) 10. Jegelka, S., Sra, S., Banerjee, A.: Approximation Algorithms for Tensor Clustering. In: Gavald` a, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS, vol. 5809, pp. 822–833. Springer, Heidelberg (2009) 11. Liu, X., Yu, S., Moreau, Y., De Moor, B., Gl¨ anzel, W., Janssens, F.: Hybrid clustering of text mining and bibliometrics applied to journal sets. In: SDM 2009: Proceedings of the 2009 SIAM International Conference on Data Mining. SIAM, Philadelphia (2009) 12. tao Sun, J., Zeng, H.-J., Liu, H., Lu, Y.: Cubesvd: A Novel Approach to Personalized Web Search. In: WWW 2005: Proceedings of the 14th International World Wide Web Conference, pp. 382–390 (2005) 13. Liu, N., Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., Chien, L.: Text Representation: From Vector to Tensor. In: ICDM 2005: Proceedings of the Fifth IEEE International Conference on Data Mining (2005) 14. Sun, J., Tao, D., Faloutsos, C.: Beyond Streams and Graphs: Dynamic Tensor Analysis. In: KDD 2006: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (2006) 15. Sun, J., Papadimitriou, S., Yu, P.S.: Window-based Tensor Analysis on Highdimensional and Multi-aspect Streams. In: ICDM 2006: Proceedings of the Sixth International Conference on Data Mining (2006)
Hybrid Clustering of Multiple Information Sources via HOSVD
345
16. Huang, H., Ding, C., Luo, D., Li, T.: Simultaneous Tensor Subspace Selection and Clustering: the Equivalence of High Order SVD and k-means Clustering. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008) 17. Newman, M.E.J.: Modularity and Community Structure in Networks. PNAS 103(23), 8577–8582 (2006) 18. Newman, M.E.J.: Finding Community Structure in Networks using the Eigenvectors of Matrices. Physical Review E 74(3), 036104 (2006) 19. Lay, D.C.: Linear Algebra and its Applications, 3rd edn. Addison Wesley, Reading (2003) 20. Kolda, T.G., Bader, B.W.: Tensor Decompositions and Applications. SIAM Review 51(3), 455–500 (2009) 21. Kroonenberg, P.M.: Applied Multiway Data Analysis. Wiley, West Sussex (2008) 22. Cichocki, A., Zdunek, R., Phan, A.-H., Amari, S.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. John Wiley, West Sussex (2009) 23. De Lathauwer, L., Moor, B.D., Vandewalle, J.: A Multilinear Singular Value Decomposition. SIAM J. Matrix Anal. Appl. 21(4), 1253–1278 (2000) 24. De Lathauwer, L., Moor, B.D., Vandewalle, J.: On the Best Rank-1 and Rank(r1 , r2 , . . . , rn ) Approximation of Higher-order Tensors. SIAM J. Matrix Anal. Appl. 21(4), 1324–1342 (2000)
A Novel Hybrid Data Mining Method Based on the RS and BP Kaiyu Tao1,2 1
Business School of Central South University, Changsha 410083, P.R. China
[email protected] 2 Hunan University of Commerce, Changsha 410205, P.R. China
Abstract. As both rough sets theory and neural network in data mining have special advantages and exiting problems, this paper presented a combined algorithm based rough sets theory and BP neural network. This algorithm deducts data from data warehouse by using rough sets’ deduct function, and then moves the deducted data to the BP neural network as training data. By data deduct, the expression of training will become clearer, and the scale of neural network can be simplified. At the same time, neural network can easy up rough set’s sensitivity for noise data. This paper presents a cost function to express the relationship between the amount of training data and the precision of neural network, and to supply a standard for the change from rough set deduct to neural network training. Keywords: Hybrid Data Mining, Rough Sets, BP Network.
1 Introduction Both Rough sets and BP Neural Networks have classification function in Data Mining. The advantage of Rough Sets is that it is good at parallel execution, description of uncertain information and the strategy dealing of redundant data, and the problem is it is sensitive with object noise [1]. BP Neural Networks is the most popular Neural Networks, whose main merits is high precision and non-sensitive with noise, but its problem is that redundant data can easily cause over-training of neural networks, besides, networks scale and the amount of training sample’s influence on the speed of network training and training time are headache problem [2-4]. As to the merits and the demerits of Rough Set theory and BP Neural Networks, this paper proposed a new Data Mining algorithm that combines Rough Sets theory and BP Neural Network. The algorithm overcame rough sets’ sensitive on noise data; meanwhile, it reduced the training time of BP Neural Network [5-8], provided constringency, which improved efficiency much.
2 The Combined Algorithm of Rough Set and BP Network A lot of problems will be arisen to access the tremendous data set in data-house: the most conspicuous is problem caused by data gathering, such as, producing redundant L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 346–352, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Novel Hybrid Data Mining Method Based on the RS and BP
347
data will make the data increased rapidly; secondly, the distortion of big data set, such as noise data. Almost every process of data mining can not avoid these two problems. As for rough set theory, dealing redundant data is its advantage, but it is not sensitive with noise data. As for BP neural network, training speed and training sample is its troublesome problem [9]. Usually, the clearer the method of data expression, the less the redundant data, then the neural network will be easier to studied, but the number of neural and power value will increase, its training time will be longer, and the extension ability of neural network will be weaker [10,11]. One idea is: using rough set to reduce data, then using the reduced data set as the design evidence and training data of BP neural network. And this can make training data clear, and absorbs both advantages of these two methods. While reducing the training time of BP neural network, the BP neural network can reduce the influence of noise [12]. Based on this, the paper proposed a combined algorithm of rough set and neural network. Data mining based on rough set theory always are property reduction, the steps of property reduction are: first, finding the kernel of property reduction set by discrimination matrix, then using reduction algorithm to calculate reduction set, and deciding the best reduction set by some judge standard. The data mining process based on rough set theory is illustrated in fig.1. Concept property Level
Initial Data
Extens Simplify extension Relation
valve
Extensi Calcula Discriminat Calcula Kern ion onRela te te Simplified Combine Meta extension
Property
Property
Reduction Reduction
Macro
Rule
Relatio
Production
Visualize Rule store
Fig. 1.
The difficulty of the algorithm is the end condition of rough set reduction, in another words, it is hard to decide the selection condition of the amount of BP neural network training data. According to the statement of third part, selecting the amount of training data has big influence for the training time of BP neural network. At present, there is no certain way to decide the amount of training data, just a cursory evaluation method: the amount of training data is twice as many as connect power. For example, if a BP neural network has n input knots, n1 hidden level knots and m output knots, then it need 2× (n×n1+ n1×m) training data. The selection of the amount of training data is concern with the precision of neural network. Usually, error is used to reflect the capability of study. The definition of error is:
348
K. Tao
m
n
i =1
j =1
∑∑ ( d e
ij
− yij )
m⋅n
m denotes the sample number of training set, n is the amount of the output unit of neural network. When the amount is increased, the error will be smaller; as a result, adding more training data will help to avoid error. But at the same time, when the training data is increased, the training time also will be longer. Based on this, a cost function is proposed. Cost function is to describe the relation between the amount of training data and error, the error function can be modified as: m
n
i =1
j =1
∑∑ ( d e
ij
− yij )
Xm ⋅ n
A variable X is added into this formula, when X=1, the function will be turned back. The form of cost function is:
y = x /(1 −
1 x
)
x is coefficient, the value area is >1, y is cost guideline. Table 1 illustrates the relation between cost function, its differential coefficient and the value of x. From the table, when x‘s value is 2.25, cost function’s one differential coefficient is 0, which is the minimum value of cost function. When cost function’s differential coefficient is smaller than -1, or bigger than 1, we can deem that the cost change too fast, and the coefficient is wrong. As a result, the coefficient should be in the area between 1.93 and 4, the optimal selection is 2.25. Table 1. x
1.25
1.5
1.75
2
2.25
2.5
2.75
3
y
11.84
8.17
7.17
6.83
6.75
6.2
6.80
7.10
y’
-30
-6.7
-2.24
-0.71
0
0.38
0.61
0.75
x
3.25
3.5
3.75
4
4.25
4.5
4.75
5
y
7.30
7.52
7.75
8
8.25
8.51
8.78
9.05
y’
0.85
0.91
0.96
1
1.02
1.05
1.06
1.08
A Novel Hybrid Data Mining Method Based on the RS and BP
349
Cost function can be regarded as the selection guideline or the end rule of rough set reduction. As for the sample with many properties, the optimal cost coefficient 2.25 will be chose to be the selection guideline, as for sample with few properties, cost higher than 2.25 will be chose. Data mining mainly deal with tremendous data, cost 2.25 will be absolutely the best answer, in order to mind special situation, here, the situation with few data has been taken in to consider. The algorithm is written as follows: Step 1: Sampling data, pointing mining condition, and deciding the goal of mining. Step 2: Deleting redundant property by following rough set theory. Step 3: Doing property reduction under rough set theory. Step 4: If the minimum property set has been get, choosing training data set by cost 2.25.otherwise, using the highest cost 4 and reduction property to calculate training sample data, if the results of calculating is smaller than the reduced amount, then turn to step 3, otherwise, choosing training data by the definition and cost function. Step 5: Designing neural network by training data, and training these training data sample. Step 6: Outputting the final results. The flow chart is illustrated by fig.2.
Begin
The needed information
Problem definition, deciding goal
Input training data
Data gaining
Design of neural network Obtain the training data
Net training
Data reduction Training data amount 〈〈 Reduced data amount Cost calculating, comparing training data and reduced data Minimum property set, training data has a big ratio in reduced data
Training results
Results
Fig. 2.
3
Examples
Here, a car data table in reference [2] is used to illustrate the algorithm in table 2.
350
K. Tao Table 2. Plate# BCT89C RST45W IUTY56 …… PLMJH9 DSA321
Make-model Ford escoort Dodeg Benz …… Honda Toyota paso
Compress High High Medium …… Medium Medium
Color Silver Green Green …… Brown Black
Power High Medium Medium …… Medium Low
Cy1 6 4 4 …… 4 4
Trans Auto Manual Manual …… Manual Manual
Door 2 2 2 …… 2 2
Weight 1020 1200 1230 …… 900 850
Displace Medium Big Big …… Medium Small Mileage Medium Medium High …… Medium High
The decision property is Make-model, others are condition properties. Using rough set to reduce the redundant propertied, table 3 is got. After reducing the redundant properties, two properties are deleted. Doing data reduction for table 3, taking the user’s request that the property reduction set must contain displace and weight, table 4 is obtained. Then building the neural network, and selecting training sample. The neural net has 4 inputs neural, 3 outputs neural, the hidden level has 4 neural, the structure of the network is illustrated by fig.3. Following network structure and cost coefficient, the number of training sample is 4×2× 4×4+4×3 =224. Then training these samples, the final results will be output as table 5.
(
)
Table 3.
Obj 1 2 3 ……
Make-model USA Germany Japan ……
Power High Medium Medium ……
Trans Auto Manual Low ……
cy1 6 4 4 ……
Door 2 2 2 ……
Weight Auto Heavy Light ……
Compress Medium Big Small …… Mileage Medium High High ……
A Novel Hybrid Data Mining Method Based on the RS and BP
351
Table 4.
Make-model USA USA Germany Japan ……
Displace Medium Big Big Small ……
Trans Auto Manual Manual Manual ……
Weight Medium Heavy Heavy Light ……
Mileage Medium Medium High High ……
Disdlace Trans Make-model Weight Mileaget
Fig. 3.
4 Conclusions In the mining process of data house which has tremendous data and many properties, this algorithm possesses the advantages of both rough set theory and BP neural network. It can overcome the noise’s influence for data sensation, at the same time; it can delete redundant data, provide clearer training data, lessen the scale of network, improves efficiency of mining. The proposal of cost function not only resolved the relation between training data and mining precision, but also provided guideline for the transformation from rough set to neural network. Unfortunately, data mining is aiming to big data warehouse, so the algorithm is not suitable for data mining with small scale.
References 1. Bazan, J.: Dynamic reducts and statistical inference. In: Sixth International Conference on IPMU, pp. 1147–1152 (1996) 2. Nguyen, T.T., Skowron, A.: Rough set Approach to Domain Knowledge Approximation. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 92–97. Springer, Heidelberg (2003) 3. Rosado, I.J., Bernal-Agustin, J.L.: Genetic Algorithms in Multistage Distribution Network Planning. IEEE Trans. Power Systems 9(4), 1927–1933 (1994) 4. Maulik, U., Bandyopdhyay, S.: Performance Evaluation of Some Clustering Algorithms and Validity Indices. IEEE Trans. on Pattern Analysis and Machine Intelligence, 1650– 1654 (2002) 5. Dillenbourg, P., Self, J.A.: A Computational Approach to Distributed Cognition. European Journal of Psychology Education 7(4), 252–373 (1992)
352
K. Tao
6. Jiang, W.J., Wang, P.: Research on Distributed Solution and Correspond Consequence of Complex System Based on MAS. Journal of Computer Research and Development 43(9), 1615–1623 (2006) 7. Pawlak, Z.: Rough Sets. Int. J. Comput. Inform. Sci. 11(5), 341–356 (1982) 8. Polkowski, L.: A Rough Set Paradigm for Unifying Rough Set Theory and Fuzzy Set Theory. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 467–471. Springer, Heidelberg (2003) 9. Jiang, W.J., Lin, X.H.: Research on Extracting Medical Diagnosis Rules Based on Rough Sets Theory. Computer Science 31(11), 97–101 (2004) 10. Jiang, W.J., Pu, W., Lianmei, Z.: Research on Grid Resource Scheduling Algorithm Based on MAS Cooperative Bidding Game. Chinese Science F 52(8), 1302–1320 (2009) 11. Jiang, W.J.: Research on the Learning Algorithm of BP Neural Networks Embedded in Evolution Strategies. In: WCICA 2005, pp. 222–227 (2005) 12. Chen, I.R.: Effect of Parallel Planning on System Reliability of Real-time Expert Systems. IEEE Trans. on Reliability 46(1), 81–87 (1997)
Dynamic Extension of Approximate Entropy Measure for Brain-Death EEG Qiwei Shi1 , Jianting Cao1,3,4 , Wei Zhou1 , Toshihisa Tanaka2,3, and Rubin Wang4 1
Saitama Institute of Technology 1690 Fusaiji, Fukaya-shi, Saitama 369-0293, Japan 2 Tokyo University of Agriculture and Technology 2-24-16, Nakacho, Koganei-shi, Tokyo 184-8588, Japan 3 Brain Science Institute, RIKEN 2-1 Hirosawa, Wako-shi, Saitama 351-0198, Japan 4 East China University of Science and Technology Meilong Road 130, Shanghai 200237, China
[email protected]
Abstract. In this paper, we propose a Electroencephalography (EEG) signal processing method for the purpose of supporting the clinical diagnosis of brain death. Approximate entropy (ApEn), as a complexitybased method appears to have potential application to physiological and clinical time-series data. Therefore, we present a ApEn based statistical measure for brain-death EEG analysis. Measure crossing all channels extends along the time-coordinate of EEG signal to observe the variation of the dynamic complexity. However, it is found that high frequency noise such as electronic interference from the surrounding containing in the real-life recorded EEG lead to inconsistent ApEn result. To solve this problem, in our method, a processing approach of EEG signal denoising is proposed by using empirical mode decomposition (EMD). Thus, high frequency interference component can be discarded from the noisy period along the time-coordinate of EEG signals. The experimental results demonstrate the effectiveness of proposed method and the accuracy of this dynamic complexity measure is well improved. Keywords: Electroencephalography (EEG), Approximate entropy (ApEn), Dynamic complexity measure, Empirical mode decomposition (EMD).
1
Introduction
The brain death is defined to the complete, irreversible, and permanent loss of all brain including brain stem function [1]. Based on this definition, electroencephalography (EEG) is used to evaluate the absence of cerebral cortex function in the brain death diagnosis. Furthermore, the process of clinical diagnosis of brain death established in most countries involves the EEG criterion. For example, a relatively strict criterion in Japan includes the following major items: (1) L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 353–359, 2010. c Springer-Verlag Berlin Heidelberg 2010
354
Q. Shi et al.
Deep coma test; (2) Pupil test; (3) Brain stem reflexes test; (4) Apnea test; (5) EEG confirmatory test. Considering the standard process of brain death diagnosis usually involves certain risks and takes a long time (e.g., the need of removing the respiratory machine and 30 minutes’ EEG confirmatory test), we have proposed an EEG preliminary examination method to develop a practical yet reliable and rapid way for the determination of brain death [2]. That is, after items (1)–(3) have been verified, an EEG preliminary examination along with real-time recorded data analysis method is applied to detect the brain wave activity at the bedside of patient. On the condition of positive examined result, we suggest to accelerate the brain death diagnosis process and spend more time on the medical care. In order to provide technical support for the EEG preliminary examination in brain death diagnosis, several statistics based signal processing tools have been developed for the signal denoising, brain activity detection or feature extraction and classification. To extract informative features from noisy EEG signals and evaluate their statistical significance, several complexity measures are developed for the quantitative EEG analysis in our previous study [3]. To decompose brain activities with a specific frequency, the time-frequency EEG analysis technique based on EMD has been proposed [4]. High intensity as well as absence of spontaneous brain activities from quasi-brain-death EEG can be obtained through power spectral pattern analysis [5]. In this paper, we present a dynamic complexity measure associating with empirical mode decomposition (EMD) denoising processing approach to analysis the real-life recorded EEG signal. Approximate entropy based complexity measure shows its well performance in evaluating the statistic feature of EEG signal. However, results obtained by extending ApEn in temporal domain indicate that the value is easily influenced by high frequency electronic interference contained in the real-life recorded EEG signal. EMD method is applied to decompose a single-channel EEG signal into a number of components with different frequency. Therefore, high frequency interferences can be discarded and the ApEn result for denoising signal is satisfying. The experimental result illustrate the effectiveness of the proposed method and the accuracy and reliability of the dynamic ApEn measure for EEG preliminary examination can be well improved.
2
ApEn and Extended Dynamic Measure
Approximate entropy (ApEn) is a regularity statistic quantifying the unpredictability of fluctuations in a time series that appears to have potential application to a wide variety of physiological and clinical time-series data [6,7]. Intuitively, one may reason that the presence of repetitive patterns of fluctuation in a time series renders it more predictable than a time series in which such patterns are absent. Given a time series {x(n)}, (n = 1, · · · , N ), to compute the ApEn(x(n), m, r) (m: length of the series of vectors, r: tolerance parameter) of the sequence, the series of vectors of length m, v(k) = [x(k), x(k + 1), · · · , x(k + m − 1)] is firstly
Dynamic Extension of Approximate Entropy Measure for Brain-Death EEG (a) Sine wave
(b) Random sequence
(c) Sine with random sequence
4
4
4
2
2
2
0
0
0
−2
−2
−2
−4
−4
0
2
4
6
8
6
4
2
0
Time
355
−4
8
0
2
4
6
8
Time
Time
Fig. 1. ApEn of a sine wave, random sequence and sine with random sequence is 0.1834, 0.9362 and 0.5841, respectively
constructed from the signal samples {x(n)}. Let D(i, j) denote the distance between two vectors v(i) and v(j) (i, j ≤ N − m + 1), which is defined as the maximum difference in the scalar components of v(i) and v(j), or D(i, j) = max |vl (i) − vl (j)| . l=1,···,m
(1)
Then, we further compute the N m,r (i), which represents the total number of vectors v(j) whose distance with respect to the generic vector v(i) is less than r, or D(i, j) ≤ r. Now define C m,r (i), the probability to find a vector that differs from v(i) less than the distance r. And φm,r , the natural logarithmic average over all the vectors of the C m,r (i) probability as C
m,r
N m,r (i) , (i) = N −m+1
m,r
φ
N −m+1 =
log C m,r (i) . N −m+1
i=1
(2)
For m + 1, repeat above steps and compute φm+1,r . ApEn statistic is given by ApEn(x(n), m, r) = φm,r − φm+1,r .
(3)
The typical values m = 2 and r between 10% and 25% of the standard deviation of the time series {x(n)} are often used in practice [6]. As illustrated in Fig. 1, a greater likelihood of remaining close (e.g., sine wave) produces smaller ApEn values, and, vice-versa, low regularity (e.g., random sequence) produces higher ApEn values. Furthermore, base on the algorithm for computing ApEn of one sequence, we extend it in the temporal domain along time-coordinate of EEG signal. Supposing an EEG data series SN consists of N sequence intervals {xi (n)}, the ApEn measure is carried out through each interval. We define the dynamic ApEn measure of given EEG signal as ApEn(SN , m, r) = [ApEn(x1 (n), m, r), · · · , ApEn(xN (n), m, r)] .
(4)
Consequently, in our experiment, the ApEn(SN , m, r) statistic measures the variation the of complexity of a EEG data series SN . The occurrence of irregular pattern of one interval is excepted to be followed by the next in brain-death EEG.
356
Q. Shi et al. ApEn based dynamic complexity measure Fp1
1.5 1 0.5 0
Average: 0.8703
Influenced by high frequency interference 0
200
400
600
800
1000
Time(s)
Fig. 2. ApEn based dynamic complexity measure for a brain death patient’s EEG in channel Fp1. Tolerance parameter r equals 0.25.
3 3.1
Experiments and Results Brain Death EEG Signal and Its Dynamic ApEn Complexity
The EEG measurements were carried out in the Shanghai Huashan Hospital affiliated to Fudan University (China). The EEG data were directly recorded by a portable NEUROSCAN ESI system at the bedside of the patients in ICU, where the level of environmental noise was relatively high since many medical machines are switched on. The EEG electrodes are placed at positions Fp1, Fp2, F3, F4, F7, F8 as well as GND and two earlobe placed reference electrodes (A1, A2), based on the standardized 10-20 system. Sampling rate of EEG was 1000Hz and the electrodes’ resistances were less than 8kΩ. Among the total of 35 coma and quasi-brain-death patients that examined by EEG from June 2004 to March 2006, one 48-year-old male patient firstly showed being in the coma state and then behaved as brain-death in the second measurement. From the previous research, we have demonstrated regular and predictable brain activities such as θ or α waves exists in the EEG of coma. Meanwhile, the EEG signals of brain-deaths are mostly interfering noises. As ApEn is suggested as a complexity-based statistics to measure the regularity or predictability of time series signal, we suppose ApEn for a sine signal equals 0 and that for a random signal equals 1. From this point, by choosing one second data from each case, ApEn for coma cases are generally lower than that for brain-death ones. Furthermore, ApEn of each second of EEG signals is calculated via a dynamic complexity measure along the time-coordinate. Here, we focus on the patient who behaved from coma to brain death. Fig. 2 gives ApEn result (r=0.25) for his brain death EEG in 1153 seconds and the average is 0.8703. It should be noted that ApEn has significant weaknesses. From the result, ApEn values of the brain death signal distribute near 1 except that ones in the dotted line region are relatively low. Comparing the recorded EEG signal of the patient from about 270 to 320 sec with others, we conjecture that the signal is mixed up with high frequency electronic interferences and these regular components influence the result of ApEn. For an acceptable overview of applying complexity measure with proposed EMD denoising process to that
Dynamic Extension of Approximate Entropy Measure for Brain-Death EEG
357
Dynamic complexity measure wity EMD denoising process Fp1
1.5 1
Average: 0.8780
0.5 0
0
200
400
600
800
1000
Time(s)
Fig. 3. Dynamic ApEn measure measure with EMD denoising process for the brain death patient’s EEG in channel Fp1. Tolerance parameter r equals 0.25.
patient’s EEG, result is illustrated in Fig. 3. Brief summary of EMD algorithm and denoising process will be provided as the following. 3.2
Complexity Measure with EMD Denoising Process for EEG
EMD is an adaptive signal decomposition method which is applicable to nonliner, non-stationary process [8], and its purpose is to decompose the signal into a finite set of intrinsic mode function (IMF), for each IMF component indicates an inherent characteristic oscillatory form embedded in the signals, which must meet two conditions: in the whole data set, the number of extrema and the number of zero crossings must either equal or differ at most by one; at any point, the mean value of envelope defined by the local maxima and local minima must be zero. By means of a process called the sifting algorithm, one channel of the real-measured EEG signal x(t) is represented as x(t) = ni=1 ci (t) + rn (t), where ci (t)(i = 1, · · · , n) denote n IMF components, and rn is a trend component within the signal. EMD result for a single channel EEG which has been mixed up by high frequency interference are illustrated in Fig. 4(a). One second signal (287 to 288 sec.) of which ApEn is 0.3619 from channel Fp1 is chosen as an example. By applying the EMD method described above, we obtained six IMF components (C1 to C6 ) and a residual one (C7 ) in their time domain. In the right column, components are displayed in the frequency domain by applying FFT. In the analysis process, component with the highest frequency around 150Hz (C1 ) is regarded as electrical interference. Secondly, the left six IMF components (C2 to C7 ) as desirable ones are synthesized to a new signal showed in Fig. 4(b). By comparing the synthesized signal with the original one, it is clear that high frequency interference is reduced. Then, ApEn of this one second EEG is calculated again and the value now goes up to 0.6091. Without loss of generality, the same process is applied to the each time sequence {xi (n)} which consists of 1000 samples from 273 to 319 sec. The similar results could be obtained. Looking back to comparison between Fig. 2 and Fig. 3, satisfying results are obtained and average ApEn of the whole EEG is also increased to 0.8780. It can be said that the denoising process takes effect in discarding the high frequency component from the recorded data.
358
Q. Shi et al. EMD Result of Single Channel Fp1
20 0 −20 287
287.1
287.2
287.3
287.4
287.5 Time(s)
20 0 −20
400 200 0
10 0 −10
400 200 0
10 0 −10
400 200 0
5 0 −5
400 200 0
5 0 −5
400 200 0
5 0 −5
1000 500 0
5 0 −5 287
1000 500 0
C
7
C
6
C
5
C
4
C
3
C
2
C
1
EMD result
287.2
287.4
287.6
287.8
Time(s)
288
287.6
287.7
287.8
287.9
288
100
150
200
100
150
200
287.9
288
Fourier Transform
0
50
0
50
Frequency(Hz)
(a) Denoised Signal of Fp1
Fp1
20 0 −20 287
287.1
287.2
287.3
287.4
287.5
287.6
287.7
287.8
Time(s)
(b)
Fig. 4. (a) EMD result for one seconds signal from channel Fp1 in time and frequency domain. (b) Denoised signal that synthesized by components C2 to C7 .
4
Conclusions
Value of ApEn based complexity measure for brain-death EEG is usually high. Because of the influence of high frequency electric interferences in EEG preliminary examination for brain death diagnosis, ApEn result drops in certain continuous time ranges. For this problem, the paper proposed a complexity measure combining with EMD pre-denoising process to discarding the possible interferences. Theoretical and experimental study results indicated that this method is feasible to evaluate the brain-death EEG and also expected result is obtained. Therefore, in terms of our proposed EEG preliminary examination, the accuracy and reliability of the dynamic measure can be well improved.
Acknowledgments This work was supported in part by KAKENHI (21360179).
Dynamic Extension of Approximate Entropy Measure for Brain-Death EEG
359
References 1. Taylor, R.M.: Reexamining the Definition and Criteria of Death. Seminars in Neurology 17, 265–270 (1997) 2. Cao, J.: Analysis of the Quasi-Brain-Death EEG Data Based on A Robust ICA Approach. In: Gabrys, B., Howlett, R.J., Jain, L.C. (eds.) KES 2006. LNCS (LNAI), vol. 4253, pp. 1240–1247. Springer, Heidelberg (2006) 3. Chen, Z., Cao, J., Cao, Y., Zhang, Y., Gu, F., Zhu, G., Hong, Z., Wang, B., Cichocki, A.: An Empirical EEG Analysis in Brain Death Diagnosis for Adults. Cognitive Neurodynamics 2, 257–271 (2008) 4. Li, L., Saito, Y., Looney, D., Cao, J., Tanaka, T., Mandic, D.P.: Data Fusion via Fission for the Analysis of Brain Death. In: Evolving Intelligent Systems: Methodology and Applications, pp. 279–320. Springer, Heidelberg (2008) 5. Shi, Q., Yang, J., Cao, J., Tanaka, T., Wang, R., Zhu, H.: EEG Data Analysis Based on EMD for Coma and Quasi-Brain-Death Patients. Journal of Experimental & Theoretical Artificial Intelligence 10 pages (2009) (in print) 6. Pincus, S.M.: Approximate entropy (ApEn) as a measure of system complexity. Proc. Natl. Acad. Sci. 88, 110–117 (1991) 7. Pincus, S.M., Goldberger, A.L.: Physiological time-series analysis: What does regularity quantify? Am. J. Physiol. 266, 1643–1656 (1994) 8. Huang, N.E., Shen, Z., Long, S.R., Wu, M.C., Shih, H.H., Zheng, Q., Yen, N.-C., Tung, C.C., Liu, H.H.: The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London A 454, 903–995 (1998)
Multi-modal EEG Online Visualization and Neuro-Feedback Kan Hong, Liqing Zhang, Jie Li, and Junhua Li MOE-Microsoft Key Laboratory for Intelligent Computing and Intelligent Systems Department of Computer Science and Engineering Shanghai Jiao Tong Univeristy, Shanghai 200240, China
[email protected]
Abstract. Brain computer interface (BCI) is a communication pathway between brain and peripheral devices, which is promising in the field of rehabilitation and helps to improve the life quality of physically challenged people. Analysis of EEG signal is essential in non-invasive BCI system. However, because of EEG signal’s low signal-to-noise ratio and huge amount of data, signal analysis function in current BCI systems is rarely available online, which is inconvenient for system adaptation and calibration, as well as comprehension of data’s characteristics. To address the problem, this paper presents two features that are suitable for online visualization. Rhythm power indicates active brain region, and filtered ERSP (Event related spectrum power) is a substitute for original ERSP which provides information in signal’s frequency domain. Moreover, visualization of CSP (Common Spatial Pattern) feature is also realized which serves as an indicator of epochs’ quality. Keywords: BCI, EEG, Visualization, ERSP, CSP.
1
Introduction
Brain Computer Interface (BCI) is a communication pathway between brain and peripheral devices making use of brain neural processes, which is independent of the normal output of brain activities, such as movements of muscle tissues. BCI system can be widely used in the rehabilitation of diseases and trauma, and help to improve life quality of physically challenged people. Electroencephalography (EEG) based non-invasive BCI system depends on analysis of the EEG signal patterns under particular tasks of thinking and imagination. Compared with invasive BCI systems, non-invasive ones are safer and easier to apply. Current BCI systems transform the classification results directly into control commands, and functions of data analysis is only available offline, which leads to poor user interface. Visualization of EEG features provides more intuitive interpretation of EEG signals. With visualization techniques, BCI users would have a better idea of the characteristics of the current signals and how the classification methods are applied to the data, which is useful for system adaptation and calibration. To establish feedback of BCI system to the users, it L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 360–367, 2010. c Springer-Verlag Berlin Heidelberg 2010
Multi-modal EEG Online Visualization and Neuro-Feedback
361
will bring benefits to have online visualized features on BCI system’s interface. This paper presents visualization techniques that make the EEG signals more comprehensible in realtime. The rest of the paper is organized as follows: Section 2 introduces two methods extracting features from signals. Section 3 defines some features of EEG signals to be visualized and describes the realization of an online system. Section 4 depicts an experiment for the online visualization and provides interpretations of the visualized features. Section 5 provides the conclusion.
2
Information Extraction for Multi-modal Visualization
Visualization of EEG signal focuses on the extraction of the signal’s distribution over frequency domain and space domain. Wavelet transform is performed to obtain a signal’s frequency information with respect to time. And common spatial pattern (CSP) is a method widely used in BCI field to find a subspace of brain region for feature extraction and pattern classification. 2.1
Wavelet Transform
Since EEG is a non-stationary signal, frequency analysis using Fourier transform is unable to investigate changes in frequency domain while time elapsing. Wavelet transform can be used to resolve this dilemma. Morlet wavelet is regarded as the best wavelet for EEG time-frequency domain analysis, for which the relation between the scale and frequency is: a=
Fc . fT
(1)
where Fc is the center frequency, T is the sampling period.[1][2] 2.2
Common Spatial Pattern (CSP)
CSP is a common feature extraction technique in BCI system. This algorithm is usually applied to find directions in which variation of one class’s data is maximized and variation of the other’s is minimized. Greater variation indicates higher signal power, and vice versa. According to the CSP algorithm[3][4], a spatial filter consisting of the eigen-vectors with the largest and the smallest eigen-values of a generalized eigen-value problem then can be applied to original data for further classification: ∗ ∗ ∗ ∗ ∗ ∗ W = [wmax1 , wmax2 , . . . , wmaxL/2 , wmin1 , wmin2 , . . . , wminL/2 ]T ,
(2)
where L is the dimension of the subspace. The variances of filtered data in each dimension are the feature of an epoch.
362
3 3.1
K. Hong et al.
System Realization Preprocess
EEG signals are of significant frequency domain characteristics. Since EEG signal power in some specific frequency ranges varies when different movements are imagined, band-pass filter is applied to filter out irrelevant frequencies. Generally, alpha (8-13Hz) and beta (14-30Hz) ranges are used in movement imagination experiments. In some cases, a wider range helps to achieve better effect. 3.2
Characteristic Calculation
Event Related Spectrum Power (ERSP). ERSP is a measure of the power in signal with respect to channel, time and frequency. Let Xe be the epoch e’s coefficient of the wavelet transform at channel c, time t and frequency f. Then ERSP can be defined as[5]: n
ERSP (c, t, f ) =
1 |Xe (c, f, t)|2 . n e=1
(3)
This measure is used to indicate the average signal power at channel c, time t and frequency f. When dealing with online system, only current epoch is processed and visualized, so the size n of the epoch set reduces to 1. Let the number of one epoch’s time steps be Nt , and the number of frequency steps be Nf . An Nf by Nt matrix ERSP(c) can be used to represent the ERSP information for each channel. Thus, each channel’s ERSP can be visualized by color-scaling the elements of matrix ERSP(c). Rhythm Power (RP). Rhythm power is a measure of signal power with respect to channel and time over a frequency range, which can be defined as: RP (c, t) = ERSP (c, t, f ) . (4) f
Compared with ERSP, Rhythm power eliminates detail information on frequency steps and considers an entire frequency band as a whole. To display the rhythm power on a head model, the rhythm power RP(c,t) is extracted from one time step t for all of the channels. Making use of the knowledge of the electrodes’ position, rhythm power for each time step is interpolated and color-scaled. After being mapped onto a head model, rhythm power reflects the active brain region for each time step. Common Spatial Pattern (CSP) Feature. CSP feature are extracted from original data as described in section 2.2. This feature is usually organized in the form of a vector, whose dimension equals the number of selected eigen-vectors in (2). However, since this is usually a vector of high dimension, it is difficult to fully visualize the features. Principle component analysis (PCA) can be employed to mapping these feature into a lower space.
Multi-modal EEG Online Visualization and Neuro-Feedback
363
Filtered ERSP. Wavelet coefficient Xe (c, f, t) in (3) is a function of channel (electrode) c. Thus, combining wavelet coefficients of all channels at a time step gives a vector over the space of channels. Since CSP algorithm provides the subspace that has most significant difference between each class’s signal-powers, applying the CSP spatial filter to the wavelet coefficient vectors gives the vectors’ projection in a lower dimension space where differences between each class are more significant. Given wavelet coefficient Xe (c, f, t), the vector in the channel space is: Vori (t, f ) = [Xe (c1 , f, t), Xe (c1 , f, t), · · · , Xe (cN , f, t)]T .
(5)
Applying the CSP projection matrix W in (2) to the original wavelet coefficient vector in (5) gives the projection of the original vector in the CSP subspace: Vproj (t, f ) = W Vori (t, f ) .
(6)
whose each elements is Xe (wi , t, f ), where wi indicates that Xe (wi , t, f ) is corresponding to the ith row-vector of the CSP projection matrix W. Thus filtered ERSP is: n 1 ERSP (wi , t, f ) = |Xe (wi , f, t)|2 . (7) n e=1 Visualization of filtered ERSP of each channel is the same as that of ERSP, except that the channels of ERSP here are directions of the spatial filter defined by CSP algorithm, i.e., wi∗ in (2). Moreover, since an online system requires more explicit indicators, an auxiliary chart is also provided to show the average ERSP for each channel: 1 ERSP (c) = ERSP (c, t, f ) . (8) T ×F t f
where T and F are the number of time steps and frequency steps.
4
Experiment and Evaluation of Visualization
To evaluate the features mentioned in this paper, we have realized an online system and did an experiment of body movement imagination. In the training stage of our experiment, an arrow is displayed and the subject imagined his or her corresponding movement of left or right arm. The subject’s EEG signal is recorded as training set. After training the model using CSP algorithm and SVM, the subject imagines the movement of left or right arm, while his or her EEG data is visualized on screen. We will figure out how the frequency and space domain characteristics are demonstrated in the visualizations. 4.1
ERSP Feature
In Fig. 1 are two visualized ERSPs of 60 one-second-length epochs at C4, which belong to different classes. These two images are the color-scaling result of the
364
K. Hong et al.
matrix defined by (3) where n=1. The horizontal axis is of time t and the vertical one is of frequency f. Since it is actually the average of ERSP over an epoch set, difference between them is significant. According to the color difference in Fig. 1, these signals of different classes have different signal power in the selected frequency band, especially at the frequencies around 12Hz.
Fig. 1. ERSP of 60 one-second-length epochs at C4 channel
Fig. 2. Single epoch’s (one-second-length) ERSP of 21 channels associated with movement imagination area. The first three rows are of an epoch belonging to class 1 and the other three rows are of an epoch belonging to class 2.
However, for an online system, where only the current epoch is available, it is difficult to tell which class the epoch belongs to according to the ERSP images, even if more channels are provided, as shown in Fig. 2. In Fig. 2 is 21 channels’ ERSP related to movement imagination brain region. The first three rows are of an epoch belonging to class 1 and the other three rows are of an epoch belonging to class 2. Each small chart in the figure is the same color-scaling result as Fig. 1, except that only a single epoch is used. When only one channel is provided, the difference between them is too subtle to be intuitive. When all the channels is provided, the information comes in abundance, and observers’ capability of perception is easily overwhelmed, especially in an online system, where these charts change all the time. This is the motivation of finding methods to extract information from ERSP.
Multi-modal EEG Online Visualization and Neuro-Feedback
4.2
365
CSP Feature
The last row of Fig. 3 shows a bi-class model, as well as three epochs classified by the model. Different colors represent data of different labels in training set, which also indicate the model used for classification. The current unlabeled datum is a green dot in the chart. By observing the position of the current data, one can get a general idea of the quality of this epoch. When the dot is in the overlapping region of classes in training set, it is hard to determine to which class it should be classified, as in the third chart in that row. On the contrary, when it is in a region where almost all the training data is of one class, this epoch is definitely belong to that class and is of high quality, as in the other two charts in that row.
Fig. 3. Filtered ERSP and the corresponding principle component of CSP
4.3
Filtered ERSP Feature
Filtered ERSP is the solution our system uses to make the frequency domain characteristics of a single epoch comprehensible. For the same epoch in Fig. 2, the associated filtered ERSP is in Fig. 3. The signal power in different channels of filtered ERSP is distinguishable between classes, as demonstrated by the auxiliary chart on the right. For class 1, the signal power in the second and the third channel is greater than the other two, and vice versa. This observation is intuitive when running the online system.
366
K. Hong et al.
In Fig. 3, visualization of the corresponding CSP is provided to indicate the quality of this epoch. The left one is for class 1 and the middle one if for class 2. An additional epoch is also visualized in Fig. 3 to show the overlapping case. This epoch’s signal power varies little among four channels of filtered ERSP. 4.4
Rhythm Power Feature
The classification of EEG signals bases on the fact that different imagination of movements leads to different active brain regions. The rhythm power is provided in the system to give an idea of the active brain regions. Fig. 4 shows the rhythm power of two different classes. The deeper in color, the higher signal power in that region. According to Fig. 4, these two classes is characterized by different active regions.
Fig. 4. Visualized rhythm power of two epoches belonging to different classes
5
Conclusion
This paper have discussed a general framework of EEG data visualization. A number of methods exploring basic features in EEG such as ERSP, are discussed. We proposed two techniques, rhythm power and filtered ERSP, to extract useful ingredient from original ERSP. Filtered ERSP provides information in frequency domain. Considering the significant frequency characteristics of EEG signal, it is useful to have an idea of the details in the signal’s frequency domain. Rhythm power gives an idea of active brain region. Since CSP algorithm, which is widely used in BCI systems, aims at finding out a brain region of greatest signal power difference, rhythm power is also useful for comprehension of CSP feature. Moreover, with visualized CSP feature in the system, it is possible to evaluate the current epoch’s quality and helps to improve the experiment effects.
Acknowledgement The work was supported by the Science and Technology Commission of Shanghai Municipality (Grant No. 08511501701), the National Basic Research Program of China (Grant No. 2005CB724301), and the National Natural Science Foundation of China (Grant No. 60775007).
Multi-modal EEG Online Visualization and Neuro-Feedback
367
References 1. Senhadji, L., Dillenseger, J.L., Wendling, F., Rocha, C., Kinie, A.: Wavelet Analysis of EEG for Three-Dimensional Mapping of Epileptic Events. Ann. Biomed. Eng. 23(5), 543–552 (1995) 2. Zhang, Z., Kawabatat, H., Liu, Z.Q.: EEG Analysis using Fast Wavelet Transform. In: IEEE International Conference on Systems, Man, and Cybernetics, vol. 4, pp. 2959–2964 (2000) 3. Wentrup, M.G., Buss, M.: Multi-class Common Spatial Patterns and Information Theoretic Feature Extraction. IEEE Trans. Biomed. Eng. 55(8), 1991–2000 (2008) 4. Ramoser, H., Muller-Gerking, J., Pfurtscheller, G.: Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE Trans. Rehabil. Eng. 20(5), 100–120 (1998) 5. Miwakeichi, F., Martinez-Montes, E., Valdes-Sosa, P.A., Nishiyama, N., Mizuhara, H., Yamaguchia, Y.: Decomposing EEG data into space-time-frequency components using Parallel Factor Analysis. NeuroImage 22(3), 1035–1045 (2004)
Applications of Second Order Blind Identification to High-Density EEG-Based Brain Imaging: A Review Akaysha Tang Department of Psychology and Department of Neurosciences, University of New Mexico, 1 University Blvd, Albuquerque, NM, USA
[email protected]
Abstract. In the context of relating specific brain functions to specific brain structures, second-order blind identification (SOBI) is one of the blind source separation algorithms that have been validated extensively in the data domain of human high-density EEG. Here we provide a review of empirical data that (1) validate the claim that SOBI is capable of separating correlated neuronal sources from each other and from typical noise sources present during an EEG experiment; (2) demonstrating the range of experimental conditions under which SOBI is able to recover functionally and neuroanatomically meaningful sources; (3) demonstrating cross- as well as within-subjects (cross-time) reliability of SOBI-recovered sources; (4) demonstrating efficiency of SOBI separation of neuronal sources. We conclude that SOBI may offer neuroscientists as well as clinicians a cost-effective way to image the dynamics of brain activity in terms of signals originating from specific brain regions using the widely available EEG recording technique. Keywords: BSS, ICA, SOBI, Source modeling, Source localization, Singletrial analysis, human EEG, Multichannel.
1 Introduction Relating specific brain functions to specific brain structures is a fundamental problem in neuroscience. Of many sensor modalities that offer measurement of brain signals, EEG is one that is mobile and relatively inexpensive and has high temporal resolution of millisecond. Thus, EEG potentially has the widest applications in both research and clinical settings. However, until recently, EEG data are typically expressed as signals read at particular sensor locations outside of the head, and thus do not readily provide direct answers to the question of structure-function relationship. To investigate structure-function relations, one needs to separate the mixture of signals recorded at each EEG sensor into signals from functionally and neuroanatomically specific brain sources. InfoMax ICA [1] and SOBI [2, 3] are two frequently used blind source separation algorithms in relating specific brain structures to specific brain functions [4-9]. Since many original and review papers were written solely on the topic of ocular artifact removal, here I present a review of SOBI applications with an exclusive focus on separation of neuronal sources from high-density EEG data. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 368–377, 2010. © Springer-Verlag Berlin Heidelberg 2010
Applications of Second Order Blind Identification
369
2 Validation From its birth to its wide application, an algorithm may take a long time, or never, reach a wide range of users. In the case of the application of blind source separation (BSS) in general this seems to be the case. One of the reasons for this slow translation may have to do with how the algorithm is validated. In the field of engineering or mathematics, a source separation algorithm is typically validated initially using simulated data with certain characteristics. As the algorithm is applied to one specific signal domain (e.g. acoustic versus neuroelectrical), the simulated data may or may not capture the critical features that enable the source separation within that specific signal domain. Hence, the ultimate validation that is meaningful and convincing to the end user has to be validations using data from that specific signal domain. Two examples of such domain-specific validations are presented below.
Fig. 1. SOBI recovery of artificially created known noise sources
2.1 Validation via “Bad” EEG Sensors To validate, one needs the source signals to be separated to be somehow already known. How could one find such known sources when one is trying to separate the mixture of EEG signals? We took advantage of the so-called bed sensors to show that temporally overlapping transient noises injected into adjacent EEG sensors can be
370
A. Tang
recovered as separate sources, the recovered source locations and source time courses match the known source locations and time courses [10]. Shown in Fig. 1, three arbitrarily chosen EEG sensors (59, 60, 61) were touched, one, two, or all three simultaneously during Epoch 1, 2 and 3 to injected noise into specific sensors. Since we know which sensors were touched, these sources are known. We were able to find three SOBI components (6, 2, and 3 respectively) with time courses match that of the sensors (59, 60, and 61) and with spatial maps with peak activity centered at the correct sensor locations (59, 60, and 61). Note that these touch-induced noise sources represents a class of commonly present unknown and unpredictable noise sources associated with minor head movement and other changes in the physical and electrical contact between the EEG sensor and the scalp. The ability to isolate them from the neuronal sources is critical for correct separation of neuronal sources.
Fig. 2. SOBI recovery of correlated neuronal sources
2.2 Validation via Bench-Mark Neuronal Sources The ability to separate overlapping noise sources in the presence of neuronal sources, does not guarantee that the algorithm will be able to separate neuronal sources among themselves, particularly when the neuronal sources are activated in a correlated fashion as in the case of simultaneous electrical stimulation of the left (L) and right (R) median nerve. Here we show that the latter can be achieved with SOBI [8, 10].
Applications of Second Order Blind Identification
371
EEG data were recorded during mixed trials of simultaneous L and R stimulation and unilateral L or R stimulation to generate correlated but “known” activation of the L and R primary somatosensory cortices (SI). If SOBI works well with correlated neuronal sources, at least two SOBI components should have spatial maps of activation that can be well explained by the known extremely focal and superficially located dipole sources at the expected SI locations. Shown in Fig. 2 are two such component sensor space projections (A,B) and the projections of two dipole sources placed at the typical locations of the L and R SIs (C, D). Notice how similar they are and how little residual is left (E,F) if one subtracts the maps of the dipole models (C,D) from the maps of the components (A,B). The locations of the dipoles with the least square fit are typical of SI as established by converging imaging modalities.
3 Robustness and Versatility The usefulness of a source separation algorithm for basic neuroscience research and clinical diagnosis and monitoring, to a large extent, depends on the robustness of the algorithm across a wide range of data-acquisition conditions. Variations in such conditions may arise from differences in noise present in the recording environment. Variations may also be associated with the specific brain functions one attempts to study that require the use of different activation tasks or a lack of any tasks (e.g. in sleep and meditation studies or study of coma patients). Here we show two examples that expand the limit of what one typically considers as possible to obtain from scalp recorded EEG. 3.1 Separation of Neuronal Sources from Resting EEG Separation of scalp EEG signals into source signals, if done at all, are typically done for EEG data collected using ERP paradigms, where an average waveform of repeated presentation of a stimulus or repetition of a response were generated and used in the process of dipole or other type of model fitting. Sources are fitted for different temporal components of a characteristic waveform. Such an approach excludes the possibility of source modeling when the EEG data were collected without ERPs (e.g. during sleep or meditation). As SOBI uses temporal delays computed over continuous data, there is no reason for the requirement of an ERP paradigm. We have shown that SOBI can decompose the scalp-recorded mixed signals into components whose sensor space projections are characteristic of those found to be neuronal sources and that these projections can be well accounted for by dipoles at known neuroanatomically meaningful locations [11]. Shown in Fig. 3 are typical examples of neuronal sources recovered from approximately 10 min resting EEG. On the left are sources believed to correspond to neuronal sources along the ventral visual processing streams and on the right-top are sources of L and R primary somatosensory cortices and on the right-bottom are multiple frontal sources. This example suggests that with SOBI, one can monitor fast neuroelectrical activity at specific brain regions without having to make the subjects to perform any specific task, thus enabling investigation of brain function during sleep, meditation, coma, and other disorders that render subjects incapable of performing a task.
372
A. Tang
Fig. 3. SOBI recovery of neuronal sources from resting EEG
3.2 Separating Neuronal Sources from EEG Recorded during Free and Continuous Eye Movement As the electrical signals associated with eye movement can be 1-2 orders of magnitude of neuronal signals, it has become an accepted practice to manually review the entire EEG record channel by channel to identify specific time windows where eye blinks and eye movement have occurred. Subsequently, data from these time windows are “chopped” for the purpose of “artifact removal”. This approach would fail completely if one’s goal is to investigate brain function while the subject is engaging in activity requiring normal free and continuous eye movement. Here we show examples of a neuronal source posterior visual cortex and an ocular source, recovered by SOBI from EEG data collected when the subject was playing a video game in front of a computer screen for less than 20 minutes [9]. The sensor space projections of both SOBI-recovered sources (Fig. 4 top row) are characteristic of those found from EEG recordings of an ERP experiment and their respective spatial origins are provided by the dipoles models (Fig. 4 middle row). Most importantly, when an average waveform is generated by averaging signals from multiple epochs surrounding a button press, a waveform resembling the visual evoked potentials (VEPs) emerged for the posterior visual source (Fig. 4, rightbottom). Furthermore, the similarly generated average waveform for the ocular source showed large amplitude variations associated with eye movement even though it overlaps in time with the VEPs of visual source (Fig. 4, left-bottom). This experiment demonstrates the possibility that with SOBI, neuronal sources can be recovered even in the presence of continuous eye movement that generate large amplitude signals
Applications of Second Order Blind Identification
373
Fig. 4. SOBI recovery of neuronal and ocular sources from data obtained during continuous eye movement
overlapping with all neuronal activity. This capability offers neuroscientists and clinician a new opportunity to study their chosen phenomena within the normal real-world context.
4 Reliability The usefulness of a source separation algorithm also depends on the reliability of the algorithm in findings similar neuronal sources across different subjects (cross-subject reliability) and the reliability in finding similar neuronal sources across repeated recording sessions (within-subject reliability), particularly across long time delays (days and weeks). Within-subject reliability across longer time delays is particularly critical for addressing questions in developmental neuroscience and in monitoring progression, treatment, and recovery from brain pathology. Here we present descriptive data pertaining to these two forms of reliability [7]. 4.1 Cross-Subject Reliability To evaluate cross-subject reliability in identifying sources corresponding to the same architetonically defined brain regions from multiple subjects, ideally one needs the structural MRI images of the individual subjects as large variations in individual brain structures exist. Here the structural MRI of a standard brain is used. With this limitation in mind, we show two typical sources: the top row is for a frontal source and the bottom is for a visual source.
374
A. Tang
These two sources are used as benchmark sources because they are always found from all EEG recordings regardless of what the subjects are doing. Be it eye-closed resting, eyes-closed imagining, eye-open resting, or search eye-open actively view, be it during a visual or somatosensory activation paradigm. The variations in scalp maps across different subjects (columns) is reasonable because the activation across the map (Fig. 5 top row: voltage map; middle row: current source density) is both a function of brain activity and relative position of the EEG cap on the head.
Fig. 5. Frontal and posterior sources from 14 subjects: cross-subject variations
4.2 Within-Subject Reliability (Cross-Time) Variations in source identification from different recording sessions across days or weeks may arise from multiple sources. These include changes in the EEG cap placement over the head, changes in the subject’s state of mind, changes in maturation if the delay is sufficiently long to cover a window of developmental change, or changes associated with health status and medical treatment. It is important to maintain the ability to separate neuronal sources and match one set of sources at one time to that of another and simultaneously retain the ability to compare temporal dynamic changes reflecting the differing circumstances. Shown in Fig. 6 are dipole locations for the two typical sources (left: posterior visual; right: fonrtal cortex) [7]. The three rows correspond to three sessions of recordings of the same groups of subjects (Week 0, Week 1, and Week 3 or longer). The multiple overlapping dipoles are from different subjects. First the tight clustering of dipoles within each sub-panels further support cross-subject reliability No statistically significant differences in source locations were found across weeks and neither were there visible differences in dipole clustering. This level of within-subject reliability means that with SOBI, one can investigate long-term changes of a given brain region.
Applications of Second Order Blind Identification
375
Fig. 6. Frontal and visual sources recovered from 3 sessions up-to one month apart (cross time within-subject reliability)
5 Efficiency There are two contrasting types of applied problems. The first deals with only one set of enormously complex data where special handling is needed as well as possible and efficiency is not a primary concern. The other deals with large number of data sets whose processing is time-sensitive and efficiency is critical. Brain imaging data in the context of clinical diagnosis and monitoring belongs to the latter category. Here using the separation of the L and R SIs as benchmark neuronal sources, we show how quickly SOBI can reach a stable solution. Shown in Fig. 7 are results from EEG data collected during median nerve electrical stimulation from four subjects. SOBI is an iterative algorithm and the separation matrix produced by SOBI is modified with each iteration. The sin (angle of rotation) is used as an indicator of whether one should continue the iterative process. We examined how the spatial location of the SOBI recovered L and R SI sources change as a function of the number of iterations as well as the ERP waveforms (not shown here) after each iteration. We found that after less than 40 iterations, the resulting SOBI-recovered L and R SI sources for all subjects showed essentially no differences. Though the number of iterations required to reach the stable solution differs across subjects, possibly due to the quality of data as well as individual differences in the neuronal sources themselves, this experiment showed that as few as 22 iterations could be enough for SOBI to reach stable solutions for certain neuronal sources. This suggests that SOBI process for all practical purposes might be surprisingly fast, particularly in comparison to other algorithms that require randomly-set initial conditions and averaging of multiple sets of solutions across a large number of random initial conditions (e.g. as in the case of InfoMax ICA).
376
A. Tang
Fig. 7. SOBI process can reach stable source solution in as few as 22 iterations
6 Conclusions We presented a mini-review of SOBI applications for addressing the problem of structure-function relations using high-density EEG. This presentation is not a comprehensive review of all works of SOBI application to EEG data and neither was it a general review of different BSS algorithms’ application to brain imaging data. We specifically left out works exclusively focused on artifacts removal, a topic for which many excellent reviews existed. The work reviewed here is exclusively empirical and selective for the purpose of focusing on (1) signal-domain-specific validations, (2) robustness across varying experimental conditions; (3) reliability of source identification across repeated measures; and (4) efficiency. I believe that this review fills a particular gap of knowledge about SOBI that is worth sharing with both the signal processing as well the neuroscience community.
References 1. Bell, A.J., Sejnowski, T.J.: An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7, 1129–1159 (1995) 2. Belouchrani, A., Meraim, K.A., Cardoso, J.F., Moulines, E.: Second-Rrder Blind Separation of Correlated Sources. In: Proc. Int. Conf. on Digital Sig. Proc., Cyprus (1993) 3. Belouchrani, A., Abed-Meraim, K., Cardoso, J.F., Moulines, E.: A Blind Source Separation Technique Using Second-Order Statistics. IEEE Transactions on Signal Process. 45, 434–444 (1997) 4. Makeig, S., Jung, T.P., Bell, A.J., Ghahremani, D., Sejnowski, T.J.: Blind Separation of Auditory Event-Related Brain Responses into Independent Components. Proceedings of the National Academy of Sciences 94, 10979–10984 (1997)
Applications of Second Order Blind Identification
377
5. Makeig, S., Westerfield, M., Jung, T., Enghoff, S., Townsend, J., Courchesne, E., Sejnowski, T.: Dynamic Brain Sources of Visual Evoked Responses. Science 295, 690–694 (2002) 6. Holmes, M., Brown, M., Tucker, D.: Dense Srray EEG and Source Snalysis Reveal Spatiotemporal Dynamics of Epileptiform Discharges. Epilepsia 46, 136 (2005) 7. Tang, A.C., Sutherland, M.T., Peng, S., Zhang, Y., Nakazawa, M., Korzekwa, A.M., Yang, Z., Ding, M.Z.: Top-Down versus Bottom-Up Processing in the Human Brain: Distinct Directional Influences Revealed by Integrating SOBI and Granger Causality. In: Davies, M.E., James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007. LNCS, vol. 4666, pp. 802–809. Springer, Heidelberg (2007) 8. Sutherland, M.T., Tang, A.C.: Reliable Detection of Bilateral Activation in Human Primary Somatosensory Cortex by Unilateral Median Nerve Stimulation. Neuroimage 33, 1042–1054 (2006) 9. Tang, A.C., Sutherland, M.T., McKinney, C.J., Liu, J.Y., Wang, Y., Parra, L.C., Gerson, A.D., Sajda, P.: Classifying Single-Trial ERPs from Visual and Frontal Cortex during Free Viewing. In: IEEE Proceedings of the 2006 International Joint Conference on Neural Networks, Vancouver, Canada (2006) 10. Tang, A.C., Sutherland, M.T., McKinney, C.J.: Validation of SOBI Components from High-Density EEG. Neuroimage 25, 539–553 (2005) 11. Sutherland, M.T., Tang, A.C.: Blind Source Separation can Recover Systematically Distributed Neuronal Sources from Resting EEG. In: EURASIP Proceedings of the Second International Symposium on Communications, Control, and Signal Processing, Marrakech, Morrocco (2006)
A Method for MRI Segmentation of Brain Tissue Bochuan Zheng1,3 and Zhang Yi2 1 2
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, P.R. China Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu 610065, P.R. China 3 College of Mathematics and Information, China West Normal University, Nanchong 637002, P.R. China
Abstract. The competitive layer model (CLM) of the Lotka-Volterra recurrent neural networks (LV RNNs) is capable of binding similar features into a layer by competing among neurons at different layers. In this paper, the CLM of the LV RNNs is used to segment brain MR image. Firstly, the CLM of the LV RNNs is applied to segment each subimage into several regions; Secondly, a similar neighboring region merging algorithm is adopted to merge the similar neighboring regions into larger regions, which depends on the intensity and area ratio of two neighboring regions; Finally, the survived regions are further classified into four classes by region-based fuzzy C-means (RFCM) definitely according to four tissues in brain. Comparing with other three methods, our proposed method shows better performance. Keywords: Image Segmentation, Competitive Layer Model, Magnetic Resonance Imaging, Fuzzy C-means.
1
Introduction
Magnetic resonance (MR) images are widely used in the computer-aided diagnosis and therapy due to its virtual noninvasion, a high spatial resolution and an excellent contrast of soft tissues [1]. Brain MR images segmentation is an important stage for automatic or semiautomatic distinguishing different brain tissues or detecting tumors, edema and necrotic tissues. However, MR image segmentation is a complex and challenging task due to the convoluted shape, blurred boundaries, inhomogeneous intensity distribution, background noise, and low intensity contrast between adjacent brain tissues [2]. So far, a lot of techniques for MR image segmentation have been reported, including thresholding, region growing, deformable model and neural networks [3]. Many clustering algorithms can also be used to the segmentation of brain MR image, e.g. the K-means [4], fuzzy c-means (FCM) [5], and expectation-maximization (EM) [1] algorithms. In [7], Wersing et al. proposed a competitive layer model of the linear threshold recurrent neural networks (LT RNNs), which has the property of spatial feature L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 378–384, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Method for MRI Segmentation of Brain Tissue
379
binding. The CLM of the LT RNNs has been found to be applied in image segmentation, salient contour extraction and motion grouping. In this paper, we use the CLM of the LV RNNs proposed in [12] to segment brain MR image combining with region merging algorithm. Our method is compared with other three methods on the brain MR image and the better performance is achieved. The organization of this paper is as follows, In Section 2, the CLM of the LV RNNs is presented; Section 3 gives our image segmentation method; Experiments are given in Section 4 that compare our method with other three image segmentation methods; finally, we end with conclusions.
2
The CLM of the LV RNNs
The LV RNNs was firstly proposed by Fukai in [8]. Derived from conventional membrane dynamics of competing neurons, the LV RNNs has been found to applied successfully in winner-take-all, winner-share-all and k-winner-take-all problem [9]. The conditions of convergence for the LV RNNs are reported in [10] [11]. In [12], the CLM of the LV RNNs is given as follows, ⎡ ⎛ ⎞ ⎤ L N xiβ (t)⎠ + wij xjα (t)⎦ . (1) x˙ iα (t) = xiα (t) ⎣C ⎝hi − β=1
j=1
for i = 1, · · · , N and α = 1, · · · , L. Where x(t) ∈ RN L denotes the state of the network at time t. Fig. 1 shows the CLM architecture. It contains a set of L layers and in each layer there are N neurons. Thus, the CLM contains N × L neurons in total. Neurons in each layer are lateral connected each other through the N × N weight matrix W , which is identify in all layers. Between different layers, only those neurons that are arranged in a column are vertical connected each other through the weight C, which is a constant. The external input for the ith neuron in the αth layer is denoted by hi > 0(i = 1, · · · , N ), which is independent of the layer index. This model implement layer competition through cooperating between neurons in a layer and competing between neurons within a column. If a column is associated with one feature, then the CLM of the LV RNNs can be used for feature binding.
3
Segmentation of Brain MR Image
Given a gray image of size X × Y , denote every pixel by the feature fi = (xi , yi , Ii ), where (xi , yi ) is the position of the pixel i in the image, i = x × Y + y, x = 1, · · · , X; y = 1, · · · , Y . Ii is the pixel intensity value of the pixel i. wij is the lateral interaction between two pixel features indexed by i and j. It is assumed that wij ≥ 0 if fi and fj is similar. Define the compatibility between features fi and fj as Φij = e−v/k1 (e−d/k2 + 1) − θ,
380
B.C. Zheng and Z. Yi
Fig. 1. The CLM architecture
where v = |Ii − Ij |, d = (xi − xj )2 + (yi − yj )2 , k1 controls the sharpness of v, k2 controls the spatial range of d, θ is the strength of a global inhibition. The Φij goes large when k1 and k2 become small. Next, normalize ωij into rang [−1, +1] as follows, ⎧ Φij ⎪ ⎪ , if Φij ≥ 0, ⎪ ⎪ ⎨ max(Φ) wij = (2) ⎪ ⎪ Φ ⎪ ij ⎪ , else ⎩ |min(Φ)| Where functions max(Φ) and min(Φ) calculate the maximum and minimum of the matrix Φ, respectively. The parameters of the experiments are: k1 = 90, k2 = 20, θ = 1.8. The architecture of the CLM consists of L layers and N neurons in each layer. Given a gray image of size X × Y , N = X × Y , then N × L neurons will be employed in this model. The size of W is N × N = (X × Y ) × (X × Y ). It is not easy to segment one large image directly. But, the image can be divided into a lot of subimages and then segment each subimage by the CLM of the LV RNNs, which therefor improves the segmentation speed and decreases the demanded memory. 3.1
Segment Subimage by the LV RNNs
Let the size of each divided subimage be X s ×Y s . Then, there are P = X/X s× P Y /Y s subimages in a X × Y image I. Denote I = k=1 Iks , where Iks is a subimage, which has N s = X s ×Y s pixels. Suppose that the size of all subimages
A Method for MRI Segmentation of Brain Tissue
381
is same, here X s = Y s = 10. L = 4 is number of network layer. For each Iks , we implement the CLM of the LV RNNs to segment as follows. 1. calculate the W of the subimage Iks using the above Equation (2); 2. initialize the xiα (0) = ε + l/L, hi = 1, C = 300. where i = 1, · · · , N s ; α = 1, · · · , L; 3. calculate the continuous neural network (1) until convergence. Obtaining the stable equilibrium state x(t) of the LV RNNs; 4. get the segmented subimage Ds of Iks by calculating class label matrix T (x, y) = arg maxL α=1 xiα , i = (x − 1)Ys + y. 3.2
Merge Similar Neighboring Regions
After segmenting all subimages using the CLM of the LV RNNs, we obtain a new segmented image D. It is assumed that there are G pieces of 4-connected region in the image D, the region set Ω = {Ω1 , Ω2 , · · · , ΩG }. The most similar two regions are merged to a larger region each time. So the distance between any two neighboring regions need to be computed before merging. The distance of two neighboring regions modified from the merging likelihood computation [13] takes into account not only homogeneity of the intensity but also geometric property of regions. In this paper, we use the intensity distance and the region area ratio of neighboring regions to form the merging likelihood of two regions. Let F be the similar matrix of region set Ω, then, Fij is the similar of any two regions Ωi and Ωj , which reads as Fij = |M ean(Ωi ) − M ean(Ωj )| + ρ
small(Ωi , Ωj ) large(Ωi , Ωj )
(3)
where function M ean(Ωi ) computes the mean intensity value of pixels in Ωi , small(Ωi , Ωj ) and large(Ωi , Ωj ) compute the number of pixels in smaller region and larger region, respectively. ρ is the weight of the region area ratio of neighboring regions. If any two regions are not neighborhood, then Fij = inf. Let the Threshold be the maximum distance between neighboring regions which need to be merged. Based on the definition above, the region merging algorithm can be described as: 1. calculate the similar matrix F using Equation (3); 2. find out the minimum value Fijmin from the similar matrix F ; 3. merge the Ωj into Ωi when Fijmin less than and equal to the Threshold, remove Ωj from Ω; 4. return to Step 1 until Fijmin large than the Threshold. 3.3
Cluster the Survived Regions by RFCM
In order to ensure the final segmentation result has only 4 classes which correspond to 4 tissues of the brain, all survived regions are clustered into 4 classes s by RFCM. Denote the survived region by Ω s = {Ω1s , Ω2s , · · · , ΩG }, where G
382
B.C. Zheng and Z. Yi
is the number of the survived regions. The mean intensity value of each region Ωis is denoted by zi , i = 1, 2, · · · , G . The RFCM clustering algorithm can be formulated by ⎧ G C ⎪ ⎪ 2 ⎪ J(U, V ) = um ⎪ ik zi − vk , ⎨ i=1 k=1 (4) C ⎪ ⎪ ⎪ ⎪ uik = 1, ⎩subject to: k=1
where the matrix U = {uik } is a fuzzy c-partition of Ω s , and uik represents the membership function of region Ωis to the ith cluster, with uik ∈ [0, 1]. C is the number of clusters. V = {v1 , v2 , · · · , vC }, denotes the cluster feature center, given C = 4. m ∈ (1, ∞) is a weighting exponent on each fuzzy membership, here, we chose m = 2 as an example.
4
Experimental Results
A T1-weighted MR images with 7% noise and 20% intensity non-uniformity was downloaded from the Brainweb [14]. The 98th brain only slice in the MR images is segmented into 4 clusters: background, cerebral spinal fluid (CSF), white matter (WM) and gray matter (GM) using our proposed method. The segmentation results at three phases are shown in Fig. 2. most of noise points can be eliminated in the stage one in Fig. 2(b). Through merging stage(Fig. 2(c)), the similar regions are merged, which improve segmentation effect. Finally the survived regions in the Fig. 2(c) are clustered into four classes. In order to display these 4 clusters distinctly, the intensity value labelled for different tissue region in the ground truth and the segmentation result image in Fig. 2(d) and following figures are assigned as follows: 0 for background, 254 for CSF, 90 for GM and 180 for WM.
(a)
(b)
(c)
(d)
Fig. 2. Segmentation results at three stages. (a) the original 98th brain noisy slice image, (b) segmentation result by the CLM of the LV RNNs, (c) segmentation by merging, (d) final segmentation result by RFCM.
Other three methods including k-mean, FCM and EM are employed to compare the segmentation performance of the proposed method. Fig. 3 shows the
A Method for MRI Segmentation of Brain Tissue
383
segmentation results of 98th brain only slice image with 7% noise level and 20% intensity non-uniformity downloaded from the Brainweb using different methods. It is shown that the proposed method has better segmentation result than other three methods, and there are less noise points in the segmentation result image of proposed method than other three methods.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Segmentation results. (a) the original 98th brain noisy slice image. (b) ground truth. (c) segmentation result of K-means. (d) segmentation result of FCM. (e) segmentation result of EM. (f) segmentation result of our method.
5
Conclusions
Medical images generally contain unknown noise and considerable uncertainty, and therefore clinically acceptable segmentation performance is difficult to achieve. In this paper, considering that it costs great amount of memory and time to segment whole image, an image is divided into many square blocks to be segmented by the CLM of the LV RNNs, which perfectly segments not only brain MR image without noise but also those with noise. The similar neighboring region merging algorithm merges the most similar neighboring regions according to the mean intensity value and the ratio of region area of neighboring regions. In the merging stage, the smaller noise regions can be merged into the larger ones according to the merging condition. As a result, all remaining noise can be further cleared at this stage. The proposed method is insensitive to noise compared with other three methods. Acknowledgments. This work was supported by Chinese 863 High-Tech Program under Grant 2008AA01Z119.
384
B.C. Zheng and Z. Yi
References 1. Wells, W.M., Grimson, W.E.L., Kikinis, R., Jolesz, F.A.: Adaptive segmentation of MRI data. IEEE Transactions on Medical Imaging 15(4), 429–442 (1996) 2. Shen, S., Sandham, W., Granat, M., Sterr, A.: MRI fuzzy segmentation of brain tissue using neighborhood attraction with neural-network optimaiztion. IEEE Tansactions on Information Technology in Biomendicine 9(3), 459–467 (2005) 3. Pham, D.L., Xu, C.Y., Prince, J.L.: A survey of current methods in medical image segmentation. Annual Review of Biomedical Engineering, Annual Reviews 2, 315–337 (2000) 4. Vemuri, B.C., Rahman, S.M., Li, J.: Multiresolution adaptive K-means algorithm for segmentation of brain MRI. In: Chin, R., Naiman, A., Pong, T.-C., Ip, H.H.-S. (eds.) ICSC 1995. LNCS, vol. 1024, pp. 5347–5354. Springer, Heidelberg (1995) 5. Ahmed, M.N., Yamany, S.M., Mohamed, N., Farag, A.A., Moriarty, T.: A modified fuzzy C-means algorithm for bias field estimation and segmentation of MRI data. IEEE Transactions on Medical Imaging 21(3), 193–199 (2002) 6. Alirezaie, J., Jernigan, M.E., Nahmias, C.: Neural network based segmentation of magnetic resonance images of the brain. IEEE Transactions on Nuclear Science 44(2), 194–198 (1997) 7. Wersing, H., Steil, J.J., Ritter, H.: A competitive-layer model for feature binding and sensory segmentation. Neural Computation 13, 357–387 (2001) 8. Fukai, T., Tanaka, S.: A simple neural network exhibiting selective activtion of neuronal ensembles: from winner-take-wall to winner-share-all. Neural Computation 9, 77–97 (1997) 9. Asai, T., Fukai, T., Tanaka, S.: A subthreshold MOS circuit for the lotka-volterra neural network producing the winner-share-all solution. Neural Networks 12, 211–216 (1999) 10. Yi, Z., Tan, K.K.: Global convergence of lotka-volterra recurrent neural networks with delays. IEEE Transactions on Circuits and Systems, Part I: Regular papers 52(11), 2482–2489 (2005) 11. Yi, Z., Tan, K.K.: Convergence Analysis of Recurrent Neural Networks. Kluwer Academic Publishers, Norwell (2004) 12. Yi, Z.: Foundations of implementing the competitive layer model by Lotka-Volterra recurrent neural networks. IEEE Transactions on Neural Network (in press) 13. Kuan, Y.H., Kuo, C.M., Yang, N.C.: Color-based image salient region segmentation using novel region merging strategy. IEEE Transactions on Multimedia 10(5), 832–845 (2008) 14. BrainWeb, http://www.bic.mni.mcgill.ca/brainweb/
Extract Mismatch Negativity and P3a through Two-Dimensional Nonnegative Decomposition on Time-Frequency Represented Event-Related Potentials Fengyu Cong1, Igor Kalyakin1, Anh-Huy Phan2, Andrzej Cichocki2, Tiina Huttunen-Scott3, Heikki Lyytinen3, and Tapani Ristaniemi1 1
Department of Mathematical Information Technology, University of Jyväskylä, Finland {Fengyu.Cong,Igor.Kalyakin,Tapani.Ristaniemi}@jyu.fi 2 Laboratory for Advanced Brain Signal Processing, Brain Science Institute, RIKEN, Japan {cia,phan}@brain.riken.jp 3 Department of Psychology, University of Jyväskylä, Finland {Tiina.Huttunen,Heikki.Lyytinen}@jyu.fi
Abstract. This study compares the row-wise unfolding nonnegative tensor factorization (NTF) and the standard nonnegative matrix factorization (NMF) in extracting time-frequency represented event-related potentials—mismatch negativity (MMN) and P3a from EEG under the two-dimensional decomposition. The criterion to judge performance of NMF and NTF is based on psychology knowledge of MMN and P3a. MMN is elicited by an oddball paradigm and may be proportionally modulated by the attention. So, participants are usually instructed to ignore the stimuli. However the deviant stimulus inevitably attracts some attention of the participant towards the stimuli. Thus, P3a often follows MMN. As a result, if P3a was larger, it could mean that more attention would be attracted by the deviant stimulus, and then MMN could be enlarged. The MMN and P3a extracted by the row-wise unfolding NTF revealed this coupling feature. However, through the standard NMF or the raw data, such characteristic was not evidently observed. Keywords: Nonnegative matrix/tensor factorization, mismatch negativity, P3a, attention.
1 Introduction Nonnegative Matrix Factorization (NMF) and Nonnegative Tensor Factorization (NTF) are multi-channel source separation algorithms with the constraints of nonnegativity and sparsity on signals [1-3]. They can be used in many disciplines, including image recognition, language modeling, speech processing, gene analysis, biomedical signals extraction and recognition, and so on. In our previous study [4], we have demonstrated that the standard NMF could extract the time-frequency represented mismatch negativity (MMN) and P3a from EEG recordings and outperform independent component analysis (ICA) [5]. NTF and NMF are similar nonnegative decomposition methods. The difference is that NMF implements the two-dimensional decomposition L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 385–391, 2010. © Springer-Verlag Berlin Heidelberg 2010
386
F. Cong et al.
and NTF can employ not only the two-dimensional but also multi-dimensional decomposition. Moreover, even in the case of the two-dimensional decomposition, the rowwise unfolding NTF adds more constrains on the decomposition than the standard NMF does [6]. Hence, this study aims to investigate whether row-wise unfolding NTF can extract MMN and P3a as the standard NMF did in [4] and to test whether this NTF algorithm could better reveal the cognitive process than the standard NMF did in the research of MMN. It is well known that NMF and NTF have the nonnegative constrains on the recordings, but raw EEG recordings do not meet this requirement. To facilitate NMF and NTF, the time-frequency representation of EEG recordings is first achieved, and then NMF and NTF decompose the time-frequency represented EEG to obtain the desired time-frequency represented components [7-8]. This study follows this line. In fact, NMF and NTF act as blind source separation (BSS) [9] in this study. The criteria to evaluate the performance of BSS algorithms usually require the real source signals and the mixing model; however, these are not available in the real EEG recordings. Thus, the criterion to judge the performance of NMF and NTF is based on the psychology knowledge of MMN and P3a in this study. MMN is a negative event-related potential (ERP) and it can be elicited by an oddball paradigm [10]. This paradigm involves the deviant stimulus that is dissimilar to the majority of repeated stimuli presented. MMN can be modulated by the attention [10]. If more attention was paid to the stimuli, MMN might be larger. This is not beneficial to the clinical study of MMN. So, participants are usually instructed to ignore the stimuli. However the deviant stimulus inevitably attracts the participant to pay some attention to the stimuli. P3a is also produced by the oddball paradigm but the participants are usually asked to pay attention to the stimuli [11]. Thus, P3a often follows MMN [10]. P3a could also be modulated by the attention [11]. As a result, if P3a was larger, it could mean that more attention would be attracted by the deviant stimulus, and then MMN could be enlarged [10]. Such a coupling feature is the criterion to evaluate the performance of the row-wise unfolding NTF and standard NMF in this study.
2 Standard NMF and Row-Wise Unfolding NTF A linear model represents MMN as
X = AS ,
(1)
where, X ∈ ℜ m×T is the matrix of observations, A ∈ ℜ m×n is unknown basis matrix, S ∈ ℜ n×T is the matrix of unknown latent components, and generally, T > m ≥ n . Each column of A is the basis function of the corresponding column of S . X , A and S are all with non-negative entries. To factorize the non-negative matrix, an adaptive learning rule through iteratively performing the following two updated rules [1]:
[XS ] [ASS ] T
ai , j ← ai , j
i, j
T
i, j
[A X] [A AS] T
,
s j ,k ← s j ,k
j ,k
T
.
(2)
j ,k
When the Euclidean distance X − AS does not increase, it is normally regarded that the stationary point is reached. Many NMF algorithms are based on such gradient
Extract Mismatch Negativity and P3a
387
related methods [6]. As the local optimization could be regarded as the global one, like ICA, the single-run NMF may have bad performance. To resolve the problem, a sequential factorization of non-negative matrices composes the hierarchical and multistage procedure in [12]. At first, basic NMF finds a stationary point, and X = A1S1 is derived; secondly, NMF is performed again, but the object is S1 , and then
S1 = A 2S 2 is computed; this procedure is continuously applied to newly achieved components until some stopping criteria are met. Thus, the learning procedure could be described as X = A1 A 2 " A L S L , A = A 1 A 2 " A L .
(3)
NMFLAB [6] includes this hierarchical and multistage procedure, and it is adopted to extract ERPs in this study. The fixed point algorithm and 10 layers are selected. The NTF-1 model is flexible and useful in practice [6]. In this model, if a tensor X ∈ ℜ I ×T ×K is given, it could be factorized to A ∈ ℜ I ×R , D ∈ ℜ K ×R , S ∈ ℜ I ×R×T , i.e., a set of matrices, and each entry in each matrix is non-negative. Mathematically,
X k = AD k S k ,
(4)
where, X k ∈ ℜ I ×T is the k th frontal slice of X ∈ ℜ I ×T ×K and it could be considered as the mixtures in ICA ; k = 1," , K is the number of frontal slices; A is the basis and represents the common factors, and it could be regarded as mixing matrix in ICA; D k ∈ ℜ R×R is a diagonal matrix, and is seated as the k th row of D ∈ ℜ K×R in its main diagonal; S k ∈ ℜ R×T denotes the hidden components, and it could be thought as the sources in ICA. Typically, T >> I , K > R . Normally, the non-negative, sparse, and smooth constraints are utilized for adaptive learning. In this study, the target is to estimate a set of S k ∈ ℜ R×T . Then, NTF-1 model can be converted to the row-wise unfolding decomposition model [6]: X = [ X1 ," , X k ," , X K ] = AS .
(5)
As a result, three-dimension NTF-1 model is transformed to a two-dimensional NMF problem by unfolding tensors. However, it should be noted that such a 2D model in general is not exactly equivalent to a standard NMF model, since we usually need to impose different additional constraints for each slice k [6]. In other words, the unfolding model should not be considered as a standard 2-way NMF of a single 2-D matrix [6]. The local optimization problem also exists in NTF. The hierarchical and multistage procedure for NMF is also helpful to NTF. NTFLAB [6] has also already adopted it. Consequently, 10 layers are set too.
3 Experiment and Results The EEG data was collected at the Department of Psychology at the University of Jyväskylä, Finland [13]. When we obtained the dataset, the MMN responses of 66
388
F. Cong et al.
normal children who showed no reading or attention problems with the mean age of 11 years 11 months, and 16 children with reading disability (RD) with mean age of 12 years 2 months, were recorded. Fig.1 demonstrated a schematic illustration of the experimental paradigm. An uninterrupted sound composed the stimuli setup through two changing 100 ms sin tones of 600 Hz and 800 Hz (repeated stimuli). The shorter one of 50 ms or 30 ms duration segments randomly replaced 7.5% of the 600 Hz tones. Meanwhile, the experiment guaranteed at least six repetitions of the alternating 100 ms tones between any of the two shorter ones (i.e., deviants). During the experiment, children were told to concentrate watching a subtitled silent video and not to pay attention to the auditory stimuli. In this paradigm, MMN usually appears within the time Fig. 1. A schematic illustration of the experimental window of 50-200ms after paradigm (Adapted from [14]) the offset of the deviant stimulus. EEG recordings started at 300 ms prior to the onset of the deviant stimulus and lasted for 350 ms after its onset. 350 trials of each type of deviants were recorded. The sampling frequency was 200 Hz and an analog band-pass of 0.1-30 Hz was performed on the raw data. So, each trial contained 130 samples. Nine electrodes were placed over the standard 10-20 sites. Electrodes included frontal (F3, Fz and F4), central (C3, Cz and C4), parietal (Pz) and mastoid (M1 and M2) placements. Electrodes were referred to the tip of the nose. Data process included 4 steps: First, the trials with large amplitude fluctuations (exceeding ±100 μV) were rejected, and then the remaining trials were averaged. Second, the Morlet wavelet transform was performed on the averaged trace to achieve the time-frequency represented EEG. Third, standard NMF and row-wise unfolding NTF estimated nine time-frequency represented components respectively. Fourth, the support to absence ratio (SAR) [4] of each component was calculated and the component with the largest SAR was chosen as the desired component [4]. These steps were implemented on the data of each subject under each deviant. For the SAR of MMN, the support could be the mean energy of a rectangle area in the time-frequency represented component. Dimensions of this rectangle were time by frequency and the frequency range was set as 2-8.5Hz [13] and the time interval was between 50ms and 200ms after the deviant was offset [13-14]. The mean energy of the left area in the time-frequency represented component was the absence. SARs of MMN and P3a in normal children and children with RD would be investigated through a general linear model and repeated measure of ANOVAs. By this way, the difference of SARs in two groups of children was tested. Before the statistical tests, the SARs were averaged over two deviants.
Extract Mismatch Negativity and P3a
(a) raw data
(b) axes
389
(c) standard NMF (d) row-wise unfolding NTF
Fig. 2. Time-frequency representation
For visual inspection, NMF and NTF were performed on the grand averaged data of 50ms deviation. Fig.2 depicted the raw data and the estimated components. Both NMF and NTF separated MMN components out from the grand averaged raw data, demonstrated respectively by the 6th plot in Fig.2-c, and 1th plot in Fig.2-d. The color from the blue to the red denoted the energy rising. P3a was also estimated respectively as shown by the 8th plot in Fig.2-c and the 4th plot in Fig.2-d. Visual inspection implies that components estimated by NTF were more evident. To NTF, NMF, and raw data, the averaged SARs of P3a of normal children vs. children with RD were respectively 19.5dB vs. 25.7dB, 19dB vs. 22.8dB, and 1.7dB vs. 2.6dB. The difference between the two groups of children was not significant in the raw data [F(1,80)=0.250, p=0.618], but was significant in the results by NMF [F(1,80)=4.067, p=0.045], and more significant by NTF [F(1,80)=7.526, p=0.008]. To NTF, NMF and raw data, the averaged SARs of MMN of normal children vs. children with RD were respectively 17.2dB vs. 20.3dB, 19.8dB vs. 20.1dB, 0.7dB vs. 2.2dB. Difference of SAR of MMN between the two groups of children was not evident in the raw data [F(1,80)=2.187, p=0.143], and was either not evident by NMF [F(1,80)=0.01, p=0.992], but almost evident by NTF [F(1,80)=3.512, p=0.065].
4 Discussion Both the standard NMF and the row-wise unfolding NTF could extract the timefrequency represented MMN and P3a from the averaged traces. Under either NMF or NTF, P3a of the children with RD was larger than that of the normal children. As P3a may be proportionally modulated by the attention, this meant the children with RD might pay more attention to the stimuli than the normal children did. As illustrated in [13], the reason that children with RD paid more attention to the stimuli might be that they hated reading the subtitles of the video, thus the deviant draw their attention. In theory, MMN could be modulated by the attention and the MMN energy is proportional to the degree of paid attention to the stimuli [10]. Thus, the MMN of children with RD would be enhanced by the attracted attention. Under NTF, MMN of children with RD was almost larger than that of normal children, which was not observed under NMF. This meant that the coupling of MMN and P3a was only revealed by NTF. From this point of view, the row-wise unfolding NTF outperforms the standard NMF, though both of them belong to two-dimensional decomposition. The difference comes from the additional constraints added by the row-wise unfolding NTF to each slice to form the augmented decomposition [6].
390
F. Cong et al.
MMN has been extensively used in cognitive studies, clinical neuroscience, and neuropharmacology [15]. NMF and NTF have been successfully used in the study of biomedical brain signals [3, 7, and 8]. So, it would be very interesting to investigate the application of NMF and NTF in the stream of MMN. Our previous contribution [4] and this presentation just attempt to discuss the feasibility to study MMN and P3a components with the standard NMF and row-wise column NTF algorithms. Surprisingly, both such basic algorithms under two-dimensional nonnegative decomposition could extract the MMN and P3a components, moreover, the coupling feature of MMN and P3a has been revealed by the simple NTF method. In fact, both NMF and NTF have better algorithms [3, 16], and it will be necessary and promising to study which algorithms would be better to study MMN in theory and in practice, i.e., better in revealing the psychology knowledge of MMN which can not be observed through the ordinary data processing. This will be significant in the clinical study of MMN with NMF or NTF. Acknowledgments. Cong and Kalyakin gratefully thank COMAS, a postgraduate school in computing and mathematical sciences at University of Jyväskylä, Finland, for supporting this study; Cong particularly thanks Mr. Zhilin Zhang (University of California, San Diego) for discussion and language proofreading; Cong also thanks the international mobility grants (Spring-2009) of University of Jyväskylä.
References 1. Lee, D.D., Seung, S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999) 2. Cichocki, A., Zdunek, R., Amari, S.: Nonnegative Matrix and Tensor Factorization. IEEE Signal Proc. Mag. 25(1), 142–145 (2008) 3. Cichocki, A., Zdunek, R., Phan, A., Amari, S.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. John Wiley & Sons Inc., Chichester (2009) 4. Cong, F., Zhang, Z., Kalyakin, I., Huttunen-Scott, T., Lyytinen, H., Ristaniemi, T.: Nonnegative Matrix Factorization Vs. FastICA on Mismatch Negativity of Children. In: International Joint Conference on Neural Networks, pp. 586–590. IEEE Press, Atlanta (2009) 5. Hyvärinen, A., Karhunen, J., Oja, E.: Independent component analysis. John Wiley & Sons Inc., Chichester (2001) 6. Cichocki, A., Zdunek, R.: Guidebook of NMFLAB for signal processing (2006), http://www.bsp.brain.riken.jp/ICALAB/nmflab.html 7. Mørup, M., Hansen, L.K., Arnfred, S.M.: ERPWAVELAB: A toolbox for multi-channel analysis of time–frequency transformed event related potentials. J. Neurosci. Meth. 161(2), 361–368 (2007) 8. Lee, H., Cichocki, A., Choi, S.: Kernel nonnegative matrix factorization for spectral EEG feature extraction. Neurocomputing 72(13-15), 3182–3190 (2009) 9. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. John Wiley, Chichester (2002) 10. Näätänen, R.: Attention and brain function. Lawrence Erlbaum Associates Publishers, Hillsdale (1992) 11. Escera, C., Alho, K., Schröger, E., Winkler, I.: Iinvoluntary attention and distractibility as evaluated with event-related brain potentials. Audiol. Neuro.-Otol. 5, 151–166 (2000)
Extract Mismatch Negativity and P3a
391
12. Cichocki, A., Zdunek, R.: Multilayer Nonnegative Matrix Factorization. Electron. Lett. 42(16), 947–948 (2006) 13. Huttunen, T., Halonen, A., Kaartinen, J., Lyytinen, H.: Does mismatch negativity show differences in reading disabled children as compared to normal children and children with attention deficit? Dev. Neuropsychol. 31(3), 453–470 (2007) 14. Kalyakin, I., González, N., Joutsensalo, J., Huttunen, T., Kaartinen, J., Lyytinen, H.: Optimal digital filtering versus difference waves on the mismatch negativity in an uninterrupted sound paradigm. Dev. Neuropsychol. 31(3), 429–452 (2007) 15. Garrido, M.I., Kilner, J.M., Stephan, K.E., Friston, K.J.: The mismatch negativity: A review of underlying mechanisms. Clin. Neurophysiol. 120(3), 453–463 (2009) 16. Hoyer, P.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 5, 1457–1469 (2004)
The Coherence Changes in the Depressed Patients in Response to Different Facial Expressions Wenqi Mao1, Yingjie Li1,*, Yingying Tang2, Hui Li3, and Jijun Wang3 1
School of Communication and Information Engineering, Shanghai University, P.O. Box 01, 200072, China Tel.: +86 21 56334214; Fax: +86 21 56334214
[email protected] 2 Department of Biomedical Engineering, Shanghai Jiao Tong University, 200240, China 3 Department of EEG Source Imaging, Shanghai Mental Health Center, 200030, China
Abstract. To characterize the changes of information transfer between different brain regions during facial expressions processing between the depressed patients and the normal subjects, we applied partial-directed coherence analysis (PDC). Participants were 16 depressed patients and 26 normal subjects, age-matched between groups. An emotion recognition task with different facial expressions (positive and negative) was utilized as stimuli. Lower frontal output PDC values in the alpha band reflected the poor frontal cortex’s regulation of parieto-occipital regions in depressed patients, while the enhanced outflow from the posterior regions to the frontal regions could be taken as an indicator that the depressed group attempted to achieve the normal performance. These topographic patterns of electrical coupling might indicate the changing functional cooperation between the brain areas in depressed patients. The depressed patients may have abnormal brain areas comprising bilateral frontal, right temporal, parietal and occipital regions. Keywords: EEG, emotional expressions, depression, partial-directed coherence.
1 Introduction As accurate perception of facial emotion expression is considered crucial in everyday social life, more and more researches have been interested in studying the emotion experience of people suffering from depression recently. Many studies have examined the relationship between emotion or emotion-related constructs and asymmetries in electroencephalographic (EEG) activity [1]. A growing body of evidence strongly suggests that the right and left cerebral hemispheres are differentially involved in the regulation and processing of emotion [2]. Neuroimaging studies have found functional connectivity in a neural network including the prefrontal cortex, amygdala, *
Corresponding author.
L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 392–399, 2010. © Springer-Verlag Berlin Heidelberg 2010
The Coherence Changes in the Depressed Patients
393
hippocampus, anterior cingulated gyrus, superior temporal gyrus, insula and the occipito-temporal cortex [3-6]. Abnormalities in the frontal lobe and limbic structures are also reported in depressed patients [7]. Though large amount literatures studied the affective disorder of depressed patients in various approaches, few of them could reflect the information flow in the neural network during the emotion identification in depressed patients. In this study, partialdirected coherence (PDC) was used to evaluate the changes of directed coherences between channels in depressed subjects as compared to those in the normal ones. PDC analysis could reflect whether and how the coherences between different neural regions changed, rather than the changes in a specific region, by means of measuring the degree of the directional dependencies of cortical activities [8]. The alpha oscillations are regarded as reflecting activity of multifunctional neuronal networks, differentially associated with sensory, cognitive and affective processing [9]. Evidence suggests that activity within the alpha range (typically 8-13Hz) may be inversely related to underlying cortical processing, since decreases in alpha tend to be observed when underlying cortical system engage in active processing [1,2]. Considering this aspect, our PDC analysis was confined in alpha band (8-13Hz) to investigate cognitive processing of emotion in depressed patients. We focused on the changes of directed connectivity of the cortical network during facial expression processing in the depressed patients as compared to the normal subjects.
2 Materials and Methods 2.1 Subjects Sixteen depressed outpatients (ten female and six male) and twenty-six age-matched normal subjects (fifteen female and eleven male) participated in this experiment. There were no significant age differences between the two groups. The depressed group was recruited from Shanghai Mental Health Center. All depressed subjects fulfilled CCMD-3 (Chinese Classification of Mental Disorders, Version 3) diagnosis criteria and hadn’t taken medicine or hadn’t taken medicine in the past two weeks. The normal subjects had no personal neurological or psychiatry history, no drug or alcohol abuse, no current medication and normal or corrected-to-normal vision. Before the experiments, all the participants signed an informed consent according to the guidelines of the Human Research Ethics Committee at SMHC and participated in an interview in which HAMD (Hamilton Rating Scale for Depression), SAS (Self-rating Anxiety Scale) and SDS (Self-rating Depression Scale) were rated. The questionnaires scores of the normal group were in the normal range which showed they had no emotional disorder and those of the depressed group showed they had mild or major depression (see Table 1). They were paid after the experiment. 2.2 Materials and Procedure The stimuli consisted of 24 photographs of Chinese faces (twelve female and twelve male) drawn from a standardized set CAFPS (Chinese Affective Face Picture System).
394
W. Mao et al.
The facial stimuli were expressions of two basic expressions (happiness and sadness) which were considered as positive and negative expression in this article. Each face with no hair, glasses, beard or other facial accessories was processed by Adobe Photoshop to achieve the same illumination. The experiment had two blocks and each trial had two facial stimuli (S1, S2), so each block included 24(faces)×2(repeated) ×2(half matched and half unmatched)=96 trials. Sine each face was presented twice and there are two blocks, 192 stimuli were presented in total. Subjects sat in front of a 17-inch LCD-screen at a distance of about 80cm and were confronted sequentially with the facial stimuli (200×216 pixels). The temporal sequence of events within the trial was as follows. Each trial began with a fixation cross appeared on the center of the screen for 1.5s. The first presentation of face lasted for 1s and was followed by an ISI (interstimulus interval) lasting for 500ms. After this interval, the second facial stimulus for recognition appeared for 2s and the subjects needed to judge using a response box, whether the presented stimulus were matched to the first one or not. Next trial began after 1.5s ITI (internal interval) (see fig.1). During the 2s presence of the second stimulus, subjects pressed one button with left hand if they judged the stimuli as identical and another button with right hand if not. Table 1. Demographic and affective characteristics of depressed patients and control subjects Age (yrs) Depression (n=16) Control (n=26) Statistical significance
HAMD
Scores of questionnaires SDS
SAS
32.56±4.07
27.94±7.61
0.69±0.11
60.12±10.04
36.96±9.18
1.77±1.75
0.35±0.07
30.46±6.15
t(1,40)=1.23 P=0.23>0.05
t(1,40)=-3.53 P=0.00<0.05*
t(1,40)=-1.31 P=0.00<0.05*
t(1,40)=-10.65 P=0.00<0.05*
* Means the difference is statistically significant (the significant value P<0.05).
2.3 EEG Data Acquisition and Preprocessing The electroencephalogram (EEG) was recorded from 64-channel surface electrodes mounted in an elastic cap (nose tip as reference, impedance<10kΩ, 0.05-100Hz band pass, 1000 samples/s). Vertical and horizontal EOGs were simultaneously recorded to monitor eyes movements and blinks. The raw EEG data were preprocessed offline using Vision Analyzer 1.1 (Brain Product, Germany). Individual EEG recordings were scanned visually for artifacts (ocular, saccades and artifacts >±100μV). Only correctly answered and artifacts-free trials were included in the subsequent analysis. EEG was segmented separately from 200ms pre-stimulus and 1000ms post-stimulus onset. This study focused on the following electrodes which covered nearly the whole cortex: Fp1, Fp2, F3, F4, F7 and F8 in the prefrontal and frontal cortex, T7 and T8 in the temporal cortex, C3 and C4 in the central cortex, P3 and P4 in the parietal cortex, O1 and O2 in the occipital cortex.
The Coherence Changes in the Depressed Patients
395
Fig. 1. Sequence of events in a typical trial from the task
2.4 PDC Analysis PDC analysis was applied to non-averaged EEG data within the 400-600ms poststimulus time window (as previous work [10] found significant affective difference in this period). The PDC, proposed by Baccalá and Sameshima in 2001 [11], is defined in terms of MVAR (multivariate autoregressive) coefficients transformed to the frequency domain. In the framework of MVAR model, a m -channel process can be represented as a vector X of m EEG signals recorded at the time instance: X ( t ) = [ x1 ( t ), x 2 ( t ), " , x m ( t )]T . Then the MVAR model can be expressed as (1) p
X ( m ) = ∑ Ar X ( m − r ) + W ( m )
(1)
r =1
The matrix Ar is the model coefficients which up to an p order and W ( m) is the vector of white noise values. In this work, the unbiased Nuttal-Strand method was used to estimate the MVAR parameters [12], while the order p was set by the Akaike information criterion [13]. Ar is transformed into the frequency domain, which yields
A( f ) = I −
∑
p r =1
Ar e
− j 2 π fr
for each frequency f . Then, the PDC value from i -th
channel to j -th one at frequency f can be obtained according to (2)
PDCi → j ( f ) =
Aij ( f )
AkjH ( f ) Akj ( f )
(2)
Where Aij ( f ) is the i , j -th element and Akj ( f ) is the j -th column of A ( f ) . These PDC values provide information about the information flow for each electrode pair and each frequency (spectral frequency resolution is 1Hz). In order to analysis the PDC in alpha-band frequency, we averaged PDC values over the frequency band of 8-13 Hz [14]. The implementation of all computational steps is available online from an open source project BIOSIG under http://biosig.sourceforge.net [15].
396
W. Mao et al.
3 Results 3.1 Behavioral Results
We compared the response accuracy (ACC) and reaction time (RT) of two groups (see fig.2). Only correct answers were included in the RT analysis. There was no significant difference of the ACC (F(1,40)=1.873, P>0.05) or the RT between two groups (F(1,40)=1.346, P>0.05), but the within factor Expression showed a significant main effect on the accuracy (F(1,40)=4.938, P<0.05). The result showed that the accuracy of recognizing the positive face was lower than that of the negative one. There was no significant effect of expression on the RT (F(1,40)=1.437, P>0.05). 3.2 PDC Results
For the investigated alpha-frequency band (8-13Hz), the calculation of PDC values showing directed coherences between the selected channels revealed statistical and topographical differences between depressed patients and normal controls within the 400-600 ms time-window (400–600 ms after stimulus onset). The factor Group showed a significant main effect on the PDC values (see fig.2). In particular, the network of coherence increases in the depressed group (relative to the normal group) converges towards right prefrontal position (Fp2) and left frontal position (F7) (Fig.2A). The frontal position almost exclusively received stronger information input, while the information output was in the parietal region. Smaller PDC values appeared in intra- and inter-hemispheric couplings in the depressed group relative to the normal one. The significant decreased information input turned out to be located at the right parietal region (P4) which derived from most frontal regions and the occipital region (Oz) which derived from most electrodes in left hemisphere (Fig. 2B).
A. dep>nor
B. dep<nor
Fig. 2. The changes of PDC values for the depressed group relative to that for the normal group within 600-800 ms time-window. (A) Coherence increased in the depressed group significantly (as compared to the normal group). (B) Coherence decreased significantly in the depressed group (as compared to the normal group).
The Coherence Changes in the Depressed Patients
397
In addition, the significant interaction of Group and Expression was also found in some sites. The left temporal, central and bilateral frontal position in the depressed group had stronger information flows in response to the positive face than the negative one (see Table 2). In response to the negative face, the depressed group had significantly decreased information input which concentrated on the right parental position (P4) as compared to the normal group (see Table 3). Table 2. The simple effect analysis of interaction between the group factor and emotion factor on PDC values (Positive vs. Negative) Main effect
Positive > Negative
Emotion on control group
T7→T8: F(1,40)=8.33, p=0.006
Emotion on depressed group
T7→F3: F(1,40)=6.34, p=0.016 C3→F3: F(1,40)=9.76, p=0.003 Fp2→T8: F(1,40)=7.48, p=0.009 C3→P4: F(1,40)=4.40, p=0.042
Positive < Negative
O2→Fp1: F(1,40)=8.18, p=0.007 T7→T8: F(1,40)=10.92, p=0.002
Table 3. The simple effect analysis of interaction between the group factor and emotion factor on PDC values (Depression vs. Normal) Main effect
Depression > Control
Depression < Control
Group on positive emotion
C3→F3: F(1,40)=10.50, p=0.002
T7→T8: F(1,40)=4.98, p=0.031
Group on negative emotion
O2→Fp1: F(1,40)=5.46, p=0.025 T7→T8: F(1,40)=8.33, p=0.006
T8→P4: F(1,40)=12.91, p=0.001 C3→P4: F(1,40)=14.04, p=0.001
4 Discussion In this study, we evaluated the ability of recognizing two basic expressions in depressed patients by PDC values in the alpha band during the 400-600 ms after stimulus onset. The changes of PDC values reflected several patterns of directed coherences between electrode pairs in different regions. Different coupling patterns in the form of coherence increases or decreases might reflect relatively increased or decreased cooperations between different functional brain areas. Enhanced information output derived from the posterior regions towards the frontal regions was observed in the depressed group relative to the normal one. According to the analysis of anatomical physiology, the electrode O1 is near the primary visual cortex and the electrodes P3 and P4 are near the Brodmann area 18 and 19 which comprise the extrastriate (or perstriate) cortex [16]. The extrastriate cortex is
398
W. Mao et al.
a visual association area, related with feature extracting, sharp recognition, attentional and multimodal integrating functions and known to be engaged in analysis of emotional stimuli including facial expressions [17]. The enhanced frontal information input from parieto-occipital regions suggested that, compared to the normal subjects, the depressed patients might need a greater degree of information input to complete emotion recognition task. On the other hand, decreased information transfer from the anterior and left temporal positions to the right parietal and occipital regions was found. As bilateral anterior regions were engaged in the emotion cognition, this weaker output in the frontal positions might reflect the dysfunction of frontal lobe in depressed patients, which induced the poor frontal cotex’s regulation of parieto-occipital regions. The enhanced outflow from the posterior regions to the frontal regions could be taken as an indicator that the depressed group attempted to achieve the normal performance in recognizing the expression. Other investigations found the right hemisphere was dominant in emotion processing [18]. And the lower inflow concentrated in the right hemisphere in the depressed group might imply their deficits of emotion recognition. Based on the observed significant interaction of Group and Expression, increased left frontal and right parieto-temporal input were found in the depressed group when the presented stimuli were positive. This phenomenon might implicate the deficits of depressed patients in dealing with the positive expression, so they needed more attention. The right parietal region was a major recipient of strong temporal, central projection in the normal group in response to the negative stimuli, while the depressed patients didn’t present this inflow. These results suggested that the processing of negative expression might change in depressed patients. Our findings showed temporal, parietal and frontal brain regions functioned as crucial parts in expression recognition, which is in line with several neuroimaging investigation [3, 18]. Finally, the differences in the functional-dynamic network between depressed and normal subjects suggested changes of functional cortex in depressed patients. Acknowledgments. This work was supported by National Natural Science Foundation of China (No.60871090) and the Hi-tech Research and Development Program of China (No.2008AA02Z412). We thank Haijiao Lv, Ling Wei and Jiping Ye for helping process the data in this work.
References 1. Coan, J.A., Allen, J.J.: Frontal EEG asymmetry as a moderator and mediator of emotion. Biol. Psychol. 67, 7–49 (2004) 2. Debener, S., Beauducel, A., Nessler, D., Brocke, B., Heilemann, H., Kayser, J.: Is resting anterior EEG alpha asymmetry a trait marker for depression? Findings for healthy adults and clinically depressed patients. Neuropsychobiology 41, 31–37 (2000) 3. Phillips, M.L., Drevets, W.C., Rauch, S.L., Lane, R.: Neurobiology of emotion perception I: the neural basis of normal emotion perception. Biol. Psychiatry. 54, 504–514 (2003) 4. Cullen, K.R., Gee, D.G., Klimes-Dougan, B., Gabbay, V., Hulvershorn, L., Mueller, B.A., et al.: A preliminary study of functional connectivity in comorbid adolescent depression. Neurosci. Lett. 460, 227–231 (2009)
The Coherence Changes in the Depressed Patients
399
5. Bermpohl, F., Walter, M., Sajonz, B., Lücke, C., Hägele, C., Sterzer, P., et al.: Attentional modulation of emotional stimulus processing in patients with major depression-Alterations in prefrontal cortical regions. Neurosci. Lett. 463, 108–113 (2009) 6. Terry, J., Lopez-Larson, M., Frazier, J.A.: Magnetic Resonance Imaging Studies in Early Onset Bipolar Disorder: An Updated Review. Child Adolesc. Psychiatr. Clin. N. Am. 18, 421–439 (2009) 7. Almeida, J.R., Versace, A., Mechelli, A., Hassel, S., Quevedo, K., Kupfer, D.J., et al.: Abnormal Amygdala-Prefrontal Effective Connectivity to Happy Faces Differentiates Bipolar from Major Depression. Biol. Psychiatry 66, 451–459 (2009) 8. Baccalá, L.A., Sameshima, K.: Overcoming the limitations of correlation analysis for many simultaneously processed neural structures. Prog. Brain Res. 130, 33–47 (2001) 9. Klimesch, W.: EEG alpha and theta oscillations reflect cognitive and memory performance: a review and analysis. Brain Res. Rev. 29, 169–195 (1999) 10. Ling, W., Li, Y.J., Ye, J.P., Yang, X.L., Wang, J.J.: Emotion-induced Higher Wavelet Entropy in the EEG with Depression during a Cognitive Task. In: 31st Annual International Conference of the IEEE EMBS, Minneapolis, pp. 5018–5021. IEEE Press, Minnesota (2009) 11. Baccalá, L.A., Sameshima, K.: Partial directed coherence: a new concept in neural structure determination. Biol. Cybern. 84, 463–474 (2001) 12. BioSig – An application of Octave, http://www.biosig.sf.net 13. Kamiński, M.J., Blinowska, K.J.: A new method of the description of the information flow in the brain structures. Biol. Cybern. 65, 203–210 (1991) 14. Sun, Y., Li, Y.J., Chen, X., Tong, S.: Electroencephalographic differences between depressed and control subjects: An aspect of interdependence analysis. Brain Res. Bull. 15, 559–564 (2008) 15. The BIOSIG project, http://www.biosig.sourceforge.net 16. Ge, J.G., Jin, B., Ge, Y.Z., Guo, H.X.: Study of stereoscopic depth-cognition process. Journal of Zhejiang University (Engineering Science) 29, 169–174 (2005) 17. Kesler/West, M.L., Andersen, A.H., Smith, C.D., Avison, M.J., Davis, C.E., Kryscio, R.J., Blonder, L.X.: Neural substrates of facial emotion processing using fMRI. Brain Res. Cogn. 11, 213–226 (2001) 18. Derix, M.M.A., Jolles, J.: Neuropsychological abnormalities in depression: relation between brain and behaviour. In: Honig, A., van Praag, H.M. (eds.) Depression: neurobiological, psychopathological and therapeutic advances, pp. 109–126. Cambridge University, London (1997)
Estimation of Event Related Potentials Using Wavelet Denoising Based Method Ling Zou1,2,3, Cailin Tao1, Xiaoming Zhang1, and Renlai Zhou2,3 1
School of Information Science & Engineering, Jiangsu Polytechnic University, Changzhou 213164, China 2 State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing 100875, China 3 Beijing Key Lab of Applied Experimental Psychology, Beijing 100875, China {Ling.Zou,Cailin.Tao,Xiaoming.Zhang}@yahoo.com.cn, {Renlai.Zhou,zoulingme}@yahoo.com.cn
Abstract. In this paper a method based on wavelet denoising (WD) is presented to estimate event related potentials (ERPs). Firstly, the empirical mode decomposition (EMD) is used to decompose the noisy signals and the intrinsic mode functions (IMFs) are produced; secondly, the IMFs are filtered by WD method with improved semi-soft threshold function and the new IMFs are selected according to the energy distribution; lastly, the ERPs are reconstructed from the selected IMFs. Computer simulation results and the experimental results demonstrate the effectiveness of the WD-based approach in ERPs estimation, which keeps the main waveform characteristics. Keywords: Event related potentials, Empirical mode decomposition, Wavelet denoising, Estimation.
1 Introduction In recent years, event related potentials (ERPs) analysis has become very useful for neuropsychological studies and clinical procedures. The most common way to visualize ERPs is to take an average over time locked single-trial measurements. But this method ignores the variation from trial to trial. Thus, the goal in the analysis of ERPs is the estimation of the single potentials that we call single-trial extraction. Several techniques have been proposed to improve the visualization of the ERPs from the strong background EEG [1-5]. Among these techniques, the wavelet transform (WT) method is especially promising for its optimal resolution both in the time and in the frequency domain. WT is an efficient tool for multiresolution analysis of non-stationary and fast transient signals. These properties make it especially suitable to the study of neurophysiologic signals. Numerous WT applications in biosignal analysis have been proposed, including for the attempt of single-trial-ERP analysis [1-2]. Empirical mode decomposition (EMD) is a promising one for non-linear time series [5-7]. EMD works like an adaptive high pass filter. It sifts out the fastest changing component of a composite signal first. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 400 – 407, 2010. © Springer-Verlag Berlin Heidelberg 2010
Estimation of Event Related Potentials Using Wavelet Denoising Based Method
401
In this paper, the wavelet denoising (WD) method with improved semi-soft threshold function is proposed to overcome the shortcomings of the conventional threshold functions [8]. The noisy signals are decomposed by EMD and the intrinsic mode functions (IMFs) are produced; then the IMFs are filtered by the WD method and the new IMFs are selected according to the energy distribution; the ERPs are reconstructed from the selected IMFs finally. Computer simulation results show that the WD-based method provides estimations with higher SNR and lower root mean square error (RMSE) than wavelet transform alone. We also apply this proposed approach to estimate the P300 and visual memory potentials (VMPs) in a study to examine EEG correlates of genetic predisposition to alcoholism, the experimental results demonstrate the effectiveness of the WD-based approach in ERPs estimation, which keeps the main waveform characteristics.
2 Methods Multiple trials of observed ERPs can be modeled as
x i ( n) = s ( n) + v i ( n ) + u i ( n )
i = 1,2, … , L;
0 ≤ n ≤ N −1
(1)
Where s (n) are ERP components. The background neural activity is simulated as a mixture of colored noise vi (n) and Gaussian noise ui (n), which varies over trials. The objective is to estimate the ERP signals s (n) from the given L trials. Firstly, EMD decomposition is applied to the i-th noisy mixture observation from (1), the sifting process is as the followings[6]: Construct the upper envelope and lower envelope by connecting all local extremes with cubic spline line; Compute the mean of the two envelopes and subtract it from the original data to get their difference d[n]; Repeat the above process for d[n] until the resulting signal, c1[n], the first IMF satisfies the criteria of an intrinsic mode function; Treat the residue r1[n] (r1[n]=x[n]c1[n])as new data subject to the sifting process as described above, yielding the second IMF from r1[n]; Continue the procedure to obtain other IMFs until the standard deviation from two consecutive sifting results between 0.2 and 0.3. After this process, the original signal x (t) is written as the sum of IMFs and the final residual. Secondly, the suitable wavelet basis and decomposition scale are selected and these IMF components are filtered by wavelet threshold. Consider the IMFs obtained from EMD decomposion, W and W-1 are defined as the forward and inverse discrete wavelet transform (DWT) operators. WD can be performed by following process: Consider the i-th noisy mixture observation from (1),
wi = W ( IMFi ) ∧
wi
= T (w i , λ )
(2) (3)
402
L. Zou et al.
∧
c
i
=W
−1
⎛ ∧ ⎜w ⎜ i ⎝
Where wi is the wavelet coefficient vector,
⎞ ⎟ ⎟ ⎠
(4)
T (wi , λ ) is the thresholding operator
∧
∧
with threshold λ , wi is the wavelet coefficients after thresholding and ci is the denoised signal. The widely used soft-thresholding function is continuous with discontinuous derivative. However, the continuous derivative or higher order derivatives are often desired for optimization problems. A new class of nonlinear soft-like thresholding functions motivated by the differentiable Sigmoid function which replaces the undifferentiable hard-limited function in traditional network, with continuous derivatives is introduced here [8]. The major energy of evoked potential signal in this paper is at 300ms, which is the basic principle for the IMFs selection. Statistical studies show that the signal energy of P300 in the experimental is mainly focused on the IMF4. Finally, the signal is reconstructed according to the selected IMFs denoised by WD and the final residual.
3 Data 3.1 Simulation Data In the present study, the simulated model for s[n] [9] is used with sampling frequency is 2016Hz (N= 2016), as shown in Fig. 1(a). The spontaneous EEG is simulated with adaptive AR model by (5) [10].
u ( n) − 1.5084 * u (n − 1) + 0.1587 * u ( n − 2) + 0.3109 * u ( n − 3) + 0.0510 * u ( n − 4) = v (n)
(5)
Where u (n) is Gaussian noise signal. We generate 15 trials of simulated data. Each trial contains 2016 data samples. The observed ERPs are composed of the simulated ERP signal s (n) and background neural activity including the Gaussian noise z (n) and the colored noise v (n). SNR and Root mean squared error (RMSE) between the true ERP s (n) and ∧
estimated ERPs s (n) are employed to assess the performance of different approaches. The RMSE is calculated using (6),
⎧⎪ 1 RMSE = ⎨ ⎪⎩ N
∧ ⎛ ⎞ ⎜ s ( n) − s ( n) ⎟ ∑ ⎠ n =1 ⎝ N
2
1/ 2
⎫⎪ ⎬ ⎪⎭
(6)
Estimation of Event Related Potentials Using Wavelet Denoising Based Method
403
And the SNR is calculated using (7), N ⎧ ⎫ 2 s ( n ) ⎪ ⎪ ∑ ⎪ ⎪ n =1 SNR = 10 log10 ⎨ ⎬ 2 N ⎪ ⎛ s (n) − s∧ (n) ⎞ ⎪ ⎜ ⎟ ⎪ ⎪⎩ ∑ ⎠ ⎭ n =1 ⎝
(7)
3.2 Experimental Data This experiment is to examine EEG correlates of genetic predisposition to alcoholism [11]. The data is recorded from 64 electrodes and digitized at a rate of 256 Hz. For each trial, 1s of data is saved on a hard disc (form 0.19s pre- to 0.81s poststimulation). The reference electrode is CZ. There are two groups of subjects: alcoholic and control. Each subject is exposed to either a single stimulus (S1) or to two stimuli (S1 and S2) which are pictures of objects chosen from the 1980 Snodgrass and Vanderwart picture set. When two stimuli are shown, they are presented in either a matched condition where S1 is identical to S2 or in a non-matched condition where S1 differed from S2. As the visual evoked potential signals (VEPs) at P8 electrode is most obvious, the VEPs anomalies under S1 (a single object shown), S2_match (object 2 shown in a matching condition) and S3_nomatch (object 2 shown in a non matching condition) stimulation are compared between the alcoholic and the normal objects at P8 electrode. 40 trials of experimental data are used here.
4 Results 4.1 Simulation Results We design a single channel with 15 trials; all the trials contain the same ERP. Statistical analysis of the 3 groups data show that the latency (F(2,14)=0.371, P=0.697>0.05) and amplitude (F(2,14)=0.397, P=0.681>0.05) of N100, the latency (F(2,14)=0.382, P=0.691>0.05) and amplitude (F(2,14)=0.044, P=0.957>0.05) of P300 are no significant difference. Table 1 compares the WT, and the WD-based methods in terms of the mean and standard deviation (SD) of the RMSEs and the SNRs of the estimated ERP on all trials. It is obvious that the WD based approach obtained the most accurate estimates than the WT approach. Fig. 1(b) corresponds to a single trial of the noisy raw signal. Fig. 1(c)-(d) shows the estimated ERP on the single trial by using the WD-based and WT approaches respectively. The SNR of the noisy raw signal is -12.2757 db while the SNR in the estimated ERP is increased to 2.2051db with the WT method, 4.0385db with the WD-based method. From the figure we can also see that the WDbased method has much more smooth estimates than the WT method. Smoothness is an expected property of the ERPs based on the reliable evoked potential estimation results obtained by ensemble averaging over a large number of trials.
404
L. Zou et al. Table 1. Simulation Results of the different approaches for 15 trials Approach Mean of RMSEs SD of RMSEs Mean of SNRs (dB) SD of SNRs (dB)
WD-based 2.7576 0.5298 4.5388 1.1649
WT 2.9549 0.4949 3.5099 1.4538
Fig. 1. Simulated results for a single trial from 15 trials: (a) simulated ERPs; (b) noisy signal; (c) WD-based denoised signal; (d) WT denoised signal
4.2 Experimental Results The performance by averaging and the WD-based approaches are compared in the present study. 4.2.1 Comparison of P300 by Visual Stimulation between Alcoholic Object and Control Object The P300 features of alcoholic and control objects are analyzed by analysis of variance (ANOVA). The results show that P300 latencies (F (1, 78) = 2.614, P = 0.110> 0.05) of alcoholic and control objects are no significant difference while the amplitude (F (1, 78) = 4.780, P = 0.032 <0.05) and peak values (F (1, 78) = 9.380, P = 0.003 <0.05) have significant differences. Fig.2 shows the results by WD-based method (randomly selected 3 trials, for paper limit) and averaging method (40 trials) between alcoholic object and control object. The evoked potential waveforms of alcoholic object (thick lines) at 300ms decrease significantly compared with the control object (thin lines) and it is consistent with previous research results [11]: the waveform is variable, the latency becomes extended and the amplitude declines correspondingly, but the stability of latency is more than amplitude. Experiments show that the method could be applied to estimate the P300 and the effects are obvious.
Estimation of Event Related Potentials Using Wavelet Denoising Based Method
405
Fig. 2. Sample results for the estimated ERPs of alcoholic (thick line) and control (thin line) subjects in S1 condition
4.2.2 Comparison of the Control Object’ VMPs between S2_match Stimulation and S2_nomatch Stimulation VMP is the performance index of visual memory and occurs at 247ms. The VMPs from 23 trials under S2_match stimulation and 15 trials under S2_nomatch stimulation are estimated. ANOVA shows that VMP latency (F (1, 36) = 0.515, P = 0.478> 0.05) is no significant difference while amplitude (F (1, 36) = 6.640, P = 0.014 <0.05) and peak value (F (1, 36) = 7.742, P = 0.009 <0.05) are significant different.
Fig. 3. Sample results for the estimated ERPs of a control subject in match (thick line) and non match condition (thin line)
406
L. Zou et al.
The comparison results of the control object’ VMPs between S2_match stimulation and S2_nomatch stimulation are shown in Fig.3. We could clearly observe that the VMPs under S2_match stimulation (thick lines) are much lower than S2_nomatch (thin lines) at around 170 to 250ms. The WD-based method has the same effects with the averaging method and the characteristic and difference of the waveform could be identified obviously.
5 Discussion The most common method of analyzing the parameters of ERPs takes an average over time-locked single trial measurements. However, the information of the variability between the single-trials is lost during averaging. Such information could be important in investigating some cognitive processes, and it could also help to identify features of the ERPs. In the present study, a WD-based method is developed to estimate the single-trial ERPs. The performance of the present approach is evaluated by comparing with the WT method and the classical ensemble averaging method alone in both designed computer simulations and experimental ERPs. Computer simulations results show the WD-based approach outperforms the WT method, which provided the higher SNR and lower RMSE than the WT method alone, even if the trial-by-trial evoked potential variations exists. We also apply this proposed approach to estimate the P300 and VMPs in a study to examine EEG correlates of genetic predisposition to alcoholism, the experimental results demonstrate the effectiveness of the WD-based approach in ERPs estimation, which keeps the main waveform characteristics. Acknowledgments. This work was supported by the open project of the State Key Laboratory of Cognitive Neuroscience and Learning and the open project of the Beijing Key Lab of Applied Experimental Psychology at the Beijing Normal University, Qinlan Project of Jiangsu Province.
References 1. Zou, L., Zhou, R.L., Hu, S.Q.: Single Trial Evoked Potentials Study during an Emotional Processing Based on Wavelet Transform. In: Sun, F., Zhang, J., Tan, Y., Cao, J., Yu, W. (eds.) ISNN 2008, Part I. LNCS, vol. 5263, pp. 1–10. Springer, Heidelberg (2008) 2. Zou, L., Ma, Z.H., Chen, S.Y.: Single Trial Evoked Potentials Estimation by Using Wavelet Enhanced Principal Component Analysis Method. In: Yu, W., He, H., Zhang, N. (eds.) ISNN 2009. LNCS, vol. 5553, pp. 638–647. Springer, Heidelberg (2009) 3. Palaniappan, R., Ravi, K.V.R.: Improving visual evoked potential feature classification for person recognition using PCA and normalization. Pattern Recognition Letters 27, 726–733 (2006) 4. Lemm, S., Curio, G., Hlushchuk, Y.: Enhancing the Signal-to-Noise Ratio of ICA-Based Extracted ERPs. IEEE Trans. Biomed. Eng. 53(4), 601–607 (2006) 5. Huang, N.E., Shen, Z., Long, S.R.: The Empirical Mode Decomposition and Hilbert Spectrum for Nonlinear and Nonstationary Time Series Analysis. Proc. R. Soc. London Ser. A 454, 903–995 (1998)
Estimation of Event Related Potentials Using Wavelet Denoising Based Method
407
6. Liang, H.L., Bressler, S.L., Desimone, R., Fries, P.: Empirical mode decomposition: a method for analyzing neural data. Neurocomputing 65-66, 801–807 (2005) 7. Flandrin, P., Rilling, G., Goncalves, P.: Empirical mode decomposition as a filter bank. IEEE Signal Processing Letters 11, 112–114 (2004) 8. Zhang, X.P., Desai, M.: Adaptive de-noising based on SURE risk. IEEE Signal Processing Letters 5(10), 265–367 (1998) 9. Masahiko, N.: Waveform Estimation from Noisy Signals with Variable Signal Delay Using Bispectrum Averaging. IEEE Trans. Biomed. Eng. 40(2), 118–127 (1993) 10. Xu, H.D., Chen, J., Wen, H.: The study of event-related evoked potential P300 in chronic alcoholics. Shanghai Archives of Psychiatry 15(3), 140–142 (2005) 11. Swartz Center for Computational Neuroscience, http://sccn.ucsd.edu/eeglab
Adaptive Fit Parameters Tuning with Data Density Changes in Locally Weighted Learning Han Lei, Xie Kun Qing, and Song Guo Jie Key Laboratory of Machine Perception (Ministry of Education), Peking University {hanlei,kunqing}@cis.pku.edu.cn,
[email protected]
Abstract. Locally weighted learning (LWL) is a form of lazy learning and focuses on locally weighted regression. Due to its high efficiency and flexibility, the learning mechanism is widely used in prediction. However, LWL fails when the data points are sparse, and fewer survey concerns about tuning fit parameters in local model with density of the data input. This paper discusses the relationship between data density and fit parameters from a theoretical view. The relationship we advocate also contributes to adaptive fit parameters selection. Experimental studies provide evidence for the mathematical derivation and show its application superiority in prediction of traffic flow. Keywords: Density; Locally weighted learning; Adaptive; Tuning.
1
Introduction
Locally weighted learning contains several important parts known as local model structures, weighting functions, smoothing parameters, distance functions etc. Important fit parameters such as bandwidth h and number of neighbors K are discussed by Atkeson et al. (1997) [1]. When robustness is emphasized, this learning mechanism is required to allow adaption to changes in data density and distribution. However, locally weighted learning fails or performs badly when the data density is very low. Low data density makes the local regression powerless because of the barren neighborhood around the query. This difficulty can be overcome by tuning the fit parameters adaptively in prediction. Take a smoothing or bandwidth parameter h as an example, many adaptive selection methods are proposed such as Partially Adaptive Bandwidth [2]. Global Bandwidth Selection(QBS), providing a global optimal bandwidth, is widely used due to its simplicity and universality, but collapses with changes in data density and noise. Query-based Bandwidth Selection(QBS) and Pointbased Bandwidth Selection(PBS) using a bandwidth associated with each data or query point will allow rapid or asymmetric changes in the behavior of the data, but they use a lot of storage associated with data points and require expensive calculations in preprocessing as well as updating.
Corresponding Author.
L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 408–415, 2010. c Springer-Verlag Berlin Heidelberg 2010
Adaptive Fit Parameters Tuning with Data Density Changes
409
This paper discusses the relationship between data density and fit parameters in local constant model from a theoretical view. Strict mathematical derivation of this relationship is evolved to explain the process of relation detection. The results produce a theorem which reflects the relationship of data density and two fit parameters in the local model − bandwidth h and number of neighbors K. Experiments are divided into two groups. The first experiment based on artificial stochastic dataset is implemented to verify the correctness of the theorem. Then the theory is used in the prediction of traffic flow to show its superiority. In this paper, the first section is an introduction. The second part describes the local constant model and its components. Section three is the mathematical derivation of relation detection in detail. Experimental studies are described in the forth section. Section five makes a conclusion.
2
Preliminaries and Problem Statement
The learning problems have a standard regression model: y = f (x) + ε
(1)
where x denotes the N -dimensional input vector, y the scalar output, and ε a mean-zero noise item. When the regression model is used for prediction of time series, x is a vector with lagged values of y . The local regression is used to approximate the unknown function f (). A constant local model is represented by the following prediction equation: K d(x ,q) yi G( hi ) yˆ(q) = i K (2) d(xi ,q) ) i G( h q is the query. d () is a distance function usually in a typical formation as the K 2 Euclidean distance, d(x, q) = i (xi − qi ) . A weighting or kernel function G() is used to calculate a weight for data points from the distance. A typical 2 weighting function is Gaussian, G(d) = e−d . A bandwidth h defines the scale or range over which generalization is performed. K denotes the number of the nearest neighbors that take part in local regression. Based on the theory above, data density should be introduced to the regression with local constant model together. When a given query is located in sparseness, the barren neighborhood makes the constant model structure in equation 2 powerless because of small weights of neighbors if bandwidth h does not make any corresponding adjustments. Note that there is no variable which refers to density in local constant model, it is difficult to find any adaptive correspondences between fit parameters and data density. Therefore, the following derivation process is aimed to overcome this problem.
3
Fit Parameters Tuning with Density Changing
We first formalize the data density for the following theory derivation.
410
3.1
H. Lei, X.K. Qing, and S.G. Jie
Expression of Local Constant Model with Data Density
Data density, reflecting the denseness or sparseness of data distribution, is represented as: K K ρ= in 3D or ρ = in 2D (3) V S K is the number of data points in a local circular region which is measured by volume V or area S. Take a planar region as example, S = πr2 , where rk is a radius of the circular region. If we treat this circle as the neighborhood of a query point q with K neighbors and assume that q is the centre of the circular region, the circle with radius rk can be regarded as a local model. In equation 2, rk is also the distance between the K th neighbor and the query. That means the K th neighbor is located on the circumference of the circle. Under a critical localized limitation (which is a constraint for selection of the number of neighbors K), the local region is assumed to be symmetrically distributed and the density ρ a constant. Thus, each neighbor i has its own distance ri to the centre query, and this distance can be represented as: i (4) d(xi , q) = ri = ρ×π Inserting equation 4 in equation 2, we obtain: K i y G( i i ρπh2 ) yˆ(q) = K i i G( ρπh2 ) 3.2
(5)
Error Function
The error of estimation is denoted as: K K i i y G( ) (y − y)G( i i i i ρπh2 ρπh2 ) e = |ˆ y (q) − y| = − y = K i K i G( ) G( ) i i ρπh2 ρπh2
(6)
This error equation implicates the relationship of number of neighbors K, bandwidth h and data density ρ. However, the term yi − y is a stochastic entry according to the actual input values. In order to facilitate analysis, the stochastic entry should be changed into non-stochastic variables. Actually, item yi − y is determined by the input which has an explicit and fixed distribution. Thus, in our assumed symmetrical local region, yi − y is constrained by ρ and stochastic oscillates in a limited range. We introduce a standard data distribution S0 with a constant density ρ0 and a query q0 in it. For any dataset St with ρt and qt , there is a relationship reflected in the distance r0k and rtk : K ρt ×π rtk ρ0 = = (7) r0k K ρt ρ0 ×π
Adaptive Fit Parameters Tuning with Data Density Changes
411
As y is the output of input x, when yˆ is used as a prediction for time series, y is actually the next state of x . Due to Local Dependence Rule [3], y holds the similar features as x . Thus, output y has the similar distribution as x : yti − yt xti − q ≈ y0i − y0 x0i − q
(8)
where xi − q denotes one of the feature distances in multi-dimensional input cases, and it is positively related to data point distance ri . We represent this positive correlation as: (9) xi − q = T (ri ) T() has various formats that depends on the features of the input data points (|xi − q| = ri when x is a single-dimensional time serie). We assume T() as a linear formation for simplification. Then we can get from equation 7, 8 and 9: T (rti ) rt yti − yt xti − q ρ0 ≈ = = k0 = (10) y0i − y0 x0i − q T (r0i ) rk ρt and equation 6 is changed by inserting equation 10: K ρ0 i i · (y − y ) · G 0i 0 ρt ρt πh2 e= K i 2 i G ρt πh
(11)
To completely eliminate the stochastic entry, we assume again that T() has a squared formation (If T() keeps identical with equation 10 as a linear formation, we get Integral Error Function, see in Section 3.3). K ρ0 i 2 C·i i i √ K · C · r · G · G 0i 2 2 ρ ρ πh i π ρ ρ ρ πh t t t 0 t (12) e= = K K i i i G i G ρt πh2 ρt πh2 Treat error e as Error Function E() about ρ, K and h using a Gaussian Kernel, equation 12 can be rewrote as: − i K √ C·i i π ρρ0 · e ρπh2 (13) E(ρ, K, h) = K − i 2 ρπh e i C is a temp coefficient and will be found useless in the following part. 3.3
Boundary Determination of Error Function
Theorem: By constraining the boundary of error function E(), we propose a theorem called Relation of Density and Parameters Theorem (RDPT): ρ=
K πh2
412
H. Lei, X.K. Qing, and S.G. Jie
Proof. Integral scaling techniques based on Squeezing Theorem are used here for derivation. Exponential function y = e−x/a has some properties that the sum of exponential terms with integer x from 1 to K is larger than its integral from 1 to K, and the sum of exponential terms with integer x from 2 to K is fewer than its integral from 1 to K. K
e
i − ρπh 2
e
i − ρπh 2
di = −ρπh2 · e
K − ρπh 2
+ ρπh2 · e
1 − ρπh 2
1
i=1 K
K
>
e
i − ρπh 2
<e
1 − ρπh 2
K
+
e
i − ρπh 2
di = −ρπh2 · e
K − ρπh 2
+ ρπh2 · e
1 − ρπh 2
+e
1 − ρπh 2
1
i=1
Propose a variable B1 , and set it to: ρπh2 · e
1 − ρπh 2
< B1 < ρπh2 · e
1 − ρπh 2
+e
1 − ρπh 2
The similar properties happened in function y = xe−x/a . K − i − K − 1 C · i · e ρπh2 C · ρπh2 > √ −K − ρπh2 · e ρπh2 + 1 + ρπh2 · e ρπh2 √ π ρρ0 π ρρ0 i=1 1 K − i − K − 1 C · e− ρπh2 C · i · e ρπh2 C · ρπh2 < √ −K − ρπh2 · e ρπh2 + 1 + ρπh2 · e ρπh2 + √ √ π ρρ0 π ρρ0 π ρρ0 i=1
B2 is proposed here like B1 and set to: 1 − ρπh 2
C · ρπh2 · e √ π ρρ0
1
1
− C · e− ρπh2 C · ρπh2 · e ρπh2 · 1 + ρπh2 < B2 < · 1 + ρπh2 + √ √ π ρρ0 π ρρ0
Introduce B1 and B2 into error function: E(ρ, K, h) =
C·ρπh2 √ π ρρ0
·e
K − ρπh 2
· −K − ρπh2 + B2
−ρπh2 · e
K − ρπh 2
(14)
+ B1
Note that if T() in Section 3.2 keeps identical with equation 10 as linear, we get Integral Error Function which is twice the integral of the Gaussian distribution 2 x with 0 mean and variance of 1/2, erf (x) = √2π 0 e−t dt , and this Integral Error Function cannot be derived to a non-integral formula. In order to facilitate analysis, we select two difference formations of T(), and both of the two can reflect the positive correlation of xi − q and ri which is the most important. Because most of the errors are generated in prediction of points in sparseness, the core focus should be put on low density situation. When data density is under a threshold ρlow , both B1 and B2 approach zero. Then we can do an approximation: − K2 √ 2 ρπh √C · −K − ρπh2 C ρh2 CK π ρρ0 · ρπh · e = + , ρ ≤ ρlow E(ρ, K, h) ≈ √ √ − K π ρρ0 ρ0 −ρπh2 · e ρπh2
Adaptive Fit Parameters Tuning with Data Density Changes
413
To minimize this error function, we calculate its derivative about ρ and make the derivative equal to zero: 3 1 1 CK ∂E(ρ, K, h) 1 Ch2 = − · √ · ρ− 2 + · √ · ρ− 2 = 0 ∂ρ 2 π ρ0 2 ρ0
(15)
Finally, we get the compendious relationship of ρ, K and h, named Relation of Density and Parameters Theorem: ρ=
4
K πh2
(16)
Experiments
Experiments are designed into two ways: (1) Verification by artificial stochastic datasets; (2) Application of RDPT theory in prediction of traffic flow. Experiments are implemented in Matlab 7, Intel Core II 2.26G CPU and 2G memory. 4.1
Verification by Artificial Stochastic Datasets
To prove the theoretical results above by data points, experimental studies are designed to test the theorem. The studies described in this part are based on stochastic datasets with variational local density. Those datasets are generated by random numbers whose elements are normally distributed with mean 0, variance σ 2 = 1 and standard deviation σ = 1, shown in figure 1. The size of sample input x, query input q, sample output y and query output yq are M. x and q is generated by the stochastic process while y and yq is generated by adding a random noise term to x and q separately in order to simulate real time series whose outputs hold the similar features with inputs. The prediction accuracy is measured by MFE: M 1 |yqi − yˆqi | MFE = M i=1 yqi Figure 2 shows the comparison of two relations when ρ is lower than a threshold ρlow (Set to be the maximum density of the 20% sparsest data points). The blue curve shows the relation of Optimal Bandwidths, calculated by ideally minimizing every query prediction errors with a ”prescient” knowledge of real output(OBS), and data density calculated by Kernel Density Estimation [4]. The red one shows the ρ = K/πh2 relation curve. K is constrained to be 10 here. The results show a good match of RDPT to optimal bandwidths. Table 1 shows the prediction MFE of several bandwidth selection methods. Global Bandwidth Selection doesn’t need any preprocessing and additional storage besides samples, but its performance is the worst. Query-based and Pointbased Bandwidth Selection are both calculating expensive in preprocessing, and need additional storage depended on size of query and sample separately, but perform better than Global Bandwidth Selection. RDPT we proposed needs no additional storage and show the best results.
414
H. Lei, X.K. Qing, and S.G. Jie
Fig. 1. Stochastic Input with Density Contour Lines
Fig. 2. Comparison of Optimal Relation and RDPT Relation, ρ <= ρlow
Table 1. Comparison of Prediction Mean Fractional Error M
1000 1000 1000 1000 1000
Bandwidth MFE Time consumed Selection in prediction(s) GBS QBS PBS RDPT OBS
0.1298 0.1058 0.1292 0.0910 0.0669
0.7330 0.7950 0.7960 0.7170 -
Time consumed Additional storage in consumed in preprocessing(s) prediction 0 0 16.2710 1000 16.0360 1000 0.1400 0 -
The RDPT policy is a combination with global bandwidth selection. When ρ <= ρlow , the RDPT method is used, and global bandwidth selection is implemented when ρ > ρlow . Although, data points in sparseness are fewer than points in dense fields, but large errors are taken place in sparseness which determines mostly the quality of entire prediction. Therefore, a combination mechanism is a good choice. 4.2
Application in Prediction of Traffic Flow
Experiment implemented in this section is an application of RDPT policy in prediction of traffic flow. The studies are based on Freeway Performance Measurement System (PeMS) from Berkeley University of California [5]. Traffic datasets are from Los Angeles mainline I5, segment 759700, 759707 and 716936. The data points are collected every 5 minutes interval including states of traffic flow and occupancy. Sample data points are selected from Oct.1 2009 to Oct.16 2009 exclude weekends and holidays. Query points are from Oct.19 2009. Both Global Bandwidth Selection(GBS) and RDPT Bandwidth Selection are used in prediction of traffic flow. Figure 3 and figure 4 show the two comparisons of prediction and real output. From table 2 and comparison of figure 3 to figure 4, it can be found that our RDPT adaptive bandwidth selection shows its great advantages in traffic flow. Both of the two experiments afford sufficient evidence for RDPT theory and show its comprehensive application.
Adaptive Fit Parameters Tuning with Data Density Changes
Fig. 3. Prediction with GBS
415
Fig. 4. Prediction with RDPT
Table 2. Comparison of Prediction Mean Fractional Error Segment No. 759700 716907 716936
5
MFE of GBS 0.1254 0.1059 0.1346
MFE of RDPT 0.1134 0.0996 0.1057
Conclusions
The relationship of data density and fit parameters is discussed in this paper. After preliminaries, detail mathematical derivation of the RDPT theory is proposed. Experimental studies are implemented to prove its correctness and to show its advantages in prediction of traffic flow. In this paper, there is a main assumptions: two formations of T() described in 3.2 and 3.3. This assumptions can be extended or modified in future research about linear regression model and other model structures. High-dimensional regression [6] should also be discussed in future work. Acknowledgments. This work was supported by the National Natural Science Foundation of China under Grant No. 60703066 and No. 60874082; supported by the National High-Tech Research and Development Plan of China (863) under Grant No.2006AA12Z217.
References 1. Atkeson, C., Moore, A., Schaal, S.: Locally Weighted Learning. Artificial Intelligence Review, 11–73 (1997) 2. Grabis, J.: Partially Adaptive Bandwidth Used in Prediction with Local Regression. Riga Technical University, Kalku 1, Riga Lv-1658, Latvia 3. Christopher, D.L.: Local Models for Spatial Analysis. Queen’s University (2007) 4. Kernel Density Estimation, http://en.wikipedia.org/wiki/ 5. Freeway Performance Measurement System, http://pems.eecs.berkeley.edu 6. Vijayakumar, S., Souza, A.D., Schaal, S.: Incremental Online Learning in High Dimensions. Neural Computation 17 (2005)
Structure Analysis of Email Networks by Information-Theoretic Clustering Yinghu Huang1 and Guoyin Wang1,2 1
2
School of Information Science & Technology, Southwest Jiaotong University, Chengdu 610031, China
[email protected] College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China wanggy
[email protected] Abstract. In the real world, many systems can be represented as a network, in which the nodes denote the objects of interest and the edges describe the relations between them, such as telecommunication networks, power grid networks, and email communication networks. These complex networks have been revealed to possess many common statistical properties such as scale-free nature and small-world property. In addition, modularity or community structure is another important characteristic of complex networks. Identifying modular structure can help us understand the function of networks. In this paper, we introduce a method based on information-theoretic clustering for finding communities/modules in complex networks. This method is robust to the feature representation of networks. Moreover, unlike most existing algorithms, this method does not need to search the number of communities in a network and can determine it automatically. We apply this method to several well-studied networks including a large-scale email communication network and the computational results demonstrate its effectiveness. Keywords: Information-theoretic clustering, Modular structure, Email communication networks, Complex network.
1
Introduction
Large-scale networks describing interactions among objects have become a powerful approach to understand complex organization in social, technological and biological systems. Extensive examples include social networks (e.g. scientific collaboration networks [1]), technological networks (e.g. the Internet [2], email networks [3,4]) to biological networks such as protein interaction networks. Complex networks in these systems have been revealed to possess various common statistical properties, such as small-world property, power-law degree distribution, network transitivity, and clustering coefficients [5]. In recent years complex network research has become a hot topic. Structure analysis of complex networks can help us understand the function of networks and their components. Taking email networks as an example, in which the nodes are email users and an edge connects two users if they have email L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 416–425, 2010. c Springer-Verlag Berlin Heidelberg 2010
Structure Analysis of Email Networks
417
communication, several researchers have found that topological properties of email networks such as clustering coefficient are good features for filtering spam emails [6,7]. In addition to the topological properties mentioned above, modularity or community structure, characterized by groups of individual nodes within which nodes are densely linked and between which the connection is sparser, is another important common characteristic of complex networks. Detecting modular structure has fundamental importance for exploiting the networks because such substructures often correspond to important functions. For example, the exchange of emails between individuals in an organization reveals how people interact and provides a map of which groups of people have similar interests, which is helpful from a managerial point of view and beyond [8]. So far many methods have been developed to detect communities in complex networks, such as betweenness-based methods [1], hierarchical clustering methods [9], heuristic methods [10], graph-theoretic methods [11], and modularity optimization [12,13,14]. Identifying community structure in complex networks is closely related to clustering of data in other areas without an underlying network structure. In contrast to above methods, Gustafsson et al. explore some standard ways of clustering data, such as k-means and hierarchical clustering, to cluster networks [15]. More algorithms for detecting community structure in complex networks can be referred to the review papers [16,17,18]. Despite the fact that many community detection methods have been proposed, there are several obvious defects in them. For example, some methods are very sensitive to the initialization and data representation (e.g. k-means in [15]). Another important problem is how to determine the number of communities in a network, which is in general a model selection problem. Many methods are required to search the number of communities before detecting communities [10,15,19] which makes them costly in terms of computation time and suffer from issues of generalization and overfitting. In this paper, we propose a method based on information-theoretical clustering for identifying community structure in email networks. Compared with the existing algorithms, this method is robust to the data representation, which help us save the efforts for preprocessing networks into Euclidean vectors or similarity matrices. In addition, the method does not require prior knowledge about the number of modules. Experimental results show that our method has good performance on both benchmark artificial networks and real-life networks.
2 2.1
Methods Data Representation of Networks
We first introduce several similarity measures as the data features of networks and will compare their performance in the next section. Given an undirected, unweighted network G = (V, E), where V denotes the set of nodes and E denotes the set of edges in this network, its adjacency matrix is defined as A = [Aij ]n×n , where Aij = 1, if nodes vi and vj are connected and otherwise Aij = 0. n is the number of nodes in the network G. The simplest
418
Y. Huang and G. Wang
data representation for a network is the adjacency matrix A itself. In addition, the concept of shortest paths has been applied to denote the distance of two nodes in a graph or network. Assume the length of the shortest path between nodes vi and vj is Dij , then the distance matrix is D = [Dij ]n×n . Gustafsson et al. [15] define an Euclidean distance measure based on this matrix. Here we propose another shortest path based similarity measure: SP(vi , vj ) =
1 k 1 + Dij
(1)
where k is a constant controlling the role that the shortest paths play in the similarity. In this paper, we set k = 2. Unlike the adjacency matrix A, shortest path based similarity can capture the long-range relationships of nodes by utilizing the global topology of a network. In addition, we also examine the representation of a network by its diffusion kernel. The diffusion kernel is a positive-definitive function and has been comprehensively used in many fields as a feature matrix [20]. The opposite Laplacian of a network is the matrix L = A − D where D is the diagonal degree matrix of the network. The exponential of the matrix L is defined as K = exp(βL) where β is a positive constant to control the degree of diffusion. In this paper, we set β = 0.1. The matrix K is symmetric and positive definite. It can also capture the long-range relationship between nodes that are not connected directly. A similarity measure based on this kernel can be obtained for clustering after normalizing K: Kij DK(vi , vj ) = . Kii Kij 2.2
Information-Theoretic Community Detection
Slonim et al. developed a clustering method based on information theory [21]. The major advantage of this method is that it has no nontrivial assumptions such as assumptions about cluster prototypes, similarity metrics, or the representation of the data. It provides a way of clustering based on collective similarity rather than the traditional pairwise measures. Here we adapt this clustering method for detecting communities in complex networks, a clustering problem with an underlying structure. We also examine if this method is invariant to changes in the representation of the data. Assume a network G has n nodes: V = {v1 , v2 , · · · , vn }, which has m (an upper bound of the number of communities) putative communities C1 , C2 , · · · , Cm . Let P (Cj |vi ) denote the probability that the node vi belongs to the community Cj . From Bayes’rule, it follows that P (vi |Cj ) =
P (Cj |vi )P (vi ) P (Cj )
where P (vi |Cj ) is the probability to choose vi from the members of community Cj , P (Cj ) is the total probability of choosing a node in the community Cj , and
Structure Analysis of Email Networks
419
P (Cj ) = i P (Cj |vi )P (vi ). If we have no prior knowledge about the community membership of a node vi , naturally it has a uniform prior probability of being chosen: P (vi ) = 1/n. Communities in a network are groups of individual nodes within which nodes are densely linked and between which the connection is sparser. In other words, if we pick nodes at random from a community, we would like these nodes to be as tightly connected with one another as possible. Assume that there is a way to compute the collective similarity s(vi1 , vi2 , · · · , vir ) among r nodes vi1 , vi2 , · · · , vir , then the average similarity among nodes in the community Cj is S(Cj ) =
n n i1=1 i2=1
···
n
s(vi1 , vi2 , · · · , vir )P (vi1 |Cj )P (vi2 |Cj ) · · · P (vir |Cj ).
ir=1
Community detection aims to find a partition with highest similarity within communities, which can be represented by < S >=
m
S(Cj )P (Cj ).
j=1
From the information-theoretic view, community detection is to compress the network data into individual communities as much as possible, and meanwhile maintain the information carried by the communities: n
< I >=
m
1 P (Cj |vi ) . P (Cj |vi ) log n i=1 j=1 P (Cj )
Therefore, finding a good network partition is to choose the assignment rules P (Cj |vi ) that maximize: F =< S > −w < I > where w is a constant to control the trade off between information compression and rate-distortion. In this study, we simply fix w as 0.02. In the practical applications, P (Cj |vi ) can be derived in a theoretical way by equating the derivative of F to zero [23] or in an iterative way that finds an explicit numerical solution [21]: 1 P (Cj ) exp [rS(Cj ; vi ) − (r − 1)S(Cj )] P (Cj |vi ) = Zi (w) w where Zi (w) is a normalization constant and S(Cj ; vi ) is the expected similarity between vi and r − 1 other nodes of community Cj . The probability P (Cj |vi ) has the form of a Boltzmann distribution, and higher similarity among nodes of a community corresponds to lower energy; w represents the temperature. s(vi1 , vi2 , · · · , vir ) is derived from multi-mutual information among multiple variables [21], which can capture complex nonlinear dependencies between nodes. This feature makes the information-theoretic clustering method particularly suited for situations where the data structure is obscured at the pairwise level but clearly manifests itself at higher levels, like the community detection problem in this paper.
420
2.3
Y. Huang and G. Wang
Evaluation Criteria
The normalized mutual information index is used to evaluate the similarity between the real partition P and the identified partition P [16]: | |P | pi pj p p p p −2 |P log(niji j n/(ni i nj j )) i=1 j=1 nij INMI (P , P ) = |P | p |P | pj pi pj i i=1 ni log(ni /n) + j=1 nj log(nj /n)
p
p p
where ni i represents the number of nodes in cluster pi and niji j denotes the number of the shared elements between clusters pi and pj . Larger INMI (P , P ) means two partitions are more similar. Obviously, 0 ≤ INMI ≤ 1 with INMI (P, P ) = 1. This measure can compare two partitions with different number of communities. We use this measure to evaluate partitions for those networks with known community structure. For those networks whose true community structure is unknown, we use the modularity function Q [22] to evaluate their partitions, which is defined as: 2
k L(Vc , V ) L(Vc , Vc ) − Q(Pk ) = L(V, V ) L(V, V ) c=1
(2)
where Pk is a partition of the nodes into k groups and L(V , V ) = i∈V ,j∈V Aij . This measure compares the number of edges in a given subnetwork with the expected value in a randomized network and provides a way to determine if a partition is good enough to decipher the community structure of the network. Generally, the larger Q is, the better community structure.
3
Computational Results
In this section, we apply the information-theoretic community detection method to several well studied networks and a large email network. The algorithm is coded and implemented in Matlab 7.6. For the core of information-theoretic clustering, we use the Iclust program [21]. 3.1
Benchmark Computer-Generated Networks
A set of widely used benchmark computer-generated networks for community detection has been designed in [1]. In this network set, each network has 128 nodes, which are divided into 4 communities of size 32 each. Edges are placed randomly according to two fixed expectation values kin and kout such that the average degree of a node is 16, where kin is the average number of each node’s edges in a community and kout is the average number of each node’s edges connecting to other communities. For each parameter kout , 100 networks are generated. Obviously, the larger kout is, the more difficult the corresponding networks to be partitioned.
Structure Analysis of Email Networks
421
We assign each detected community to a known community by majority voting and then compute classification error over the individual nodes. Fig. 1 shows the fraction of nodes that are classified into their correct communities with respect to kout by our method (IC algorithm, based on the adjacency matrix A), the spectral algorithm for optimization of Q [12] (Spectral Q algorithm), the hierarchical algorithm based on Q [14] (Hierarchical Q algorithm), and the betweennessbased algorithm [1] (GN algorithm), where Spectral Q algorithm is a recently published efficient method. We can see that the information-theoretic community detection method and the spectral algorithm are much more effective than the betweenness-based algorithm and the hierarchical Q algorithm. The performance of our method is better than that of Spectral Q algorithm. In addition, the IC algorithm based on shortest paths and diffusion kernel has quite similar results. Data is not shown for clarity. 3.2
A Benchmark Real-Life Network
As a further test example, we use the American college football network [1]. This network represents the game schedule of the 2000 season of Division I of the U.S. college football league. The nodes represent the 115 teams, while the edges represent 613 games played over the 2000 season. The teams are divided into 12 conferences, with each conference having 8-12 teams. Generally games are more frequent between members of the same conference than between teams of different conferences. Interconference play is not uniformly distributed and teams that are geographically close to one another are more likely to play one another than teams separated by large geographic distances. The natural community structure in the network makes it a widely used benchmark testing example for evaluating community detection algorithms [15].
0.08
0.95
Standard deviation
Fraction of correctly classified nodes
1
0.9 0.85 0.8 0.75 0.7 4
Spectral Q algorithm IC algorithm GN algorithm Hierarchical Q algorithm
Spectral Q algorithm IC algorithm 0.06
0.04
0.02
0 5
6 k
(a)
out
7
8
4
5
6 k out
7
8
(b)
Fig. 1. Comparison of the four methods on the computer-generated networks. (a) The results are averaged over 100 networks with a same kout . (b) The standard deviation over 100 realizations.
422
Y. Huang and G. Wang
Table 1. Comparison of community detection methods for the football team network. NA denotes that the corresponding partition is not available for comparison. Large Q corresponds to a better community structure. Larger INMI means the detected partition is closer to the true partition. Methods IC A IC SP IC DK IC 12 SA Q [13] Hierar. Q [14] Hierar. clustering [15] k-means [15]
#clusters 11 11 11 12 9 6 8 10
Q 0.589 0.602 0.601 0.601 0.5371 0.556 0.46 0.600
INMI 0.945 0.904 0.916 0.924 0.848 0.698 NA NA
The computational results on this network are summarized in Table 1, where IC A, IC SP, IC DK are the information-theoretic community detection based on the adjacency matrix, shortest paths, and diffusion kernel respectively. IC 12 denotes the IC algorithm with the number of communities fixed as 12. SA Q is the simulated annealing optimization of Q for community detection, developed in [13]. If we don’t use the prior knowledge about the number of communities, the IC algorithm with the three similarity metrics all return 11 communities, with a high agreement with the true community structure. When fixed as 12 communities, the IC algorithm is also able to detect a good network partition, robust to the data representation of networks. In other words, the three similarity metrics lead to the same result. In contrast, other algorithms such as the simulated annealing algorithm [13], hierarchical Q algorithm [14], and hierarchical clustering [15] are not so good in terms of modularity Q and the mutual information index INMI . As an illustrative result, Fig. 2 shows the 12-community partition obtained by IC 12, where the dense subgraphs in the layout are the detected communities by the IC algorithm. We can see that 10 of 12 communities correspond to 10 conferences almost exactly, and 104 football teams (more than 90%) are correctly classified into their real conference membership. Only the conference Sunbelt (blue squares) is divided into two parts, and the teams in the conference Independents (pink squares) are distributed into other communities. This happens because of the uneven distribution of interconference plays. 3.3
A Large-Scale Email Networks
The exchange of e-mails between individuals in an organization form an email network. Structure analysis of large-scale email networks can gain many useful insights. For example, email networks reveal how people interact with each other in a self-organizing way, which provide much information from the management
Structure Analysis of Email Networks
423
Fig. 2. The detected communities in the football team network. The football teams with the same shape and color are from a same conference.
view [24]. Guimer` a et al. constructed a large-scale email network from e-mail communications within a medium sized university with about 1700 employees [24]. This email network, with nodes as employees and edges as email communications, has 1133 nodes and 5451 edges. Their study found that human interactions, as a complex system represented by the email network, self-organize into a state where the distribution of community sizes is self-similar. It indicates that there is an underlying universal mechanism responsible for such self-organization that drives the formation and evolution of social networks. We apply the information-theoretic community detection method to this email network and test if our method can reveal a similar phenomenon, i.e. if our method can detect communities with similar sizes. Based on the adjacency matrix and shortest paths, the IC algorithm detects 34 communities. As a contrast, SA Q, the simulated annealing algorithm for optimizing Q [13] identifies 9 communities. The distribution of community sizes is illustrated in Figure 3. We can see that the structure of the email network is indeed self-similar. That is to say, the detected communities are homogeneous, which means that the sizes of different communities are quite similar, mainly ranging from 11 to 30. On the other hand, the communities detected by the SA Q algorithm are very heterogeneous: the minimum community has 36 nodes and the maximum community has 228 nodes.
424
Y. Huang and G. Wang
Number of communities
15 IC_A IC_SP SA_Q 10
5
0 0~10
11~20 21~30 31~40 Sizes of communities
>40
Fig. 3. The distribution of the sizes of the detected communities in the email network
4
Conclusion and Discussion
In this paper, we introduced an information-theoretic community detection method. Computational results on several benchmark networks and one email network show that our method has very good performance in community detection and structure analysis of email communication networks. Robustness of data representation helps the method save. efforts for preprocessing networks. A potential improvement of our method is that we can consider how to automatically determine an optimal value for the parameter w and how to extend the algorithm for detecting real-time community structure [25]. One possible way for determining the trade-off parameter w is to sample it over an interval and select one based on some modularity criteria such as Q [22].
References 1. Girvan, M., Newman, M.E.J.: Community Structure in Social and Biological Networks. Proc. Natl. Acad. Sci. USA 99, 7821–7826 (2002) 2. Gibson, D., Kleinberg, J., Raghavan, P.: Inferring Web Communities from Link Topology. In: Proceedings of the 9th ACM Conference on Hypertext and Hypermedia. ACM Press, New York (1998) 3. Newman, M.E.J., Forrest, S., Balthrop, J.: E-mail Networks and the Spread of Computer Viruses. Phys. Rev. E 66, 035101 (2002) 4. Ebel, H., Mielsch, L.I., Bornholdt, S.: Scale-free Topology of E-mail Networks. Phys. Rev. E 66, 035103 (2002) 5. Albert, R., Barab´ asi, A.: Statistical Mechanics of Complex Networks. Reviews of Modern Physics 74, 47–97 (2002) 6. Boykin, P.O., Roychowdhury, V.P.: Leveraging Social Networks to Fight Spam. Computer 38, 61–68 (2005) 7. Kong, J.S., Rezaei, B.A., Sarshar, N., Roychowdhury, V.P., Boykin, P.O.: Collaborative Spam Filtering Using E-mail networks. Computer 39, 67–73 (2006)
Structure Analysis of Email Networks
425
8. Guimer` a, R., Danon, L., Diaz-Guilera, A., Giralt, E., Arenas, A.: The Real Communication Network Behind the Formal Chart: Community Structure in Organizations. Journal of Economic Behavior & Organization 61, 653–667 (2006) 9. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge (1994) 10. Angelini, L., Boccaletti, S., Marinazzo, D., Pellicoro, M., Stramaglia, S.: Fast Identification of Network Modules by Optimization of Ratio Association. Chaos 17, 023114 (2007) 11. Pallal, G., Derenyi, I., Farkasl, I., Vicsek, T.: Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society. Nature 435, 814–818 (2005) 12. Newman, M.E.J.: Modularity and Community Structure in Networks. Proc. Natl Acad. Sci. USA 103(23), 8577–8582 (2006) 13. Guimer` a, R., Amaral, L.A.N.: Functional Cartography of Complex Metabolic Networks. Nature 438, 895–900 (2005) 14. Newman, M.E.J.: Fast Algorithm for Detecting Community Structure in Networks. Phys. Rev. E 69, 066133 (2004) 15. Gustafsson, M., H¨ ornquista, M., Lombardi, A.: Comparison and Validation of Community Structures in Complex Networks. Physica A 367, 559–576 (2006) 16. Danon, L., Daz-Guilera, A., Duch, J., Arenas, A.: Comparing Community Structure Identification. J. Statist. Mech.: Theory and Experiment 09, P09008 (2005) 17. Fortunato, S.: Community Detection in Graphs. Physics Reports 486(3-5), 75–174 (2010) 18. Lancichinetti, A., Fortunato, S.: Community Detection Algorithms: A Comparative Analysis. Phys. Rev. E 80, 056117 (2009) 19. White, S., Smyth, P.: A Spectral Clustering Approach to Finding Communities in Graphs. In: SIAM International Conference on Data Mining, Newport Beach, CA, USA (2005) 20. Kondor, R.I., Lafferty, J.: Diffusion Kernels on Graphs and Other Discrete Input Space. In: Proc. ICML 2002, pp. 315–322. Morgan Kaufmann Publishers Inc., San Francisco (2002) 21. Slonim, N., Atwal, G.S., Tkacik, G., Bialek, W.: Information-based Clustering. Proc. Natl. Acad. Sci. USA 102(51), 18297–18302 (2005) 22. Newman, M.E.J., Girvan, M.: Finding and Evaluating Community Structure in Networks. Phys. Rev. E 69, 026113 (2004) 23. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (1991) 24. Guimer` a, R., Danon, L., D´ıaz-Guilera, A., Giralt, F., Arenas, A.: Self-similar Community Structure in a Network of Human Interactions. Phys. Rev. E 68, 65103 (2003) 25. Leung, I.X.Y., Hui, P., Li` o, P., Crowcroft, J.: Towards Real-time Community Detection in Large Networks. Phys. Rev. E 79, 066107 (2009)
Recognizing Mixture Control Chart Patterns with Independent Component Analysis and Support Vector Machine Chi-Jie Lu1, Yuehjen E. Shao2, Po-Hsun Li3, and Yu-Chiun Wang3 1
Department of Industrial Engineering and Management, Ching Yun University, Jung-Li 320, Taoyuan, Taiwan, R.O.C.
[email protected] 2 Department of Statistics and Information Science, Fu Jen Catholic University, Hsinchuang, Taipei County 242, Taiwan, R.O.C.
[email protected] 3 Graduate Institute of Applied Statistics, Fu Jen Catholic University, Hsinchuang, Taipei County 242, Taiwan, R.O.C.
Abstract. Effective recognition of control chart patterns (CCPs) is an important issue since abnormal patterns exhibited in control chats can be associated with certain assignable causes adversely affecting the process. Most of the existing studies assumed that the observed process data needed to be recognized are basic types of abnormal CCPs. However, in practical situations, the observed process data could be mixture patterns which are mixed by two basic CCPs together. In this study, a hybrid scheme using independent component analysis (ICA) and support vector machine (SVM) is proposed for CCPs recognition. The proposed hybrid ICA-SVM scheme first uses ICA to the mixture patterns for generating independent components (ICs). The hidden basic patterns of the mixture patterns could be discovered in these ICs. The ICs are then used as the input variables of the SVM for building CCP recognition model. Experimental results revealed that the proposed scheme is able to effectively recognize mixture control chart patterns. Keywords: Control chart pattern, Independent component analysis, Support vector machine, Pattern recognition.
1 Introduction Statistical process control (SPC) is one of the most used techniques to monitor and improve the quality of process. Control charts are effective SPC tools which can help reduce variation in manufactured products and increase the competitiveness by improving product quality. A process is out-of-control when a point falls outside the control limits or the control charts exhibit unnatural/abnormal patterns [1]. Effective recognition of control chart patterns (CCPs) is an important issue in SPC since unnatural CCPs can be associated with specific assignable causes adversely affecting the process. In general, eight basic CCPs are commonly exhibited in control charts L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 426 – 431, 2010. © Springer-Verlag Berlin Heidelberg 2010
Recognizing Mixture Control Chart Patterns with Independent Component Analysis
427
including normal (NOR), stratification (STA), systematic (SYS), cyclic (CYC), increasing trend (UT), decreasing trend (DT), upward shift (US) and downward shift (DS) [2][3]. There have been many studies on control chart pattern recognition [3][4][5]. Most of the existing studies were concerned with the recognition of the single abnormal control chart patterns. That is, they assumed that the observed process data needed to be recognized are basic types of abnormal CCPs [3][4]. However, in most real control chart applications, the observed process data may be mixture patterns where two patterns may exist together. Only few studies have been reported on the recognition of mixture process patterns [5]. Without loss of generality, Fig. 1 shows five mixture CCPs which are respectively mixed by one basic abnormal pattern and the natural pattern. It can be observed from Fig. 1 that the mixture CCPs are difficult to be recognized. Consequently, how to effective identify mixture CCPs is an important and challenging task.
1
8
15
22
29
1
(a) STA
8
15
22
29
1
(b) SYS
8
15
22
29
1
(c) CYC
8
15
22
(d) UT
29
1
8
15
22
29
(e) US
Fig. 1. Mixture CCPs: (a) Stratification+ Normal, (b) Systematic+Normal, (c) Cyclic+Normal, (d) Increasing Trend+Normal, (e) Upward shift+Normal.
In this study, an integrated independent component analysis (ICA) and support vector machine (SVM) scheme, called ICA-SVM model, is proposed for identifying mixture CCPs. ICA is a novel feature extraction technique and aims at recovering independent sources from their mixtures, without knowing the mixing procedure or any specific knowledge of the sources [6]. SVM, based on statistical learning theory, is a novel neural network algorithm [7]. The proposed ICA-SVM scheme first uses ICA to the mixture patterns for generating independent components. The estimated ICs are then served as the independent sources of the mixture patterns. The hidden basic patterns of the mixture patterns could be discovered in these ICs. The ICs are then used as the input variables of the SVM for building CCP recognition model.
2 Methodology 2.1 Independent Component Analysis In the basic conceptual framework of ICA algorithm [6], it is assumed that m measured variables, x = [ x1 , x2 , ", xm ]T can be expressed as linear combinations of n
unknown latent source components s = [ s1 , s 2 , " , s n ]T : n
x=
∑a s
j j
j =1
= As
(1)
428
C.-J. Lu et al.
where a j is the j-th row of unknown mixing matrix A. Here, we assume m ≥ n for A to be full rank matrix. The vector s is the latent source data that cannot be directly observed from the observed mixture data x. The ICA aims to estimate the latent source components s and unknown mixing matrix A from x with appropriate assumptions on the statistical properties of the source distribution. Thus, ICA model intents to find a de-mixing matrix W such that y=
n
∑ w j x j = Wx ,
(2)
j =1
where y = [ y1 , y 2 , ", y n ]T is the independent component vector. The elements of y must be statistically independent, and are called independent components (ICs). The ICs are used to estimate the source components, s j . The vector w j in (2) is the jth row of the de-mixing matrix W. The ICA modeling is formulated as an optimization problem by setting up the measure of the independence of ICs as an objective function followed by using some optimization techniques for solving the de-mixing matrix W. Several existing algorithms can be used for performing ICA modeling [6]. In this study, the FastICA algorithm proposed by [6] is adopted in this paper. 2.2 Support Vector Machine
The basic idea of applying SVM to pattern recognition can be stated briefly as follows. We can initially map the input vectors into one feature space (possible with a higher dimension), either linearly or non-linearly, which is relevant with the selection of the kernel function. Then, within the feature space from the first, we seek an optimized linear division, that is, construct a hyperplane which separates two classes (this can be extended to multi-class). A description of SVM algorithm is follows. Let
{(xi , yi )}iN=1 ,
x i ∈ R d , yi ∈ {− 1, 1}
be the training set with input vectors and labels. Here, N is the number of sample observations and d is the dimension of each observation, yi is known target. The algorithm is to seek the hyperplane w ⋅ xi + b = 0 , where w is the vector of hyperplane and b is a bias term, to separate the data from two classes with maximal 2
margin width 2 / w , and the all points under the boundary is named support vector.
In order to optimal the hyperplane that SVM was to solve the optimization problem was following [7]. 1 2 w 2 S.t. yi (w T xi + b) ≥ 1, i = 1, 2, ..., N
Min Φ (x) =
(3)
It is difficult to solve (3), must to transform the optimization problem to be dual problem by Lagrange method. It is difficult to solve (3), must to transform the
Recognizing Mixture Control Chart Patterns with Independent Component Analysis
429
optimization problem to be dual problem by Lagrange method. The value of α in the Lagrange method must be non-negative real coefficients. The (3) is transformed into the following constrained form, N
Max Φ (w , b, ξ , α , β ) = ∑ α i − i =1
1 N α iα j yi y j x i T x j 2 i =1, j =1
∑
(4)
N
S.t.
∑α j y j = 0 j =1
0 ≤ α i ≤ C , i = 1, 2, ..., N In (4), C is the penalty factor and determines the degree of penalty assigned to an error. It can be viewed as a tuning parameter which can be used to control the tradeoff between maximizing the margin and the classification error. In general, it could not find the linear separate hyperplane in all application data. In the non-linear data, it must transform the original data to higher dimension of linear separate is the best solution. The higher dimension is called feature space, it improve the data separated by classification. The common kernel function are linear, polynomial, radial basis function (RBF) and sigmoid. Although several choices for the kernel function are available, the most widely used kernel unction is the RBF 2
kernel defined as K (x i , x j ) = exp(−γ x i − x j ), γ ≥ 0 [8], where γ denotes the
width of the RBF. Thus, the RBF is applied in this study. The original SVM was designed for binary classifications. Constructing multi-class SVM is still an ongoing research issue. In this study, we used multi-class SVM method proposed by [9]. For more details, please refer to [9].
3 The Proposed ICA-SVM Scheme This study integrates ICA and SVM for recognizing mixture control chart patterns. The proposed scheme has two stages, training and monitoring. In the training stage, the aim of the proposed method is to find the best parameter setting for the SVM model. The basic CCPs are used as training sample to establish SVM pattern recognition. Since the RBF kernel function is adopted in this study, the performance of SVM is mainly affected by the setting of parameters of two parameters (C and γ ). There are no general rules for the choice of C and γ . In this study, the grid search proposed by [8] is used in this study for parameters setting. The trained SVM model with best parameter set is preserved and used in the monitoring stage for CCP recognition. In the monitoring stage, the proposed model first collect two observed data from monitoring multivariate process. Then, the ICA model is used to the observed data to estimate two ICs. Finally, for each IC, use trained SVM model for CCP recognition. As an example, Figs. 2(a) and (b) show two observed data collected from the monitoring multivariate process. It is assumed that the data are mixed by normal and systematic patterns. Then, the ICA model is used to the data to generate two ICs which are illustrated in Figs. 2(c) and (d). It can be found that Figs. 2(c) and (d) can be used to represent normal and systematic patterns, respectively. For each IC, the
430
C.-J. Lu et al.
trained SVM model is used to recognize the pattern exhibited in the IC. According to the SVM results, the process monitoring task is conducted to identify which basic patterns are exhibited in the process.
0
20
40
60
80
100
0
20
(a)
40
60
80
(b)
100
0
20
40
60
80
100
0
20
(c)
40
60
80
100
(d)
Fig. 2. (a) and (b) the observed data mixed by normal and systematic patterns; (c) the IC represents normal pattern; (d) the IC represents systematic pattern
4 Experimental Results In this study, eight basic CCPs and five mixture CCPs (as shown in Fig. 1) are used for training and testing the proposed ICA-SVM scheme. The eight basic patterns are generated using the same equations and values of different pattern parameters, as used by [3]. The values of different parameters for abnormal patterns are randomly varied in a uniform manner between the limits. It is assumed that, in the current approach for pattern generation, all the patterns in an observation window are complete. The observation window used in this study is 24 data points. For generating five mixture patterns, the models proposed by [5] are used. The proposed ICA-SVM model directly uses the 24 data pints of observation window as inputs of the SVM model. In order to demonstrate the performance the proposed ICA-SVM scheme, the single SVM models without using ICA as preprocessing is constructed, called SVM-D. After using the grid search method to the two models, the best parameter sets for the ICA-SVM and SVM-D models are (C=21, γ =21) and (C=22, γ =22), respectively. The testing results of the ICA-SVM and SVM-D models are respectively illustrated in Tables 1 and 2. Observing the Tables, it can be found that the average correct classification rates of the ICA-SVM and SVM-D models are 83.16% and 64.29%, respectively. The proposed model has the best recognition performance. Therefore, the proposed scheme can effectively recognize mixture control chart patterns. Table 1. Confusion matrix of testing result using the ICA-SVM model True pattern class NOR STA SYS CYC Trend Shift Average
NOR 51.95% 50.65% 0.00% 0.00% 0.00% 0.00%
STA 47.85% 49.00% 0.00% 0.00% 0.00% 0.00%
Identified patterns class SYS CYC 0.00% 0.00% 0.00% 0.00% 100.00% 0.00% 0.00% 100.00% 0.00% 0.00% 0.00% 0.00% 83.16%
Trend 0.20% 0.35% 0.00% 0.00% 98.00% 0.00%
Shift 0.00% 0.00% 0.00% 0.00% 2.00% 100.00%
Recognizing Mixture Control Chart Patterns with Independent Component Analysis
431
Table 2. Confusion matrix of testing result using the SVM-D model True pattern class NOR STA SYS CYC Trend Shift Average
NOR 99.25% 11.40% 24.45% 34.85% 45.85% 9.65%
STA 0.15% 88.60% 0.00% 0.00% 0.15% 0.00%
Identified patterns class SYS CYC 0.00% 0.00% 0.00% 0.00% 74.20% 0.00% 0.00% 64.90% 0.00% 0.00% 0.00% 0.00% 64.29%
Trend 0.60% 0.00% 1.35% 0.25% 54.00% 85.55%
Shift 0.00% 0.00% 0.00% 0.00% 0.00% 4.80%
5 Conclusion Mixture CCPs mixed by two types of basic CCPs together usually exist in the real manufacture process. However, most existing studies are considered for recognizing the single abnormal control chart patterns. In this study, a hybrid scheme by integrating ICA and SVM is proposed for recognizing mixture CCPs. The proposed scheme first uses ICA to the mixture patterns to generate ICs. Then, the SVM model is used to each IC for pattern recognition. Five mixture CCPs are used in this study for evaluating the performance of the proposed method Experimental results showed that the proposed ICA-SVM scheme can produce the highest average correct classification rate in the testing datasets. It outperformed the comparison method used in this study. According to the experimental results, it can be concluded that the proposed scheme can effectively recognize mixture control chart patterns.
References 1. Montgomery, D.C.: Introduction to Statistical Quality Control. John Wiley & Sons, New York (2001) 2. Western Electric.: Statistical Quality Control Handbook. Western Electric Company, Indianapolis (1958) 3. Gauri, S.K., Charkaborty, S.: Recognition of Control Chart Patterns Using Improved Selection of Features. Computer & Industrial Engineering 56, 1577–1588 (2009) 4. Assaleh, K., Al-assaf, Y.: Feature Extraction and Analysis for Classifying Causable Patterns in Control Charts. Computer & Industrial Engineering 49, 168–181 (2005) 5. Wang, C.H., Dong, T.P., Kuo, W.: A Hybrid Approach for Identification of Concurrent Control Chart Patterns. Journal of Intelligent Manufacturing 20, 409–419 (2009) 6. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, New York (2001) 7. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Berlin (2000) 8. Hsu, C.W., Chang, C.C., Lin, C.J.: A Practical Guide to Support Vector Classification, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan (2003) 9. Hsu, C.W., Lin, C.J.: A Comparison of Methods for Multiclass Support Vector Machines. IEEE Transactions on Neural Network 13, 415–425 (2002)
Application of Rough Fuzzy Neural Network in Iron Ore Import Risk Early-Warning YunBing Hou and Juan Yang College of Resources and Safety Engineering, China University of Mining and Technology (Beijing) 100012 Beijing, China
[email protected],
[email protected]
Abstract. This paper identifies factors for iron ore import risk early warning. Application of rough set theory for Chinese iron ore import risk factors and test data reduction has been introduced to construct rough fuzzy neural network model of iron ore import risk assessment. By employing monthly data of 2004.1-2008.12 for model training, we use this model to forecast 10 groups (2009.1-2009.10) of iron ore import risk early-warning indicators under conditions and to predict the actual test results and error analysis of data. Keywords: rough set, fuzzy neural network, iron ore import risk, early-warning.
1 Introduction Prevalence of iron ore import process is non-linear, time-varying and uncertainty in the characteristics and a general approach to building the model there is often a big error and can not fully reflect the characteristics of the process issues. Rough set theory is a data analysis theory proposed by a Polish mathematician Z. Pawlak[1] in 1982. Rough fuzzy neural network system is the use of rough set method of neural network input information preprocessing, the use of knowledge reduction method to make important degree of information extraction, to reduce the number of attributes of information representation, and then according to the treated information structure processing fuzzy neural network construction and training, effectively evade the subjectivity and blindness of the neural network in the structure selection.
2 Rough Fuzzy Neural Network 2.1 Theory Rough set (rough set, RS) theory, as the tool of analysis to deal with imprecise, uncertain and fuzzy knowledge, is a research hotspot in recent years. For example, Cho and Lee Hong and Wang studied how to use the RS model for the fuzzy model for the automatic acquisition of fuzzy rules and fuzzy membership functions [2]. Felix and Ushio obtained the minimum rules from a contradictory, incomplete database by the use of RS technology combined with GA (genetic algorithms) method [3], access L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 432–438, 2010. © Springer-Verlag Berlin Heidelberg 2010
Application of Rough Fuzzy Neural Network in Iron Ore Import Risk Early-Warning
Training data set
Test data set
Discretization of continuous attributes
The minimum condition attribute set and testing sample set
Construct a decision table
Construct rough fuzzy neural network
433
Attribute reduction Attribute set of minimum rules, regulations and training sample set Simulation prediction
Fig. 1. Rough fuzzy neural network construction process
to acquire the intrinsic characteristics of information systems. Jagielska et al had a systematic study of the NN, fuzzy logic, GA and RS in automatic knowledge acquisition in the application. Therefore, the use of RS to obtain the rules of fuzzy neural network is very worth researching. Based on using RS to process fuzzy neural network input data, this paper obtained the minimum rules, improved the learning speed, while reducing the error. 2.2 Steps The establishment of rough fuzzy neural network requires the following steps: 1) Sample data processing Rough set rules only for discrete data acquisition, and therefore at the beginning of building the network, it is necessary to discrete data of the sample data tables. In this paper, using fuzzy partition way to discrete data and determine the subordinate function of each classification. In the traditional fuzzy inference system, the initial fuzzy partition is evenly distributed, with blindness, and is not conducive to rule acquisition and reduction of rough set. So it is need to find a more accurate fuzzy partition. Kohonen self-organizing network, also called Kohonen feature mapping, is a competition-based neural network on data clustering. In this paper, using the method to data processing, thereout obtained a division of data domain of discourse and determined the membership function of each class, to calculate membership of the corresponding classes for sample data . Take the data for the largest degree of membership of all membership functions of a class alternative to the data, can be obtained by this method in condition attribute values and decision value. After being handled, the data sheet is converted to information table, for the foundation for the following rules. 2) Establishing the rule table The traditional fuzzy rule table has two features, a) accumulate expertise to format rule. However, there inevitably exist uncertainties from the specialists, or operator to
434
Y. Hou and J. Yang
obtain the knowledge. b) Each rule is a combination of each dimension fuzzy subsets, and they are all totally rules, if any n-dimensional input vector, each input has m subsets of a fuzzy partition, then the number of rules is nm. In fact do not need so many rules. Using discernibility matrix method of rough set, to reduce the information table, and to eliminate exactly the same rules, and then simplify the unnecessary attribute value of each rules. This can be a minimalist rule table, as a substitute for the corresponding parts of fuzzy neural network. This achieved the purpose of improving the speed of network computing. 3) Mapping for network Rough fuzzy neural network (RFNN) structure, as shown in Fig 2. Data Preprocessing
Network
xi
Fuzzy c-means clustering
y
Input layer
Fuzzy layer
Rules Layer
Compensation layer
Output layer
Fig. 2. Structure of RFNN model
The network is a multi-input and single output 5 layers system with 3 hidden layers, and a supervised network learning process. The first layer is the input layer, the input X = ( x1 , x2 ," xn )T , n is the number of input variables. Layer 2 for the fuzzy layer, its role is to calculate the membership function μ , where μij = μ Ai j ( xi ) , i = 1, 2, " , n , j = 1, 2, ", mi ,which is each input component belonging j
i
to all language variable values on fuzzy sets o. n is the input dimension, mi is the number of the fuzzy partition of xi . In this article all the membership functions are used Gauss function, μi j = exp ⎢ − ( ⎡ ⎢ ⎣
2 xi − cij ) ⎤ ⎥ σ ij ⎥ ⎦
, the total number of nodes in the layer
n
N 2 = ∑ mi i =1
. Each node in Layer 3 on behalf of a fuzzy rule, the role of neurons is used to match the antecedent of rules table after cut by the Rough sets, to calculate the adaptability of each
Application of Rough Fuzzy Neural Network in Iron Ore Import Risk Early-Warning
rule. Namely α
j
= m in
(μ
i1 1
, μ 2i 2 , " μ ni n
435
) , where, i1 ∈ {1, 2,", m1} ,", in ∈ {1, 2,", mn } ,
j = 1, 2, " , m , m = ∏ mi , the layer of nodes N3 = m . n
i =1
Neurons in layer 4 are used to match the consequent of rules table after cut by the Rough sets, decision-making of this layer. Formula: qi = j
m'
p
p
∑ α / ∑∑ α k =1
k j
i =1 k =1
k j
, of
which, i ∈ (1, 2,", m' ) , k ∈ {1, 2,", p} , j = 1, 2, " , m , nodes of the layer is m ' . Layer 5 is the output layer, the realization of clarity computing. Namely, m'
y = ∑ wi qij . It Can be seen more nodes in the hidden layer of fuzzy neural network. i =1
When more input variables, it will greatly reduce the network computing speed. In this paper, rough fuzzy neural network model is used to program and solve neural network toolbox in the functional under the MATLAB7.0 environment [8]. Using normalization and fuzzy clustering of rough fuzzy neural network, China's iron ore import risk indicators and risk assessment level, as the sample input and the network output, the network changes the weights by continuous learning to identify complex internal correspondence between evaluation indexes and evaluation levels, the use of this network model can be comprehensively evaluated the risk of China's iron ore imports.
3 Experimental Process and Comparative Analysis Risk assessment in China's iron ore imports, selection from January 2004 to September 2009 monthly data of China's iron ore imports were 70 sets of data[9]. Due to the limited space, the relevant data were not listed. Condition attribute set C is { ED, ER,IP,IO,SP,DP,PI}, where, ED, ER, respectively for external dependence and exchange-rate; IP, ID, SP as import prices volatility, iron ore production, steel product price ; OP for ocean freight; PI for the port iron ore stocks; decision attribute set D (RT), which is early warning level of China's iron ore supply-demand security, as shown in Table 1. Table 1. Decision attribute reduction rule table
Attribute index Import concentration, % External dependence, % Price volatility, % Exchange fluctuation rate, % Steel product price, $/ton Gap between demand and supply, million tons Domestic transport share, % Port ore inventories, million tons Ocean freight (Dry bulk transportation price index), $/ton
0 <20 <20 <25 <30 <100 <10000 >70 <30000 <100
Attribute values 1 20 60 20 50 25 55 30 50 100 180 10000 30000
~ ~ ~ ~ ~ ~ 70~30 30000~50000 100~180
2 >60 >50 >55 >50 >180 >30000 <30 >50000 >180
436
Y. Hou and J. Yang
3.1 Discretization of Continuous Attributes and Attributes Reduction In this paper, 70 groups of data were randomly selected 60 as testing data sets, the remaining 10 sets as training data sets, using fuzzy classification method for continuous data, discrete processing, various discrete results of attributes are 1,2,3; minimal condition attribute set (ED, IP, IO, OP) was obtained through attribute reduction, properties of discrete interval as shown in table 1. In order to prevent the attribute value of the test data beyond the interval boundary value, this paper gave interval boundary value, based on the minimum and maximum of past data combining with expert advice. 3.2 Extraction of Rules and the Training Sample Set According to reduction results, through the restrictions to the rough membership degree, it obtains 12 deterministic rules, only partial listing rules. Based on deterministic rules, it extracted 10 sets as training samples from 70 training data sets. 3.3 Rough Fuzzy Neural Network Simulation and Forecast According to the extracted rules to determine the corresponding rough fuzzy neural network structure, nodes of the layer from 1 to 5 respectively is 4,12,9,9,3, where using the fuzzy toolbox of MATLAB software to model. The training sample sets were incorporated into BP network and rough fuzzy neural network learning and training, two kinds of neural network learning rate and inertia coefficients are obtained 0.001 and 0.5, the overall error are 10-6. Main properties of two kinds of networks to the training sample set ,such as shown in Table 2 Putting the testing sample sets into the two neural networks, which had already been trained, it predicted the results shown in Fig. 3. From the predicted results, prediction accuracy rate of rough fuzzy neural network is more than 90%, while prediction accuracy rate of BP neural network is 70%. From Fig. 3 and Table 2 can be seen , training speed and prediction accuracy ,rough fuzzy neural network built in this paper, was superior to that of general BP neural network. Table 2. Results of iron ore imports risk early-warning in 2009.1-10
No. 1 2 3 4 5 6 7 8 9 10
Alarm Results RFNN Actual result BP results (0.9135 0.2726 0.0000)→(1 0 0)3 2 3 (0.0019 0.9978 0.0000)→(0 1 0)2 2 2 (0.0011 0.9986 0.0000)→(0 1 0)2 2 2 (0.0000 1.0000 0.0000)→(0 1 0)2 2 2 (0.0000 0.9804 0.0000)→(0 1 0)2 2 2 (0.0000 1.0000 0.0000)→(0 1 0)2 2 1 (0.0008 1.0000 0.0026)→(0 1 0)2 2 2 (0.0000 1.0000 0.0006)→(0 1 0)2 2 2 (0.0013 1.0000 0.0000)→(0 1 0)2 2 2 (0.0000 1.0000 0.0000)→(0 1 0)2 2 3
Application of Rough Fuzzy Neural Network in Iron Ore Import Risk Early-Warning
437
Performance is 0.000916071, Goal is 0.001
0
10
-1
Training-Blue Goal-Black
10
-2
10
-3
10
-4
10
0
5
10
15 20 35 Epochs
25
30
35
Fig. 3. Training error curve
4 Conclusion The paper identified factors for iron ore import risk early warning; Application of rough set theory to Chinese iron ore import risk factors and test data reduction, constructed rough set fuzzy neural network model of iron ore import risk assessment. Using months data of 2004.1-2008.12 to the model for training, and using this model to forecast 10 groups (2009.1-2009.10)of iron ore import risk assessment indicators under conditions to predict the actual test results. Reached the following conclusions: (1) Rough set fuzzy neural network is applicable to the risk early-warning of iron ore imports. It not only can take into account quantity factors of import risk, but also may consider fuzzy factors, in line with an objective fact of iron ore imports affected by many factors, as risk factors and fuzzy factors. (2) Rough set theory can be carried out to reject the irrelevant attributes of iron ore import risk early-warning indicators by attribute reduction, simplifying input variables and narrowing the search space of fuzzy neural network, to improve performance prediction of imports risk forecast indicators. It showed that a prediction accuracy of a target is higher than the accuracy of three indicators simultaneously, the average error of forecast up to 5.7%. Acknowledgments. Supported by Chinese National Natural Science Foundation (50774088).
References 1. Kubo, S.: Elevator Group Control System With A Fuzzy Neural Network Model. Elevator Technology. In: International Joint Conference of the Fourth IEEE International Conference on Fuzzy Systems and The Second International Fuzzy Engineering Symposium, pp. 11–20. IEEE Press, New York (1995)
438
Y. Hou and J. Yang
2. ChunHua, L.: China’s oil import risk assessment and prevention. China University of Mining, Beijing (2009) 3. FanZhen, L.: China’s iron ore resources supply and demand situation and trend analysis. China University of Mining, Beijing (2009) 4. XuQuan, J.: Ground vibration model testing of bench blasting and BP neural network analysis. In: The 7th international symposium on rock fragmentation by blasting, pp. 591– 594. Science & Technology Press, Beijing (2002) 5. Serguieva, A., Hunter, J.: Fuzzy interval methods in investment risk appraisa. J. Fuzzy Sets and Systems 142, 443–466 (2004) 6. Tung, W.L.: A Novel Neural-Fuzzy Based Early Warning System for Predicting Bank Failures. J. Neural Networks 17, 567–587 (2004) 7. Kosko, B.: Neural networks and fuzzy systems: A dynamical systems approach to machine intelligence. Prentice Hall Inc., New Jersey (1992) 8. Deduth, H., Beatle, M.: Neural Network Toolbox For Use with MATLAB. J. The Math Works 5, 14–17 (2001) 9. National Center for Metal Information, http://www.metalchina.com
Emotion Recognition and Communication for Reducing Second-Language Speaking Anxiety in a Web-Based One-to-One Synchronous Learning Environment Chih-Ming Chen1 and Chin-Ming Hong2 1
Graduate Institute of Library, Information and Archival Studies, National Chengchi University, Taipei Taiwan, R.O.C.
[email protected] 2 Department of Applied Electronics Technology, National Taiwan Normal University, Taipei Taiwan, R.O.C.
[email protected]
Abstract. In the language education field, many studies have investigated anxiety associated with learning a second language, noting that anxiety has an adverse effect on the performance of those speaking English as a second language. Accordingly, this study develops an embedded human-emotion-recognition system based on human pulse signals for detecting three human emotions— nervousness, peace, and joy—to help teachers reduce language-learning anxiety of individual learners in a one-to-one synchronous Web-based learning environment. Experimental results indicate that the proposed embedded humanemotion-recognition system is helpful in reducing language-based anxiety and promotes instruction effectiveness in English conversation classes. Keywords: Emotion recognition, Language-learning anxiety, E-learning, Physiological signals.
1 Introduction Many education scholars have pointed out that emotions are directly related to and affect learning performance [1][2][3][4]. Emotions can affect attention, the creation of meaning, and formation of memory channels. Hence, emotional status and learning are strongly related [5]. To interact with students effectively in an educational context, teachers often try to gain insight into “invisible” human emotions and thoughts. In a learning scenario, teachers who make correct judgments about the emotional status of students can improve the effectiveness of their interactions with students. However, recognizing learner emotions in an E-learning environment is extremely challenging. Additionally, researchers, language teachers, and even language learners have been interested in how anxiety inhibits language learning. Horwitz et al. [6] argued that anxiety associated with language learning is a specific anxiety rather than a trait anxiety. They called this specific anxiety related to language learning as Foreign Language Anxiety, which is manifested in student experiences in language classes. They developed a novel instrument, the Foreign Language Classroom Anxiety Scale L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 439–447, 2010. © Springer-Verlag Berlin Heidelberg 2010
440
C.-M. Chen and C.-M. Hong
(FLCAS) [6][7], to measure this anxiety. Foreign Language Anxiety is defined by some researches as “a feeling of tension, apprehension and nervousness associated with the situation of learning a foreign language” [8]. Horwitz [9], who also discussed the relationship between anxiety and second-language learning, concluded that anxiety in some individuals is a cause of poor language learning. Horwitz also discussed the possible sources of this anxiety, including difficulty in authentic self-presentation and various language teaching practices. Horwitz indicated that people who have poor language-learning abilities will experience foreign language anxiety. Woodrow [10] surveyed a great deal of research into second- or foreign-language anxiety over the past two decades. Woodrow claimed that anxiety has an adverse effect on the language learning process. Therefore, this study proposes an embedded human-emotion-recognition system that has an affective display interface that immediately identifies the emotional states of individual learners to help English teachers in comprehending language-learning anxiety in a Web-based one-to-one synchronous learning environment. The proposed system uses support vector machines (SVMs) to construct a human-emotionrecognition model that can identify three emotions—peace, nervousness, and joy— based on emotion features extracted from frequency-domain human-pulse signals. The experiment for second-language speaking anxiety confirms that the proposed embedded human-emotion-recognition system is helpful in reducing languagelearning anxiety when teachers can provide appropriate learning assistance or guidance based on the emotional states of individual learners.
2 System Design This section introduces the embedded human-emotion-recognition system for reducing learner second-language speaking anxiety. 2.1 System Architecture The proposed embedded emotion-recognition system is composed of three parts—the module for measuring pulse, the Advanced RISC Machine (ARM) embedded platform, and human-emotion-recognition module—implemented on a remote computer server. The module for measuring pulse and the Advanced RISC Machine (ARM) embedded platform consists of an embedded human-pulse-detection system, which can accurately measure and transfer human pulse to the human-emotion-recognition module for identifying learner emotions. Figure 1 presents the system architecture of the proposed system. The module that measures human pulse (bottom-left portion of Fig. 1) senses pulse signals via a piezoelectric sensor and performs signal preprocessing to filter out noise. To transfer these pulse signals to the embedded system, analog pulse signals must be first transformed into digital signals by an analog-to-digital converter (ADC). These digital signals are then transmitted to the embedded system through a serial transmission device using a predetermined baud rate. The upper-left portion of Fig. 1 presents the embedded system with serial transmission and wireless local area network (WLAN) interfaces. When the collected pulse data meets a predetermined amount, the embedded system transmits pulse data to the remote computer
Emotion Recognition and Communication
441
server via wireless communication. The remote server then stores the pulse data in a human physiology database. The right side of Fig. 1 shows the details of the proposed human-emotion-recognition module that assists teachers in reducing speaking anxiety of individual learners. The emotion-recognition module consists of Fast Fourier Transform (FFT) software [11], a library for support vector machines (LIBSVM) [12], which is a support vector machine tool library, and the Web meeting system JoinNet (JoinNet). JoinNet was employed as a speaking training system supported by the proposed embedded human-emotion-recognition system to aid teachers in reducing student language-learning anxiety in a Web-based English-speaking instructional environment.
Fig. 1. The system architecture of the proposed embedded human emotion recognition system for supporting teachers to reduce learner’s language speaking anxiety
2.2 The Proposed Human-Emotion-Recognition Scheme Based on Human Pulse Signals The human-emotion-recognition module is composed of FFT software [11], LIBSVM [12], and JoinNet (right side of Fig. 1). The FFT is used to transform time-domain human pulse signals into frequency-domain pulse signals for extraction of humanemotion features. The LIBSVM, which is an integrated software package for support vector classification and has excellent pattern recognition performance, was applied to construct the human-emotion-recognition model that uses the extracted humanemotion features. Moreover, JoinNet supports instructors teaching English conversation online and facilitates communication with students via audio, video, and text chat. Teachers and students can share and discuss slides, figures, documents, websites, desktops, and even control the PCs or laptops of other students remotely via this system. This study only employed the JoinNet functionality that allows teachers to communicate with learners online via an audio channel, and thereby support English conversation training with the assistance of learner emotion recognition.
442
C.-M. Chen and C.-M. Hong
Notably, the time and amplitude of pulse signals do not have logically mapping with variations of human emotions. Consequently, this study employed FFT to transform original time-domain pulse signals into frequency-domain signals as frequency-domain signals vary as human emotions vary. The study thus extracted emotion features from frequency-domain pulse signals and employed SVMs to construct a human-emotionrecognition model based on these extracted features. The following section further describes the FFT and SVMs in detail. 2.2.1 Fast Fourier Transform for Human-Emotion Feature Extraction During this FFT, the original pulse signals sensed by the measuring module are approximately transformed into combinations of many sine waves with corresponding frequencies and amplitudes. In the FFT, the corresponding amplitude of each sine wave represents a feature weight. Based on experimental results, the feature weights of amplitudes vary as human emotions vary; that is, different people experiencing the same emotion can obtain similar feature weight combinations of sine waves. Therefore, these feature weight combinations of sine waves derived from an emotion can serve as emotion features when constructing a human-emotion-recognition model using machine-learning models. To promote the identification of human emotions, this study adopted Fastest Fourier Transform in the West (FFTW) [11], which is the fastest FFT software and uses C language application programming interfaces (APIs), to transform the original time-domain pulse signals into frequency-domain pulse signals. 2.2.2 SVM for Constructing the Human Emotion Recognition Model The extracted emotion features based on the FFT served as input features to train the human-emotion-recognition model using SVMs for predicting emotional variations of individual learners during learning. These SVMs are supervised learning models that have been widely applied in pattern classification and regression [13][14]. The main consideration in employing SVMs to construct an emotion-recognition model is that human pulse signals contain large feature dimensions and a considerable amount of noise. The SVMs are good at solving such problems and are superior to other statistics-based machine-learning methods [15]. Moreover, many studies have adopted LIBSVM, which is an integrated software package for support vector classification, as tool because it rapidly analyzes data and supports multiple programming languages and platforms. Notably, LIBSVM can also automatically determine SVM parameters, including kernel function (default is the radial basis function) and parameters using the grid parameter search approach [12]. In this study, human pulse signals with corresponding emotion features obtained through the transformation of the FFTW serve as training data for LIBSVM to build a recognition model of human emotions. Furthermore, the cross-validation scheme was used in this study to assess the forecasting accuracy rate of human-emotion recognition. The human-emotion-recognition model can help teachers immediately offer feedback based on the emotions of students during learning processes.
Emotion Recognition and Communication
443
3 Experiments 3.1 Constructing the Human-Emotion-Recognition Model Using Support Vector Machines To obtain emotion features for constructing a human-emotion-recognition model that can identify the emotional states of peace, joy, and nervousness based on SVMs, this study utilized online films and PC games to elicit these human emotions. The joy experienced some volunteers was generated by a funny film. The nervousness of some volunteers was generated by computer game in which subjects shot moving objects. Similarly, the peace experienced by some volunteers was in response to an online film presenting a clear blue sky accompanied by soft music. This study logically assumes that the online film or computer game elicited specific human emotions. After collecting pulse data, the data were transformed by FFTW from the time domain to the frequency domain for extraction of emotion features. After FFTW transformation, each extracted pulse signal contained 26 emotion features that were normalized to the assigned data format of LIBSVM for constructing the emotionrecognition model. The entire forecasting accuracy rate of simultaneously experiencing the three emotions was 79.7136% under the automatically determined LIBSVM parameters. 3.2 Online English-Speaking Training Supported by the Developed Embedded Human-Emotion-Recognition System in a Web-Based One-to-One Synchronous Learning Environment The developed embedded human-emotion-recognition system was applied to support online English speaking training in a Web-based one-to-one synchronous learning environment. This study recruited four students and one English teacher from a senior high school in Taiwan to take part in the experiment. To identify the learning status of students, these four students were numbered No. 1–4. Among these four participants, the emotions of learners Nos. 1 and 2 were not conveyed immediately to the teacher during previous online English-speaking training; however, their emotional variations over time were automatically recorded by the embedded human-emotion-recognition system. Therefore, the teacher had no emotion reference while teaching English speaking to learners Nos. 1 and 2. Moreover, the emotion variations of learners Nos. 3 and 4 were recorded by the embedded human-emotion-recognition system and conveyed to the teacher’s computer monitor during learning processes. Before the online English-speaking instruction started, all learners filled out two pretest questionnaires about the anxiety and nervousness they experience during English learning—the FLCAS [6][7] and Anxiety Toward In-Class Activities Questionnaire (ATIAQ) [16]. Students also filled out these two questionnaires after finishing the online Englishspeaking training. After the pretest, each learner and the teacher were asked to wear an earmicrophone, sit in two different language-learning rooms and spoke to each other in English via the JoinNet audio channel. The English content was planned beforehand by the English teacher to ensure that individual leaner emotions can be elicited during speaking training. Additionally, students Nos. 3 and 4 wore a human pulse sensor on a
444
C.-M. Chen and C.-M. Hong
finger of their left hands for sensing learner emotions during learning processes. Figure 2 shows learner No. 3 wearing a human pulse sensor and ear-microphone during English-speaking learning with the teacher via JoinNet. All English-speaking sessions with the teacher were recorded. The speaking processes of learners Nos. 3 and 4 were also displayed on the teacher’s computer screen with the interface displaying learner emotions. Figure 3 shows the teacher instructing learner No. 3 in English speaking through JoinNet with the assistance of the human-emotion-recognition system. Table 1 presents the statistics for the emotions of the four learners during an English speaking session with the English teacher. Learner No. 4 experienced nervousness more often than the other three learners during learning. Additionally, students Nos. 1 and 3, which had good English speaking skills, had lower percentages of nervousness than students Nos. 2 and 4, who had poor English speaking skills. In Fig. 4, the places with circles represent changes from nervousness to peacefulness due to teacher feedback or guidance to reduce learner anxiety in response to individual learner emotions. Tables 2 and 3 show the pretest and posttest results for the FLCAS and ATIAQ filled out by the four learners, respectively. The test scores on the FLCAS and ATIAQ indicate the degree of nervousness; thus, a high test score represents a high degree of nervousness. Since each learner conversed in English with the teacher for about 10 minutes in the experiment, the pretest result can be viewed as self-known language-learning anxiety, and the posttest result will be close to real language learning anxiety. The FLCAS and ATIAQ have 25 and 15 items, respectively. Responses to each item are on a 5-point Likert scale. Notably, subject scores for items structured in a negative form must be reversed when computing the total score. Compared to the FLCAS pretest, this study found that all learners had lower scores on the posttest FLCAS. Restated, the anxieties of all learners when speaking English decreased with or without the support of the embedded human-emotion-recognition system. In particular, students Nos. 1 and 3, who have good English speaking skills, had lower test scores on the FLCAS than students Nos. 2 and 4, who have poor English speaking skills. Moreover, no difference existed in pretest and posttest scores in the ATIAQ for learner No. 4. Conversely, the posttest ATIAQ scores for learners Nos. 1, 2, and 3 were lower than their respective ATIAQ pretest scores.
Fig. 2. No.3 learner who was wearing human pulse sensor and ear-microphone for English speaking learning with a teacher through JoinNet
Emotion Recognition and Communication
445
Fig. 3. The teacher who was guiding No.3 learner’s English speaking skill through JoinNet with the assistance of human emotion recognition system
Table 1. Statistics information of four learners’ emotion variations during the English speaking training processes Total seconds
No.1 No.2 No.3 No.4
665s 650s 612.5s 800s
Peaceful (%)
Nervous (%)
79.7% 68.5% 79.7% 47.8%
20.3% 31.5% 20.3% 52.2%
Cou nts of nerve 28 37 25 52
Frequency of nerve (cpm.) 2.53 3.67 2.49 3.9
No.3
3 2 1
56 5 59 0 61 5 64 0
90 11 5 14 0 16 5 19 0 21 5 24 0 26 5 29 0 31 5 34 0 36 5 39 0 41 5 44 0 46 5 49 0 51 5 54 0
65
40
0
second
Fig. 4. The emotion variations of the No.3 learner with time during the English conversation training
Table 2. The pretest and posttest of FLCAS filled out by the four learners
Sum Difference Sum Difference
No.1 Learner Pretest Posttest 97 81 16 No.3 Learner Pretest Posttest 89 80 9
No.2 Learner Pretest Posttest 91 89 2 No.4 Learner Pretest Posttest 99 95 4
446
C.-M. Chen and C.-M. Hong Table 3. The pretest and posttest of ATIAQ filled out by the four learners
Sum Difference
Sum Difference
No.1 Learner Pretest Posttest 39 31 8 No.3 Learner Pretest Posttest 34 27 7
No.2 Learner Pretest Posttest 47 41 6 No.4 Learner Pretest Posttest 46 46 0
4 Conclusions This study presents an embedded human-emotion-recognition system that supports communication tailored to emotions in Web-based learning environments. The forecasting accuracy rate of the proposed human-emotion-recognition system evaluated by cross validation is 79.7136% based on the proposed emotion feature-extraction scheme. The forecasting accuracy rate is sufficient to support teachers in immediately understanding the emotions of individual learners during learning processes. Additionally, this study also applied the proposed embedded human-emotion-recognition system to support teachers in reducing the anxiety of English learners while speaking in a Web-based one-to-one synchronous learning environment. The teacher acknowledged that the embedded human-emotion-recognition system could aid in her comprehension of student emotions, thus allowing her to provide appropriate learning feedback or guidance in a Web-based one-to-one synchronous learning environment.
References 1. Goleman, D.: Emotional intelligence. Bantam Books, New York (1995) 2. Piaget, J.: Les relations entre l’intelligence et l’affectivité dans le development de l’enfant. In: Rimé, B., Scherer, K. (eds.) Les Émotions. Textes de base en psychologie, pp. 75–95. Delachaux et Niestlé, Paris (1989) 3. Vygotsky, L.: The problem of the environment. In: van der Veer, R., Valsiner, J. (eds.) The Vygotsky Reader. Blackwell, Oxford (1994) 4. John-Steiner, V.: Creative collaborations. Oxford University Press, Oxford (2000) 5. LeDoux, J.: Emotion, memory, and the brain. Scientific American 270, 50–57 (1994) 6. Horwitz, E.K.: Preliminary evidence for the reliability and validity of a foreign language anxiety scale. TESOL Quarterly 20(3), 559–562 (1986) 7. Horwitz, E.K., Horwitz, M.B., Cope, J.: Foreign language classroom anxiety. The Morden Language Journal 70(2), 125–132 (1986) 8. Ozcan, C.: Anxiety in learning a language – part I (Retrieved), http://www.eslteachersboard.com/cgi-bin/ articles/index.pl?noframes;page=4;read=2611 9. Horwitz, E.K.: Language anxiety and achievement. Annual Review of Applied Linguistics 21, 112–126 (2001)
Emotion Recognition and Communication
447
10. Woodrow, L.: Anxiety and speaking English as a second language. RELC Journal 37(3), 308–328 (2006) 11. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings of the IEEE 93(2), 216–231 (2005) (invited paper); Special Issue on Program Generation, Optimization, and Platform Adaptation 12. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at, http://www.csie.ntu.edu.tw/~cjlin/libsvm 13. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: 5th Annual ACM Workshop on COLT, pp. 144–152 (1992) 14. Vapnik, V.N.: The support vector method of function estimation. In: Suykens, J., Vandewalle, J. (eds.) Nonlinear Modeling: Advanced Black-Box Techniques, pp. 55–86. Kluwer Academic Publishers, Boston (1998) 15. Noble, W.S.: Support vector machine applications in computational biology. A survey article of Dept. of Genome Sciences, University of Washington (2003) 16. Young, D.J.: An investigation of students’ perspectives on anxiety and speaking. Foreign Language Annals 23(6), 539–553 (1990)
A New Short-Term Load Forecasting Model of Power System Based on HHT and ANN Zhigang Liu, Weili Bai, and Gang Chen School of Elec. Eng., Southwest Jiaotong University, Chengdu 610031, Sichuan, China
[email protected]
Abstract. Aiming to the disadvantages of short-term load forecasting with HHT such as mode mixing and random component, a new short-term load forecasting model based on HHT and ANN is proposed. The first order difference algorithm is adopted to eliminate mode mixing. The random component is forecast with different methods including BP, RBF, SVM, linear and ANN combination. Other components can be forecast with proper method. The simulation results show that the higher accuracy of short-term load forecasting can be obtained. Keywords: Short-Term Load Forecasting, HHT, ANN, Combination Model.
1 Introduction During the past years, a wide variety of techniques have been tried in the problem of load forecasting, most of which are based on time-series analysis. The time-series model mainly includes approaches based on statistical methods and artificial neural networks (ANNs) [1]. The statistical models are hard computing techniques based on an exact model of the system, which include moving average and exponential smoothing methods, linear regression models, stochastic process, data mining approach, autoregressive and moving averages (ARMA) models, and Kalman filtering-based methods [2-6]. In [7-8], the load data is decomposed with EMD (Experience Mode Decompose), and the each components and remainder are forecast with same forecasting model. Their forecasting results are added together as the final result. Although the advantages of EMD can improve the forecasting precision, the adaptability between each decomposed component and their forecasting method do not be considered. In fact, for different feature of each component, adopting same forecasting method is irrational. In [9], three sequences are constructed with decomposed components of EMD, and three different forecasting methods are adopted. But the plan cannot improve the forecasting precision. There are two main reasons. On the one hand, EMD has some serious disadvantages including border effect and mode mixing. The end swings will occur during HHT, and it can eventually propagate inward and L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 448–454, 2010. © Springer-Verlag Berlin Heidelberg 2010
A New Short-Term Load Forecasting Model of Power System Based on HHT and ANN
449
corrupt the whole data span especially in the low frequency components [10]. This mode mixing is due to EMD algorithm and sifting process [11]. The sample frequency, signal component and amplitude etc. will also affect the decomposition result. On the other hand, the load data sequence’s characteristic must be considered carefully. The many experiments show the forecasting precision for high frequency component is determinant for the forecasting error of short-term load.
2 Forecasting Method In order to make full use of EMD and solve the problems above, a new short-term load forecasting model of power system based on HHT and ANN is proposed in the paper. Aiming to the mode mixing and incorrectness of high frequency forecasting with EMD, some algorithms are adopted. The decomposition of EMD is improved with first-order difference algorithm. After the mode mixing is eliminated, a series of IMF (intrinsic mode functions) and remainder can be obtained. Through the computation and analysis for each component’s frequency spectrum, the low frequency components can be extracted and reconstructed, which can be forecast with proper model. Because the random components of load mainly are included in the first intrinsic mode function (IMF1) and its random fluctuation is high, the combination forecasting with ANN is adopted. The forecasting method proposed in the paper is listed below in detail. (1) The noise and unnormal data in the original data are eliminated with wavelet and error adjustment criterion. (2) With improved EMD, load data is decomposed. IMFs with different frequency can be obtained from high to low frequency. These IMFs represent the local features of original load sequence. The period, random and trend components of load can be found clearly from the IMFs. (3) Through the analysis of frequency spectrum, the low frequency component can be extracted. Because the low frequency component shows the change trend of long- term load, it can be reconstructed. It can reduce the forecasting steps and complexity, but will not lose the forecasting correctness. (4) The IMF1 mainly means the random component. The weather, temperature and holiday can greatly influence the IMF1. So, the extra factors should be considered for forecasting IMF1. In the paper, BP neural network (BPNN), RBF neural network (RBFNN), support vector machine (SVM), linear combination (COM1) and ANN combination (COM2) are adopted to forecast the IMF1. (5) Because other IMFs show strong periodicity, SVM can be adopted for other IMFs’ forecasting. (6) The remainder of EMD is linear. The linear ANN is adopted to forecast the remainder. (7) The final forecasting result can be obtained through the sum of each IMF’s forecasting result.
450
Z. Liu, W. Bai, and G. Chen
3 HHT of Load Sequence There are peak load, middle load and base load for the general power system load. The main reason of mode mixing with EMD for load sequence is great different between their energy. The peak load data is submerged in the middle and base load data, and middle load data is hidden in base load data, which is similar to the low frequency signal with high frequency weak noise. Fig. 1 is the load data of Chongqing Grid from July 1st to August 25th 2006. The decomposition results with EMD are shown in Fig. 2. The mode mixing begins to occur in IMF2. The decomposition results with difference algorithm are shown in Fig. 3. The instantaneous frequency of each IMF is shown in Fig. 4. The method can realize mode separation effectively.
Fig. 1. Original load data
Fig. 2. The decomposition results with EMD
A New Short-Term Load Forecasting Model of Power System Based on HHT and ANN
Fig. 3. The improved results with EMD
Fig. 4. The instantaneous frequency
451
452
Z. Liu, W. Bai, and G. Chen
4 Forecasting Experiments The instantaneous frequency of each IMF can be computed with Hilbert transform, which can provide the help for analyzing load sequence’s frequency. The mean instantaneous frequency of each IMF is listed in Tab. 1. Table 1. The mean instantaneous frequency of each IMF IMF1 0.29
IMF2 0.16
IMF3 0.08
IMF4 0.04
IMF5 0.02
IMF6 0.01
IMF7 0.007
IMF8 0.004
IMF9 0.002
In Tab. 1, the frequency of IMF7, IMF8 and IMF9 is low, so they can be considered as the long-term fluctuation component and can be redefined as IMF7. If the remainder is added, there will be 8 components. For IMF2-IMF7, SVM is adopted to forecast each component. For the remainder, linear ANN is adopted. The current load point is forecast with 20 points before. The forecasting results (red line) of IMF2IMF7 and remainder are shown in Fig.5.
Fig. 5. The forecasting results of IMF2-IMF7 and remainder
For the IMF1’s forecasting, BPNN, RBFNN, SVM, COM1 and COM2 are adopted, and temperature and holiday will be considered. The neurons of input layer are load values, lowest temperature, highest temperature, holiday type (holiday is 1 and workday is 0) at same time and previous two times on the two days before, at same time and previous two times on the one days before and at previous two times on intraday. The number of neurons is actual load value. The forecasting results (red line) of IMF1 with different methods are shown in Fig. 6.
A New Short-Term Load Forecasting Model of Power System Based on HHT and ANN
453
Fig. 6. The forecasting results of IMF1 with different method
The forecasting errors of IMF1 with different methods are listed in Tab. 2. Table 2. The forecasting errors of IMF1 with different methods Error Max Error Mean Error
BPNN 32.1 11.84
RBFNN 48.52 11.52
SVM 45.58 9.37
COM1 30.14 8.18
COM2 25.79 25.79
From Tab. 2, the forecasting precision with combination model is better than that with single model. The whole forecasting errors with different models for IMF1 are listed in Tab. 3. The forecasting precision with COM2 (ANN combination) is highest. Table 3. Comparison of forecasting results error Date 16 17 18 19 20 21 22
RBFNN 3.90 4.60 6.05 5.47 4.24 3.41 4.26
Max Error (%) BPNN SVM COM1 3.85 3.47 3.35 4.81 5.06 4.78 6.09 5.48 5.87 4.64 4.50 4.51 4.77 3.33 3.44 3.51 4.80 3.59 4.43 6.22 4.58
COM2 2.41 4.85 5.52 5.00 3.18 3.25 4.30
RBFNN 1.50 1.30 2.25 1.77 1.90 1.60 1.64
Mean Error (%) BPNN SVM COM1 1.52 1.01 1.30 1.34 1.00 1.22 1.77 1.85 1.93 1.90 1.52 1.54 1.51 1.26 1.37 1.45 1.45 1.30 1.42 1.89 1.55
COM2 0.87 0.98 1.78 1.53 1.29 1.13 1.55
5 Conclusion Considering the advantages of HHT and ANN, a new short-term load forecasting model of power system based on HHT and ANN is proposed. The disadvantages of
454
Z. Liu, W. Bai, and G. Chen
the load forecasting with HHT are presented and discussed firstly. The features of load data are analyzed. The main problems of load forecasting are load noise and unnormal data, mode mixing of EMD, and the random component of high frequency IMF1. The wavelet transform and error adjustment criterion are adopted to eliminate noise and unnormal data. Through improving the mode mixing, each component can be forecast with proper method. In addition, the factors of weather and holiday are considered in the model. The simulation results show that the higher accuracy of short-term load forecasting can be obtained in the paper. Acknowledgments. The author would like to thank New Century Excellent Talents Project Fund (NECT-08-0825), Fok Ying Tung Education Fund (101060) and Sichuan Province Distinguished Scholars Fund (07ZQ026-012) in China.
References 1. Hippert, H.S., Pedreira, C.E., Castro, R.: Neural Networks for Short Term Load Forecasting: A Review and Evaluation. IEEE Trans. Power System 16, 44–55 (2001) 2. Haida, T., Muto, S.: Regression Based Peak Load Forecasting Using a Transformation Technique. IEEE Trans. Power System 9, 1788–1794 (1994) 3. Papalexopoulos, A.D., Hesterber, T.C.: A Regression-based Approach to Short-term Load Forecasting. IEEE Trans. Power System 5, 1535–1550 (1990) 4. Rahman, S., Hazim, O.: A Generalized Knowledge-based Short Term Load-forecasting Technique. IEEE Trans. Power System 8, 508–514 (1993) 5. Huang, S.J., Shih, K.R.: Short-term Load Forecasting via ARMA Model Identification Including Non-Gaussian Process Considerations. IEEE Trans. Power System 18, 673–679 (2003) 6. Infield, D.G., Hill, D.C.: Optimal Smoothing for Trend Removal in Short Term Electricity Demand Forecasting. IEEE Trans. Power System 13, 1115–1120 (1998) 7. Zhu, Z.H., Sun, Y.L., Li, H.Q.: Hybrid of EMD and SVMs for Short-Term Load Forecasting. In: IEEE International Conference on Control and Automation, pp. 1044–1047 (2007) 8. Xie, J.X., Cheng, C.T., Zhou, G.H., Sun, Y.M.: A New Direct Multi-step Ahead Prediction Model Based on EMD and Chaos Analysis. Journal of Automation 34, 684–689 (2008) 9. Li, Y.Y., Niu, D.X., Qi, J.X., Liu, D.: A Novel Hybrid Power Load Forecasting Method Based on Ensemble Empirical Mode Decomposition. Power System Technology 32, 58–62 (2008) 10. Zhang, Y.S., Liang, J.W., Hu, Y.X.: Application of AR Mode to Improve End Effect of EMD. Physical Science Progress 13, 1054–1059 (2003) 11. Rilling, G., Flandrin, P., Goncalves, P.: On Empirical Mode Decomposition and its Algorithms. In: IEEE-EURASIP workshop on nonlinear signal and image processing, NSIP 2003, Grado, Italy (2003)
Sensitivity Analysis of CRM Indicators Virgilijus Sakalauskas and Dalia Kriksciuniene Department of Informatics, Vilnius University, Muitines 8, 44280 Kaunas, Lithuania {virgilijus.sakalauskas,dalia.kriksciuniene}@vukhf.lt
Abstract. The research aims to explore the sensitivity of CRM indicators and to define specific conditions for their application according to customer historical information. The neural network analysis was applied for the research of classification task of distinguishing classes of potentially returning customers from those, who tend to leave the company. The research results revealed that application of traditional time and money-related variables have important advantage, as they are considered to make objective basis for judgement. The experimental research was performed by mining customer database of the travel agency. According to the experimental evaluation we explored that the classification model can not be uniformly applied throughout all the customer history during his lifecycle. The explored sensitivity of indicators suggests application of different neural networks fitting to the particular stages of the customer lifetime cycle. The neural network analysis results were summarized and the research insights presented, based on the dynamics of sensitivity of the customer indicators. Keywords: customer relationship management, CRM indicators, neural network analysis, sensitivity analysis.
1 Introduction CRM (Customer Relationship Management) data analysis creates serious concern for both management research and application of information technologies. The most urgent research directions aim to creation of methods, which could present comprehensive evaluation of customer processes related to potential and completed transactions, and which could provide valuable insights in the historical data of customer relationship to the enterprise. The investigation of metrics includes problems of forecasting customer churn and estimation of customer lifecycle value and length. The future transactions are forecasted by applying both technical analysis and fundamental analysis methods. As there are no direct means to forecast these important indicators, numerous metrics are created and analyzed by applying both managerial and computational methods. The managerial approach of CRM analysis is based on making insights to understanding customer behavior by collecting all available historical data of customer relationship. That makes analysis of customers very complicated due to lack of understanding, which variables can directly explain the future development of relationship with the customer. Lots of various data is collected in the enterprises, which is further L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 455–463, 2010. © Springer-Verlag Berlin Heidelberg 2010
456
V. Sakalauskas and D. Kriksciuniene
never used, and the executives complain, that marketers add new measures and rely on many simultaneously applied metrics, which lack consistency (Peppers, Rogers, 2005, 2006). Information technologies-based approach can be understood as computational intelligence problem, which can provide insights of rules, forecasts and tendencies by analyzing data of all types. The efforts to explain customer models are hindered by lack of reliable variables, describing the customer relationships with the enterprise. Therefore not only quantitative historical metrics are applied, but the qualitative characteristics are created as well. Varieties of variable types expand application scope of data mining methods for customer database analysis. These methods include creation of association rules, customer segmentation and identification, modeling churn, time series analysis, application of OLAP tools and computational intelligence methods, such as neural networks (NN), genetic, hybrid algorithms (Berry and Linoff, 2004, Sakalauskas and Kriksciuniene, 2008). The dynamic approach to customer behavior is mostly related to application of time series models (Berry and Linoff, 2004). The analysis method, which explores own “trajectories” of each customer is suggested by Peppers and Rogers (2004). This method allows observing changes in the life time curve, making impact to it, and also clustering similar customers with similar behavior. Ha et al. (2002, in Liao et al, 2007) proposed a dynamic customer relationships management model based on data mining and monitoring agent systems to extract longitudinal knowledge from the customer data and analyse his behavior pattern. The SOM (Self Organizing Map) is used for clustering customers. The result outcomes allowed providing general impresions on customer behaviors over time and helping to improve effectiveness of marketing strategies. The research based on airline customer lifetime value (CLV) early analysis, examined in Wangenheim (2006) presents the future transactions prediction model, based on regular updating information of drivers of the customer behavior, such as data from customer communications, channel choices, availability of choice due to competition and exhibited transaction behavior. Although the relationship among the drivers and customer future transactions has been noticed in the research, the importance of driver selection has been made only on the assumption basis. The problem of variable selection for the model is analysed in the implications of the customer churn related research (Neslin, Gupta, et al, 2004). One of the reasons is that despite of attention paid by the academic literature to application of statistical techniques, such as various types of neural net or logistic models, the need to understand methods of variable selection and the model building process matters most. The further research directions (Neslin, Gupta, et al, 2004) indicated necessity of analysis of relative contributions of variable selection to model accuracy, because the combined application of logistic regression and exploratory data analysis does not allow to separate contributions of each variable and to further explain the performance of the model. Therefore the present research aims not only to design the model for recognizing reliable customers and to indicate potentially lost ones, but to analyze the sensitivity of the prediction model to the variables, which could be enhanced by supporting information for increasing performance of the model.
Sensitivity Analysis of CRM Indicators
457
2 Research Framework The goal of the research was to define, what parameters could characterize the further development of the customer relationship, and to forecast as early as possible, which customers tend to keep relationship with the enterprise, and which of them are potentially lost. The general business practice implies, that the customers, who have made numerous purchases at the enterprise, are loyal, and their loyalty strengthens with each following visit. The firms usually tend to apply various incentives and discounts for these customers, as they accumulate significant monetary value by total purchases of the lifecycle. It is a very big challenge to define if the customer intends to further cooperate with the enterprise, or plans to quit or switch to the other service providers. The other challenge is to exclusively apply the purchase data, recorded in by the accounting department. In this way the customer relationships are evaluated without using any data, collected in inconsistent way or by using subjective judgments, such as surveys, subjective characteristics of communication or unfulfilled service inquiries by the customer. The experimental research was made by using accounting data from the airline ticket sales department of the tourism agency. Trading of airline tickets is one of numerous services provided by the agency; they include services to customers buying both personal and business trips. Each transaction contains data record describing only one person. The data records include sales information; there are no inscriptions about presence of long-term contracts or incentives, which could affect strength and duration of the customer relationship. The data consists of 8000 transaction records, collected during three years from January 02, 2002 till October 29, 2005, which belongs to 3429 customers. The database consists of 2184 records of customers, who paid only one visit. The sales transaction records include date, flight route, customer code, purchase amount, discounts and payment details. The transaction record data was processed and summarized by customer code by calculating various variables (Table 1), describing the purchase history of each customer. The variable set included various indicators, which are quite common in the research literature, and which can be calculated by using the historical transaction data. No qualitative variables were applied, which could bring subjectivity into evaluation of the customer relationship. The further calculations were made only with the data of customers who made more than four purchases. One customer who had more than one hundred visits was excluded as well, because his data could have significant influence to the data analysis results. After data processing the records of 410 customers were used for analysis. The variables were calculated by using customer transaction data. The variables and their descriptions are listed in the table 1. The last variable ‘Classif.’ in table 1 indicates which of the customers in the database are not expected to return, because of the long time since their last visit. We consider the customer as lost, if the time (in days) after their last visit (Recency) exceeds their average interval between visits by triple value of the standard deviation of this interval: Classif. = “0“, if Recency> AverageData+3 *StnDevData, otherwise Classif. =“1“. The customer data base consists of 131 customers classified to the class of lost customers and 279 - to the existing customer class.
458
V. Sakalauskas and D. Kriksciuniene Table 1. Estimated variables
Variable Code Count GCTrukme GCVerte AverageB AverageDate StnDevDate Group Last_Date Classif.
Description Customer code Frequency of the visits Length of the life cycle Total sum of purchases by customer Average value of purchase Average number of days between each visit Standard deviation of the number of days between visits Customer group, assigned by their visit frequency (three groups): Ist group visits >17, 2nd group: 7< visits <18, and 3rd group visits <8 Date of the last visit Classifying variable. The customer is assigned either to the reliable customer group (=1) or lost customer group (=0)
The research is designed for detection of the variables which could be valuable for evaluating the threat of loss of the customer. The model consists of two stages: 1. In the first stage the neural network method is applied for creating customer classification model. The full customer data base was applied for training and testing the neural network. 2. The second stage of the research explored the prevailing perception about customers in the enterprises, which tends to consider customers more reliable, if their buying frequency in big. The NN model is modified for ranking of the variables, according to their impact for classification of the customers with different frequency of the visits. The model is aimed to inform, what type of main and supporting information should be analyzed. Although the variables are only related to the financial transactions of the customer, they have different background: the financial values (GCVerte, AverageB), time values (GCTrukme, AverageDate, StnDevDate, and Last_Date), frequency (Count, Group). The attempt to define the most influential variables for customer classification lead to their refinement by collecting non-transactional information, related to communication history with the customer. The sensitivity analysis of the variables, used for input of the NN model, can further be enhanced by the set of supporting non-transactional CRM variables, more extensively describing time, frequency and financial characteristics of the customer relationships. The further investigation is aimed to research of the neural model, enhanced by the variables increasing sensitivity and correct performance of the classificaton model in different periods of the customer life cycle. 2.1 Neural Network Model for Customer Classification The Neural network for classification of customers was prepared by applying STATISTICA Neural Network module (StatSoft Inc., 2006). The data, used for the model, included categorical output variable Classif and continuous input variables: GCTrukme, GCVerte, AverageB, AverageDate, StnDevData (Table 1).
Sensitivity Analysis of CRM Indicators
459
Three types of network modeling were investigated for classification. The generally applied network types for designing NN models are Probabilistic Neural Network, Radial Basis Function and Multilayer Perceptron. The main difference is in their algorithms, used for analysis and grouping of the input cases for further classification. The data analysis is performed in stages, by including hidden layers for further solving classification tasks. Any type of neural network model is designed in training, selection and testing steps by calculating classification error and performance rates on the training, selection and test data subsets. The Probabilistic Neural Networks differ from the other types of NN by their speed in different stages of model creation. This type of network copies every training case to the hidden layer of the network, where the Gaussian kernel-based estimation is further applied. The output layer is then reduced, by making estimations from each hidden unit. The training is extremely fast, as it just copies the training cases after their normalisation to the network. But this procedure tends to make the NN very large (StatSoft Inc., 2006). The Probabilistic Neural Network model was selected for further investigation and refinement of the customer classification model, as it outperformed the Radial Basis Function and Multilayer Perceptron models. The application of neural network models required preparing three datasets. The prepared model included five input variables, 206 atoms in hidden layer and one output categorical variable. Its structure is presented in Fig.1.
Fig. 1. The structure of the Probabilistic Neural Network model
460
V. Sakalauskas and D. Kriksciuniene
In the Fig. 2 the performance of the model is evaluated by calculating correct classification rate and errors in train, select and test stages.
Fig. 2. Probabilistic Neural Network model characteristics
The network's error function is measured as the root mean square (RMS) of each individual case error. The NN training algorithms attempt to minimise the training error. The correct classification rate is shown by the performance measures, which indicate the proportion of cases which were correctly classified in training, selecting and testing stages. As we can see from Fig.3, the correct classification rate is near 0.8 for all subsets, which means, that more than 80% cases were classified correctly. The classification performance of the created NN model is summarised in Fig. 3. The performance of the neural network was different in correctly classifying reliable customers (Classif..1.1) and lost customers (Classif..0.1). That means that the reliable customers were correctly assigned to the reliable customer group (=1) in 95,34% cases, and incorrectly assigned to the group of lost customers Fig. 3. Classification results of the NN model (=0) in 4,66% cases. But the lost customers were correctly identified as belonging to lost customer group (=0) only in 50,38% cases. Therefore the performance of the created model was unsatisfactory due to inability to recognise the customers, who tend to lose connection with the enterprise. The further investigation is based on assumption that the accuracy of classification depends on other variables describing customers. Number of visits is one of the variables, mostly influencing analysis of customer history. We explored if customers with higher frequency of the visits, indicated by variable Count, tend to be differently assigned to the group of reliable customers by the neural network model. 2.2 Sensitivity Analysis of the Input Variables of the Neural Network Model In this part we explore ability of the model to classify customers in various stages of customer life cycle, denoted by frequency values. We analyse the customer’s data with equal visit frequency. The customers whose total number of visits is higher than 4 are included into all corresponding data subsets with accumulated variable values, corresponding to their frequency values from 4 to the number of their last visit.
Sensitivity Analysis of CRM Indicators
461
Fig. 4. The sensitivity analysis of the designed neural network model
The neural network models are designed for each subset for classifying customers to the reliable customer group (=1) and to the lost customer group (=0), the correct and incorrect classification percent is calculated for both groups, and the variables are ranked for exploring their influence to the classification. The impact of different variables to the correctness of the customer classification is denoted as sensitivity analysis. In Fig. 4 the variables are ranked by their influence to neural network model. The time-related variables Length of the life cycle (GCTrukme), Standard deviation of the number of days between visits (StndDevData) and Average number of days between each visit (AverageDate) take the three highest ranked positions in the model. The money-related variables (Average purchase AverageB and Life cycle value GCVerte) are ranked in lowest positions. The differences of the impact of the variables in the model are quite small, as shown in the Ratio 1 line of the Fig.4, except for GCTrukme, which has the highest impact, exceeding the others by approximately 20 percent. The impact of different variables to the correctness of the customer classification is denoted as sensitivity analysis. Our goal is to analyze the network performance for recognizing reliable customers for each value of frequency in the explored range from 4 to 17. The performance of the designed NN models corresponding to each value of visit frequency (count) and ranking of the variables according to the sensitivity of the NN models are presented in Table 2. Table 2. Performance and sensitivity of the neural network model Count 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Correct Classification Variable rank Correct_1 Correct_0 GCTrukme GCVerte AvB AvDate StddevDate 69,5 63,4 2 3 5 4 1 71,9 61,0 5 1 3 4 2 72,3 63,8 5 1 3 4 2 72,5 62,2 5 1 3 4 2 71,8 64,5 5 1 2 4 3 74,0 61,5 5 1 2 4 3 76,7 54,2 4 1 2 5 3 77,9 54,5 4 1 3 5 2 81,8 31,6 4 1 2 5 3 90,5 40,0 2 1 3 5 4 93,2 33,3 1 2 3 5 4 94,4 18,2 1 2 3 5 4 96,0 22,2 1 2 3 5 4 95,5 22,2 1 2 3 5 4
462
V. Sakalauskas and D. Kriksciuniene
The reliability of customer relationship differs in the three main stages: early (four to seven visits), middle (seven-twelve visits), and late (thirteen-seventeen). In these stages different NN models should be applied for customer classification. The performance of the model for recognizing reliable customers is increasing in all the stages, but its performance for recognizing potentially lost customers decreases. The NN model performance can be improved by further search of nontransactional variables which supports the overwhelmed variables of the applied models. Therefore in the later stages of customer life cycle, when the model is mostly influenced by the variables of life time and tend to assign all customers with long history to reliable group, the ability of the model to recognize lost customers can be increased by designing and evaluating variables characterizing average purchase and standard deviation value, which mostly depend on success of each visit.
3 Conclusions The presented research analysed customers classification to the reliable and potentially lost by using main variables of transactional data: frequency of visits, life cycle value, life cycle duration, average purchase, average number of days between visits and the standard deviation of time interval between visits in days. The neural network model was designed and explored for analysis customers' reliability changes by increasing number of visits. We state which variables can most sensitively describe customer potential to keep further relationship with the firm. The neural network experimental analysis showed that application of the transaction-based variables for customer classification had increasing ability to recognize the reliable customers (from 69.5 % to 95.5%), but the performance to predict further churn of the customer was weakened (from 63.4% to 22.2%). The power of the general neural model to make predictions for entire customer data base was 95.34% and 50.38% respectively. The research results imply that the performance of neural network model for the group of best customers is unsatisfactory, due to its high sensitivity of increasing values of variables related to customer relationship history. Therefore by analysing sequence of data subsets with the increasing frequency of visits, new supplementary variables have to be applied for analysing customer’s attitude to each following visit, which could affect the values of average purchase and standard deviation of number of days between visits, and improve accuracy of the model.
References 1. Berry, M.J.A., Linoff, G.S.: Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management (2004) 2. Liao, T.W., Triantaphyllou, E. (eds.): Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications, pp. 1–109. World Scientific, Singapore (2007) 3. Neslin, S.A., Gupta, S., Kamakura, W., Lu, J., Mason, C.: Defection Detection: Improving Predictive Accuracy of Customer Churn Models. Tuck School of Business, Dartmouth College (2004)
Sensitivity Analysis of CRM Indicators
463
4. Peppers, D., Rogers, M.: Managing Customer Relationships: a Strategic Framework. John Wiley and Sons Inc., Hoboken (2004) 5. Peppers, D., Rogers, M.: Return on Customer: Creating Maximum Value you’re your Scarest Resource. Currency, New York (2005) 6. Peppers, D., Rogers, M.: Response to Ambler and Roberts Beware the Silver Metrics. Report No.06-114, Marketing Science Institute (2006) 7. Sakalauskas, V., Kriksciuniene, D.: Deriving knowledge based indicators for monitoring virtual project teamwork: Continuous Optimization and Knowledge-Based Technologies, Vilnius Technika, pp. 105–112 (2008) 8. StatSoft Inc., Electronic Statistics Textbook. Tulsa, OK: StatSoft (2006), http://www.statsoft.com/textbook/stathome.html 9. Wangenheim, F.: Life time Value Prediction at early Customer Relationship stages. Working Paper Series, Issue 2, No. 06-002. Report Summary No. 06-112, 2006: Marketing Science Institute (2006)
Endpoint Detection of SiO2 Plasma Etching Using Expanded Hidden Markov Model Sung-Ik Jeon1, Seung-Gyun Kim1, Sang-Jeen Hong2, and Seung-Soo Han1 1
Deptartment of Information Engineering, Myongji University, Yongin, Gyeonggido, 449-728, Korea
[email protected] 2 Department of Electronics Engineering Myongji University, Yongin, Gyeonggido, 449-728, Korea
Abstract. In this paper, extended Hidden Markov Model (eHMM) is employed to resolve transition detection problems in plasma etch processes using optical emission spectroscopy (OES) data. The proposed eHMM framework is a one of various semi-Markov models: a combination of semi-Markov model and segmental model. In the OES data, the endpoint is correlated to the state transition in the model. The segmental model permit adaptable modeling of data within several segments, e.g., linear, quadratic, or other regression functions. The semi-Markov model permits blending prior knowledge from previous time. The semi-Markov-Model is an extended version of the standard Hidden Markov Model (HMM), from which learning and deductive algorithms are expanded. The verification using test data is assured accuracy and excellence of the proposed eHMM in endpoint detection. Keywords: Optical Emission Spectroscopy, Extended Hidden Markov Model, Plasma Etching, Hidden Markov model, Endpoint Detection.
1 Introduction Plasma etching is an extensively used critical process in the semiconductor manufacturing. The demand for accurate in-situ monitoring of plasma etching is essential for current and future VLSI fabrication [1]. Severely restricted control of all process factors must be sustained to increase throughput and reproducibility. The enormous need for plasma process monitoring occurs in the adjudication of the endpoint in etching process, which can reduce the amount of over-etch and under-etch. All in all, the dry etch process, especially a reactive ion etch (RIE) process, is used for etching thin line patterns in a silicon wafer. RIE is similar to the standard parallelplate plasma etcher, except a masked wafer in a chamber that is placed on the RFpowered electrode. The plasma includes etchant gases that are dissociated in a radio frequency field. This condition creates vertically reactive ions contained in the etchant gases toward the wafer surface. The accelerated reactive ions combine chemically with unmasked material on the wafer surface. As a result, volatile etch products are L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 464–471, 2010. © Springer-Verlag Berlin Heidelberg 2010
Endpoint Detection of SiO2 Plasma Etching Using Expanded Hidden Markov Model
465
generated. The volatile etch products are incorporated into the plasma in that a layer of unmasked material is etched. As the etching process reaches end of layer, the quantity of volatile etch product found in the plasma decrease since the quantity of unmasked material being etched is reduced due to the over-etching. The depletion or reduction in the amount of volatile etch product in the plasma during the RIE process typically can be used as an indicator for the end of the etching process. One of the most commonly used measurement techniques for in-situ plasma monitoring and sophisticated endpoint detection are optical emission spectroscopy (OES) [2]. OES systems investigate the alternation of the optical emission intensity of the plasma as a capacity of the reactants and by-products in the chamber. The radicals and ions in the plasma can be diagnosed by measuring the intensity of the optical emission signal at particular wavelengths. In this technique, the intensity of an emission peak closely related to a particular reactant or product is monitored in the course of time. Most endpoint detection (EPD) methods using OES focus on identifying a single wavelength that corresponds to a chemical species that shows a pronounced transition at the endpoint. When the target layer is cleared by the etching process, the concentration of reaction products from the target layer is reduced and the concentration of products from the underlying layer is increased. As an example, in SiO2 with CHF3, carbon combines with oxygen from the wafer to from carbonmonoxide(CO) as an etch product. It is known that CO emits light at a wavelength of 451 nm, and that this wavelength can be monitored for detecting the endpoint. When the oxide is completely etched, there is no longer a source of oxygen and the CO peak at 451 nm decreases, thus signaling an endpoint of etch process. In recent years, there has been extensive research into the plasma etch endpoint detection using OES and interferometry signals [3][4][5][6][7]. Many researchers have studied to find the endpoint by surveying the figure of the time waveform of selected spectral lines, and pattern recognition methods such as neural networks. The disadvantage of the neural network approach is that a considerable number of training examples may be needed to build the model. To solve this problem Mundt [5] suggested synthesizing training data. To complete this technique, a mathematical model has to be constructed for the endpoint. This is a complicated technique to perform because it requires detailed prior knowledge and need to be reiterated for each new pattern. Other pattern matching techniques also have been surveyed. Allen et. al. [6] used the Haar wavelet representation to model an endpoint pattern over many resolutions. As mentioned above, it is difficult to directly incorporate prior knowledge using this approach. To increase the sensitivity of modeling for the endpoint detection, a progressive method which based on two extensions of the basic HMM has been proposed. One is the semi-Markov model to enable an arbitrary distribution on the location of the change-point. The other extension is a segmental HMM to model the configuration in each segment. The ultimate object is to be able to detect this change-point in on-line real time system. In this paper an off-line calculation of detecting problem using extended Hidden Markov Model (eHMM) is suggested. This model also used to detect endpoint in on-line system. The composition of this paper is as follows: Segmental semi-Markov modeling is described in Section 2. Experiments and characterization techniques are outlined in Section 3 followed by a discussion of the results and conclusions.
466
S.-I. Jeon et al.
2 Segmental Semi-Markov Model for the Endpoint Detection In this section, the extended version of segmental semi-Markov model to solve the problem of endpoint detection is suggested. This method detects one time change of the value and resulted in high efficiency in computation. Proposed method has the following specific component [8]: The process is assumed to start at the state 1 and then transition to the state 2 and stays until end of the process. • A segmental hidden Markov model is used to model the experimental data. It is assume that the regression functions for states 1 and 2 are linear with values and plus the additive noise , which is Gaussian with zero mean and unknown variance . • The process is semi-Markov, characterized a state duration distribution for . state 1 as For the problem of change-point detection, a 2-state segmental semi-Markov model is used: • •
State 1: 1st segment (from the starting point to the changing point) State 2: 2nd segment (after the changing point)
As the process will start with state 1 and transition to state 2, with the initial state distribution 1 0
(1)
and the transition matrix is defined by 0 1 (2) 0 0 The state duration distribution of state 1 is set to reflect prior knowledge of ‘when the change-point will occur.’ For example, if we expect that the change will occur approximately at time 20%, we can use a truncated normal distribution ,
3
3
(3)
0, where 3σ µ 20%. As we are carefree with the duration of state 2, we set its distribution to be 1,
for
0
(4)
And then, the linear regression functions of the N states is |
,
for 1
,.
(5)
The joint distribution of the model is ,… ,
,
,… ,
,…,
(6)
Endpoint Detection of SiO2 Plasma Etching Using Expanded Hidden Markov Model
467
2.1 Regression Model In this section, regression model is defined to accommodate the functional form of the | that concern the observed data with the hidden states. conditional densities In the standard HMM, the observed sequence ’s rely only on the state sequence not on the time t. The proposed model allows each state to produce data in the form of a linear regression model [9], i.e., | (7) | is a linear regression function with parameters and is additive where independent noise. Gaussian noise often supposed Gaussian, but not inevitably. Ac| cordingly, in the Gaussian noise case we get that is Gaussian with a | and with variance . Note that conditioned on the time-dependent mean regression parameters , the is only depend on the current state , as in the regression framework. 2.2 A Viterbi-Like Algorithm to Compute the Most Likely State Sequence (MLSS) For more sensitive change-point detection, the most likely state (i.e., segment label) sequence … for a data sequence … is used [10]. At each time , this algorithm calculates the quantity for each state i, 1 , where is defined as max
|
…
|
…
,
,
(8)
is the likelihood of the most likely state sequence that ends In other words, , ̂ is defined as with state i, and is the last point of segment i. At time max
|
…
|
…
,
(9)
By definition, the most likely state sequence for the data sequence … with likelihood max . the state sequence is The recursive function for calculating max
max
…
|
for 1
…
will be
(10)
As the process reaches the final time, we acquire the maximum using equation 7 and then find the most likely state sequence (MLSS) using trace back. In the most likely state sequence(MLSS), changing from state 1 to state 2 is the ultimate endpoint.
3 Experiments The etching tool used in this research was a MINI Plasma- station series. A diagram of a reactive ion etching system is shown in Fig. 1-(a). Reactive ion etch is a method for eliminating material from the wafer surface with both a reactive chemical process and a physical process using ion bombardment. In this type, a DC self-bias evolves on the cathode and the wafer acquires large voltage difference with respect to the plasma. This state leads directionality to ionized species moving toward wafer.
468
S.-I. Jeon et al.
An Avaspec-128 OES system was used for monitoring the RIE chamber during the etching process. This OES system primarily consisted of optical sensors. Fiber-optic cables were used to collect the plasma emission intensity readings independently through a small quartz window in front of the RIE chamber. Fig. 2 depicts the etch chamber sensor configuration.
Fig. 1. Schematic diagrmas of (a) a reactive ion etcher(RIE) and (b) the SiO2 coupon wafers
The vacuum was maintained using a turbo-molecular pump supported by a dry mechanical pump. The process gases were induced by the top gas inlet through the center of the dielectric window at a controlled flow rate, and the chamber pressure was supervised using a throttling gate valve (TGV). The process chamber was operated at 15mT with 1000 W of 13.56 MHz inductive source power. The following feed gases were supplied to the process chamber: 50sccm of CF4, 15sccm of Ar. The thickness of the SiO2 layer was 5000 Å, and the layer was deposited by low pressure chemical vapor deposition (LPCVD) using dichlorosilane (SiH2Cl2 or DCS) and nitrous oxide(N2O) as the precursor. Figure 1-(b) shows the SiO2 test coupon wafers, which had been broken into small pieces. In manufacturing processes, the coupon wafer test is broadly used in the early phase of process development when real-patterned wafers are not readily available. The sizes of the coupon wafers used ranged from 0.2% (62.5 mm2) to 1.0% (312.5 mm2) within 4 inch wafer. These coupon wafers were etched for 450s in this experiment. 0.4% coupon wafer was tested to prove the etching endpoint. The optical emission signals of the 175 x 110 x 44 mm channels in the OES spectrometer from 360-1100nm were measured 2 second interval during 8 minutes.
Fig. 2. OES sensor configuration for plasma monitoring
Endpoint Detection of SiO2 Plasma Etching Using Expanded Hidden Markov Model
469
Multiple peaks obtained from the literature were considered, as shown in Table 1. After monitoring all the peaks, the noisiest wavelengths (CO, 482.5 nm) were selected. Table 1. Several relative wavelengths for the single wavelength method material CF CF2 CO CO2 Si SiF SiO2
Wavelenghts (nm) 240.0, 247.5, 255.8 259.5, 271.2, 274.9 219.7, 292.4, 313.9, 349.3, 482.5, 560.9 287.7, 290.0 252.0, 252.3, 505.6 440.2 248.6
4 Simulation Result There are total 6 OES sensor data sets with same process conditions from the etch system (MINI Plasma-Station). Four of them are used to make an eHMM model, the other 2 are used to verify the performance of the model. Both the linear regression method and MLSS method were utilized to represent the model. To systematically verify the linear regression method and MLSS method, the 2 test cases were validated in the following manners: • The model was trained using the 133 points data in the wavelength of 482.5 nm, which consists of two linear segments with additive Gaussian noise. • The actual transition-point from one segment to the other is sampled from a Gaussian distribution with mean 99 and standard deviation 6.6. This distribution was used as the prior knowledge in the eHMM. To test the sensitivity of the segmental semi-Markov method, the test data were fed into the model to verify that our model correctly finds changing point. • To prove sensitivity of MLSS method, comparing is needed after applying the MLSS. Figure 3 shows the linear regression model of the four data sets. The two slopes are decided to represent two segments in the linear regression representation. In the regression model, the slopes θ = 0.01285 and θ = -0.02 represent each state respectively. Detected endpoint using proposed eHMM method is shown in Fig. 4. In Fig.4. - (a) and (d), the thin linear line shows the probability of state 1, and the thick line shows the probability of state 2. At many points before the endpoint arrives, we can see that the probability of state 2 are greater than state 1, which can be incorrectly detected as endpoint. So, without MLSS, incorrect detections occur in many points before the real endpoint arrives. Fig.4 - (b) and (e) show the probability of state transition. The greatest point corresponds to the endpoint. Because the Viterbi algorithm is able to filter out the noise, the curve becomes much flatter. With this algorithm, the only one candidate point is acquired and the endpoint is found.
470
S.-I. Jeon et al.
Fig. 3. The four data sets used in making eHMM model
After using the MLSS method, errors of detecting endpoint tend to be much smaller. The performance of proposed eHMM algorithm is validated using two test samples. Fig. 4 – (c), (f) shows the detection of endpoint using eHMM algorithm. The algorithm decided that the endpoints are at 99sec, which are correctly located. The eHMM algorithm is assumed to start from the state 1 until endpoint occurs. Then the state will be changed to the state 2. As state changes, the proposed algorithm detects the correct endpoint. This result shows that the proposed model performs good in dealing with the ambiguities in the data.
Fig. 4. (a), (d): output before MLSS, (b), (e): output after MLSS, (c), (f): simulation results
Endpoint Detection of SiO2 Plasma Etching Using Expanded Hidden Markov Model
471
5 Conclusions Extended hidden Markov (eHMM) modeling has been applied in plasma etch endpoint detection. This method is a combination of a semi-Markov model which characterized by state duration distribution and a segmental model which used to model the shape in the data. Experimental OES data from etching SiO2 coupon wafers with CF4 and Ar gas was utilized to model and to verify the performance of the suggested algorithm. Four data sets are used to train the eHMM model using linear regression method and Viterbi algorithm. The other two data sets are used to test whether the trained model can detect the endpoint correctly or not. The result shows that the eHMM model can detect the endpoint correctly even with the noisy data. This algorithm can be applied to detecting endpoint in etching with a small opening area ratio. Acknowledgments. This work is financially supported by Ministry of Knowledge Economy (10031812-2008-11).
References 1. Almgren, C.: The role of RF measurements in plasma etching. Semi-cond. Int. 20(8), 99– 104 (1997) 2. Shabushnig, J., Demko, P.: Application of optical emission spectroscopy to semiconductor device fabrication. Amer. Lab. 16, 60–67 (1984) 3. Rietman, E.A., Frye, R.C., Lory, E.R., Harry, T.R.: Active neural network control of wafer attributes in a plasma etch process. Journal of Vacuum Science & Technology B (Microelectronics Processing and Phenomena), 1314–1316 (1993) 4. White, D.A., Goodlin, B.E., Gower, A.E., Boning, D.S., Chen, H., Sawin, H.H., Dalton, T.J.: Low open-area endpoint detection using a PCA-based T 2 statistic and Q statistic on optical emission spectroscopy measurements. IEEE Transactions on Semiconductor Manufacturing, 193 (May 2000) 5. Mundt, R.: Model based training of a neural network endpoint detector for plasma etch applications. In: Meyyapan, M., Economou, D.J., Butler, S.W. (eds.) Proc. Symposium on Process Control, Diagnostics, and Modeling in Semiconductor Manufacturing, May 1995, pp. 178–188 (1995) 6. Allen, R.L., Moore, R., Whelan, M.: Application of neural networks to plasma etch endpoint detection. Journal of Vacuum Science & Technology B (Microelectronics and Nanometer Structures), 498–503 (1996) 7. Dreeskornfeld, L., Segler, R., Haindl, G., Wehmeyer, O., Rahn, S., Majkova, E., Kleineberg, U., Heinzmann, U., Hudek, P., Kostic, I.: Reactive ion etching endpoint detection of microstructured Mo/Si multilayers by optical emission spectroscopy. Microelect. Eng., 54– 303 (2000) 8. Ge, X., Smyth, P.: Segmental Semi-Markov Models for Change-Point Detection with Applications to Semiconductor Manufacturing. Technical Report UCI-ICS, pp. 00–08 (March 2000) 9. Draper, N.R., Smith, H.: Applied regression analysis. John Wiley & Sons, Inc., Chichester (1998) 10. Ostendorf, M., Digalakis, V.V., Kimball, O.A.: From HMM’s to segment models, a uniedview of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing 4(5), 360–378 (1996)
Kernel Independent Component Analysis and Dynamic Selective Neural Network Ensemble for Fault Diagnosis of Steam Turbine* Dongfeng Wang, Baohai Huang, Yan Li, and Pu Han Department of Automation, North China Electric Power University, Baoding Hebei 071003, China
Abstract. A new method for fault diagnosis of steam turbine based on kernel independent component analysis (KICA) and dynamic selective neural network ensemble is proposed. Firstly, the fault data of steam turbine is analyzed using KICA to extract main features from high dimensional patterns. Not only is the diagnosing efficiency improved but also the diagnosing accuracy is ensured. Then, the generalization errors of different neural networks to each validating sample are calculated and the information is collected into a performance matrix, according to which the K-nearest neighbor algorithm is used to predict the generalization errors of different neural networks to each testing sample. Lastly, the individual networks whose generalization errors are in a threshold λ will be dynamically selected and the predictions of the component neural networks are combined through majority voting. The practical applications in fault diagnosis of steam turbine show that the proposed approach gives promising results on performance even with smaller learning samples, and it has higher accuracy and stability. Keywords: kernel independent component analysis, features extraction, ensemble learning, dynamic selective ensemble, steam turbine; fault diagnosis.
1 Introduction With the development of automation and increasing capacity of steam turbogenerator, the higher requirements have been put forward on higher speed, fully loaded, continuous state and reliable operation of the equipments as well as the online monitoring and fault diagnosis technology of large steam turbo-generator unit. Neural network technology is widely used in the field of fault diagnosis because of its self-study, non-linear pattern recognition, ability to associate and fault-tolerant features [1-3]. However, there are two basic problems prevalent in intelligent diagnosis based on neural networks: On the one hand, how to select the most valuable characteristics from a large number of features got through the signal analysis; On the other hand, how to structure and train neural networks in order to improve their ability *
This work is partially supported by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, North China Electric Power University (Grant No.200814002).
L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 472–480, 2010. © Springer-Verlag Berlin Heidelberg 2010
KICA and Dynamic Selective Neural Network Ensemble for Fault Diagnosis
473
of generalization. Recently, the methods combined neural network and the rough set theory [4], empirical mode decomposition [5], non linear principal component analysis [6] or information fusion technology [7] have been proposed and achieved good results. However, with the growing size and complexity of diagnosis system, these methods have poor stability in resolving non-linear pattern recognition, and their improvement in generalization ability of neural network is very limited. Kernel independent component analysis (KICA) [8] uses “kernel tricks” to nonlinearly map the data into a high-dimensional feature space in which the data have a linear structure, and ultimately converts the problem of performing ICA [9] in feature space into a problem of implementing ICA in the kernel principal component analysis (KPCA) [10] transformed space. Neural network ensemble (NNE) [11] is a learning paradigm where a collection of a finite number of neural networks is trained for the same task. It has been proved to be a very effective approach to significantly improve the generalization ability of a neural network system. In this paper, to overcome the deficiency of existing methods, a new approach for fault diagnosis of steam turbine based on KICA and dynamic selective neural network ensemble is proposed. The practical applications show that the proposed approach gives promising results on performance and it has higher accuracy and stability compared with other methods.
2 Features Extraction Based on KICA Given a random vector x, which is possibly nonlinearly mixed, we map it into the feature space F, by the nonlinear mapping: Φ : x Φ (x) F. Assume that after the nonlinear mapping, the data have a linearly separable structure in feature space F. Our task is to find a linear operator W Φ in F to recover the independent components from Φ (x) by the following linear transformation:
→
S = W ΦΦ ( x) .
∈
(1)
Before applying a KICA algorithm, it is usually very useful to do some preprocessing work such as sphering or whitening data. Relevant method and certification please refer to reference [12]. After whitening, the following task is to find a new unmixing matrix Wp in the KPCA-transformed space to recover the independent source s from y, i.e., s=W p y. Note that the new unmixing matrix Wp should be orthogonal. We adopt the FastICA algorithm that was proposed by Hyvärinen et al [13]. The basic form of FastICA algorithm is as follows: Step 1: Choose the number of the independent components estimated and let the iterated number p←1; Step 2: Choose a random initial weight vector Wp; Step 3: Let W p = E { y g (W pT y )} − E { g ' (W pT y )}W p , denote nonlinear function g: g ( y ) = y ex p ( − y 2 /2 ) ; p −1
T Step 4: Let W p = W p − ∑ (W p W j )W j ; j =1
474
D. Wang et al.
Step 5: Let Wp = Wp / Wp ; Step 6: If not converged, go back to Step 3; Step 7: Let p=p+1, if p ≤ m, go back to Step 2.
Through the above approach, we can calculate the unmixing matrix Wp and get a group of independent sources s. Then, the sample date in the feature space F can be denoted by the linear combination of s. The coefficient of the linear combination, that is the projection coefficient of y on s, can be used as the new data set Ω to be classified and recognized.
3 Dynamic Selective Neural Network Ensemble 3.1 Neural Network Ensemble
Neural network ensemble is a learning paradigm where a collection of a finite number of neural networks is trained for the same task. It originates from Hansen and Salamon’s work [11], which shows that the generalization ability of a neural network system can be significantly improved through ensembling a number of neural networks. In general, a neural network ensemble is constructed in two steps, i.e. training a number of component neural networks and then combining the component predictions. Krogh and Vedelsby proved that increasing the diversity of component networks can effectively reduce the generalization error of neural network ensemble [14]. It is usually to increase the diversity of component neural networks by the following methods: to form different structure of component neural networks or use the most prevailing approaches Bagging and Boosting to generate different training sets. As for combining the predictions of component neural networks, the most prevailing approaches are plurality voting or majority voting for classification tasks. The existing research results show that the necessary choice for component neural networks can reduce the generalization error of neural network ensemble [15]. 3.2 Selective Neural Network Ensemble
Suppose the task is to use an ensemble comprising N component neural networks to approximate a function f Rm C where C is the set of class labels and the predictions of the component networks are combined through majority voting. For convenience of discussion, we assume that C contains only two class labels, i.e. the function to be approximated is f Rm {-1,1}. Then suppose there are m instances, the expected output, i.e. D=[d1, d2,… , dm]T, where dj denotes the expected output on the j-th instance, and the actual output of the i-th component network, i.e. fi, on those instances is [fi1, fi2 … , fim]T where fij denotes the actual output of the i-th component network on the j-th instance. Thus the generalization error of the i-th component neural network on those m instances is:
: →
: →
,
Ei =
1 m ∑ Error ( fij d j ) . m j =1
(2)
KICA and Dynamic Selective Neural Network Ensemble for Fault Diagnosis
475
According to the theoretical analysis, suppose the k-th component network is excluded from the ensemble, we can derive that if Eq. (3) is satisfied then ensembling the remaining neural network is better than ensembling all of them m
∑
j =1 j∈{ j Sum j ≤1}
Sgn (( Sum j + f kj ) d j ) ≤ 0
(3)
It is obvious that there are cases where Eq. (3) is satisfied. Details of the proving process refer to the previous work [15]. 3.3 Dynamic Selective Neural Network Ensemble
The traditional approach means the component neural networks selected are fixed. It is impossible to use all neural networks at any time. Such methods get easily into the local optimal solution and have greater complex computations and other deficiencies. To solve the problem, we propose a new approach named dynamic selective neural network ensemble (DSNNE). The DSNNE algorithm can be described as: Step 1: Divide the data set Ω into three parts, one is learning sample set L, the other is validating sample set V and the rest is testing sample set T. Then train N component neural networks independently using the learning sample set L; Step 2: Calculate the generalization error Eij where denotes the generalization error of the i-th component neural network on the j-th instance of validating sample set V according to Eq. (2), and collect the information of generalization errors into a performance matrix Λ ; Step 3: Suppose the testing sample is t T, and the instances of V may be denoted v v v as xv = { x1 , x 2 , " , x A } . We adopt the K-nearest neighbor algorithm to search the nearest K samples in validating sample set V to the testing sample t by Eq. (4):
∈
dtv = d ( x v , x t ) =
A
∑ (x j =1
v j
− x tj ) 2 .
(4)
Step 4: Predict all generalization errors of the component neural networks on the testing sample t by Eq. (5), according to the performance matrix Λ and the Euclidean distance dtv1 dtv2 ..., dtvK of the testing sample t to the K samples which are nearest to it in the validating sample set V;
, ,
K
K
1 Eivk k =1 dtvk
Eit = ∑
1 ,
∑d k =1
(5)
tvk
Step 5: Standardize Eit by Eq. (6).
E it' = E it
N
∑E i =1
it
,
(6)
Then the Eit is used to comprise the predicted performance matrix Λ * which contains the generalization errors information of all component neural networks on every testing samples; '
476
D. Wang et al.
Step 6: Given a threshold λ (General value is 1/N), according to the predicted performance matrix Λ * and the different testing samples, we dynamically select the component neural networks whose generalization errors are in a threshold λ on the testing sample to comprise the corresponding neural network ensemble. Lastly, we get the predictions of the component neural networks through majority voting by Eq. (7) where f i ( t ) denotes the actual output of the i-th component neural network on the t-th testing sample,
Y = arg c∈C max
∑
1.
(7)
i : f i ( t ) = c , E it' ≤ λ
Dynamic selective neural network ensemble realizes “Concrete analysis of concrete problems”, that is to say, DSNNE can dynamically select the corresponding neural network ensemble which fits the corresponding testing sample best from all neural networks at any time.
4 Fault Diagnosis of Steam Turbine Based on KICA and DSNNE The data used in this paper is the history failure data set of a 300MW steam turbogenerator unit. After spectrum analysis and normalization for the vibration signals got from the detection system, we show the frequency spectrum distributions of typical failures of steam turbine in Table 1 where the f is basic frequency. The number of fault data is 94. The number of samples belongs to the types of faults 1 2 3 4 is 23 23 23 25. Table 2 shows the specific meaning of typical failures.
、、、
、 、 、
Table 1. Data of the types and characteristics of steam turbine failures No.
The frequency distributions of steam turbine failures (0-0.39) f (0.4-0.49) f 0.5 f (0.51-0.99) f
1f
2f
(3-5) f
Type Output odd*f
>5 f
1
0.0027
0.0014
0.0082
0.0155
0.9235 0.0687 0.0404 0.0459 0.0045
1
1000
2
0.1193
0.0187
0.0079
0.1322
0.4925 0.0192 0.0440 0.0262 0.0737
2
0100
3
0.0036
0.0036
0.0048
0.0070
0.5853 0.1505 0.1140 0.1336 0.0305
3
0010
4
0.0267
0.2049
0.3946
0.1966
0.0632 0.0865 0.0265 0.0188 0.0052
4
0001
#
#
94
0.0132
#
#
0.2339
0.4880
#
#
0.0636
#
#
#
#
0.0994 0.0384 0.0278 0.0278 0.0079
Table 2. Typical faults of steam turbine Fault serial number 1 2
Fault type
The output vectors
Rotor imbalance fault Static and dynamic gouging abrasion fault
1000 0100
3
Rotor crack fault
0010
4
Rotor loose fault
0001
#
#
4
0001
KICA and Dynamic Selective Neural Network Ensemble for Fault Diagnosis
477
In order to verify the proposed method KICA-DSNNE have still better diagnostic results with smaller learning samples, we sample the above data in a stratified random way, select out 16 samples as training sample set L, and other 16 samples again as validating sample set V, the rest 62 samples as the testing sample set T. Firstly, the literature [8] pointed the KICA algorithm is robust to set parameters in the light of experience and given the scope of related parameters and its impact to results. In this paper, we select Gaussian kernel function, i.e., Eq. (8) where σ =1, K ( x i , x j )= exp(−
xi − x j
2
2σ 2
).
(8)
In KICA algorithm, we set the regular parameter κ =0.02, the number of restart times r =2 and the number of estimated independent component m=5. After features extraction with KICA for the original faults data, we found the new sample data set Ω and get the corresponding new training sample set L new validating sample set V and new testing sample set T which are displayed in Table 3 where the number from 1 to 16 is L , 17 to 32 is V and 33 to 94 is T . Then, we train 20 BP neural networks independently by L . Every BP network is divided into three layers, where the neurons number of input layer equals to the number of input variables and the neurons number of hidden layer is taken value by n1 = m + n + a , where n1 is the neurons number of hidden layer, m is the neurons number of output layer and n is that of input layer. The parameter a is a positive integer from 1 to 10. In this paper, n1 equals to 6+x (x=1,2,… ,5) and the transfer function of neurons in hidden layer uses S-tangent function tansig, the output layer logsig.
、
Table 3. The sample dataset Ω after features extraction NO. 1
e1e2 0.0201
0.0093
e3
e4
e5
0.0103
0.2826
-0.0023
#
Output vectors 1000
#
#
#
#
#
#
16
0.0167
0.0027
-0.0535
-0.0067
-0.0230
0001
17
0.0148
0.0035
0.0055
0.2172
-0.0019
1000
#
#
#
#
#
#
#
32
0.0131
-0.0015
-0.0289
-0.0231
-0.0176
0001
33
0.0185
0.0028
0.0055
0.2869
-0.0071
1000
#
#
#
#
94
0.0189
0.0022
-0.0488
# -0.0070
# -0.0273
# 0001
The training functions of BP neural networks are BFGS quasi-Newton algorithm, Bayes standardized algorithm, and adaptive momentum lrBP reduced gradient algorithm, Levenberg-Marquardt algorithm. This can ensure that 20 BP networks have differences in initial weight structure and training function. The learning rate of BP algorithm is 0.001 and the objective error at the end of training is 0.001. During training, if the generalization error of BP networks has no change in 5 times, the training of neural networks is terminated to avoid the problem of overfitting.
、
478
D. Wang et al.
In the K-NN algorithm, where K equals to 4 and the threshold λ of generalization errors equals to 0.05. Then, after 10 times tests, we found that the correct rate of fault diagnosis using KICA-DSNNE is 100%. In order to further study, we compare the traditional method NNE (no feature extraction for original fault samples, choose the union of L and V as the learning sample set for BP networks and combine all predictions through majority voting) and KICA-NNE (the only difference with NNE is that it has feature extraction using KICA for original fault samples) with KICADSNNE proposed in this paper. We still follow the above construction method of neural network to train 20 different BP neural networks and randomly select different ensemble sizes of 3 5 7 9 11 13 15 18 20, 9 kinds of ensemble scales in total from all 20 different BP neural networks to ensemble. We use the above three different ensemble methods for simulation research in different ensemble sizes. The tests different methods in different sizes are carried out 10 times separately. We choose the average rate of correctness as the evaluation index in Fig.1. As can be seen from Fig.1, with the increasing ensemble size, the correct rate of fault diagnosis using NNE is rising, when N=18, it reaches 100%. But the efficiency of fault diagnosis is low because of the higher dimension of input data and the larger ensemble size. The correct rate of KICA-NNE reaches 100% when N=13. The learning efficiency and accuracy of fault diagnosis are improved. The ensemble size has been also reduced as a result of the feature extraction using KICA for fault data. Therefore, the KICA-NNE method is superior to NNE. However, the above two methods are all involved in non selective synthesis, but not all of the integrated neural networks have the positive contributions to the correct rate. KICA-DSNNE can dynamically select the corresponding neural network ensemble which fits the corresponding testing sample best from all neural networks at any time. In Fig.1, we can see that the correct rate reaches 100% when N=9. It also should be pointed out that if we do not use the neural network ensemble learning method, in fact, the average correct rate of fault diagnosis to use a single BP neural network is only 84.29%. This shows that the approach we proposed in this paper is effective.
、、、、 、 、 、 、
Fig. 1. The comparison of correct rate of three methods
KICA and Dynamic Selective Neural Network Ensemble for Fault Diagnosis
479
5 Conclusions (1) In fault diagnosis of steam turbine, the feature extraction from a large number of features is very necessary, so as to reduce the noise and improve the efficiency and accuracy of neural networks. (2) In this paper, we use the KICA algorithm to eliminate the correlation of highorder data, in addition to remove features redundant and map the input model space to the corresponding independent component space. The classifiers have very strong generalization ability for the high-order independence of independent components. (3) The generalization ability of neural network is improved very limited by improving the performance of individual neural network. So the ensemble learning method in fault diagnosis will be bound to become a very important research direction. (4) The simulation results in fault diagnosis of steam turbine show that the proposed approach gives promising results on performance and it has higher accuracy, better stability and certain practicability. If the component neural networks use the parallel learning method, it will greatly improve the efficiency in fault diagnosis using neural network ensemble.
References 1. Yu, H.J., Chen, M.Z., Zhang, S.: Intelligent diagnosis based on neural networks. Metallurgy Industry, Beijing (2002) 2. Hu, S.S., Zhou, C., Wang, Y.: Pattern recognition for composite fault based on wavelet neural networks. Acta Automatica Sinica 28, 540–543 (2002) 3. Li, D.H., Liu, H.: Method and application of fault diagnosis based on probabilistic neural network. Systems Engineering and Electronics 26, 997–999 (2004) 4. Ling, W.Y., Jia, M.P.: Optimizing strategy on rough set neural network fault diagnosis system. Proceedings of the CSEE 23, 98–102 (2003) 5. Yang, Y., Yu, D.J., Cheng, J.S.: Roller Bearing Fault Diagnosis Method Based on EMD and Neural Network. Journal of Vibration and Shock 24, 85–88 (2005) 6. Hou, G.L., Sun, X.G., Zhang, J.H.: Research on fault diagnosis of condenser via nonlinear principal component analysis and probabilistic neural networks. Proceedings of the CSEE 25, 104–108 (2005) 7. Li, Y.W., Han, X.D., Wang, Z.Y.: The temperature variation fault diagnosis of highvoltage electric equipment based on information fusion. In: Proceedings of 2008 International Conference on Machine Learning and Cybernetics, pp. 127–130. IEEE Press, Kunming (2008) 8. Bach, F.R., Jordan, M.I.: Kernel Independent Component Analysis. Journal of Machine Learning Research 3, 1–48 (2002) 9. Comon, P.: Independent Component Analysis-A New Concept. Signal Processing 36, 287–314 (1994) 10. Schölkopf, B., Smola, A., Müller, K.: Non-linear component analysis as a kernel eigenvalues problem. Neural Computation 10, 1299–1319 (1998) 11. Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 993–1001 (1990)
480
D. Wang et al.
12. Yang, J., Gao, X.M., Zhang, D.: Kernel ICA: An alternative formulation and its application to face recognition. Pattern Recongnition 38, 1784–1787 (2005) 13. Hyvärinen, A.: Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Transactions on Neural Networks 10, 626–634 (1999) 14. Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active learning. MIT Press, Cambridge (1995) 15. Zhou, Z.H., Wu, J.X., Tang, W.: Ensembling neural networks: many could be better than all. Artificial Intelligence 137, 239–263 (2002)
A Neural Network Model for Evaluating Mobile Ad Hoc Wireless Network Survivability Tong Wang1,2 and ChuanHe Huang1 2
1 School of Computer, Wuhan University, Wuhan 430072, China School of Computer, Hubei University of Economics, Wuhan 430205, China
[email protected]
Abstract. Due to the self-organized, unstable features of MANET, how to ensure the survivability of it becomes more and more important. This paper proposes a survivability evaluation model for MANETs. A Monte Carlo simulation algorithm and an Artificial Neural network computing model are also proposed to calculate the survivability of MANET based on the survivability evaluation model. Computational results show that the ANN model is an effective approach to evaluate the survivability compare with MC method. Keywords: MANET, Neural Network, Monte Carlo simulation, Survivability; Two-terminal reliability, Evaluation.
1 Introduction Traditionally, survivability in network systems has been defined as a “property of a system, subsystem, equipment, process, or procedure that provides a defined degree of assurance that the named entity will continue to function during and after a natural or man-made disturbance [1]. The first step to ensure the survivability of network systems is to evaluate it. Unfortunately, the survivability concept does not refer to a measurable sense. So evaluation is not mathematically well defined. Most research effort in this area use its qualitative and quantitative attributes especially the reliability, availability, and fault-tolerance attributes that can be statistically modeled using the parameters of MTTF, MTTR, MTBF, failure rate, repair rate, and faultcoverage to evaluate the survivability of network systems [2, 3]. Mobile wireless networks are attracting an increasing interest due to the possibility of ubiquitous communications they offer. In particular, mobile ad hoc networks (MANETs) enable users to maintain connectivity to the fixed network or exchange information when no infrastructure, such as a base station or an access point, is available. As mobility plays a crucial role in a MANET, relative node movements can break links and thus change the topology. The underlying topology control algorithm and routing protocol will take a while to adapt. During this period, packets can be lost and the survivability of network will degrade. As with any system, the first step to ensuring survivability of a MANET is the ability to determine survivability. However, existing methods for network and system L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 481–488, 2010. © Springer-Verlag Berlin Heidelberg 2010
482
T. Wang and C. Huang
survivability are insufficient to meet this need. The mobility of the nodes requires a modified approach that accommodates a dynamic network configuration. The rest of this paper is organized as follows: Section 2 describes research done in the area of network survivability for wireless networks. Section 3 discusses our MANET survivability model that integrated with node mobility features and then proposes MANET survivability calculation method. Section 4 and section 5 proposes an enumeration algorithm and a MC algorithm, respectively. In section 6,we give a neural network method to solve the survivability calculation problem. The computational analysis was given in section 7 where these different methods are compared to explore the validation of MANET survivability. Finally, Section 8 concludes with a summary of the work presented, its contributions and the future areas of interest for this research work on MANET survivability.
2 Related Work Authors of [4] evaluate the reliability of WMN using stochastic link failure model. [5, 6] considers the two-terminal reliability based on the hop count of the path, the reliability degrades with the increasing counts between nodes. However, it does not take the multi-path phenomenon into consideration and make the survivability evaluation under different scenario. Also the topology change frequency can affect the survivability and do not evaluate in the paper. [2] Use the attribute of cellular network such as dropping probability, blocking probability, availability and voice quality, as survivability index (SI) to estimate the whole survivability. [7] Think to maximize the network survivability, the energy efficiency of paths must be taken into account for route selection. The author presents a static single-path routing algorithm, which uses one energy-efficient path for each communicating peer throughout the network lifetime, eliminating the overhead of multi-path routing. [8] Describe the survivability of mobile networks under base station failure.
3 Survivability Model for MANET 3.1 Assumption 1) Nodes are connected if they are neighbors. 2) Nodes move randomly according to the random waypoint mobility model [9]. 3) The capacity of every link is binary and homogeneous. That is, a link either exists at a specified capacity or it does not exist. 4) Channel and radio of MANETs are sufficient, so links can be built when neighbor node entering each other’s radio coverage. 3.2 Model Definition Definition 1: A MANET
G = N , A consists of a set N of m mobile nodes and a
set A of n bio-directed arcs. Each arc e is in either of two states, good or failed. Arc e
A Neural Network Model for Evaluating Mobile Ad Hoc Wireless Network Survivability
operates (being good) independently of the other arcs with probability with probability qe
483
pe and fails
= 1 − pe .
Definition 2: For specified source node s and destination node t , a s − t path is a set of arcs whose operations ensure the connection from s to t . If no proper subset of a s − t path is a s − t path, it is a s − t path-set. Conversely, a s − t cut is a set of arcs whose failures interrupt the connection from s to t. If no proper subset of a s − t cut is a
s − t cut, it is a s − t cut-set. The survivability Rst is the probability of
connection from s to t. Definition 3: Let sets and all
path-
s − t cut-sets of network G = N , A respectively. Let Ei be the event s − t path-set Di operate, and F j be the event that all arcs in s − t
that all arcs in cut-set
{D1, D2 ,… Dn } and {C1, C2 ,…Cn } be the sets of all s − t
C j fail [10].
3.3 Survivability Calculation Method The link existence is controlled by node mobility, in this paper that will be modeled with the RWMM. Implementations of this mobility model may provide the average number of neighbors per node, defined as h, so the probability of link existence can be calculated as follows
Vij =
γ m −1 ,γ
Where
h =γ m −1
(1)
could be considered equal in the RWMM MANET, when h amounts to will reach its maximum.
The existence of a link in
G = N , A is probabilistic and the number of potential
network configurations is given by
C = 2m ( m−1) / 2 . That is, the permutations of
existing and non-existing links generate a set of all possible network configurations
Wi (i = 1,2
C ) . The probability of each existing, in turn, is a function of link
probability of existence, unlinked pairs,
ηu
γ
, the number of linked node pairs, η l , and the number of
, in the configuration. The probability associated with each
possible configuration is given as follows
P (α κ = Wi ) = γ ηl (1 − γ )ηu
(2)
We can get the path-set D of source and destination. Using the principle of inclusion and exclusion, two-terminal reliability Rst , based on the set of all s − t path-sets, can be expressed as follows.
484
T. Wang and C. Huang
⎞ ⎛ Rst = Pr ⎜⎜ E1 ∪ E2 ∪ ∪ En ⎟⎟ ⎠ ⎝ = ∑ P r (Ei ) − ∑ P r (Ei ∩ E j ) + ∑ P r (Ei ∩ E j ∩ El ) − i
i< j
i < j
(3) ( −1)n +1 P r (E1 ∩ E2 ∩
∩E ) n
So average survivability is given by: C
R = ∑ Rst P(ακ = Wi ) = E [Rst ]
(4)
i =1
4 Enumeration Approach As in every network where reliability needs to be computed, a logical first step is to consider complete enumeration of the possible states of the network. The approach that follows, enumerates all possible configurations that a MANET can take then, each configuration is assigned a probability of existence based on the RWMM. The method follows by: Initialization: Define M, and γ Step 1: Enumerate all possible configurations of set S. Step 2: Determine Step 3: for
k = 1,
G = N , A and stack them in
P (ακ = Wi ) based on ( 2). , C , obtain Rst from set S based on (3) and the path-set
calculation method in [11, 12]. Step 4: Apply (4) calculate R . It is important to acknowledge that the enumeration method can be employed for networks of relatively small size since the number of possible configurations increases exponentially with the number of mobile nodes. Thus, there is a need to develop faster computational procedures that can provide an accurate approximation of complex MANET.
5 Monte Carlo Simulation Methods The enumeration method provides an exact solution of
Rst , yet the method becomes
computationally expensive for the analysis of large networks. Thus, we use the Monte Carlo simulation technique to solve the problem.
m × m where, element i, j of m is denoted by mi , j , as link is bi-directional, mi , j = m j ,i , probability of link existence γ , speed of nodes v Input: number of nodes: m , link configuration Matrix
A Neural Network Model for Evaluating Mobile Ad Hoc Wireless Network Survivability
485
Func SurvMC { Initiate the γ based on v using RWMM Count=5000 R =0 While i
mi , j =0
}
P (ακ = Wi ) based on (2) Obtain Rst based on (3) and the path-set calculation method in [11, 12] R = R + Rst Obtain
}End while
R = R / 5000 } End SurvMC
6 Neural Network Approach An ANN has the ability to learn relationships between given sets of input and output data by changing the weights. This process is called training the ANN. One of the most important properties of a trained ANN is its ability to generalize, which means that ANN can generate a satisfactory set of outputs from inputs that are not used during its training process. We use the following metrics as the candidate set of inputs for our neural network: number of nodes m , number of links NL , node degree of each node ND (0 if not present), average node degree of MANET h , link probability γ , node mobility v , survivability upper bound UB . So five input configurations were studied:
1) ND, γ , UB 2) ND, h, NL, γ , v, UB 3) ND, h, NL, v, UB 4) m, γ , v, UB 5) h, γ , v, UB
486
T. Wang and C. Huang
We now show the method for an ANN for design of networks from 10 to 20 nodes. Twenty input neurons are reserved for node degrees (to accommodate networks up to 20 nodes). For example, when data are sampled from a network with 10 nodes, node degrees are assigned to the first 10 input neurons, and the remaining 10 input neurons are set to zero. There are 22 input neurons for the first configuration, 25 for the second configurations, 24 for the third configuration, and 4 for the fourth and fifth configuration. This topology representation uses up to 24 input neurons for a network with 20 nodes. The upper bound of each network topology and the exact network reliability were calculated to use as an input and as the target output, respectively. The output of the ANN is the result of MC simulation of 2-terminal network reliability (one real valued neuron). We used randomly generated data sets for training and validation considering five different link probabilities (0.80, 0.82, 0.85, 0.90, and 0.95), and five different mobility speed values (10,20,30,40,50), so that there are 25 design points. An equal number of network topologies were generated for each design point. Networks of 10, 15, and 20 nodes were generated with from (minimally connected) to (fully connected) inclusive. The number of hidden neurons, and training data size were set to 15, and 3000, respectively. The model was validated using five-fold cross validation [13], where each validation network was trained and tested using 2400, and 600 observations, respectively. A final application network was trained using all members of the data set, i.e. 3000 observations, and its validation was inferred using the average of the prediction error of the five validation networks The five validation ANN used 4/5's of the data set for training (2400 observations) and the remaining 1/5 (600 observations) as testing, where the testing set changed with each validation ANN. Assume that there is a population F , the most from which a random sample Tn , of
= {( x1 , y1 ), ( x2 , y 2 ), , ( xn , y n )} ; where ti = ( xi , yi ) is a realization drawn from F . The grouped cross-validation estimate of root mean
size n is drawn, Tn
squared error (RMSE) for the application ANN is
RMSE =
1 3000
5
600
∑∑(y g =1 h =1
( g −1 ) 600 + h
− f [T( g ) , X ( g −1) 600 + h ]) 2
(5)
Tables 1 give the five-fold results in root mean squared error (RMSE) for the ANN model. It can be seen that the ANN estimations always improve upon the upper bound. Table 1. Five Fold Cross Validation Results For MANET Fold 1 2 3 4 5 Average
RMSE training 0.04534 0.04332 0.04536 0.04234 0.04736 0.04538
RMSE testing 0.05872 0.06524 0.07022 0.07166 0.04532 0.06264
RMSE Upper-bound 0.08080 0.09000 0.09980 0.09992 0.07532 0.08672
A Neural Network Model for Evaluating Mobile Ad Hoc Wireless Network Survivability
487
7 Computational Results We use matlab7.0 to analyze our ANN model and put out the comparison results in Fig.1:
Fig. 1. Network survivability evaluation from the three different methods
8 Conclusion For networks of larger size, where target reliabilities using backtracking [14] or another exact method are not practical, Monte Carlo simulation could be substituted. Network reliability could be accurately estimated by using many replications of Monte Carlo simulation for each network topology in the data set available for training/testing. While this is still computationally burdensome, it is feasible, and need only be done for the relatively small training/testing data set. Computational work shows that the ANN model is equivalent or superior in estimation accuracy to MANET survivability. Later research efforts should Applying this method to larger, actual networks.
Acknowledgment The work was supported by the National Natural Science Foundation of China under the grant No.60633020.
References 1. Srivaree-Ratana, C., Konak, A., Smith, A.E.: Estimation of all-terminal network reliability using an artificial neural network. Computers & Operations Research 29, 849–868 (2002) 2. Purohit, N., Tokekar, S.: A new measure of survivability for a cellular network. In: Fourth International Conference on Wireless Communication and Sensor Networks, pp. 201–205 (2008) 3. Song, H., Yong, X., Ling, Z.: Study of network survivability based on multi-path routing mechanism. Science in Cina Series F-Information Sciences 51, 1898–1907 (2008)
488
T. Wang and C. Huang
4. Egeland, G., Engelstad, P.E.: The Availability and Reliability of Wireless Multi-Hop Networks with Stochastic Link Failures. IEEE Journal on Selected Areas in Communications 27, 1132–1146 (2009) 5. Cook, J.L., Ramirez-Marquez, J.E.: Two-terminal reliability analyses for a mobile ad hoc wireless network. Reliability Engineering & System Safety 92, 821–829 (2007) 6. Cook, J.L., R.J.: Mobility and reliability modeling for a mobile ad hoc network. IIE Transactions 41, 23–31 7. Bejerano, Y., Seung-Jae, H., Keon-Taek, L., Kumar, A.: Single-path routing for life time maximization in multi-hop wireless networks. In: 33rd IEEE Conference on Local Computer Networks, pp. 160–167 (2008) 8. Sangjoon, P., Jiyoung, S., Byunggi, K.: A survivability strategy in mobile networks. IEEE Transactions on Vehicular Technology 55, 328–340 (2006) 9. Camp, T.: A survey of mobility models for ad hoc network research. Wireless Communication & Mobile Computing 2, 483–502 (2002) 10. Jane, C.C., Shen, W.H., Laih, Y.W.: Practical sequential bounds for approximating twoterminal reliability. European Journal of Operational Research 195, 427–441 (2009) 11. Bansal, V.K., Misra, K.B., Jain, M.P.: Minimal pathset and minimal cutsets using search technique. Microelectronics Reliability 22, 1067–1075 (1982) 12. Samad, M.A.: An efficient method for terminal and multiterminal pathset enumeration. Microelectronics Reliability 27, 443–446 (1987) 13. Twomey, J.M.: Bias and variance of validation methods for function approximation neural networks under conditions of sparse data. IEEE Transactions on Systems, Man and Cybernetics-Part c: Applications and Reviews 28, 417–430 (1998) 14. van Slyke, M.B.A.R.: Backtracking algorithms for network reliability analysis. Annals of Discrete Mathematics 1, 49–64 (1977)
Ultra High Frequency Sine and Sine Higher Order Neural Networks Ming Zhang Christopher Newport University, Newport News, VA 23606, USA
[email protected]
Abstract. New nonlinear models of Ultra high frequency Sine and Sine Higher Order Neural Networks (USSHONN) are presented in this paper. A new learning algorithm for USSHONN is also developed from this study. A time series data analysis system, USSHONN Simulator, is built based on the USSHONN models too. Test results show that USSHONN models are 4.4457% to 9.0276% better than Polynomial Higher Order Neural Network (PHONN) and Trigonometric Higher Order Neural Network (THONN) models. For time series data simulation, the error rate that USSHONN, with order 6 and 100,000 epochs, can reach is 0.0000%. Keywords: Ultra High Frequency, Sine and Sine, Higher Order Neural Networks, Polynomial Higher Order Neural Networks, Trigonometric Higher Order Neural Networks.
1 Introduction and Motivations Many studies use traditional artificial neural network models, which are black box models that did not provide users with a function that describes the relationship between the input and output. The first motivation of this paper is to develop nonlinear “open box” neural network models that will provide rationale for network’s decisions, also provide better results. Nobel Prize in Economic in 2003 rewarded two contributions: nonstationarity and time-varying volatility. These contributions had greatly deepened our understanding of properties of many economic time series (Vetenskapsakademien [1]). Granger and Bates [2] researched the combination of forecasts. Granger [3] changed the way of empirical models in macroeconomic relationships by introducing the concept of cointegrated variables. Granger and Weiss [4] showed the importance of cointegration in the modeling of nonstationary economic series. Granger and Lee [5] studied multicointegration. Granger and Swanson [6] further developed multicointegration in studying of cointegrated variables. The second motivation of this paper is to develop a new nonstationary data analysis system by using new generation computer techniques that will improve the accuracy of the data simulation. Zhang, Zhang, and Fulcher [7] studied HONN group model for data simulation. By utilizing adaptive neuron activation functions, Zhang, Xu, and Fulcher [8] developed a new HONN neural network model, called neuron adaptive HONN. Furthermore, HONN models are also capable of simulating higher L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 489–496, 2010. © Springer-Verlag Berlin Heidelberg 2010
490
M. Zhang
frequency and higher order nonlinear data, thus producing superior data simulations, compared with those derived from ANN-based models. Zhang and Fulcher [9] published a book chapter to provide detail mathematics for THONN models, which are used for high frequency, nonlinear data simulation. Zhang [10] published a HONN book, where all 22 chapters focused on artificial higher order neural networks for economics and business. Zhang [11] found that HONN can simulate noncontinuous data with better accuracy than SAS NLIN (non-linear) models. Zhang [12] also developed Ultra High Frequency Trigonometric Higher Order Neural Networks, in which model details of UCSHONN (Ultra High Frequency Cosine and Sin Higher Order Neural Network) was given, but not USSHONN details. Zhang [12] also gave the UCSHONN learning algorithm, but not USSHONN learning algorithm. Zhang [12] provided experimental results of UCSHONN, but not the running results of USSHONN. The contributions of this paper will be: • • • •
Present the details of a new model - USSHONN (section 2). Based on the USSHONN models, build a time series simulation system USSHONN simulator (section 3). Develop the USSHOHH learning algorithm and weight update formulae (section 4). Shows that USSHONN can do better than PHONN and THONN models. Provide evidences that USSHONN can reach 0.0000% error rate for simulating data (section 5).
2 Models of USSHONN Nyquist Rule says that a sampling rate must be at least twice as fast as the fastest frequency. In simulating and predicting time series data, the new nonlinear models of UTHONN should have twice as high frequency as that of the ultra high frequency of the time series data. To achieve this purpose, a new model should be developed to enforce high frequency of HONN in order to make the simulation and prediction error close to zero. The new HONN model, Ultra High Frequency Trigonometric Higher Order Neural Network (UTHONN), includes three different models base on the different neuron functions. Ultra high frequency Cosine and Sine Trigonometric Higher Order Neural Network (UCSHONN) has neurons with cosine and sine functions. Ultra high frequency Cosine and Cosine Trigonometric Higher Order Neural Network (UCCHONN) has neurons with cosine functions. Similarly, Ultra high frequency Sine and Sine Trigonometric Higher Order Neural Network (USSHONN) has neurons with sine functions. Except for the functions in the neuron all other parts of these three models are the same. The following section will discuss the USSHONN in detail. USSHONN Model Structure can be seen in Figure 1. The different types of USSHONN models are shown as follows. Formula (1) (2) and (3) are for USSHONN model 1b, 1 and 0 respectively. Model 1b has three layers of weights changeable,
Ultra High Frequency Sine and Sine Higher Order Neural Networks
491
Model 1 has two layers of weights changeable, and model 0 has one layer of weights changeable. For models 1b, 1 and 0, Z is the output while x and y are the inputs of USSHONN. akjo is the weight for the output layer, akjhx and akjhy are the weights for the second hidden layer, and akx and ajy are the weights for the first hidden layer. Functions sine and sine are the first hidden layer nodes of USSHONN. The nodes of the second hidden layer are multiplication neurons. The output layer node of USSHONN is a linear function of fo(neto) = neto, where neto equals the input of output layer node. USSHONN is an open neural network model, each weight of HONN has its corresponding coefficient in the model formula, and each node of USSHONN has its corresponding function in the model formula. The structure of USSHONN is built by a nonlinear formula. It means, after training, there is rationale for each component of USSHONC in the nonlinear formula. For formula 1, 2, and 3, values of k and j ranges from 0 to n, where n is an integer. The USSHONN model can simulate ultra high frequency time series data, when n increases to a big number. This property of the model allows it to easily simulate and predicate ultra high frequency time series data, since both k and j increase when there is an increase in n. Figure 1 shows the “USSHONN Architecture.” This model structure is used to develop the model learning algorithm, which make sure the convergence of learning. This allows the deference between desired output and real output of USSHONN close to zero. Formula 1:
UCCHONN Z=
n
∑(c
o
kj
k , j =0
Model
1b :
hx
x
hy
Formula 2: UCCHONN n
z=
∑c
k , j =0
o
kj
Model
1:
x
y
sink (k *ck x) sin j ( j *c j y) (ckj ) = (ckj ) = 1 hx
where:
hy
Formula 3: UCCHONN z =
n
∑c
Model o
kj
0:
sin k ( k * x ) sin j ( j * y )
k , j=0
where : and
y
){ckj sink (k *ck x)}{ckj sin j ( j * c j y)}
( c kj
hx
) = ( c kj
hy
)=1
ak = a j = 1 x
y
492
M. Zhang
z
net k
x
f x (net k ) x
sin k (k * c k * x)
Output Layer ckjo
x
ck * x
bxk
f o (neto)
c00o
x
or net j
cnno
y
y
cj * y y
by j
f h(netkjh)
ikj
···
··· c00hx
Second Hidden Layer ···
c00hy
ckjhx
y
sin j ( j * c j * y )
··
···
ckjhy
f y (net j )
cnnhx cnnhy First Hidden Layer
b xk
f x(netkx)
··· x
ck
c0
cnx
x ···
«
··· x
c0
y
··· cjy
cny y
Input Layer
More Neurons More Weights k
b yj
f y(netjy)
net
n
o
¦c
net kj i kj
Multiplication
kj
o
o
n
¦c
o kj
i kj
k j 0
Linear Neuron
n
o
f (net )
z
sinj(j*cj y*y)
¦ (c
ikj
k, j 0
x
sin (k*ck *x)
Z
o kj
hx
x
hy
h
hx
hy
c kj b x k * c kj b y j h
h
f (net kj )
y
){ckj sin k (k * ck * x)}{ckj sin j ( j * c j * y)}
k, j 0
Fig. 1. USSHONN Architecture
hx
hy
c kj b x k * c kj b y j
Ultra High Frequency Sine and Sine Higher Order Neural Networks
493
3 USSHONN Time Series Analysis System The USSHONN simulator is written in C language, runs under X window on Sun workstation. A user-friendly GUI (Graphical User Interface) system has also been incorporated. When you run the system, any step, data or calculation can be reviewed and modified from different windows during processing. Hence, changing data, network models and comparing results can be done very easily and efficiently.
4 Learning Algorithm of USSHONN Learning Algorithm of USSHONN Model can be described as followings. Let: ckx = 1st hidden layer weight for input x; k = kth neuron of first hidden layer η = learning rate (positive & usually < 1) E = error t = training time ikj = output from 2nd hidden layer (= input to the output neuron) bxk and byj= output from the 1st hidden layer neuron (= input to 2nd hidden layer) fx and fy= 1st hidden layer neuron activation function x and y = input to 1st hidden layer The 1st hidden layer weights are updated according to:
ck (t +1) = ck (t) −η(∂Ep / ∂ck ) x
x
x
The learning algorithm for the 1st hidden layer weights of x input neuron: ck (t +1) = ck (t) −η(∂E p / ∂ck ) x
x
x
= ck (t) +η(d − z) f ' (net )ckj * f ' (netkj )δ kj ckj f x ' (netk ) x x
o
o
o
h
h
hx
hx
x
= ck (t) + x
η *δ
ol
* ckj *δ hx *ckj * (k 2 ) sink −1 (k * netk ) cos(k * netk ) * x o
= ck (t) +η *δ x
ol
hx
* ckj *δ o
hx
x
* ckj *δ * x hx
x
where:
δ
ol
= (d − z) f ' (net ) = d − z
δ
hx
= f ' (netkj )ckj b
o
h
o
h
hy
y
(linear)
= ckj b j hy
j
(linear)
δ = f x ' (netk ) = (k ) sin (k * netk ) cos(k * netk x ) x
x
2
k −1
x
x
494
M. Zhang
The learning algorithm for the 1st hidden layer weights of y input neuron: c j (t + 1) = c j (t ) − η (∂E p / ∂c j ) y
y
y
= c j (t ) + η (d − z ) f ' (net )c kj * f ' (net kj )δ kj c kj f y ' (net j ) y y
o
o
o
h
h
hy
hy
y
= c j (t ) y
+η *δ
ol
* c kj * δ
= c j (t ) + η * δ y
o
ol
hy
* c kj
* c kj * δ o
hy
hy
* ( j 2 ) sin j −1 ( j * net j ) cos( j * net j ) * y
* c kj
y
hy
*δ
y
y
*y
where :
δ
ol
= (d − z ) f ' (net ) = d − z
δ
hy
= f ' (net kj )c kj b
δ
y
o
h
hy
o
hx
x
2
= c kj b x k
k
= f y ' (net j ) = ( j ) sin y
(linear ) hx
j −1
y
(linear) y
( j * net k ) cos( j * net k )
5 Time Series Data Test Using USSHONN This paper uses the monthly Australian and USA dollar exchange rate from Nov. 2003 to Dec. 2004 (See Table 1) as the test data for USSHONN models. Input 1, Rt-2 is the data at time t-2. Input 2, Rt-1 is the data at time t-1.The values of Rt-2, Rt-1, and Rt are converted to a range from 0 to 1 and then used as inputs and output in the USSHONN model. The tests use data from Table 1. The Australian and USA dollar Exchange rates are used as the test data for USSHONN model 0, model 1, and model 1b. The test data of orders from 2 to 6 for using 10,000 epochs are shown that the errors are 0.0264% (model 0), 3.8342% (Model 1), and 4.3982% (Model 1b). When 100,000 epochs are used for the same test data, for order 6, all of the USSHONN models have reached an error percentage of 0.0000%. This shows that USSHONN models can successfully simulate Table 1 data with 0.000% error. The tests are tried for Model 0 of USSHONN, PHONN and THONN. After 1000 epochs, the three models USSHONN, PHONN, and THONN have reached errors of 2.3672%, 8.7080% and 10.0366%. This shows that USSHONN can reach a smaller error in the same time frame. After 100,000 epochs, error for USSHONN error is close to 0.0000%, but error for PHONN and THONN are still 4.4457% and 4.5712%, respectively. This result shows that USSHONN can simulate ultra high frequency, and is more accurate than PHONN and THONN. After 100,000 epochs, Model 1 of the USSHONN has reached an error of 0.000%, while errors for PHONN and THONN are still around 9.0276% and 6.7119%, respectively. Therefore, the USSHONN model 1 is more superior for data simulation than PHONN and THONN models. After 100,000 epochs, Model 1B of the USSHONN has reached an error of 0.000%, while errors for PHONN and THONN are still around 7.8234% and 7.3468%, respectively. Therefore, the USSHONN model 1B is more superior for data simulation than PHONN and THONN models.
Ultra High Frequency Sine and Sine Higher Order Neural Networks
495
Table 1. Australia Dollars Vs US Dollars for USSHONN Models
Date
Rate
Two months before Input 1
One month before Input 2
Prediction Simulating Output
11/28/2003
1AU$ = ? US$ 0.7236
12/31/2003
0.7520
1/30/2004
0.7625
0.7236
0.7520
0.7625
2/27/2004
0.7717
0.7520
0.7625
0.7717
3/31/2004
0.7620
0.7625
0.7717
0.7620
4/30/2004
0.7210
0.7717
0.7620
0.7210
5/28/2004
0.7138
0.7620
0.7210
0.7138
6/30/2004
0.6952
0.7210
0.7138
0.6952
7/30/2004
0.7035
0.7138
0.6952
0.7035
8/31/2004
0.7071
0.6952
0.7035
0.7071
9/30/2004
0.7244
0.7035
0.7071
0.7244
10/29/2004
0.7468
0.7071
0.7244
0.7468
11/30/2004
0.7723
0.7244
0.7468
0.7723
6 Conclusion This paper develops the details of three nonlinear neural network models of USSHONN, which is part of the Ultra High Frequency Trigonometric Higher Order Neural Networks (UTHONN). This paper also provides the learning algorithm formulae for USSHONN, based on the structures of USSHONN. This paper tests the USSHONN models using ultra high frequency data and the running results are compared with Polynomial Higher Order Neural Network (PHONN) and Trigonometric Higher Order Neural Network (THONN) models. Experimental results show that USSHONN models are 4.4457 – 9.0276% better than PHONN and THONN models. Using nonlinear functions to model and analyze time series data will be a major goal in the future. One of the topics for future research is to continue
496
M. Zhang
building models using USSHONN for different data series. The coefficients of the higher order models will be studied not only using artificial neural network techniques, but also statistical methods.
References 1. Vetenskapsakademien, K.: Time-series econometrics: Co-integration and Autoregressive Conditional Heteroskedasticity. In: Advanced information on the Bank of Sweden Prize in Economic Sciences in Memory of Alfred Nobel, pp. 1–20 (2003) 2. Granger, C.W.J.: Some properties of time series data and their use in econometric model specification. Journal of Econometrics 16, 121–130 (1981) 3. Granger, C.W.J., Bates, J.: The combination of forecasts. Operations Research Quarterly 20, 451–468 (1969) 4. Granger, C.W.J., Weiss, A.A.: Time series analysis of error-correction models. In: Karlin, S., Amemiya, T., Goodman, L.A. (eds.) Studies in Econometrics, Time Series and Multivariate Statistics, In Honor of T. W. Anderson, pp. 255–278. Academic Press, San Diego (1983) 5. Granger, C.W.J., Lee, T.H.: Multicointegration. In: Rhodes Jr., G.F., Fomby, T.B. (eds.) Advances in Econometrics: Cointegration, Spurious Regressions and Unit Roots, pp. 17– 84. JAI Press, New York (1990) 6. Granger, C.W.J., Swanson, N.R.: Further developments in study of cointegrated variables. Oxford Bulletin of Economics and Statistics 58, 374–386 (1996) 7. Zhang, M., Zhang, J.C., Fulcher, J.: Higher order neural network group models for data approximation. International Journal of Neural Systems 10(2), 123–142 (2000) 8. Zhang, M., Xu, S., Fulcher, J.: Neuron-Adaptive Higher Order Neural Network Models for Automated Financial Data Modeling. IEEE Transactions on Neural Networks 13(1), 188– 204 (2002) 9. Zhang, M., Fulcher, J.: Higher Order Neural Networks for Satellite Weather Prediction. In: Fulcher, J., Jain, L.C. (eds.) Applied Intelligent Systems, vol. 153, pp. 17–57. Springer, Heidelberg (2004) 10. Zhang, M.: Artificial Higher Order Neural Networks for Economics and Business. IGIGlobal Publisher (2009) 11. Zhang, M.: Artificial Higher Order Neural Network Nonlinear Models: SAS NLIN or HONNs? In: Zhang, M. (ed.) Artificial Higher Order Neural Networks for Economics and Business, pp. 1–47. IGI-Global Publisher (2009) 12. Zhang, M.: Ultra High Frequency Trigonometric Higher Order Neural Networks. In: Zhang, M. (ed.) Artificial Higher Order Neural Networks for Economics and Business, pp. 133–163. IGI-Global Publisher (2009)
Robust Adaptive Control Scheme Using Hopfield Dynamic Neural Network for Nonlinear Nonaffine Systems Pin-Cheng Chen1, Ping-Zing Lin2, Chi-Hsu Wang3, and Tsu-Tian Lee1 1
Department of Electrical Engineering, National Taipei University of Technology
[email protected],
[email protected] 2 Department of Applied Electronics Technology, National Taiwan Normal University
[email protected] 3 Department of Electrical Engineering, National Chiao Tung University
[email protected]
Abstract. In this paper, we propose a robust adaptive control scheme using Hopfield-based dynamic neural network for uncertain or ill-defined nonlinear nonaffine systems. A Hopfield-based dynamic neural network is used to approximate the unknown plant nonlinearity. The robust adaptive controller is designed to achieve a L2 tracking performance to stabilize the closed-loop system. The weights of Hopfield-based dynamic neural network are on-line tuned by the adaptive laws derived in the sense of Lyapunov, so that the stability of the closed-loop system can be guaranteed, and the tracking error is bounded. The proposed control scheme is applied to control an anti-lock braking system, and the simulation results illustrate the applicability of the proposed control scheme. The designed parsimonious structure of the Hopfieldbased dynamic neural network makes the practical implementation of the work in this paper much easier. Keywords: adaptive control, robust control, Hopfield-based dynamic neural network, Lyapunov stability theory.
1 Introduction Static and dynamic (neural networks) NNs, are two feasible solutions often used to control problems in recent years. Some static neural networks (SNNs), such as feedforward fuzzy neural network (FNN) or feedforward radius basis function network (RBFN), are frequently used as a powerful tool for modeling the ideal control input or nonlinear functions of systems [1]-[2]. Although they have achieved much theoretical success, their complex structures make the practical implementation of the control schemes infeasible, and the numbers of the hidden neurons in the NNs’ hidden layers are hard to be determined. Another well-known disadvantage is that SNNs are quite sensitive to the major change never been learned in the training phase. Some researchers adopt dynamic neural networks (DNNs) to solve the control problem of nonlinear systems. An important motivation is that a smaller DNN is L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 497–506, 2010. © Springer-Verlag Berlin Heidelberg 2010
498
P.-C. Chen et al.
possible to provide the functionality of a much larger SNN [3]. In addition, SNNs are unable to represent dynamic system mapping without the aid of tapped delay, which results in long computation time, high sensitivity to external noise, and a large number of neurons when high dimensional systems are considered [4]. This drawback severely affects the applicability of SNNs to system identification, which is the central part in some control techniques for nonlinear systems. On the other hand, owing to their dynamic memory, DNNs have good performance on identification, state estimation, trajectory tracking, etc., even with the unmodeled dynamics [5],[6]. In this paper, a robust adaptive control scheme using Hopfield-based dynamic neural network (RACHDNN) is proposed for SISO nonlinear nonaffine systems. The control object is to force the system output to track a given reference signal. Hopfield model was first proposed by Hopfield J.J. in 1982 and 1984 [7]-[8]. Hopfield circuit is quite easy to be realized. The Hopfield-based DNN can be viewed as a special kind of DNNs and is used to approximate the unknown plant nonlinearity. Moreover, a simple but powerful robust adaptive controller is merged into the control law to achieve L2 tracking performance with a designed attenuation level. All control parameters of the RACHDNN are on-line tuned by the adaptive laws derived in the Lyapunov sense to achieve favorable fuzzy approximation.
2 Hopfield-Based Dynamic Neural Model 2.1 Description of the DNN Model Consider a simplest DNN without any hidden layers described by the following nonlinear differential equation [5]
χ = Aχ + BWσ (χ ) + BΘ u
(1)
where χ = [χ 1 χ 2 χ n ]T ∈ R n is the state vector; u = [u1 u 2 u m ]T ∈ R m is the input vector; σ : R n → R n ; A = diag{− a1 − a 2 − a n } with ai > 0 , i=1, 2, …, n is a Hurwitz matrix; B = diag{b1 , b2 , , bn }∈ R n× n ; W ∈ R n×k ; Θ ∈ R n×m ; σ (⋅) is a sigmoid vector function responsible for nonlinear state feedbacks. The structure of the DNN is shown in Fig. 1. From (1), we have χ i = −ai χ i + biWi T σ (χ ) + bi Θ Ti u , i = 1, 2,
(2)
,n
where Wi T = [wi1 wi 2 win ] and Θ i T = [θ i1 θ i 2 θ im ] are the ith rows of W and Θ , respectively. Solving the differential equation (2), we obtain
χ i = bi (Wi T ξ W , i + Θ Ti ξ Θ , i ) + e − a t χ i (0) − e − a t bi [Wi T ξ W , i (0) + Θ Ti ξ Θ , i (0)] , i = 1, 2, i
i
, n (3)
where χ i (0) is the initial state of χ i ; ξ W , i ∈ R n and ξ Θ , i ∈ R m are the solutions of
ξ W , i = −a i ξ W , i + σ(χ )
(4)
Robust Adaptive Control Scheme Using Hopfield Dynamic Neural Network
499
and
ξ Θ ,i = − a i ξ Θ , i + u
(5)
respectively; ξ W , i (0) and ξ Θ , i (0) are initial states of ξ W , i and ξ Θ , i , respectively. Note
[
]
that the terms e − ai t χ i (0) and e − ai t bi Wi T ξ W , i (0) + Θ Ti ξ Θ , i (0) in (3) will exponentially decay with time owing to ai > 0 .
2.2 Hopfield-Based DNN Approximator A DNN approximator for continuous functions can be defined as
(
)
[
]
ˆ T ξ + e − a t χ (0) − e − a t b Wˆ T ξ (0) + Θ ˆ T ξ (0) , i = 1, 2, χ i = bi Wˆ i T ξ W , i + Θ i Θ, i i i i W ,i i Θ, i i
i
, n (6)
ˆ are the estimations of W and Θ . For a continuous vector function where Wˆ i and Θ i i i
Φ = [Φ 1 Φ 2
T Φ n ] ∈ R n , we first define optimal vectors Wi * and Θ *i as
σ ((⋅⋅) w11
− a1
+ + + Σ θ11 +
w1n
θ
+ + χ1 Σ
b1
χ
∫
1
1m
U
σ ((⋅⋅) wn1 wnn
θ θ
− an
+
+ + Σ n1 +
+ + χn Σ
bn
χ
∫
n
nm
Fig. 1. The structure of the dynamic neural network
(W
i
*
, Θ *i ) = arg
min
ˆ ∈Ω Wˆi ∈ΩWi , Θ i Θi
[
{(
U
T ˆ T ξ (0) − e −a t bi Wˆ i ξ W ,i (0) + Θ i Θ ,i i
where Dχ ⊂ R N and
{
}
)
⎧ T ˆ T ξ + e − a t χ (0) ⎨ sup Φ i − bi Wˆ i ξ W , i + Θ i i Θ, i χ u D , D ∈ ∈ ⎩ χ
]} }
i
(7)
{
DU ⊂ R m are compact sets; Ω W = Wˆ i : Wˆ i ≤ M W i
i
} and
ˆ : Θ ˆ ≤M ˆ . Then, Φ can be are constraint sets for Wˆ i and Θ ΩΘ = Θ i i Θ i i
(
i
)
[
]
Φ i = bi Wi * ξ W ,i + Θ *i ξ Θ, i + e − a t χ i (0) − e − a t bi Wi * ξ W , i (0) + Θ *i ξ Θ, i (0) + ε i i = 1, 2, T
T
i
i
T
T
,n
(8)
500
P.-C. Chen et al.
where ε i is difficult to ~ χ = [χ~1 χ~2
the approximation error. Note that the optimal vectors Wi * and Θ *i are be determined and might not be unique. The modeling error vector T χ~n ] can be defined from (6) and (8) as
(~
)
~
[~
]
~
χ~i = Φ i − χ i = bi Wi T ξ W , i + ΘTi ξ Θ , i − e − a t bi Wi T ξ W , i (0) + ΘTi ξ Θ , i (0) + ε i i = 1, 2, i
, n (9)
~ ~ ˆ . where Wi = Wi * − Wˆ i , and Θ i = Θ *i − Θ i In this paper, a Hopfield-based dynamic neural network is adopted as the approximator. It is known as a special case of DNN with ai = 1 /( Ri C i ) and bi = 1 / C i , where Ri > 0 and C i > 0 representing the resistance and capacitance at the ith neuron,
respectively [6]. The sigmoid function vector σ(χ ) = [σ 1 ( χ 1 ) σ 2 ( χ 2 ) defined by a hyperbolic tangent function as
σ ( χ i ) = tanh(κ i χ i ) , i = 1, 2,
σ n ( χ n )] is
,n
T
(10)
where κ i is the slope of the hyperbolic tangent function, tanh(⋅) , at the origin.
3 Problem Formulation Consider a single-input and single-output (SISO) nonaffine nonlinear system x ( n ) = f ( x, u ) + d
(11)
where x = [ x x … x ( n −1) ]T is the measurable state vector of the system on a domain Ω x ⊂ R n , f (x, u ) : Ω x × R → R is the smooth unknown nonlinear function, u is the control input, and d is the bounded external disturbance. Here the single output is x. It should be noted that f(x, u) is an implicit function with respect to u. Feedback linearization is performed by rewriting (11) as x ( n ) = η u + Δ ( x, u ) + d
(12)
where r is a constant to be designed and Δ (x, u ) = f (x, u ) − ru . Here we assume that ∂ f ( x, u ) is nonzero for all (x, u ) ∈ Ω x × R with a known sign. Without losing ∂u generality, we further assume that [9] ∂ f ( x, u ) >0 ∂u
(13)
for all f (x, u ) ∈ Ω x × R . Note that for the nonaffine systems with property ∂ f ( x, u ) < 0 , the control scheme can be easily defined with minor modifications ∂u
Robust Adaptive Control Scheme Using Hopfield Dynamic Neural Network
501
discussed in section 4. The control objective is to develop a control scheme for the nonaffine nonlinear system (11) so that the output trajectory x can track a given trajectory xc closely. The tracking error is defined as e = xc − x
(14)
If the system dynamics and the external disturbance are well known, the ideal feedback controller can be determined as u id =
1
η
[u lc − d − Δ(x, u )]
(15)
where u lc = xc
(n)
+ kTe
(16)
with e = [e e … e ( n −1) ]T and k = [k n k n −1 … k1 ]T . Applying (15) to (12) and using (14) yield the following error dynamics e ( n ) + k1e ( n −1) +
+ kne = 0
(17)
If ki, i=1, 2, …, n are chosen so that all roots of the polynomial H ( s )Δ s n + k1 s n −1 + + k n lie strictly in the open left half of the complex plane, then lim e(t ) = 0 can be implied for any initial conditions. However, since Δ (x, u ) and the t →∞
external disturbance d may be unknown or perturbed, the ideal feedback controller uid in (15) cannot be implemented. Thus, to achieve the control objective, a Hopfield-based dynamic neural network is used to estimate the system uncertainty Δ (x, u ) in (12).
4 Design of RACHDNN The control law u in the RACHDNN is designed as
u=
1
η
(u
rac
− u HDNN )
(18)
where urac is a robust adaptive controller to achieve a L2 tracking performance with a small attenuation level and u HDNN is a Hopfield-based DNN controller used to approximate Δ(x, u ) . The input signal of the Hopfield-based dynamic neural network
is u = [x x x ( n −1) u ] and the output is uHDNN. Here we introduce the following Lemma 1 to educe the complexity of the approximation [9]. T
Lemma 1: Let the constant c satisfies the condition
η>
1 (∂f / ∂u ) 2
(19)
502
P.-C. Chen et al.
* Then, there exist a unique u HDNN which is a function of x and u rac so that * u HDNN (x, u rac ) satisfies
Δ
* * * ψ (x, u rac , u HDNN ) = Δ (x, u rac , u HDNN ) − u HDNN (x, u rac ) = 0
(20)
for all (x, u rac ) = Ω x × R . The Proof of Lemma 1 can be found in [9]. Therefore, Δ (x, u ) can be approximated by a Hopfield-based DNN ith input u = [x
x
x ( n −1)
u rac ] . Then, substituting (18) into (12) and using (14) yield T
e = A c e − B c [Δ(x, u ) − u HDNN + (u rac − u lc ) + d ]
(21)
where
⎡ 0 ⎢ Ac = ⎢ ⎢ 0 ⎢ ⎣− k n
1
0 ⎤ 0 ⎥⎥ and B c = [0 0 …1]T . 1 ⎥ ⎥ − k1 ⎦
0 0
− k n −1
Because the Hopfield-based DNN adopted to approximate Δ (x, u ) contains only one single neuron, we can eliminate the subscript i in (6) to express the output as u HDNN =
(
[
)
1 1 ˆ 1 −1t ˆ T ξ + e − RC t u ˆ T ξ (0) W ξW + Θ (0) − e RC Wˆ ξ W (0) + Θ Θ Θ HDNN C C
]
(22)
where u HDNN (0) is the initial value of u HDNN ; R and C represents the resistance and capacitance, respectively. Note that Wˆ and ξ are scalars. As we mentioned in sec. W
2.2, Hopfield-based DNN can be viewed as a special case of the DNN shown in Fig. 1 with ai = 1 /( Ri C i ) and bi = 1 / C i . Figure 2 shows the Hopfield-based DNN containing only a single neuron. Substituting (22) into (21) yields 1 1 t ⎧ 1 ~⎛ ⎫ t − − ⎞ 1 ~ ⎛ ⎞ e = A c e − B c ⎨ W ⎜⎜ ξ W − e RC ξ W (0) ⎟⎟ + Θ T ⎜⎜ ξ Θ − e RC ξ Θ (0) ⎟⎟ + ε + d + (u rac − u lc ) ⎬ ⎠ C ⎝ ⎠ ⎩C ⎝ ⎭
W
u1
um
θ11 θ 1m
∑
χ C
σ (⋅)
R
Fig. 2. The structure of Hopfield-based DNN containing only a single neuron
(23)
Robust Adaptive Control Scheme Using Hopfield Dynamic Neural Network
503
where ε is the approximation error. In order to derive the main theorem in this paper, the following lemma is required. Lemma 2: Suppose P = P T > 0 satisfies A c P + PA c + Q + PB c ( T
1 1 T − )B c P = 0 ρ2 δ
(24)
1 1 − ≤ 0 . Let Wˆ (0) ∈ ΩW and ρ2 δ ˆ , respectively. ˆ (0) ∈ Ω , where Wˆ (0) and Θ ˆ (0) are the initial values of Wˆ and Θ Θ Θ If the adaptive laws are designed as
where Q = Q T > 0 ; ρ > 0 and δ > 0 satisfies
)
(
1 1 ⎧ βW T t t − − ⎛ ⎞ ⎡ ⎤ ⎡ ⎤ e PB c ⎢ξ W − e RC ξ W (0) ⎥ if Wˆ < M W or ⎜⎜ Wˆ = M W and e T PB cWˆ ⎢ξ W − e RC ξ W (0)⎥ ≥ 0 ⎟⎟ ⎪ C ⎣ ⎦ ⎣ ⎦ ~ ⎪ ⎝ ⎠ Wˆ = −W = ⎨ 1 1 ⎫ t t − − ⎡ ⎤ ⎡ ⎤ ⎪ ⎧ βW T RC ξ W (0) ⎥ ⎬ if Wˆ = M W and e T PB cWˆ ⎢ξ W − e RC ξ W (0) ⎥ < 0 (25) ⎪Pr ⎨ C e PB c ⎢ξ W − e ⎣ ⎦⎭ ⎣ ⎦ ⎩ ⎩
(
)
1 1 ⎧βΘ T t − ⎡ ⎤ ˆ < M or ⎛⎜ Θ ˆ = M and e T PB Θ ˆ ⎡ξ − e − RC t ξ (0)⎤ ≥ 0 ⎞⎟ e PB c ⎢ξ Θ − e RC ξ Θ (0) ⎥ if Θ ⎪ c Θ Θ Θ ⎢ Θ ⎥ ⎜ ⎟ C ⎣ ⎦ ⎣ ⎦ ~ ⎪ ⎝ ⎠ ˆ = −Θ Θ =⎨ 1 1 − − t t ⎤ ⎡ ⎤⎫ ⎡ ⎪ ⎧ βΘ T T RC RC ˆ = M and e PB Θ ˆ ξ −e ξ Θ (0)⎥ ⎬ if Θ ξ Θ (0 ) ⎥ < 0 Θ c ⎢ Θ ⎪Pr ⎨ C e PB c ⎢ξ Θ − e ⎦ ⎣ ⎦⎭ ⎣ ⎩ ⎩
(26) where β W and β Θ are positive learning rates; the projection operators Pr{∗} are ⎧β ⎡ Pr ⎨ W e T PB c ⎢ξ W − e C ⎣ ⎩
1 t − RC
1 − t ⎡ ⎡ ⎤ ⎤ Wˆ ⎢ξ W − e RC ξ W (0)⎥ ⎥ ⎢ 1 t − ⎤⎫ βW ⎢ T ⎡ ⎤ ⎣ ⎦ ˆ⎥ ξ W (0) ⎥ ⎬ = e PB c ⎢ξ W − e RC ξ W (0)⎥ + e T PB c W 2 ⎥ ˆ ⎦⎭ C ⎢ ⎣ ⎦ W ⎥ ⎢ ⎢⎣ ⎦⎥
(27) 1 ⎡ ˆ T ⎡ξ − e − RC t ξ (0)⎤ ⎤⎥ Θ ⎢ Θ Θ ⎥ ⎢ 1 − − t t ⎧β ⎡ ⎤ ⎡ ⎤⎫ β ⎣ ⎦ ˆ⎥ Θ Pr ⎨ Θ e T PB c ⎢ξ Θ − e RC ξ Θ (0)⎥ ⎬ = Θ ⎢e T PB c ⎢ξ Θ − e RC ξ Θ (0)⎥ + e T PB c 2 ⎢ ⎥ C C ˆ ⎣ ⎦ ⎣ ⎦⎭ ⎩ Θ ⎢ ⎥ ⎢⎣ ⎦⎥ 1
(28) ˆ are bounded by Wˆ ≤ M and Θ ˆ ≤ M for all t ≥ 0 [10]. then Wˆ and Θ W Θ
504
P.-C. Chen et al.
Now we are prepared to state the main theorem of this paper. Theorem: Suppose the assumption (13) holds. Consider a SISO nonaffine nonlinear system (11) with the control law (18), where the Hopfield-based DNN controller is given as (22), and the robust adaptive controller is given as u rac = u lc +
1 T B c Pe 2δ
(29)
Then, the overall control scheme guarantees 1 2
∫
t
e T Qe dτ ≤
0
~ ~ ~ ~ 1 W (0)W (0) Θ T (0)Θ(0) ρ2 e(0) T Pe(0) + + Θ+ 2 2β W 2β Θ 2
∫
t
(ε + d ) 2 dτ
(30)
0
Proof. (omitted) 1 L (− Fx L + K b Pi ) + (1 − λ )a v ]. v In
λ= [
(32)
5 Simulation In this section, the proposed control scheme is applied to an anti-lock braking (ABS) system to illustrate its effectiveness. The dynamics of the ABS system of a halfvehicle model can be expressed as [11] where λ is the wheel slip ratio; v (m/s) is the speed of the vehicle; L = 0.275 m is the wheel radius, I = 3 kg ⋅ m 2 is the moment of the inertia of the wheel;
Fx is the longitudinal force; Pi (kgf / cm 2 ⋅ s) is
the braking pressure, K b is the gain between the braking pressure and the wheel braking torque ( N ⋅ m) ; a v (m/s 2 ) is the acceleration of the vehicle. Differentiating (32), we have [11]
λ = F (λ , λ , Pi ) + d
(33)
where F (λ , λ , Pi ) = f (λ , λ ) + gPi + d , in which f and g are nonlinear functions; d is the external disturbance. The control objective is to force the wheel slip ratio λ to track the optimal slip ratio of the road surface during the braking process. Assume the vehicle is braked on a dry road surface with a known optimal slip ratio 0.28. The speed of the vehicle at the beginning of braking process is 120 km/hr (33.33 m/s). Figure 3 (a) shows that the wheel speed is reduced from 33.33 m/s to 0 m/s in 3.87 seconds. The control input Pi is shown in Fig. 3 (b). In Fig. 3(c) we see that after a short period of transient response, the wheel slips λ approaches to the optimal wheel slip ratio 0.28. The simulation results demonstrate the effectiveness of the proposed control scheme in the presence of the uncertainties and disturbances.
Robust Adaptive Control Scheme Using Hopfield Dynamic Neural Network
505
vehicle velocity, v (m/s)
optimal slip ratio rear wheel
slip ratio,
front wheel
time(sec) (b)
time(sec) (a)
control input,
(kgf/cm2
s)
.
front wheel
rear wheel
time(sec) (c)
Fig. 3. Simulation results
6 Conclusion A RACHDNN for uncertain or ill-defined nonlinear SISO nonaffine systems is proposed. A simple Hopfield-based DNN is used to approximate the system uncertainty and the synaptic weights Hopfield-based DNN are tuned on-line by the adaptive laws derived in the Lyapunov sense. We show that the RACHDNN can achieve a L2 tracking performance with a designed attenuation level. The proposed RACHDNN is applied to control an ABS and achieve favorable tracking performance in the presence of external disturbance. Here we should emphasize that Hopfieldbased DNN used in this paper contains only one neuron; this parsimonious structure and the simple Hopfield circuit make the RACHDNN much easier to implement and more reliable in practical purposes.
Acknowledgements The authors appreciate the partial financial support from the National Science Council of Republic of China under grant NSC 98-2752-E-027-002-PAE.
506
P.-C. Chen et al.
References 1. Gao, Y., Er, M.J.: Online adaptive fuzzy neural identification and control of a class of MIMO nonlinear systems. IEEE Trans. Fuzzy Syst. 11, 462–477 (2003) 2. Li, Y., Qiang, S., Zhang, X., Kaynak, O.: Robust and adaptive backstepping control for nonlinear systems using RBF neural networks. IEEE Trans. Neural Networks 15, 693–701 (2004) 3. Lin, C.T., Lee, C.S.G.: Neural fuzzy systems: a neuro-fuzzy synergism to intelligent systems. Prentice Hall, Englewood Cliffs (1996) 4. Pham, D.T., Liu, X.: Dynamic system identification using partially recurrent neural networks. J. Syst. Eng. 2, 90–97 (1992) 5. Poznyak, A.S., Yu, W., Sanchez, D.N., Perez, J.P.: Nonlinear adaptive trajectory tracking using dynamic neural networks. IEEE Trans. Neural Networks 10, 1402–1411 (1999) 6. Ren, X.M., Rad, A.B., Chan, P.T., Lo, W.L.: Identification and control of continuous-time nonlinear systems via dynamic neural networks. IEEE Trans. Ind. Electronics 50, 478–486 (2003) 7. Hopfield, J.J.: Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proceedings of National Academy of Sciences, USA 79, 2554– 2558 (1982) 8. Hopfield, J.J.: Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of National Academy of Sciences, USA 81, 3088– 3092 (1984) 9. Park, J.H., Kim, S.H.: Direct adaptive self-structuring fuzzy controller for nonaffine nonlinear system. IEE Proc. Control Theory 153, 429–445 (2005) 10. Wang, L.X.: Adaptive Fuzzy Systems and Control - Design and Stability Analysis. Prentice Hall, Englewood Cliffs (1994) 11. Wang, W.Y., Chen, G.M., Tao, C.W.: Stable anti-lock braking system using outputfeedback direct adaptive fuzzy neural control. In: IEEE International Conference on Systems, Man and Cybernetics, vol. 4, pp. 3675–3680. IEEE Press, Washington (2003)
A New Intelligent Prediction Method for Grade Estimation Xiaoli Li1, Yuling Xie1, and Qianjin Guo2 1
Civil & Environmental Engineering School, University of Science and Technology Beijing, Beijing 100083, P.R. China
[email protected] 2 The State Key Laboratory of Molecular Reaction Dynamics, and Beijing National Laboratory for Molecular Sciences (BNLMS), Institute of Chemistry, Chinese Academy of Sciences, Beijing 100190, P.R. China
[email protected]
Abstract. In this paper, a novel PSO–SVR model that hybridized the constrict particle swarm optimization (PSO) and support vector regression (SVR) is proposed for grade estimation. This hybrid PSO–SVR model searches for SVR’s optimal parameters using constrict particle swarm optimization algorithms, and then adopts the optimal parameters to construct the SVR models. The hybrid PSO–SVR grade estimation method has been tested on a number of real ore deposits. The result shows that method has advantages of rapid training, generality and accuracy grade estimation approach. It can provide with a very fast and robust alternative to the existing time-consuming methodologies for ore grade estimation. Keywords: Support vector regression, Particle swarm optimization, Grade estimation.
1 Introduction Grade estimation is probably one of the most important stages in reserve calculations. A number of methods such as geometrical and, geostatistical approaches have been developed for the purpose of grade estimation [1-4]. However, Geostatistics have been based on certain assumptions about the spatial distribution of ore grades within the orebody. Effects of these assumptions have led to far more complicated methods requiring a large amount of knowledge in order to be effectively applied [5]. And it is proved to be very difficult to learn and apply efficiently and also very timeconsuming [6]. The need for a new method of ore grade estimation comes from the difficulties in applying conventional methods such as geostatistics. In 1995, Vapnik [7] developed a neural-network algorithm called support vector machine (SVM), which is a novel learning machine based on statistical learning theory, and which adheres to the principle of structural risk minimization seeking to minimize an upper bound of the generalization error rather than minimize the training error (the principle followed by neural networks). With the introduction of Vapnik’s L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 507–515, 2010. © Springer-Verlag Berlin Heidelberg 2010
508
X. Li, Y. Xie, and Q. Guo
ε-insensitive loss function, SVM has been extended to solve nonlinear regression estimation problems, such as new techniques known as support vector regression (SVR), which have been shown to exhibit excellent performance [8]. To construct the SVR model efficiently, SVR’s parameters must be set carefully [6,8]. Several past investigations adopted with the method of trail-and-error approaches to tune up the parameters. The usage of these methods must create many SVR models and also take much time to compute the training error. In other words, so far no systemic and architectural methods are available to determine these parameters. Therefore, this study proposes a new approach known as HPSO-SVR. In HPSO-SVR, the particle swarm optimization (PSO) was employed to determine the optimal parameters of SVR, which were then applied to construct the SVR model. Then, the HPSO-SVR direction system was introduced as grade estimation for reserve calculations.
2 Support Vector Regression (SVR) The basic idea in SVR is to map the input data x into a higher dimensional feature space F via a nonlinear mapping Φ and then a linear regression problem is obtained and solved in this feature space(Fig.1) [4,9]. Therefore, the regression approximation addresses the problem of estimating a function based on a given data set N d G = {( x i , y i )}i =1 (xi R is input vector, yi R is the desired value). In SVM
∈
∈
method, the regression function is approximated by the following function: N
y = f ( x) = ∑ wi Φ i ( x) + b
(1)
i =1
where {φ i ( x )}iN=1 are the features of inputs, {w i }iN=1 and b are coefficients. L e(e) y = f(x) H
[i
0 [i
y f ( x) H
H
[ i*
y f ( x) H
-e
+e
y - f(x)
Fig. 1. The SVR ‘Max-Margin’ idea
The nonlinear function is learned by a linear learning machine where the learning algorithm minimizes a convex functional. The convex functional is expressed as the following regularized risk function, and the parameters w and b are a support vector weight and a bias that are calculated by minimizing the regularized risk function
A New Intelligent Prediction Method for Grade Estimation N
1 N
R( f ) = C
∑
L ( f ( xi ) − yi ) +
i=1
1 w 2
509
(2)
2
Vapnik [7] developed an ε-insensitive loss function that has the same structure as the Huber loss function. This ε-insensitive loss function is defined by: L
( f (x )− i
yi
)=
⎧⎪ f ⎨ ⎪⎩
( xi ) −
yi − ε , fo r f
0,
( xi ) −
yi ≥ ε
(3)
o th e rw is e
where ε(≥0) is the maximum deviation allowed during the training and C(>0) represents the associated penalty for excess deviation during the training. Thus, SVR is formulated as minimization of the following functional: 1 w 2
+C
2
∑ (ξ
+ ξ i* )
n
i =1
i
(4)
⎧ y i − w , xi − b ≤ ε + ξ i ⎪ * ⎨ w , xi + b − y i ≤ ε + ξ i ⎪ ξ i , ξ i* ≥ 0 ⎩
(5)
where the constant C>0 stands for the penalty degree of the sample with error exceeding ε. Two positive slack variables ξ i , ξ i* represent the distance from actual values to the corresponding boundary values of ε-tube. A dual problem can then be derived by using the optimization method to maximize the function: 1 2
∑ (α n
i , j =1
i
−α
* i
)(α
j
−α
)( x
* j
∑ (α n
i=1
i
) − ε ∑ (α n
i
,x
−α
* i
j
)=
i =1
i
+α
)+ ∑ n
* i
0 and 0 ≤ α i , α
i =1
* i
≤ C
y i (α i − α
* i
)
(6)
(7)
where the model parameters α , α* represent the Lagrange multipliers satisfying the constraint 0≤α , α*≤C. The SVM for function fitting obtained by using the above-mentioned maximization function is then given by
( x ) = ∑ (α i n
f
i =1
− α i* ) xi , x + b
(8)
In this work, substituting K(xi,xj) =φ(x)φ(xj) in (6) allows us to reformulate the SVM algorithm in a nonlinear paradigm. Finally, we have
( x ) = ∑ (α i n
f
i =1
−α
* i
)K
xi , x + b
(9)
3 Hybrid PSO-SVR Model for Grade Estimation In this section, we describe the design of our proposed model, a particle swarm optimization based SVR model, for grade estimation. To construct the SVR model efficiently,
510
X. Li, Y. Xie, and Q. Guo
SVR’s parameters must be set carefully, a reliable and robust parameter selection optimization strategy is a pre-requisite to obtain a well-performing and robust SVR regression model [10,11]. This research used the RBF kernel function (defined by (10)) for the SVR regression because the RBF kernel function can analyze higher dimensional data and requires that only two parameters. 1 ⎛ K ( xi , x j ) = ex p ⎜ − ⎝ 2σ
2
xi − x
2 j
⎞ ⎟ ⎠
(10)
The root-mean-square error (RMS), which defined in (11), was used as criteria for evaluating the fitness abilities of the SVR models. RM S =
1 n
∑
n i =1
( yi
− yi
)
2
(11)
where n is the number of samples; yi is the observed value; y i is the calculated value. In the HPSO-SVR model, the SVR parameters are dynamically optimized by implementing the HPSO evolutionary process and the SVR model then performs the prediction task using these optimal values. Namely, the HPSO tries to search the optimal values to enable SVR to fit various datasets. The algorithm can be summarized as follows. Step 1(Start): Initial value of types of kernel function and all SVR parameters. Select the type of kernel function and the range of the SVR parameters. Step 2: Initialize the parameters for PSO such as the number of evolutionary generations, population size, the number of subpopulations and individuals per subpopulation, Initialize the three SVR parameters, C, σ and ε. Step 3: Randomly initialize particle position X. Randomly initialize particle velocities 0 ≤ v i ≤ v max . Let k indicates iteration times, set constants kmax , c1, c2, Vmax, w , d. Set i=1. Step 4: Input learning samples Xn(t) and the corresponding output values On to SVR, where t varies from 1 to L, representing the number of the input nodes. Step 5: SVM model training: for the training set , conduct 5-fold cross validation (CV) on the training set, and calculate the average CV accuracy based on the (C, σ) which is represented in the second and third part of a particle. Step 6: Compute the fitness values of the vectors, for each particle, evaluate the desired optimization fitness function in d variables. Step 7: (Check the stop criterion). Repeat step (5-8) until a criterion is met, usually a sufficiently good fitness or a maximum number of iterations/epochs. Step 8: For all the particles of the swarm, take the following steps: (a) Compare particle’s fitness evaluation with particle’s pbest. If current value is better than pbest, then set pbest value equal to the current value and the pbest location equal to the current location in d-dimensional space. (b) Compare fitness evaluation with the population’s overall previous best. If the current value is better than gbest, then reset gbest to the current particle’s array index and value.
A New Intelligent Prediction Method for Grade Estimation
511
(c) For every particle define a neighborhood and determine the local best fitness value (lbest) and its coordinates for every particle. (d) if pbest(i) ≠ gbest, then all particles are manipulated respectively; go to step (10). (e) If pbest (i) = gbest, then particle i’s position xi(t + 1) needs to optimize and other particles are manipulated according to (7) in [12] . Step 9: Establish the prediction model of grade estimation based on the parameter values of C,ε and σ. Step 10: Save the training parameters and the whole training of the SVR is completed. As a result, the prediction model in this paper is constructed by using HPSO as training algorithm.
4 Case Study for Grade Estimation Here the work uses a nonlinear grade estimation model that we have developed to test HPSO-SVR method for grade estimating. This model mainly contains the sample data acquisition, sampling neighborhood scheme extraction and data normalization, network training, and grade estimation etc. 4.1 Sample Data Acquisition This section selected actual grade data of a porphyry copper deposit, which was taken from David’s ‘Geostatistical Ore Reserve Estimation’ book [3]. According to David, the deposit is homogeneous. The values used are supposed from one level of the simulated deposit, divided into 100’×100’×50’ blocks. The grades given are the actual grades of the blocks. But now the grades are assigned 20×20 blocks when the deposit will be used to train and validation HPSO-SVR. 4.2 Sampling Neighborhood Scheme and Data Normalization After the selection of source data deposits, the sampling scheme that used to train and validate the HPSO-SVR topologies will be discussed. Quite reasonably, some researchers tried to take advantage of the information hidden in the relationship between neighboring samples. This approach is followed in general terms by the most advanced existing methods for grade estimation like inverse distance weighting and kriging. Most of the examples following this approach choose as neighbors the samples closest to the estimation point and treat the problem of ore grade estimation as a mapping between the surrounding grades and the grade at the estimation point. Grade values from surrounding sample points are given as input to the network while the grade of the central point is given an output. The sample scheme is one estimation point with eight surrounding neighbor’s points, according to the regulation, sample data is selected. The sample data shown in Table 1 and sample scheme topology is shown in Fig.2. It is supposed that the deposit is homogeneous, and geometric distant is not considered at present.
512
X. Li, Y. Xie, and Q. Guo
'
'
'
'
'
'
'
'
'
Fig. 2. I/O configurations (red indicate estimation)
Another issue that affects forecasting performance of SVR model is the data preprocessing procedure. Scaling can reportedly improve the performance of the support vector machines model. Therefore, this investigation performs the scaling procedure to preprocess input data, thus improving the forecasting accuracy of the developed HPSO-SVR model. The algorithm to normalize a given data set X so that the inputs and targets will fall in the range [0,1] as follows. Y =
X − M in V a lu e M a x V a lu e − M in V a lu e
(12)
According to this method, the sets of sample in Table 1 can be normalized between 0 and 1 for training see in Table 2. These normalized vectors are used as input and output of SVR. Table 1. The sets of training sample
(D22) Targets
Inputs No. 1
D11 0.67
D12 0.94
D13 0.9
D21 1.15
D23 1.29
D31 1.45
D32 1.62
D33 2.04
2
1.49
2.08
2.97
1.68
3.36
2.58
2.89
3.16
3.02
3
2.77
1.89
1.34
2.33
0.65
2.11
1.04
0.65
1.13
4
0.95
0.74
0.65
0.67
1.05
0.6
0.61
1.01
0.67
5
1.83
1.88
1.42
1.15
1.22
0.69
0.82
0.98
1.22
1.61
6
1.3
0.7
0.69
1.43
1.37
1.31
1.33
1.55
1.05
7
1.15
1.61
1.29
1.45
2.04
1.6
2.08
1.89
1.62
8
1.68
3.02
3.36
2.58
3.16
1.89
2.21
2.97
2.89
9
2.33
1.13
0.65
2.11
0.65
2.1
0.94
0.63
1.04
10
0.67
0.67
1.05
0.6
1.01
0.42
0.4
0.54
0.61
4.3 Training and Prediction with HPSO-SVR Model The selection of these optimal parameters plays an important role in various training algorithms. A single parameter choice has a tremendous effect on the rate of
A New Intelligent Prediction Method for Grade Estimation
513
convergence. For this paper, the optimal parameters are determined by trail and error experimentations. In HPSO, A particle X is a set of parameters of SVR denoted as (C, σ, ε). The domain of the cost parameters C, the kernel parameter σ and insensitive value ε is different, so this work divides X into X1, X2 and X3, where X1={C}, X2={σ}, X3={ε}, and the rate of position change (velocity) V for particle X is accordingly divided into V1, V2, V3 respectively, where V 1 = {V i C } , V 2 = {V i σ } , V 3 = {V i ε } . The search space range available for X1 is defined as (10-1,104), X2 is defined as (10-3,102) and X3 is defined as (10-4,1), that is, if xd>Xmax,then xd=Xmax and vd=-alpha*vd, where alpha between 0 and 1, and similarly for Xmin. So the particle cannot move out of this range in each dimension. The other settings of the HPSO algorithm are as follows: swarm size was set to 30, c1 and c2 were set to 2 and 2.1 respectively, then ϕ = c1 + c 2 = 4.1, set κ =1 and the constriction coefficient χ is thus 0.729 according to [12]. The maximum number of epochs was limited to 500. These system parameter settings are summarized in Table 2. In the experiment, 200 groups of sample data are acquired for test; 150 groups of the data are used as the training sets, the remaining 50 groups are used as the validate sets. Sample vectors are used as input and output of HPSO-SVR model. The experiment with a particle size of 30 is demonstrated here for simplicity. As indicated in Fig.3, its fitness curves gradually improved from iteration 0 to 200, and exhibited no significant improvements after iteration 150 for the training setting strategies. The optimal stopping iteration to get the highest validation accuracy was around iteration 90–130. Based on the best SVR parameters (the best particle) for this optimal iteration number, the prediction accuracy was calculated according to the training model created by the training set. After all possible normal operating modes of the grade are learned, the system enters the grade estimation stage. Table 2. Setting of the system parameters Cost parameters C [10-1,104] [-200,200] (2,2.1)
Search range Velocity:[-vmax,vmax] Learning factors(c1,c2)
Kernel parameter σ [10-3,102] [-100,100] (2,2.1)
Insensitive value ε [10-4,1] [-20,20] (2,2.1)
0.7
The best fitness value
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60 80 100 120 140 160 Number of fitness evalluations(Iterations)
180
200
Fig. 3. The fitness value during the training stage for sample
514
X. Li, Y. Xie, and Q. Guo
0.9 Real Prediction
0.8
O reG ra d e (S c a le )
0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
5
10
15
20 Test samples
25
30
35
40
Fig. 4. Prediction values and real values of ore grade (scale)
In addition to particle size of 60, we also tried particle sizes of 40 and 100. We found the particle size did not significantly affect the classification accuracy and feature subset performance in this data set. However, it did have an impact on the terminating iterations for searching for the optimal solution with the correct feature subset and prediction accuracy. This, thus, shortened or prolonged the execution time. That is, the particle size impacts the convergence. After the HPSO was applied to search for the optimal parameter sets, the HPSOSVR forecasting models were built. The forecasting simulation was performed against the testing data. The graphical comparisons between the actual and predicted reliability are shown in Fig.4. Fig.4 shows the predicted and actual values of the systems reliability and relative prediction error in the data sets. It is observed that the proposed HPSO-SVR model fits this particular data set very well.
5 Conclusions In this study, we applied SVR prediction method optimized by HPSO for the grade estimation in reserve calculations and mineral deposit. The constrict particle swarm optimization were used to optimize the best SVR model parameters. The experimental results showed that the proposed system can optimize the model parameters efficiently and reliability. It can be seen from the experiment that the prediction model overcomes the main shortage of artificial neural network without defining network structure and trapping in the local optimum, so it is applicable to grade estimation.
Acknowledgements This work has been supported by the Key Projects in the National Science & Technology Pillar Program during the Eleventh Five-Year Plan Period (Grant No.2006BAB01A04).
References 1. Bardossy, G., Fodor, J.: Evaluation of Uncertainties and Risks In Geology. Springer, Heidelberg (2004) 2. Pham, T.D.: Grade Estimation Using Fuzzy Set Algorithms. Mathematical Geology 29, 291–304 (1997)
A New Intelligent Prediction Method for Grade Estimation
515
3. David, M.: Geostatistical Ore Reserve Estimation. Elsevier Scientific Publishing, Amsterdam (1977) 4. Smola, A.J., Schölköpf, B.: A Tutorial on Support Vector Regression. Technical Report NC2-TR-1998-030, NeuroCOLT2 (1998) 5. Goovaerts, P.: Geostatistics for Natural Resources Evaluation. Oxford University Press, New York (1997) 6. James, K., Russelll, C.E., Shi, Y.H.: Swarm Intelligence. Morgan Kaufmann, San Francisco (2001) 7. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998) 8. Kennedy, J.: Small Worlds and Mega-minds: Effects of Neighborhood Topology on Particle Swarm Performance. In: Proceedings of IEEE Congress on Evolutionary Computation, pp. 1931–1938. IEEE Press, Washington (1999) 9. Christianini, V., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2002) 10. Nello, C., John, S.T.: An Introduction to Support Vector Machines and Other Kernel Based Learning Methods. Cambridge University Press, Cambridge (2000) 11. Lin, P.T.: Support Vector Regression: Systematic Design and Performance Analysis. PhD thesis, Department of Electronic Engineering, National Taiwan University of Science and Technology, Taipei (2001) 12. Guo, Q.J., et al.: A Hybrid PSO-GD Based Intelligent Method for Machine Diagnosis. Digital Signal Processing: A Review Journal 16, 402–418 (2006)
Kernel-Based Lip Shape Clustering with Phoneme Recognition for Real-Time Voice Driven Talking Face Po-Yi Shih, Jhing-Fa Wang, and Zong-You Chen Department of Electrical Engineering, National Cheng Kung University No. 1 University Road, Tainan City, Taiwan
[email protected],
[email protected],
[email protected]
Abstract. This work describes a real-time voice driven method using which a speaker’s lip shape is synchronized with the corresponding speech signal, for a low bandwidth mobile devices. Phoneme recognition is generally regarded as an important task in the operation of a real-time lip-sync system. In this work, the use of the kernel-based lip shape clustering algorithm is inspired based on one-class support vector machines (SVM). A set of speaker who has similar lip shape is clustered and a cluster-dependent vowel phoneme is then constructed for each cluster. We use sum of absolute difference (SAD) as vowel lip shape likelihood to cluster into categories. Then adjust the source and destination pictures of lip shape in the transparent level using alpha blending for lip-sync animation. We find that this method outperforms conventional CHMM method in phoneme error rate (PER), 8.78% and 32.25%, respectively. Keywords: Real-time, Voice-driven, Kernel-based, Lip shape clustering, Phoneme recognition.
1 Introduction With the development of communications system, people are no longer satisfied with mere voice for the communication. Furthermore, they wish they can interact directly face to face by real-time video transmission with their portable devices. Unfortunately, the bandwidth is a key point that it can’t pass the real-time video smoothly for most portable devices. Therefore, to develop low bandwidth communication tools with AV Sync become more and more necessary [1]. A lip-sync system based on voice-driven method is designed to animate the face of a speaker (i.e., in communication devices another side) that it realistically speaks given text based on only the acoustic input. The objective is to synchronize the lip movements of a speaker with a speech signal in real-time. Experiments show that the trust and attention of humans towards machines are able to increase by 30% if humans are communicating with talking faces instead of voice-only. However, realistic facial animation still remains to be one of the most challenging tasks despite decades of extensive research. This is mainly due to the fact that the mechanisms of human facial expressions are not yet well understood [2]. Therefore, making facial animations in real-time is a new field that is worthy for research and discussion. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 516–523, 2010. © Springer-Verlag Berlin Heidelberg 2010
Kernel-Based Lip Shape Clustering with Phoneme Recognition
517
Many researchers have studied the hidden Markov model (HMM) in order to achieve real-time continuous speech recognition for lip-sync, such as [3]. However, these methods required complex training. Some systems will be useful in the area of human-computer interaction, such as [4]. However, more reliable audio-to-visual conversion methods should be implemented in such systems. [5] adapts HMM inversion method for estimation visual parameters based on the trained audio-visual HMMs and speech feature vectors in MPEG-4. [6] presents an real-time speech driven lip-sync and [7] presents speech driven lip motion animation, they point out two important factors in realize a real-time lip-sync system, namely, effective contextdependent acoustic modeling and real-time continuous phoneme recognition. However, the above methods all don’t fit to communication system, especially low bandwidth portable devices (i.e, cell phone, PDA). In this paper, we propose a real-time voice-driven human talking face technology for low bandwidth portable devices communication system. A review of one-class SVM is given in section 2. Section 3 describes our proposed method. Section 4 presents the experiments performed and the obtained results. Finally, section 5 concludes this paper.
2 Previous Work One-class SVM [8] is a kernel method based on a support vector description of a data set consisting of positive examples only. If all data are inliers, one-class support vector machine computes the smallest sphere in feature space enclosing the image of the input data. Using the Lagrangian into the Wolf dual form that is a function of one-class SVM hyperplane decision: L = ∑ φ ( xi ) α i − ∑∑ α iα jφ ( xi ) ⋅ φ ( x j ) m
2
i =1
m
m
i =1 j =1
(1)
subjects to the constraints 0 ≤ α i ≤ C , i = 1,..., m and
m
∑α i =1
i
=1
The ordinary inner product between images of input points in Equation (1) can be replaced by a positive definite function, K ( ) , also called kernel function. Examples of positive kernel functions are the Linear, Polynomial and Radial Basis Function (Gaussian) kernels. So the Wolf dual form Equation (1) thus becomes m
m
m
i =1
i =1 j =1
L = ∑ α i K ( xi , x j ) − ∑∑ α iα j K ( xi , x j )
with the
αi
(2)
subject to the same constraints as above.
The distance R ( x ) in feature space between φ ( x) and A can be evaluated in terms of kernel function K and the input points xi . So we define the distance of R ( x ) as follows
518
P.-Y. Shih, J.-F. Wang, and Z.-Y. Chen
R 2 ( x ) = φ ( xi ) − A
2
(3)
In view of Equation (3) and the definition of the kernel we have m
m
m
i =1
i =1 j =1
R 2 ( x ) = K ( x, x) − 2∑ α i K ( xi , x) + ∑∑ α iα j K ( xi , x j )
(4)
3 Phoneme Recognition Based on Lip Shape Clustering We introduce a real-time voice driven talking face for low bit-rate mobile device communication system. First, we use sum of absolute difference (SAD) as vowel lip shape likelihood to cluster into several categories. Fig. 1 shows a block diagram for the proposed phoneme classification algorithm. The input utterance is segmented into frame unit after speech is detected. At the level of energy, the input is analyzed in the frame to decide whether the frame has speech signal or not. If the frame indeed contains speech signal, first we do the pre-processing that include pre-emphasis and hamming window, then extracting 12th order linear predictive cepstral coefficient (LPCC) as the feature vector. The classification is then performed over the frames using SVM with LPCC as the feature input vector. Last, the source and destination pictures of lip shape are adjusted in the transparent level using alpha blending. Speech Input Frame Segmentation Next frame Get energy of frame
Threshold?
Vowel Classification based on SVM
Energy < Threshold
Energy > Threshold
Pre-Processing & Feature extraction 12 order LPCC
Fig. 1. Block diagram of Mandarin vowels classification procedure
3.1 Mandarin Vowels Classification Based on Lip shape
It is easy to lead to recognition error if there are phonemes similar to each other. It is usually happened when we utterance Mandarin phonemes, such as “ and ” or “ and ”. This case maybe increases because each speaker has his own accent in utterance. For example, when we are pronouncing with the same sound, recognition results may be wrong due to the phoneme is too similar to each other. And the error will cause jerkiness to the animations. To avoid this situation, we collect several parts of
ㄝ
ㄡ ㄛ
ㄟ
Kernel-Based Lip Shape Clustering with Phoneme Recognition
519
16 Mandarin single vowel mouth shape pictures, each part is pronounced by different person, and by using the method of SAD (Sum of absolute differences) to divide the mouth shape pictures into category accordance with difference degree. Assume that the y-axis is current phoneme and x-axis is classified phoneme. The sum of the difference of colorspace R,G,B are calculated as follows. S
SumCOLOR = ∑ COLOR (Yi ) − COLOR( X i ) , COLOR = {R, G , B}
(5)
i =1
where S is the size of the mouth shapes picture. The average SAD is
AvgSum =
SumR + SumG + SumB S ×3
(6)
Thus, while we use SAD for lip shape classification, the question is converted into how to find a threshold to discriminate whether the lip shape is similar or not by SAD. Therefore, we gather the average SAD values for ten people’s lip shape pictures as shown in Fig. 2. According to the data experiences, the value of threshold is 5
Fig. 2. Average SAD value of ten people’s lip shape pictures
Then we convert Fig. 2 into data distribution diagram as shown in Fig. 3. The two red lines are the upper-limit and lower-limit interval that we found. Finally, the threshold is on the average of 5.588 and 7.138 as Equation (13), and we can get 6.363 as the threshold
Threshold avg =
Thresholdupper + Thresholdlower 2
(7)
Last, we cluster the Mandarin vowel phoneme by this experience result, the total vowel phoneme is 16, and clustered to 11 clusters, as shown in Table 1.
520
P.-Y. Shih, J.-F. Wang, and Z.-Y. Chen Table 1. The clusters of Mandarin vowel phoneme by lip shape clustering Cluster Vowel Phoneme Clustering IPA Symbol
1
2
3
4
5
6
7
8
9
10
11
a, o, ə
ε, ai, i
ei
au
ou an ən, əŋ aŋ
ɚ
u
y
ㄚㄛㄜ ㄝㄞㄧ ㄟ ㄠ ㄡ ㄢ ㄣㄥ ㄤ ㄦ ㄨ ㄩ
7.138 5.588
Fig. 3. Distribution diagram of Figure 3
3.2 Lip Shape Clustering Algorithm and Alpha Blending
Attracted by the idea of support vector description of data sets [9], the key point of the method is to describe each cluster by a sphere of minimum radius. The assignment of lip shapes to clusters and the spheres of minimum radii are obtained through an iterative procedure similar to K-means. Starting off with a cluster initialization based on threshold of lip shape pictures for each phoneme, as shown in Table 1, at each iteration, an on-class SVM is trained on each cluster, and then the distance of image points (phonemes), ϕ ( x ) , from the center of each sphere is calculated and the phonemes are assigned to the nearest sphere (cluster). The procedure continues until no cluster changes. The algorithm for the implementation of the proposed Phoneme clustering method is presented as following steps. The parameter ρ is adopted to provide robustness against outliers. Step 1: Initialize K clusters wk ( ρ ) , K = 1, ..., k , using a sub set of training pho-
neme that lip shape clustered. Step 2: Train a one-class SVM for each cluster wk ( ρ ) . Step 3: Update each cluster, i.e.
{
wk ( ρ ) = xi | k = arg min φ ( xi ) − Aj j =1,..., K
} with φ ( x ) − A i
j
<ρ
(8)
Step 4: Got to step 2 until no cluster changes.
Alpha Blending means to blend the source and destination pixels according to the alpha value. The RGB pixels from source are mixed with the corresponding RGB pixels. Here we use alpha blending as a smoothing method between distinct lip shapes pictures.
Kernel-Based Lip Shape Clustering with Phoneme Recognition
521
4 Experimental Setting and Results There are three parts in the experiment. At first, we do the single words recognition via one-class SVM before and after phonemes classification. We estimate the word error rate with continuous speech in real-time system. Fig. 4 shows the block diagram of the experiment setup for the Mandarin Chinese phoneme recognition.
Fig. 4. Block diagram of experimental setup
The test dataset includes several audio files which are recorded with some sentence or single words. And the audio files are acquired at the sample rate of 8k HZ and each sample size is 16 bits. The frame length is 128ms (1024 samples at 8k HZ sampling rate) and the test audio is segmented into frames. We extract 12-th order Linear Predictive Cepstral Coefficients (LPCC) as the feature vectors. The classification is then performed over the frames using one-class SVM with LPCC as the feature input vector. And the output from SVM classifier is a series of results which are labeled with time. To estimate the word error rate, the main phoneme of a lip shape is labeled by hand to contrast to the series result from classifier. The experimental results of isolated word recognition via one-class SVM are shown in Table 2. We also compare the Phoneme Error Rate (PER) and Word Error Rate (WER) of proposed kernel-based lip shape clustering algorithm for mandarin vowel phoneme classifier and CHMM classifier in Table 3. Note that the classifier of SVM and CHMM are context-independent, but the oneclass SVM classifier is used in speaker dependent case and CHMM classifier is used in speaker indenpedent case. And the proposed real-time voice driven facial animation system is focused on low bit-rate mobile communication systme. Thus, the one-class SVM classifier is suitable for the talking head system. For the voice driven system, we design an answering-question MOS system with the prepared questions listed in Table 4 and score scale for these questions in Table 5. Question 1 and Question 2 are related to the correctness of the recognized performance. Question 3 and Question 4 relate to the general feeling of the animations shown.
522
P.-Y. Shih, J.-F. Wang, and Z.-Y. Chen Table 2. The isolated words recognition rate (%) of SVM classifier Vowel Phoneme Before clustering (%) After clustering (%) Vowel Phoneme Before clustering (%) After clustering (%)
ㄚ ㄛ ㄜ ㄝ ㄞ ㄟ ㄠ ㄡ ㄢ ㄣ 5 5
30 5
20 5
40 10
25 10
37 14
20 15
50 10
5 5
5 5
14 4
ㄤ ㄥ ㄦ ㄧ ㄨ ㄩ
36 12
25 5
8 8
20 15
Avg 12.5 19.22 12.5 8.78
Table 3. Phone Error Rate (PER) and Word Error Rate (WER) of classifier methods Classifier The Proposed Method CHMM
PER(%) 8.78 32.25
WER(%) 27.65 84.75
Table 4. Question sets for subjective test
Question 1 Question 2 Question 3 Question 4
Inquiry Is the result correct when you pronounce each single vowel phoneme? Is the result correct when you pronounce an isolated word? How long does the delay of the facial animation? Is it natural of the facial animation? Table 5. Score scale for eorrectness and delay
Score Question 1 &2 Question 3 Question4 5 Excellent Perfect Synchronized Very natural 4 Acceptable Natural perceptible delay Good 3 Fair Slightly delay Fair 2 Little acceptable Delay (but not objectionable) Poor continuous 1 Not correct slow discontinuous
Fig. 5. Result of MOS test of question sets
Kernel-Based Lip Shape Clustering with Phoneme Recognition
523
We compare the averaged MOS test results with the software called “CrazyTalk” as shown in Fig. 5. The CrarzyTalk is commercial software, and it should be fed with a recorded wave data to make the animations, it is an off-line speech driven system. Thus, the speech segmentation and the parameters of lip shape are easier processed than the real-time system in human talking face.
5 Conclusions In this paper, this work describes a real-time lip-sync method using which a speaker’s lip shape is synchronized with the corresponding speech signal for a low-bandwidth mobile devices. Phoneme recognition is generally regards as an important task in the operation of a real-time lip-sync system. In this work, the use of the kernel-based mouth shape clustering algorithm inspired based on one-class support vector machines. For each speech segment, we extract 12-order linear predictive cepstral coefficients (LPCCs) as the speech feature vectors. The Mandarin vowel phonetic symbol is recognized by one-class SVM. We use sum of absolute difference (SAD) as vowel lip shape likelihood to cluster into categories. Then adjust the source and destination pictures of lip shape in the transparent level using alpha blending. Experimental results indicate that this method outperforms conventional CHMM method in phoneme error rate (PER), 8.78% and 32.25%, respectively. In the system implementation, these phoneme recognition results that include the single vowel recognition results based on one-class SVM can be provided as elementary information to an talking face in order to realize more naturally smoothed lipmotions in real-time voice driven talking face lip-sync systems.
References 1. Lin, I.C., Hung, C.S., Yang, T.J., Ouhyoung, M.: A Speech Driven Talking Head System Based on a Single Face Image. In: Proc. Pacific Graphics 1999, Seoul, Korea, October 1999, pp. 43–49 (1999) IEEE ISBN 0-7695-0293-8 2. Ostermann, J., Weissenfeld, A.: Talking faces-technologies and applications. In: Proc. of ICPR 2004, August 2004, vol. 3, pp. 826–833 (2004) 3. Tamura, M.: Visual speech synthesis based on parameter generation from HMM: Speech driven and text-and-speech driven approaches. In: Proc. AVSP 1998, pp. 221–226 (1998) 4. Zoric, G., Pandzic, I.S.: Automatic lip sync. and its use in the new multimedia services for mobile devices. In: Proc. 8th Int. Conf. Telecommunications, vol. 2, pp. 353–358 (2005) 5. Xie, L., Liu, Z.: Realistic mouth-synching for speech-driven talking face using articulatory modeling. IEEE Trans. Multimedia 9(3), 500–510 (2007) 6. Park, J., Ko, H.: Real-Time Continuous Phoneme Recognition System Using ClassDependent Tied-Mixture HMM With HBT Structure for Speech-Driven Lip-Sync. IEEE Transaction on Multimedia 10(7) (November 2008) 7. Sun, N., Suigetsu, K., Ayabe, T.: An Approach to Speech Driven Animation. In: International Conference on Intelligent Information Hiding and Multimedia Signal Processing, August 15-17 (2008) 8. Camastra, F., Verri, A.: A novel kernel method for clustering. IEEE Trans. PAMI 27(5), 801–805 (2005) 9. Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recognition Letters 20(11-13), 1191–1199 (1999)
Dynamic Fixed-Point Arithmetic Design of Embedded SVM-Based Speaker Identification System Jhing-Fa Wang, Ta-Wen Kuan, Jia-Ching Wang, and Ta-Wei Sun Department of Electrical Engineering, National Cheng Kung University No. 1 University Road, Tainan City, Taiwan
[email protected],
[email protected]
Abstract. This work proposes a dynamic fixed-point arithmetic design for SVM-based speaker identification in embedded environment. The whole speaker identification system includes LPCC extraction, SVM training with sequential minimal optimization (SMO), and SVM recognition. The proposed dynamic fixed-point design is applied to each arithmetic procedure and fixed-point error analysis is also performed. The fixed-point SVM-based speaker identification system have been implemented and evaluated on ARM9 DMA2400. The experimental results show that the speaker identification accuracy is slightly degraded with the proposed dynamic fixed-point technique. Keywords: Support vector machine (SVM), linear prediction cepstral coefficient (LPCC), sequential minimal optimization (SMO), dynamic fixed-point design.
1 Introduction Traditionally, the research of speaker recognition can be divided into two applications: speaker identification and speaker verification. Speaker identification is to find the correct speaker of a given test utterance among registered speakers, and speaker verification is to determine whether the claimed speaker is accepted based on the score over threshold or not. Recent studies have revealed speaker identification systems based on support vector machines (SVMs) can achieve good identification accuracy. In some stand-alone applications, such as smart home appliance, digital doorway, and mobile consumer products, it is essential to realize speaker identification algorithm on an embedded system. However, this kind of device is often resource-limited and may not possess floating-point arithmetic unit. Adopting the fixed-point arithmetic design rather than the floating-point one is able to reduce the required computational power to meet the real-time requirement. The disadvantage of fixed-point realization is that the speaker identification accuracy may decrease a lot. Therefore, this paper proposes a dynamic fixed-point design to degrade the round-off error. With the dynamic range analysis, each variable is dynamically formatted as integer and fraction parts. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 524–531, 2010. © Springer-Verlag Berlin Heidelberg 2010
Dynamic Fixed-Point Arithmetic Design
525
The rest of this article is organized as follows. Section 2 describes the SVM training and the SMO algorithm in detail. Section 3 describes the proposed dynamic fixed-point model and its basic notion. Section 4 elucidates the error analysis of proposed model. Section 5 shows the experimental results. Finally, Section 6 draws conclusions.
2 SVM-Based Speaker Identification In our embedded speaker identification system, speech signal is parameterized as a sequence of 18-order linear predictive cepstral coefficients after prior preprocess including end point detection, pre-emphasis, frame blocking, and hamming windowing. The support vector machine is used as the identification model. For training an SVM, sequential minimum optimization algorithm is applied. Each trained model is a decision function called hyperplane and can be used for 2-class identification. Multi-class identification can be realized based on 2-class one. For example, one-versus-one approach [5]. 2.1 Linear Predictive Cepstral Coefficients With the autocorrelation coefficients of a speech frame, We can use Levinson-Durbin’s recursive algorithm to obtain linear predictive coefficients (LPC), a( j )1P . For cepstrum extraction, instead of taking the inverse Fourier transform of the logarithm of the spectrum, linear predictive cepstral coefficients (LPCC) is taken from the LPC. The cepstral coefficients C[n] can be computed using a recursive formula, without computing the Fourier Transform. That is defined by (1) (i ) Θ0 2 = Emin
⎧ ⎪0, n<0 ⎪ ⎪ ⎪ C (n) = ⎨log Θ0 , n=0 ⎪ ⎪ n −1 ⎪ k ⎪an + ∑ C ( k ) × a( n − k ), k =1 n ⎩
(1) 0
where P is LPCC order, C[n] is linear predictive cepstral coefficient and a[n] is LPC coefficients. 2.2 Support Vector Machine Support vector classification is a computationally efficient way of learning good separating hyperplanes in high-dimensional feature space. Accordingly, training an
526
J.-F. Wang et al.
SVM is the equivalent of finding the hyperplane with the maximum margin to the support vectors. This optimal hyperplane is obtained by minimizing the following constrained optimization problem (2) in a case of imperfect separation: 1 ⎛ N ⎞ min wT w + C ⎜ ∑ ξi ⎟ w,b ,ξ 2 ⎝ i =1 ⎠
(2)
subject to yi ( wφ ( xi ) + b) + ξi − 1 ≥ 0,
1≤ i ≤ n
ξi ≥ 0,
1≤ i ≤ n
∈
∈
where xi is a training sample, yi {-1,+1} is the corresponding target value, w Rm is a vector of weights of training instances, b is a constant, C is a real value cost parameter, and ξi is a penalty parameter (slack variable). By (2), the constrained optimization problem is referred as constrained quadratic programming problem, and this can be rewritten in dual form by using the Lagrange multiplier theory. Therefore, the Lagrange function in imperfect separating case is shown in (3). N
N
N
1 Λˆ ( w, b, ξi , α , μ ) = wT w + C ∑ ξ i − ∑ α i [ yi ( wT xi + b) + ξi − 1] − ∑ μiξ i 2 i =1 i =1 i =1
(3)
where αi and μi are the Lagrange multipliers. After maximizing Λˆ ( w, b, ξi , α , μ ) corresponding to α and μ, which are subject the constraints of the gradient of Λˆ ( w, b, ξi , α , μ ) , and with respect to the primal variables of w, b and ξ varnished, the dual presentation of primal optimization problem is formulated as (4). Basically, it is a QP problem and can be solved by the SMO algorithm. N
max Ψ (α ) = ∑ α i − α
i =1
N
subject to
∑ yα i =1
i
i
1 N N ∑∑ yi y j k ( xi , x j )α iα j 2 i =1 j =1
= 0, 0 ≤ α i ≤ C , and
(4)
i = 1,..., n
2.3 Sequential Minimum Optimization Algorithm The SMO is initially to calculate the constraint of all multipliers, and then to find out the constrained maximum [2]. For only two multipliers to be optimized, the constraints can be regarded as two dimensions of square area with diagonal line segment bounded on a box boarder. Such a constraint shows the reason why at least two multipliers is the minimum number to be optimized rather than only one multiplier. John Platt comprehensively discussed the SMO algorithm [1], [2].
Dynamic Fixed-Point Arithmetic Design
527
3 Dynamic Fixed-Point Arithmetic Design A low cost ARM processor and dynamic fixed-point design are adopted in this paper. The advantage of ARM processor is not afforded the float-point operation. Basically, the dynamic fixed-point design is to dynamically tuning the integer part and fraction part of a variable according to the range of the floating-point simulation results. The analysis flowchart of the fixed-point format is shown in Fig.1. Take a variable with the dynamic range between 6.2456 and -5.1235 for example, in 32-bit fixed-point format, we can use 1 bit for sign, 3 bits for integer, and the rest 28 bits for fraction. This fixed-point format is denoted as Q28.
Start
Dynamic Range Analysis
Module Analysis
Data Format Decision
All Module Fixed
No
Yes No
Analysis Efficient Yes End
Fig. 1. The analysis flowchart of the fixed-point format
After data format decision, we must transform floating-point operation to fixed-point operation. If three signals need to be processed, such as, c[t], a[t] and b[t], they can be defined as Qd, Qn, and Qm, respectively. The basic operations include addition, subtraction, multiplication, and division [3]: For floating-point addition/subtraction operation, c[t]= b[t]± c[t], the transformation for output fixed-point signal is as follows: C[t ] = 2 d × c[t ] = 2 d ( a[t ] ± b[t ]) = 2 d − n A[t ] + 2 d − m B[t ]
(7)
C[t ] = ( A[t ] << (d − n)) ± ( B[t ] << ( d − m ))
where a[t], b[t] , c[t] are floating-point signals, and A[t], B[t], C[t] are fixed-point signals. The symbol << denotes the left-shift operation performed in the bit level. Some typical fixed-point operations are summarized in Table 1.
528
J.-F. Wang et al. Table 1. Fixed-point operation Floating-Point Operation
Fixed-Point Transformation Operation
c[t ] = a[t ]
C[t ] = A[t ] << (d − n)
c[t ] = a[t ] + b[t ]
C[t ] = ( A[t ] << ( d − n)) + ( B[t ] << ( d − m))
c[t ] = a[t ] − b[t ]
C[t ] = ( A[t ] << ( d − n)) − ( B[t ] << (d − m))
c[t ] = a[t ] × b[t ]
C[t ] = ( A[t ] × B[t ]) << (d − n − m)
c[t ] = a[t ] / b[t ]
C[t ] = ( A[t ] << (d − n − m)) / B[t ]
c[t ] = a[t ]
C[t ] = isqrt ( A[t ] << (2 × d − n))
4 Model Analysis 4.1 Dynamic Fixed-Point Analysis Table 2 shows the data range and dynamic fixed-point formats of variables used in LPCC extraction. The variable α is defined as linear prediction coefficient, Rn is the cell value of autocorrelation function matrix, En is the linear prediction error, 0 is the root square of minimum linear prediction error, and C[n] is the cepstral coefficient. The maximum and minimum in simulation presented after the feature calculation. In Table 2, the first column represents parameters of feature extraction; the second column represents maximum value and the third column represents minimum one; the forth column represents the bits utilization; the fifth column represents the sign bit and the final column is the bits number of fraction part.
Θ
Table 2. Data range and dynamic fixed-point formats of variables used in LPCC extraction
Variable α ( LPC )
Maximum 10.2265
Minimum -10.8670
Total Bit 32
Sign Bit Fraction 1 Q27
Rn
240.1496
-193.3301
64
1
Q55
0.0000028
32
0
Q28
En
14.0546
Θ0
3.7489
0.0017
32
0
Q29
C (0) ( LPCC )
1.3215
-6.3963
32
1
Q28
C ( n) 0 < n ≤ P
3.7843
-1.6579
32
1
Q28
The SMO algorithm can be divided into five steps [4]. The variables are updated in loop procedure till the selected two Language multipliers meet the KKT condition. Table 3 shows the dynamic fixed-point analysis of the SMO algorithm. For the
Dynamic Fixed-Point Arithmetic Design
529
fixed-point SMO algorithm, 64 bit is adopted in fixed-point format. The integer part uses only four bits or less, and the fraction part occupies most bits to assure the precision. Table 3. Data range and dynamic fixed-point formats of variables used in SMO algorithm
Variable
Maximum
Minimum
Total Bit
Sign Bit
Fraction
k11
45.7662
0
64
0
Q58
k12
40.7703
-6.1791
64
1
Q57
k22
45.8217
0
64
0
Q58
η (eta )
0
-49.0218
64
1
Q57
E1
8.1619
-9.5351
64
1
Q59
E2
8.7974
-7.6237
64
1
Q59
L
1
0
64
0
Q63
H
1
0
64
0
Q63
α1
1
0
64
0
Q63
α2
25.2191
-15.4364
64
1
Q58
s = y1 y2
1
-1
64
1
Q62
w
4.8218
-5.7868
64
1
Q60
b
8.4107
-12.6734
64
1
Q59
Δb(delta _ b)
7.1288
-8.1559
64
1
Q59
4.2 Fixed-Point Error Analysis The round-off error is generated from the floating-point format truncated to fixed-point format. Especially, the operations of multiplications or multiply-accumulate is the most serious one for generating round-off errors. The error problem can be lessened via increasing the fixed-point length. According to our dynamic range analysis result, the integer parts of variables of LPCC extraction and SMO algorithm is not quiet large. This indicates most of the bits can be used as the fraction part. SNR is a useful measure for round-off error analysis [7]. The basic notion of SNR equation is as 20 log( dataA[i] dataA[i] dataB[i] ), where dataA[i] is the floating-point value, and dataB[i] is its corresponding fixed-point value, i = 0,…,N-1, and N is the occurrence number of dataA. Figure 2 shows the average SNR for the 18-order LPCC. For SNR bigger than 30dB, there is only a very small round-off error.
• ∣
∣/∣
∣-∣
∣
530
J.-F. Wang et al.
LPCC Order: 18, Speaker: 10, Total Sec: 100
SNR (dB)
70 65 60 55 50 45 40 35 30 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
LPCC Order
Fig. 2. Fixed-point error analysis for 18-order LPCC based on SNR
For ten speaker identification, 45 hyperplanes should be trained by SMO algorithm based on one-versus-one multi-class classifier approach [5]. Each hyperplane is determined by two parameters b and w. Figure 3 depicts the bias b value of the 45 hyperplanes by the floating-point and fixed-point formats. The two curves are almost overlapped.
LPCC Order: 18, Speaker Num.: 10, Total Sec: 100 10
Float Fix
8 6 4 b
2 0 -2 -4 -6 -8 1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 Hyperplane Index
Fig. 3. Bias b value of the 45 hyperplanes by the floating-point and fixed-point formats
5 Experimental Results The performance of speaker identification was evaluated by FMMD dataset [6], which was recorded from a microphone array in real-world environment. Figure 4 gives the performance comparison in speaker identification between floating-pont and dynamic fixed-point formats. Multiple constrained KKT iteration numbers is to accelerate the SMO optimization process. The experimental results demonstrate that the proposed dynamic fixed-point formats only introduce very slight performance degradation.
Dynamic Fixed-Point Arithmetic Design
Constrained KKT Benchmark task Iteration Number Floating 200 Dynamic Fixed Floating 300 Dynamic Fixed Floating 400 Dynamic Fixed Average
531
Training Utterance = 10 sec. Testing Utterances (sec.) 2 3 4 5 6 Average (%) Recognition Accuracy (%) 84.8 87.2 88.4 92.9 94.7 89.6 86.6 87.9 89.5 90.6 92.0 89.3 84.1 87.2 88.8 92.9 94.7 89.5 86.7 87.9 89.5 90.6 92.0 89.3 84.5 87.0 88.8 92.6 94.7 89.5 86.7 87.9 89.5 90.6 92.0 89.3 85.6 87.5 89.1 91.7 93.4
Fig. 4. Performance comparison between floating-point and dynamic fixed-point formats
6 Conclusion In this paper, an SVM-based speaker identification system is realized on an embedded device. A fixed-point arithmetic design is presented to accelerate the LPCC extraction, SVM training, and SVM recognition. The proposed dynamic fixed-point design starts from the dynamic range analysis of the procedure variables and then determines the format based on the available bit number. The error analysis results reveal that the fixed-point realization only brings slight performance degradation. Our future work is to design an automatic mechanism to decide the required total bit number for use in application specific IC design.
References 1. Platt, J.C.: Advances in Kernel Methods of Support Vector Machines: Fast training of support vector machines using sequential minimal optimization. MIT Press, Cambridge (1998) 2. Platt, J.C.: Sequential minimal optimization for SVM. Internet (2007) 3. Andrew, N.S., Dominic, S., Chris, W.: ARM System Developer’s Guide: Designing and Optimizing System Software 4. Wang, J.F., Kuan, T.W., Wang, J.C., Gu, G.H.: VLSI Design of Sequential Minimal Optimization Algorithm for SVM Learning. In: Proc. IEEE Int. Conf. on Circuits and Systems (ISCAS), May 25-27, pp. 2509–2512 (2009) 5. Bishop, C.M.: Pattern Recognition and Machine Learning, pp. 338–339. Springer, UK (2006) 6. Wang, J.F., Kuan, T.W., Wang, J.C., Gu, G.H.: Ubiquitous and robust text- independent speaker recognition for home automation digital life. In: Sandnes, F.E., Zhang, Y., Rong, C., Yang, L.T., Ma, J. (eds.) UIC 2008. LNCS, vol. 5061, pp. 297–310. Springer, Heidelberg (2008) 7. Hart, C.L., Jang, J.S.: Speech Recognition on 32-bit Fixed-point Processors: Implementation & Discussions, Master’s Thesis, Tsing Hua University, Hsinchu City, Taiwan (2005)
A Neural Network Based Model for Project Risk and Talent Management Nadee Goonawardene, Shashikala Subashini, Nilupa Boralessa, and Lalith Premaratne Unversity of Colombo School of Computing. 35, Reid Avenue, Colombo 7, Sri Lanka {nadeerodrigo,shashiks5212,nlupaboralessa}gmail.com,
[email protected]
Abstract. Objective of this research was to examine the effectiveness of using neural and fuzzy systems in the areas such as job recruitment, prediction of project success/failures and decision making process of employee performance appraisals related to any company/industry. Since these activities are common in many areas, the domain of the research was confined to the software industry. Through a thorough survey, most relevant parameters (together with their level of relevance) that are used in evaluating the suitability of candidates in recruitment etc. for various designations were gathered and fuzzy logic, neural network based approach was used for training and testing. Performance appraisals were also implemented using a neural network which was able to pinpoint the employee’s appraisal to a great deal of accuracy. Further, the prediction of project success and failure was also implemented similarly. Since many linguistic terms are used in these activities, a fuzzy input/output interface was accommodated to the system. The evaluation of the proposed system with the support from the industry experts shows a very high level of accuracy. Keywords: Talent Management, Knowledge Management, Risk Management, Artificial Neural Networks, Fuzzy Logic.
1 Introduction In the present competitive economy, recruitment of employees, management of employee talents and assessing project risks play an integral part in the success of an organization. It is widely believed that the role of managers is becoming a key determinant for enterprises' competitiveness in today's knowledge economy era. The most important issue for the human resource management is to provide the organization a high quality workforce. Therefore, the human resource development theories set the goal of "selecting the right people for the right positions." In a conventional company employee selection is done based on the knowledge of expertise. There aren’t any hard and fast rules for employee selection, selection criteria varies depending on the skills and the experience of the interviewer. If the interviewer is new to the company domain, there is a risk of recruiting inappropriate employees to the company. Apart from that once employees were recruited, their skill L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 532–539, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Neural Network Based Model for Project Risk and Talent Management
533
levels change over the time. Thus during the employee life cycle performance appraisal has to be conducted periodically. This task is done according to the managers or leaders of the company. Up and until now the existence of standardized criteria to perform performance appraisal is very rare to be found. If the manager(s) resign from the company or is absent, the new person(s) who will take over the responsibility may have a little knowledge regarding the skills of employees. Apart from talent management, risk assessment of projects is another major role of a manager. Projects with no risk factors are very rare to be found in any industry. The area we worked on, the software industry was no exception. When it comes to the software industry, projects can fail due to vague requirements, inefficiencies of software development life cycle, and communication gaps between stakeholders or due to documentation errors. Because of these unforeseen uncertainties companies fail to deliver the projects within budget, schedule and with the expected quality. Risks of carrying out a project will be decided according to the manager’s knowledge. With the above mentioned problems in mind, our objective of this research was to examine the effectiveness of using neural and fuzzy systems in the areas such as job recruiting, predicting of project success/failures and on decision making on performance appraisal of employees. Though the application of our research suits any industry, we evaluated our system for the software development industry.
2 Related Work Existing Enterprise Resource Planning systems are most often used to provide human capital management functionality in industries. PeopleSoft is an ERP system which was launched as an enterprise human capital management suite in order to make the management process efficient by analyzing and modeling your workforce skill pool to accurately plan the future workforce and leadership. It was also used to attract and retain ideal employees to fit the workforce plan [1]. As another example the SAP ERP human capital management solution is another complete and integrated human capital management solution which is popular among many organizations. This is capable of automating all the significant human resource processes, such as employee administration, payroll, and legal reporting. These functionalities will increase the efficiency and will also work efficiently with the changing global and local regulations [2]. Having mentioned the above two ERP systems it should be stated that they do not contain an automated employee selection system which is one of the core components of the system developed in our research. A set of researchers constructed a new model for the evaluation of managerial talents using ‘fuzzy analytic hierarchy model’. According to their model, managerial talents are classified as individual traits and managerial skills. Under individual traits capability trait, motivational trait and personality trait were considered and under managerial skills conceptual skills, interpersonal skills, technical skills were considered [3]. In the system developed in this research fuzzy systems were basically used as a way of pre and post processing of neural networks. When the above system is compared with our system we should mention that we used fuzzy interfaces not just for pre and post processing but we used for decision making also.
534
N. Goonawardene et al.
Another important area that should be discussed is the graphical analysis tool called the RENO. It is used to provide applications for risk analysis, complex reliability modeling, maintenance planning, optimization, and operational research etc. It is a powerful and flexible platform for visualizing and dynamically analyzing and simulating nearly any kind of physical, financial or organizational system [4]. Another available solution was ‘Modulo’ risk manager; it provides an objective methodology that offers qualitative and quantitative results that can help on effectively prioritizing actions of an organization and supporting decision making [5]. Because our system used MATLAB we had to develop the graphical visualizations using .NET technologies. Another area related to our project was the statistics provided by a set of researchers [6]. They found out that the overall project success rate was 26%, challenged project rate was 46% and failed project rate was 28%. It should also be mentioned that their statistics is similar to the statistics provided by the Standish Group in America. Base on their research findings software projects can be divided in to three ranks, such as successful projects, challenged projects (projects partially failed) and projects that are total failure. The outcome of all these research work was beneficial for us as we based some of our project work on the statistics and the categorizations, which were outcomes of these research works. Apart from above research, another set of researchers [7] proposed a new model for risk evaluation. Going into more detail on this research it should be mentioned that it was an approach to build a statistical neural network-based framework for supporting the risk management process.. The strategy proposed to use historical data to build a baseline to evaluate risk value and to automatically check whether the project is meeting the stated objectives or not. The baseline was generated by an ANN, whose output represented the ‘posterior probability’ that can be used in the risk evaluation. When it came to our research we should mention that we did not use an ANN based on statistical results. Apart from that threshold values were mostly used in fuzzy interfaces. In addition to the above work, a separate module was developed by Annie R. Pearce, Rita A. Gregory and Laura Williams to evaluate the risk of cost for a construction project using back propagation [8].
3 Design and Implementation The system has two major components, talent management component and the risk management component. The talent management component facilitates project managers/ technical leaders to identify the most suitable position for an employee. By the use of talent management component, human resource managers will able to select qualified employees based on their skill level and preference and this component assist in employee’s performance appraisal also. The risk management component facilitates in predicting the project’s success or failure based on the risk factors that were previously identified. Additionally a fuzzy application was designed to convert the linguistic values to real values. Data was gathered through questionnaires. Then the gathered data were analyzed to identify a standard set of factors (parameters) common to software development industry. Then, nineteen factors for recruitment, 16 factors for performance appraisal
A Neural Network Based Model for Project Risk and Talent Management
535
and twenty two risk factors affecting significantly to the success/ failure of a project were finally selected. 3.1 Fuzzy Application As means of pre & post processing, well defined fuzzy interfaces were used to convert linguistic input values to values which could be given as input to a neural network. The process of converting linguistic values to fuzzy values was based on a set of predefined weights which were different in each network we used. 3.2 Talent Management Recruitment. 19 inputs were identified for the recruitment neural network. Inputs categorized into three categories include; skills measured at the interview, aptitude test and the qualifications extracts analyzing the CV. So the input layer of the neural network consists of nineteen input nodes. Inputs are given in Table 1. Table 1. Inputs of Recruitment Neural Network Interview Communication Skills Inter – Personal Skills Domain Knowledge
Aptitude Test Testing Skills Domain Knowledge Analytical Skills
Analytical Skills Problem Solving Skills Attitudes Programming Skills
Documentation Skills Object Oriented / UML Skills Database Implementation Skills
CV Relevance Degree GPA Relevancy of Post Graduate Qualifications Extra Activities Relevant Experience Relevance of professional qualifications
The 19 inputs were first sent through a fuzzy interface for defuzzyfication, and thereafter defuzzyfied values were multiplied by a weight factor based on the candidate’s preference. There were 3 weight sets associated with three positions. The positions are software engineers, quality assurance engineers and business analysts. Altogether the network classifies employees into four classes including a class to classify the disqualified candidates. Then the pre-processed data was given as input to a neural network for employee classification as shown in the Fig. 1. A five layered feedforward backpropagation neural network was used for talent management. It consisted of nineteen nodes in the input layer and three hidden layers which consist of 8, 6 and 4 hidden nodes. The output layer consists of 2 nodes. A trial and error approach was used to find the suitable number of hidden layers and nodes. It should also be mentioned that in the implementation process we used the tan sigmoid transfer function in order to conduct the supervised training of the network. Going into more detail of the training process, it should be stated that the connection weights of the neural network was initialized with some random values. Afterwards training data was given as input to the network after normalizing them. Then, the connection weights were adjusted according to the error back-propagation learning rule.
536
N. Goonawardene et al.
We have done several changes for the network architecture, based on the Number of hidden layers, Optimum number of hidden nodes, Momentum factor and the Learning Rate parameter. The training was done using a training data set of 300. After that, the network was tested with the test data set and the accuracy of the system was calculated. The testing was done using a training data set of 40 and the accuracy for the testing process was 90%, 90%, 62%, 60% with respect to positions ,Software Engineer, Quality Assurance Engineer, Business Analyst and Disqualified. The overall accuracy was 75%.
F u z z y
Neural Network
Output Classes
Fig. 1. Architecture of the Recruitment Component
Performance Appraisal. Sixteen skills were identified regarding the performance appraisal of employees. These were identified at the data analysis stage. Input layer of the neural network consists of sixteen input nodes. Inputs are given in Table 2. Table 2. Inputs of Performance Appraisal Neural Network
Communication Skills Inter- personal Skills Domain Knowledge Analytical Skills Problem Solving Skills Attitudes
Inputs Programming Skills Testing Skills Performance Efficiency Documentation Skills Object Oriented / UML Skills Relevancy of Post Graduate Qualifications
Extracurricular Activities Relevant Experience Relevance of professional qualifications Database Implementation Skills
The inputs were initially pre-processed using fuzzy interfaces. Afterwards the preprocessed data was given as input to a neural network, based on the tan sigmoid transfer function, which contained 3 hidden layers (8, 6, 4 nodes), 16 input nodes and two output nodes. The training of the network was done using a data set of 300 records based on positions such as Associate Software Engineer, Software Engineer, Senior Software Engineer and Quality Assurance Engineer. The testing was done using a data set of 40. The accuracy for the training process was 90%, 70%, 80% and 80% with respect to the positions Associate Software Engineer, Software Engineer, Senior Software Engineer and Quality Assurance Engineer. The overall accuracy was 80%.
A Neural Network Based Model for Project Risk and Talent Management
537
3.3 Risk Management According to the initial study carried out, we found out that there are 22 factors affecting the success of a project. These factors basically falls into three major risk categories, which are factors affecting the schedule (time), budget (cost) and quality of a project. Table 3. Input factors of Risk Management Component Cost
Time
Quality
Financial Budget Acceptability Required Team Effort Complexity of Project Requirement Stability Developer Experience Analyst Experience Cost for new Technologies Third Party Involvement
Acceptability of Schedule Complexity of Project Requirement Stability Probability of Staff Turnover/ Absence Availability of clearly defined project milestones Clearly defined Project Goals and Objectives Availability of Reusable code Project Progress monitoring frequency
Requirement Stability Complexity of Project Availability of Documented Test Plans Availability of Change Management Process Inspection & Testing Availability QA Experience
Cost Inputs
F u z z y
Time
Fuzzy Rule Set
Output
Quality
Fig. 2. Block Diagram of the Risk Management Component
Details regarding the software projects were initially sent through a pre-processing process using well defined fuzzy interfaces. These pre-processed data were given as input to three neural network structures which were designed for cost (classifies whether the cost of the project is surplus, just managed and exceed budget) time (classifies whether the project will be completed before the due date, just on due date and exceeded schedule) and project quality (classifies whether the project is high quality, expected level of quality and low quality). The output of these three networks was combined and post-processed using a fuzzy interface in order to provide a meaningful output to the user. Moreover, it should be stated that the all three networks were feedforward backpropagation networks (see Table 4 for detail network structure) which used the tan sigmoid as the transfer function (each layer) and Traingd as the training function. Training was carried out using 75 real world software
538
N. Goonawardene et al.
projects and testing was done using 30 projects. Apart from that regression analysis was done to map the inputs with targets. Block diagram of the components used in the entire process of risk evaluation is shown in the Fig. 2. Table 4. Network Structures of Risk Management Component
Input Nodes Hidden Layers Output Nodes Output Classes
Training Accuracy Testing Accuracy
Cost 8 2 2 Surplus Just Managed Exceeded Budget 91% 100%
Time 8 2 2 Before due date Just on due date Exceeded Schedule 91% 90%
Quality 6 2 2 High Quality Medium Quality Low Quality 91% 90%
Fuzzy Rule Set. Evaluation of the overall risk of a software project was done using a Fuzzy interface which included 54 carefully designed fuzzy rules. The desired output of each network was sent through this fuzzy interface to predict the probability of how risky the project could be. Basically there can be three desired states of a project. They are Success projects, Challenged projects and Failed projects. Moreover, if the output is a challenged project, our system is capable of stating the changes that needs to be done in order to be a successful project. Some of them could be listed as; Amendments Necessary, Modifications recommended in Schedule & Performance, Adjustments in schedule recommended, Re-Organize the Schedule and Performance factors, Careful Monitoring recommended, Re-Organizing Performance Factors Recommended, Can succeed with amendments in Budget & schedule etc.
4 System Testing and User Evaluation The talent management system was evaluated by a project manager from a leading software company in Sri Lanka. After a short briefing of the system, the project manager was requested to use the proposed automated system and compare the outcome with the manual process. After several comparisons the expert was satisfied with the accuracy of the automated system which was around 90%. In fact, the manager requested the authors to accommodate more job categories. The success of the automated system for the performance appraisal was also evaluated similarly and the accuracy was found to be around 90%. In this case the component was tested using a real world software development project. It was found out that at present, the risk assessment is basically done according to the knowledge and experience of experts. During the evaluation process it was revealed that the very high uncertainty involved in the manual process could be overcome by using an automated system to a great extent. The system implemented for risk assessment was also evaluated similarly with the support from a human expert. The results were similar to those in the previous systems.
A Neural Network Based Model for Project Risk and Talent Management
539
Table 5. Test results System Recruitment Performance Appraisal Risk Assessment
Training set 90 85 84
% Accuracy Test set 75 80 83
5 Conclusion The objective of this research was to examine the effectiveness of using neural and fuzzy systems in the areas such as job recruiting, predicting of project success/failures and on decision making on performance appraisal of employees. Up to date there isn’t any hard and fast rule for employee selection and thus this process varies upon the skills and the experience of the interviewer. And when it comes to employee appraisal it still remains the same as employee selection with the addition of some statistical method. But having mentioned that it should be stated our research outcomes yields that we could very successfully use neural fuzzy systems to standardized these processors. Even when it comes to predicting of risks related to software projects, up to date there aren’t automated intelligent systems that could be used to assist mangers. But with the promising results of our project we could confidently state that our system, which based on a neural and fuzzy system, could be used to assist the mangers in predicting.
References 1. Oracle Inc., Oracle and PeopleSoft, http://www.oracle.com/peoplesoft/index.html 2. SAP, Inc., Sap - sap business management software solutions applications and services, http://www.sap.com/ 3. Huang, L.C., Huang, K.S., Huang, H.P., Jaw, B.S.: Applying fuzzy neural network in human resource selection system. In: ICNC 2007: Proceedings of the Third International Conference on Natural Computation, pp. 169–174 (2004) 4. ReliaSoft Corporation, Risk analysis software, monte carlo simulation software, probabilistic event simulation - reno, http://www.reliasoft.com/reno/ 5. ComplianceHome.com, Modulo launches risk management pandemic solution, http://www.compliancehome.com/news/HIPAA/15777.html 6. Hu, Y., Huang, J., Chen, J., Liu, M., Xie, K.: Software project risk management modeling with neural network and support vector machine approaches. In: ICNC 2007: Proceedings of the Third International Conference on Natural Computation, pp. 358–362. IEEE Computer Society, Washington (2007) 7. Sarcia, S.A., Cantone, G., Basili, V.R.: A statistical neural network framework for risk management process - from the proposal to its preliminary validation for efficiency. In: ICSOFT (SE), pp. 168–177. INSTICC Press (2007) 8. Pearcel, A.R., Gregory, R.A., Williams, L.: Range estimating for risk management using artificial neural networks (1996)
Harnessing ANN for a Secure Environment Mee H. Ling and Wan H. Hassan School of Computer Technology, Sunway University College, Petaling Jaya, Malaysia {mhling,wanh}@sunway.edu.my
Abstract. This paper explores recent works in the application of artificial neural network (ANN) for security – namely, network security via intrusion detection systems, and authentication systems. This paper highlights a variety of approaches that have been adopted in these two distinct areas of study. In the application of intrusion detection systems, ANN has been found to be more effective in detecting known attacks over rule-based system; however, only moderate success has been achieved in detecting unknown attacks. For authentication systems, the use of ANN has evolved considerably with hybrid models being developed in recent years. Hybrid ANN, combining different variants of ANN or combining ANN with non-AI techniques, has yielded encouraging results in lowering training time and increasing accuracy. Results suggest that the future of ANN in the deployment of a secure environment may lie in the development of hybrid models that are responsive for real-world applications. Keywords: Artificial neural network, security, intrusion detection systems, authentication systems.
1 Introduction Information and systems security has been one of the main areas of concern in the 21st century. Given the proliferation of networked computers and growth of the Internet, many systems are becoming increasingly vulnerable to intrusions and other forms of malicious attack. The financial implications are staggering as well – for example, the cost of viruses to organizations in the US alone exceeded USD 14 billion in 2005 [1]. Furthermore, according to the Malware Annual Report 2007 [2], new threats double every 12 months and are no longer destructive in nature but are focused on seizing control of computing resources remotely in an unobtrusive manner. Traditional computing methods that use algorithmic approaches may not be sufficiently effective in providing a secure environment. The mode of intrusion often varies, from a temporal and spatial perspective, and in most cases, the problem may not be well defined nor easily identified. As such, there is a need to build secure systems that are not only self-learning and adaptive in nature but are also self-reliant and responsive in real time. For these reasons, techniques used in artificial neural network may offer better insights and solutions in providing an environment that is both secure and immune against intrusions, present and future. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 540–547, 2010. © Springer-Verlag Berlin Heidelberg 2010
Harnessing ANN for a Secure Environment
541
Security includes confidentiality (via cryptography), integrity, authentication, as well as operational security i.e. against on-going attacks and intrusions [3]. A comprehensive review of artificial neural network applications in cryptosystems has been discussed in [4]. The focus of this paper is mainly to present recent works that utilize artificial neural networks in two other application areas, namely intrusion detection systems for network security, and authentication. This paper is thus organized as follows: Section 2 provides a brief overview of the different approaches in neural networks. Section 3 presents recent works in neural network application in intrusion detection systems while Section 4 explores neural networks that are used in authentication systems. Section 5 concludes the paper with respect to challenges in the deployment of neural networks for security.
2 Overview of ANN In essence, artificial neural network (ANN) or simply neural network mimics the way in which the human brain processes information and resolves complex problems, albeit in a very much less intricate manner. An artificial neural network is an interconnected assembly of processing elements (or neurons) that work in parallel to solve problems by detecting certain patterns or trends extracted from imprecise or complex data [5, 6]. Thus a neural network may be trained to solve problems which otherwise would not be possible using conventional problem solving techniques in computing. Since its inception several decades ago, artificial neural network has evolved considerably in terms of architectures and approaches. There are multitude of approaches to neural networks – from the classical single layer feed-forward to the multi-layered feed-forward back-propagation neural networks, from recurrent neural networks to radial basis function (RBF), and from stochastic or probabilistic neural networks to modular neural networks, to name but a few. Moreover, neural networks may be broadly categorized (though not in the strictest sense) based on their specific application. According to [7], there are five main applications areas for neural networks, which are non-mutually exclusive – prediction, classification, data association, data conceptualization, and data filtering. Neural networks that are used for prediction (for given a set of input, predict the outcome) include, amongst others, feed-forward back-propagation, delta-bar-delta, and higher order neural networks. Originally developed in the 1970s, the feed-forward backpropagation neural network is suited for ill-defined problems and consists of an input layer, an output layer and at least one (usually more) hidden layer. Data flows from the input layer into the trained neural network and is processed to give the output or answer through a procedure known as recall. Back-propagation does not occur during recall, but during the learning process of the network through the training set. The delta-bar-delta developed by Robert Jacobs [8] is similar in operations to the feedforward back-propagation approach but with one main difference – the learning process has been optimized using heuristics techniques to expedite the learning rate towards convergence. Higher order neural network, developed by Yoh-Han Pao [9], also extends the feed-forward back-propagation approach by augmenting the input
542
M.H. Ling and W.H. Hassan
layer with higher order mathematical/algebraic functions. This also results in a higher rate of learning for the neural network. Neural networks that are used for classification purposes (for a given input, classify the data) include learning vector quantization (LVQ), counter-propagation and probabilistic neural networks. LVQ is an algorithm that was developed by Teuvo Kohonen and is considered a pre-cursor to self-organizing maps neural networks. In LVQ, there is an input layer, a Kohonen classification layer and a competitive output layer. The Kohonen layer is responsible for supervised learning and for classification based on a training set. Classification of data is made based on comparisons with prototype vectors where similarity is determined from the Euclidean distance in feature space [10]. From this distance, the nearest processing element is declared the winner for the whole layer. Consequently, the winner will determine a single output with the required associated classification. Robert Hecht-Nielsen [11] invented the counter-propagation neural network in 1987 as a method to reduce the number of processing elements in the Kohonen layer and improve learning time. It is similar to LVQ but the Kohonen layer in this approach learns in an unsupervised manner, and under certain conditions, this may result in an output that incorrectly chooses elements from different classes whenever the inputs are not well defined. Probabilistic neural network, developed in 1988 by Donald Specht [12], is based on Bayesian classifier and uses a supervised training set for learning. It has three layers – input, pattern and output/summation layers. As with LVQ, probabilistic neural network’s pattern layer works competitively whereby the highest match to an input wins to generate an output. Training probabilistic neural networks is generally simpler than back-propagation models; however, the size of the pattern layer may become large if the degree of classification varies considerably. Hopfield, Hamming neural networks and bidirectional associative memory are typical examples of neural networks that are used for data association. Neural networks that are used for data conceptualization (for a given set of input, infer the grouping of relationships) include adaptive resonance, and self-organizing maps. Lastly, recirculation neural network (also known as principal component analysis, PCA) is an example of neural network that is used for data filtering e.g. for noise filtering in signal processing.
3 ANN and Intrusion Detection An intrusion detection system may be categorized as either misuse-based or anomalybased. The former relies on known signatures of attacks for detection while the latter assumes certain behavior as normal and concludes any variation from the norm as probable intrusion. Misuse-based intrusion detection systems generally have low false positive rates, but are unable to detect new forms of intrusion (0-day signature). Conversely, anomaly intrusion detection systems have high false positive rates but are superior in detecting unknown attacks. Early works in intrusion detection systems were developed mainly using rule-based (expert) systems and required considerable amount of data for the knowledge base and subsequently, incurred high processing times. Recent trends have suggested that neural networks may yield better performance with higher accuracy (in terms of detection and recognition), lower false
Harnessing ANN for a Secure Environment
543
positive rates and processing times, and thus may be more suited as real-time intrusion detection systems. Work by Lima et al. [13] demonstrates that an intrusion detection system based on neural network may be more feasible than conventional rule-based systems. In their work, a network intrusion detection prototype called I-IDS (Intelligent Intrusion Detection System) was built using a multilayer perceptron (MLP) neural network that uses back-propagation training algorithm and hyperbolic tangent activation function. Their prototype was a hybrid of network- and host-based system, and was built to solve the false negative and false positive problem by fine-tuning weights in the neural network model. The prototype was able to detect an acceptable proportion of unknown as well as known intrusions. Results indicate that the I-IDS prototype was moderately successful in detecting up to 74% of unknown attacks. Since knowledge is acquired through periodic training, a small pattern group used as the training set was found to be sufficiently effective in detecting intrusions into the communication network. Their work, however, did not provide any insight with respect to timings nor was a comparative study made with conventional rule-based systems. As such, the interpretation of their results may be somewhat optimistically biased especially since the training and testing sets were relatively small. Golovko et al. [14] examines the use of modular neural networks for intrusion detection. In their work, four variant architectures comprising linear and non-linear principal component neural networks (LPCA and NPCA) and multilayer perceptrons, were developed. Four classes of attacks were defined – denial of service, unauthorized access as local super user, unauthorized access remotely, and probe (where the network is scanned for confidential data). Their intrusion detection system comprised three phases – feature selection, feature extraction and classification. Network traffic was monitored and TCP connection traffic served as input to their modular neural network. PCA was used in the first feature selection phase to reduce dimensionality, while both PCA and MLP networks, in different combinations under the four variants, were integrated for the extraction and classification phases. A total of 6186 samples were used as the training set while the testing set consisted of almost 50,000 samples, all taken from the Knowledge Discovery and Data Mining (KDD’99) [15] set. The performance metrics studied were detection rate, recognition rate for each attack, and false positive rate. Results indicate that all four variants yielded high detection and recognition rates (80%-95%) for most types of attacks; however, it could not be determined conclusively which variant of their modular neural network was superior since performance varied for different types of traffic and modes of attack. Furthermore, no indication was given towards the learning or processing times incurred given the large amount of samples used for both the training and testing sets. From the work of Shum and Malki [16], feed-forward neural network with back propagation training algorithm was used to determine its effectiveness in detecting intrusions based on network traffic. The network was first trained by using data gathered by the MIT Lincoln Laboratory under Defense Advanced Research Projects Agency (DARPA) Intrusion Detection Evaluation project. The network had an input layer, a hidden layer, and an output layer. The number of input nodes was determined by the input data set while the number of nodes in the hidden layer was varied throughout the experiment. There were two nodes in the output layer to distinguish between a normal and a compromised traffic. From the evaluation, it was found that
544
M.H. Ling and W.H. Hassan
the model was able to detect accurately all normal traffic and known compromised data set. However, it was only moderately successful in detecting the unknown attack data set – achieving 76% correct classification. In [17, 18], genetic neural networks (GNN) were used in anomaly-based intrusion detection systems. Both works adopted genetic algorithms and back propagation algorithms as these were found to be effective for global and local searching. Each of their improved algorithm was able to detect accurately Denial of Service (DoS) attacks; unfortunately, however, it did poorly in detecting remote-to-local (R2L) types of attack.
4 ANN and Authentication According to Peltier et al. [19], authentication is the act of verifying the identity of a system entity (user, system, network node) and the entity’s eligibility to access computerized information. It also refers to the verification of the correctness of a piece of data. Authentication mechanisms may use any of the three qualities to confirm a user’s identity i.e. something a user might know e.g. PIN number or passphrase; something a user might possess e.g. ID cards or driver’s license; or something that is considered as a unique attribute of a user e.g. fingerprint or face. Typically, fingerprints and face authenticators are called biometrics and are based on a physical characteristics of the user [20]. Biometric characteristics may be classified as either physiological or behavioral. Physiological refers to the shape of the body. Some examples include fingerprint, face recognition, DNA, hand and palm geometry, and iris recognition, amongst others. Behavioral is related to the behavior of a person e.g. his/her signature, gait, or tone/pitch of voice. Heinen and Osorio [21] implemented an on-line handwritten signature authentication system based on ANN. Signatures were collected and stored in a database where position and scale adjustments were made over the signatures. Distinctive features used in the authentication process were extracted from the signatures. The input space dimensionality was reduced using PCA from 117 to 40 inputs, and Heinen and Osorio discovered that the ANN generalization rate increased with the input space reduction even with some loss of information. In order to classify the signatures, MLP with backpropagation was used. The results obtained from their simulation, using Stuttgart Neural Network Simulator, suggest that ANN may be suitable for signature-authentication tasks. The application of PCA allowed the ANN to achieve high learning rates and good generalization levels with lower incorrect signature authentication rates. Wang and Wang [22] focused on solving two inherent problems of existing layered neural network (LNN) used in password detection – namely, long training time and recall accuracy. Their approach replaces the traditional way of using verification table to authenticate passwords by using encrypted ANN weights stored within the system. In LNN, retraining typically needs to be done whenever a new user is added to the system or a password is altered. This training time may take more than 5 minutes for a small system of 50 users, or more than 30 minutes for a system with 100 users [23]. Furthermore, the output of the LNN is not a discrete binary integer and thus may lead to mistakes in authentication. Wang and Wang developed a hybrid neural network (HNN) integrating Reed-Solomon coding algorithm to address these
Harnessing ANN for a Secure Environment
545
problems. Using a set of 10 million legal user’s ID and passwords, and a set of 10 million illegal user’s ID and passwords, the computational performance of 100 users using HNN was compared against LNN. Their results showed that HNN only required 0.00136 seconds while LNN took 1876 seconds. Moreover, in the computational performance of 10 million users, HNN required only 213 seconds whereas LNN performed poorly. In addition, the trained HNN was able to recall each legal user’s ID and password accurately and instantly, and reject each illegal user’s ID and passwords correctly. In Mazloom and Ayat [24], a hybrid neural network was developed to increase the accuracy of face recognition using a combination of wavelet transform (WT) and PCA. Existing approaches used in face recognition are either content-based or facebased. In content-based, recognition is determined on the relationship between human facial features, while in face-based, recognition is ascertained from the entire face. Generally, the content-based approach is not as robust since every human face has similar facial features while face-based approach using PCA suffers from poor discriminatory power and large computational load in finding the eigenvectors. Face recognition method consists of three stages – preprocessing, feature extraction and classification rules. In the proposed hybrid method, PCA was applied on wavelet subband for feature extraction where an image was decomposed into a number of subbands with various frequency components using a three-level wavelet transform. Images of lower resolution of 16 x16 pixels were used to reduce the computational complexity. The results showed that new sub-images gave better recognition accuracy and discriminatory power than by applying PCA on the whole original image of 128 x128 pixels. With the size reduction of 64 times, the recognition computational load was also reduced by 64 times. The proposed system comprised training and recognition stages. In the training stage, WT was applied to decompose reference images to produce sub-images of 16 x 16 pixels that were obtained by the three-level wavelet decomposition. PCA was then applied on the sub-images to obtain a set of representational basis for selecting the eigenvectors. The feature vectors were then used to train ANN using back-propagation algorithm. The recognition stage is similar to the training stage except that the inputs of unknown images are matched against those referenced images stored in the database (obtained previously using wavelet transform and PCA). The proposed system used Yale and ORL database to perform various tests. Results indicate that the hybrid method gives high recognition rates (97.68%) when compared to other non-hybrid methods.
5 Conclusion In this paper, an overview of ANN application for intrusion detection and authentication has been presented. In general, the literature suggests that ANN may be feasibly deployed to enhance existing approaches used for security, given the encouraging results obtained by recent works. However, one of the major drawbacks in ANN is the degree of responsiveness i.e. slow training time, and this may render ANN unsuitable for adoption in real-time secure systems. Interestingly enough, a number of researchers have begun to develop and adopt hybrid models, combining ANN with other non-AI techniques, or combining different types of ANN methods to
546
M.H. Ling and W.H. Hassan
reduce further the training time whilst increasing the level of accuracy. This is especially true for authentication systems, as demonstrated by the works of Heinen and Osorio [21], Wang and Wang [22], and Mazloom and Ayat [24]. In the case of intrusion detection systems, majority of the works in the application of ANN have been successful in detecting attacks with known signatures, but not unknown attack i.e. 0-day signature – which is a major cause of concern presently. As such, one area that may be explored further is the use of hybrid ANN models to detect unknown attacks responsively with lower false positive rates for network security.
References 1. 2005 Malware Report: The Impact of Malicious Code Attacks (2005), http://computereconomics.com 2. Malware Annual Report 2007 (2007), http://www.emsisoft.com/en/kb/articles/news080116/ 3. Kurose, J.F., Ross, W.R.: Computer Networking – A Top-Down Approach, 4th edn. Pearson Education, London (2008) 4. Schmidt, T., Rahnama, H., Sadeghian, A.: A Review of Applications of Artificial Neural Networks in Cryptosystems. In: World Automation Congress (WAC 2008), Hawaii, pp. 1– 6. IEEE, Los Alamitos (2008) 5. Stergiou, C., Siganos, D.: Neural Networks, http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/ report.html 6. Gurney, K.: An Introduction to Neural Networks. CRC Press, Boca Raton (2003) 7. Artificial Neural Networks Technology, http://www.dacs.dtic.mil/techs/neural/neural6.php#RTFToC22 8. Jacobs, R.: Increased Rates of Convergence Through Learning Rate Adaptation. Neural Networks 1(4), 295–307 (1988) 9. Pao, Y.: Adaptive Patterns and Neural Networks. Addison-Wesley, Reading (1989) 10. Biehl, M., Gosh, A., Hammer, B.: Learning Vector Quantization: The Dynamics of Winner-Take-All Algorithms. Neurocomputing 69(7-9), 660–670 (2005) 11. Hecht-Nielsen, R.: Counter-Propagation Networks. In: First International Conference on Neural Networks, San Diego, California. IEEE, Los Alamitos (1987) 12. Specht, D.F.: Probabilistic Neural Networks for Classification, Mapping or Associative Memory. In: International Conference on Neural Networks (ICNN 1988), San Diego, California, pp. 11-333–11-340. IEEE, Los Alamitos (1988) 13. Lima, I.V.M., Dagaspari, J.A., Sobral, J.B.M.: Intrusion Detection Through Artificial Neural Networks. In: IEEE/IFIP Network Operations and Management Symposium: Pervasive Management for Ubiquitous Networks and Services, NOMS 2008, Salvador, Brazil, April 7-11, pp. 867–870. IEEE, Los Alamitos (2008) 14. Golovko, V., Vaitsekhovich, L.U., Kochurko, P.A., Rubanau, U.: Dimensionality Reduction and Attack Recognition using Neural Network Approaches. In: International Joint Conference on Neural Networks, Orlando, Florida, August 12-17. IEEE, Los Alamitos (2007) 15. 1999 KDD Cup Competition (1999), http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Harnessing ANN for a Secure Environment
547
16. Shum, J., Malki, H.A.: Network Intrusion Detection System Using Neural Networks. In: 4th International Conference on Natural Computation, pp. 242–246. IEEE, Los Alamitos (2008) 17. Jiang, H., Zhao, X.: Study on the Network Intrusion Detection Model Based on Genetic Neural Network. In: 2008 International Workshop on Modelling, Simulation and Optimization, pp. 60–64. IEEE, Los Alamitos (2009) 18. Zhou, T., Yang, L.: The Research of Intrusion Detection Based on Genetic Neural Network. In: Proceedings of the 2009 International Conference on Wavelet Analysis and Pattern Recognition, Hong Kong, pp. 276–280. IEEE, Los Alamitos (2008) 19. Peltier, T.R., Peltier, J., Backley, J.: Information Security Fundamentals. Auerbach (2004) 20. Pfleeger, C.P., Pfleeger, S.L.: Security in Computing. Prentice Hall, Englewood Cliffs (2006) 21. Heinen, M.R., Osorio, F.S.: Handwritten Signature Authentication using Artificial Neural Networks. In: International Joint Conference on Neural Networks, Vancouver, Canada, pp. 5012–5019. IEEE, Los Alamitos (2006) 22. Wang, S., Wang, H.: Password Authentication Using Hopfield Neural Network. IEEE Transactions on Systems Man, and Cybernetics – Part C: Applications and Reviews 38, 265–268 (2008) 23. Li, L.H., Lin, I.C., Hwang, M.S.: A Remote Password Authentication Scheme for Multiserver Architecture using Neural Networks. IEEE Transaction Neural Network 12(6), 1498–1504 (2001) 24. Mazloom, M., Ayat, S.: Combinational Method for Face Recognition: Wavelet, PCA and ANN. Digital Image Computing: Techniques and Applications, 90–95 (2008)
Facility Power Usage Modeling and Short Term Prediction with Artificial Neural Networks Sunny Wan and Xiao-Hua Yu Department of Electrical Engineering, California Polytechnic State University, San Luis Obispo, CA 93407, USA
Abstract. Residential and commercial buildings accounted for about 68% of the total U.S. electricity consumption in 2002. Improving the energy efficiency of buildings can save energy, reduce cost, and protect the global environment. In this research, artificial neural network is employed to model and predict the facility power usage of campus buildings. The prediction is based on the building power usage history and weather conditions such as temperature, humidity, wind speed, etc. Different neural network configurations are discussed; satisfactory computer simulation results are obtained and presented. Keywords: Power prediction, building energy management, artificial neural network applications.
1 Introduction With the limited resources of fossil fuel and the ever-increasing energy demand, the studies on energy management have become more and more important. It is reported that residential and commercial buildings accounted for about 68% of the total U.S. electricity consumption in 2002 [1]. An efficient energy management system of a building can optimize energy use, improve comfort, reduce building operational cost and energy related emissions, and thus protect the global environment. Facility management is essential for building maintenance and functioning. The energy consumption related with facility operations includes, but not limited to, HVAC (Heating, Ventilating, and Air Conditioning), lighting, water supply, etc. Facility power usage modeling and short term prediction plays an important role in an adaptive energy management system. Based on the model and/or prediction, a control signal can be generated on-line to minimize the power consumption and thus optimize the system performance. In addition, it can also be used as an integral part of the building SHM (Structural Health Monitoring) system, to detect and identify possible equipment failure, and to provide information for the retrofit of existing buildings. Power usage prediction is often based on statistics or the numerical modeling of historical data, such as linear regression models [10]. However, these models are L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 548–555, 2010. © Springer-Verlag Berlin Heidelberg 2010
Facility Power Usage Modeling and Short Term Prediction
549
usually not accurate and the order of linear regression model can be very high. In addition, recent studies show that building power usage is also related with other factors such as the building occupancy and weather conditions. The LCEA (Life Cycle Energy Assessment) of buildings is a systematic and comprehensive way to assess building energy consumption. It is part of the ISO 14,000 series of standards on environmental management [2]. LCEA is based on the model of the physical structure of the building, such as the materials used in building envelope (i.e., the separation between the interior and the exterior environments of a building – for example, the brick wall), the gross volume and floor area of the building, floor finish, ceiling height, etc. LCEA provides a general guideline to estimate building energy consumption; however, it lacks the precision and details when it is used for short-term energy management. There have been some developments in recent years in the application of artificial neural networks (ANN) for building power usage modeling and prediction, such as [3], [4], and [5]. It is well known that an artificial neural network with appropriate size can be employed to approximate any measurable function once it is fully trained ([7], [9]). The approach based on neural network has some significant advantages over conventional methods, such as adaptive learning and nonlinear mapping. An energy forecasting model of a building based on artificial neural network is discussed in [3]. The building is modeled and calibrated using a software package (“DOE-2.1E Building Energy Analysis Program”) first; then neural network is trained using the simulated data generated by the software. It is recommended that future neural network modeling should be developed and tested using “real”, or “actual” building measurement data. In [4], the relationship between the energy consumption and the work shift of different units in a dairy firm is studied. The inputs to the neural network include the specific hours in a day for different types of activities (i.e., working, not working, and washing) at various processing units. However, the impact of weather conditions is not considered. The prediction of thermal energy consumption of a hospital using neural networks is investigated in [5]. The time-series prediction is based on the measurements that are recorded by meters, such as the natural gas consumption, the cold and hot water consumption, the external temperature, as well as the temperature inside the building. Note this is still a rough estimation due to the low data sampling rate (about two hours). In addition, temperature is the only weather related parameter that is included in the neural network model. This paper focuses on the modeling and short-term prediction of facility power usage for buildings at Cal Poly (California Polytechnic State University) SLO (San Luis Obispo) campus. The past power consumption data is provided by the Dept. of Engineering and Utilities Facility Services at Cal Poly; and the data on the her of San Luis Obispo County is obtained from internet [6]. Simulation results show that artificial neural network can successfully model and predict facility power usage for campus buildings.
550
S. Wan and X.-H. Yu
2 The Neural Network Model for Power Consumption Prediction In this section, the neural network model for building power usage modeling and prediction is discussed. The building power usage can be modeled as a nonlinear function of several parameters that include the history of power usage as well as the current weather conditions:
P( n +1 ) = f [P( n ), T ( n ), W ( n ), "] .
(1)
where n is the time index; P(n+1) is the prediction and P(n) is the current building power usage; T(n), W(n) etc. are parameters that are related with the current weather condition such as temperature, wind speed, etc. A multi-layer feedforward neural network model is proposed for this application. It has an input layer, an output layer, and one or more hidden layer(s). As indicated in Eq. (1), the neural network inputs include P(n) and the measurements of weather condition while its output is the prediction P(n+1). That is, the neural network model is a multi-input, single-output system. Fig. 1 shows the neural network configuration when one hidden layer is employed:
Fig. 1. The Neural network model for power usage prediction
The activation function for each hidden neuron is chosen as the sigmoid function:
f(x)=
1 . 1 + e−x
(2)
The activation function for the neuron in the output layer is simply a linear function. The weights of the neural network are initialized at random, and then updated to minimize the following objective function:
[e( n )]2 = 1 [Pˆ( n ) − P( n )]2 2
=
1 [ y NN ( n ) − yd ( n )]2 . 2
(3)
ˆ ( n ) is the prediction (i.e., the NN output y ( n ) ) and P(n) is the actual where P NN building power usage (i.e., the desired output yd ( n ) ).
Facility Power Usage Modeling and Short Term Prediction
551
The Levenberg-Marquardt algorithm is employed to train the neural network:
W ( k +1 ) = W ( k ) + ΔW .
(4)
T T ΔW = ( J a J a + μI )−1 J a e .
(5)
and:
where J a is the first order derivative of the error function with respect to the neural network weight (also called the Jacobian matrix); e is the output error (i.e., the difference between the neural network output and the desired output); μ is a learning parameter, and k is the index of iterations. Before training the neural network model, the number of hidden layers and the number of hidden neurons in each layer must be specified. As we know, the dimension (or size) of a neural network may have a great impact on neural network learning. In general, a larger neural network (with more hidden layers and/or more hidden neurons) is able to approximate more complex nonlinear functions; however it may require more memory space and longer computation time. Besides, its generalization ability may be poor and thus causes the “over-fitting” problem (that is, the neural network can fit well through the training data while fail to generate the correct output for testing data). On the other hand, a smaller neural network runs faster but may yield a higher training error (called the “under-fitting” problem). Therefore, choosing the appropriate network size becomes a critical issue in the design of artificial neural networks [8]. This issue will be further discussed in the next section.
3 Simulation Results In this section, the proposed neural network model is applied to predict the facility power usage of a building on Cal Poly campus. Most building energy management applications use monthly data [10]. The Dept. of Engineering and Utilities Facility Services at Cal Poly provides us the power consumption data of building 1 (Administration Building) in November 2008. The data sampling period is 15 minutes, resulting totally 2878 measurements. The weather condition of the San Luis Obispo County is obtained from the web ([5]), which includes temperature (F), dew point (F), pressure (inch), wind speed (mph), wind gust (mph), humidity (%), and rainfall rate (hourly). Note that this data set is not sampled uniformly – i.e., the time interval between two adjacent measurements is not a constant. For example, some of the measurements are 5 minutes apart; while others may be 6 minutes, 10 minutes, or even up to 15 minutes apart. To ensure the consistency of data, the nearest-neighbor interpolation (also known as proximal interpolation), a piecewise-constant interpolation algorithm, is employed to “resample” the data every 15 minutes. At each sampling instant, if the original measurement data is not available, this algorithm then sets the value of data at this
552
S. Wan and X.-H. Yu
point to be the same as the value of the immediate previous measurement (i.e., the measurement that was taken right before the current sampling time instant). The entire database contains 2,878 sets (or points). Among them, 2,000 points are used for neural network training and the rest 878 points are used for testing. The power consumption of the building varies from 54 kW to 215 kW, with an average of 101.552 kW, a median of 80 kW, and a mode of 68 kW (in statistics, the term “mode” refers to the value that occurs the most frequently in a data set or a probability distribution). To determine the appropriate size of neural network for this application, two neural networks with different architectures are employed in the simulation, and their performances are compared. Both networks have 8 input neurons (7 for the measurements on weather and 1 for the current power usage) in the input layer; and 1 output neuron in the output layer (for power prediction). Network 1 has one hidden layer with 9 neurons while network 2 has two hidden layers, with 10 neurons in the first hidden layer and 5 neurons in the second hidden layer. The computer simulation results of the proposed neural network model for building power usage are shown in Fig. 2 – Fig. 5. On-line training is employed in this research. Fig. 2 shows the performance of network 1 in training phase, where the solid line represents the desired output and the dotted line represents the output of neural network. Similarly, Fig. 3 illustrates the training results of network 2. The neural network performance in testing phase is shown in Fig. 3 (for network 1) and Fig. 4 (for network 2), where the solid line represents the actual building power consumption and the dotted line represents the neural network prediction.
Predicted power consumptions in November 2008 during training 220 Target value Predicted value using ANNs
200
Power consumptions(kW)
180 160 140 120 100 80 60 40
0
0.5
1
1.5 2 Time(minutes)
2.5
3
Fig. 2. Neural Network 1 Training Results
3.5 4
x 10
Facility Power Usage Modeling and Short Term Prediction
553
Predicted power consumptions in November 2008 during training 220 Target value Predicted value using ANNs
200
Power consumptions(kW)
180 160 140 120 100 80 60 40
0
0.5
1
1.5 2 Time(minutes)
2.5
3
3.5 4
x 10
Fig. 3. Neural Network 2 Training Results Predicted power consumptions in November 2008 during testing 220 Target value Predicted value using ANNs
200
Power consumptions(kW)
180 160 140 120 100 80 60 40
3
3.2
3.4
3.6 3.8 Time(minutes)
4
4.2
4.4 4
x 10
Fig. 4. Neural Network 1 Testing Results
Both neural networks can successfully predict the building power consumption based on weather conditions and the power usage history. Further analysis on the training error and the testing error is given in Table 1, where “1” is for networks 1, “2” is for network 2, “R” is the training phase, “E” is the testing phase, “Std. Dev.” is the standard deviation of the error, and “Avg.” is the average, “Max.” is the maximum, “RMS” is the root-mean-square value of errors. The statistics values shown in the tables indicate that the output error of network 1 is unbiased with a smaller standard deviation, average, maximum, median, mode, and RMS value, in both training and testing phases. For example, the RMS value of testing error is
554
S. Wan and X.-H. Yu
6.8766 kW for network 1 (about 5.66% of the average power consumption of the building) while the RMS value of testing error is 10.9224 kW for network 1 (about 10.76% of the average power consumption of the building). Therefore, we conclude that network 1 yields better overall performance than network 2. Predicted power consumptions in November 2008 during testing 220 Target value Predicted value using ANNs
200
Power consumptions(kW)
180 160 140 120 100 80 60
3
3.2
3.4
3.6 3.8 Time(minutes)
4
4.2
4.4 4
x 10
Fig. 5. Neural Network 2 Testing Results Table 1. Statistics of Neural Network Performance (Standard Deviation, Average, Maximum, Median of Error, Mode, and RMS value of Error) NN 1 1 2 2
Phase R E R E
Std. Dev. 6.8783 5.7397 6.9312 6.4726
Avg. -0.0006 0.2785 0.0282 0.3032
Max. 25.2227 30.0599 25.8888 30.8219
Median -0.1103 -0.0428 -0.1814 -0.2951
Mode -1.6127 -2.3454 -2.8192 -4.4821
RMS 6.8766 5.7432 12.7105 10.9224
4 Conclusions An approach to model and predict building energy consumption based on artificial neural network is studied in this paper. Satisfactory computer simulation results are obtained and presented. More tests will be conducted to further investigate the performance of the neural network model in the future. For example, this approach can be applied to and tested on the data collected from different buildings and/or different times of the year. Also, the raw measurement data may contain noise and outliers. Pre-processing the raw data using adaptive filtering and/or outlier detection may speed up neural network learning and improve the neural network generalization ability.
Facility Power Usage Modeling and Short Term Prediction
555
Acknowledgments. The authors would like to thank Mr. D. Elliot for providing the data that is used in this research.
References 1. U.S. Environmental Protection Agency Green Building Workgroup, Buildings and the environment: a statistical summary, http://www.epa.gov/greenbuilding/pubs/gbstats.pdf 2. Kofoworola, O., Gheewala, S.: Life cycle energy assessment of a typical office building in Thailand. Energy and Buildings 41, 1076–1083 (2009) 3. Cohen, D., Krarti, M.: A neural network modeling approach applied to energy conservation retrofits. In: Proceedings of the Fourth International Conference on Building Simulation, pp. 423–430 (1995) 4. Frosini, L., Petrecca, G.: System identification for the prediction of the electric energy consumption of a dairy firm. In: Proceedings of the 2001 IEEE Mountain Workshop on Soft Computing in Industrial Applications (2001) 5. Frosini, L., Petrecca, G.: Neural networks for energy flows prediction in facility systems. In: Proceedings of 1999 IEEE Midnight-Sun Workshop on Soft Computing Methods in Industrial Applications (1999) 6. Weather Underground, http://www.wunderground.com/weatherstation/index.asp 7. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, Englewood Cliffs (1999) 8. Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4, 1–58 (1992) 9. Sjoberg, J., Hjalmerso, H., Ljung, L.: Neural networks in system identification. In: Proceedings of the 10th IFAC symposium on system identification (1994) 10. Begg, C.: Energy: management, supply and conservation. Butterworth-Heinemann (2002)
Classification of Malicious Software Behaviour Detection with Hybrid Set Based Feed Forward Neural Network Yong Wang1,2, Dawu Gu2, Mi Wen1, Haiming Li1, and Jianping Xu1 1
Department of Computers Science and Technolgy, Shanghai University of Electric Power, 20090 Shanghai, China 2 Department of Computers Science and Engineering, Shanghai Jiao Tong University, 200240 Shanghai, China
[email protected]
Abstract. Behavior detection of malicious software is better than signaturebased detection method when used to find unknown malicious software. The paper presents a classification method of malicious software behavior detection with hybrid set based feed forward neural network. We choose malicious software detection database for test with 57345 records from National AntiComputer Intrusion and Anti-Virus Research Center. According to the definition of selected data set relations and transfer functions, the weighted path length trees of malicious software detection data are calculated for neural network input vectors. After repeat training, different malicious software detection methods can be classified by the method with the about 83.9 percent right classification. Keywords: Neural Network, Malicious Software, Set, IDS, Behavior Detection.
1 Introduction Malicious software is harmful to computer host and servers, which includes computer viruses, worms, Trojan horses and most Rootkits. Snipe networks, a leading provider of user-centric behavioral anomaly detection and Kaspersky Lab, a leading developer of secure content management solutions enhanced malicious behavior detection cooperation. Czech security company AVG has completed a deal to acquire Sana security, which specializes in detecting malicious software based on its behavior. Behavior detection is better than signature-based detection methods which can’t find unknown computer malicious software. The accurately detecting unknown worm activity in individual computers is feasible by minimizing the required set of features collected from the monitored Computer [1]. Behavior-based spam detection can use a hybrid method of rule-based techniques and neural networks utilize the spamming behaviors as features for describing emails [2]. Neural networks are usually used for intrusion detection, sometimes hierarchical IDS frameworks can be used in real-time intrusion detection [3].Intrusion detector for recognizing IIS attacks based on neural networks is feasible. The training time fully depend on the downgrade percentage of the detection rate, which determines the size of the retraining data set [4]. The generalized feed forward neural network leads to the best confusion. Radial basis L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 556 – 565, 2010. © Springer-Verlag Berlin Heidelberg 2010
Classification of Malicious Software Behaviour Detection
557
function performs the higher detection rate of the denial of service attack category [5]. Most of existing intrusion detection models with a single-level structure can only detect either misuse or anomaly attacks. A hierarchical intrusion detection model using principal component analysis neural networks is proposed to overcome such shortages [6]. Neural networks works well as the detection tool to predict the number of stepping-stones for incoming packets by both proposed schemes through monitoring a connection chain with a few packets [7]. The fuzzy neural network can involve genetic optimization mechanisms [10]. The classify method can be used in wide areas Such as epileptic seizure classification by fuzzy set [8]. We continue the previous IDS research job in KDD cup data [9], seek for malicious software behavior detection methods. The paper describes the hybrid set based feed forward neural network method for malicious software behavior detection. The set operation definitions are used to find for the neural network input vectors internal rules. After the neural network training, the malicious software behavior detection methods are classified.
2 Set Architecture and Transfer Function 2.1 Set Architectures of Malicious Software Detection for Neural Network Set is a collection of objects which objects are called the members or the elements of the set. Malicious software detection database is also set. We choose the data structure of malicious software behaviour detection software, which is from National AntiComputer Intrusion and Anti-Virus Research Center. The structure is as Table 1: Table 1. Data structure of malicious software detection database Table name
Attribute1
dll dllID api apiID eventClass eventClassID event eventID prog progID trace traceID traceEvent traceEventID
Attribute2
dllName dllID
Attribute3
Attribute4
apiName
eventClassName
eventClassID progName progID traceID
eventName apiID progPath returnVal eventID processName
The malicious software detection database has 7 tables such as dll, api, eventClass, event, prog, trace and traceEvent tables. Each table has many field attributes, only key field attribute, key field name and foreign key are selected except for any other parameters fields. The tables of malicious software detection database have different records. The event Class table only has 10 records then the trance Event table 116832 in mySql. Only 57345 common used records in trace event table are selected. The other data samples of malicious software detection dataset are as Table 2:
558
Y. Wang et al. Table 2. Data Samples of malicious software detection
Table name Attribute1
dll 1-8 api 1-70 eventClass 0-10 event 1-41 prog 1-980 trace 1-980 traceEvent 1-57345
Attribute2
Advapi32.dll 1-8 UNKNOWN 0-10 Trojan.Click 1-980 1-980
Attribute3
Attribute4
ChangeServiceConfigA Normal 1-70 F:\av\v\Trojan.click.exe 0-65535 1-41 Kernel32.dll::copyFileA
The records in different tables are connected by foreign key. All the tables are like the neural network structure as Fig.1:
Fig. 1. Neural network architecture of malicious software detection tables are connected with single or double neural input vectors
2.2 Set Transfer Function of Neural Network In order to build the whole neural networks, the transfer function is needed. As the used transfer function data range is double, we need define the set transfer function. Definition 1. (Union definition), the union of two sets is the set whose elements are elements of one set or elements of the other set; That is as Formula 1:
⎧a = f union (WP + b ) ⎨ ⎩ f union = {x | x ∈ P1 , or , x ∈ P2 , or , Pn }
(1)
Classification of Malicious Software Behaviour Detection
559
The union of Api records set and dll record set, the result is {apiID, dllID, apiName}. In order to be used as neural network input, the apiID and dllID are transformed to integer or double. Definition 2. (Intersection definition), the intersection of two sets is the set whose elements are elements of both sets; The “or” and “and” in the last two definitions are the logical “or” and “and”. That is as Formula 2:
⎧a = f int er sec tion (WP + b ) ⎨ ⎩ f int er sec tion = {x | x ∈ P1 , and , x ∈ P2 , and , Pn }
(2)
For example: IF api record is {1,1, ChangeServiceConfigA} and dll record is { 1, Advapi32.dll }, one of record from the intersection set of the api set and dll set is {1}. Definition 3. (Set difference definition), the set difference of S and T is the set whose elements are elements of S and not elements of T; it is denoted S - T. That is as Formula 3:
⎧⎪a = f setdifference (wp + b ) ⎨ ⎪⎩ f setdifference = {x | x ∈ P0 , and , x ∉ P1}
(3)
For example: IF api record is {1,1, ChangeServiceConfigA} and dll record is { 1, Advapi32.dll }, one of record from the difference set of the api set and dll set is {1, Advapi32.dll }. Definition 4. (Cartesian product Definition), the Cartesian product of sets is the first set consisting of all ordered pairs of the second set. That is as Formula 4:
⎧a = f Cartesian Pr oduct (WP + b ) ⎨ ⎩ f Cartesian Pr oduct = {x | x ∈ ( Pm , Pn ), Pm ∈ I1 , and , Pn ∈ I 2 }
(4)
Example. Let dll = {1, Advapi32.dll } and eventClass = {0, UNKNOWN}. Then the Cartesian product of dll set and eventClass is {(1, o), (1, UNKNOWN), (Advapi32.dll, 0), (Advapi32.dll, UNKNOWN)}.
3 Relations and Weighted Path Length Tree of Input Vectors 3.1 Relations Definition Definition 5. (Relation definition), any subset of the Cartesian product two sets is called a relation between the Cartesian product set. Any subset of Cartesian product of same set is called a relation on the same set. That is as Formula 5:
560
Y. Wang et al.
⎧ LetA = {dll , api, eventClass, event , prog , trace, traceEvent} ⎪ ⎧⎪ dll , api , api, event , eventClass, event , ⎫⎪ ⎨ = R ⎨ ⎬ ⎪ ⎪⎩ event , traceEvent , prog , trace , trace, traceEvent ⎪⎭ ⎩
(5)
The relation of the malicious software detection is as formula 5. The arrow representation of relation is given in Fig. 2.
Fig. 2. Relation of the malicious software detection table
Definition 6. (Partial order definition) IF 1. R is reflexive if
x, x ∈ R for all x ∈ A .
2. R is antisymmetric if for all
x = y. 3. R is transitive if for all
x, y ∈ A , x, y ∈ R , and y, x ∈ R implies
x, y, z ∈ A , x, y ∈ R , and y, z ∈ R implies
x, z ∈ R Then R is a partial order on A, if R is reflexive, antisymmetric, and transitive. The Fig 2. is not a partial order relation. The set dose not include the set
{ api, traceEvent , eventClass, traceEvent ,
prog , traceEvent
}.
If the relation is reflexive, transitive and symmetric, the relation is called as equivalence relation. The detection relation is usually not equivalence relation. 3.2 Weighted Path Length Tree of Input Vectors In order to get the whole relation valuation for the record relation records, the weighted path length tree is used to measure different malicious software detection methods. The weighted path length results are used for input data vector matrix. The tree is as Fig.3.
Classification of Malicious Software Behaviour Detection
561
Fig. 3. Route tree of the malicious software detection tables has different data range. The weight calculate method is as Definition 6.
Definition 7. (Weighted path length of tree definition), tree path length with the weight for the WPL = (W1 * L1 + W2 * L2 + W3 * L3 +...+ Wn * Ln). n is the number of nodes. W is the weight and the corresponding leaf nodes of the path length for the Li (i = 1,2, ... n). That is as Formula 6: N ⎧ = WPL Wi × Li ∑ ⎪⎪ i =1 ⎨ ⎪W = N i + N i +1 ⎪⎩ i 2
(6)
N is the tree node data which has some data range. The weight path length is as: WPL(t)=((dll+api)/2*3+(api+event)/2*2+(event+traceEvent)/2*1)+((eventClass+e vent)/2*2+(event+I51)/2*1)+((prog+trace)/2*2+(trace+traceEvent)/2*1) 3.3 Input Vectors and Targets in Neural Network The original data range between event_id table and trace_count table is 1:24 to 29:25944. In order to emphasis on the key relation features and same records length, the front 8 rows records are selected. For example: dll table has 8 records while the api table has 70 records. Only the front 8 rows of api table are selected for input vectors. The relation between the dll table and the api table is as api records. When the dll field is 1, there are 17 records in the api tables. The number of count records is filled in the new table for the input vectors in the neural network. The weighted path trees of records are calculated as the rear field. The WPL tree is the whole values of malicious software detection method. All the selected data from the malicious software detection is as Table 3:
562
Y. Wang et al. Table 3. Selected data count allocation table as input vectors
dll 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
api eventClass event 0.17 0.02 0 0.04 0.03 0.01 0.07 0.04 0.19 0.06 0.05 0.1 0.33 0.06 0.12 0.01 0.07 0.08 0.01 0.08 0.06 0.01 0.09 0.05
prog trace traceEvent WPL(t) average feature
0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07
0.42 0.31 3.545 0.039 0.047 0.213 0.029 0.145
1.10 0.68 6.19 0.67 1.46 0.83 0.53 0.72
0.22 0.14 1.26 0.13 0.26 0.17 0.11 0.15
0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30
The WPL(t) is weighted path tree of the algorithm. The average value is the row average and the feature is the average of eight average values. According to the table 3, total 7 different algorithms with 8 records are as input vector, which is double with an 56 *8 matrix. Use command P=input’, The P input vector is an 8*56 matrix. The target vector ranges from zero to one. When the corresponding bit value is one, it indicates one of the malicious software detection methods would happen. The vector T is defined as follows: Target_Behaviour_WPL1 = [ 1 0 0 0 0 0 0 0 ] Target_ Behaviour_WPL2 = [ 0 1 0 0 0 0 0 0 ] Target_ Behaviour_WPL3 = [ 0 0 1 0 0 0 0 0 ] The target vector is an 56*8 matrix. Use command T=target’, The T target vector is an 8*56 matrix.
4
Training and Classification Results Analysis
4.1 Training Feed Foreword Neural Network In order to classify the malicious software detection methods, the pattern recognition network is preferred. Build the feed forward neural network with 2 layers network with tan-sigmoid transfer in both hidden layer and the output layer. The hidden layer has 20 neurons. net = newpr(P,T,20);The perform function is Mean of Squared Errors (MSE). The train function is transcg [9].The training methods are incremental training and batch training. The difference between the incremental training and the batch training is weight and biases update way. The incremental training update each time when an input is presented to the network and the batch training updated after all the inputs are presented [9]. The two training results are as Fig.4.
Classification of Malicious Software Behaviour Detection
563
Fig. 4. Val fail and gradient of training, the training gradient equals to 0.0097242 at epoch 19 and validation checks equals to 6 at 19 epochs. Use performance in the training window to find the validation error as Fig.5.
Fig. 5. The best validation performance occurred at iteration 13, and the network at this iteration is returned with training errors, validation errors, and test errors. The perfect test would show points in the upper-left corner, with good sensitivity and specificity with the Receiver Operating Characteristic (ROC) curve, the network performs almost better.
The training computer is Dell precision M4300 computer, Intel Core 2 Duo CPU T7250 2.0GHZ 2.0 GHZ, RAM 2.0G. 4.2 Results Analysis During the training, the training results change greatly. The reason for the training results’ fluctuate is that the dataset records have not obviously pattern and the hiding layer neurons less. After repeat training, the best relatively results appeared. Mean square error (MSE) is the average squared difference between output and targets. Lower values are better Percent error indicates the malicious software behaviour detection methods which are missed classified. After repeat training, the best classification of different malicious software detection methods is about 83.9 percent.
564
Y. Wang et al.
Fig. 6. Malicious software behavior detection algorithm classify according to the input vectors stand for algorithm feature.
5 Conclusions Behavior detection of malicious software is better than signature-based detection method when used to find unknown malicious software. The common algorithms used for behavior detection include neural network, fuzzy set and so on. It’s difficult to judge which behaviour is malicious. The common used method is giving abnormal behaviour warning. The judgement job is given to the common user. Usually the users always refused the warning behaviours when they are normal behaviour. Once they judge the wrong, the malicious software will install their program in the computer. After the analysis of the malicious software detection relation by set, weighted path length and neural network, we can get the preliminary judgment of different malicious software detection methods. The future job continues researching the behaviour detection method according to new detection technology and algorithm. The set based neural network is also research topic for intrusion detection. Acknowledgments. Supported by National hi-tech research and development project No.2006AA01Z405. The National Natural Science Foundation of China under Grant No.60903188. Shanghai postdoctoral scientific program No.08R214131.
References 1. Robert, M., Yuval, E., Lior, R.: Detection of Unknown Computer Worms Based on Behavioral Classification of the Host. Computational Statistics & Data Analysis 52, 4544– 4566 (2008) 2. Wu, C.H.: Behavior-based Spam Detection using a Hybrid Method of Rule-based Techniques and Neural Networks. Expert Systems with Applications 36, 4321–4330 (2009) 3. Zhang, C.L., Jiang, J., Mohamed, K.: Intrusion Detection using Hierarchical Neural Networks. Pattern Recognition Letters 26, 779–791 (2005)
Classification of Malicious Software Behaviour Detection
565
4. Horng, S.J., Fan, P.Z., Chou, Y.P., Chang, Y.C., Pan, Y.: A Feasible Intrusion Detector for Recognizing IIS Attacks based on Neural Networks. Computers & Security 27, 84–100 (2008) 5. Rachid, B.: Critical Study of Neural Networks in Detecting Intrusions. Computers & Security 27, 168–175 (2008) 6. Liu, G., Yi, Z., Yang, S.M.: A Hierarchical Intrusion Detection Model based on the PCA Neural Networks. Neurocomputing 70, 1561–1568 (2007) 7. Han, C.W., Shou, H.S.: Neural Networks-based Detection of Stepping-stone Intrusion. Expert Systems with Applications 37, 1431–1437 (2010) 8. Abdulhamit, S.: Automatic Detection of Epileptic Seizure Using Dynamic Fuzzy Neural Networks. Expert Systems with Applications 31, 320–328 (2006) 9. Wang, Y., Gu, D.W., Li, W., Li, H.J., Li, J.: Network Intrusion Detection with Workflow Feature Definition Using BP Neural Network. In: Yu, W., He, H., Zhang, N. (eds.) ISNN 2009. LNCS, vol. 5551, pp. 60–67. Springer, Heidelberg (2009) 10. Sung, K.O., Witold, P., Seok, B.R.: Genetically optimized fuzzy polynomial neural networks with fuzzy set-based polynomial neurons. Information Sciences 176, 3490–3519 (2006)
MULP: A Multi-Layer Perceptron Application to Long-Term, Out-of-Sample Time Series Prediction Eros Pasero*, Giovanni Raimondo, and Suela Ruffa Electronics Department, Politecnico di Torino, Torino, Italy {eros.pasero,giovanni.raimondo,suela.ruffa}@polito.it
Abstract. A forecasting approach based on Multi-Layer Perceptron (MLP) Artificial Neural Networks (named by the authors MULP) is proposed for the NN5 111 time series long-term, out of sample forecasting competition. This approach follows a direct prediction strategy and is completely automatic. It has been chosen after having been compared with other regression methods (as for example Support Vector Machines (SVMs)) and with a recursive approach to prediction. Good results have also been obtained using the ANNs forecaster together with a dimensional reduction of the input features space performed through a Principal Component Analysis (PCA) and a proper information theory based backward selection algorithm. Using this methodology we took the 10th place among the best 50% scorers in the final results table of the NN5 competition. Keywords: Machine Learning Methods, Artificial Neural Networks, Time Series Prediction.
1 Introduction In this paper, a methodology for the long-term prediction of a set of empirical time series of daily cash withdrawals at cash machines is proposed. This methodology combines direct prediction strategy and sophisticated input selection criteria. A challenge in the field of time series prediction is the long-term, out-of-sample prediction: several steps ahead have to be predicted. Many methods designed for time series forecasting perform well on a rather short-term horizon but are rather poor on a longer-term one. In general, these methods try to build a model of the process. The model is then used on the last values of the series to predict the future values. The common difficulty to all the methods is the determination of sufficient and necessary information for an accurate prediction. Long-term prediction has to face growing uncertainties arising from various sources, for instance, accumulation of errors and the lack of information. In the paper, a direct variant of prediction strategies is investigated in order to improve the time series prediction ability up to 56 steps ahead and to minimize the prediction error (expressed in terms of Symmetric Mean Absolute *
Prof. Pasero is Member IEEE and associate INFN -- sezione di Torino.
L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 566 – 575, 2010. © Springer-Verlag Berlin Heidelberg 2010
MULP: A Multi-Layer Perceptron Application to Long-Term
567
Percent Error, SMAPE). The recursive prediction strategy performance has been compared to the direct one showing far worse results. Two of the most wide-spread machine learning methodologies, such as Artificial Neural Networks, ANNs, and Support Vector Machines, SVMs, have been applied to the time series forecasting. ANNs are machine learning tools that implement simplified models of the central nervous system. They are networks of highly interconnected neural computing elements that have the ability to respond to input stimuli and to learn to adapt to the environment [1]. SVMs are a statistical learning technique, based on the computational learning theory, which implements a simple idea and can be considered as a method to minimize the structural risk [2].The ANNs and the SVMs have often been used in time series forecasting [3], [4], [5], [6], [7], [8] , [9], [10]. In this paper we used these methods trying to avoid the local minima during the training phase with an output averaging process. On the basis of the SMAPE the most performing method has been chosen. Moreover a PCA [11] and an information theory based backward selection algorithm have also been applied to the input data in order to reduce the dimensionality of the input features space. Section 2 presents an introduction to the machine learning system with a discussion about feature selection methodologies. In section 3 authors compare different techniques used to forecast the data and they choose the best model for the set of data provided by the competitions while results obtained with the best model (MULP) are reported and discussed in section 4.
2 Machine Learning Forecasting Engine 2.1 Machine Learning Methods Various heuristic approaches have been proposed to limit design complexity and computing time in ANNs modelling, parameterisation and selection for time series prediction. However, no single approach demonstrates robust superiority on arbitrary datasets, causing additional decision problems and a trial-and-error approach to network modeling [12]. Various ANNs models have been adopted for time series prediction: Neural Network Ensembles [13], Echo State Networks [14], Self-Organizing Maps [15], and many others. The results obtained by these approaches in academic competitions or business forecasting are comparable with each other and often outperforming conventional statistical approaches of ARMA-, ARIMA- or exponential smoothing-methods [16]. The performances of each of these different ANN architectures and topologies have to be carefully evaluated and compared with the alternatives in order to implement an optimal neural model for the time series to be predicted in every different situation [17]. The approach presented in this paper, MULP, is based on three layers MLP and is the improvement of a forecasting system already developed by the authors for weather nowcasting [18], and air quality time series prediction [19]. The improvement consists in an output averaging process in order to overcome the empirical variability of machine learning with MLP and to make its results more repeatable. In fact the three layers MLP topology has been proved, in the
568
E. Pasero, G. Raimondo, and S. Ruffa
literature [1], [20], to be one of the most effective and flexible in approximating input-output mappings as in the case of forecasting the future values of a univariate time series on the basis of its past values. Moreover three layers MLP networks are capable of performing arbitrary mappings: they are universal approximators. Such mappings are possible if a sufficient number of hidden units are provided and if the network can be trained, that is if a set of weights that perform the desired mapping can be found. It is a rule of thumb that for an increasing number of inputs we need an increasing number of hidden neurons to approximate the mapping, independently of the complexity of the model underlying the mapping. A set of feed-forward neural networks with the same topology was used. Each network had three layers with 1 neuron in the output layer and a certain number of neurons in the hidden layer (varying in a range between 3 and 15). The hyperbolic tangent function was used as transfer function. The back-propagation rule [21], was used to adjust the weights of each network and the Levenberg-Marquardt algorithm [22], to proceed smoothly between the extremes of the inverse-Hessian method and the steepest descent method [23]. As an alternative to ANN was used an SVM with an ε-insensitive loss function [2]. The kernel function of the SVM was chosen to be a Gaussian function. The principal parameters of the SVM were the regularized constant C, the width value σ of the Gaussian kernel, and the width ε of the tube around the solution. The SVM performance was optimized choosing the proper values for such parameters. An active set method [24], was used as optimization algorithm for the training of the SVM [25]. 2.2 Feature Selection The following step for the implementation of a forecasting ANN-MLP or SVM system is the selection of the best subset of features that are going to be used as the input to the forecasting tool. We used the Koller-Sahami method [26] to select the best 14 features among the last 35 samples available for each time horizon plus the sample one year before the prediction time. We also applied a manual filter method in which we chose directly the more meaningful features based on the nature of the time series data to be predicted. The general criterion for reducing the dimension is the desire to preserve most of the relevant information of the original data according to some optimality criteria. Therefore we performed also a feature extraction using Principal Component Analysis. The objective of PCA is to reduce the number of predictive variables and transform them into new variables, called principal components (PC); these new variables are independent linear combinations of the original data and retain the maximum possible variance of the original set. The main advantage of PCA is that, you can find patterns in the data, and compress them, ie. by reducing the number of dimensions, without much loss of information. We performed a PCA analysis to select only the principle components being responsible for 80% of the total variance.
MULP: A Multi-Layer Perceptron Application to Long-Term
569
3 Preliminary Results with Different Techniques The available data from NN5 Forecasting competition consist of 2 years cash money demand at various automatic teller machines at different locations in England. The competition offers 2 datasets: a complete dataset of 111 daily time series and a reduced dataset which includes a sub sample of 11 time series from the 111 time series and is therefore contained in the larger dataset. The aim of the competition is to forecast automatically and as accurately as possible the last 56 out-of-sample observations (undisclosed to the participants in the competition) as a trace forecast for a forecasting horizon of 1, 2, …, 56 for each of the 11 or 111 time series. The performance of the adopted methods was calculated on the reduced data set comprising 11 time series. First of all each time series was split in a training and a test set. The test set was chosen to include the last 56 samples. The remaining samples were used as training set. We treated each time series separately as a univariate time series. To take into account the supposed weekly and monthly pseudo-periodicity of the given time series we selected as input features for the one step ahead prediction the past values corresponding to 1, 2, 3, 4, 5 and 6 days before the prediction day and the gradient 1, 2, 3 time steps ahead and a week, two weeks, 3 and 4 weeks ahead the prediction day. The selected features were normalized ( ∈ [− 1,1] ) in order for each of them to contribute with the same weight to the forecasting task. So the model for the 1 step ahead prediction consists of either a 3-layers MLPANN or an SVM with 14 input features. We built a similar direct prediction model for each time horizon (from 1 to 56 steps ahead) taking as input features the same group of features selected for the 1 day ahead prediction, but shifted to comprise only the most recent available data. Given that the results obtained running each model only once varied naturally from time to time we ran the neural engine ten times for each time horizon in order to avoid local minima, to overcome the empirical variability of machine learning with MLPANNs and to render its results more repeatable. For each iteration k of the chosen model we have an output
yik corresponding to the input instance xi. So the final out-
put from the system for the instance xi after the 10 iterations of the model is equal to
yi =
1 10 k ⋅ ∑ yi . 10 k =1
Once we obtained the average forecasts of all the 56 test samples for each model we chose for each time horizon only the output of the corresponding model (i.e., for the first sample of the test set we took the prediction of the one step ahead forecaster, for the second sample we took the prediction of the two steps ahead forecaster and so on) and we calculated the SMAPE between such collection of forecasts and the 56 available testing samples (see Fig. 1).
570
E. Pasero, G. Raimondo, and S. Ruffa
1 2 Model # 1
…
Model # 2
…
Model # 3
…
3
………….
56
…
Model # 56
Input Features
Model Output
Fig. 1. Machine Learning Forecasting System
If
yi are the actual values and yi the corresponding forecasts with i=1,2,…,n then
the SMAPE is calculated as follows:
SMAPE =
1 n
n
yi − yi
∑ ( y + y ) / 2 ⋅ (100) i =1
i
(1)
i
As a preprocessing step we replaced the in-sample missing values using the arithmetic mean of the adjacent samples. This is sub-optimal and once we have selected the best prediction method we used it to forecast with 1 step ahead the missing values: the input features were the last three available samples for the forecasting of the first 29 samples and the same set of features used for the forecasting of the test set for the other samples.
4 Selecting and Utilizing the Best Model 4.1 Preliminary Results on Testing Set
To select the best model first of all we paid attention not to overfit the training data so we chose the number of hidden neurons and of training epochs that minimized the difference between the errors on the training set and test set respectively. Among these values we chose the ones that minimized the error on the testing set. Finally we chose 7 hidden neurons and 100 training Epochs for the MLP-ANNs. In Table 1 are reported the SMAPEs obtained with the optimal models for the reduced set of 11 time series. We applied also PCA and the backward selection algorithm to select input features (as described in the previous paragraph), but in general results obtained with these methods as feature selection do not improve the forecast with respect to the manual feature selection (Table 1). In fact the input features considered altogether contain a greater amount of information useful for the forecasting than the output of the PCA or the backward selection algorithm.
MULP: A Multi-Layer Perceptron Application to Long-Term
571
Table 1. SMAPEs for the Reduced dataset. The results are obtained with a 3-layer MLP-ANNs with 7 neurons in the hidden layer and trained for 100 Epochs with and without feature selection with PCA or information theory based backward selection: the average SMAPE on the whole dataset is equal to 25.31 using backward selection, 24.95 using PCA and 22.68 using neither PCA nor backward elimination.
TS 1 2 3 4 5 6 7 8 9 10 11
SMAPE With backward selection 20.67 34.91 25.33 22.46 27.05 22.41 30.07 25.03 22.29 29.16 19.02
SMAPE with PCA 17.82 35.44 27.65 21.86 27.30 18.37 29.64 27.19 21.54 28.98 18.67
SMAPE Without either PCA or backward selection 19.52 28.16 19.53 24.66 27.32 23.82 26.77 15.89 19.37 26.04 18.37
We also used the SVMs. Different assignment for SVM parameters ε, σ and C, were tried in order to find the optimum configuration with the highest performance, [19]. When ε and C were kept constant (ε=0.001 and C=1000), the SVM performances depended on σ and reached a maximum when σ=1, corresponding to an optimum trade-off between SVM generalization capability (large values of σ) and model accuracy with respect to the training data (small values of σ). When σ and C were kept constant (σ=1 and C=1000), the best performances were achieved when ε was close to 0 and the allowed training error was minimized. From this observation, by abductive reasoning we could conclude that the input noise level was low. In accordance with such a behavior the performance of the network improved when the parameter C increased from 1 to 1000. Since the results tended to flatten for values of C greater than 1000, the parameter C was set equal to 1000. With the optimal configuration of the hyper-parameters (ε=0.001, σ=1 and C=1000) we obtained worse results compared to MLP-ANNs best performance (see Table 2). Then we used the best SVMs and MLP-ANNs models in a recursive way. We ran the one step ahead models to forecast all the out-of-sample test data using as inputs the forecasted values in the previous steps. In this way we obtained far worse results than in the direct prediction cases (Table 2). 4.2 Final Results
Once we chose the optimal model for each time horizon, and filled the in-sample missing values with their 1-step ahead forecasting, we proceeded with the forecasting of the samples that are the object of the NN5 competition in the following way. We trained each of the MLP-ANNs models with the “filled” two years of data and then fed as an input to each model the last instance of the input features to forecast the sample corresponding to each different forecasting horizon.
572
E. Pasero, G. Raimondo, and S. Ruffa
Table 2. SMAPEs for the Reduced dataset. The results in the first column are obtained with an SVM with the following configuration of the hyper-parameters ε=0.001, σ=1 and C=1000; the results in the second column are obtained using ANNs with the same configuration of TABLE I but used in a recursive way: the average SMAPE on the whole dataset are equal to 25.96 and 40.02 respectively.
TS 1 2 3 4 5 6 7 8 9 10 11
SMAPE with SVM 22.72 31.18 23.57 26.58 31.41 27.32 30.65 18.44 21.82 29.31 22.54
SMAPE Recursive ANN 35.67 44.17 34.56 41.69 50.07 46.82 51.29 31.43 30.91 41.22 32.37
The participation to the NN5 Competition allowed us to compare our methodology with other state-of-the-art Computational Intelligence and Statistical approaches to longterm, out-of-sample time series prediction. Especially significant was the 56 out-ofsample time horizon of the requested prediction and the multiple time series data-base comprising 111 different daily time series. In fact, according to [27] desirable characteristics of an out-of-sample test are adequacy, enough forecasts at each time horizon, and diversity, desensitizing forecast error measures to special events and specific phases of business. We attained adequacy and diversity by using 111 time series and 56 time horizons for each of them. So the performance of our method has been calculated on a sufficient number of samples and on a set of time series heterogeneous in nature, thus establishing a broad based track record for our forecasting method. This makes the validation of our system statistically robust. The final results of the competition were encouraging about the correctness of our approach since we classified 10th (with respect to other NN competing methods) in the forecasting of all the 111 time series complete data set and 11th in the forecasting of the reduced 11 time series data set. Such results are also comparable with those obtained with traditional statistical methods as can be seen in the results table of the competition , NN5 Competition Results, [28]. In Table 3 are shown the out-of-sample SMAPEs between the forecasted values and the real test data (downloadable from the NN5 Forecasting competition web-site http://www. neuralforecasting-competition.com/downloads/NN5/datasets/download.htm (2008)) for the 111 time series (the last 11 time series, from the 100th to the 111th, constitute the reduced data set). The capability of the adopted methodology to predict data is very close to that estimated on the testing set, as can be seen also from values of SMAPE reported in Table 3.
MULP: A Multi-Layer Perceptron Application to Long-Term
573
Table 3. Out-of-sample SMAPEs for the Complete dataset. The results are obtained with a 3layer MLP-ANNs with 7 neurons in the hidden layer and trained for 100 Epochs: the average SMAPE on the whole 111 time series dataset is equal to 25.3 while the average SMAPE on the 11 time series (from the 100th to the 111th time series) reduced dataset is equal to 23.2.
TS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TS 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
SMAPE 18.43 21.27 37.21 25.06 33.48 31.40 22.47 29.40 21.63 20.27 31.97 25.16 20.14 17.85 20.06 17.98 27.58 20.57 19.15 20.11 SMAPE 22.27 23.91 16.19 42.89 19.38 23.43 24.11 30.65 27.43 16.05 34.01 25.86 34.16 22.31 26.12 20.89 18.99 19.88 21.56 26.01
TS 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 TS 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
SMAPE 21.90 13.27 40.28 28.41 24.51 22.90 18.80 29.03 40.68 31.01 15.97 26.03 31.71 21.31 24.35 28.31 36.92 22.27 27.86 20.34 SMAPE 21.90 20.42 26.04 25.61 32.81 23.02 23.22 50.30 34.06 21.42 28.02 29.65 23.33 26.32 18.59 20.51 28.76 26.43 23.53 19.84
TS 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 TS 101 102 103 104 105 106 107 108 109 110 111
SMAPE 37.04 16.06 19.91 18.42 23.99 21.65 18.06 32.35 28.92 31.33 29.53 28.77 20.96 27.62 27.19 27.28 20.41 21.58 32.14 39.81 SMAPE 21.21 28.01 19.83 20.42 26.61 17.93 27.05 20.95 15.91 35.73 21.40
5 Conclusions In this paper we presented techniques and results for the long-term, out-of-sample prediction of time series and their application to the NN5 Forecasting Competition
574
E. Pasero, G. Raimondo, and S. Ruffa
organized by S. Crone. The aim of the analysis is to develop an automated forecasting scheme that works effectively for a large number of time series. We propose MultiLayer Perceptron ANNs based direct prediction models for time series forecasting. The performance of such approach gave statistically robust results when compared to other NN and statistical methods during the NN5 Competition. The best method, used to forecast data provided during the competition and named MULP, was 7 hidden neurons and 100 training Epochs direct MLP-ANNs, run ten times for each time horizon in order to avoid local minima and to overcome the empirical variability of machine learning. The method proposed can be further optimized through the adoption of more effective feature selection and preprocessing of its inputs and a larger trial-and-error selection procedure for the design of the ANNs-MLP model.
References 1. Patterson, D.W.: Artificial Neural Networks: Theory and Applications. Prentice Hall, Singapore (1996) 2. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 3. Crone, S., Lessmann, S., Pietsch, S.: Forecasting with Computational Intelligence - An Evaluation of Support Vector Regression and Artificial Neural Networks for Time Series Prediction. In: Proceedings of the World Congress in Computational Intelligence, WCCI 2006, Vancouver, Canada. IEEE, New York (2006) 4. Pasero, E., Moniaci, W., Meindl, T., Montuori, A.: NEMEFO: NEural MEteorological Forecast. In: Proceeding of SIRWEC 2004, 12th International Road Weather Conference, Bingen (2004) 5. Wang, H.A., Chan, A.K.H.: A feedforward neural network model for Hang Seng Index. In: Proceedings of 4th Australian Conference on Information Systems, Brisbane, pp. 575–585 (1993) 6. Windsor, C.G., Harker, A.H.: Multi-variate financial index prediction -a neural network study. In: Proceedings of International Neural Network Conference, Paris, France, pp. 357–360 (1990) 7. White, H.: Economic prediction using Neural Networks: The case of the IBM daily stock returns. In: Proceedings of IEEE International Conference on Neural Networks, pp. 451– 458 (1988) 8. Benvenuto, F., Marani, A.: Neural networks for environmental problems: data quality control and air pollution nowcasting. Global NEST: The International Journal 2(3), 281–292 (2000) 9. Perez, P., Trier, A., Reyes, J.: Prediction of PM2.5 concentrations several hours in advance using neural networks in Santiago, Chile. Atmospheric Environment 34, 1189–1196 (2000) 10. Božnar, M.Z., Mlakar, P., Grašič, B.: Neural Networks Based Ozone Forecasting. In: Proc. of 9th Int. Conf. on Harmonisation within Atmospheric Dispersion Modelling for Regulatory Purposes, Garmisch-Partenkirchen, Germany (2004) 11. Jolliffe, I.T.: Principal Component Analysis (2002) 12. Crone, S.: Stepwise Selection of Artificial Neural Network Models for Time Series Prediction. Journal of Intelligent Systems 14(2-3), 99–122 (2005) 13. Ruta, D., Gabrys, B.: Neural Network Ensembles for Time Series Prediction. In: Proc. of IJCNN 2007, Orlando, Florida, USA (2007)
MULP: A Multi-Layer Perceptron Application to Long-Term
575
14. Ilies, I., Jaeger, H., Kosuchinas, O., et al.: Stepping forward through echoes of the past: forecasting with Echo State Networks, NN3 Forecasting Competition results (2007), http://www.neural-forecasting-competition.com/NN3/results.htm 15. Simon, G., Lendasse, A., Cottrell, M., Verleysen, M.: Long-Term Time Series Forecasting Using Self-Organizing Maps: the Double Vector Quantization Method. In: ANNPR 2003 proc., IAPR-TC3, Florence, Italy, pp. 8–14 (2003) 16. Liao, K.-P., Fildes, R.: The accuracy of a procedural approach to specifying feedforward neural networks for forecasting. Computers & Operations Research, 2121–2169 (2005) 17. Adya, Collopy: How effective are neural networks at forecasting and prediction? A review and evaluation. Journal of Forecasting 17, 481–495 (1998) 18. Pasero, E., Moniaci, W.: Artificial Neural Networks for Meteorological Nowcast. In: Proc. of IEEE International Conference on Computational Intelligence for Measurement Systems and Applications, pp. 36–39 (2004) 19. Raimondo, G., Montuori, A., Moniaci, W., Pasero, E., Almkvist, E.: A Machine Learning Tool to Forecast PM10 Level. In: Proc. of the AMS 87th Annual Meeting, San Antonio, TX, USA (2007) 20. Costa, M., Moniaci, W., Pasero, E.: INFO: an artificial neural system to forecast ice formation on the road. In: Proc. of IEEE International Symposium on Computational Intelligence for Measurement Systems and Applications, pp. 216–221 (2003) 21. Werbos, P.: Beyond regression: New tools for Prediction and Analysis in the Behavioural Sciences, Ph.D. Dissertation, Committee on Appl. Math., Harvard Univ. Cambridge, MA (November 1974) 22. Marquardt, D.: An algorithm for least squares estimation of nonlinear parameters. SIAM J. Appl. Math. 11, 431–441 (1963) 23. Demuth, H., Beale, M.: Neural Network Toolbox User’s Guide. The MathWorks, Inc. (1987) Download final NN5 Competition Datasets (including test data), http://www.neuralforecastingcompetition.com/downloads/NN5/da tasets/download.htm (2005) 24. Fletcher, R.: Practical Methods of Optimization, 2nd edn. John Wiley & Sons, NY (1987) 25. Canu, S., Grandvalet, Y., Guigue, V., Rakotomamonjy, A.: SVM and Kernel Methods Matlab Toolbox. Perception Systèmes et Information, INSA de Rouen, Rouen, France (2005), http://asi.insarouen.fr/~arakotom/toolbox/index.html 26. Koller, D., Sahami, M.: Toward optimal feature selection. In: Proc. of 13th International Conference on Machine Learning (ICML), Bari, Italy, pp. 284–292 (1996) 27. Tashman, L.: Out-of-sample tests of forecasting accuracy - an analysis and review. International Journal of Forecasting 16, 437–450 (2000) 28. NN5 Competition Results, http://www.neural-forecasting-competition.com/results.htm
Denial of Service Detection with Hybrid Fuzzy Set Based Feed Forward Neural Network Yong Wang1,2, Dawu Gu2, Mi Wen1, Jianping Xu1, and Haiming Li1 1
Department of Computers Science and Technolgy, Shanghai University of Electric Power, 20090 Shanghai, China 2 Department of Computers Science and Engineering, Shanghai Jiao Tong University, 200240 Shanghai, China
[email protected]
Abstract. The paper presents a Denial of Service (DoS) intrusion detection method with hybrid fuzzy set based feed forward neural network from KDD cup 99 records. The data are pre-processed by fuzzy set, which transform record string attributes to double types and find the data record internal rules. After data pre-process, about 60 percent of selected KDD 99 64633 records are selected for training and each 20 percent for validation and test in the neural network. all the training, validation and test results show 99.6 percent correctly classified cases and only 0.4 percent misclassified cases. The experiment results show that the design is effective. Keywords: Denial of Service, Neural Network, Fuzzy Set, DoS, IDS.
1 Introduction Denial of Service (DoS) attack is harmful to server system, which attempts to make a computer resource unavailable to its intended users. The usually used DOS methods include ping of death, teardrop attacks, SYN-Flood attacks, Smurf attacks, UDP-and Flood attacks. In recently, Distributed DoS (DDoS) appeared threaten greatly web servers. To detect these Dos attack, many detection methods are used in the Intrusion Detection System (IDS). As the net packet sent speed grows very quickly, the higher recognize rate needs of the DoS attack increase correspondingly. The common used algorithm for intrusion detection is neural network [1]. The aim of research is to determine which neural network classifies well the attacks and leads to the higher detection rate of each attack in KDD cup 99 [2]. Because the TCP packet records always include some string attributes, the data set need be pre-processed to double for neural network training. The fuzzy set can transform the inaccuracy data to curate double items. Hybrid fuzzy set-based neural network was then introduced [3]. The neural network can classify multiclass items with rough set-based feature selection [4]. The classify method can be used in wide areas Such as epileptic seizure classification [5] and white blood cell detection [6]. The fuzzy neural networks have the ability of pattern recognition and detect weak signals under noisy data environments. The cases are speech detection in noisy environments [7] and detecting L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 576 – 585, 2010. © Springer-Verlag Berlin Heidelberg 2010
DoS Detection with Hybrid Fuzzy Set Based Feed Forward Neural Network
577
weak fault signals [8]. The fuzzy neural network can involve genetic optimization mechanisms [3, 10]. As the fuzzy neural network has some superiority, we continue the previous IDS research job in KDD cup data[9], seek for the pre-processed method for neural network input. This paper describes the hybrid fuzzy based feed forward neural network method for DoS attack detection. The fuzzy set can transform the data and the fuzzy logic can find the data internal rules to feed forward neural network recognize attacks.
2 Data Pre-processed by Fuzzy Set Neural Network 2.1 DoS Attack Data Features Analysis from KDD Cup 99 Many researchers use the KDD Cup 99 dataset for intrusion detection competition test. The KDD Cup 99 datasets are based on the 1998 DARPA initiative which was prepared and managed by MIT Lincoln Labs. They set up an environment to acquire the raw TCP dump data for a local-area network for nine weeks. The experiment simulates true Air Force environment and peppering it with multiple attacks for seven weeks. Additional machines are used to generate traffic and a sniffer that records all network traffic using the TCP dump format. The TCP records fall into one of five categories: Probing, Denial of Service (DoS), Remote to Local (R2L) and User to Root (U2R). Denial of Service (DoS): clogs up so much memory on the target system that cannot serve its users, for example: ping of death, teardrop attacks, SYN-Flood attacks, Smurf attacks, UDP-Flood attacks, Distributed DOS attacks and so on. The full kddcup.data.gz data set is 743M after being uncompressed. It’s too large for the neural network training. The 10 percent subset kddcup.data_10_percent.gz data set is 75 M after being uncompressed. We import 65535 data records into dataset. The Denial of Service records are 25334 with the 39 percent of total records. The any other three attack types: probing, U2R and R2L are only 1 percent.The normal records are 39298 with 64 percent of total records. As the reason, we choose the KDD cup 99 10 percent data with 65535 records as the selected KDD cup data set for the DOS attack test. The record in selected KDD cup data has 41 attributes as bellow: 0,tcp,http,SF,54540,8314,0,0,0,2,0,1,1,0,0,0,0,0,0,0,0,0,2,3,0,0,0,0.33,1,0,0.67,2, 2,1,0,0.5,0,0,0,0,0,back. The back attack belongs to DoS attack types, which type attack involves many connections to some host(s) in a very short period of time. The connection time of DoS is different to R2L and U2R attacks, which are embedded in the data portions of packets, and normally involve only a single connection. We choose the basic features of individual connections followed the kdd-cup-99 task description web, the front 9 features of back attack record is selected as sample record, which are shown in table 1.
578
Y. Wang et al. Table 1. Basic features of individual connections for DOS attack test Feature name
Description
Data Type
duration protocol_type service flag src_bytes dst_bytes land wrong_fragment
number seconds of connection tcp, udp, etc. http, telnet, etc. normal or error connection status data bytes from source data bytes from destination 1 if same host/port; 0 otherwise number of ``wrong'' fragments
Continuous Discrete Discrete Discrete Continuous Continuous Discrete Continuous
Record
0 tcp http SF 54540 8314 0 0
Because data types are continuous and discrete, we need pre-process the data to continuous type as the neural network input vector. After the analysis of the DOS attack data, the data range of record is as table 2. Table 2. Features data range of DOS and normal records Feature name
Land 0,1 0 tcp tcp http finger S1,S2,SF,RSTR S0 0 13748-52560 0 0-8315
duration protocol_type Service Flag src_bytes dst_bytes 0 Land wrong_fragment 0
back
1 0
Neptune
Smurf 0 0 0 tcp icmp icmp 54 types ecr_i ecr_i S0 SF SF 0 1480 1032 0 0 0 0 0 0 0 1 0 Pod
Teardrop 0 Udp private SF 28 0 0 1,3
Normal 0-25602 tcp,icmp,udp 19 types 9 types 0-19721 0-125015 0 0
• Protocol_type has 3 features which can be transformed to integer from 1 to 3. • Service has 59 features which can be transformed to integer from 1 to 59. • Flag has 5 features which can be transformed to integer from 1 to 5. After the transform process, the different data types are unified to continuous data type. The table 2 illustrates that the different data type obey certain rules. If we can use rule definition during the data pre-process, the training time of neural network will be decreased accordingly. 2.2 Hybrid Fuzzy Set Feed Forward Neural Network Architecture Fuzzy set can transform string attribute in KDD 99 records to double type and Fuzzy logic can define the records rules to clarify the records. After the data pre-processed, the feed forward neural network can spent less training time. Fig. 1 shows architecture as follows:
DoS Detection with Hybrid Fuzzy Set Based Feed Forward Neural Network
579
Fig. 1. Architecture of hybrid fuzzy set feed forward neural network
2.3 Input Variables and Rules Define Using Fuzzy Set Fuzzy logic develops from the theory of fuzzy sets, which can classes of objects with fuzzy boundaries. We use fuzzy logic to define the data rules. We use fuzzy set to define the DoS attack and Normal 8 features as input variables as Fig. 2.
Fig. 2. Input variables features are pre-processed by fuzzy set for neural network. All input variables function names are same gaussmf. The values range and input variables name as the figure shows.
In order to distinguish the DoS attack records from normal records, we define 7 rules according to input variables range and Table 2. Rule1 and Rule 7 as below:
580
Y. Wang et al.
Rule1: IF (duration is near Disconnected ) and (protocol_type is tcp) and (service is http) and (flag is manyFlags) and (Src_bytes is Huge) and (Dst_bytes is Huge) and (land is otherWise) and (wrong_fragment is NoneWrong) Then (DoS attackType is Back) Rule 7: IF (duration is near connected ) and (protocol_type is allTypes) and (service is manyTypes) and (flag is manyFlags) and (Src_bytes is Huge) and (Dst_bytes is Huge) and (land is otherWise) and (wrong_fragment is NoneWrong) Then (DoSattackType is Normal) 2.4 Variables Surface Affected by Rules for DoS Attack Types In order to find the rules affect the variables results, variables surface for dos attack types is shown as Fig. 3.
Fig. 3. Surface of variables affected by rules for DOS attack types
Fig. 3 (1) illustrates the top area represents normal connection when “duration” is connected and “wrong_fragment” is tiny wrong. The other affected surface stands for DoS attack when “wrong_fragment” is tiny wrong and “duration” is disconnected. Fig. 3 (2) illustrates the top area represents normal connection when “service” is all type protocol and “wrong_fragment” is tiny wrong. The other affected surface stands for DoS attack when “wrong_fragment” is tiny wrong and “service” is http, finger or ecr_i.
DoS Detection with Hybrid Fuzzy Set Based Feed Forward Neural Network
581
Fig.3 (3) illustrates the top area represents normal connection when “flag” is S0, S1, S2, SF or RSTR and “wrong_fragment” is tiny wrong. The other affected surface stands for DoS attack when “wrong_fragment” is tiny wrong and “flag” is many flags. Fig.3 (4) illustrates the top area represents normal connection when “dest_bytes” is huge and “wrong_fragment” is tiny wrong. The other affected surface stands for DoS attack when “wrong_fragment” is tiny wrong and “dest_bytes” is none or many. As the results illustrated in Figures 3, the distinguished method using fuzzy set and logic for DoS attack records sometimes is not enough. wrong judge Sometimes happened with the test data with normal connection record such as: Normal connection record: 0,tcp,http,SF,181,5450,0,0 Transformed into fuzzy set format: evalfis([0 1 1 4 181 5450 0 0],b) ans=1.0296 . The result show it is a kind of DOS attacks land attack.
3 Training Data from Fuzzy DoS Records with Feed Forward Neural Network 3.1 Input and Target Vector from Fuzzy DoS Attack Records As the DoS attack recognition results composed of some wrong items when only use fuzzy set and fuzzy logic algorithm. The DoS records are pre-processed by fuzzy set can be used as neural network input vector. The feed forward neural network needs input vector. Where vector P is an input to the network.
(
)
⎧a 1 = tan sigmoid IW 1,1 p + b1 ⎪ ⎨a ∈ {T | 0,1} ⎪ p ∈ {p , p , p , p , p , p , p , p | p ∈ [0,1.25e + 5]} 1 2 3 4 5 6 7 8 n ⎩
(1)
The output vector is represented by a in equation 1. The p is multiplied by the matrix weight w to form wp and added by bias b. The transfer function is logsig. The input vector from DOS records which are pre-processed by fuzzy set and fuzzy logic. The element of original example of DOS attack is: 0,tcp,http,SF,54540,8314,0,0 After being transformed, the input record vector is as below: 0,1,1,5,54540,8314,0,0 When all the records are transformed like this, the neural network can import the data to workspace as input vector when using Matlab platform. The input data type is double with a 64633*8 matrix. Use command DoSP=input’, The DoSP input vector is a 8*64633 matrix. The average of input matrix is as Table 3.
582
Y. Wang et al. Table 3. Input vector average of fuzzy DoS records
Feature name
duration protocol_type Service Flag src_bytes dst_bytes Land wrong_fragment
back 0.00 1.00 1.00 5.03 54277.10 8249.49 0.00 0.00
Land 0.00 1.00 2.00 1.00 0.00 0.00 1.00 0.00
Neptune 0.00 1.00 5.43 1.00 0.00 0.00 0.00 0.00
Pod
0.00 2.00 3.00 5.00 1480.00 0.00 0.00 1.00
Smurf 0.00 2.00 3.00 5.00 1032.00 0.00 0.00 0.00
Teardrop 0.00 3.00 4.00 5.00 28.00 0.00 0.00 2.98
Normal 4.71 1.08 7.11 5.09 974.38 4466.67 0.00 0.00
As the listed average data in Table 3, there are obvious difference between the normal records and the DOS attack records. The normal duration is usually large than DOS attack records and the normal Src_bytes is near 974.38 other than back attack record has huge sre_bytes and land has none. Feed foreword neural network need target vector for training. The target vector is transformed from original KDD 99 database. The transform rule is that the type such as Back, Land, Neptune, Pod, Smurf, Teardrop and Normal are transformed into double types from 1 to 7 as the list sequence. The other types are also transformed into double types from 8 to 19 according to type name’s alphabetical order. Each vector T ranges from zero to one. When the corresponding bit value is one, it indicates the target attack type would happen. The vector T is defined as follows: Target_DOS_back = [1000000] Target_DOS_land = [0100000] Target_DOS_neptune = [ 0 0 1 0 0 0 0 ] The target vector is a 64633*7 matrix. Use command DoST=target’, The DoST target vector is a 7*64633 matrix. 3.2 Training Feed Foreword Neural Network In order to build a pattern recognition algorithm, we choose feed forward neural network with 2 layers network. The transfer function in both hidden layer and the output layer is tan-sigmoid transfer. The hidden layer has 20 neurons. net = newpr(DoSP,DoST,20); The perform function is mse, which measures the network's performance according to the mean of squared errors. The algorithm should adjust the network parameters in order to minimize the mean square error. The train function is transcg, use Matlab command Net=train(net,DoSP,DoST); The training time is 9 minutes 7 seconds using Dell precision M4300 computer, Intel Core 2 Duo CPU T7250 2.0GHZ 2.0 GHZ, RAM 2.0G. Incremental training and batch training are two different styles of training. In incremental training, the weights and biases of the network are updated each time when an input is presented to the network. In batch training the weights and biases are
DoS Detection with Hybrid Fuzzy Set Based Feed Forward Neural Network
583
Fig. 4. Val fail and gradient of training. The training gradient equals to 1.6468e-005 at epoch 276 and validation checks equals to 6 at 276 epochs.
only updated after all the inputs are presented. Training Styles describes the two training approaches shown as Fig.4. In order to test the neural network training errors, use performance in the training window to find the validation error. The errors appear in the following Fig. 5.
Fig. 5. The best validation performance occurred at iteration 276, and the network at this iteration is returned with training errors, validation errors, and test errors
4 Results Analysis To analyze the network response, use confusion function in the training window to display of the confusion matrix appears that shows various types of errors that occurred for the final trained network. The diagonal cells in each table show the number of cases that were correctly classified, and the off-diagonal cells show the misclassified cases. After training many times, we choose the best confusion matrix including training, validation and testing confusion matrix. The blue cell in the
584
Y. Wang et al.
bottom right show the total percent of correctly classified cases is 99.6 percent and the wrong classified cases is only 0.4 percent. The results is wonderful because the data set has the internal rules and many times training. The Fig.6 shows the results.
Fig. 6. Correctly classified ratio of selected KDD cup records after training
We choose about 60 percent of selected KDD 99 64633 records for training and each 20 percent for validation and test, all the training, validation and test results show 99.6 percent correctly classified cases and only 0.4 percent misclassified cases. The results are satisfied. There are still some parts need improvement. The misclassified cases reason is because the training records are very few such as land and pod attack. Another shortcoming is low true positive rate or sensitivity versus ratio, which is as the Receiver Operating Characteristic (ROC) curve the threshold. A perfect test would show points in the upper-left corner, with 100% sensitivity and 100% specificity. Yet our test results illustrate that the lines concentrate the left area. For this problem, the network performs almost perfectly.
5 Conclusions Feed Forward Neural networks with fuzzy set and fuzzy logic can be applied in Denial of Service (DoS) detection. The hybrid fuzzy feed forward neural network can classify DoS attacks from normal records after training and learning. The data pre-process by fuzzy set and logic is very important issue; otherwise the training time will be longer. The fuzzy set pre-process can transform record string attribute to double type and find the data record internal rules. If the data are transformed and clarified only by fuzzy set and logic, the methods can be used to distinguish DoS attack records from normal records in a way show as Fig. 4. But there are higher wrong distinguish ratio. The actual DoS detection in DoS firewall, the hardware needs higher capture ratio of data packet and efficient. The efficient DoS detection algorithm will get better results. Further research might be carried out to build a hardware DoS detection system in order to detect the high speed and huge data packet DoS attack.
DoS Detection with Hybrid Fuzzy Set Based Feed Forward Neural Network
585
Acknowledgments. Supported by National hi-tech research and development project No.2006AA01Z405. The National Natural Science Foundation of China under Grant No.60903188. Shanghai postdoctoral scientific program No.08R214131.
References 1. Horng, S.J., Fan, P.Z., Chou, Y.P., Chang, Y.C., Pan, Y.: A Feasible Intrusion Detector for Recognizing IIS Attacks Based on Neural Networks. Computers & Security 27, 84–100 (2008) 2. Rachid, B.: Critical Study of Neural Networks in Detecting Intrusions. Computers & Security 27, 168–175 (2008) 3. Sung, K.O., Witold, P., Seok, B.R.: Hybrid Fuzzy Set-based Polynomial Neural Networks and Their Development with the Aid of Genetic Optimization and Information Granulation. Applied Soft Computing 9, 1068–1089 (2009) 4. Hung, Y.H.: A Neural Network Classifier with Rough Set-based Feature Selection to Classify Multiclass IC Package Products. Advanced Engineering Informatics 23, 348–357 (2009) 5. Abdulhamit, S.: Automatic Detection of Epileptic Seizure Using Dynamic Fuzzy Neural Networks. Expert Systems with Applications 31, 320–328 (2006) 6. Shi, T.W., Korris, F.L.C., Fu, D.: Applying the Improved Fuzzy Cellular Neural Network IFCNN to white blood cell detection. Neurocomputing 70, 1348–1359 (2007) 7. Juang, C.F., Cheng, C.N., Chen, T.M.: Speech Detection in Noisy Environments by Wavelet Energy-based Recurrent Neural Fuzzy Network. Expert Systems with Applications 36, 321–332 (2009) 8. Hongying, Y., Hao, Y., Gui, Z.W.: Dynamic Reconstruction-Based Fuzzy Neural Network Method for Fault Detection in Chaotic System. Tsinghua Science & Technology 13, 65–70 (2008) 9. Wang, Y., Gu, D.W., Li, W., Li, H.J., Li, J.: Network Intrusion Detection with Workflow Feature Definition Using BP Neural Network. In: Yu, W., He, H., Zhang, N. (eds.) ISNN 2009. LNCS, vol. 5551, pp. 60–67. Springer, Heidelberg (2009) 10. Sung, K.O., Witold, P., Seok, B.R.: Genetically Optimized Fuzzy Polynomial Neural Networks with Fuzzy Set-based Polynomial Neurons. Information Sciences 176, 3490–3519 (2006)
Learning to Believe by Feeling: An Agent Model for an Emergent Effect of Feelings on Beliefs Zulfiqar A. Memon1,2 and Jan Treur1 1
VU University Amsterdam, Department of Artificial Intelligence De Boelelaan 1081, 1081 HV Amsterdam, The Netherlands 2 Sukkur Institute of Business Administration (Sukkur IBA), Airport Road Sukkur, Sindh, Pakistan {zamemon,treur}@few.vu.nl http://www.few.vu.nl/~{zamemon,treur}
Abstract. An agent's beliefs usually depend on cognitive factors, but also affective factors may play a role. This paper presents an agent model that shows how such affective effects on beliefs can emerge and become stronger over time due to experiences obtained. In this way an effect of judgment by ‘experience’ or ‘gut feeling’ can be obtained. It is shown how based on Hebbian learning a connection from feeling to belief can develop. Some example simulation results and a mathematical analysis of the equilibria are presented. Keywords: agent model, emergent effect, believing, feeling, Hebbian learning.
1 Introduction Beliefs that are activated usually trigger emotional responses that result in certain feelings. Conversely, emotions felt can also have an effect on beliefs, as discussed in empirical work such as [11], [12], [13], [20], [21] and [22]. Connections from feelings to beliefs might be assumed given a priori (innate), but in this paper it is shown how it such connections can emerge by a Hebbian learning mechanism; cf. [2], [14], [15]. To this end, elements from neurological theories on emotion and feeling were adopted, following lines as described in [5], [9], [10] and [22]. A main focus here is on how the effect of the feeling on the belief can emerge over time. To model this, it is assumed that due to a Hebbian learning mechanism over time the connection from feeling to belief gets nonzero strength: the agent learns to strengthen its belief based on a supporting feeling. As a consequence, when such a feeling would be absent, the agent’s belief would develop less strength; the feeling gives it its full strength. This principle models the idea that over time persons build up certain intuitions or gut feelings, and let these play an important role in what to (fully) believe and what not to believe. In this paper, first in Section 2 a dynamical agent model for the generation of feelings based on a recursive as-if body loop is introduced. In Section 3, the Hebbian learning model is described. Section 4 presents some simulation results. In Section 5 a mathematical analysis of the equilibria of the model is presented. Finally, Section 6 is a discussion. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 586 – 595, 2010. © Springer-Verlag Berlin Heidelberg 2010
Learning to Believe by Feeling: An Agent Model
587
2 An Agent Model for the Dynamics of Believing and Feeling As mental states in a person usually do, any belief state induces emotions felt within this person, as described by Damasio in [9] and [10]. In some more detail, emotion generation via an as-if body loop roughly proceeds according to the following causal chain; see [9] and [10]: belief → preparation for body state → sensory representation of body state → feeling. The as-if body loop is extended to a recursive as-if body loop by assuming that the preparation of the bodily response is also affected by the state of feeling the emotion: feeling → preparation for the body state; as an additional causal relation. Such recursiveness is also assumed by [10], p. 91. Thus the obtained model for emotion generation is based on reciprocal causation relations between emotion felt and preparations for body states. Within the model used both the preparation for the bodily response and the feeling are assigned an (activation) level or gradation, expressed by a number, which is assumed dynamic; for example, the strength of a smile and the extent of happiness. The cycle is modelled as a positive feedback loop, triggered by a belief and converging to a certain level of feeling and preparation for body state. An overview of this dynamical model for the agent’s believing and feeling is depicted in Figure 1. In this picture states represent groups of neurons, indicated by circle icons, labeled with representations from the detailed specifications explained below. However, note that the precise numerical relations between the indicated variables V representing activation levels, are not expressed in this picture, but below in the detailed specifications of dynamic properties for the temporal relations between these states (labeled by LP1 to LP6 as shown in the picture). Note that the sensor and effector state for body states and the dashed arrows connecting them to internal states are not used in the model. Informally described theories in, for example, biological or neurological disciplines, often are formulated in terms of causal relationships or in terms of dynamical systems. To adequately formalise such a theory the hybrid dynamic modelling language LEADSTO has been developed that subsumes qualitative and quantitative causal relationships, and dynamical systems; cf. [3]. This language has
LP1
LP2
LP3
LP4
LP5
LP6
Fig. 1. Overview of the agent model for the dynamics of believing and feeling
588
Z.A. Memon and J. Treur
been proven successful to obtain agent models in a number of contexts, varying from biochemical processes that make up the dynamics of cell behaviour (cf. [16]) to neurological and cognitive processes (e.g., [4], [5] and [6]). Within LEADSTO the dynamic property or temporal relation a → →D b denotes that when a state property a occurs, then after a certain time delay (which for each relation instance can be specified as any positive real number D), state property b will occur. Below, this D will be taken as the time step Δt, and usually not be mentioned explicitly. In LEADSTO both logical and numerical calculations can be specified in an integrated manner, and a dedicated software environment is available to support specification and simulation. In the dynamic properties below capitals are used for variables (assumed universally quantified). First the part is presented that describes the basic mechanisms to generate a belief state and the associated feeling. The first dynamic property addresses how properties of the world state are sensed. Note that first a semiformal and next a formal representation is shown. LP1 Sensing a world state If world state property W occurs of strength V then the sensor state for W will have strength V. world_state(W, V) → → sensor_state(W, V)
From the sensor states, sensory representations are generated according to LP2. LP2 Generating a sensory representation for a sensed world state If the sensor state for world state property W has strength V, then the sensory representation for W with have strength V. sensor_state(W, V) → → srs(W, V)
Next the property is described that relates a sensory representation and a feeling to a belief strength. Here a connection strength ω1 from sensory representation to belief and ω2 from feeling to belief is assumed. In Section 4 it will be discussed how the connection strength ω2 is adapted by a Hebbian learning principle. A function g(β1, ω1, ω2, V1, V2) is used for the way in which activation levels V1 and V2 of sensory representation and feeling are combined taking into account the connection strengths. Here β1 is a parameter called the person’s orientation for believing; value 0 indicates that the person is reluctant to believe and 1 that the person is willing to believe. LP3 If and and and and and and then
Generating a belief state for a feeling and a sensory representation a sensory representation for w with strength V1 occurs, the associated feeling of b has strength V2 the belief for w has strength V3 the connection from sensory representation to belief of w has strength ω1 the connection from feeling b to belief of w has strength ω2 β1 is the person’s orientation for believing γ1 is the person’s flexibility for beliefs after Δt the belief for w will have strength V3 + γ1(g(β1, ω1, ω2, V1, V2)-V3) Δt.
srs(w, V1) & feeling(b, V2) & belief(w, V3) & has_connection_strength(srs(w), belief(w), ω1) & has_connection_strength(feeling(b), belief(w), ω2) → → belief(w, V3 + γ1 (g(β1, ω1, ω2, V1,V2) - V3) Δt)
Learning to Believe by Feeling: An Agent Model
589
For the function g(β1, ω1, ω2, V1, V2) the following was taken: g(β1, ω1, ω2, V1, V2) = β1(1-(1-ω1V1)(1-ω2V2)) + (1-β1) ω1 ω2V1V2 This function g(β1, ω1, ω2, V1, V2) can be considered to play the role of a quadratic threshold function, parameterised by β1. Note that for connection strength ω2 = 0 (no effect of feeling on belief) the formula reduces to following: g(β1, ω1, 0, V1, V2) = β1(1-(1-ω1V1)) = β1 ω1V1 In the example simulations discussed in Section 4 the connection strength ω1 was 1. Dynamic property LP4 describes the emotional response to a belief in the form of the preparation for a specific bodily reaction. The resulting level for the preparation is calculated based on a function h(β2, ω3, ω4, V1, V2) of the original levels. Here ω3 is the connection strength from belief to preparation and ω4 from feeling to preparation. LP4 From belief and feeling to preparation of a body state If belief w with strength V1 occurs and feeling the associated body state b has strength V2 and the preparation state for b has strength V3 and the connection from belief of w to preparation for b has strength ω3 and the connection from feeling b to preparation for b has strength ω4 and β2 is the person’s orientation for emotional response and γ2 is the person’s flexibility for bodily responses then after Δt the preparation state for body state b will have strength V3 + γ2(h(β2, ω3, ω4, V1, V2)-V3) Δt. belief(w, V1) & feeling(b, V2) & preparation_state(b, V3) & has_connection_strength(belief(w), preparation(b), ω3) & has_connection_strength(feeling(b), preparation(b), ω4) → → preparation_state(b, V3+γ2(h(β2, ω3, ω4, V1, V2)-V3) Δt)
For the function h(β2, ω3, ω4, V1, V2) the following has been taken: h(β2, ω3, ω4, V1, V2) = β2 (1-(1-ω3V1)(1- ω4V2)) + (1-β2) ω3ω4V1V2 In the example simulations discussed in Section 4 the connection strengths ω3 and ω4 have been set on 1. Dynamic properties LP5 and LP6 describe the as-if body loop. LP5 From preparation to sensory representation of a body state If preparation state for body state B occurs with strength V, then the sensory representation for body state B with have strength V. preparation_state(B, V) → → srs(B, V) LP6 From sensory representation of body state to feeling If a sensory representation for body state B with strength V occurs, then B will be felt with strength V. srs(B, V) → → feeling(B, V)
3 How the Agent Learns to Believe by Feeling This far it was discussed how via a converging loop activated beliefs lead to certain feelings. Conversely, persons may affect the strength of their beliefs by their feelings in the sense that, for example, an optimist person strengthens beliefs that have a
590
Z.A. Memon and J. Treur
positive feeling associated and a pessimistic person strengthens beliefs with a negative associated feeling. Thus the strengths of beliefs may depend on non-informational aspects of mental processes and related personal characteristics. To model this for the case of feelings a connection from feeling to belief is used; see Figure 1. Support for a connection from feeling to belief from a neurological theory can be found in Damasio’s Somatic Marker Hypothesis; cf. [1], [7], [8] and [10]. This is a theory on decision making with a central role for emotions felt. Each decision option induces (via an emotional response) a feeling which is used to mark the option. For example, when a negative somatic marker is linked to a particular option, it provides a negative feeling for that option. Similarly, a positive somatic marker provides a positive feeling for that option. Damasio describes the use of somatic markers in the following way: ‘the somatic marker (...) forces attention on the negative outcome to which a given action may lead, and functions as an automated alarm signal (…) When a positive somatic marker is juxtaposed instead, it becomes a beacon of incentive. (…) on occasion somatic markers may operate covertly (without coming to consciousness) and may utilize an ‘as-if-loop’.’ ([7], pp. 173-174). Usually the Somatic Marker Hypothesis is applied to provide endorsements or valuations for options for a person’s decisions on actions. However, it may be considered plausible that such a mechanism is applicable to valuations of internal states such as beliefs as well. One of the elements of the Somatic Marker Hypothesis is that somatic markers depend on past experiences of the person. Within the agent model introduced above this element is incorporated by making the connection strength from feeling to believing adaptive, dependent on beliefs and feelings experienced over time. From a Hebbian neurological perspective [15], strengthening of a connection from feeling to belief over time may be considered plausible, as neurons involved in the belief and in the associated feeling will often be activated simultaneously. Therefore such a connection from feeling to belief may be developed and adapted based on a Hebbian learning mechanism [2], [14] and [15]: connections between neurons that are activated simultaneously are strengthened, similar to what has been proposed for the emergence of mirror neurons in, e.g., [17] and [18]. Based on these considerations, in the agent model the connection strength ω is adapted using the following Hebbian learning rule. It takes into account a maximal connection strength 1, a learning rate η, and an extinction rate ζ. A similar Hebbian learning rule can be found in ([14], p. 406). By the factor (1 - ω) the learning rule keeps the level of ω bounded by 1 (which could be replaced by any number), as Hebbian learning without such a bound usually provides instability. When extinction is neglected, the upward changes during learning are proportional to both V1 and V2, which in particular means that no learning takes place whenever one of them is 0, and maximal learning takes place when both are 1. LP7 If and and and and then
Hebbian learning rule the belief for w has strength V1 the feeling of b has strength V2 the connection from feeling b to belief of w has strength ω the learning rate from feeling b to belief of w is η the extinction rate from feeling b to belief of w is ζ after Δt the connection from feeling b to belief of w will have strength ω + (ηV1V2 (1 - ω) - ζω) Δt.
Learning to Believe by Feeling: An Agent Model
591
feeling(b, V1) & belief(w, V2) & has_connection_strength(b, w, ω) & has_learning_rate(b, w, η) & has_extinction_rate(b, w, ζ) → → has_connection_strength(b, w, ω + (ηV1V2 (1 - ω) - ζω) Δt)
4 Example Simulation Results Based on the agent model described in the previous section, a number of simulations have been performed. Some example simulation traces are included in this section as an illustration; see Figure 2, Figure 3 and Figure 4 (the time delays within the LEADSTO relations were taken 1 time unit). Here the connection strengths ω1, ω3 and ω4 have been set on 1.
Fig. 2. Example trace with a number of learning phases
592
Z.A. Memon and J. Treur
Fig. 3. Approximated equilibrium for β1 = 1, β2 = 0.4
Fig. 4. Approximated equilibrium for β1 = 0.2, β2 = 0.9
In Figure 2, β1 = 0.95 , β2 = 0.4, η = 0.02 and the extinction rate ζ is 0.0002. The example trace in Figure 2 shows how after learning of the connection from feeling to belief, the strength of the belief substantially exceeds the strength of the incoming stimulus (0.9 vs 0.6). In Figure 3 and Figure 4 some example simulation traces are showing how equilibria are reached for a constant environment with settings as indicated in the upper part of the figures. These traces illustrate the outcomes of the mathematical analysis of equilibria presented in Section 5.
5 Mathematical Analysis The example simulations in Figure 3 and Figure 4 show how for a time period with a constant environment with strength 0.5, the strengths of beliefs, body states and
Learning to Believe by Feeling: An Agent Model
593
feelings and connection between feeling and belief reach a stable equilibrium. By a mathematical analysis it can be addressed which types of equilibria are possible. To this end equations for equilibria can be determined from the dynamical model equations for the belief and the preparation state level, which (assuming ω1, ω3 and ω4 constant) can be expressed as differential equations as follows (with b(t) the level of the belief, s(t) of the stimulus, f(t) of the feeling, and p(t) of the preparation for the body state at time t). db(t)/dt = γ1 (β1(1-(1-ω1s(t))(1-ω2(t)f(t))) + (1-β1) ω1 ω2(t)s(t)f(t) – b(t)) dp(t)/dt = γ2 (β2(1-(1-ω3b(t))(1-ω4f(t))) + (1-β2) ω3 ω4b(t)f(t) – p(t)) d ω2(t)/dt = (η b(t)f(t)(1 - ω2(t))- ζω2(t)) Note that below, as in Section 4 the connection strenghts ω1, ω3 and ω4 are taken 1. Moreover, ω2 is denoted as ω. To obtain equations for equilibria, constant values for all variables are assumed (also the ones used as inputs such as the stimuli). Then in all of the equations the reference to time t can be left out, and in addition the derivatives dp(t)/dt and db(t)/dt can be replaced by 0. As for an equilibrium it also holds that f = p assuming γ1, γ2 , ζ and η nonzero, this leads the following equations in b, f, ω, s:
β1(1-(1-s)(1- ωf)) + (1-β1)ωsf – b = 0 β2(1-(1-b)(1-f)) + (1-β2)bf – f = 0 ηbf(1 - ω) - ζω = 0
(1) (2) (3)
Note that as an extreme case b = f = s = 0 satisfies (1), (2) and (3). For the general case, first, equation (3) can be rewritten into ηbf - ωηbf - ζω = 0, providing
ω = ηbf /(ηbf + ζ) ω = 1 /(1 + ζ /ηbf) where the last step only applies when b, f ≠ 0. Using b, f ≤ 1 from this it follows that ω ≤ 1 / (1 + (ζ /η )) < 1. For small ζ /η this can be rewritten into ω ≤ 1 - (ζ /η ). This shows that given the extinction, the maximal connection strength will be lower than 1, but may be close to 1 when the extinction rate is very small compared to the learning rate. However, it also depends on the equilibrium values for f and b. For values of f and b that close to 1, this maximal value of ω can be approximated. When in contrast these values are low, also the equilibrium value for ω will be low, since: ω = η bf / (ηbf + ζ) ≤ ηbf /ζ. In particular, when one of b and f is 0 then also ω is 0. For the general case equation (1) can directly be used to express b in f, ω, s and β1. Using this, in (2) b can be replaced by this expression in f, ω, s and β1, which transforms (2) into a quadratic equation in f with coefficients in terms of s, ω and the parameters β1 and β2. Solving this quadratic equation algebraically provides a complex expression for f in terms of s, ω, β1 and β2. Using this, by (1) also an expression for b in terms of s, ω, β1 and β2 can be found. As the expressions for the general case become rather complex, only an overview for a number of special cases is shown in Table 1 (for 9 combinations of values 0, 0.5 and 1 for both β1 and β2). For these cases the equations (1) and (2) can be substantially simplified as shown in the second column (for equation (1)) and second row (for equation (2)). As can be seen in Table 1, persons that have a low orientation for believing (β1 = 0) and a low profile in
594
Z.A. Memon and J. Treur
generating emotional responses (β2 = 0), have an equilibrium for which both the belief and the feeling have level 0, and also ω = 0. The case where both β1 = 0.5 and β2 = 0.5 indicates an equilibrium with b = f = s, and ω = 1 /(1 + ζ /ηs2). Note that in Table 1 for β1 = 1 and β2 nonzero two equations in ω and b occur, which can be solved further to obtain more complex explicit expressions for each of them. Table 1. Overview of equilibria for 9 cases of parameter settings for β1 and β2 0
2 1 eq. (1)
eq. (2)
0 b = sf
0.5 b = (s + f)/2
1 1-b = (1-s)(1- f)
0.5
f=0
b =1
b=f
b=f = = 0
-
b=f=
1 b=0
f=1 =0
b= s f=1 = 0 or
= 1- / s
b= 0
=
b = s/2 b = f = s =1 f= = = 1 /(1+( / )) 0
b=f=s = 1 /(1 + / s2)
b = (s + 1)/2 b=f=s f =1 = =0 =1 /(1+( / (s + 1)/2))
b = s b = f = =1 f = = b = s =1 0 = 1 /(1+( / f))
b=f =1 /(1 +( / b2)); = (1-(s/b))/(1-s)
f=1 = (b-s)/(1-s) = 1 /(1+( / b))
b=s= =0
6 Discussion In this paper an adaptive agent model was introduced for the emerging effect of feeling on belief. The introduced agent model on the one hand describes more specifically how a belief generates an emotional response that is felt, and on the other hand how a connection can emerge enabling that the emotion that is felt affects the strength of the belief. For feeling the emotion, a converging recursive body loop is used, based on elements taken from [5], [9] and [10]. A relation from feeling to belief is developed based on a Hebbian learning rule (cf. [2], [14], [15]). After developing a connection from feeling to belief, the strength of a belief does no only depend on the strength of a stimulus, but also on the strength of the induced emotional response and feeling. A mathematical analysis of the equilibria of the model was discussed. The agent model was specified in the hybrid dynamic modelling language LEADSTO, and simulations were performed in its software environment; cf. [3]. In [19] a nonadaptive model for interaction between belief and feeling was presented, where a priori a connection from feeling to belief is assumed given. The agent model presented in the current paper is different in three respects. Firstly, it uses an as-if body loop instead of a body loop as was used in [19]. Secondly, in the agent model presented here, the connections between different mental states have weights that can be numbers between 0 and 1. Thirdly, here the connection from feeling to belief is not given a priori as in [19] but emerges over time, what can be considered as a major difference.
Learning to Believe by Feeling: An Agent Model
595
References 1. Bechara, A., Damasio, A.: The Somatic Marker Hypothesis: a neural theory of economic decision. Games and Economic Behavior 52, 336–372 (2004) 2. Bi, G.Q., Poo, M.M.: Synaptic Modifications by Correlated Activity: Hebb’s Postulate Revisited. Ann. Rev. Neurosci. 24, 139–166 (2001) 3. Bosse, T., Jonker, C.M., van der Meij, L., Treur, J.: A Language and Environment for Analysis of Dynamics by Simulation. Int. J. of Artificial Intelligence Tools 16, 435–464 (2007) 4. Bosse, T., Jonker, C.M., Treur, J.: Simulation and Analysis of Adaptive Agents: an Integrative Modelling Approach. Advances in Complex Systems 10, 335–357 (2007) 5. Bosse, T., Jonker, C.M., Treur, J.: Formalisation of Damasio’s Theory of Emotion, Feeling and Core Consciousness. Consciousness and Cognition 17, 94–113 (2008) 6. Bosse, T., Jonker, C.M., Los, S.A., van der Torre, L., Treur, J.: Formal Analysis of Trace Conditioning. Cognitive Systems Research Journal 8, 36–47 (2007) 7. Damasio, A.: Descartes’ Error: Emotion. Reason and the Human Brain, Papermac (1994) 8. Damasio, A.: The Somatic Marker Hypothesis and the Possible Functions of the Prefrontal Cortex. Philosophical Transactions of the Royal Society: Biological Sciences 351, 1413– 1420 (1996) 9. Damasio, A.: The Feeling of What Happens. Body and Emotion in the Making of Consciousness. Harcourt Brace, New York (1999) 10. Damasio, A.: Looking for Spinoza. Vintage Books, London (2004) 11. Eich, E., Kihlstrom, J.F., Bower, G.H., Forgas, J.P., Niedenthal, P.M.: Cognition and Emotion. Oxford University Press, New York (2000) 12. Forgas, J.P., Laham, S.M., Vargas, P.T.: Mood effects on eyewitness memory: Affective influences on susceptibility to misinformation. Journal of Experimental Social Psychology 41, 574–588 (2005) 13. Forgas, J.P., Goldenberg, L., Unkelbach, C.: Can bad weather improve your memory? An unobtrusive field study of natural mood effects on real-life memory. Journal of Experimental Social Psychology 45, 254–257 (2009) 14. Gerstner, W., Kistler, W.M.: Mathematical formulations of Hebbian learning. Biol. Cybern. 87, 404–415 (2002) 15. Hebb, D.O.: The Organisation of Behavior. Wiley, New York (1949) 16. Jonker, C.M., Snoep, J.L., Treur, J., Westerhoff, H.V., Wijngaards, W.C.A.: BDIModelling of Complex Intracellular Dynamics. Journal of Theoretical Biology 251, 1–23 (2008) 17. Keysers, C., Gazzola, V.: Unifying Social Cognition. In: Pineda, J.A. (ed.) Mirror Neuron Systems: the Role of Mirroring Processes in Social Cognition, pp. 3–28. Humana Press/Springer (2009) 18. Keysers, C., Perrett, D.I.: Demystifying social cognition: a Hebbian perspective. Trends in Cognitive Sciences 8, 501–507 (2004) 19. Memon, Z.A., Treur, J.: Modelling the Reciprocal Interaction between Believing and Feeling from a Neurological Perspective. In: Zhong, N., Li, K., Lu, S., Chen, L. (eds.) BI 2009. LNCS (LNAI), vol. 5819, pp. 13–24. Springer, Heidelberg (2009) 20. Niedenthal, P.M.: Embodying Emotion. Science 316, 1002–1005 (2007) 21. Schooler, J.W., Eich, E.: Memory for Emotional Events. In: Tulving, E., Craik, F.I.M. (eds.) The Oxford Handbook of Memory, pp. 379–394. Oxford University Press, Oxford (2000) 22. Winkielman, P., Niedenthal, P.M., Oberman, L.M.: Embodied Perspective on EmotionCognition Interactions. In: Pineda, J.A. (ed.) Mirror Neuron Systems: the Role of Mirroring Processes in Social Cognition, pp. 235–257. Springer, Heidelberg (2009)
Soft Set Theoretic Approach for Discovering Attributes Dependency in Information Systems Tutut Herawan1, Ahmad Nazari Mohd Rose2, and Mustafa Mat Deris3 1
Department of Mathematics Education, Universitas Ahmad Dahlan, Indonesia Faculty of Informatics, Universiti Darul Iman Malaysia, Terengganu, Malaysia 3 FTMM, Universiti Tun Hussein Onn Malaysia, Johor, Malaysia
[email protected],
[email protected],
[email protected] 2
Abstract. This paper presents the applicability of soft set theory for discovering attribute dependency in multi-valued information systems. The proposed approach is based on the notion of multi-soft sets. An inclusion of value sets in soft set theory is used to discover degree of attributes dependency. The results obtained are equivalent to the rough attributes dependency. Keywords: Information system; Soft set theory; Attribute dependency.
1 Introduction An attribute dependency states that the value of an attribute is uniquely determined by the values of some other attributes. The objective of discovering attribute dependency is to find the relationship among attributes in information systems. Attribute dependencies have been applied in the areas, such as; data classification, association rule mining, etc. One of the methods for discovering attribute dependencies is using rough set theory [1]. Formally, in an information system S = (U , A, V , f ) [1], attribute D is called totally depends on attribute C, denoted C ⇒ D , if each value of D is associated exactly one value of C. Otherwise, D is depends partially on C. The discovery of attribute dependencies using rough set theory has been received considerable interest (e.g., [2−6]). Soft set theory [7], proposed by Molodtsov in 1999, is a new method for dealing with uncertain data. In recent years, research on soft set theory has been active, and great progress has been achieved, including the works of theoretical soft set theory, soft set theory in abstract algebra, soft set theory in forecasting and soft set theory for data analysis, particularly in parameterization reduction and decision making. However, no current researches have been done in applying soft set theory for discovering attributes dependency in multi-valued information systems. Inspired by the fact that every rough set is a soft set [8], in this paper, we present the applicability of soft set theory for discovering attributes dependencies in multi-valued information systems. The proposed approach is based on the notion of multi-soft sets [9]. An inclusion of value sets in soft set theory is used to discover degree of attributes dependency. The results obtained show that our proposed technique is equivalent to the rough-set based attributes dependency. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 596 – 605, 2010. © Springer-Verlag Berlin Heidelberg 2010
Soft Set Theoretic Approach for Discovering Attributes Dependency
597
The rest of this paper is organized as follows. Section 2 describes the fundamental concept of information systems and set approximations. Section 3 describes the basic concept of soft set theory and multi-soft sets. Section 4 describes soft set approach for discovering attributes dependency. Finally, the conclusion of our work is described in section 5.
2 Information Systems and Set Approximations An information system is a 4-tuple (quadruple) S = (U , A,V , f ) , where U is a nonempty finite set of objects, A is a non-empty finite set of attributes, V = ∪a∈A Va , Va is the domain (value set) of attribute a, f : U × A → V is a total function such that
f (u , a ) ∈ Va , for every (u, a ) ∈ U × A , called information (knowledge) function. In many applications, there is an outcome of classification that is known. This a posteriori knowledge is expressed by one (or more) distinguished attribute called decision attribute; the process is known as supervised learning. An information system of this kind is called a decision system. A decision system is an information system of the form D = (U , A ∪ {d },V , f ) , where d ∉ A is the decision attribute [1]. The elements of A are called condition attributes. An example of a decision system is given in Table 1 as follow. Example 1. Hiring: an example of a multi-valued information system is presented in Table 1. In Table 1, six students are characterized by four conditional attributes; Diploma, Experience, French, Reference and one decision attribute; Decision. Table 1. Hiring: an information system
Student 1 2 3 4 5 6 7 8
Diploma MBA MBA MCE MSc MSc MSc MBA MCE
Math High Low Low High Medium High High Low
IT Yes Yes Yes Yes Yes Yes No No
English Excellent Neutral Good Neutral Neutral Excellent Good Excellent
Decision Accept Reject Reject Accept Reject Accept Accept Reject
The starting point of rough set theory is the indiscernibility relation, which is generated by information about objects of interest. Two objects in an information system are called indiscernible (or similar) if they have the same feature. Definition 2. (See [1].) Two elements x, y ∈ U are said to be B-indiscernible
(indiscernible by the set of attribute B ⊆ A in S) if and only if f (x, a ) = f ( y , a ) , for every a ∈ B .
598
T. Herawan, A.N. Mohd Rose, and M. Mat Deris
Obviously, every subset of A induces unique indiscernibility relation. Notice that, an indiscernibility relation induced by the set of attribute B, denoted by IND(B ) , is an equivalence relation. The partition of U induced by IND(B ) is denoted by U / B and
the equivalence class in the partition U / B containing x ∈ U , is denoted by [x ]B . The notions of lower and upper approximations of a set are defined as follows. Definition 3. (See [1].) The B-lower approximation of X, denoted by B( X ) and B-
upper approximations of X, denoted by B( X ) , respectively, are defined by
{
B( X ) = x ∈ U
[x]
B
}
{
⊆ X and B( X ) = x ∈ U
[x]
B
}
∩ X ≠φ .
The accuracy of approximation (accuracy of roughness) of any subset X ⊆ U with respect to the set of attributes B ⊆ A , denoted α B ( X ) is numerically measured by
α B ( X ) = B( X ) / B( X ) , where X denotes the cardinality of X. The higher of accuracy of approximation of any subset X ⊆ U is the more precise (the less imprecise) of itself. The dependency degree of attributes in an information system can be illustrated in Definition 4. Definition 4. Let S = (U , A, V , f ) be an information system and let D and C be any
subsets of A. The dependency of attribute D on C in a degree k (0 ≤ k ≤ 1) , is denoted by C ⇒ k D. The degree k is defined by k = ∑ X ∈U / D C ( X ) / U .
D is said to be fully depends (in a degree of k) on C if k = 1 . Otherwise, D is partially depends on C. Thus, D fully (partially) depends on C, if all (some) elements of the universe U can be uniquely classified to equivalence classes of the partition U / D , employing C.
3 Multi-soft Sets We may see the structure of “standard” soft set [7] is a mapping which is only classify the objects into two classes (yes/1 or no/0). However, in the real application, depending on the set of parameters, a given parameter may have different values. To this, we first present the idea for representing a multi-valued information system, socalled multi-soft sets [9]. The idea of multi-soft sets is based on a decomposition of a multi-valued information system S = (U , A,V , f ) , into A number of binary-valued
information systems S = (U , A,V{0,1} , f ) , where Consequently, the
denotes the cardinality of A.
A binary-valued information systems define multi-soft sets
(F , A) = {(F , a ) : 1 ≤ i ≤ A }. i
A
Soft Set Theoretic Approach for Discovering Attributes Dependency
599
3.1 Soft Set Theory
Throughout this section U refers to an initial universe, E is a set of parameters, P(U ) is the power set of U and A ⊆ E . Definition 5. (See [7].) A pair (F, A) is called a soft set over U, where F is a
mapping given by F : A → P(U ) .
In other words, a soft set over U is a parameterized family of subsets of the universe U. For ε ∈ A , F (ε ) may be considered as the set of ε -elements of the soft set (F, A) or as the set of ε -approximate elements of the soft set. Clearly, a soft set is not a (crisp) set. For illustration, Molodtsov considered several examples in [7]. 3.2 Decomposition of Multi-valued Information Systems
The decomposition of S = (U , A, V , f ) is based on decomposition of A={a1 , a 2 , L , a A } into the disjoint-singleton attribute{a1 }, {a 2 }, L ,{a A } . Here, we only consider for complete
information systems. Let S = (U , A, V , f ) be an information system such that for every
a ∈ A , Va = f (U , A) is a finite non-empty set and for every u ∈ U , f (u , a ) = 1 . For
every a i under i th-attribute consideration, a i ∈ A and v ∈ Va , we define the map a vi : U → {0,1} such that a vi (u ) = 1 if f (u , a ) = v , otherwise a vi (u ) = 0 . The next result,
we define a binary-valued information system as a quadruple S i = (U , ai , V{0,1} , f ) . The
information systems S i = (U , ai , V{0,1} , f ) , 1 ≤ i ≤ A is referred to as a decomposition of
a multi-valued information system S = (U , A, V , f ) into A binary-valued information systems, as depicted in Figure 1. Every information system S i = (U , ai ,Vai , f ) , 1 ≤ i ≤ A
is a deterministic information system since for every a ∈ A and for every u ∈ U , f (u , a ) = 1 such that the structure of a multi-valued information system and A number of binary-valued information systems give the same value of attribute related to objects. 3.3 Multi-soft Sets
Based on the notion of a decomposition of a multi-valued information system in the previous sub-section, in this sub-section we present the notion of multi-soft set representing multi-valued information systems. Let S = (U , A, V , f ) be a multi-
valued information system and S i = (U , ai ,Vai , f ) , 1 ≤ i ≤ A be the A binaryvalued information systems. Since a standard soft set equivalent with a binary-valued information system, then we have
600
T. Herawan, A.N. Mohd Rose, and M. Mat Deris
⎧ S 1 = (U , a1 , V{0,1} , f ) ⇔ (F , a1 ) ⎪ 2 ⎪S = (U , a 2 , V{0,1} , f ) ⇔ (F , a 2 ) S = (U , A, V , f ) = ⎨ ⎪ ⎪ S A = U , a A , V{0,1} , f ⇔ F , a A ⎩
(
(
= (F , a1 ), (F , a2 ),
(
(
We define (F , A) = (F , a1 ), (F , a 2 ),
, F, a A
(
, F,a A
))
) (
)
)) as a multi-soft sets over universe U
representing a multi-valued information system S = (U , A, V , f ) .
U
a1
a2
…
ak
…
u1
f (u1 , a1 )
f (u1 , a 2 )
…
f (u1 , a k )
…
u2
f (u 2 , a1 )
f (u 2 , a 2 )
…
f (u 2 , a k )
f u1 , a A
…
2
u3
f (u 3 , a1 )
f (u 3 , a 2 )
A
…
f (u 3 , a k )
…
3
A
M uU
M f u U , a1
M f u U , a2
O
M f u U , ak
(
)
(
)
(
…
)
aA
( ) f (u , a ) f (u , a )
O
M f u U ,a A
(
…
Binary-valued information system-1 U
a1 Va11
Va12
…
Va1k
…
Va1n
u1
0
1
…
0
…
0
u2
0
0
…
1
…
0
1 M 0
…
0
…
0
O …
M 0
O …
M 0
u3
M uU
0 M 1
M Binary-valued information system- A
U
aA Va A 1
Va A 2
…
Va A k
…
Va A n
u1
0
0
…
1
…
0
u2
0
0
…
0
…
1
1 M 0
…
0
…
0
O …
M 0
O …
M 0
u3
M uU
0 M 1
Fig. 1. A decomposition of information systems
)
Soft Set Theoretic Approach for Discovering Attributes Dependency
601
3.4 AND and OR Operations
The notions of AND and OR operations in multi-soft sets are given below. Definition 6. Let (F , A) = ((F , ai ) : i = 1,2,
, A ) be a multi-soft set over U S = (U , A, V , f ) . The AND
representing a multi-valued information system operation between (F, ai ) and (F, a j ) is defined as
(F , a )AND(F , a ) = (F , a i
where
j
i
× a j ),
F (Vai , Va j ) = F (Vai ) ∩ F (Va j ) , ∀(Vai ,Va j ) ∈ ai × a j , for 1 ≤ i, j ≤ A .
Definition 7. Let (F , A) = ((F , ai ) : i = 1,2,
, A ) be a multi-soft set over U
representing a multi-valued information system S = (U , A, V , f ) . The OR operation between (F, ai ) and (F, a j ) is defined as
(F , a )OR (F , a ) = (F , a i
where
j
i
×a j ),
F (Vai ,Va j ) = F (Vai ) ∪ F (Va j ) , ∀(Vai ,Va j ) ∈ ai × a j , for 1 ≤ i, j ≤ A .
Thus, both AND and OR operations in multi-soft set over U define a soft set over U ×U . Definition 8. Let (F, A) be multi-soft sets over U and (F , a ) ∈ (F , A) . The value-class
of (F, a ) , i.e., class of all value sets of (F, a ) , denoted C( F ,a ) is defined by
{
C( F ,a ) = {u : f (u, α 1 ) = 1},
{ (
, u : f u, α V
a
) = 1}},
where u ∈ U and α ∈ Va . Clearly C( F ,a ) ⊆ P(U ) . Example 9. Let A = {Diploma, Math, IT, English, Decision} . Therefore, the multi-soft set representing Table 1 is given in Figure 2. Note that the class value of every soft set is a partition of U.
⎞ ⎛ {MBA = {1,2,7}, MCE = {3,8}, MSc = {4,5,6}}, ⎟ ⎜ ⎟ ⎜ {Medium = {5}, Low = {2,3,8}, High = {1,4,6,7}}, ⎟ ⎜ (F , A) = ⎜ {Yes = {1,2,3,4,5,6}, No = {7,8}}, ⎟ ⎜ {Excellent = {1,6,8}, Neutral = {2,4,5}, Good = {3,7}}, ⎟ ⎟ ⎜ ⎠ ⎝ {Accept = {1,4,6,7}, Reject = {2,3,5,8}} Fig. 2. Multi soft-sets representing Table 1
602
T. Herawan, A.N. Mohd Rose, and M. Mat Deris
Let C = {Diploma, Math} , we have
(F , Diploma )AND(F , Math ) = ⎧(MBA, Medium ) = φ , (MBA, Low ) = {2}, (MBA, High ) = {1,7}⎫ ⎪ ⎪ = ⎨ (MCE, Medium ) = φ , (MCE, Low ) = {3,8}, (MCE, High ) = φ ⎬ ⎪ (MSc, Medium ) = {5}, (MSc, Low ) = φ , (MSc, High ) = {4,6} ⎪ ⎩ ⎭ with
C( F ,Diploma )AND ( F ,Mathematics ) = {{1,7}, {2}, {3,8}, {4,6}, {5}}
and
(F , Diploma )OR (F , Math ) = ⎧(MBA, Medium ) = {1,2,5,7}, (MBA, Low ) = {1,2,3,7,8}, (MBA, High ) = {1,2,4,6,7}⎫ ⎪ ⎪ = ⎨ (MCE, Medium ) = {3,5,8}, (MCE, Low ) = {2,3,8}, (MCE, High ) = {1,3,4,7,8} ⎬ ⎪ (MSc, Medium ) = {4,5,6}, (MSc, Low ) = {2,3,4,5,6,8}, (MSc, High ) = {1,4,5,6,7} ⎪ ⎩ ⎭ with ⎧{1,2,5,7}, {1,2,3,7,8}, {1,2,4,6,7}, {3,5,8}, {2,3,8},⎫ C( F ,Diploma )OR ( F ,Mathematics ) = ⎨ ⎬ ⎩{1,3,4,7,8}, {4,5,6}, {2,3,4,5,6,8}, {1,4,5,6,7} ⎭ Discovering attributes dependency is an important issue in data analysis. In the following section, we propose an alternative technique for discovering attributes dependency in a multi-valued information system.
4 Soft Set Approach for Discovering Attributes Dependencies Intuitively, a set of attributes D depends totally on a set of attributes C, denoted C ⇒ D , if all values of attributes from D are uniquely determined by values of attributes from C. In other word, D depends totally on C, if there exist a functional dependency between values of D and C. 4.1 Functional Dependency Definition 10. Let (F, A) be multi-soft sets over U and D,C be any subsets of A,
where C ∩ D = φ . Attribute D is functionally depends on C , denoted C ⇒ D , if each value of D is associated exactly one value of C, i.e., f (u , c i ) = f (v, c i ), ci ∈ C for every u, v ∈ U .
⇒
f (u , d i ) = f (v, d i ), d i ∈ D
Soft Set Theoretic Approach for Discovering Attributes Dependency
603
Example 11. From Table 1, since
f (i, Math )i=1, 4, 6, 7 ⇒ f (i, Decision )i=1, 4, 6, 7 ,
f ( j , Math ) j =2, 3,8 ⇒ f ( j , Decision ) j =2 , 3,8 ,
f (k , Math )k =5 ⇒ f (k , Decision )k =5 . Thus, we have Math ⇒ Decision .
Definition 12. Let (F, A) be multi-soft sets over U representing S = (U , A,V , f ) and
(F , a ), (F , a )∈ (F , A) . Soft set (F, a ) is said to be dominated by (F, a ) , denoted (F , a ) ≤ (F , a ) if for every X ∈ C( ) , there exist Y ∈ C( ) , such that X ⊆ Y . i
j
i
i
j
j
F ,a j
F , ai
Example 13. From Table 1, we let C = {Diploma, Math} and D = {Decision} . We have
(F , C ) = (F , Diploma )AND(F , Mathematics ) ⎧(MBA, Medium ) = φ , (MBA, Low ) = {2}, (MBA, High ) = {1,7}⎫ ⎪ ⎪ = ⎨ (MCE, Medium ) = φ , (MCE, Low ) = {3,8}, (MCE, High ) = φ ⎬ , ⎪ (MSc, Medium ) = {5}, (MSc, Low ) = φ , (MSc, High ) = {4,6} ⎪ ⎩ ⎭ where C( F ,C ) = {{1,4,6,7}, {2,3,8}, {5}} , Since C( F , D ) = {{1,4,6,7}, {2,3,5,8}} , then (F, C ) is dominated by (F, D ) , i.e., (F , C ) ≤ (F , D ) .
Proposition 14. Let
(F, A)
be multi-soft sets over U. A functional dependency
C ⇒ D if only if C( F ,C ) ≤ C( F ,D ) . Proof. (⇒ ) Let C , D ⊆ A , where C ∩ D = φ . From the hypothesis, we have if f (u , ci ) = f (v, ci ) , ci ∈ C then f (u , d i ) = f (v, d i ) , d i ∈ D . Thus, every value set in
C( F ,a ) is contained in a value set in C(F ,a ) , i.e., for every X ∈ C ( F , a ) , there exist Y ∈ C(F ,a ) , such that X ⊆ Y . j
i
i
j
(⇐) Obvious.
□
Thus, we can easily check whether or not C ⇒ D , for C , D ⊆ A by checking C( F ,C ) ≤ C( F ,D ) . 4.2 Identity Dependency Definition 15. Let (F, A) be multi-soft sets over U. An identity dependency (ID)
between two subsets C , D ⊆ A , where C ∩ D = φ of attributes in S is a statement, denoted by C ⇔ D if only if C ⇒ D and C ⇐ D . Proposition 16. Let (F, A) be multi-soft sets over U. A identity dependency C ⇔ D
if only if C( F ,C ) ≤ C( F ,D ) and C( F ,C ) ≥ C( F ,D ) .
604
T. Herawan, A.N. Mohd Rose, and M. Mat Deris
Proof.
(⇒)
Since C ⇔ D , it means that C ⇒ D and C ⇐ D . Thus, from
Proposition 14, we have C( F ,C ) ≤ C( F ,D ) and C( F ,C ) ≥ C( F ,D ) .
(⇐) Obvious.
□
Thus, we can easily check whether or not C ⇔ D , for C , D ⊆ A by checking C( F ,C ) = C( F , D ) . The concept of dependency discussed above corresponds to that considered in relational databases. Note that in information system, each map is a tuple t i = f (u i , a1 ), f (u i , a 2 ), , f u i , a A , where where i = 1,2,3, , U . A tuple t is
(
(
))
not necessarily associated with entity uniquely. Thus, two distinct entities could have the same tuple representation (duplicated/redundant tuple), which is not permissible in relational databases. Thus, the concept of information systems is a generalization of the concept of relational databases. In the following section, we propose the idea of generalized functional dependency in information system, so-called attributes dependency under soft set theory. We further show that the proposed approach is equivalent to rough’s attributes dependency. 4.3 Attributes Dependency Definition 17. Let
(F, A)
(F , a ), (F , a )∈ (F , A) . (F, a ) , denoted (F , a ) ≤ (F , a ) ,
be multi-soft sets over U and
(F, a ) is said to be dominated in degree k by j
i
i
j
i
k
j
where k=
∪ X : X ⊆ Y /U
,
where, X ∈ C( F ,a ) and Y ∈ C(F ,a ) . i
j
Obviously 0 ≤ k ≤ 1 . If k = 1 , then
(F, a ) i
is dominated totally by
Otherwise, (F, a i ) is dominated partially by (F, a j ) .
(F, a ) . j
Definition 18. Dependency attributes in Definitions 4 and 17 are equivalent.
Proof. Since U / C = C ( F ,C ) , then the proof is obvious.
□
Example 19. By Table 1, let A = {Diploma, Math, IT, English} and D = {Decision} . We have
U / C = C( F ,C ) = {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8} and U / D = C( F , D ) = {{1,4,6,7}, {2,3,5,8}} . The degree dependency of D on C is given by C ⇒ k D , where k = 1 .
5 Conclusion In this paper, we have presented the applicability of soft set theory for discovering attributes dependencies in multi-valued information systems. It is inspired from the fact that every rough set can be considered as a soft set. We used the notion of multisoft sets to handle a multi-valued information system. Based on an inclusion of value
Soft Set Theoretic Approach for Discovering Attributes Dependency
605
sets in soft set theory, we have successfully discovered the degree of attributes dependency. It is shown that, the degree defined is equivalent to the rough set based degree of dependency. In the next paper, we will present an application of such dependency for maximal association rules mining, decision making in a multi-valued domain and categorical data clustering.
Acknowledgement The authors thank to Universiti Darul Iman Malaysia (UDM) for financial support of this paper. The work of Tutut Herawan was supported by the FRGS under the Grant No. Vote 0402, Ministry of Higher Education, Malaysia.
References 1. Pawlak, Z., Skowron, A.: Rudiments of rough sets. Information Sciences (177) 1, 3–27 (2007) 2. Ziarko, W.: The Discovery, Analysis, and Representation of Data Dependencies in Databases. In: Knowledge Discovery in Databases, pp. 195–212 (1991) 3. Ziarko, W., Shan, N.: Discovering attribute relationships, dependencies and rules by using rough sets. HICSS (3), 293–299 (1995) 4. Düntsch, I., Gediga, G.: Statistical evaluation of rough set dependency analysis. International Journal of Human Computer Studies 46, 589–604 (1997) 5. Düntsch, I., Gediga, G.: Algebraic Aspects of Attribute Dependencies in Information Systems. Fundamenta Informaticae 29(1-2), 119–133 (1997) 6. Ziarko, W.: Dependencies in Structures of Decision Tables. RSEISP, 113–121 (2007) 7. Molodtsov, D.: Soft set theory-first results. Computers and Mathematics with Applications 37, 19–31 (1999) 8. Herawan, T., Mustafa, M.D.: A direct proof of every rough set is a soft set. In: Proceedings of the Third Conference of AMS 2009, pp. 119–124 (2009) 9. Herawan, T., Mustafa, M.D.: On Multi-soft Sets Construction in Information Systems. In: Huang, D.-S., Jo, K.-H., Lee, H.-H., Kang, H.-J., Bevilacqua, V. (eds.) ICIC 2009. LNCS, vol. 5755, pp. 101–110. Springer, Heidelberg (2009)
An Application of Optimization Model to Multi-agent Conflict Resolution Yu-Teng Chang1,2, Chen-Feng Wu2, and Chih-Yao Lo1,2 2
1 School of management, Huazhong University of Science & Technology Department of Information Management, Yu Da University, Miaoli County, Taiwan 361, R.O.C. {cyt,cfwu,jacklo}@ydu.edu.tw
Abstract. Conflict is a natural and very typical phenomenon in every field of human world. It has very significant common characteristics and dynamics, and, therefore, it makes sense to examine them together and comparatively. People get involved in conflicts because their interests or their values are challenged. Because the solving of distributed problem, considering how the work of solving a particular problem, can be divided among a number of agents, the domain in which this strategy applies is one in which a macro-level or global perspective of problems and solutions may be best achieved by centralized control. In this paper, it concentrates on the linear programming model to develop its conflict resolution algorithm and implement it on AGENT-0. Keywords: Multi-agent, Linear optimization, Agent conflict.
1 Introduction Conflict is a natural and very typical phenomenon in every field of human world. It has very significant common characteristics and dynamics, and, therefore, it makes sense to examine them together and comparatively. People get involved in conflicts because their interests or their values are challenged or because their needs are not met [1, 2]. Certain characteristics must be true for a conflict to happen. Fundamentally there are at least two parties required. The opinions of these parties should be mutually exclusive or mutually incompatible or mutually opposing [3]. As two or more individuals are involved in solving a particular problem, conflicts may arise throughout the whole problem solving process. For example, in the marketing logistics field of a company, the sales manager wants more sales points and warehouses to serve more customers, but this will increase the expenses of the warehouse manager. Because more warehouses need more expenses, therefore the conflict between the sales manager and the warehouse manager happens.
2 The Linear Programming Method to Conflict Resolution If different linear equations are represented as different agents’ views, then this research can get an area, which is formed by the intersection of these different L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 606–614, 2010. © Springer-Verlag Berlin Heidelberg 2010
An Application of Optimization Model to Multi-agent Conflict Resolution
607
equations. In this area, according to the need, the decision-making can be obtained. First look at the simplest of cases, that being when the problem has just two dimensions. In this situation the different linear equations can be visualized on a diagram, Figure 1 representing the views of three agents, denoted A-C, in the two dimensions x and y. Y
Agent B
Agent C
Agent A
X
Fig. 1. Linear programming diagram
The problem of course can be extended to N dimensions but once N becomes greater than three it is impossible to visualize on paper. In this situation, since the agents' views are represented by linear equations in N dimensions, therefore the linear equations should each consist of N variables. If there are M agents and M is greater than N, i.e., there are M equations and N variables and M > N then there will be two situations. One is if these linear equations, which represent different agents’ view, still intersect by chance, and then the best solution will happen on the extreme points in the feasible solution area. The other is if these linear equations do not intersect, then the feasible solution area will not be formed and the decision maker has to make a decision by using some other techniques. However, if M < N then this case in mathematics will result in infinite solutions. Besides the above situations, if the views of agents cannot be represented by linear equations i.e. they are non-linear, and then in this case it is more complicated.[6][7]
3 The Proposed Algorithm According to proposed quantitative methods, this research is going to construct the system framework and algorithm of the linear programming model. As Linear Programming must have a goal function, which is a form of linear function. The general form for linear function is G = a1 x1 + a2 x2 + ⋅⋅⋅ + an xn where a1, a2 , ... an is coefficient and x1 , x2 , ... xn is decision variable. Therefore, the algorithm for linear programming is described as follows: The Proposed Algorithm for Linear Programming; FOR Agent A; Inform Agent C of the values of decision variables; Receive the result from Agent C; Detect the result;
608
Y.-T. Chang, C.-F. Wu, and C.-Y. Lo
Inform Agent C of detecting result; END; ( * Agent A * ) FOR Agent B; Inform Agent C of the values of decision variables; Receive the result from Agent C; Detect the result; Inform Agent C of detecting result; END; ( * Agent B * ) FOR Agent C; Receive the values of decision variables from Agent A and Agent B; Conflict recognized; REPEAT Select the decision rule; Conflict Resolution; Inform Agent A and Agent B of the result; UNTIL Solution satisfied; END; ( * Agent C * ) END. ( * Linear Program Method Algorithm * ) The above algorithm specifies that there are only two agents providing the values of decision variables to agent C. In the general case there should be M agents providing N sets of values of decision variables to agent C, i.e., agent A provides a set of values of decision variables (x A1, x A 2 , ⋅ ⋅ ⋅ x An ) ,..., agent M provides a set of values of
decision variables ( x x . . . x ) . The goal function is G = a1 x1 + a2 x2 + ⋅⋅⋅ + an xn and different cases have different coefficients. Therefore agent C will have M goal function values, denoted G1,G2,…,GM, and then compare these M goal function values to select the greatest values among G1,G2,…,GM. M 1,
M 2 ,
M n
4 Verification and Design In this section this research takes a specific case to simulate the model by using the concept which has been proposed and then design the programming. 4.1 The Implementation of a Specific Case The specific example is described as following: a company in Manchester has 40 unit raw materials and wants to use these limited materials to manufacture product X and product Y for distribution to retailers in various sections of Taiwan. Each product X
An Application of Optimization Model to Multi-agent Conflict Resolution
609
needs 2 unit materials to produce and each product Y needs 5 unit materials to produce. The company management asks its departments to suggest how many numbers of product X and product Y should be produced to maximize the profit. The response from production department is that according to the machine capacity each unit of product X requires 2 labor-hours of machining and each unit of product Y requires 3 labor-hours of machining. Manufacturing capacity available during the coming production week is 36 labor-hours. The personnel department answers 3 man-hours are needed for each unit of product X and 2 man-hours for each unit of product Y and the amount of man-hours available is 27 man-hours. The sales department replies that each unit of product X needs 2 units of sales cost and each unit of product Y needs 1 unit of sales cost. The amount of sales costs available is 24 sales costs. In this case, the decision variables are the number of product X and the number of product Y, denoted x and y, that the company should produce. The objective function is to maximize the profit P = x +7y that is made by the product X and product Y. In the case this research can regard the management as a Conflict Resolution agent, and the production department, the personnel department and the sales department as three individual agents with their own expertise. Because of the limitation of the material resource, the material resource is the corporate constraint for the company and each department. Each department has its own individual constraint such as the manufacturing capacity for the production department, the man-hour for the personnel department and the sales cost for the sales department. The Figure 2 expresses this idea. Corporate Constraint
2X+5Y<=40 P=X+7Y Management Conflict Resolution Agent
Individual Constraint
Individual Constraint
Individual Constraint
2X+3Y<=36 Production Agent
3X+2Y<=27 Personnel Agent
2X+Y<=24 Sales Agent
Fig. 2. Specific case diagram
From the point of view of the production agent, it will provide the values of decision variables to the management agent according to its own expertise and referring to its individual constraint and the corporate constraint. Therefore the values of decision variables can be decided by two constraint equations.[7] In Figure3, there is a feasible solution region that is formed by two constraint equations: 2 x + 5 y ≤ 40 ( corporate constraint )
(1)
610
Y.-T. Chang, C.-F. Wu, and C.-Y. Lo
2 x + 3 y ≤ 36 ( individual constraint )
(2)
In the feasible solution region, the optimum solution will happen on the extreme point that is on the intersection point of the following two equations: 2 x + 5 y = 40
(3)
2 x + 3 y = 36
(4)
After solving the equations (3) and (4), the intersection point (15, 2) is gotten. Therefore the production agent will provide this value to the management agent. Y Feasible Solution Region Optimum Solution Point
2X+3Y<=36 2X+5Y<=40
12 8
(15,2) X 18 20
Fig. 3. Production agent diagram Y Optimum Solution Point
3X+2Y<=27 13.5 2X+5Y<=40 (5,6)
8
9
Feasible Solution Region
X
20
Fig. 4. Personnel agent diagram
The similar situation will happen with the personnel and sales agents as well. Figure 4 and 5 represent these ideas Y Feasible
2X+Y<=24
Solution Region
24
Optimum Solution
2X+5Y<=40 8
0
(10,4)
Point
12
20
Fig. 5. Sales agent diagram
X
An Application of Optimization Model to Multi-agent Conflict Resolution Y
611
Real Feasible SolutionRegion
2x+y<=24 24
Optimum Solution Point ( 10,4 )
3x+2y<=27 2x+3y<=36 13.5
P=x+7y
12 Optimum Solution Point ( 15,2 )
2x+5y<=40 8
0
9
12
18 20
X
Optimum Solution Point ( 5,6 )
Fig. 6. Management agent diagram
Now attempt to overlay Figure 3, Figure 4 and Figure 5 together to get Figure 6 and then look at it. There are three feasible regions and three optimum solution points for each agent. However, the CR agent (management agent) will only choose one optimum point to be its best point by using the total goal function P = x + 7 y, therefore from Figure 13 shows the goal function line (P = x + 7 y ) will first reach the optimum solution point (15, 2) of the production agent. It is very clearly seen from the figure that this point can not satisfy the other two agents’ individual constraints because this point is outside the other two agents’ feasible regions. Thus the goal function line will go to the second optimum solution point (10, 4) which is the optimum solution point of the sales agent, but this point still can not satisfy the personnel agent's constraint because it is outside the feasible region of the personnel agent as well. Therefore in the end the goal function line will reach the third optimum solution point (5, 6), this point satisfies not only the personnel agent's constraint but also the production and sales agents' constraints. Hence, the CR agent found the best solution point (5, 6) and the real feasible region. The above process is to be explained in the theoretical form of the linear programming model. The actual circumstance of the programming operation will be looked at in the following text. 4.2 The Design of Conflict Resolution System The structure of an agent has an agent’s name followed by the time-grain and beliefs and then commit-rules; after the commit-rules it can have different private functions. The number of functions is dependent on the need of the program. How do the production agent, the personnel agent and the sales agent pass their own optimum solutions to the management agent? Because these three agents perform the same action, therefore the proposed method just explain the production agent. In Agent-0 there is a command called “inform”, these three agents simply use this command to pass their own values of decision variables. Under the circumstance of no matter what new message the production agent receives, and no matter what fact the production agent believes, the production agent informs the management agent that the values of its decision variables are x = 15 and y = 2 at the current time. After receiving the opinions from the production agent, personnel agent and sales agent, the management agent will judge whether these opinions conflict or not. If they conflict with one another then the management agent will choose the first rule from
612
Y.-T. Chang, C.-F. Wu, and C.-Y. Lo
its decision rule database to resolve the conflict and then return the result to each agent. Each agent will check the result to see whether the result is satisfied with its own individual constraint or not. If any one of them is not satisfied with the result then the management agent has to choose the second rule from its decision rule database to resolve the conflict again. Therefore this action is an iterative action. The best result will not be found until all agents are satisfied. That it is better to have as many decision rules in the database as possible. The whole idea is similar to the production rule in an expert system; Figure 7 represents this idea. Production Personnel Sales Agents
Match
Result
No Match Select Management Agent
Decision Rule Database
Execute Conflict Resolution Control
Data Flow
Fig. 7. Conflict resolution system
In Figure 7 the checking action of match or no match is executed by the production agent, the personnel agent and the sales agent individually. Because the process of checking action for each agent is very similar therefore this research uses the production agent checking code only to illustrate the process. The production agent detects the result of Conflict Resolution as follows; if the result is satisfied with its own individual constraint then the production agent sends the message “solution_satisfied” back to the management agent at current time. If the result is not satisfied with its own individual constraint then the production agent returns the message “solution_not_satisfied” back to the management agent at current time. In Figure 7 the Conflict Resolution execution is the most important part in this research’s programming code. It is linked by the conflict_resolution function and the rule function. The function conflict_resolution is described as to print the message “enter conflict resolution” and if rule = 1 then call the function rule1 and pass the result back to the production agent, the personnel agent and the sales agent. If rule = 2 then the management agent will call the function rule2 and then pass the result back to the three agents. If rule = 3 then the management agent will call the function rule3 and pass the result back to the three agents. If rule > 3 then the management agent will print out the message of the database is not large enough to solving the problem. The function rule1 is to take the maximal value of three agents’ goal functions and then print out this maximal and the values of the best decision variables. The function rule2 and rule3 are one to take the minimal value of three agents’ goal functions, and the other to take the mean value of the decision variables' values of three agents.
An Application of Optimization Model to Multi-agent Conflict Resolution
613
5 Simulation The Linear Programming algorithm was implemented on the Linux-based workstations within the Agent Oriented Programming [6] environment that is set under the Common Lisp environment, therefore to run AGENT-0, one first has to start up Common Lisp. Because of this reason, the code of Agent Oriented Programming is similar to any Common Lisp implementation, although there may be special purpose applications that run only one or some of the support platforms. There are four agents, each agent could be an object or a human being or a group of human beings, but fundamentally an agent can be regarded as being similar to a person, therefore each agent has its own mental state such as its beliefs and its capability, etc... For example, the production agent is an object (the production department) consisting of a group of people with professional knowledge in the production field. Its capability is to produce; its beliefs are the limited material resource of the company and its own limited manufacturing capacity and its duty is to pursue the maximal profit for the company under its beliefs and capability. Among these four agents, one of them is called the Conflict Resolution agent, denoted agent CR, whose role is to resolve conflict among the other three agents according to certain criteria. These criteria is called decision rules, thus the CR agent has a decision rule database. The other three agents, denoted agent A, agent B and agent D, are of the same importance and independent of one another. They have their own knowledge bases (expertise) because of their own capabilities and can access the corporate database. There is a total goal to be achieved by the agent A, agent B and agent D. Agent A, agent B and agent D will provide different opinions (the proposed method’s model supposes that they will provide different sets of decision variables) to the agent CR according to their own individual constraints and the corporate constraint. Because agents A, agent B and agent D have their own expertise and refer to the corporate database, they may provide different values of decision variables to agent CR. There are two points have to pay attention to. Due to each agent having individual and corporate constraints that are represented by the linear inequality equations in the two dimensions' x and y, they will therefore form a region and inside this region there can be many values of decision variables. These values are feasible for the problem. Therefore the proposed method calls this region the feasible region and these values feasible solutions' values. According to the linear programming technique, the optimum solution will happen on the extreme point of the feasible region. Each agent is independent of one another and it can only access the corporate database and its own knowledge base. Therefore when each agent makes a decision it will try to benefit itself considerably [8]. When the agent CR receives the values from agent A, agent B and agent D, first it has to do the conflict recognition, i.e., to compare these values. If they are equal then the agent CR will inform each agent with no conflict and take any one set of values to solve the problem. However, if the agent CR does find conflict among these sets of values then it will initiate Conflict Resolution. After detecting the values, each agent will pass the result back to the agent CR. If this set of values is satisfied with all agents' constraints then Conflict Resolution is
614
Y.-T. Chang, C.-F. Wu, and C.-Y. Lo
achieved; if not the agent CR will choose another decision rule from its decision rule database to resolve the conflict again. Because of this reason it is better for the CR agent to have as many decision rules as possible.
6 Conclusions These mathematic techniques are important in the resolution of conflict in DAI (Distributed Artificial Intelligence). Although at the moment there are only five different mathematic techniques used to describe Conflict Resolution in DAI, and considered their advantages and disadvantages of each technique, they are prototypes in order to design decision rule databases including some techniques of Conflict Resolution that the CR agent can use to make decisions in the future. Thus they can provide the necessary infrastructure into which additional mathematical models could subsequently be accommodated. The aim of the proposed method was to develop a Linear Programming Model to help resolve conflict in DAI problems, and an algorithm was developed and applied in a software language called Agent-0. This system ought to be viewed as a prototype for other mathematical models in future work.
References 1. Sycara, K.P.: Multiagent Compromise via Negotiation. Distributed Artificial Intelligence 2, 119–137 (1987) 2. Resmerita, S., Heymann, M.: Conflict resolution in multi-agent systems. In: Proc. of 2003 42nd IEEE Conference on Decision and Control, vol. 3, pp. 2537–2542 (2003) 3. Mack, R.W., Synder, R.C.: The Analysis of Social Conflict Toward an Overview & Synthesis. In: Smith, C.G. (ed.) Conflict Resolution: Contributions of the Behavioural Sciences. University of Notre Dame Press, London (1971) 4. Johsansen, J., Vallee, V., Springer, S.: Electronic Meetings: Technical Alternatives and Social Choices. Addison-Wesley, Reading (1979) 5. Stefik, M.J.: Planning with Constraints (Molgen: part 1). Artificial Intelligence 16(2), 111–140 (1990) 6. Chu, K.: Quantitative Methods for Business and Economic Analysis (1969) 7. Ozan, T.M.: Applied Mathematical Programming for Production and Engineering Management (1986) 8. Chang, E.: Participant Systems for Cooperative Work, pp. 311–339. Morgan Kaufmann, San Francisco (1980)
Using TOPSIS Approach for Solving the Problem of Optimal Competence Set Adjustment with Multiple Target Solutions Tsung-Chih Lai Department of Information and Electronic Commerce, Kainan University, 1, Kainan Road, Luzhu, Taoyuan Country 33857, Taiwan
[email protected]
Abstract. Management by objectives (MBO) is an effective framework for enterprise management. In the optimal adjustment of competence set problem, the relevant coefficients are adjusted so that a given target solution (objective) could be attainable. However, various target solutions might be given from various points of view. The conventional method is concerned only with one target solution rather than multiple targets. In this paper, we employ the technique for order preference by similarity to an ideal solution (TOPSIS) method to select/evaluate target solutions suggested by decision maker. A numerical example with four target solutions is also used to illustrate the proposed method. Keywords: competence set, competence set adjustment, TOPSIS, multiattribute decision making, management by objectives.
1 Introduction A competence set is a collection of ideas, knowledge, information, resources, and skills for satisfactorily solving a given decision problem [1-3]. By using mathematical programming, a number of researchers have focused on searching for the optimal expansion process from an already acquired competence set to a needed one [4-7]. Feng and Yu [8] designed the minimum spanning table algorithm to find the optimal competence set expansion process without formulating the related mathematical program. Huang, Tzeng, and Ong [9] employed the multi-objective evolutionary algorithm (MOEA) to obtain the optimal competence set expansion process under fuzzy multiple-criteria environment. In recent years, the concept of competence set has been applied in consumer decision problems [10-11]. However, the competence set has been assumed to be discrete and finite so as to represent its elements by nodes of a graph. Lai, Changlin, and Yu [12] extended the conventional competence set analysis to consider more general situations with linear programming (LP). By treating the feasible region in LP as an acquired skill set, the optimal adjustments of the relevant coefficients could be obtained by formulating competence set adjustment model for achieving a given target. L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 615–624, 2010. © Springer-Verlag Berlin Heidelberg 2010
616
T.-C. Lai
In the competence set adjustment problems, the optimal adjustment of the relevant coefficients is sought in terms of a given target solution. However, in practice decision maker may suggest various targets from various points of view (e.g. finance, sales, and so on). As a result, a set of targets is considered rather than a single target. Observe that the decision maker may have a set of potential criteria to be considered to evaluate these suggested targets. This problem could be viewed as a multiple criteria decision making (MCDM) problem. The technique for order preference by similarity to ideal solution (TOPSIS) initiated by Hwang and Yoon [13] is one of the well known classical MCDM methods. TOPSIS is a practical and useful approach for evaluating and ranking a number of available alternatives. The best alternative chosen by the TOPSIS method possesses both the shortest distance from the positive ideal solution (PIS) and the farthest distance from the negative ideal solution (NIS). While the PIS maximizes benefit criteria and minimizes cost criteria, the NIS maximizes cost criteria and minimizes benefit criteria. In practice, TOPSIS has been successfully employed in various fields to solve selection/evaluation problems [14-16]. In this paper, TOPSIS method is adopted to solve the optimal adjustment of competence set problem with a set of targets due to it has a sound logic that represents the rationale of human choice [17] and has been proved to be one of the best methods in the issue of rank reversal [18]. In order to overcome the problem that the effects of criteria weighting are doubled in TOPSIS, the weighted Minkowski distance function is applied to enhance the reliability of the decision process. A numerical example is used to demonstrate the procedures of TOPSIS for selecting target. The remainder of this paper is organized as follows. The class of optimal adjustment of competence set problems is reviewed in Section 2. The basic concepts and procedures of the TOPSIS approach is described in Section 3. A numerical example is used to illustrate the proposed method in Section 4. Finally, conclusions are presented in Section 5.
2 Optimal Adjustment of Competence Set In this section, we review some important concepts of optimal adjustment of competence set. For more detailed construction of models, the reader is referred to [12]. The basic idea of optimal adjustment of competence set is that given a target, which may not be attainable within the current framework of productivity and of resources, how to optimally adjust some relevant parameters, such as the constraint coefficients and the right hand sided resource level in linear programming (LP) problems, so that the target is feasible. Consider a standard LP problem as follows.
max s.t.
z 0 ( x) = cx Ax ≤ b, x ≥ 0,
(1)
Using TOPSIS Approach for Solving the Problem
617
where c=[cj] is the 1×n objective coefficient vector, x=[xj] denotes the n×1 decision vector, A=[aij] is the m×n consumption (or productivity) matrix, and b=[bi] is the m×1 resource availability vector. Suppose that x0 is a target solution set by decision maker. Then the competence set adjustment (CSA) model can be formulated as follows.
z1 ( D − , γ + | x 0 ) = ∑∑ {δ ij− | aij |} + ∑ {γ i+ hi } m
min
n
i =1 j =1
m
s.t.
n
∑∑ (a
ij
i =1 j =1
m
i =1
− δ ij− ) x 0j ≤ bi + γ i+ , i = 1, 2, … , m,
(2)
δ ij− ≥ 0, γ i+ ≥ 0, where D − = [δ ij− ]m×n denotes the deviation from aij, γ + = (γ 1+ , γ 2+ ,… , γ m+ ) denotes the deviation from bi, and hi is defined by ⎧| bi | if bi ≠ 0, hi = ⎨ ⎩| M i | if bi = 0.
(3)
Note, when bi=0, | γ i0 | / | bi | is not defined. The positive number Mi needs to be chosen properly to reflect the impact of the adjustment on bi. Let (D−*,γ+*) be the optimal solution derived from (2). Note that when z1(D−*,γ+*|x0) = 0, there is no need for adjustment. That is, the original system can produce the target solution x0. Practically, the degrees of adjustment may be bounded in a certain range as follows.
δ ij− ≤ lij , i = 1, 2,… , m, j = 1, 2,… n,
(4)
γ i+ ≤ ui , i = 1, 2,… , m,
(5)
where lij and ui denote the upper bounds for adjusting aij and bi respectively. In addition, the budget constraint could be written as follows. m
⎡⎛
n
∑ ⎢⎜ ∑ o δ i =1
⎣⎢⎝
j =1
ij
− ij
⎤ ⎞ + ⎟ + pi γ i ⎥ ≤ G, ⎠ ⎦⎥
(6)
where the cost for adjusting aij and bi is denoted respectively by oij and pi, and G denotes the available budget for adjustment. By combing (2) and (4)-(6), we have a more practical and general CSA model as follows.
618
T.-C. Lai
z1 ( D − , γ + | x 0 ) = ∑∑ {δ ij− | aij |} + ∑ {γ i+ hi } m
min
n
m
i =1 j =1
m
s.t.
n
∑∑ (a
ij
i =1 j =1
i =1
− δ ij− ) x 0j ≤ bi + γ i+ , i = 1, 2, … , m,
δ ij− ≤ lij , i = 1, 2,… , m, j = 1, 2, … n, γ i+ ≤ ui , i = 1, 2,… , m, ⎤ ⎞ + ⎟ + pi γ i ⎥ ≤ G, i =1 ⎢ ⎥⎦ ⎠ ⎣⎝ j =1 − + δ ij ≥ 0, γ i ≥ 0. m
⎡⎛
(7)
n
∑ ⎢⎜ ∑ o δ ij
− ij
In this study, given a set of target solutions, xk, k=1,2,…q, we attempt to select a best target to adjust competence set of a company accordingly. In order to generate the best target, z0(xk) and z1(D−*,γ+*|xk) derived from (1) and (7) respectively could be treated as a part of criterion for evaluating each target solution. Therefore, the problem of optimal adjustment of competence set with multiple target solutions could be viewed as a multiple criteria decision-making problem. The TOPSIS method can then be employed to rank the target solutions.
3 The TOPSIS Method The technique for order preference by similarity to an ideal solution (TOPSIS), which is first proposed by Hwang and Yoon [13], is one of the best-known multiple criteria decision making (MCDM) method. The best alternative chosen by the TOPSIS method possesses both the shortest distance from the positive ideal solution (PIS) and the farthest distance from the negative ideal solution (NIS). While the PIS maximizes benefit criteria and minimizes cost criteria, the NIS maximizes cost criteria and minimizes benefit criteria. Let C={c1, c2, …, cn} be a criterion set and A={a1, a2, …, am} be an alternative set. The procedure of TOPSIS method is stated as follows. 3.1 Construct Decision Matrix
The first step of the TOPSIS method is to construct the m×n decision matrix DM. c1 a1 DM = a2 am
⎡ d11 ⎢d ⎢ 21 ⎢ ⎢ ⎣ d m1
c2 d12 d 22 dm2
cn d1n ⎤ d 2 n ⎥⎥ ⎥ ⎥ d mn ⎦
(8)
where ai denotes the ith possible alternative, i=1, 2,…, m; cj denotes the jth criterion, j=1, 2,…, n; and dij denotes the crisp performance value of each alternative ai with respect to each criterion cj.
Using TOPSIS Approach for Solving the Problem
619
3.2 Construct Normalized Decision Matrix
Let R=[rij] be the normalized decision matrix. The normalized value rij is calculated as dij
rij =
, i = 1, 2, … , m; j = 1, 2, …, n.
m
∑d j =1
2 ij
(9)
3.3 Construct Weighted Normalized Decision Matrix
In order to obtain objective weights of criteria importance, the entropy weighting method [19], derived from the classical maximum entropy method is used in this research. The entropy measure of the jth criterion ej can be obtained as follows. m
e j = − K ∑ rij ln rij , i = 1, 2,… , m; j = 1, 2, … , n,
(10)
i =1
where K=1/ln m is a constant which guarantees 0≤ ej ≤1. Then the normalized weighting of criterion wj is given by wj =
1− ej
∑ (1 − e ) n
, i = 1, 2,… , m; j = 1, 2,… , n.
(11)
j
j =1
Let V=[vij] be the weighted normalized decision matrix. The weighted normalized value vij is calculated as ⎡ w1r11 ⎢w r V = ⎡⎣vij ⎤⎦ = ⎢ 1 21 ⎢ ⎢ ⎣ w1rm1
w2 r12 w2 r22 w2 rm 2
wn r1n ⎤ wn r2 n ⎥⎥ ⎥ ⎥ wn rmn ⎦
(12)
where wj is the weight of importance with respect to the jth criterion which is derived n
by (10)-(11), and
∑w j =1
j
=1.
3.4 Measure the Distance of Each Alternative from the PIS and the NIS
The separation measures of each alternative from the PIS and from the NIS are computed. Traditionally, the TOPSIS method use the Euclidean distance to measure the distance of each alternative from the PIS and from the NIS. The separation measure of each alternative from the PIS, di+ , is given by di+ =
∑ (v n
j =1
ij
− v +j ) , i = 1, 2,… , m. 2
(13)
620
T.-C. Lai
Similarly, the separation measure of each alternative from the NIS, di− , is as follows.
∑ (v n
di− =
j =1
− v −j ) , i = 1, 2,… , m. 2
ij
(14)
However, the use of the Euclidean distance may have the problem that the effects of weighting are doubled. This problem is obviously by rewriting (13) or (14) as follows. di+ = =
∑ (v n
j =1
ij
− v +j )
∑(w r n
j =1
j ij
2
)
+ 2 j j
−w r
=
∑ w (r n
j =1
2 j
ij
−r
)
+ 2 j
(15) .
From (15), we can easily observe that the decision results overly controlled by weighting. Fortunately, this problem can be overcome by means of weighted Minkowski distance [20-21], Lwp , as follows. ⎡ n p⎤ Lwp ( x, y ) = ⎢ ∑ w j x j − y j ⎥ ⎣ j =1 ⎦
1 p
(16)
,
where wj is the weight of importance with respect to the jth criterion and p≥1. Note that Lwp with p=2 is known as the weighted Euclidean distance and is applied to measure the distance of each alternative from the PIS and the NIS in this research. Based on the weighted Euclidean distance, the separation measures can be calculated as follows. Recall that R=[rij] is the normalized decision matrix. Define R + = {r1+ , r2+ ,… , rn+ } =
{( max r | j ∈ J ) , ( min r | j ∈ J ′)}
(17)
R − = {r1− , r2− , … , rn− } =
{( min r | j ∈ J ) , ( max r | j ∈ J ′)}.
(18)
i
ij
i
ij
and i
ij
i
ij
Then the separation measure of each alternative ai from the PIS based on the weighted Euclidean distance are computed as di+ =
n
∑w j =1
2
j
rij − rj+ , i = 1, 2,… , m.
(19)
Similarly, the separation measure of each alternative ai from the NIS based on the weighted Euclidean distance are computed as di− =
n
∑w j =1
2
j
rij − rj− , i = 1, 2, … , m.
(20)
Using TOPSIS Approach for Solving the Problem
621
3.5 Calculate the Relative Closeness Coefficient
The relative closeness coefficient, RCCi, associated with each alternative, ai, can be computed by
RCCi =
di− . di+ + di−
(21)
Finally, all the available alternatives can be ranked according to RCCi.
4 Numerical Example Consider the following LP problem. max
z 0 ( x) = 30 x1 + 20 x2 + 40 x3
s.t.
x1 + x2 + x3 ≤ 100, x1 + 3x2 + x3 ≤ 130,
(22)
2 x1 + x2 + x3 ≤ 100, x1 ≥ 0, x2 ≥ 0, x3 ≥ 0, *
where the optimal solution x =(30, 30, 10). Suppose four alternative target solutions, x1=(32,33,14), x2=(35,30,15), x3=(34,28,16), and x4=(28,28,18), and the available ⎡ 0.5 0.3 0.5⎤ budget for adjustment G=500 are set by decision maker. L = ⎢⎢0.25 0.5 0.3⎥⎥ ⎢⎣ 1 0.25 0.5⎥⎦ ⎡ 25 20 28⎤ denotes the maximum deviation of adjusting aij, O = ⎢⎢30 25 22⎥⎥ denotes the unit ⎢⎣30 21 25⎥⎦ price for adjusting aij, and p=(50,65,48) denotes the unit price for purchasing extra resources. The criterion set (C) includes the objective function value (c1), the relative adjustment measure (c2) derived from (7), and the confidence level for achieving the target. Then the TOPSIS method is applied according to the following procedures.
Step 1. construct the decision matrix as listed in Table 1. Note that z0(xk) is the objective function value derived from (22) and z1(D−*,γ+*|xk) is the optimal relative adjustment obtained by solving (7) in terms of xk. The confidence level for each alternative target solution is rated by decision maker. Table 1. The original decision matrix 1
x =(32,33,14) x2=(35,30,15) x3=(34,28,16) x4=(28,28,18)
z0(xk) 2180 2250 2220 2120
z1(D−*,γ+*|xk) 0.751705 0.841249 0.762664 0.695000
Confidence level 0.85 0.75 0.70 0.80
622
T.-C. Lai
Step 2. construct the normalized decision matrix listed in Table 2 using (9). Table 2. The normalized decision matrix 1
x =(32,33,14) x2=(35,30,15) x3=(34,28,16) x4=(28,28,18)
z1(D−*,γ+*|xk) 0.552006 0.617762 0.560054 0.510365
z0(xk) 0.567753 0.585983 0.578170 0.552126
Confidence level 0.637999 0.562940 0.525411 0.600469
Step 3. calculate the objective weights of each criterion using (10) and (11) as follows. w={0.322779, 0.279909, 0.397312} Step 4. calculate the positive ideal solution (PIS) and the negative ideal solution (NIS) using (17) and (18) as follows. R+={0.585983, 0.510365, 0.637999}, R−={0.552126, 0.617762, 0.525411}. Step 5. calculate the distance of each alternative target solution from PIS and NIS using (19) and (20) respectively as follows. Table 3. The distance from PIS and NIS for each alternative target solution 1
x =(32,33,14) x2=(35,30,15) x3=(34,28,16) x4=(28,28,18)
d− 0.0795326 0.0304892 0.0339277 0.0739380
d+ 0.0243439 0.0739380 0.0758098 0.0304892
Step 6. calculate the relative closeness coefficient for each alternative target solution using (21). Table 4. The relative closeness coefficients x1=(32,33,14) x2=(35,30,15) x3=(34,28,16) x4=(28,28,18)
RCCi 0.765646 0.291966 0.309171 0.708034
Therefore, the final ranking is x1>x4>x3>x2.
5 Conclusion In this paper, we have investigated the problem of optimal adjustment of competence set with a set of target solutions proposed by decision maker. The relative adjustment measure derived from the competence set adjustment model is regarded as a criterion
Using TOPSIS Approach for Solving the Problem
623
for evaluating alternative target solution. By incorporating with some relevant criteria, such as the objective function value and the confidence level to achieve the alternative target, the problem could be treated as a multiple criteria decision-making problem. Consequently, the TOPSIS method has been employed to obtain the ranking of alternative targets. In the procedure of the TOPSIS, we have used the entropy method to obtain objective weights of criteria importance. Moreover, the weighted Minkowski distance function has been adopted for overcoming the problem that the effects of weighting are doubled when calculating the distance of each alternative from positive ideal solution and from negative ideal solution. Utilizing the TOPSIS method, the optimal competence set adjustment problem with multiple target solutions was examined and the results are demonstrated.
Acknowledgement The author would like to thank three anonymous reviewers for helpful comments on earlier drafts of this paper. This research was supported by the National Science Council of Taiwan under the grant number: NSC 98-2410-H-424-019.
References 1. Yu, P.L.: Forming winning strategies - An integrated theory of Habitual Domains. Springer, Heidelberg (1990) 2. Yu, P.L.: Habitual Domains. Oper. Res. 39(6), 869–876 (1991) 3. Yu, P.L.: Habitual Domains and forming winning strategies. NCTU Press, Taiwan (2002) 4. Yu, P.L., Zhang, D.: A foundation for competence set analysis. Math. Soc. Sci. 20(3), 251–299 (1990) 5. Yu, P.L., Zhang, D.: Optimal expansion of competence set and decision support. Oper. Res. 30(1), 68–84 (1992) 6. Li, H.L., Yu, P.L.: Optimal Competence Set Expansion Using Deduction Graphs. J. Optim. Theory Appl. 80(1), 75–91 (1994) 7. Shi, D.S., Yu, P.L.: Optimal Expansion and Design of Competence Sets with Asymmetric Acquiring Costs. J. Optim. Theory Appl. 88(3), 643–658 (1996) 8. Feng, J.W., Yu, P.L.: Minimum Spanning Table and Optimal Expansion of Competence Set. J. Optim. Theory Appl. 99(3), 655–679 (1998) 9. Huang, J.J., Tzeng, G.H., Ong, C.S.: Optimal fuzzy multi-criteria expansion of competence sets using multi-objectives evolutionary algorithms. Expert Syst. Appl. 30(4), 739–745 (2006) 10. Chen, T.Y.: Using Competence Sets to Analyze the Consumer Decision Problem. Eur. J. Oper. Res. 128(1), 98–118 (2001) 11. Chen, T.Y.: Expanding competence sets for the consumer decision problem. Eur. J. Oper. Res. 138(3), 622–648 (2002) 12. Lai, T.C., Chianglin, C.Y., Yu, P.L.: Optimal adjustment of competence set with linear programming. Taiwan. J. Math. 12(8), 2045–2062 (2008) 13. Hwang, C., Yoon, K.: Multiple attribute decision making methods and application. Springer, New York (1981) 14. Deng, H., Yeh, C.H., Willis, R.J.: Inter-company comparison using modified TOPSIS with objective weights. Comput. Oper. Res. 27(10), 963–973 (2000)
624
T.-C. Lai
15. Janic, M.: Multicriteria Evaluation of High-speed Rail, Transrapid Maglev and Air Passenger Transport in Europe. Transp. Plan. Technol. 26(6), 491–512 (2003) 16. Lin, M.C., Wang, C.C., Chen, M.S., Chang, C.A.: Using AHP and TOPSIS approaches in customer-driven product design process. Comput. Ind. 59(1), 17–31 (2008) 17. Shih, H.S., Shyur, H.J., Lee, E.S.: An extension of TOPSIS for group decision making. Math. Comput. Model. 45(7-8), 801–813 (2007) 18. Zanakis, S.H., Solomon, A., Wishart, N., Dublish, S.: Multi-attribute decision making: A simulation comparison of select methods. Eur. J. Oper. Res. 107, 507–529 (1998) 19. Zeleny, M.: Multiple criteria decision making. McGraw-Hill, New York (1982) 20. Berberian, S.K.: Fundamentals of Real Analysis. Springer, New York (1999) 21. Steuer, R.E.: Multiple Criteria Optimization: Theory, Computation, and Application. John Wiley, New York (1986)
About the End-User for Discovering Knowledge Amel Grissa Touzi Ecole Nationale d’Ingénieurs de Tunis Bp. 37, Le Belvédère 1002 Tunis, Tunisia
[email protected]
Abstract. In this paper, we are interested of the end-user for who have been defined different approaches for Knowledge Discovery in Database (KDD). One of the problems met with these approaches is the big number of generated rules that are not easily assimilated by the human brain. In this paper, we discuss these problems and we propose a pragmatic solution by (1) proposing a new approach for KDD through the fusion of conceptual clustering, fuzzy logic and formal concept analysis, and by (2) defining an Expert System (ES) allowing the user to easily exploit all generated knowledge in the first step. Indeed, this ES can help the user to give semantics of data and to optimize the research of information. This solution is extensible; the user can choose the fuzzy method of classification according to the domain of his data and his needs or the Inference Engine for the ES. Keywords: Data Mining, Clustering, Formal Concept Analysis, Fuzzy Logic, Knowledge Discovery in Databases, Expert system.
1 Introduction Nowadays, we notice a growing interest for the Knowledge Discovery in Databases (KDD) methods. Several algorithms for mining association rules were proposed in the literature [1]. Generally, generated rules by these algorithms, exceeding some times of thousand rules, are not easily exploitable [2], [3]. In this case, the user must choose among these rules those which are intimately bound to the operation that he wants to carry out. Several approaches of reduction of this big number of rules have been proposed like the use of quality measurements, syntactic filtering by constraints, and compression by the representative or Generic Bases [4]. In our opinion, the main goal to extract knowledge in database is to help the user to give semantics of data and to optimize the information research. Unfortunately, this fundamental constraint is not taken into account by almost all the approaches for knowledge discovery, since these generate a big number of rules that are not easily assimilated by the human brain. Indeed, this big number of rules is due to the fact that these approaches try to determine rules starting from the data or a data variety like the frequent itemsets or the frequent closed itemsets, which may be huge. To cure these problems, we propose a new KDD approach having the following characteristics: (1) Extract knowledge taking in consideration another degree of granularity into the process of knowledge extraction. Indeed, we propose to define L. Zhang, J. Kwok, and B.-L. Lu (Eds.): ISNN 2010, Part II, LNCS 6064, pp. 625–635, 2010. © Springer-Verlag Berlin Heidelberg 2010
626
A. Grissa Touzi
rules (Meta-Rules) between classes resulting from a preliminary classification on the data. Then, we automatically deduce knowledge about the initial data set. We prove that the knowledge discovered contains no redundant rule. (2) Propose an Expert System (ES) allowing the end-user to easily exploit all knowledge generated. This ES can help the user to give semantics of data and to optimize the information research. The rest of the paper is organized as follows: section 2 presents the basic concepts of discovering association rules and Formal Concept Analysis (FCA). Section 3 presents problems and limits of the existing knowledge discovery approaches. Section 4 gives notations related to our new proposed approach. Section 5 describes our KDD model. Section 6 enumerates the advantages and validates the proposed approach. We finish this paper with a conclusion and a presentation of some future works.
2 Basic Concepts 2.1 Discovering Association Rules Association rules mining have been developed in order to analyze basket data in a marketing environment. Input data are composed of transactions: each transaction consists of items purchased by a consumer during a single visit. Output data is composed of rules. An example of an association rule is “90% of transactions that involve the purchase of bread and butter also include milk” [5]. Even if this method was introduced in the context of Market Business Analysis, it can also be used to search for frequent co-occurrences in every large data set. The first efficient algorithm to mine association rules is APriori [6]. The first step of this algorithm is the research of frequent itemsets. The user gives a minimum threshold for the support and the algorithm searches all itemsets that appear with a support greater than this threshold. The second step is to build rules from the itemsets found in the first step. The algorithm computes confidence of each rule and keeps only those where confidence is greater than a threshold defined by the user. One of the main problems is to define support and confidence thresholds. Other algorithms were proposed to improve computational efficiency. Among them, we mention CLOSED, CHARM and TITANIC. 2.2 Fuzzy Conceptual Scaling and FCA Conceptual scaling theory is the central part in Formal Concept Analysis (FCA). It allows introduce for the embedding of the given data much more general scales than the usual chains and direct products of chains. In the direct products of the concept lattices of these scales the given data can be embedded. FCA starts with the notion of a formal context specifying which objects have what attributes and thus a formal context may be viewed as a binary relation between the object set and the attribute set with the values 0 and 1. In [7], an ordered lattice extension theory has been proposed: Fuzzy Formal Concept Analysis (FFCA), in which uncertainty information is directly represented by a real number of membership value in the range of [0,1]. This number is equal to similarity defined as follow:
About the End-User for Discovering Knowledge
627
Definition. The similarity of a fuzzy formal concept C1 = (ϕ ( A1), B1 ) and its subcon-
cept C2 = (ϕ ( A2 ), B2 ) is defined as: S (C1 , C 2 ) =
ϕ ( A1 ) ∩ ϕ ( A2 ) ϕ ( A1 ) ∪ ϕ ( A2 )
where ∩ and ∪ refer intersection and union operators on fuzzy sets, respectively. In [8], we showed as these FFCA are very powerful as well in the interpretation of the results of the Fuzzy Clustering and in optimization of the flexible query. Example: Let a relational database table presented by Table1 containing the list of AGE and SALARY of Employee. Table 2 presents the results of fuzzy clustering (using Fuzzy C-Means [9]) applied to Age and Salary attributes. For Salary attribute, fuzzy clustering generates three clusters (C1, C2 and C3). For AGE attribute, two clusters have been generated (C4 and C5). In our example, α − Cut (Salary) = 0.3 and α − Cut (Age) = 0.5, so, the Table 2 can be rewriting as show in Table 3. The corresponding fuzzy concept lattices of fuzzy context presented in Table 3, noted as TAH’s are given by the line diagrams presented in the Figure 1 and 2. Table 1. A relational database table t1 t2 t3 t4 t5 t6
SALARY 800 600 400 900 1000 500
AGE 30 35 26 40 27 30
Table 2. Fuzzy Conceptual Scales for age and salary attributes C1 0.1 0.3 0.7 0.1 0.5
t1 t2 t3 t4 t5 t6
SALARY C2 0.5 0.6 0.2 0.4 0.5 0.5
C3 0.4 0.1 0.1 0.5 0.5 -
C4 0.5 0.4 0.7 0.2 0.6 0.5
AGE C5 0.5 0.6 0.3 0.8 0.4 0.5
Table 3. Fuzzy Conceptual Scales for age and Salary attributes with α − Cut
t1 t2 t3 t4 t5 t6
C1 0.3 0.7 0.5
SALARY C2 0.5 0.6 0.4 0.5 0.5
C3 0.4 0.5 0.5 -
C4 0.5 0.7 0.6 0.5
AGE C5 0.5 0.6 0.8 0.5
3 Problems and Contributions The traditional algorithms try to trace the decision tree or the FCA or one of these extensions to extract the association rules. In this case, researchers always focus on giving an optimum set of rules modelling in a faithful way the starting data unit, after having done a data cleansing step and an elimination of invalid-value elements.
628
A. Grissa Touzi
Fig. 1. Salary TAH
Fig. 2. Age TAH
To our point of view, limits of these approaches consist in extracting the set of rules departing from the data or a data variety like the frequent itemsets or the frequent closed itemsets, which may be huge. Thus we note the following limits: (1) The rules generated from these data are generally redundant rules; (2) These algorithms generated a very big number of rules, almost thousands, that the human brain cannot even assimilate; (3) Generally the goal to extract a set of rules is to help the user to give semantics of data and to optimize the information research. This fundamental constraint is not taken into account by these approaches. To cure all these problems, we propose a new approach for knowledge extraction using conceptual clustering, fuzzy logic, and FCA.
About the End-User for Discovering Knowledge
629
4 Notations Related to Our KDD Model In this section, we present the notations related fuzzy conceptual scaling and some news concepts for our new approach. Definition. A fuzzy Clusters Lattice (FCL) of a Fuzzy Formal Concept Lattice, is consist on a Fuzzy concept lattice such as each equivalence class (i.e. a node of the lattice) contains only the intentional description (intent) of the associated fuzzy formal concept. We make in this case a certain abstraction on the list of the objects with their degrees of membership in the clusters. The nodes of FCL are clusters ordered by the inclusion relation. Definitions - A level L of a FCL is the set of nodes of FCL having cardinality equal to L. - A Knowledge level is an abstraction level is regarded as a level in the FCL generated. Definition. Let I= {C1, C2, …, Cp, Cq , …, Cn} n Clusters generated by a classification algorithm. A fuzzy association meta-rule (called meta-rule) is an implication of the form R: I1 => I2, (CF) where I1 = { C1, C2, …, Cp } and I2={ Cq , …, Cn }. I1 and I2 are called, respectively, the premise part and conclusion part of the metarule R. The value CF is in ]0..1] and called Confidence Factor of this rule. This value indicates the relative degree of importance of this meta-rule. R is interpreted as follows: if an object belongs to a cluster C1∩ C2∩…∩ Cp then this object can also belongs to the cluster Cq∩…∩ Cn with a probability equal to CF. Note that classical (or crisp) association meta-rules can be defined as a special case of fuzzy association meta-rules. Indeed, when CF=1, then a fuzzy association metarule is equivalent to a classical one. Example. Let R: C1 => C2 (60%). This means that any object belongs to a cluster C1 can also belongs to the cluster C2 with a probability equal to 60%. Definition. Let A1,A2...,Ap,Aq,…An n attributes having respectively {l11,l12...,l1m }, {l21,l22... ,l2m}...,{lp1 ,lp2..., lpm },{lq1,lq2...,lqm}…., ,{ln1,ln2...,lnm} as linguistic labels. A fuzzy association rule (or rule) is an implication of the form r : I1 => I2, (CF); where I1 = { A1(l1), A2(l2), …, Ap(lp) } and I2= {Aq(lq), …, An(ln) }. Ai(li) models the attribute Ai having a linguistic label li. I1 and I2 are called, respectively, the premise part and conclusion part of the fuzzy rule r. The value CF is in ]0..1] and called Confidence Factor of this rule. Definition. We define Meta Knowledge (resp. Knowledge), as a set of fuzzy association meta-rule (resp. rule). We define the level i of Meta Knowledge (resp. knowledge) as the set of fuzzy association meta-rule (resp. rule) on all objects verifying i properties. Proposition. Rewriting meta- rule Let C1= {A1, A2, …, An} and C2={B1 , …, Bm} two set of Clusters. The fuzzy association meta-rule R : A1,..,An => B1,..,Bm (CF) is equivalent to R1 defined as follow: R1: A1,..,An => C1,..,Cq (CF) such that {C1,…,Cq} = C2\C1.
630
A. Grissa Touzi
Proposition. Generation Rule Given C1={A1.., An} C2={B1.., Bn} and C3={D1.., Dn} three set of Clusters and R1,R2 two meta rule defined as follows: R 1: A1,..,An => B1,..,Bn (d1); and R 2: B1,..,Bn => D1,..,Dn (d2) Then we deduce the meta rule defined as follows: R 3: A1,..,An => D1,..,Dn (d3); such that d3= d2(d1) = d2*d1 Example. From the two meta-rule R1 and R2 defined as R1 : C2 => C2,C4 60% and R2 : C2,C4 =>C2, C3, C4 53% .We can deduce R3: C2 =>C2, C3,C4 31%. R3 can rewriting as : C2 => C3,C4 31%.
5 KDD Model Description In this section, we present the architecture of the KDD model and the process for discovering and exploiting knowledge. The architecture of the KDD model is presented in Figure 3. It consists of three steps: the first step consists in data organization the second aims at Extraction of Knowledge and the third step consists to define an ES allowing the end-user to easily exploit all knowledge generated. In the following, we detail these different steps.
Fig. 3. KDD Model
5.1 Data Organization Step This step gives a certain number of clusters for each attribute. Each tuple has values in the interval [0,1] representing these membership degrees according the formed clusters. Linguistic labels, which are fuzzy partitions, will be attributed on attribute’s domain. This step consists of TAH’s and MTAH generation of relieving attributes.
About the End-User for Discovering Knowledge
631
This step is very important in KDD Process because it allows to define and interpreter the distribution of objects in the various clusters. Example: Let a relational database table presented by Table 1 containing the list of AGE and SALARY of Employee. Table 2 presents the results of fuzzy clustering applied to Age and Salary attributes. The minimal value (resp. maximal) of each cluster corresponds on the lower (resp. higher) interval terminal of the values of this last. Each cluster of a partition is labeled with a linguistic labels provided by the user or a domain expert. For example, the fuzzy labels Young and Adult could belong to a partition built over the domain of the attribute AGE. Also, the fuzzy labels Low, Medium and High, could belong to a partition built over the domain of the attribute Salary. The Table 4 presents the correspondence of the linguistic labels and their designations for the attributes Salary and Age. The corresponding fuzzy concept lattices of fuzzy context is presented in Table 5; noted as TAH’s are given by the line diagrams presented in Figure 1 and 2. This very simple sorting procedure gives us for each many-valued attribute the distribution of the objects in the line diagram of the chosen fuzzy scale. Usually, we are interested in the interaction between two or more fuzzy many-valued attributes. This interaction can be visualized using the so-called fuzzy nested line diagrams. It is used for visualizing larger fuzzy concept lattices, and combining fuzzy conceptual scales on-line. Figure 4 shows the fuzzy nested lattice constructed from Figure 1 and 2. Table 4. Correspondence of the linguistic labels and their designations Attribute
Linguistic labels
Designation
Salary Salary Salary Age Age
Low Medium High Young Adult
C1 C2 C3 C4 C5
Table 5. Fuzzy Conceptual Scales for age and Salary attributes with α − Cut
t1 t2 t3 t4 t5 t6
SALARY Low Medium 0.5 0.3 0.6 0.7 0.4 0.5 0.5 0.5
Fig. 4. Fuzzy Lattice: MTAH
High 0.4 0.5 0.5 -
AGE Young 0.5 0.7 0.6 0.5
Adult 0.5 0.6 0.8 0.5
632
A. Grissa Touzi
5.2 Discovering Knowledge Step This step consists on Extraction of Knowledge. It consists to deduce the Fuzzy Cluster Lattice corresponding to MTAH lattice generated in the first step, then traverse this lattice to extract the Meta Knowledge ( Set of fuzzy associations meta-rules on the clusters ), and in end deduce the rules modeling the Knowledge (Set of fuzzy associations rules on the attributes). This set is denoted by SFR. Example: From the fuzzy lattice, obtained in the first step (Figure 4), we can draw the correspondent FCL. As shown from the Figure 5, we obtain a lattice more reduced, simpler to traverse and stored. Considering the FCL in Figure 5, we can generate the following levels with the corresponding FCL. The Level 0 and Level 6 are both the root and leaves of FCL. The Level 1 corresponds to the nodes {C1}, {C5},{C2},{C4}. Generally Level i corresponds to the nodes having i clusters. This permits to identify all the existing of overlapping between i clusters. It allows the knowledge discovery on all objects belonging to the intersection of these i clusters. Thus, the derivation of fuzzy association meta-rules can be performed straightforwardly. Indeed, the meta-rule represent “inter-node” implications, assorted with the CF, between two adjacent comparable equivalence classes, i.e., from a set of clusters to another set of clusters immediately covering it. The confidence Factor will be equal to the weight of the arc binding the two nodes. Such an implication brings into participate two comparable equivalence classes, i.e. of a set of clusters towards another set of cluster including it in the partial order structure.
Fig. 5. The FCL
⇒
Example: The meta-rule C5 C2,C5 (0,83), is generated starting from the two equivalence classes, whose their respective nodes are Clusters {C5}, {C2,C5} having as distance d=0.83. The Algorithm for Discovering Fuzzy Association Meta-rules traverses the search space (FCL) by level to determine the Fuzzy Meta Rules Set (FMRS). As input it takes the lattice of Clusters FCL and returns, as output, the list of all Fuzzy Meta Rules Set (FMRS) generated. It works as follows: For each non empty node ∈ FCL in
About the End-User for Discovering Knowledge
633
descending, it generates all meta-rules with one cluster in conclusion (level 1). Then, it generates the set of all meta-rules with two Clusters in conclusion. The same process is applied to generate conclusions with four clusters, and so on until conclusions with n clusters have been generated. Let's note that the FMRS set doesn't contain any redundant rule. This is due that of a level to another of the lattice the nodes are obligatorily distinct (by definition even of a level of lattice). From the FMRS set we can easily deduce the rules modeling the Knowledge SFR. It’s enough to use the Table 4 presents the correspondence of the linguistic labels and their designations for the attributes Salary and Age. Example: The meta-rule C5 => C2 83% is transformed in Age(Adult) => Salary(Medium) 83%. 5.3 Exploiting Knowledge Step In this section, we propose the definition of an ES, called ES-DM, allowing the user to easily exploit the discovered Knowledge in the second step. This ES has been designed as an intelligent tool for helping the user to give semantics of data, to seek information in this data and to satisfy his needs. It is described by the following points:
− ES Architecture. The general architecture of ES-DM is shown in Figure 3. Like most ES, ES-DM is composed of a Knowledge Base (KB), an Inference Engine (IE) and two interfaces: acquirement of the knowledge and user-interface. − Definitions. The KB of ES-DM is equal to the set of rules FRS generated in second step. The IE and two interfaces of ES-DM are supposed offered by the ESS which used for implemented this ES. − Definition of the user-interface. According to the user's needs, we adopted four strategies of dialog: (1) First strategy: The user specifies the number i of properties which he wants to check. The ES posts the list of the objects checking i properties. (2) Second Strategy: The user definite a property p, the ES posts the list of the properties which an object must check so that it can check the property p. (3) Third strategy: The user defined a p property that is already satisfied by an object, the ES posts the list of property that can verify an object knowing that it verifies this property. (4) Fourth strategy: The user can ask any queries for the ES.
6 Advantages and Validation of the Proposed Approach Different advantages are granted by the proposed approach: (1) The definition of the Meta knowledge concept: This definition is in our opinion very important, since the number of rules generated is smaller. Besides, the concept of Meta knowledge is important to have a global view on the data set which is very voluminous. This models a certain abstraction of the data that is fundamental in the case of an enormous number of data. In this case, we define the set of meta-rules between the clusters. That can generate automatically the association rules between the data, if we want more details. (2) The definition of the ES: This System has been designed as an intelligent tool for helping the user to give semantics of data, to seek information in this data and to
634
A. Grissa Touzi
satisfy his needs. (3) Extensibility of the proposed approach: Our approach can be applied with any fuzzy classification algorithm to classify the initial data or the Inference Engine for the ES. The comparison with the existing approaches can be done on two levels (1) The i = N −1
maximum number of generated rules with our approach is
∑ (C *C i =0
i
i+1
N
N
), where N is
a number of clusters, given in entry by the user, in the step of classification. Thus, is independent of the size of the data. (2) The existing algorithms don’t take into account any semantics of the data. All the researchers focused themselves on the reduction of the set of rules, by proposing the concept of metadata, or on the method of visualization of this rules. Our principle in this approach is to propose an extraction of knowledge based on the ties semantics of the data which is in our opinion more interesting, that the one existing which bases on the form (syntax) objects. To validate the approach proposed, we chose: 1) The FCM (Fuzzy C-Means) algorithm for a fuzzy classification of the data set, 2) The Ganter algorithm for the construction of the lattice, and 3) the Expert system Shell ‘JESS’ to implement the ES.
7 Conclusion Knowing the essential goal of the extraction of knowledge is to help the user to seek information in this data set and to satisfy his needs, in this paper, we proposed a pragmatic solution for this problem, which consists in defining a new KDD model. It consists of three steps: the first organizes the database records in homogeneous clusters having common properties which permit to deduce the data’s semantic. This step consists of TAH’s and MTAH generation of relieving attributes. The second permits to Discovering Knowledge. It consists to deduce the Fuzzy Cluster Lattice corresponding to MTAH lattice generated in the first step, then traverse this lattice to extract the Meta Knowledge ( Set of fuzzy associations meta-rules on the clusters ), and in end deduce the rules modeling the Knowledge (Set of Fuzzy Associations Rules on the attributes SFR). The third step consists to define an ES allowing the end-user to easy exploit all knowledge generated. The Knowledge Base of this ES is equal to the SFR generated in second step. This solution is extensible; the user can choose the fuzzy method of clustering according to the domain of his data and his needs, or the Inference Engine for the ES. In the future, we propose to define an incremental method that permits to deduct the Knowledge Base generated by our model knowing the modifications carried out in the initial data base.
References 1. Goebel, M., Gruenwald, L.: A Survey of Data Mining and Knowledge Discovery Software Tools. ACM SIGKDD 1(1), 20–33 (1999) 2. Zaki, M.: Mining Non-Redundant Association Rules. Data Mining and Knowledge Discovery 9, 223–248 (2004) 3. Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., Lakhal, L.: Intelligent structuring and reducing of association rules with formal concept analysis. In: Baader, F., Brewka, G., Eiter, T. (eds.) KI 2001. LNCS (LNAI), vol. 2174, pp. 335–350. Springer, Heidelberg (2001)
About the End-User for Discovering Knowledge
635
4. Pasquier, N.: Data Mining: Algorithmes d’Extraction et de Réduction des Règles d’Association dans les Bases de Données. Thèse, Département d’Informatique et Statistique, Faculté des Sciences Economiques et de Gestion, Lyon (2000) 5. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between sets of items in large Databases. In: Proceedings of the ACM SIGMOD Intl. Conference on Management of Data, Washington, USA, June 1993, pp. 207–216 (1993) 6. Agrawal, R., Skirant, R.: Fast algoritms for mining association rules. In: Proceedings of the 20th Int’l Conference on Very Large Databases, June 1994, pp. 478–499 (1994) 7. Thanh, T., Cheung, Siu, H., Tru Hoang, C.: A Fuzzy FCA-based Approach to Conceptual Clustering for Automatic Generation of Concept Hierarchy on Uncertainty Data. CLA, pp. 1–12 (2004) 8. Grissa Touzi, A., Sassi, M., Ounelli, H.: An innovative contribution to flexible query through the fusion of conceptual clustering, fuzzy logic, and formal concept analysis. International Journal of Computers and Their Applications 16(4), 220–233 (2009) 9. Sun, H., Wanga, S., Jiangb, Q.: FCM-Based Model Selection Algorithms for Determining the Number of Clusters. Pattern Recognition 37, 2027–2037 (2004)
Author Index
Alanis, Alma Y. I-719 Alejo, R. I-303 Aoun, Mario Antoine I-33 Bai, Junqing I-365 Bai, Qinghai II-60 Bai, Weili II-448 Boralessa, Nilupa II-532 Busa- Fekete, R´ obert II-178 Cai, Qiao I-325 Cao, Jianting II-353 Cao, Jinde I-9, I-483 Cham, Wai-Kuen II-97 Chang, Yu-Teng II-606 Che, Xilong II-1 Chen, Chih-Ming II-439 Chen, Chuanbo I-436 Chen, Chunjie I-152 Chen, Dongyue II-90 Chen, Gang II-448 Chen, Guangyi II-178 Chen, Jie II-112 Chen, Pin-Cheng II-497 Chen, Qing I-169 Chen, Wen-Ching I-389 Chen, Xiaofeng I-603 Chen, Xinyu II-17 Chen, Yen-Wei II-162 Chen, Yonggang I-659 Chen, Yun I-238 Chen, Zong-You II-516 Cheng, Wei-Chen II-75 Cheng, Yong I-422 Choi, Jeoung-Nae I-215 Cichocki, Andrzej II-385 Cong, Fengyu II-385 Dai, Guiping I-58, II-112 Dai, Lengshi I-51 De Lathauwer, Lieven II-337 De Moor, Bart II-337 Deng, Feiqi I-493 Deng, Xiaolian I-643 Deng, Xiongbing II-276
Ding, Lixin I-199 Ding, Shifei I-319 Ding, Tao I-296 Ding, Wensi I-554, I-595 Duan, Lijuan II-128, II-240 Du, Jixiang II-112 Du, Tingsong I-118 Fang, Faming II-240 Fang, Lei I-161 Fang, Liang I-102, I-110 Feng, Jian I-504 Feng, Zengzhe I-102 Fiori, Simone I-185 Foruzan, Amir H. II-162 Franco, Leonardo I-86 Freeman, Walter J. I-51 Fu, Chaojin I-651 Fu, Xian I-651 Fukagawa, Daiji II-302 Gao, Daqi II-42 Gao, Jiaquan I-161 Gao, Pengyi I-436 Gao, Xieping I-347 Gao, Yun I-520 G´ omez, Iv´ an I-86 Gong, Dunwei I-288 Goonawardene, Nadee II-532 Grissa Touzi, Amel II-625 Guan, Li-He I-311 Guan, Zhengqiang II-222 Gu, Dawu II-556, II-576 Gu, Mei II-322 Guo, Chengan II-184 Guo, Ping II-17, II-33 Guo, Qianjin II-507 Guo, Shengbo I-396 Guo, Zhishan I-77 Gu, Zhenghui I-347 Han, Han, Han, Han,
Feng-Qing I-311 Min I-413, I-450, I-465 Peng II-90 Pu II-472
638
Author Index
Han, Seung-Soo II-464 Han, Zhen II-120 Hassan, Wan H. II-540 He, Guixia I-161 He, Guoping I-102 He, Haibo I-325 He, Hanlin I-542 He, Ji II-312 He, Qing I-404 He, Xingui I-280 He, Yong I-144 Herawan, Tutut I-473, II-596 Honda, Takeru I-67 Hong, Chin-Ming II-439 Hong, Kan II-360 Hong, Qun I-554 Hong, Sang-Jeen II-464 Hou, Huijing I-126 Hou, Yuexian II-282 Hou, YunBing II-432 Houllier, Maxime I-355 Hu, Guosheng II-1, II-60 Hu, Liang II-1, II-60 Hu, Ruimin II-120 Hu, Yingsong I-436 Huai, Wenjun II-112 Huang, Baohai II-472 Huang, ChuanHe II-481 Huang, LinLin II-208 Huang, Longwen I-1 Huang, Minzhang II-250 Huang, Wei I-199, I-207 Huang, Yinghu II-416 Huang, Zhenkun I-627 Huttunen-Scott, Tiina II-385 Huyen, Nguyen Thi Bich II-192 Janssens, Frizo II-337 Jeon, Sung-Ik II-464 Jerez, Jos´e M. I-86 Ji, Guoli I-17 Jia, Jia I-528 Jia, Weikuan I-319 Jiang, Chuntao I-611 Jiang, Feng I-577 Jiang, Hong I-169 Jiang, Minghui I-611, I-635 Jian, Jigui I-643, I-667 Jie, Song Guo II-408 Jin, Ming I-339
Jin, Shangzhu II-222 Jin, Shen I-331 Kalyakin, Igor II-385 K´egl, Bal´ azs II-178 Kim, Hyun-Ki I-177, I-215, I-246 Kim, Seung-Gyun II-464 Kim, Wook-Dong I-207 Kim, Young-Hoon I-246 Kong, Xianming I-110, II-83 Kriksciuniene, Dalia II-455 Kuan, Ta-Wen II-524 Kuang, Shifang I-493 Lai, Tsung-Chih II-615 Lan, Chengdong II-120 Lee, Chung-Hong II-292 Lee, Tsu-Tian II-497 Lei, Han II-408 Liang, Chuanwei II-276 Liang, Lishi I-554 Liang, Pei-Ji I-44 Liang, Zhi-ping I-465 Liao, Wudai I-193 Li, Bing I-561 Li, Dan I-436 Li, De-cai I-413 Li, Demin I-152 Li, Gang II-200 Li, Guang I-51, I-58, I-63 Li, Haiming II-556, II-576 Li, Han I-110 Li, Hao I-365 Li, Hongwei II-1, II-60 Li, Hui II-392 Li, Jianwu II-25 Li, Jie II-360 Li, Jun I-745 Li, Junhua II-360 Li, Na II-208 Li, Pengchao II-1 Li, Po-Hsun II-426 Li, Qingbo I-569 Li, Qingshun Quinn I-17 Li, Shengrong I-611 Li, Wei I-223, I-595, I-711 Li, Xiaoli II-507 Li, Xiaolin I-528 Li, Xue II-276 Li, Xuemei I-493
Author Index Li, Yan I-444, II-472 Li, Yanling II-200 Li, Yan-Ming I-457 Li, Yingjie II-392 Li, Yuanqing I-347 Li, Yuanxiang I-272 Li, Yu-qin II-330 Li, Zhanchao I-126 Li, Zhenxiao II-152 Li, Zhouhong I-585 Ling, Mee H. II-540 Lin, Ping-Zing II-497 Liou, Cheng-Yuan II-75 Liu, Chao-Chun II-68 Liu, Cheng-Liang I-457 Liu, Jian II-258 Liu, Jiangrong I-577 Liu, Jiao I-254 Liu, Jie II-276 Liu, Jinbao I-296 Liu, Kun I-280 Liu, Shuai-shi II-144 Liu, Taian II-83 Liu, Xiaolin I-548 Liu, Xinhai II-337 Liu, Yankui I-373 Liu, Zhenwei I-512 Liu, Zhigang II-448 Liu, Zhilei II-104 Lo, Chih-Yao II-606 Loukianov, Alexander G. I-719 Lu, Bao-Liang II-250 Lu, Chi-Jie II-426 Lu, Tao II-120 Lu, Yao II-25 Luo, Ding II-120 Luo, Fei I-331 Luo, Siwei I-444, II-136, II-170 Luo, Wenjuan I-404 Luo, Yuan I-355 Luo, Zhiyuan I-63 Lv, Guangjun II-33 Lv, Xue-qiang II-330 Lyytinen, Heikki II-385 Ma, Bingpeng II-240 Majewski, Maciej II-268 Ma, Jicai II-128 Man, Hong I-325 Mao, Wenqi II-392
Mao, Wentao I-365 Masada, Tomonari II-302 Mat Deris, Mustafa I-473, II-596 Ma, Xiaoyan I-110 Memon, Zulfiqar A. II-586 Meng, Zhaopeng II-282 Miao, Jun II-128, II-240 Miao, Yuping I-58 Mohd Rose, Ahmad Nazari I-473 Mu, Dayun I-450 Mu, Xuewen I-95 Murphey, Yi Lu I-430 Neruda, Roman I-534 Nie, Xiaobing I-483 Nishino, Tetsuro I-67 Oguri, Kiyoshi II-302 Oh, Sung-Kwun I-177, I-199, I-207, I-215, I-246 Pan, Quanxiang I-659 Pan, Zhisong II-42 Park, Dong-Chul II-192 Park, Ho-Sung I-177 Pasero, Eros II-566 Peng, Jun II-222 Perez-Cisneros, Marco A. I-719 Phan, Anh-Huy II-385 Premaratne, Lalith II-532 Qiao, Yuanhua II-240 Qing, Xie Kun II-408 Qin, Tiheng I-659 Qiu, Jianlong I-325 Qiu, Yi I-58 Raimondo, Giovanni II-566 Ren, Jie I-288 Ristaniemi, Tapani II-385 Rose, Ahmad Nazari Mohd II-596 Rud, Samuel II-52 Ruffa, Suela II-566 Sakalauskas, Virgilijus II-455 Sanchez, Edgar N. I-719 Sang, Nong II-214 Sanner, Scott I-396 Shang, Chun II-282 Shang, Li II-112
639
640
Author Index
Shao, Yuehjen E. II-426 Shao, Yuxiang I-169 Shi, Qiwei II-353 Shi, Shui-cai II-330 Shi, Weiya II-9 Shi, Zhongzhi I-404 Shibata, Yuichiro II-302 Shih, Po-Yi II-516 Song, Jing II-1 Song, Qiankun I-561, I-603, I-619 Sotoca, J.M. I-303 Stuart, Keith Douglas II-268 Subashini, Shashikala II-532 Subirats, Jos´e L. I-86 Sun, Bin II-214 Sun, Chen-zhi I-136 Sun, Hongying I-711 Sun, Ta-Wei II-524 Sun, Wei II-136, II-170 Sun, Xiaoyan I-288 Takasu, Atsuhiro II-302 Tanaka, Shigeru I-67 Tanaka, Toshihisa II-353 Tang, Akaysha II-368 Tang, Yingying II-392 Tan, Qianrong I-745 Tan, Ying I-280 Tao, Cailin II-400 Tao, Kaiyu II-346 Tian, Xin I-27 Tian, Yan-tao II-144 Tong, Qiang I-422 Toribio, P. I-303 Trelis, Ana Botella II-268 Treur, Jan II-586 Tseng, Lin-Yu I-389 Tsubokawa, Katsumi II-162 Tu, Jianjun I-542 Tu, Zhengwen I-635, I-643, I-667 Valdovinos, R.M. Vidnerov´ a, Petra Wang, Wang, Wang, Wang, Wang, Wang, Wang,
I-303 I-534
Baoxian I-667 Chao II-184 Cheng I-675 Chi-Hsu II-497 Dingwei I-230 Dongfeng II-472 Guanjun I-727
Wang, Guoyin II-416 Wang, Haili II-240 Wang, Jhing-Fa II-516, II-524 Wang, Jia-Ching II-524 Wang, Jiacun I-152 Wang, Jiangfeng I-193 Wang, Jiao I-444 Wang, Jijun II-392 Wang, Jun I-77 Wang, Junyan I-193 Wang, Kesheng I-238 Wang, Lei II-330 Wang, Nini I-381 Wang, Rubin II-353 Wang, Shangfei II-104 Wang, Shih-Hao II-292 Wang, Tong II-481 Wang, Xiaohong I-611 Wang, Xiaoqing I-373 Wang, Yong II-556, II-576 Wang, Yongli I-102 Wang, You I-51, I-58, I-63 Wang, Yu-Chiun II-426 Wang, Yuehuan II-214 Wang, Zhanshan I-504 Wang, Zhe II-42 Wang, Zheng-Xia I-311 Wan, Sunny II-548 Wen, Mi II-556, II-576 Woo, Dong-Min II-192 Wu, Ailong I-651 Wu, Charles Q. II-230 Wu, Chen-Feng II-606 Wu, Chengdong II-90 Wu, Dongqing I-711 Wu, Lina II-136, II-170 Wu, Si I-1 Wu, Weigen I-745 Wu, Xiaohui I-17 Wu, Yuanyuan I-569 Xiao, Hongfei I-296 Xiao, Lei I-44 Xiaoling, Ding I-331 Xiao, Min I-9 Xie, Yuling II-507 Xing, Peixu I-569 Xu, Bingxin II-33 Xue, Xin II-83 Xu, Jianping II-556, II-576
Author Index Xu, Xu, Xu, Xu,
Xianyun I-254, I-520 Xiaohui I-693 Xinzheng I-319 Yao-qun I-136
Yamazaki, Tadashi I-67 Yan, Guirong I-365 Yan, Hong II-68 Yan, Zhen II-17 Yang, Chenxi I-585 Yang, Fengjian I-554, I-595, I-711 Yang, Hsin-Chang II-292 Yang, Hua I-577 Yang, Jianfu I-554, I-711 Yang, Jiann-Shiou II-52 Yang, Jianxi I-619 Yang, Jingli I-118 Yang, Juan II-432 Yang, Wei I-702 Yang, Yongqing I-254, I-520 Yang, Zhen II-128 Yang, Zhichun I-735 Yang, Zhiguo I-735 Yao, Jian II-97 Yeh, Ming-Feng I-262 Yi, Gang I-339 Yi, Hu I-27 Yi, Zhang II-378 Yin, Jianchuan I-381 Yin, Qian II-33 Yin, Xing I-745 Ying, Weiqin I-272 Yousuf, Aisha I-430 Yu, Fahong I-272 Yu, Xiao-Hua II-548 Yuan, Huilin I-230 Yuan, Jimin I-745 Yuan, Jin I-457 Yuan, Kun I-548 Zha, Xuan F. I-457 Zhang, Chaolong I-595
Zhang, Chunrui I-702 Zhang, Dexian II-9 Zhang, Huaguang I-504, I-512 Zhang, Jia-hai I-136 Zhang, Jiye I-684, I-693 Zhang, Jun I-347 Zhang, Kui I-436 Zhang, Liqing II-152, II-360 Zhang, Ming II-489 Zhang, ShiLin II-322 Zhang, Shuo II-282 Zhang, Ting I-635 Zhang, Wei I-110, II-83, II-97 Zhang, Wei-Feng II-68 Zhang, Weihua I-693 Zhang, Xiaoming II-400 Zhang, Yaling I-95 Zhang, Ying-Ying I-44 Zhao, Guangyu II-60 Zhao, Hai II-250 Zhao, Kaihong I-585 Zhao, Yanhong I-520 Zheng, Bochuan II-378 Zheng, Dongjian I-126 Zheng, Jianti I-17 Zheng, Qingqing II-214 Zheng, Weifan I-684 Zheng, Yuanzhe I-63 Zhong, Jiang II-276 Zhou, Jianting I-619 Zhou, Jie I-152 Zhou, Renlai II-400 Zhou, Wei II-353 Zhou, Yafei I-561, I-603 Zhou, Zhi I-339 Zhu, Haigang I-51 Zhu, Hanhong I-238 Zhu, Hong I-319 Zhu, Wei-ping II-178 Zhuang, Fuzhen I-404 Zou, Ling II-400
641