Yoel Tenne and Chi-Keong Goh (Eds.) Computational Intelligence in Optimization
Adaptation, Learning, and Optimization, Volume 7 Series Editor-in-Chief Meng-Hiot Lim Nanyang Technological University, Singapore E-mail:
[email protected] sg Yew-Soon Ong Nanyang Technological University, Singapore E-mail:
[email protected] sg Further volumes of this series can be found on our homepage: springer.com Vol. 1. Jingqiao Zhang and Arthur C. Sanderson Adaptive Differential Evolution, 2009 ISBN 978-3-642-01526-7 Vol. 2. Yoel Tenne and Chi-Keong Goh (Eds.) Computational Intelligence in Expensive Optimization Problems, 2010 ISBN 978-3-642-10700-9 Vol. 3. Ying-ping Chen (Ed.) Exploitation of Linkage Learning in Evolutionary Algorithms, 2010 ISBN 978-3-642-12833-2 Vol. 4. Anyong Qing and Ching Kwang Lee Differential Evolution in Electromagnetics, 2010 ISBN 978-3-642-12868-4 Vol. 5. Ruhul A. Sarker and Tapabrata Ray (Eds.) Agent-Based Evolutionary Search, 2010 ISBN 978-3-642-13424-1 Vol. 6. John Seiffertt and Donald C. Wunsch Unified Computational Intelligence for Complex Systems, 2010 ISBN 978-3-642-03179-3 Vol. 7. Yoel Tenne and Chi-Keong Goh (Eds.) Computational Intelligence in Optimization, 2010 ISBN 978-3-642-12774-8
Yoel Tenne and Chi-Keong Goh (Eds.)
Computational Intelligence in Optimization Applications and Implementations
13
Dr. Yoel Tenne Department of Mechanical Engineering and Science-Faculty of Engineering, Kyoto University, Yoshida-honmachi, Sakyo-Ku, Kyoto 606-8501, Japan E-mail:
[email protected] Formerly: School of Aerospace Mechanical and Mechatronic Engineering, Sydney University, NSW 2006, Australia
Dr. Chi-Keong Goh Advanced Technology Centre, Rolls-Royce Singapore Pte Ltd 50 Nanyang Avenue, Block N2, Level B3C, Unit 05-08, Singapore 639798 E-mail:
[email protected]
ISBN 978-3-642-12774-8
e-ISBN 978-3-642-12775-5
DOI 10.1007/978-3-642-12775-5 Adaptation, Learning, and Optimization
ISSN 1867-4534
Library of Congress Control Number: 2010926028 c 2010 Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed on acid-free paper 987654321 springer.com
To our families for their love and support.
Preface
Optimization is an integral part to science and engineering. Most real-world applications involve complex optimization processes, which are difficult to solve without advanced computational tools. With the increasing challenges of fulfilling optimization goals of current applications there is a strong drive to advance the development of efficient optimizers. The challenges introduced by emerging problems include: • objective functions which are prohibitively expensive to evaluate, so typically so only a small number of objective function evaluations can be made during the entire search, • objective functions which are highly multimodal or discontinuous, and • non-stationary problems which may change in time (dynamic). Classical optimizers may perform poorly or even may fail to produce any improvement over the starting vector in the face of such challenges. This has motivated researchers to explore the use computational intelligence (CI) to augment classical methods in tackling such challenging problems. Such methods include population-based search methods such as: a) evolutionary algorithms and particle swarm optimization and b) non-linear mapping and knowledge embedding approaches such as artificial neural networks and fuzzy logic, to name a few. Such approaches have been shown to perform well in challenging settings. Specifically, CI are powerful tools which offer several potential benefits such as: a) robustness (impose little or no requirements on the objective function) b) versatility (handle highly non-linear mappings) c) self-adaption to improve performance and d) operation in parallel (making it easy to decompose complex tasks). However, the successful application of CI methods to real-world problems is not straightforward and requires both expert knowledge and trial-and-error experiments. As such the goal of this volume is to survey a wide range of studies where CI has been successfully applied to challenging real-world optimization problems, while highlighting the
VIII
Preface
insights researchers have obtained. Broadly, the studies in this volume focus on four main disciplines: continuous optimization, classification, scheduling and hardware implementations. For continuous optimization, Neto et al. study the use of artificial neural networks (ANNs) and Heuristic Rules for solving large scale optimization problems. They focus on a recurrent ANN to solve a quadratic programming problem and propose several techniques to accelerate convergence of the algorithm. Their method is more efficient than one using an ANN only. Starzyk et al. propose a direct-search optimization algorithm which uses reinforcement learning, resulting in an algorithm which ‘learns’ the best path during the search. The algorithm weights past steps based on their success to yield a new candidate search step. They benchmark their algorithm with several mathematical test functions and apply to training of a multi-layer perceptron neural network for image recognition. Ventresca et al. use the opposition sampling approach to decrease the number of function evaluations. The approach attempts to sample the function in a subspace generated by the ‘opposites’ of an existing population of candidates. They apply their method to differential evolution and incremental learning and show that the opposition method improves performance over baseline variants. Bazan studies an optimization algorithm for problems where the objective function requires large computational resources. His proposed algorithm uses locally regularized approximations of the objective function using radial basis functions. He provides convergence proofs and formulates a framework which can be applied to other algorithms such as Gauss-Seidel or the Conjugate Directions. Ruiz-Torrubiano et al. study hybrid methods for solving large scale optimization problems with cardinality constraints, a class of problems arising in diverse areas such as finance, machine learning and statistical data analysis. While existing methods can provide exact solutions (such as branch-andbound) they require large resources. As such the study focuses on methods which can efficiently identify approximate solutions but require far less computer resources. For problems where it is expensive to evaluate the objective function, Jayadeva et al. propose using a support-vector machine to predict the location of yet undiscovered optima. Their framework can be applied to problems where little or no a-priori information is available on the objective function, as the algorithm ‘learns’ during the search process. Benchmarks show their method can outperform existing methods such as particle swarm optimization or genetic algorithms. Vouchkov and Keane study multi-objective optimization problems using surrogate-models. They investigate how to efficiently update the surrogates under a small optimization ‘budget’ and compare different updating strategies. They also shows that using a number of surrogate models can improve the optimization search and that the size of the ensemble should increase with the problem dimension. Others study agents-based algorithms, that is, where the optimization is done by agents which co-operate during the search. Dre˙zewski and Siwik review agent-based co-evolutionary algorithms for multi-objective problems.
Preface
IX
Such algorithms combine co-evolution (multiple species) with the agent approach (interaction). They review and compare existing methods and benchmark them over a range of test problems. Results show the agent-based coevolutionary algorithms can perform equally well and even surpass some of the best existing multi-objective evolutionary algorithms. Salhi and T¨ oreyen proposes a multi-agent algorithm based on game theory. Their framework uses multiple solvers (agents) which compete over available resources and their algorithm identifies the most successful solver. In the spirit of game theory, successful solvers are rewarded by increasing their computing resources and vice versa. Test results show the framework provides a better final solution when compared to using a single solver. For applications in classification, Arana-Daniel et al. use Clifford algebra to generalize support vector machines (SVMs) for classification (with an extension to regression). They represent input data as a multivector and use a single Clifford kernel for multi-class problems. This approach significantly reduces the computational complexity involved in training the SVM. Tests using real-world applications of signal processing and computer vision show the merit of their approach. Luukka and Lampinen propose a classification method which combines principal component analysis to pre-process the data followed by optimization of the classifier parameters using a differential evolution algorithm. Specifically, they optimize the class vectors used by the classifier and the power of the distance metric. Test results using real-world data sets show the proposed approach performs equally or better to some of the best existing classifiers. Lastly in this category, Zhang et al. study the problem of feature selection in high-dimensional problems. They focus on the GA-SVM approach, where a genetic algorithm (GA) optimizes the parameters of the SVM (the GA uses the SVM output as the objective values). The problem requires large computational resources which make it difficult to apply to large or high-dimensional sets. As such they propose several measures such as parallelization, neighbour search and caching to accelerate the search. Test results show their approach can reduce the computational cost of training an SVM classifier. Two studies focus on difficult scheduling problems. First, Pieters studies the problem of railway timetable design scheduling, which is an NP-hard problem with additional challenging features as such being reactive and dynamic. He studies solving the problem with Symbiotic Networks, a class of neural networks inspired by the symbiosis phenomenon is nature, and so the network uses ‘agents’ to adapt itself to the problem. Test results show the Symbiotic network can successfully handle the complex scheduling problem. Next, Srivastava et al. propose an approach combining evolutionary algorithms, neural network and fuzzy logic to solve problems of multiobjective time-cost trade-off. They consider a range of such problems including nonlinear time-cost relationships, constrained resources and project uncertainties. They show the merit of their approach by testing it on a real-world test case.
X
Preface
Lastly, for applications of CI to hardware implementations, Meher studies the use of systolic arrays for implementing artificial neural networks in VLSI and FPGA platforms. The chapter studies the use of systolic arrays for efficient hardware implementations of neural networks for real-time applications. The chapter surveys various approaches, current achievements as well as future directions such as mixed analog-digital neural networks. This is followed by Thangavelautham et al. who propose using coarse-coding techniques to evolve multi-robot controllers and where they aim to evolve simultaneously both the controller and sensor configurations. To make the problem tractable they use an Artificial Neural Tissue to exploit regularity in the sensor data. Test results show their approach outperforms a reference one. Overall, the chapters in this volume address a spectrum of issues arising in the application of computational intelligence to real-world difficult optimization problems. The chapters discuss both the current accomplishments and the remaining open issues as well as point to future research directions in the field. September 2009
Yoel TENNE Chi-Keong GOH
Acknowledgement to Reviewers
We express our thanks to the expertise provided by our fellow researchers who have kindly reviewed for the edited book. Their assistance have been invaluable to our endeavors. B.V. Babu Will Browne Pedro M. S. Carvalho Jia Chen Sheng Chen Tsung-Che Chiang Siang-Yew Chong Antonio Della Ciopa Carlos A. Coello Coello Marco Cococcioni Claudio De Stefano Antonio Gaspar-Cunha Kyriakos C. Giannakoglou David Ginsbourger Frederico Guimar˜ aes Martin Holena Amitay Issacs Jayadeva Wee Tat Koo Slawomier Koziel Jouni Lampinen Xiaodong Li
Dudy Lim Passi Luukka Pramod Kumar Meher Hirotaka Nakayama Ferrante Neri Thai Dung Nguyen Alberto Ochoa Yew-Soon Ong Khaled Rasheed Tapabrata Ray Abdellah Salhi Vui Ann Shim Ofer M. Shir Dimitri Solomatine Sanjay Srivastava Janusz Starzyk Stephan Stilkerich Haldun S¨ ural Mohammhed B. Trabia Massimiliano Vasile Lingfeng Wang Chee How Wong
Contents
1
2
New Hybrid Intelligent Systems to Solve Linear and Quadratic Optimization Problems and Increase Guaranteed Optimal Convergence Speed of Recurrent ANN . . . . . . . . . . . . . . . . . . . . . . . . . Otoni N´obrega Neto, Ronaldo R.B. de Aquino, Milde M.S. Lira 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Neural Network of Maa and Shanblatt: Two-Phase Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Hybrid Intelligent System Description . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Method of Tendency Based on the Dynamics in Space-Time (TDST) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Method of Tendency Based on the Dynamics in State-Space (TDSS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Case 1: Mathematical Linear Programming Problem – Four Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Case 2: Mathematical Linear Programming Problem – Eleven Variables . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Case 3: Mathematical Quadratic Programming Problem – Three Variables . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Novel Optimization Algorithm Based on Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Janusz A. Starzyk, Yinyin Liu, Sebastian Batog 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Basic Search Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 4 7 9 13 16 16 16 18 19 23 25 27 27 30 30
XIV
Contents
2.2.2
Extracting Historical Information by Weighted Optimized Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Predicting New Step Sizes . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Stopping Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Simulation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Finding Global Minimum of a Multi-variable Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Optimization of Weights in Multi-layer Perceptron Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Micro-saccade Optimization in Active Vision for Machine Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4
The Use of Opposition for Decreasing Function Evaluations in Population-Based Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mario Ventresca, Shahryar Rahnamayan, Hamid Reza Tizhoosh 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Theoretical Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Consequences of Opposition . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Lowering Function Evaluations . . . . . . . . . . . . . . . . . . . . . 3.2.4 Comparison to Existing Methods . . . . . . . . . . . . . . . . . . . 3.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Opposition-Based Differential Evolution . . . . . . . . . . . . . 3.3.3 Population-Based Incremental Learning . . . . . . . . . . . . . . 3.3.4 Oppositional Population-Based Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Evolutionary Image Thresholding . . . . . . . . . . . . . . . . . . . 3.4.2 Parameter Settings and Solution Representation . . . . . . . 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 OPBIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Search Procedure Exploiting Locally Regularized Objective Approximation: A Convergence Theorem for Direct Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marek Bazan 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The Search Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Zangwill’s Method to Prove Convergence . . . . . . . . . . . . . . . . . . . .
30 36 37 38 39 39 42 43 45 46 49 49 50 51 52 53 54 55 56 57 57 58 59 59 63 64 64 65 68 69
73 73 74 75
Contents
XV
4.4
The Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4.1 Closedness of the Algorithmic Transformation . . . . . . . . 78 4.4.2 A Perturbation in the Line Search . . . . . . . . . . . . . . . . . . . 80 4.5 The Radial Basis Appproximation . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.1 Detecting Dense Regions . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.2 Regularization Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.5.3 Choice of the Regularization Parameter λ Value . . . . . . . 90 4.5.4 Error Bounds for Radial Basis Approximation . . . . . . . . 91 4.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.6.1 Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5
6
Optimization Problems with Cardinality Constraints . . . . . . . . . . . . Rub´en Ruiz-Torrubiano, Sergio Garc´ıa-Moratilla, Alberto Su´arez 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Approximate Methods for the Solution of Optimization Problems with Cardinality Constrains . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Estimation of Distribution Algorithms . . . . . . . . . . . . . . . 5.3 Benchmark Optimization Problems with Cardinality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 The Knapsack Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Ensemble Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Portfolio Optimization with Cardinality Constraints . . . . 5.3.4 Index Tracking by Partial Replication . . . . . . . . . . . . . . . . 5.3.5 Sparse Principal Component Analysis . . . . . . . . . . . . . . . 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Learning Global Optimization through a Support Vector Machine Based Adaptive Multistart Strategy . . . . . . . . . . . . . . . . . . . Jayadeva, Sameena Shah, Suresh Chandra 6.1 Introduction and Background Research . . . . . . . . . . . . . . . . . . . . . . 6.2 Global Optimization with Support Vector Regression Based Adaptive Multistart (GOSAM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 One Dimensional Wave Function . . . . . . . . . . . . . . . . . . . 6.3.2 Two Dimensional Case: Ackley’s Function . . . . . . . . . . . 6.3.3 Comparison with PSO and GA on Higher Dimensional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Extension to Constrained Optimization Problems . . . . . . . . . . . . . . 6.4.1 Sequential Unconstrained Minimization Techniques . . . 6.5 Design Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105 105 108 108 110 111 113 114 116 119 122 124 127 128 131 132 134 136 137 140 141 143 143 147
XVI
7
8
Contents
6.5.1 Sample and Hold Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Folded Cascode Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
148 149 149 152 153
Multi-objective Optimization Using Surrogates . . . . . . . . . . . . . . . . . Ivan Voutchkov, Andy Keane 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Surrogate Models for Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Multi-objective Optimization Using Surrogates . . . . . . . . . . . . . . . 7.4 Pareto Fronts - Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Response Surface Methods, Optimization Procedure and Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Update Strategies and Related Parameters . . . . . . . . . . . . . . . . . . . . 7.7 Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Pareto Front Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 Generational Distance ([3], pp.326) . . . . . . . . . . . . . . . . . 7.8.2 Spacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.3 Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.4 Maximum Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 Understanding the Results . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.2 Preliminary Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.3 The Effect of the Update Strategy Selection . . . . . . . . . . . 7.9.4 The Effect of the Initial Design of Experiments . . . . . . . . 7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
155
A Review of Agent-Based Co-Evolutionary Algorithms for Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rafał Drezewski, Leszek Siwik 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Model of Co-Evolutionary Multi-Agent System . . . . . . . . . . . . . . . 8.2.1 Co-Evolutionary Multi-Agent System . . . . . . . . . . . . . . . 8.2.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Co-Evolutionary Multi-Agent Systems for Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Co-Evolutionary Multi-Agent System with Co-Operation Mechanism (CCoEMAS) . . . . . . . . . . . . . . 8.3.2 Co-Evolutionary Multi-Agent System with Predator-Prey Interactions (PPCoEMAS) . . . . . . . . . . . . . 8.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
155 157 158 159 161 163 164 165 165 165 165 166 166 166 167 167 171 174 174 177 177 179 180 180 181 182 183 187 187 190 196
Contents
XVII
8.4.1
Test Suite, Performance Metric and State-of-the-Art Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 A Glance at Assessing Co-operation Based Approach (CCoEMAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 A Glance at Assessing Predator-Prey Based Approach (PPCoEMAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
A Game Theory-Based Multi-Agent System for Expensive Optimisation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ¨ Abdellah Salhi, Ozgun T¨oreyen 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Game Theory: The Iterated Priosoners’ Dilemma . . . . . . 9.2.3 Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Constructing GTMAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 GTMAS at Work: Illustration . . . . . . . . . . . . . . . . . . . . . . 9.4 The GTMAS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Solver-Agents Decision Making Procedure . . . . . . . . . . . 9.5 Application of GTMAS to TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Tests and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7 Conclusion and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 Optimization with Clifford Support Vector Machines and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. Arana-Daniel, C. L´opez-Franco, E. Bayro-Corrochano 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Geometric Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 The Geometric Algebra of n-D Space . . . . . . . . . . . . . . . . 10.2.2 The Geometric Algebra of 3-D Space . . . . . . . . . . . . . . . 10.3 Linear Clifford Support Vector Machines for Classification . . . . . 10.4 Non Linear Clifford Support Vector Machines for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Clifford SVM for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Recurrent Clifford SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7.1 3D Spiral: Nonlinear Classification Problem . . . . . . . . . . 10.7.2 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7.3 Multi-case Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7.4 Experiments Using Recurrent CSVM . . . . . . . . . . . . . . . . 10.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
196 197 200 207 207 211 211 213 213 213 214 215 216 218 219 221 228 229 230 233 233 234 235 237 237 242 243 245 247 247 250 256 257 260 260
XVIII
Contents
11 A Classification Method Based on Principal Component Analysis and Differential Evolution Algorithm Applied for Prediction Diagnosis from Clinical EMR Heart Data Sets . . . . . . . . . . . . . . . . . . Pasi Luukka, Jouni Lampinen 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Heart Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Classification Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Dimension Reduction Using Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Classification Based on Differential Evolution . . . . . . . . 11.3.3 Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 An Integrated Approach to Speed Up GA-SVM Feature Selection Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tianyou Zhang, Xiuju Fu, Rick Siow Mong Goh, Chee Keong Kwoh, Gary Kee Khoon Lee 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Parallel/Distributed GA . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Parallel SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.3 Neighbor Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.4 Evaluation Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Computation in Complex Environments; Optimizing Railway Timetable Problems with Symbiotic Networks . . . . . . . . . . . . . . . . . . Kees Pieters 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1.1 Convergence Inducing Process . . . . . . . . . . . . . . . . . . . . . 13.1.2 A Classification of Problem Domains . . . . . . . . . . . . . . . . 13.2 Railway Timetable Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 Symbiotic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.1 A Theory of Symbiosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 Premature Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Symbiotic Networks as Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5 Trains as Symbiots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.1 Trains in Symbiosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.2 The Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.3 The Trains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.4 The Optimizing Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . .
263 264 267 268 268 269 271 272 280 281 285
285 288 288 290 291 292 292 297 298 299 299 300 301 302 304 306 311 313 314 315 315 316 318 319
Contents
XIX
13.5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.7 A Symbiotic Network as a CCGA . . . . . . . . . . . . . . . . . . . 13.5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
319 321 322 322
14 Project Scheduling: Time-Cost Tradeoff Problems . . . . . . . . . . . . . . . Sanjay Srivastava, Bhupendra Pathak, Kamal Srivastava 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.1 A Mathematical Description of TCT Problems . . . . . . . . 14.2 Resource-Constrained Nonlinear TCT . . . . . . . . . . . . . . . . . . . . . . . 14.2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.2 Working of ANN and Heuristic Embedded Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.3 ANNHEGA for a Case Study . . . . . . . . . . . . . . . . . . . . . . 14.3 Sensitivity Analysis of TCT Profiles . . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 Working of IFAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.2 IFAG for a Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 Hybrid Meta Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 Working of Hybrid Meta Heuristic . . . . . . . . . . . . . . . . . . 14.4.2 HMH Approach for Case Studies . . . . . . . . . . . . . . . . . . . 14.4.3 Standard Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
325
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pramod Kumar Meher 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Direct-Design of VLSI for Artificial Neural Network . . . . . . . . . . 15.3 Design Considerations and Systolic Building Blocks for ANN . . . 15.4 Systolic Architectures for ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.1 Systolic Architecture for Hopfield Net . . . . . . . . . . . . . . . 15.4.2 Systolic Architecture for Multilayer Neural Network . . . 15.4.3 Systolic Implementation of Back-Propagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.4 Implementation of Advance Algorithms and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
325 328 329 330 331 334 336 339 339 343 345 348 352 354 355 359 360 362 364 371 371 373 373 376 376 377
16 Application of Coarse-Coding Techniques for Evolvable Multirobot Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 Jekanthan Thangavelautham, Paul Grouchy, Gabriele M.T. D’Eleuterio 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 16.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
XX
Contents
16.2.1 The Body and the Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.2 Task Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.3 Machine-Learning Techniques and Modularization . . . . 16.2.4 Fixed versus Variable Topologies . . . . . . . . . . . . . . . . . . . 16.2.5 Regularity in the Environment . . . . . . . . . . . . . . . . . . . . . . 16.3 Artificial Neural Tissue Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.2 The Decision Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.3 Evolution and Development . . . . . . . . . . . . . . . . . . . . . . . . 16.3.4 Sensory Coarse Coding Model . . . . . . . . . . . . . . . . . . . . . 16.4 An Example Task: Resource Gathering . . . . . . . . . . . . . . . . . . . . . . 16.4.1 Coupled Motor Primitives . . . . . . . . . . . . . . . . . . . . . . . . . 16.4.2 Evolutionary Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.1 Evolution and Robot Density . . . . . . . . . . . . . . . . . . . . . . . 16.5.2 Behavioral Adaptations . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.3 Evolved Controller Scalability . . . . . . . . . . . . . . . . . . . . . . 16.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
385 385 386 387 388 389 389 390 391 393 395 397 399 399 403 403 406 407 409 410
Chapter 1
New Hybrid Intelligent Systems to Solve Linear and Quadratic Optimization Problems and Increase Guaranteed Optimal Convergence Speed of Recurrent ANN Otoni N´obrega Neto, Ronaldo R.B. de Aquino, and Milde M.S. Lira
Abstract. This chapter deals with the study of artificial neural networks (ANNs) and Heuristic Rules (HR) to solve optimization problems. The study of ANN as optimization tools for solving large scale problems was due to the fact that this technique has great potential for hardware VLSI implementation, in which it may be more efficient than traditional optimization techniques. However, the implementation of computational algorithm has shown that the proposed technique should have been efficient but slow as compared with traditional mathematical methods. In order to make it a fast method, we will show two ways to increase the speed of convergence of the computational algorithm. For analyzes and comparison, we solved three test cases. This paper considers recurrent ANN to solve linear and quadratic programming problems. These networks are based on the solution of a set of differential equations that are obtained from a transformation of an augmented Lagrange energy function. The proposed hybrid systems combining recurrent ANN and HR presented a reduced computational effort in relation to the one using only the recurrent ANN.
1.1 Introduction The early 1980’s were marked by a resurgence of interest in artificial neural networks (ANNs). At that time, the development of ANNs had the important characteristic of temporal processing. Many researchers have attributed the resumption of researches on ANNs in the eighties to the Hopfield model presented in 1982 [1]. This recurrent Hopfield model has so far constituted, a great progress in the threshold of knowledge in the area of neural networks, until then. Nowadays, it is known that there are two ways of incorporating the temporal computation in a neural network: the first one is possible by using a statistical neural nets to accomplish a dynamical mapping in a structure of short-term memory; and the Otoni N´obrega Neto · Ronaldo R.B. de Aquino · Milde M.S. Lira Electrical Engineering Department, Federal University of Pernambuco, Brazil e-mail:
[email protected] ,
[email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 1–26. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
2
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
second one is by making internal feedback connections that may be made by single or multi-loop feedback in which the neural network can be fully connected. Artificial neural networks that have feedback connections in their topology are known as recurrent neural networks [2]. The theoretical study and applications of recurrent neural nets were developed in several subsequent works [3, 4, 5, 6, 7, 8, 9]. Actually, the progress provided by Hopfield’s works have shown that a value of energy could be associated to each state of the net and that this energy decreases monotonically as the path is described within the state-space towards a fixed point. These fixed points are therefore stable points of energy [10], i.e., the described energy function behaves as Lyapunov functions for the model described in detail in Hopfield’s works. At this exact moment, it is observed subjects of stabilities in recurrent neural nets. Considering the stability in a non-linear dynamical system, we usually think about stability in the sense of Lyapunov. The Direct Method of Lyapunov is broadly used for stability analysis of linear and non-linear systems which may be either time-variant or time-invariant. Therefore, it can be directly applicable to the stability analysis of ANNs [2]. In 1985, Hopfield solved the traveling salesman problem [7] that is a problem in combinatorial optimization using a continuous model of the recurrent neural network as an optimization tool. In 1986, Hopfield proposed a specialized ANN to solve specific problems of linear programming (LP) [9] based on analog circuits, studied since 1956 by Insley B. Pyne and presented in [11]. On that occasion, Hopfield demonstrated that the dynamics involved in recurrent artificial neural nets were described by a Lyapunov function and that for this reason, it was demonstrated that this network is stable and also that the point of stability is the solution to the problem for which the ANN was modeled. In 1987, Kennedy and Chua demonstrated that the ANN which was proposed by Hopfield in 1986, in spite of searching for the minimum level of the energy function, it had not been modeled to offer an inferior limit, but only when the saturation of an operational amplifier of the circuit was reached [12]. Due to this deficiency, Kennedy and Chua proposed a new circuit for LP problems that also proved to be able to solve quadratic programming (QP) problems. These circuits were nominated as “canonical non-linear programming circuit”, which are based on the Kuhn-Tucker (KT) conditions [12]. In this kind of ANN-based optimization, the problem has to be “hard-wired” in the network and the convergence behavior of the ANN depends greatly on how the cost function is modeled. Later on, hard studies [13, 14] confirmed that for non-linear programming problems, the proposed model by Kennedy and Chua [15] has completely satisfied the optimization of KT conditions and the penalty method. Besides, under appropriate conditions this net is stable. In spite of the important progresses presented in Kennedy and Chua’s studies, a deficiency was observed in the model, which appears when the equilibrium point of the net happens in the neighborhood of the optimal point of the original problem, but the distance between the optimal point and the equilibrium point of the network can be reduced by increasing the penalty parameter (s), as in [14] and [16]. Even so, Kennedy and Chua’s network is able to solve a great class of optimization problems with and without constraints. However, when
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
3
the solutions of constrains optimization problems are in the neighborhood of the feasible region, i.e., equality constraints are close to happen, then the network only converges on an approximate solution that can be out of the feasible region [17]. This is explained by the application of the penalty function theorem [16]. For applications in which an unfeasible solution cannot be tolerated, the usefulness of this technique (Kennedy and Chua’s neural networks) is seriously jeopardized. With the intention of overcoming this difficulty, Maa and Shanblatt proposed the two-phase method [14]. This work reveals an innovation in the method presented by David W. Tank and John J. Hopfield [18] and it guarantees that, in certain conditions, the proposed network evolves towards the exact solution of the optimization problem. Since Kennedy and Chua network contains a finite penalty parameter, it generates only approximate solutions and presents an implementation problem when the penalty parameter is very large. To reach an exact solution Maa and Shanblatt method uses another penalty parameter in the second phase. Therefore, to avoid using penalty parameters, some significant works have been done in recent years. Among them, a few primal-dual neural networks with two-layer and onelayer structure were developed for solving linear and quadratic programming problems [18, 19, 20, 21]. These neural networks were proved to be globally convergent to an exact solution when the objective function is convex. Nowadays, recurrent ANN have been used to solve real world problems such as, the hydrothermal Scheduling [22] based on the augmented Lagrange Hopfield network and [23, 24, 25] based on Maa and Shanblatt Two-phase Neural Network. In this work, an ANNs was approached to solve optimization problems and the proposed method by Maa and Shanblatt was applied. The study of ANN as optimization tools for solving large scale problems was due to the fact that this technique has a great potential for hardware VLSI implementation, in which it can be more efficient than traditional optimization techniques. However, the implementation of the method in software has shown that, in spite of the technique being efficient in the solution of optimization problems, the speed of convergence could become slow when compared with traditional mathematical methods. In this regard, it was created and proposed heuristic rules in a hybrid form to aid and accelerate the convergence of the two-phase method in the software. It is important to point out that the software implementation of the method is an important part of the development and analysis of the method in hardware. An important characteristic for choosing Maa and Shanblatt network is that it is ready to solve linear and quadratic optimization problems with equality and inequality linear constraints without using mathematical transformations, which would increase the dimension of the problem. As we plan to apply the developed HIS in this work to solve the hydrothermal scheduling problem [24, 25], which does not need an exact solution, Maa and Shanblatt network first phase was chosen for this implementation. In future works, we may try other kinds of recurrent ANNs. Decision trees and classification rules are important and common methods used for knowledge representation in the expert systems [26]. Heuristic rules are rules which have no particular foundation in a scientific theory, but which are only based on the observation of general patterns and derived from facts. These rules are
4
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
applicable to many problems as shown in [27, 28, 29, 30]. Here the basis of the proposed heuristic rules is the dynamical behavior of neural networks. From the convergence analysis, we identified the parameters and their relationships, which are then transformed into a set of heuristic rules. We developed an algorithm based on the heuristic rules and carried out some experiments to evaluate and support the proposed technique. In this work, two possible implementations were developed, tested and compared; and a high reduction in computational effort was observed by using the proposed heuristic rules. This reduction is related to the decrease in the number of ODE computed during the convergence process. Other possible implementations are also indicated. This work is organized beginning with a revision of the two-phase method of Maa and Shanblatt; following, we present the proposed heuristic rules and show the solutions for test cases using the previously discussed techniques; next, the simulation results are presented and analyzed; and finally, we draw conclusions about the proposed work.
1.2 Neural Network of Maa and Shanblatt: Two-Phase Optimization The operation of the Hopfield network model and the subsequent models is based on a constraint violation of optimization problem. When a constraint violation occurs, the magnitude and the direction of the violation are fed back to adjust the states of the neurons of the network so that the overall energy function of the network always decreases until it reaches a minimum level. These ANN models have dynamical characteristics according to the Lyapunov function. Therefore, it can be demonstrated that these networks are stable and that the equilibrium point is the solution of LP and QP problems that the network represents. This type of neural network was firstly improved in [15] and later in [14]. The network used in the last version is the one used in this work. Maa and Shanblatt network is able to solve constrained or unconstrained convex quadratic and linear programming problems. Consider the following convex P problem: 1 T x Qx + cT x 2 s.t. g(x) = Dx − b ≤ 0
(P) min f (x) =
h(x) = Hx − w = 0 x ∈ Rn
(1.1)
where c ∈ Rn , D ∈ R pxn , b ∈ R p , H ∈ Rqxn , w ∈ Rq , p and q ≤ n and Q ∈ Rnxn is symmetric and positive definite or positive semidefinite, f , gi ’s and h j ’s are functions on Rn → R. Assume that the feasible domain of P is not empty and the objective function is bounded below over the domain of P.
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
5
Particularly, P is said to be convex programming if f and gi ’s are convex functions, and h j ’s are affine functions. Another particularity of the formulation can be observed when Q is a zero matrix and the cost function is thus reduced to f (x) = cT x. In this case, if the inequality and equality constraints have linear formulation then the problem P becomes a linear programming problem. The method presented by Maa and Shanblatt [14] is composed in to two phases. The first phase of the method aims to initialize the problem and to converge quickly without high accuracy towards the neighborhood of the optimal point, while the second phase aims to reach the exact solution of the problem. To this end, the dynamic of the first phase is based on the exact penalty of Lagrangian function or energy function L(s, x): q p s 2 2 (1.2) L(s, x) = f (x) + ∑ g+i (x) + ∑ (h j (x)) 2 i=1 j=1 where s is a large positive real number, and the function g+ i (x(t)) = max{0, gi (x(t))}, + T whose notation was simplified to g+ = [g+ 1 , . . . , gm ] , according to [14]. As long as the system converges x(t) → xˆ , sg+ i (x(t)) → λi and sh j (x(t)) → μ j which are the Lagrange multipliers associated with each corresponding constraint. In the first phase, an approximation of the Lagrange multipliers is already obtained. The block diagram of a two-phase optimization network is shown in Fig. 1.1. The dynamics that happen in the first phase are in the time range 0 ≤ t ≤ t1 (t1 is the time instant when the switch is closed connecting the first phase to the second one). The network operates according to the following dynamics: p q dx + = −∇ f (x) − s ∑ ∇gi (x)gi (x) + ∑ (∇h j (x)h j (x)) (1.3) dt i=1 j=1 In the second phase (t ≥ t1 ) the network begins to shift the directional vector sg+ i (x) gradually to λi , and sh j (v) to μ j . By imposing a small positive real value ε , the update rate of d λi /dt and d μi /dt that are represented in (1.6) and (1.7), respectively, is comparatively much slower than that of dx/dt (1.5). Approximation of such dynamics is possible by considering λ and μ to be fixed. Then it can be seen that (1.6) is seeking a minimum point of the augmented Lagrangian function La (s, x): La (s, x) = f (x) + λ T g(x) + μ T h(x) +
s + g (x) 2 + h(x) 2 2
(1.4)
In the block diagram of Fig. 1.1, in the first phase, the subsystems within the two large rectangles do not contribute during t ≤ t1 and in the second phase, when t > t1 , the dynamics of the network become: p q + dx = −∇ f (x) − ∑ ∇gi (x) sgi (x) + λi + ∑ (∇h j (x) (sh j (x) + μ j )) dt i=1 j=1 (1.5)
6
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
Fig. 1.1 Block diagram of the dynamical system to Maa and Shanblatt network
The Lagrange multipliers are updated as: d λi (t + Δ t) = ε sg+ i (gi (x(t))), dt
to
i = 1, . . . p,
and
(1.6)
d μ j (t + Δ t) = ε sh j (x(t)), to j = 1, . . . q. (1.7) dt A practical value is ε = 1/s according to [14], what leaves the network with just one adjustment parameter. However, using ε independently of s gives more freedom to control the dynamics of the network. During the first phase, the Lagrange multipliers are null, thus there is not restriction on the initial value of x(t). According to the theorem of penalty function, the solution achieved in the first phase is not equivalent to the minimum of the function f (x), unless the penalty parameter s is infinite. In this way, the use of the second phase of optimization is necessary to any finite value of s. The system reaches equilibrium when:
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
7
g+ i =0 h j = 0 and p
q
i=1
j=1
∇ f (x) + ∑ ∇gi (x)λi (x) + ∑ ∇h j (x)μ j (x) = 0,
(1.8)
that is identical to optimality condition of the KT theorem and thus the equilibrium point of the two-phase network is precisely a global minimum point to a convex problem (P). In [12] it is demonstrated that the Kennedy and Chua network for linear and quadratic programming problems satisfies completely the optimality conditions of KT and the function penalty method. It shows also that under appropriate conditions this network is completely stable. Moreover, it is shown that the equilibrium point happens in the neighborhood of the optimal point of the original problem and that the distance among them can be made arbitrarily small, selecting sufficiently a large value of penalty parameter (s). For problems that cannot tolerate a solution in the infeasible region, due to physical limits of operational amplifiers, a two-phase optimization network model is proposed. In the second phase, we can obtain both, the exact solution for these problems and the corresponding Lagrange multipliers associated with each restriction.
1.3 Hybrid Intelligent System Description The proposed network by Maa and Shanblatt has two attractive features which are the property of guaranteed global convergence of the mathematical programming problem and the possibility of physical implementation of the neural network in a circuit with electrical components where the response time of the dynamic of the circuit would be imposed by the capacitance in the circuit, thus the convergence time would be negligible. In spite of these attractive characteristics, the time required for processing the computational algorithm becomes a barrier in the ANN-based applications for solving large-scale mathematical programming problems, then it is necessary the resolution of several differential equations. In this regard, problems with larger number of variables and constraints will involve a fair amount of differential equations to be solved. In order to mitigate this problem, heuristic rules were developed to accelerate the convergence of the computational algorithm involved in recurrent neural networks. The combination of recurrent ANN with heuristic rules forms the Hybrid Intelligent System (HIS), in which these two techniques interact and exchange information with one another while the optimal solution of the problem is not achieved. The basis of the proposed heuristic rules is the dynamical behavior of the neural networks. The control theory of system studies deeply the dynamical behavior of a process. With the aid of control theory and gathering information of the Lyapunov theorem for the network [2], it can be stated that from any given initial state of the state vectors x(0), the network will always change the values of the state variables xi (t) in the direction in which the value of the Lyapunov function for the network
8
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
decreases continuously until it reaches a stationary point, which corresponds to a global minimum of the programming problem. The trajectory of the variables is illustrated in Fig. 1.2. It depicts a two-dimensional trajectory (orbit) of a dynamical system, where it is possible to observe the state variables of the system at certain time instants (t0 ,t1 , . . . ,t5 ). The dotted vector can be understood as the tendency of the convergence (indicated by the gradient vector) of the variables in the dynamics of the system.
Fig. 1.2 A two-dimensional trajectory (orbit) of a dynamical system
Trajectories of the state variables for the same system are exemplified graphically in Fig. 1.3. These trajectories are distinct due to the fact that the state variables have different initial states. The dynamic of recurrent ANN has the same properties and, therefore, it is similar to the dynamics showed in Fig. 1.3. In spite of Maa and Shanblatt model which deals with continuous-time recurrent network, in a computational algorithm the calculations of the iterations are performed in a discrete-time form, since the calculations of the integral equations demand a small step size for calculations, but not null. Therefore, we take total control at the course of the iteration of the algorithm in the network. Detailed observations were carried out during test of the algorithm of Maa and Shanblatt model showing that the computational convergence is slow and the trajectories in the state-space of convergence of recurrent networks are smooth and possibly predictable. Then, we observed that, in certain conditions, it is not only possible to estimate a point closer to the minimum point of the function energy of the network, but it is also possible to estimate a point that instead of the initial orbit of convergence turns to an initial point of a new orbit of convergence. This new orbit would have a shorter curvature and, consequently, a smaller Euclidean distance to the optimal point. In this regard, the number of steps to calculate the convergence of the computational algorithm can be reduced and, consequently, the time to compute the equilibrium point of the network.
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
9
Fig. 1.3 An illustration of a two-dimensional state (phase) of a dynamical system and the associated vectorial field
To achieve the equilibrium point, we use two methods. In the first one, the point is calculated starting from the evolution of the dynamics in space-time plan (in this work, we considered only autonomous systems). In the second method, the calculation is performed by observing the evolution of the variable in state-space. The mechanism of these two methods and the way they operate in the proposed HIS are described as follows.
1.3.1 Method of Tendency Based on the Dynamics in Space-Time (TDST) Consider the convergences of dynamical systems of first order according to the graphs in Fig. 1.4. Observing the curves of Fig. 1.4 and pointing out that the time is a variable that is always increasing, i.e., the following point x1 (t) is always in front of x1 (t − Δ t), we can reach the conclusion that a closer point to the convergence would be located outside the internal area of the convergence curve concavity, in case the convergence curve behaves as shown in graph Fig. 1.4(a). For example, for t = 0.060, x1 (0.060) ≈ 0.3; in this case, a better estimation to the point would be x1 (0.061) = 0.45 for Δ t = 0.001s. Restarting the network with this initial state would generate a convergence curve as illustrated in graph Fig. 1.5. However, this rule would be applied only for curves of type (a) and (c) as shown in Fig. 1.4, since curves (b) and (d), a better estimation to the point is located inside the internal area of the concavity. Observing the particularities of the possible curvatures of the convergence curve in the space-time plan, the following parameters were created in relation to: • curvature {curve, straight line}; • concavity, when it exist, {concave downwards, concave upwards};
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
1
1
09
09
08
08
07
07
06
06
x1(t)
x1(t)
10
05
05
04
04
03
03
02
02
01
01
0
0
01
02
03
04
05
06
07
08
09
0
1
0
01
02
03
04
05
t
07
08
09
1
06
07
08
09
1
(b)
1
1
09
09
08
08
07
07
06
06
x1(t)
x1(t)
(a)
05
05
04
04
03
03
02
02
01 0
06
t
01
0
01
02
03
04
05
06
07
08
09
1
0
0
01
02
03
04
05
t
t
(c)
(d)
Fig. 1.4 Dynamical convergences of first order (single variable systems): graphs of evolution of state variables in time
• time rating {high, mean, low}; • variable xi (t) {increasing, decreasing}. In order to assess these parameters, the network must provide at least three points (P0 , P1 , P2 ) in the convergence curve. Next, these three points are normalized in the horizontal axis and also in the vertical axis, in order to avoid problems in the algorithm used to estimate a better point. The chosen normalization equation is presented in (1.9), where M is the maximum and m is the minimum of the three values to be normalized, a and b are chosen according to the range of the normalized values. In this work, the values are normalized into value in the range [0, 1], then a = 0 and b = 1. z is the value to be normalized, and zN is the normalized value. zN =
b(z − m) − a(z − M) M−m
(1.9)
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
11
1
0.8
x1(t)
Original Dynamic 0.6
Advanced Dynamic 0.4
Predict Point by Heuristic Rule (HIS−1)
0.2
The Algorithm Time Gain Time in which the ANN is Restarted
0
0
0.2
0.4
t
06
08
1
Fig. 1.5 Action of the HIS to calculate a better point in dynamics evolving over time
From the normalized points, two spatial vectors → v 1 and → v 2 are computed, and the following relevant information are obtained from them: Euclidian norm, angle (θi ) of each vector in relation to the horizontal axis; and finally, angle between them (Δ θ = θ2 − θ1 ). Therefore, when there is a space curvature over the normalized points, classification regions (decision regions) are generated as shown in Fig. 1.6. To understand the illustration in Fig. 1.6 better, consider that the normalized initial point (P0N ) is always found in the beginning of each area (S4, S5, S6, S7, S8 and S9). Besides these six possibilities, there are more three others that occur in case of straightforward convergences where Δ θ is approximately zero.
Fig. 1.6 Classification regions to patterns that have spatial curvature
Region 4 (S4) is similar to the beginning of the convergence shown in Fig. 1.4(a). While region 5 (S5) describes a behavior close to the curve formed by the end of the convergence shown in Fig. 1.4(a) and the beginning of the convergence shown in Fig. 1.4(b). Region 6 (S6) has a convergence similar to that shown in Fig. 1.4(b).
12
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
Region 7 (S7) describes the dynamic of the type shown in the Fig. 1.4(d) and region 8 (S8) describes a behavior close to the curve formed by the end of the convergence shown in the Fig. 4(c) and the beginning of the convergence shown in Fig. 1.4(d). While region 9 (S9) represents a behavior close to that shown in Fig. 1.4(c). The straightforward regime can be of three types: increasing - the derivative of the curve is positive and not close to zero, corresponding to region 1 (S1); constant - the derivative of the curve is approximately zero, corresponding to region 2 (S2); and decreasing - derivative is negative and not close to zero, corresponding to region 3 (S3). We can note that the regimes described by regions S5 and S8 can be considered close to the constant straightforward regime. Thus, we modeled the following heuristic rules: Rule 1: if <curvature is a straight line> and
and than . Rule 2: if <curvature is a straight line> and than . Rule 3: if <curvature is a straight line> and and than . Rule 4: if <curvature is a curve> and and and than . Rule 5: if <curvature is a curve> and and than . Rule 6: if <curvature is a curve> and and and than . Rule 7: if <curvature is a curve> and and and than . Rule 8: if <curvature is a curve> and and than < Action II >. Rule 9: if <curvature is a curve> and and and than . The actions shown in the rules lead to sub-functions that return a better value for the next initialization point of the network. The straight line condition implicates that either the system is converging very slowly or the step size of the integration algorithm is very small. In this case, the linear function shown in (1.10) can be applied as shown below: P3N = a(P2N − P0N ) + P2N ,
(1.10)
where a is a constant that yields a gain in magnitude of → v 3 (→ v 3 = P3 − P2 ). The rules above are summarized in Table 1.1.
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
13
Table 1.1 Description of the actions to be taken due to the heuristic rules according to each decision region
Actions Description of the Actions I II III IV
V
VI
P3N is calculated according to (1.6) with a = a1 . P3N is calculated according to (1.6) with a = a2 . P3N is calculated according to (1.6) with a = a3 . P3N takes the coordinates of the superior point in the circumference that passes through the normalized points P0N , P1N and P2N . P3N takes the coordinates of the point fixed at the most right position in the circumference that passes through the normalized points P0N , P1N and P2N . P3N takes the coordinates of the inferior point in the circumference that passes through the normalized points P0N , P1N and P2N .
Regions S1 S2, S5, S8 S3 S4
S6, S7
S9
Having the normalized point P3N estimated by the heuristic rules, we need to unnormalize it to obtain the P3 value. This value will be used to start the recurrent network.
1.3.2 Method of Tendency Based on the Dynamics in State-Space (TDSS) To calculate a better point using the dynamics in state-space, two facts must be pointed out: firstly, the convergence of the variables depends on the convergence of other variables; secondly, when we are working in the state-space, the variations in the state of the variables are mapped by taking one variable xi as reference, so that the curve is free to evolve in all directions, which does not happen when we are working in the space-time plan. Perceiving previous facts and observing the convergence orbits in the state-space shown Fig. 1.2 and Fig. 1.3, we reached the conclusion that to calculate a better point in the state-space, it must be located inside the concavity of the orbit of the system dynamics. To calculate in such a state-space, we do the following: first we take one of the variables as a reference (for example, x1 (t)) and draw n − 1 complex plan (for a system with n variable), thus we have the plans x1 (t)0x2 (t), x1 (t)0x3 (t), . . ., x1 (t)0xn−1(t). Having the three points P2 = x(t), P1 = x(t − Δ t) and P0 = x(t − 2Δ t) provided by the network, we can create state vectors in each plan of all n − 1 complex plan. For example, for the plan x1 (t)0x2 (t), we have: → v 1 (t) = (x1 (t) + i · x2(t)) − (x1 (t − Δ t) + i · x2(t − Δ t)) and
(1.11)
→ v 2 (t) = (x1 (t − Δ t) + i · x2(t − Δ t)) − (x1(t − 2Δ t) + i · x2(t − 2Δ t)).
(1.12)
14
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
From these vectors, we carried out a rotation transformation in the axes using (1.13) and (1.14), according to the Fig. 1.7. → v 1 exp(−θ1 ) v 1’ = →
(1.13)
→ v 2 exp(−θ1 ) v 2’ = →
(1.14)
v 1 (t) angle and also a translation transformation using: where θ1 is the → x1 ’(t − 2Δ t) = −|→ v 1| x2 ’(t − 2Δ t) = 0 x1 ’(t − Δ t) = 0 x2 ’(t − Δ t) = 0 x1 ’(t) = x1 (t) − x1 (t − 2Δ t) x2 ’(t) = x2 (t) − x2 (t − 2Δ t)
(1.15)
Fig. 1.7 Translation and rotation transformation in state-space
The rotation and translation transformation facilitates the behavior analysis of the vector → v 2 in relation to vector → v 1 . Therefore, heuristic rules can be applied to perform the gain in module and angle of the state vectors yielding vector → v 3 ’ to each n − 1 complex plan. Finally, a strategy is created to determine which will be the final value of the reference variable. An effective strategy is to add the value of the reference variable and the average of the calculated increments of this variable in the complex plans. Fig. 1.8 shows two pictures associated with two examples of a set of heuristic rules that can be used to produce vector → v 3 ’. In each picture, the point that is closer to the left position in the circumference is the point P0 ’, in the center of the circumference is fixed the point P1 ’ and the points marked with tiny circle in the circumference symbolize several possibilities of point P2 ’. And finally, the results
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
15
3
3
2
2
1
1 X2’(t)
X2’(t)
of the heuristic rules are marked with green circle, points P3 ’. In order to obtain the final point P3 , we apply the inverse transformation of translation and rotation to the point P3 ’, thus generating the appropriated value to initialize the recurrent network. Fig. 1.9 shows an example of application of the heuristic rules to estimate a better point through the dynamics in the state-space. The external curve represents the dynamics of the recurrent network without the heuristic rules, and the internal one represents the dynamics using the HIS (ANN and TDSS) which is based on heuristic rules (TDSS). We point out that, in the internal curve, the points with circle as a marker are the iteration points performed by the network and the points with a plus sign are the points estimated by the heuristic rules (P3 ).
0
0
−1
−1
−2
−2
−3
−3 0
2 X1’(t)
4
0
2 X1’(t)
(a)
4
(b)
Fig. 1.8 Variations of the rules applied to points P0 , P1 , P2 to calculate the estimated point P3 25
2
Original Orbit
x2(t)
15
1
Predict Point by Heuristic Rule (HIS−2) 05
Advanced Orbit Point by Recurrent NN 0
05
1
0
1
2
3
4
5
x1(t)
Fig. 1.9 Graph of the convergence orbit of the state variables x1 and x2 : external curve ANN; internal curve - HIS
16
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
By iteratively applying the ANN and the HR, the system will perform to reduce the curvature of the orbit in the state-space jumping from one orbit to another until it achieves the solution of the problem (point of equilibrium of the recurrent network).
1.4 Case Studies In order to test the proposed hybrid intelligent system, we chose mathematical programming problems previously solved in [16, 31, 32]. The following cases were solved using the HIS which was implemented in the MATLAB. In addition, to solve the differential equations involved in the problem, we implemented the Richardson extrapolation method of level four, which has in its structure the classical Runge-Kutta method of order four, and uses a fixed step-size of integration [33].
1.4.1 Case 1: Mathematical Linear Programming Problem – Four Variables A classical linear programming problem [31] was used to choose the parameters of the developed heuristic rules. In [31] the problem was solved to compare the performance of several recurrent network models. In this case we deal with the following problem LP1 : (LP1 ) min f (x) = −8x1 − 8x2 − 5x3 − 5x4 s.t.
x1 + x3 = 40
x2 + x4 = 60 −5x1 + 5x2 ≤ 0 2x1 − 3x2 ≤ 0 x≥0 x ∈ R4
(1.16)
1.4.2 Case 2: Mathematical Linear Programming Problem – Eleven Variables With the heuristic rules parameters already defined in case 1, we chose a larger scale problem to compare the models performance. In this case, we chose a flow problem of minimum cost (LP2 ) solved in [32]. In this problem there are several nodes (points) representing several consumers and suppliers which are connecting through paths (arcs). The aim of the problem is to calculate the flow through all paths in order to minimize the total cost, whose value is calculated by the sum of the products of the cost and the flow performed in each arch. The problem can be represented as a graph as shown in Fig. 1.10 where the amount of flow on a node cannot exceed the capacity of the node. A network is a set of elements called nodes
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
17
and a set of elements called arcs, each arc ei j being an ordered pair (i, j) of distinct nodes i and j. If ei j is an arc, then the node i is called the tail of ei j and the node j is called the head of ei j . The directed graph shown in Fig. 1.10 is formed of 6 nodes and 11 arcs. e12
1 e13 e15
2 e23
3
e42
e53
e62
e43 5
e54
4 e64 e56
6
Fig. 1.10 A directed graph of a minimum cost flow problem
The cost of each arc is represented by vector c and its maximum capacity flow by vector b, and the demands by vector w. If wi ≤ 0, the node is a supplier (sources) and, if wi > 0, the node is consumers (sinks). Suppose, for instance, that we have wT = [−9 4 17 1 − 5 − 8]. The matrix H of the network is called the incidence matrix of our network. More generally, the incidence matrix of a network with n nodes and m arcs has n rows and m columns. Thus, our matrix H has size 6 x 11 and is formed as follows: ⎤ ⎡ −1 −1 −1 0 0 0 0 0 0 0 0 ⎢ 1 0 0 −1 1 1 0 0 0 0 0 ⎥ ⎥ ⎢ ⎢ 0 1 0 1 0 0 1 1 0 0 0⎥ ⎥ (1.17) H =⎢ ⎢ 0 0 0 0 −1 0 0 −1 1 1 0 ⎥ ⎥ ⎢ ⎣ 0 0 1 0 0 0 −1 0 −1 0 −1 ⎦ 0 0 0 0 0 −1 0 0 0 −1 1 Considering that there are no losses in the network, i.e., everything that is produced is consumed then the sum of all the elements wi j of the graph is zero. This condition turns the matrix H into a linearly dependent (LD) matrix, in other words, any row can be obtained by a linear combination of the other rows. In order to overcome this problem, we remove a row of the matrix H and one element of the column vector w. Here, the last row of the matrix H was removed, turning this matrix and the vector w into a truncated incidence matrix and a truncated vector, according to [32]. ⎡ ⎤ −1 −1 −1 0 0 0 0 0 0 0 0 ⎢ 1 0 0 −1 1 1 0 0 0 0 0 ⎥ ⎢ ⎥ ⎥ (1.18) H =⎢ ⎢ 0 1 0 1 0 0 1 1 0 0 0⎥ ⎣ 0 0 0 0 −1 0 0 −1 1 1 0 ⎦ 0 0 1 0 0 0 −1 0 −1 0 −1
18
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
Problem LP2 has the following form: (LP2 ) min f (x) = 3x1 + 5x2 + x3 + x4 + 4x5 + x6 + 6x7 + x8 + x9 + x10 + x11 s.t. − 1x1 − 1x2 − 1x3 = 9 −x1 − x4 + x5 + x6 = −4 x2 + x4 + x7 + x8 = −17 −x5 − x8 + x9 + x10 = −1 x3 − x7 − x9 − x11 = 5 x ≤ [2 10 10 6 8 7 9 9 10 8 6]T x≥0 x ∈ R11
(1.19)
1.4.3 Case 3: Mathematical Quadratic Programming Problem – Three Variables As an example of quadratic programming (QP), the economic dispatch problem was solved as formulate in [16]. In this problem, the aim is to minimize the total cost and respond to the demand of the power system. The formulation contemplates 3 thermal generators (n = 3) connected to just one load. Defining fi as the generation cost of the i-th generation unit, xi as the generated power by the i-th generation unit, w as the total power demand of the load; the limits xi,min , xi,max are defined by the physical limitation of the i-th generation unit. The power economic dispatch is expressed as QP: (QP) min f (x) =
n
∑ fi (xi )
i=1 n
s.t.
∑ xi − w = 0
i=1
xmin ≤ x ≤ xmax x ∈ Rn
(1.20)
Data were obtained from [16]: xmin = [150 100 50]T in MW, xmax = [600 400 200]T in MW, w = 850 MW and the following costs for the generator units: f1 (x1 ) = 561 + 7.92x1 + 0.00156x21 f2 (x2 ) = 310 + 7.85x2 + 0.00194x22 f3 (x3 ) = 78 + 7.95x3 + 0.00482x23
(1.21)
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
19
1.5 Simulations The chosen parameters to simulate the problems LP1 and LP2 were: integration step size of 1e−3 , 100 for the parameter s of the neural network in the first and the second phase, value 1.1 for the parameter e of the network in the second phase. The main results of LP1 and LP2 simulation are presented in Table 1.2.
Table 1.2 Simulation Results of LP1 and LP2 LP1 LP2 ANN HIS-1a HIS-2b ANN HIS-1a HIS-2b No. points by the ANN (Phase 1) 8316 1000 2391 8670 No. points by the HR (Phase 1) 331 795 Total no. points (Phase 1) 8316 1331 3186 8670 Normalized computer processing time (Phase 1) 1.00 0.12 0.29 1.00 8.32 1.33 3.19 8.67 Time instant when the switch is closed (s) = t1 No. of calculated points by the ANN (Phase 2) 8607 8277 8316 7692 No. of calculated points in both phases 16923 9608 11502 16362 Initial cost (Phase 1) = f (x(0)) -260.00 -260.00 -260.00 0.00 -741.77 -741.82 -741.78 55.67 Final cost (Phase 1) = f (x(t1 )) Final cost (Phase 2) = f (x(tend )) -740.00 -740.00 -740.00 56.00
2965 986 3951 0.33 3.95 7251 11202 0.00 55.65 55.99
3099 1031 4130 0.35 4.13 7263 11393 0.00 55.63 56.00
a
HIS-1 = ANN and method of Tendency based on the Dynamics in Space-Time(TDST). b HIS-2 = ANN and method of Tendency based on the Dynamics in State-Space (TDSS).
The results shown in Table 1.2 at row 5 point out that both proposed hybrid systems (HIS-1 and HIS-2) were able to advance the dynamics of the simulated linear problems efficiently. This reduces greatly the time to process the network algorithm since at each integration step size, n ODEs are solved, where n is the number of variables in the problem. For instance, for the problem LP1 , the total number of points calculated at the end of the first phase by using the ANN was 8316, then 33264 ODEs were solved while using the HIS-1. It was necessary 1000 points yielding 4000 ODEs to be solved in order to reach the end of the first phase, in other words, the HIS-1 reduced the computational effort by approximately 88% compared to the ANN. Besides, for the LP2 problem, this rate was approximately 66%. The rates of computational effort reduction for problems LP1 and LP2 when comparing the ANN to the HIS-2 were 71% and 64%, respectively. Fig. 1.11-1.13 presents the simulation results for the LP1 problem and Fig. 1.14-1.16 for the LP2 problem. The chosen parameters to simulate the problems QP: integration step size of 1e−2 ; 50 for the parameter s of the neural network in the first phase. The initial condition used was x(0) = [400 300 150]T .
20
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
40
40
x1
x2
X1(t) x X2(t)
35
35
30
30
X1(t) x X4(t) 25
25 x4
20
20
15
15
10
10
5
0
2
4
6
8
10
X1(t) x X3(t)
5
x3
0
12
14
0
16
10
15
20
(a)
25
30
35
40
(b)
Fig. 1.11 Dynamics of the problem LP1 obtained by the ANN with the initial condition x(0) = [10 10 10 10]T : (a) Dynamic in time-state plan; (b) Dynamic in state-space plan, taking the variable x1 (t) as reference
40
40 x1
x2
X1(t) x X4(t)
35
35
30
30
25
25 x4
20
15
10
10
5
0
1
2
3
4
X1(t) x X3(t)
5
x3
0
X1(t) x X2(t)
20
15
5
(a)
6
7
8
9
10
0
10
15
20
25
30
35
40
(b)
Fig. 1.12 Dynamics of the problem LP1 , obtained by the HIS-1 with the initial condition x(0) = [10 10 10 10]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan, taking the variable x1 (t) as reference
The results shown in Table 1.3 at row 5 point out that both proposed hybrid systems were able to advance the dynamics of the simulated quadratic problems efficiently. This reduces greatly the time to process the network algorithm since at each step size integration, n ODEs are solved, where n is the number of variables in the problem. For instance, for the problem QP, the total number of points calculated at the end of the first phase by using the ANN was 172629, then 517887 ODEs were solved, while using the HIS-1, it was necessary 8257 points yielding 24771 ODEs to be solved in order to reach the end of the first phase, in other words, the HIS-1 reduced the computational effort by approximately 95% compared to the
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
21
40
40 x1
x2
35
35
X1(t) x X4(t)
30
30
25
25 x4
20 15
15
10
10
5
0
2
4
X1(t) x X3(t)
5
x3
0
X1(t) x X2(t)
20
6
8
10
12
0
10
15
20
25
(a)
30
35
40
(b)
Fig. 1.13 Dynamics of the problem LP1 , obtained by the HIS-2 with the initial condition x(0) = [10 10 10 10]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan, taking the variable x1 (t) as reference
9
9 x8
8
8
x9 x6
7
7 6
6 x4
5
5
x3
4
4
x2
3
3
x1
2 x7
1
2
x10
1
x11
0
0
x5
1 0
2
4
6
8
(a)
10
12
14
16
0
05
1
15
2
25
(b)
Fig. 1.14 Dynamics of the problem LP2 , obtained by the ANN with the initial condition x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan, taking the variable x1 (t) as reference
ANN. In addition to HIS-2, and the QP problem, this rate was approximately 70%. Fig. 1.17-1.19 presents the simulation results for the problem QP. All case studies were carried out in the same computer, thus we take the processing time of the ANN in phase 1 as the base to normalize the hybrid case in the same phase. Here, we point out that the hybrid systems were not used in phase 2. As a result, we have observed that for the LP1 case, the HIS-1 has taken a processing time of 12%, while the HIS-2 has taken 29%; for the LP2 case, the HIS-1 has taken a processing time of 33%, while the HIS-2 has taken 35%; and for the QP case, the HIS-1 has taken a processing time of 6%, while the HIS-2 has taken 45%. It is important
22
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
16
16
14
14
12
12 10
10
x8 x9
8
8
x6
6
x4 x3 x2 x1 x10 x5 x7 x11
4 2 0 2
6 4 2 0 2
4 0
2
4
6
8
10
15
1
05
0
(a)
05
1
15
2
25
3
(b)
Fig. 1.15 Dynamics of the problem LP2 , obtained by the HIS-1 with the initial condition x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan, taking the variable x1 (t) as reference
9
9
x8
x9
8
8 x6
7
7 6
6 x4
5
5
x3
4
4
x2
3
3
x1
2
2
x10
1
x5
0
x7
x11
1 0
1 0
2
4
6
(a)
8
10
0
05
1
15
2
25
(b)
Fig. 1.16 Dynamics of the problem LP2 , obtained by the HIS-2 with the initial condition x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan, taking the variable x1 (t) as reference
to note that the heuristic rule efficiency varies according to the type of the problem. In this work, the results have showed that the HIS-1 yielded better performance, specifically, in the LP1 and QP problems. Therefore, we could observe a decrease in the processing time yielded by the implemented heuristic rules, while reducing the number of ODE computed. These rules estimated the next values to each variable of the problem throughout the convergence. We can highlight that as ODE also calculates points during the application of the HIS, the proposed systems can correct themselves in case of incorrect estimative, showing the high performance for being resilient.
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
23
Table 1.3 Simulation Results of QP
QP HIS-1a
ANN
HIS-2b
No. points by the ANN (Phase 1) 172629 8257 51900 No. points by the HR (Phase 1) 2750 17299 Total no. points (Phase 1) 172629 11007 69199 Normalized computer processing time (Phase 1) 1.00 0.06 0.45 1726.29 110.07 691.99 Time instant when the switch is closed (s) = t1 Final cost (Phase 1) = f (x(t1 )) 22680.05 22680.05 22680.05 a
HIS-1 = ANN and method of Tendency based on the Dynamics in Space-Time(TDST). b HIS-2 = ANN and method of Tendency based on the Dynamics in State-Space (TDSS).
400
320
x1
X1(t) x X3(t)
300
350
280 x2
260
300
240 250
220 200
200
180 160
150
140
x3 0
200
400
600
800
(a)
1000
1200
1400
1600
1800
120 393
X1(t) x X2(t) 394
395
396
397
398
399
400
401
(b)
Fig. 1.17 Dynamics of the problem QP, obtained by the ANN with the initial condition x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan, taking the variable x1 (t) as reference
1.6 Conclusion In this paper, two Hybrid Intelligent Systems have been proposed. These systems combined Maa and Shamblatt network with heuristic rules. Maa and Shamblatt network is a two-phase recurrent neural network that provides the exact solution for linear and quadratic programming problem. When compared to conventional linear and nonlinear optimization techniques, the two-phase network formulation becomes advantageous as there is no matrix inversion required. The main aim of the proposed HIS is to increase the speed of convergence towards the optimal point which is guaranteed by the ANN. In the cases presented, the optimal convergence
24
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
x1
400
320
X1(t) x X2(t)
300
350
x2
280 260
300
240 250
220 200
200
180 160
150
140
x3 0
20
40
60
80
120 393
100
X1(t) x X3(t) 394
395
396
(a)
397
398
399
400
401
(b)
Fig. 1.18 Dynamics of the problem QP, obtained by the HIS-1 with the initial condition x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan, taking the variable x1 (t) as reference
x1
400
320
X1(t) x X2(t)
300 350
x2
280 260
300
240 250
220 200
200
180 160
150
140
x3 0
100
200
300
(a)
400
500
600
700
120 393
X1(t) x X3(t) 394
395
396
397
398
399
400
401
(b)
Fig. 1.19 Dynamics of the problem QP, obtained by the HIS-2 with the initial condition x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan, taking the variable x1 (t) as reference
was reached. The proposed systems have both advantages. The simulation analyses show a reduction in the computational effort by approximately 95% compared to the ANN in the QP case solved in this paper which has a guaranteed optimal convergence, without inverting matrices. The implementation of the proposed HIS has been developed to solve operational planning problems of large-scale, which may be applied in future works. That is, a large-scale economic power dispatch, with the scheduling of hydro, thermal and wind power plants to minimize the overall production cost, while satisfying the load demand in the mid-term operation planning of hydrothermal generation systems. In future works, we will propose the combination of these heuristic rules and/or the application in the second phase of Maa and Shanblatt method.
1
New HIS to Solve LP and QP Optimization and Increase Convergence Speed
25
References 1. Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 79, 2552–2558 (1982) 2. Haykin, S.: Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, USA (1999) 3. Hopfield, J.J.: Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA 81, 3088–3092 (1984) 4. Hopfield, J.J.: Learning algorithms and probability distributions in feed-forward and feed-back networks. Proc. Natl. Acad. Sci. USA 84, 8429–8433 (1987) 5. Hopfield, J.J.: The effectiveness of analogue neural network hardware. Network: Computation in Neural Systems 1(1), 27–40 (1990) 6. Hopfield, J.J., Feinstein, D.I., Palmer, R.G.: Unlearning has a stabilizing effect in collective memories. Nature 304, 158–159 (1983) 7. Hopfield, J.J., Tank, D.W.: Neural Computation of Decisions in Optimization Problem. Biological Cybernetics 52, 141–152 (1985) 8. Hopfield, J.J., Tank, D.W.: Computing with Neural Circuits: A Model. Science 233(8), 625–633 (1986) 9. Tank, D.W., Hopfield, J.J.: Simple Neural Optimization Networks: An A/D Converter, Signal Decision Circuit, and a Linear Programming Circuit. IEEE Trans. on Circuits and Systems 33(5), 533–541 (1986) 10. Ludemir, T.B., Braga, A.P., Carvalho, A.C.P.L.F.: Redes Neurais Artificiais Teoria e Aplicacoes. 1st ed. Rio de Janeiro, RJ: LTC - Livros Tecnicos e Cientificos Editora S.A. (2000) 11. Pyne, I.B.: Linear Programming on an electronic analogue computer. Trans. AIEE. Part I (Comm. & Elect.) 75, 139–143 (1956) 12. Kennedy, M.P., Chua, L.O.: Unifying Tank and Hopfield Linear Programming Circuit and the Canonical Nonlinear Programming Circuit of Chua and Lin. IEEE Trans. on Circuits and Systems 34(2), 210–214 (1987) 13. Chiu, C., Maa, C.Y., Shanblatt, M.A.: An artificial neural network algorithm for dynamic programming. Int. J. Neural Syst. 1(3), 211–220 (1990) 14. Maa, C.Y., Shanblatt, M.A.: A Two-Phase Optimization Neural Network. IEEE Transactions on Neural Networks 3(6), 1003–1009 (1992) 15. Kennedy, M.P., Chua, L.O.: Neural Networks for Nonlinear Programming. IEEE Trans. on Circuits and Systems 35(5), 210–220 (1988) 16. Maa, C.Y., Shanblatt, M.A.: Linear and Quadratic Programming Neural Network Analysis. IEEE Transactions on Neural Networks 3(4), 580–594 (1992) 17. Chiu, C., Maa, C.Y., Shanblatt, M.A.: Energy Function Analysis of Dynamic Programming Neural Networks. IEEE Transactions on Neural Networks 2(4) (July 1991) 18. Xia, Y.S.: A New Neural Network for Solving Linear and Quadratic Programming Problems. IEEE Transactions on Neural Networks 7(6), 1544–1547 (1996) 19. Tao, Q., Cao, J.D., Xue, M.S., Qiao, H.: A High Performance Neural Network for Solving Nonlinear Programming Problems with Hybrid Constraints. Phys. Lett. A 288(2), 88–94 (2001) 20. Xia, Y.S., Wang, J.: A General Methodology for Designing Globally Convergent Optimization Neural Networks. IEEE Transactions on Neural Networks 9(6), 1331–1343 (1998)
26
O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
21. Xia, Y.S., Wang, J.: A Recurrent Neural Network for Solving Nonlinear Convex Programs Subject to Linear Constraints. IEEE Transactions on Neural Networks 16(2), 379–386 (2005) 22. Dieu, V.N., Ongsakul, W.: Enhanced Merit Order and Augmented Lagrange Hopfield Network for Hydrothermal Scheduling. Electrical Power and Energy Systems 30, 93–101 (2008) 23. Naresh, R., Dubey, J., Sharma, J.: Two-phase Neural Network Based Modeling Framework of Constrained Economic Load Dispatch. IEE Proc. Gener. Transm. Distrib. 151(3) (May 2004) 24. Aquino, R.R.B.: Recurrent Artificial Neural Networks: an application to optimization of hydro thermal power systems (in Portuguese), Ph.D. Thesis, COPELE/UFPE, Campina Grande, Brazil (January 2001) 25. Rosas, P., Aquino, R.R.B., et al.: Study of Impacts of a Large Penetration of Wind Power and Distributed Power Generation as a Whole on the Brazilian Power System. In: European Wind Energy Conference (EWEC), London (November 2004) 26. Witten, I.H., Frank, E.: Data Mining Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 27. Mitra, S., Mitra, M., Chaudhuri, B.B.: Pattern Defined Heuristic Rules and Directional Histogram Based Online ECG Parameter Extraction. Measurement 42, 150–156 (2009) 28. Tuncel, G.: A Heuristic Rule-Based Approach for Dynamic Scheduling of Flexible Manufacturing Systems. In: Levner, E. (ed.) Multiprocessor Scheduling: Theory and Applications, December 2007, p. 436. Itech Education and Publishing, Vienna (2007) 29. Baykasoglu, A., Ozbakir, L., Dereli, T.: Multiple Dispatching Rule Based Heuristic for Multi-Objective Scheduling of Job Shops Using Tabu Search. In: Proceedings of MIM 2002: 5th International Conference on Managing Innovations in Manufacturing (MIM) Milwaukee, Wisconsin, USA, September 9-11, pp. 1–6 (2002) 30. Idris, N., Baba, S., Abdullah, R.: Using Heuristic Rules from Sentence Decomposition of Experts Summaries to Detect Students Summarizing Strategies. International Journal of Human and Social Sciences 2, 1 (Winter 2008), www.waset.org 31. Zak, S.H., Upatising, V., Hui, S.: Solving Linear Programming Problems with Neural Networks: A Comparative Study. IEEE Transactions on Neural Networks 6(1), 94–104 (1995) 32. Chvatal, V.: Linear Programming. W.H. Freman and Company, New York (1983) 33. Lastman, G.J., Sinha, N.K.: Microcomputer-based numerical methods for science and engineering. Saunders Colleg Pubblishing, USA (1988)
Chapter 2
A Novel Optimization Algorithm Based on Reinforcement Learning Janusz A. Starzyk, Yinyin Liu, and Sebastian Batog
Abstract. In this chapter, an efficient optimization algorithm is presented for the problems with hard to evaluate objective functions. It uses the reinforcement learning principle to determine the particle move in search for the optimum process. A model of successful actions is build and future actions are based on past experience. The step increment combines exploitation of the known search path and exploration for the improved search direction. The algorithm does not require any prior knowledge of the objective function, nor does it require any characteristics of such function. It is simple, intuitive and easy to implement and tune. The optimization algorithm was tested using several multi-variable functions and compared with other widely used random search optimization algorithms. Furthermore, the training of a multi-layer perceptron, to find a set of optimized weights, is treated as an optimization problem. The optimized multi-layer perceptron was applied to Iris database classification. Finally, the algorithm is used in image recognition to find a familiar object with retina sampling and micro-saccades.
2.1 Introduction Optimization is a process to find the maximum or the minimum function value within given constraints by changing values of its multiple variables. It can be the essential for solving complex engineering problems in such areas as computer science, aerospace, machine intelligence applications, etc. When the analytical relation Janusz A. Starzyk · Yinyin Liu Ohio University, School of Electrical Engineering and Computer Science, U.S.A. e-mail: [email protected] ,[email protected] Sebastian Batog Silesian University of Technology, Institute Of Computer Science, Poland e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 27–47. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
28
J.A. Starzyk, Y. Liu, and S. Batog
between the variables and the objective function value is explicitly known, analytical methods, such as Lagrange multiplier methods [1], interior point methods [18], Newton methods [30], gradient descent methods [25], etc., can be applied. However, in many practical applications, analytical methods do not apply. This happens when the objective functions are unknown, when relations between variables and function value are not given or difficult to find, when the functions are known while their derivatives are not applicable, or when the optimum value of function cannot be verified. In these cases, iterative search processes are required to find the function optimum. Direct search algorithms [10] contain a set of optimization methods that do not require derivatives and do not approximate either the objective functions or their derivatives. These algorithms find locations with better function values following a search strategy. They only need to compare the objective function values in successive iterative steps to make the move decision. Within the category of direct search, distinctions can be made among three classes including pattern search methods [28], simplex methods [6], and adaptive sets of search directions [23]. In pattern search methods, the variables of the function are varied by either steps of predetermined magnitude or the steps sizes are reduced at the same degree [15]. Simplex methods construct a simplex in ℜN using N+1 points and use the simplex to drive the search for optimum. The methods with adaptive sets of search directions, proposed by Rosenbrock [23] and Powell [21], construct conjugate directions using the information about the curvature of the objective function during the search. In order to avoid local minima, random search methods are developed utilizing randomness in setting the initial search points and other search parameters like the search direction or the step size. In Optimized Step-Size Random Search (OSSRS) [24], the step size is determined by fitting a quadratic function for the optimized function values in each of the random directions. The random direction is generated with a normal distribution of a given mean and standard deviation. Monte-Carlo optimizations adopted randomness in the search process to generate the possibilities to escape from the local minima. Simulate Annealing (SA) [13] is one typical kind of Monte-Carlo algorithm. It exploits the analogy between the search for a minimum in the optimization problem and the annealing process in which a metal cools and stabilizes into a minimum energy crystalline structure. It accepts the move to a new position with worse function value with a probability, which is controlled by the ”temperature” parameter, and the probability decreases along the ”cooling process”. SA can deal with highly nonlinear, chaotic problems provided that the cooling schedule and other parameters are carefully tuned. Particle Swarm Optimization (PSO) [11] is a population-based evolutionary computational algorithm. It exploits the cooperation within the solution population instead of the competition among them. At each iteration in PSO, a group of search particles make moves in a mutually coordinated fashion. The step size of a particle is a function of both the best solution found by that particle and the best solution found so far by all the particles in the group. The use of a population of search particles and the cooperation among them enable the algorithm to evaluate function values in a wide range of variables in the input space and to find the optimum position. Each
2
A Novel Optimization Algorithm Based on Reinforcement Learning
29
particle only remembers its best solution and the global best solution of the group to determine its step sizes. Generally, during the course of search, a sequence of decisions on the step sizes is made and a number of function values are obtained in these optimization methods. In order to implement an efficient search for the optimum point, it is desired that such historical information can be utilized in the optimization process. Reinforcement Learning (RL) [27] is a type of learning process to maximize certain numerical values by combining exploration and exploitation and using rewards as learning stimuli. In the reinforcement learning problem, the learning agent performs the experiments to interact with the unknown environment and accumulate the knowledge during this process. It is a trial-and-error exploratory process with the objective to find the optimum action. During this process, an agent can learn to build the model of the environment to instruct its search, so that the agent can predict the environment’s response to its actions and choose the most useful actions for its objectives based on its past exploring experience. Surrogate based optimization refers to an idea of speeding optimization process by using surrogates for the objectives and constraints functions. The surrogates also allow for the optimization of problems with non-smooth or noisy responses, and can provide insight into the nature of the design space. The max-min SAGA approach [20] is to search for designs that have the best worst case performance in the presence of parameter uncertainty. By leveraging a trust-region approach which uses computationally cheap surrogate models, the present approach allows for the possibility of achieving robust design solutions on a limited computational budged. Another example of a surrogate based optimization is the surrogate assisted Hooke-Jeeves (SAHJA) algorithm [8] which can be used as a local component of a global optimization algorithm. This local searcher uses the Hooke-Jeeves method, which performs its exploration of the input space intelligently employing both the real fitness and an approximated function. The idea of building knowledge about an unknown problem through exploration can be applied in the optimization problems. To find the optimum of an unknown multivariable function, an efficient search procedure can be performed using only historical information from conducted experiments to expedite the search. In this chapter, a novel and efficient optimization algorithm based on reinforcement learning is presented. This algorithm uses simple search operators and will be called reinforcement learning optimization (RLO) in the later sections. It does not require any prior knowledge of the objective function or function’s gradient information, nor does it require any characteristics of the objective function. In addition, it is conceptually very simple and easy to implement. This approach to optimization is compatible with the neural networks and learning through interaction, thus it is useful for systems of embodied intelligence and motivated learning as presented in [26]. The following section presents the RLO method and illustrates it within several machine learning applications.
30
J.A. Starzyk, Y. Liu, and S. Batog
2.2 Optimization Algorithm 2.2.1 Basic Search Procedure A N-variable optimization objective function V = f (p1 , p2 , ..., pN ) (p1 , p2 , ..., pN ,V ∈ ℜ1 ) could have several local minima and several global minima Vopt1 , ...,VoptN . It is desired that the search process, initiated from a random point, finds a path to the global optimum point. Unlike particle swarm optimization [11], this process can be performed with a single search particle that learns how to find its way to the optimum point. It does not require the cooperation among a group of particles, although implementing the cooperation among several search particles may further enhance the search process in this method. At each point of the search, the search particle intends to find a new location with a better value within a searching range around it and then determines the direction and the step size for the next move. It tries to reach the optimum by exploring weighted random search of each variable (coordinate). The step size of search in each variable is randomly generated with its own probability density function. These functions are gradually learned during the search process. It is expected that at the later stage of search, the probability density functions are approximated for each variable. Then the stochastically randomized path to the minimum point of the function from the start point is learned. The step sizes of all the coordinates determine the center of the new searching area and the standard deviations of the probability functions determine the size of the new searching area around the center. In the new searching area, several locations PS are randomly generated. If there is a location p’ with better value than the current one, the search operator moves to it. From this new location, new step sizes and new searching range are determined, so that the search for optimum continues. If in the current searching area, there is no point with better value that the search particle can move to, another set of random points are generated until no improvement is obtained after several, say M, trials. Then the searching area size and step sizes are modified in order to find a better function value. If no better value is found after K trials of generating different searching areas or the proposed stopping criterion is met, we can claim that the optimum point has been found. The algorithm of searching for the minimum point is schematically shown in the Figure 2.1.
2.2.2 Extracting Historical Information by Weighted Optimized Approximation After the search particle makes a sequence of n moves, the step sizes of these moves d pti (t = 1, 2, ..., n; i = 1, 2, ..., N) are available for learning. These historical steps have made the search particle move towards better values of the objective function and hopefully get closer to the optimum location. In this sense, these steps are the
2
A Novel Optimization Algorithm Based on Reinforcement Learning
31
Fig. 2.1 The algorithm of RLO searching for the minimum point
successful actions during the trial. It is proposed that the successful actions which result in a positive reinforcement (as the step sizes of each coordinate) follow a function of the iterative steps t, as in (2.1), where dpi represents the step sizes on ith coordinate and f i (t) is the function for coordinate i. d pi = fi (t) (i = 1, 2, ..., N),
(2.1)
These unknown functions f i (t) can be approximated, for example, using polynomials through the least-squared fit (LSF) process. ⎫ ⎧ a0 ⎪ ⎪ ⎪ ⎪ ⎡ ⎤ ⎪ ⎧ i⎫ ⎪ 1 t1 t12 ... t1B ⎪ ⎬ ⎨ d p1 ⎬ ⎨ a1 ⎪ ⎣ ... ... ... ... ... ⎦ a2 = ... (2.2) ⎩ i⎭ ⎪ ⎪ ⎪ d p ... 1 tn tn2 ... tnB ⎪ ⎪ ⎪ n ⎪ ⎪ ⎭ ⎩ aB In (2.2), the step sizes from d pi1 to d pin are the step sizes on a certain coordinate during n steps and are fitted as unknown function values using polynomials of order B. The polynomial coefficients a0 to aB can be obtained and will represent the function f i (t) to estimate dpi , d pi =
B
∑ a jt j.
(2.3)
j=0
Using polynomials for function approximation could be easy and efficient. However, considering the characteristic of optimization problems, we have two concerns. First, in order to generate a good approximation while avoiding overfitting, a proper order of polynomials must be selected. In the optimized approximation algorithm (OAA) presented in [17], the goodness of fit is determined by the so-called signalto-noise ratio figure (SNRF). Based on SNRF, an approximation stopping criterion
32
J.A. Starzyk, Y. Liu, and S. Batog
was developed. Using a certain set of basis functions for approximation, the error signal, computed as the difference between the approximated function and the sampled data, can be examined by SNRF to determine how much useful information it contains. The SNRF for the error signal, denoted as SNRF e , is compared to the precalculated SNRF for white Gaussian noise (WGN), denoted as SNRFW GN . If SNRF e is higher thanSNRFWGN , more basis functions should be used to improve the learning. Otherwise, the error signal shows the characteristic of WGN and should not be reduced any more to avoid fitting into the noise, and the obtained approximated function is the optimum function. Such process can be applied to determine the proper order of the polynomial. The second concern is that in the case of reinforcement learning, the knowledge about originally unknown environment is gradually accumulated throughout the learning process. The information that the learning system obtains at the beginning of the process is mostly based on initially random exploration. During the process of interaction, the learning system collects the historical information and builds the model of the environment. The model can be updated after each step of interaction. The decisions made at the later stages of the interaction are more based on the built model rather than a random exploration. This means that the recent results are more important and should be weighted more heavily than the old ones. For example, the weights applied can be exponentially increasing from the initial trials to the recent ones, as wt =
αt n
(t = 1, 2, ..., n),
(2.4)
where we can define α n = n. As a result, the weights are in the open interval (0:1], and weight is 1 for the most recent sample. Applying the weights in the LSF, we have the weighted least-squared fit (WLSF), expressed as follows: ⎫ ⎧ a0 ⎪ ⎪ ⎪ ⎪ ⎫ ⎤ ⎡ ⎪ ⎧ ⎪ 1 · w1 t1 w1 t12 w1 ... t1B w1 ⎪ ⎬ ⎨ d p1 w1 ⎬ ⎨ a1 ⎪ ⎣ ... ... ... ... ... ⎦ a2 = ... (2.5) ⎪ ⎩ ⎪ ⎭ B 2 ⎪ ... w d p 1 · wn tn wn tn wn ... tn wn ⎪ ⎪ ⎪ n n ⎪ ⎪ ⎭ ⎩ aB Due to the weights applied to the given samples, the approximated function will fit to the recent data better than to the old ones. Utilizing the concept of OAA to obtain optimized WLSF, the SNRF for the error signal or WGN has to be estimated considering the sample weights. In the original OAA for one-dimensional problem [17], the SNRF for error signal was calculated as, C(e j , e j−1 ) (2.6) SNRFe = C(e j , e j ) − C(e j , e j−1 ) where C represents the correlation calculation, e j represents the error signal (j=1,2, ...,n), e j−1 represents the (circular) shifted version of the e j . The characteristics
2
A Novel Optimization Algorithm Based on Reinforcement Learning
33
of SNRF for WGN, expressed through the average value and the standard deviation, can be estimated from Monte-Carlo simulation, as (see derivation at [17])
μSNRF W GN (n) = 0
(2.7)
1 σSNRF W GN (n) = √ . n
(2.8)
Then the threshold, which determines whether SNRF e shows the characteristic of SNRFW GN and the fitting error should not be further reduced, is, thSNRF W GN (n) = μSNRF W GN (n) + 1.7σSNRF W GN (n).
(2.9)
For the weighted approximation, the SNRF for the error signal is calculated as, SNRFe =
C(e j · w j , e j−1 · w j−1 ) . C(e j · w j , e j · w j ) − C(e j · w j , e j−1 · w j−1 )
(2.10)
In Fig.2.2(a), σSNRF W GN (n) from a 200-run Monte-Carlo simulation is shown in the logarithmic scale. The σSNRF W GN (n) can be estimated as
σSNRF
W GN (N)
2 =√ . n
(2.11)
It is found that the 5% significance level can be approximated by the average value plus 1.5 times standard deviations for an arbitrary n. Fig.2.2(b) illustrates the histogram of SNRFW GN with 216 samples, as an example. The threshold in this case of a dataset with 216 samples can be calculated using μ + 1.5σ = 0 + 1.5 × 0.0078 = 0.0117. Therefore, to obtain an optimized weighted approximation in one-dimensional case, the following algorithm is performed. Optimized weighted approximation algorithm (OWAA) Step (2.1). Assume that an unknown function F, with input space t ⊂ ℜ1 is described by n training samples as d pt , (t = 1, 2, ..., n). Step (2.2). The signal detection threshold is pre-calculated for the given number of samples n based on SNRFW GN . For a one-dimensional problem, 1.5 · 2 thSNRF W GN (n) = √ . n Step (2.3). Take a set of basis functions, for example, polynomials of order from 0 up to order B. Step (2.4). Use these B+1 basis functions to obtain the approximated function, dˆpt =
B+1
∑
l=1
fl (xt ) (t = 1, 2, ..., n).
(2.12)
34
J.A. Starzyk, Y. Liu, and S. Batog
Fig. 2.2 Characteristic of SNRF for WGN in weighted approximation
Step (2.5). Calculate the approximation error signal, et = d pt − dˆpt
(t = 1, 2, ..., n).
(2.13)
Step (2.6). Determine SNRF of the error signal using (2.10). Step (2.7). Compare the SNRF e with thSNRF W GN . If the SNRF e is equal to or less than thSNRF W GN , or if B exceeds the number of samples, stop the procedure. In such case Fˆ is the optimized approximation. Otherwise, add one basis function, in this example increase the order of the approximating polynomial to B+1 and repeat Steps (2.4)-(2.7). Using the above algorithm, the proper order of polynomial is determined to extract the useful (but not the noise) information from the historical data. Also, the extracted information will fit into the recent results better than to the old ones. We illustrate this process of learning historical information by considering a 2variable function as an example. Example The function V (p1 , p2 ) = p22 sin(1.5p2 ) + 2p21 sin(2p1 ) + p1 sin(2p2 ) has several local minima, but only one global minimum, as shown in Fig. 2.3. In the process of interaction, the historical information after each iteration is collected. The historical step sizes of 2 coordinates are separately approximated, as shown in the Fig. 2.4 (a) and 2.4 (b). The step sizes of two coordinates are approximated by quadratic polynomials which are determined by OWAA and the coefficients of polynomials are obtained using WLSF. In Fig. 2.4, the approximated functions are compared with the quadratic polynomials whose coefficients are obtained from LSF. Again, it is
2
A Novel Optimization Algorithm Based on Reinforcement Learning
35
observed that, the function obtained using WLSF is fitted closer to the data in later iterations than the function obtained using LSF.
Fig. 2.3 A 2-variable function V (p1 , p2 )
Fig. 2.4 Function approximation for historical step sizes
The level of the approximation error signal et for step sizes of a certain coordinate dpi , which is the difference between the observed sampled data and the approximated function, can be measured by its standard deviation, as shown in (2.14). 1 n σ pi = (2.14) ∑ (et − e)¯ 2 n t=1 This standard deviation will be called the approximation deviation in the following discussion. It represents the maximum deviation of the location of the search particle from the prediction by the approximated function in the unknown function optimization problem.
36
J.A. Starzyk, Y. Liu, and S. Batog
2.2.3 Predicting New Step Sizes The approximated functions will be used to determine the step sizes for the next iteration, as shown in (2.15) and Fig. 2.5 along with the approximated functions. i d pt+1 = f i (t + 1)
(2.15)
Fig. 2.5 Prediction of the step sizes for the next iteration
The step size functions are the model of environment that the learning system builds during the process of interaction based on historical information. The future step size determined by such model can be employed as exploitation of the existing model. However, such model built during the learning process cannot be treated as exact. Besides exploitation which best utilizes the obtained model, exploration is desired to a certain degree in order to improve the model and discover better solutions. The exploration can be implemented using Gaussian random generator (GRG). As a good trade-off between exploitation and exploration is needed, we propose to use the step sizes for the next iteration determined by the step size functions as the mean value and the approximation deviation as the standard deviation of the random generator. Gaussian random generators give several random choices of the step sizes. Effectively, the determined step sizes of multiple coordinates generate the center of the searching area, and the size of the searching range is determined by the standard deviations of GRG for the coordinates. The multiple random values generated by GRG for each coordinate effectively create multiple locations within the searching area. The objective function values of these locations will be compared and the location with the best value, called current best location, will be chosen as the place from which the search particle will continue searching in the next iteration. Therefore, the actual step sizes are calculated using the distance from the “previous best location” to the “current best location”. The actual step sizes will be added in the historical step sizes and used to update the model of the unknown environment.
2
A Novel Optimization Algorithm Based on Reinforcement Learning
37
Several locations of the search particle in this approach are illustrated in Fig. 2.6 using a 2-variable function as an example. The search particle was located at previous best location p prev (p1prev , p2prev ) and the previous step size was found as d p prev (d p1prev , d p2prev ) after current best location p(p1 , p2 )is found as the best location in previous searching area (an area with p(p1 , p2 ) in it, not shown in the figure). At current best position p(p1 , p2 ), using the environment model built with historical step sizes, the current step size is determined to be dp1 on coordinate 1 and dp2 on coordinate 2, so that the center of the searching area is determined. The approximation deviations of two coordinates σ p1 and σ p2 give the size of the searching range. Within the searching range, several random points are generated in order to find a better position to which the search operator will move.
Fig. 2.6 Step sizes and searching area
2.2.4 Stopping Criterion Search particle moves from every “previous best location” to “current best location” and step sizes actually taken are used for model learning. As new step sizes are generated, the search particle is expected to move to locations with better objective function values. In the proposed algorithm, the search particle only makes the move when a location with a better function value is found. However, if all the points generated in the current searching range have no better function values than the current best value, the search particle does not move and the GRG will repeat generating groups of particle locations for several trials. If no better location is found after M trials, we suspect that the current searching range is too small or the current step size is too large, which makes us miss the locations with better function values. In such case, we should enlarge the size of the searching area, and reduce the step size, as in (2.16),
38
J.A. Starzyk, Y. Liu, and S. Batog
σ pi = α σ pi d pi = ε d pi
(i = 1, 2, ..., N),
(2.16)
where α > 1, and ε < 1. If this new search is still not successful, the searching range and the step size will continue changing until some points with better function values are found. If at certain step of the search process, in order to find the new location with better function values, the current step size is reduced to be too small to make the search particle move anywhere, it indicates that the optimum point has been reached. The stop criterion can be defined by the current step size being β times smaller than the previous step size, as, d p < β d p prev
(0 < β < 1, β is usually small).
(2.17)
2.2.5 Optimization Algorithm Based on previous discussion, the proposed optimization algorithm (RLO) can be described as follows. (a). The procedure starts from a random point of the objective function with Nvariables V = f (p1 , p2 , ..., pN ) . It will try to make a series of moves to get closer to the global optimum point. (b). To change from the current location, the step size dpi (i=1, 2, . . . ,N) and the standard deviation σ pi (i = 1, 2, ..., N) for each coordinate are generated from the uniform probability distribution. (c). The step sizes dpi (i=1, 2, . . . N) determine the center of the searching area. The deviations of all the coordinates σ pi (i = 1, 2, ..., N) determine the size of the searching area. Several points Ps in this range are randomly chosen from Gaussian distribution using dpi as mean values and σ pi as standard deviations. (d). The objective function values are evaluated at these new points. Compare the objective function values on random points with that at the current location. (e). If the new points generated in Step (c) have no better values than the current position, Step (c ) is repeated for up to M trials until point with better function value is found. (f). If the search fails after M trials, enlarge the size of the searching area, and reduce the step size, as in (2.16). (g). If the search with the updated searching area size and the step sizes from Step (f) is not successful, the range and the step size will keep being adjusted until either some points with better values are found, the current step sizes are much smaller than previous step sizes as in (2.17), or function value changed by less than a prespecified threshold. If any of these conditions happens then the algorithm terminates. This also indicates that the optimum point has been reached. (h). Move the search particle to the point p(p1 , p2 ) with the best function value V b (a local minimum or maximum depending on the optimization objective). The distance between previous best point p prev (p1prev , p2prev ) and current best point p(p1 , p2 ) gives the actual step size dpi (i=1, 2, . . . , N). Collect the historical information of the step sizes taken during the search process.
2
A Novel Optimization Algorithm Based on Reinforcement Learning
39
(i). Approximate the function of the step sizes as a function of iterative steps using weighted least-square fit as in (2.5). The proper maximum order of the basis functions is determined using SNRF described in section 2.2.2 to avoid overfitting. (j). Use the modeled function to determine the step sizes dpi (i=1, 2, . . . ,N) for the next iteration step. The approximation deviation difference between the approximated step sizes and the actual step sizes σ pi (i = 1, 2, ..., N) gives the approximation deviation. Repeat Step (c) to (j). In general, the optimization algorithm based on the reinforcement learning builds the model of successful moves for a given objective function. The model is built based on historical successful actions and it is used to determine new actions. The algorithm combines the exploitation and exploration of searching using random generators. The optimization algorithm does not require any prior knowledge of the objective function or its derivatives nor there are any special requirements put on the objective function. The use of search operator is conceptually very simple and intuitive. In the following section, the algorithm is verified using several experiments.
2.3 Simulation and Discussion 2.3.1 Finding Global Minimum of a Multi-variable Function 2.3.1.1
A Synthetic Bivariate Function
A synthetic bivariate function V (p1 , p2 ) = p22 sin(1.5p2 ) + 2p21 sin(2p1 ) + p1 sin(2p2 ), used previously in the example in section 2.2.2, is used as the objective function. This function has several local minima and one global minimum equal to -112.2586. The optimization algorithm starts at a random point and performs the search process looking for the optimum point (minimum in this example). The number of random points Ps generated in the searching area in each step is 10. The scaling factors α and ε in (2.16) are 1.1 and 0.9. The β in (2.17) is 0.005. One possible search path is shown in Fig. 2.7 from the start location to the final optimum location as found by RLO algorithm. The global optimum is found in 13 iterative steps. The historical locations are shown in the figure as well. During the search process, the historical step sizes taken are shown in Fig. 2.8 with their approximation by WLSF. Example of another search process starting from another random point is performed and is shown in Fig. 2.9. The global optimum is found in 10 iterative steps. Table 2.1 shows changes in the numerical function values and adjustment of the step sizes dp1 and dp2 for p1 and p2 in the successive search steps. Notice how the step size was initially reduced to be increased again once the algorithm started to follow a correct path towards the optimum.
40
J.A. Starzyk, Y. Liu, and S. Batog
Fig. 2.7 Search path from start point to optimum
Fig. 2.8 Step sizes taken during the search process
Fig. 2.9 Search path from start point to optimum
2
A Novel Optimization Algorithm Based on Reinforcement Learning
41
Table 2.1 Function values and step sizes in a searching process
Search steps Function value V (p1 , p2 ) Step size d p1 Step size d p2 1 2 3 4 5 6 7 8 9 10
1.4430 -34.8100 -61.4957 -69.8342 -70.5394 -71.5813 -109.0453 -110.8888 -112.0104 -112.1666
2.9455 0.3570 -0.0508 -0.0477 -0.1232 0.0000 -0.0281 0.0495 0.0438
0.8606 -1.7924 -0.7299 -0.3114 0.2015 4.4358 0.3408 -0.0531 -0.0772
Such search process was performed for 300 random trials. The success rate of finding the global optimum is 93.78%. On average, it takes 5.9 steps and 4299 function evaluations to find the optimum in this problem. The same problems are tested on several other direct search based optimization algorithms, including SA [29], PSO [14] and OSSRS [2]. The success rate of finding global optimum and the average number of function evaluations are compared in Tables 2.2, 2.3, 2.4. All the simulations were performed using an Intel Core Duo 2.2GHz based PC, with 2GB of RAM.
Table 2.2 Comparison of optimization performances on synthetic function
RLO
SA
PSO
OSSRS
Success rate of finding the global optimum 93.78% 29.08% 94.89% 52.21% Number of function evaluations 4299 13118 4087 313 CPU time consumption [s] 28.4 254.35 20.29 1.95
2.3.1.2
Six-Hump Camel Back Function
The classic 2D six-hump camel back function [5] has 6 local minima and 2 global minima. The function is given as V (p1 , p2 ) = (4 − 2.1p21 + [−2, 2]).
p41 2 ) p1 + p1 p2 + (−4 + 4p22 ) p22 (p1 ∈ [−3, 3], p2 ∈ 3
Within the specified bounded region, the function has 2 global minima equal to 1.0316. The optimization performances of these algorithms from 300 random trials are compared in Table 2.3.
42
J.A. Starzyk, Y. Liu, and S. Batog Table 2.3 Comparison of optimization performances on six-hump camel back function
RLO
SA
PSO
OSSRS
Success rate of finding the global optimum 80.33% 45.22% 86.44% 42.67% Number of function evaluations 5016 8045.5 3971 256 CPU time consumption [s] 33.60 151.86 20.35 1.63 2.3.1.3
Banana Function
The Rosenbrock’s famous “banana function” [23], as V (p1 , p2 ) = 100(p2 − p21 )2 + (1 − p1)2 , has 1 global minimum equal to 0 lying inside a narrow, curved valley. The optimization performances of these algorithms from 300 random trials are compared in Table 2.4. Table 2.4 Comparison of optimization performances on banana function
RLO
SA
PSO
Success rate of finding the global optimum 74.55% 3.33% 41% Number of function evaluations 48883.7 28412 4168 CPU time consumption [s] 320.74 539.38 20.27
OSSRS 88.89% 882.4 5.15
In these optimization problems, RLO demonstrates consistently satisfactory performance without particular tuning of the parameters. However, other methods show different level of efficiency and capabilities of handling various problems.
2.3.2 Optimization of Weights in Multi-layer Perceptron Training The output of a multi-layer perceptron (MLP) can be looked at as the value of a function with the weights as the approximation variables. Training the MLP, in the sense of finding optimal values of weights to accomplish the learning task, can be treated as an optimization problem. We can take the Iris plant database [22] as a testing case. The Iris database contains 3 classes, 5 numerical features and 150 samples. In order to accomplish the classification of the iris samples, a 3-layered MLP with an input layer, a hidden layer and an output layer can be used. The size of the input layer should be equal to the number of features. The size of the hidden layer is chosen to be 6, and since the class IDs are numerical values equal to 1, 2 and 3, the size of the output layer is 1. The weight matrix between the input layer and the hidden layer contains 30 elements, and the one between the hidden layer and the
2
A Novel Optimization Algorithm Based on Reinforcement Learning
43
output layer contains 6 elements. Overall, there are 36 weight elements (parameters) to be optimized. In a typical trial, the optimization algorithm finds the optimal set of weights after only 3 iterations. In the testing stage, the outputs of the MLP are rounded to be the nearest integers to indicate predicted class IDs. Comparing the given class IDs and the predicted class IDs from the MLP in Fig. 2.10, it is obtained that 146 out of 150 iris samples can be correctly classified by such set of weights and the percentage of correct classification is 97.3%. A single support vector machine (SVM) achieved 96.73% classification rate [12]. In addition, a MLP with the same structure, training by back-propagation (BP) achieved 96% on Iris test case. The MLP and BP are implemented using MATLAB neural network toolbox.
Fig. 2.10 RLO performance on neural network training on Iris problem
2.3.3 Micro-saccade Optimization in Active Vision for Machine Intelligence In the area of machine intelligence, active vision becomes an interesting topic. Instead of taking in the whole scene captured by the camera and making sense of all the information in the conventional computer vision approach, active vision agent focuses on small parts of the scene and moves its fixation frequently. Human and other animals use such quick movement of both eyes, which is called saccade [3], to focus on the interesting part of the scene and efficiently use its own resources. The interesting parts are usually important features of the input, and with the important features being extracted, the high-resolution scene is analyzed and recognized with relatively small number of samples. In a saccade movement network (SMN) presented in [16], the original images are transformed into a set of low resolution images after saccade movements and retina
44
J.A. Starzyk, Y. Liu, and S. Batog
sampling. The set of images, as the sampled features, are fed to the self-organizing winner-take-all classifier (SOWTAC) network for recognition. To find interesting features of the input image and to direct the movements of saccade, image segmentation, edge detection and basic morphology tools [4] are utilized.
Fig. 2.11 Face image and its interesting features in active vision [16]
Fig. 2.11 (a) shows a face image from [7] with 320×240 pixels. The interesting features found are shown in Fig. 2.11 (b). The stars represent the center of the four interesting features found on a face image and the rectangles represent the feature boundaries. Then, the retina sampling model [16] places its fovea at the center of each interesting feature, so that these features will be extracted. Practically, the centers of the interesting features found by image processing tools [4] are not guaranteed to be the accurate centers, which will affect the accuracy of feature extraction and pattern recognition process. In order to help to find the optimum sampling position, RLO algorithm can be used to direct the move of the fovea of the retina and find the closest match between the obtained sample features and pre-stored reference sample features. These slight moves during fixation to find the optimum sampling positions can be called microsaccades in the active vision process, although the actual role of microsaccades has been a debate topic unsolved for several decades [19]. Fig. 2.12 (a) shows a group of ideal samples of important features in face recognition. Fig. 2.12 (b) shows the group of sampled features with initial sampling positions. In the optimization process, the x-y coordinates need to be optimized so that the sampled images have the optimum similarity to the ideal images. The level of similarity can be measured by the sum of squared intensity difference [9]. In this metric, increased similarity will have decreased intensity difference. Such problem can be also perceived as an image registration problem. The two-variables objective function V(x, y), the sum of squared intensity difference, needs to be minimized
2
A Novel Optimization Algorithm Based on Reinforcement Learning
45
Fig. 2.12 Image sampling by micro-saccade
through RLO algorithm. It is noted that the only information available is that V can be the function of x and y coordinates. How the function would be expressed and what are its characteristics are totally unknown. The minimum value of the objective function is not known either. RLO would be the suitable algorithm for such optimization problem. Fig. 2.12 (c) shows the optimized sampled images using RLO-directed microsaccades. The optimized feature samples are closer to the ideal feature samples, which will help the processing of the face image. After the featured images are obtained through RLO-directed microsaccades, these low-resolution images, instead of the entire high-resolution face image, are sent to the SOWTAC network for further processing or recognition.
2.4 Conclusions In this chapter, a novel and efficient optimization algorithm is presented for the problems in which the objective functions are unknown. The search particle is able to build the model of successful actions and choose its future action based on the past exploring experience. The decisions on the step sizes (and directions) are made based on a trade-off between exploitation of the known search path and exploration for the improved search direction. In this sense, this algorithm falls into a category of reinforcement learning based optimization (RLO) methods. The algorithm does not require any prior knowledge of the objective function, nor does it require any characteristics of such function. It is conceptually very simple and intuitive as well as very easy to implement and tune. The optimization algorithm was tested and verified using several multi-variable functions and compared with several other widely used random search optimization
46
J.A. Starzyk, Y. Liu, and S. Batog
algorithms. Furthermore, the training of a multi-layer perceptron (MLP), based on finding a set of optimized weights to accomplish the learning, is treated as an optimization problem. The proposed RLO was used to find the weights of MLP in the training problem on Iris database. Finally, the algorithm is used in the image recognition process to find a familiar object with retina sampling and micro-saccades. The performance of RLO, will depend to a certain degree on the values of several parameters that this algorithm uses. With certain preset parameters, the performance of RLO can meet our requirements in several machine learning problems involved in our current research. In the future research, a theoretical and systematic analysis of the effect of these parameters will be conducted. In addition, using a group of search particles and their cooperation and competition, a population based RLO can be developed. With the help of model approximation techniques and the trade-off between exploration and exploitation proposed in this work, the population based RLO is expected to have better performance.
References 1. Arfken, G.: Lagrange Multipliers, 3rd edn. §17.6 in Mathematical Methods for Physicists, pp. 945–950. Academic Press, Orlando (1985) 2. Belur, S.: A random search method for the optimization of a function of n variables. MATLAB central file exchange, http://www.mathworks.com/ matlabcentral/fileexchange/loadFile.do?objectId=100 3. Cassin, B., Solomon, S.: Dictionary of Eye Terminology. Triad Publishing Company, Gainsville (1990) 4. Detecting a Cell Using Image Segmentation. Image Processing Toolbox, the Mathworks, http://www.mathworks.com/products/image/demos.html 5. Dixon, L.C.W., Szego, G.P.: The optimization problem: An introduction. Towards Global Optimization II. North Holland, New York (1978) 6. Eelder, J.A., Mead, R.: A simplex method for function minimization. The Computer Journal 7, 308–313 (1965) 7. Facegen Modeller. Singular Inversions, http://www.facegen.com/products.htm 8. del Toro Garcia, X., Neri, F., Cascella, G.L., Salvatore, N.: A surrogate associated Hooke-Jeeves algorithm to optimize the control system of a PMSM drive. IEEE ISIE, 347–352 (July 2006) 9. Hill, D.L.G., Batchelor, P.: Registration methodology: concepts and algorithms. In: Hajnal, J.V., Hill, D.L.G., Hawkes, D.J. (eds.) Medical Image Registration. Medical Image Registration. CRC, Boca Raton (2001) 10. Hooke, R., Jeeves, T.A.: Direct search solution of numerical and statistical problems. Journal of the Association for Computing Machinery 8, 212–229 (1961) 11. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proc. IEEE Int. Conf. Neural Networks, Perth, Australia, December 1995, vol. 4, pp. 1942–1948 (1995) 12. Kim, H., Pang, S., Je, H.: Support vector machine ensemble with bagging. In: Lee, S.-W., Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, p. 397. Springer, Heidelberg (2002) 13. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
2
A Novel Optimization Algorithm Based on Reinforcement Learning
47
14. Leontitsis, A.: Hybrid Particle Swarm Optimization, MATLAB central file exchange, http://www.mathworks.com/matlabcentral/fileexchange/ loadFile.do?objectId=6497 15. Lewis, R.M., Torczon, V., Trosset, M.W.: Direct search methods: Then and now. Journal of Computational and Applied Mathematics 124(1), 191–207 (2000) 16. Li, Y.: Active Vision through Invariant Representations and Saccade Movements. Master thesis, School of Electrical Engineering and Computer Science, Ohio University (2006) 17. Liu, Y., Starzyk, J.A., Zhu, Z.: Optimized Approximation Algorithm in Neural Networks without overfitting. IEEE Trans. on Neural Networks 19(4), 983–995 (2008) 18. Lustig, I.J., Marsten, R.E., Shanno, D.F.: Computational Experience with a Primal-Dual Interior Point Method for Linear Programming. Linear Algebra and its Application 152, 191–222 (1991) 19. Martinez-Conde, S., Macknik, S.L., Hubel, D.H.: The role of fixational eye movements in visual perception. Nature Reviews Neuroscience 5(3), 229–240 (2004) 20. Ong, Y.-S.: Max-min surrogate-assisted evolutionary algorithm for robust design. IEEE Trans. on Evolutionary Computation 10(4), 392–404 (2006) 21. Powell, M.J.D.: An efficient method for finding the minimum of a function of several variables without calculating derivatives. The Computer Journal 7, 155–162 (1964) 22. Fisher, R.A.: Iris Plants Database (July 1988), http://faculty.cs.byu.edu/˜cgc/Teaching/CS_478/iris.arff 23. Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a function. The Computer Journal 3, 175–184 (1960) 24. Sheela, B.V.: An optimized step-size random search. Computer Methods in Applied Mechanics and Engineering 19(1), 99–106 (1979) 25. Snyman, J.A.: Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms. Springer, Heidelberg (2005) 26. Starzyk, J.A.: Motivation in Embodied Intelligence. In: Frontiers in Robotics, Automation and Control, October 2008, pp. 83–110. I-Tech Education and Publishing (2008), http://www.intechweb.org/ book.php?%20id=78&content=subject&sid=11 27. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998) 28. Torczon, V.: On the Convergence of Pattern Search Algorithms. SIAM Journal on Optimization 17(1), 1–25 (1997) 29. Vandekerckhove, J.: General simulated annealing algorithm, MATLAB central file exchange, http://www.mathworks.com/matlabcentral/fileexchange/ loadFile.do?objectId=10548 30. Ypma, T.J.: Historical development of the Newton-Raphson method. SIAM Review 37(4), 531–551 (1995)
Chapter 3
The Use of Opposition for Decreasing Function Evaluations in Population-Based Search Mario Ventresca, Shahryar Rahnamayan, and Hamid Reza Tizhoosh
Abstract. This chapter discusses the application of opposition-based computing to reducing the amount of function calls required to perform optimization by population-based search. We provide motivation and comparison to similar, but different approaches including antithetic variates and quasi-randomness/ low-discrepancy sequences. We employ differential evolution and population-based incremental learning as optimization methods for image thresholding. Our results confirm improvements in required function calls, as well as support the oppositional princples used to attain them.
3.1 Introduction Global optimization is concerned with discovering an optimal (minimum or maximum) solution to a given problem generally within a large search space. In some instances the search space may be simple (i.e. concave or convex optimization can be used). However, most real world problems are multi-modal and deceptive [5] which often causes traditional optimization algorithms to become trapped at local optima. Many strategies have been developed to overcome this for global optimization including, but not limited to simulated annealing [9], tabu search [4], evolutionary algorithms [7] and swarm intelligence [3]. Some of these methods employ a single solution per iteration methodology whereby only one solution is generated and successively perturbed towards more Mario Ventresca · Hamid Reza Tizhoosh Department of Systems Design Engineering, The University of Waterloo, Ontario, Canada e-mail: [email protected] ,[email protected] Shahryar Rahnamayan Faculty of Engineering and Applied Science, The University of Ontario Institute of Technology, Ontario, Canada e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 49–71. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
50
M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
appropriate solutions (i.e. simulated annealing and tabu search). Single-solution methods inherently require low computational time per iteration, however they often lack sufficient diversity to adequately explore the search space. An alternative is via population-based techniques where many solutions are generated at each iteration and all (or most) are used to determine the search direction (i.e. evolutionary and swarm intelligence algorithms). By considering many solutions per iteration these methods will have a higher diversity, however will require a large amount of computation. In this chapter we focus on population-based optimization. Real-world problems also present the possible issue of complexity in the evaluation of a solution. That is, determining the quality of a given solution is computationally expensive. Investigation into simpler evaluation metrics is one possible direction, however, it may be found that evaluation is still expensive. Then, reducing the time spent (i.e. number of function evaluations) becomes an important goal. Opposition-based computing (OBC) is a newly devised concept, having as one of its aims, the improvment of convergence rates of algorithms by defining and simultaneously considering pairs of opposite solutions [24]. This advanced convergence rate is also usually accompanied by a more desirable final result. To date, OBC has shown improvements in reinforcement learning [20, 22, 23], evolutionary algorithms [14, 15, 16], ant colony optimization [12], simulated annealing [28], estimation of distribution [29], and neural networks [26, 27, 30]. In this chapter we discuss the application of OBC to reducing the number of function calls required to achieve a desired accuracy for population-based searches. We show theoretical reasoning behind OBC (which has roots in monotonic optimization) and provide mathematical motivations for its ability to reduce function calls and improve accuracy of simulation results. The key factor to accomplishing this is via simultaneous consideration of negatively associated variables and their affect on the target evaluation function and search process. We choose to highlight the improvements for the task of image thresholding using differential evolution and population-based incremental learning. The rest of this chapter is organized as follows: Section 3.2 discusses the theoretical motivations behind our approach. Differential evolution and population-based incremental learning are discussed in Section 3.3 as are their respective oppositionbased counterparts. The experimental setup is given in Section 3.4 and results are presented in Section 8.4. Conclusions are give in Section 3.6.
3.2 Theoretical Motivations In this section we introduce notations and definitions required to explain the concept of opposition and its ability to reduce the number of function calls. We also provide a brief comparison of OBC to antithetic variates and quasi-random/low-discrepancy sequences.
3
The Use of Opposition for Decreasing Function Evaluations
51
3.2.1 Definitions and Notations In the following definitions, assume that A ⊆ ℜd is non-empty, compact and ddimensional. Without loss of generality, f : A → ℜ is a continuous function to be maximized. We assume all A are feasible. The purpose of a global search method is to discover the global optima (either minimum or maximum) of a given function and not converge to one of the local optima. Definition 1 (Global Optima). A solution θ ∗ ∈ A is a global optima if f (θ ) ≤ f (θ ∗ ) for all θ ∈ A . There may exist more than one global optima. Definition 2 (Local Optima). A solution θ ∈ A is a local optima if there exists a ε -neighborhood Nε (θ ) with radius ε > 0 where g(θ , θ ) < ε for distance function g and θ ∈ A ∩ Nε (θ ), and f (θ ) ≤ f (θ ). Recent research [1, 25, 32] has shown the benefit of utilizing monotonic transformations of the evaluation criteria as a means of discovering global optima. This causes a reordering of the solutions and a gradient-based method can be used to search the reordered space. An issue with these convexification and concavification methods which transform certain functions to a monotonic form is that the mapping must be known a priori. Otherwise, optimization on the transformed function is unreliable. Definition 3 (Monotonicity). Function φ : ℜ → ℜ is monotonic if for x, y ∈ ℜ and x < (>)y, then φ (x) ≤ (≥)φ (y). A strictly monotonic function is one which does not permit equality, (i.e. φ (x) < (>)φ (y)). Theoretically, a monotonic transformation is ideal, however, OBC does not require it. Instead, opposition extends the monotonic global search idea through the use of opposites solutions, which are simultaneously considered and the more desirable (w.r.t. f and the problem definition) is used during the search. Definition 4 (Opposite). A pair (x, x) ˘ ∈ A are opposites if there exists a function Φ : A → A where Φ (x) = y and Φ (y) = x. The breve notation will be used to denote the opposite element (i.e. x˘ = Φ (x) = y). The function Φ referred to in Definition 4 is the key to employing opposition-based techniques. This determines which elements are opposites, and a poorly selected function could lead to poor performance (see the following section). Definition 5 (Opposite Mapping). A one-to-one function Φ : A → A where every pair x, x˘ ∈ A are unique (i.e. for z ∈ A , if Φ (x) = y and Φ (y) = x then there does not exist Φ (y) = z or Φ (x) = z).
Φ can be determined via prior knowledge, intuition or through some a priori or online learning procedure. Simultaneous use of the opposites (for example, a maximization problem) is easily accomplished by allowing f (θ ) = f (θ˘ ) = max( f (θ ), f (θ˘ )) and searching for a solution S which corresponds to the most desirable solution. (3.1) S = arg max f (θ ). θ
52
M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
3.2.2 Consequences of Opposition As with a monotonic transformation the “optimal” opposition map is one which effectively reorders elements of A such that φ ( f ) is monotonic. Under this trans˘ ≤ 0 for X, X˘ ⊂ A , as is shown in Figure 3.1. formation cor(X, X)
1 0.9 0.8 0.7
Evaluation
0.6 0.5
X
X
0.4 0.3 0.2 0.1 0
Solution
Fig. 3.1 Example of a transformed evaluation funtion to a monotonic function. The values in X and X˘ are negatively correlated. The original function (not shown) could have been any nonlinear, continuous function
The implementation of opposition for an optimization problem involves the simultaneous examination of x, x˘ and returns the most desirable. That is, we aim for a negative correlation between the two guesses. A consequence of Φ is an effective halving of possible evaluations (as shown in Figure 3.2) where f (θ ) = f (θ˘ ) = max( f (θ ), f (θ˘ )). The halving results from full information of the transformed function such that the opposite solution is not required to be observed to compute the max operation. In the more general case, it is sufficient to determine a function such that Pr(max( f (θ1 ), f (θ˘1 ) < max( f (θ1 ), f (θ2 ))) > 0.5, for θ1 , θ2 ∈ A . A further consequence of simultaneous consideration of opposites is a provably more desirable E[ f ] and lower variance [28]. Alone this does not guarantee a higher quality outcome; the probability density function of the opposite-transformed function values should also be more desirable than the joint p.d.f. of random sampling (i.e. the distribution corresponding to Pr(max(x1 , x2 ))).
3
The Use of Opposition for Decreasing Function Evaluations
53
1 0.9 0.8 0.7
Function representing the maximum of a solution and its opposite
Evaluation
0.6 0.5
Transformed monotonic function
0.4 0.3 0.2 0.1 0
Solution
Fig. 3.2 Taking f (x) = max( f (x), f (−x)), we see that the possible evaluations in the search space have been halved in the optimal situation of full knowledge. In the general case, the transformed function will have a more desirable mean and lower variance
While not investigated in this chapter we make the observation that successive applications of different opposite maps will lead to further smoothing of f . For example, (3.2) f 2 (θ ) = max( f (z = arg max f (θ )), f (˘z)) θ
where z˘ is determined via opposite map Φ2 = Φ1 and superscript f 2 indicates the two applications of opposite mappings. In the limit, lim f i (θ ) = max( f i (zi = arg max f i−1 (θi )), f (˘zi )) = f (θ ∗ )
i→∞
θi
(3.3)
for i > 0 and global optima f (θ ∗ ). Effectively, this flattens the entire error surface of f , except for the global optimal(s). A more feasible alternative is to use k > 0 transformations which give reasonable results where 0 ≤ | f k−1 − f k | < ε does not diminish greatly.
3.2.3 Lowering Function Evaluations We briefy discuss conditions on the design of the opposition map Φ which often lead to improvements over purely random sampling with respect to lowering function evaluations. If using an algorithm solely based on randomly generating solutions to the problem at hand then we require that for some ε > 0 and δ > 0.5
54
M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
Pr (max( f (x), f (x)) ˘ − max( f (x), f (y)) > ε ) > δ
(3.4)
where x, y, x˘ ∈ A . That is, the distribution max(x, x) ˘ must be less desirable than the distribution of i.i.d. random guesses. If this condition is met, then the probability that the optimal solution (or higher quality) is discovered is higher using opposite guesses. The goal in developing Φ is to maximize ε and δ . A similar goal is to determine Φ such that E[g(x, x)] ˘ is maximized for some distance function g. Satisifying this condition implies (3.4). Thus, probabilistically we expect a lower number of function calls to find a solution of a given thresholded quality. If employing this strategy within a guided search method the dynamics of the algorithm must be considered to assure the guarantee (i.e. the algorithm adds bias to the search which affects the convergence rate). Practically, the simplest manner to decide a satisfactory Φ is through intuition or prior knowledge about f . A possibility is to utilize modified low-discrepancy sequences (see below), which aim to distribute guesses evenly throughout the search space.
3.2.4 Comparison to Existing Methods While opposition may sometimes employ methods from antithetic variates and lowdiscrepancy sequences, in general that is not the case. To elucidate the uniqueness of opposition in the following we distinguish it from these two methods. 3.2.4.1
Antithetic Variates
Suppose we desire to estimate ξ = E[ f ] = E[Y1 ,Y2 ] with unbiased estimator Y1 + Y2 ξˆ = . 2
(3.5)
If Y1 ,Y2 are i.i.d. then var(ξˆ ) = var(Y1 +Y2 )/2. However, if cov(Y1 ,Y2 ) < 0 then the variance can be further reduced. One method to accomplish this is through the use of a monotonic function h. Then, generate Y1 as an i.i.d. value as before, but utilizing h our two variables are h(Y1 ) and h(1 − Y1), which are monotonic over interval [0,1]. And h(Y1 ) + h(Y2 ) ξˆ = 2
(3.6)
will be an unbiased estimator of E[ f ]. Opposition is similar in its selection of negatively correlated samples. However, in the antithetic approach there is no guideline to construct such a monotonic function, although such a function has been proven to exist [17]. Opposition provides various means to accomplish this, as well as to incorporate the idea into optimization while guaranteeing more desirable expected values and lower variance in the target function.
3
The Use of Opposition for Decreasing Function Evaluations
55
Further, opposition extends beyond the generation of solutions in random sampling-based algorithms. It can also be applied to algorithm behavior and can be used to relate concepts expressed linguistically, where we have not found evidence that antithetic variates have application. 3.2.4.2
Quasi-Randomness and Low-Discrepancy Sequences
These methods aim to combine the randomness from pseudorandom generators which select values i.i.d. with the advantages of generating points distributed in a grid-like fashion. The goal is to uniformly distribute values over the search space by achieving a low-discrepancy. However, this is achieved at the cost of statistical independence. The discrepancy of a sequence is a measure of its uniformity and can be calculated via [6]: |B ∩ X| − λd (B) DN (X) = sup (3.7) N I∈J where λd is the d-dimensional Lebesque measure, |B ∩ X| is the number of points of X = (x1 , ..., xN ) that fall into interval I, and J is the set of d-dimensional intervals defined as: d
∏[ai, bi ) = {x ∈ ℜd : ai ≤ xi ≤ bi }
(3.8)
i=1
for 0 ≤ ai < bi ≤ 1. That is, the actual number of points within each interval for a given sample is close to the expected number. Such sequences have been widely studied in quasiMonte Carlo methods [10]. Opposition may utilize low-discrepancy sequences in some situations. Though in general, low-discrepancy sequences are simply a means for attaining uniform distribution without regard to the correlation between the evaluations at these points. Further, opposition-based techniques simultaneously consider two points in order to smooth the evaluation function and improve performance of the sampling algorithm whereas quasi-random sequences often are concerned with many more points. These methods have been applied to evolutionary algorithms where it was found that by a performance study of the different sampling methods such as Uniform, Normal, Halton, Sobol, Faure, and Low-Discrepancy is valuable only for low-dimensional (d < 10 and so non-highly-sparse) populations [11].
3.3 Algorithms In this section we describe Differential Evolution (DE) and Population-¿ It seems, performance study of the different sampling methods such as Uniform, Normal, Halton, Sobol, Faure, and Low-Discrepancy [11] is valuable only for low-dimensional (D < 10 and so non-highly-sparse) populations. Learning (PBIL), which are the
56
M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
parent algorithms for this study. We also describe the oppositional variants of these methods.
3.3.1 Differential Evolution Differential Evolution (DE) was proposed by Price and Storn in 1995 [21]. It is an effective, robust, and simple global optimization algorithm [8]. DE is a populationbased directed search method. Like other evolutionary algorithms, it starts with an initial population vector. If no preliminary knowledge about the solution space is available the population is randomly generated. Each vector of the initial population can be generated as follows [8]: Xi, j = a j + rand j (0, 1) × (a j − b j ); j = 1, 2, ..., d,
(3.9)
where d is the problem dimension; a j and b j are the lower and the upper boundaries of the variable j, respectively. rand(0, 1) is the uniformly generated random number in [0, 1]. Assume Xi,t (i = 1, 2, ..., N p ) are candidate solution vectors in generation t and N p : is the population size. Successive populations are generated by adding the weighted difference of two randomly selected vectors to a third randomly selected vector. For classical DE (DE/rand/1/bin), the mutation, crossover, and selection operators are straightforwardly defined as follows: Mutation: For each vector Xi,t in generation t a mutant vector Vi,t is defined by Vi,t = Xa,t + F(Xc,t − Xb,t ),
(3.10)
where i = {1, 2, ..., N p } and a, b, and c are mutually different random integer indices selected from {1, 2, ..., N p }. Further, i, a, b, and c are unique such that it is necessary for N p ≥ 4. F ∈ [0, 2] is a real constant which determines the amplification of the added differential variation of Xc,t − Xb,t . Larger values for F result in higher diversity in the generated population and lower values lead to faster convergence. Crossover: By shuffling competing solution vectors DE utilizes the crossover operation to generate new solutions and also to increase the population diversity. For the classical DE (DE/rand/1/bin), the binary crossover (shown by ‘bin’ in the notation) is utilized. It defines the following trial vector: Ui,t = (U1i,t ,U2i,t , ...,Udi,t ), where, U ji,t
V ji,t = X ji,t
if rand j (0, 1) ≤ Cr ∨ j = k, . otherwise,
(3.11)
(3.12)
for Cr ∈ (0, 1) the predefined crossover rate, and rand j (0, 1) is the jth evaluation of a uniform random number generator. k ∈ {1, 2, ..., d} is a random parameter index,
3
The Use of Opposition for Decreasing Function Evaluations
57
chosen once for each i to make sure that at least one parameter is always selected from the mutated vector, V ji,t . Most popular values for Cr are in the range of (0.4, 1) [14]. Selection: This decides which vector (Ui,t or Xi,t ) should be a member of next (new) generation, t + 1. For a minimization problem, the vector with the lower value of objective function is chosen (greedy selection). This evolutionary cycle (i.e., mutation, crossover, and selection) is repeated N p (population size) times to generate a new population. These successive generations are produced until meeting the predefined termination criteria.
3.3.2 Opposition-Based Differential Evolution By utilizing opposite points, we can obtain fitter starting candidate solutions even when there is no a priori knowledge about the solution(s) according to: 1. Random initialization of population P(NP ), 2. Calculate opposite population by OPi, j = a j + b j − Pi, j ,
(3.13)
i = 1, 2, ..., N p ; j = 1, 2, ..., D, where Pi, j and OPi, j denote jth variable of the ith vector of the population and the opposite-population, respectively. 3. Selecting the N p fittest individuals from {P ∪ OP} as the initial population. The general ODE scheme also employs generation jumping, but it has not be used in this work in lieu of only population-initialization and sample generation.
3.3.3 Population-Based Incremental Learning PBIL is a stochastic search which abstracts the population of samples found in evolutionary computation with a probability distribution for each variable of the solution [2]. At each generation a new sample population is generated based on the current probability distribution. The best individual is retained and the probability model is updated accordingly to reflect the belief regarding the final solution. The update rule is similar to that found in reinforcement learning. A population is represented by probability matrix M := (mi, j )d×c which stores the probability distribution over each possible element in the solution. If considering a binary problem then solution S := (si, j )d×c ∈ {0, 1} and mi, j ∈ [0, 1] is the probability element si, j = 1. For continuous problems probability distributions are used instead of a threshold value as is the case for discrete problems [18]. Learning consists of utilizing M to generate population P of k samples. After evaluation of each sample according to function f the “best” (B∗ ) solution is retained and M is updated according to
58
M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
Mt = (1 − α )Mt−1 + α B∗
(3.14)
where 0 < α < 1 is the learning rate and t ≥ 1 is the iteration. Initially, mi, j = 0.5 to reflect that lack of prior information. To abstract the crossover and mutation operators of evolutionary computation, PBIL employs a randomization of M. Let 0 < β , γ < 1 be the probability of mutation and degree of mutation, respectively. Then with probability β mi, j = (1 − γ )mi, j + γ · random(0 or 1).
(3.15)
Algorithm 1 provides a summary of this approach.
Algorithm 1. Population-Based Incremental Learning [2] 1: {Initialize probabilities} 2: M0 := (mi, j ) = 0.5 3: for t = 1 to ω do 4: {Generate samples} 5: G1 = generate samples(k,Mt−1 ) 6: 7:
{Find best sample} B∗ = select best({B∗ } ∪ G1)
8: 9:
{Update M} Mt = (1 − α )Mt−1 + α B∗
10: {Mutate probability vector} 11: for i = 0...d and j = 0...c do 12: if random(0, 1) < β then 13: mi, j = (1 − γ )mi, j + γ · random(0 or 1) 14: end if 15: end for 16: end for
3.3.4 Oppositional Population-Based Incremental Learning The opposition-based version of PBIL (OPBIL) shown in Algorithm 2 employs the opposite concept to improve diversity within the sample generation phase. A direct effect on convergence rate is observed as a consequence of this mechanism. Further, OPBIL has an ability to escape local optima which estimation of distribution algorithms such as PBIL are prone to becoming trapped on [19]. The description provided here is brief and the interested reader is invited to read [29] for a more detailed description. The general structure of the PBIL algorithm remains, however aside from the sampling procedure the update and mutation rules are altered to reflect a degrading
3
The Use of Opposition for Decreasing Function Evaluations
59
degree of opposition with respect to the number of iterations. That is, as the number of iterations t → ∞ the amount two opposite solutions differ approaches 1 bit (w.r.t. Hamming distance). Sampling is accompished using an opposite guessing strategy whereby half of the population R1 is generated using probability matrix M and the other half is generated via a change in Hamming distance to a given element of R1 . The distance is calculated using an exponentially decaying function in the flavor of
ξ (t) = le(ct) ,
(3.16)
where l is the maximum number of bits in a guess and c < 0 is a user defined constant. Updating of M is performed in lines 14-28. If a new global best solution has been discovered (i.e. η = B∗ ), or with probability pamp the sample best solution is used to focus the search, respectively. When no new optima have been discovered this strategy tends away from B∗ . The actual update is performed in line 16 and is based on a reinforcement learning update using the sample best solution. The degree to which M is updated is controlled by the user defined parameter 0 < ρ < 1. Should the above criteria for update fail, a decay of M with probability pdecay is attempted in line 17. The decay, performed in lines 21-27 slowly tends M away from B∗ . This portion of the algorithm has the intention to prevent convergence an aide in the exploration ability through small, smooth updates. Parameter 0 < τ < 1 is user defined where often τ ρ . Equations in lines 11 and 12 were determined experimentally and no argument regarding their optimality is provided. Indeed, there likely exists many functions which will yield more desirable results. These have been decided because they tend to lead to a good behavior and outcome of PBIL.
3.4 Experimental Setup In this section we provide a discussion of the image thresholding problem, and the application of evolutionary algorithms to solving it. Additionally, the evaluation measure we employ to grade the quality of a segmentation is presented. Parameter settings and problem representation are also given.
3.4.1 Evolutionary Image Thresholding Image segmentation involves partitioning an image I into a set of segments with the goal of locating objects in the image which are sufficiently similar. Thresholding is a subset problem of image segmentation, with only 2 classes defined by whether a given pixel is above or below a specific threshold value ω . This task has numerous applications and several general segmentation algorithms have been proposed [33]. Due to the variety of image types there does not exist a single algorithm for segmenting all images optimally.
60
M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
Algorithm 2. Pseudocode for the OPBIL algorithm Require: Maximum iterations, ω Require: Number of samples per iteration, k 1: {Initialize probabilities} 2: M0 = mi..l = 0.5 3: for t = 1 to ω do 4: {Generate samples} 5: R1 = generate samples(k/2,M) 6: R˘ 1 = generate opposites(R1) 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30:
{Find best sample} η = select best({R1 ∪ R˘ 1 }) B∗ = select best(B∗ , η ) {Compute probabilities} pamp (Δ ) = 1 − e−bΔ pdecay (Δ ;
f (B∗ ),
∗
)− f (η ) 1 − f (Bf (B ∗) √ f (η )) = Δ +1
{Update M} if η = B∗ OR random(0, 1) < pamp then Δ =0 Mt = (1 − ρ )Mt−1 + ρη else if random(0, 1) < pdecay then if random(0, 1) < pdecay then use B∗ in line 23 instead of η end if for all i, j each with probability < pdecay do if ηi, j = B∗i, j then 1 − τ · random(0, 1), if ηi, j = 1, mi, j = mi, j · 1 + τ · random(0, 1), otherwise else 1 + τ · random(0, 1), if ηi, j = 1, mi, j = mi, j · 1 − τ · random(0, 1), otherwise end if end for end if Δ = Δ +1 end for
3
The Use of Opposition for Decreasing Function Evaluations
61
Many general purpose segmentation algorithms are histogram based and aim to discover a deep valley between two peaks, and setting ω equal to that value. However, many real world problems will have multimodal histograms and deciding which value (i.e. valley) will correspond to the best thresholing is not obvious. The difficulty is compounded by the fact that the relative size of peaks may be large (and the valley becomes hard to distinguish) or valleys could be very broad. Several algorithms have been proposed to overcome this [33]. Other methods based on information theory and other statistical methods have been proposed as well [13]. Typically, the problem of segmentation involves a high degree of uncertainty which makes solving the problem difficult. Stochastic searches such as evolutionary algorithms and population-based incremental learning often cope well with uncertainty in optimization, hence they provide an interesting alternative approach to traditional methods. The main difficulty associated with the use of population-based methods is that they are computationally expensive due to the large amount of function calls required during the optimization process. One approach to minimizing uncertainty is by splitting the image into subimages which (hopefully) have characteristics allowing for an easy segmentation. Combining the subimages together then forms the entire segmented image, although this requires extra function calls to analyze each subimage. An important caveat is that the local image may represent a good segmentation, but may not be useful with respect to the image as a whole. In this chaper we investigate thresholding with population-based techniques. Using ODE we do not perform any splitting into subimages and for OPBIL we split I into four equal-sized subregions, each having it’s own threshold value. In both cases we require a single evaluation to perform the segmentation and we show that the opposition-based techniques reduce the required number of function calls. As stated above, there exist many different segmentation algorithms. Further, numerous methods for evaluating the quality of a seqmentation have also been put forth [34]. In this paper we use a simple method which aims to minimize the discrepancy between the original M × N gray-level image I and its thresholded image T [31]: M
N
∑ ∑ |Ii, j − Ti, j |
(3.17)
i=1 j=1
where | · | represents the absolute value operation. Using different evaluations will change the outcome of the algorithm, however, the problem of segmentation in this manner nonetheless remains computationally expensive. We use the images shown in Figure 3.3 to evaluate the algorithms. The first column represents the original image, then the gold and the third column corresponds to the approximate target image for ODE and OPBIL (i.e. the value-to-reach targets, discussed below). We show the gold image for completeness, it is not required in the experiments. Both experiments employ a value-to-reach (VRT) stopping criteria which measures the time or function calls required to reach a specific value. The VTR values have been experimentally determined and are given in the following table. Due to
62
M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
Fig. 3.3 The images used to benchmark the algorithms. The first column is the original graylevel image, the second is the gold and the third colum is the target image of the optimization within the required function calls
3
The Use of Opposition for Decreasing Function Evaluations
63
the respective algorithm ability of solving this problem, given the representation and bahavior of convergence these values differ for ODE and OPBIL.
Table 3.1 Value-to-reach (VRT) for O/DE and O/PBIL experiments
Image 1 2 3 4 5 6
O/DE O/PBIL 19579 19850 3391 4925 7139 7175 19449 19850 19650 19700 22081 22700
3.4.2 Parameter Settings and Solution Representation The ODE and OPBIL algorithms differ in that the former is a real optimization algorithm and the latter operates in the binary domain. Therefore, the solution representations will also differ and consequently directly affect the quality of results. However, the goal of this chapter is to show the ability of opposition to decrease the required number of function evaluations, and so fine-tuning aspects these algorithms is not the focus of this investigation. ODE Settings The differential evolution experiments follow standard enconding guidelines. The user-defined parameter settings were determined empirically as shown in Table 3.2.
Table 3.2 Parameter settings for differential evolution-based experiments
Parameter Value Population size Np = 5 Amplification factor F = 0.9 Crossover probability Cr = 0.9 Mutation strategy DE/rand/1/bin Maximum function calls MAXNFC = 200 Jumpring rate (no jumping) Jr = −1
In order to maintain a reliable and fair comparison, these settings are kept unchanged for all conducted experiments for both DE and ODE algorithms.
64
M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
OPBIL Settings As stated above, OPBIL requires a binary solution representation. However, thresholding aims to discover an integer value 0 ≤ T ≤ 255, to perform the segmentation operation of I > T . Additionally, we use an approach of splitting I into subimages I1,...,16 where each Ii is an equal sized square region of the original image. Encoding was determined to be a matrix R := (ri, j )16×8 which corresponds to 16 subimages having a gray-level value < 28 = 256. Each row of R is converted to an integer which is used to segment the respective region of I. The extra regions increase problem difficulty as they result in more deceptive and multimodal problems. Parameter settings for PBIL and OPBIL are as follows:
Table 3.3 Parameter settings for population-based incremental learning experiments
Parameter Value Maximum iterations t = 150 Sample size k = 24 PBIL Only Learning rate α = 0.35 Mutation probability β = 0.15 Mutation degree γ = 0.25 OPBIL Only Update frequency control b = 0.1 Learning rate ρ = 0.25 Probability decay τ = 0.0005
3.5 Experimental Results Using the test images and parameter settings stated above we show the ability of oppositional concepts to decrease the required function calls. Only the results for OPBIL are more detailed due to space limitations, but similar behavior should be observed in ODE. All results presented correspond to the average of 30 runs, unless otherwise noted.
3.5.1 ODE Table 3.4 presents a summary of the results obtained regarding function calls for ODE versus DE. Except for image 2, we show an decrease in function calls for all images. Images 4 and 5 have statistically significant improvements with respect to the decreased number of function calls, using a t-test at 0.9 confidence level. Further except for image 6, we show a lower standard deviation which indicates higher reliability in the results.
3
The Use of Opposition for Decreasing Function Evaluations
65
Computing the overall mean function calls we show an improvement of 322277=45 function calls. This equates to an average of 45/6 = 7.5 saved function calls per image. This implies a savings of 322/277 ≈ 1.16, indicating approximately 16% less function calls. For expensive optimization problems this can correspond to a great amount of savings with respect to algorithm run-time. Table 3.4 Summary results for DE vs. ODE with respect to required function calls. μ and σ correspond to the mean and standard deviation of the subscripted algorithm, respectively
Image 1 2 3 4 5 6 Total
μDE 74 32 42 74 47 63 322
σDE 41 20 23 36 37 26
μODE 60 35 37 54 45 46 277
σODE 34 16 19 30 22 31
3.5.2 OPBIL Table 3.5 shows the expected number of iterations (each iteration has 24 function calls) to attain the value-to-reach given in Table 3.1. In all cases OPBIL reaches its goal in fewer iterations that PBIL, where results for images 2,5,6 are found to be statistically significant using a t-test at a 0.9 confidence interval. Additionally, in all cases we find a lower standard deviation indicating a more reliable behavior for OPBIL. Overall we have 444-347=97 saved iterations using OPBIL, which is an average of 16*24=384 function calls per image. The approximate savings is 444/347 ≈ 1.28 which is about a 28% improvement in required iterations.
Table 3.5 Summary results for PBIL vs. OPBIL with respect to required iterations calls. μ and σ correspond to the mean and standard deviation of the subscripted algorithm, respectively
Image μPBIL σPBIL μOPBIL σOPBIL 1 62 19 53 12 2 80 25 65 9 3 61 12 60 5 4 47 14 40 10 5 68 13 53 9 6 128 21 76 14 Total 444 347
66
M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
In the following we analyze the correlation and distance for each sample per iteration. This is to examine whether the negative correlation and larger distance properties between a guess and its opposite are found in the sample. If true, we have supported (although not confirmed) the hypothesis that the observed improvements can be due to these characteristics. Figure 3.4 shows the averaged correlation cor(Rt1 , R˘ 1 ), for randomly generated t R1 and respective opposites R˘t1 at iteration t. The solid line corresponds to OPBIL and the dotted line is PBIL, respectively. These plots show that OPBIL indeed has a lower correlation (with respect to evaluation function) than PBIL (where we generate R1 as above, and let R˘ 1 also be randomly generated). In all cases the correlation is much stronger for PBIL (noting that if the algorithm reaches the VTR then we set the correlation to 0).
Image 2 1
0.8
0.8 correlation
correlation
Image 1 1
0.6 0.4 0.2 0
0.6 0.4 0.2 0
0
50
100
150
0
50
1
1
0.8
0.8
0.6 0.4 0.2 0
150
100
150
100
150
0.6 0.4 0.2 0
0
50
100
150
0
50
iterations Image 5
iterations Image 6
1
1
0.8
0.8 correlation
correlation
100 iterations Image 4
correlation
correlation
iterations Image 3
0.6 0.4 0.2 0
0.6 0.4 0.2 0
0
50
100 iterations
150
0
50 iterations
Fig. 3.4 Sample mean correlation over 30 trials for PBIL (dotted) versus OPBIL (solid). We find OPBIL indeed yields a lower correlation than PBIL
3
The Use of Opposition for Decreasing Function Evaluations
67
We also examine the mean distance, g¯ =
2 k/2 ∑ g(Rt1,i , R˘t1,i ) k i=1
(3.18)
which computes the fitness-distance between the ith guess R1,i and its opposite R˘ 1,i at iteration t, which is shown in Figure 3.5. The distance for PBIL is relatively low throughout the 150 iterations, gently decreasing as the algorithm converges. However, as a consequence of OPBIL’s ability to mainain diversity the distances between samples increases during the early stanges of the search and similarily rapidly decreases. Indeed, this implies the lower correlation shown above.
Image 1
Image 2
4000
1500
distance
distance
3000 2000 1000 0
0
50
100
1000
500
0
150
0
50
100 iterations Image 4
150
0
50
100 iterations Image 6
150
0
50
100 iterations
150
iteration Image 3 4000 3000
2000
distance
distance
3000
1000
0
1000
0
50
100 iterations Image 5
0
150
4000 3000
4000
distance
distance
6000
2000
0
2000
2000 1000
0
50
100 iterations
150
0
Fig. 3.5 Sample mean distance over 30 trials for samples of PBIL (dotted) versus OPBIL (solid). We find OPBIL indeed yields a higher distance between paired samples than PBIL
68
M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
The final test is to examine the standard deviation (w.r.t. evaluation function) of the distance between samples, given in Figure 3.6. Both algorithms have similarly formed plots with respect to this measure, reflecting the convergence rate of the respective algorithms. It seems as though the use of opposition aides by infusing diversity early during the search and quickly focuses once a high quality optima is found. Conversely, the basic PBIL does not include this bias, therefore convergence is less rapid.
Image 1
Image 2
2000
1500
std. dev.
std. dev.
1500 1000
1000
500
500 0
0
50
100
0
150
0
50
iterations Image 3 2000
std. dev.
std. dev.
1000 500
0
50
100
150
100
150
500
0
150
0
50 iterations Image 6
3000
1500
2000
1000
std. dev.
std. dev.
100
1000
iterations Image 5
1000
0
150
1500
1500
0
100 iterations Image 4
0
50
100 iterations
150
500
0
0
50 iterations
Fig. 3.6 Sample mean standard deviations over 30 trials for samples of PBIL (dotted) versus OPBIL (solid)
3.6 Conclusion In this chapter we have discussed the application of opposition-based computing techniques to reducing the required number of function calls. Firstly, a brief
3
The Use of Opposition for Decreasing Function Evaluations
69
introduction to the underlying concepts of opposition were given, along with conditions under which opposition-based methods should be successful. A comparison to similar concepts of antithetic variates and quasi-random/low-discrepancy sequences made obvious the uniqueness of our method. Two recently proposed algorithms, ODE and OPBIL were briefly introduced and the manner in which opposition is used to improve their parent algorithms, DE and PBIL was given, respectively. The manner in which opposites are used in both cases differed, but the underlying concepts are the same. Using the expensive optimization problem of image thresholding as a benchmark, we examine the ability of ODE and PBIL to lower the required function calls to reach the pre-specified target value. It was found that both algorithms reduce the expected number of function calls, ODE by approximately 16% (function calls) and OPBIL by 28% (iterations), respectively. Further, concentrating on OPBIL, we show the hypothesized lower correlation and higher fitness-distance measures for a quality opposite mapping. Our results are very promising, however their requires future work in various regards. Firstly, a further theoretical basis for opposition and choosing opposite mappings is needed. This could possibly lead to general strategies of implementation when no prior knowledge is available. Further application to different real-world problems is also desired.
Acknowledgements This work has been partially supported by Natural Sciences and Engineering Research Council of Canada (NSERC).
References 1. Bai, F., Wu, Z.: A novel monotonization transformation for some classes of global optimization problems. Asia-Pacific Journal of Operational Research 23(3), 371–392 (2006) 2. Baluja, S.: Population Based Incremental Learning - A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning. Tech. rep., Carnegie Mellon University, CMU-CS-94-163 (1994) 3. Engelbrecht, A.: Fundamentals of Computational Swarm Intelligence. Wiley, Chichester (2005) 4. Glover, F., Laguna, M.: Tabu Search. Kluwer, Dordrecht (1997) 5. Goldberg, D.E., Horn, J., Deb, K.: What makes a problem hard for a classifier system? Tech. rep. In: Collected Abstracts for the First International Workshop on Learning Classifier Systems (IWLCS 1992), NASA Johnson Space (1992) 6. Niederreiter, H.: Random Number Generation and Quasi-Monte Carlo Methods. Society for Industrial and Applied Mathematics (1992) 7. Holland, J.H.: Adaptation in Natural and Artificial Systems. The University of Michigan Press (1975) 8. Price, K., Storn, R., Lampinen, J.A.: Differential Evolution: A Practical Approach to Global Optimization. Springer, Heidelberg (2005)
70
M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
9. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220(4598), 671–680 (1983) 10. Lemieux, C.: Monte Carlo and Quasi-Monte Carlo Sampling. Springer, Heidelberg (2009) 11. Maaranen, H., Miettinen, K., Penttinen, A.: On intitial populations of genetic algorithms for continuous optimization problems. Journal of Global Optimization 37(3), 405–436 (2007) 12. Montgomery, J., Randall, M.: Anti-pheromone as a tool for better exploration of search space. In: Third International Workshop, ANTS, pp. 1–3 (2002) 13. O’Gormam, L., Sammon, M., Seul, M. (eds.): Practical Algorithms for Image Analysis. Cambridge University Press, Cambridge (2008) 14. Rahnamayan, S., Tizhoosh, H.R., Salama, M.M.A.: Opposition-based differential evolution. IEEE Transactions on Evolutionary Computation 12(1), 64–79 (2008) 15. Rahnamayn, S., Tizhoosh, H.R., Salama, S.: Opposition-based Differential Evolution Algorithms. In: IEEE Congress on Evolutionary Computation, pp. 7363–7370 (2006) 16. Rahnamayn, S., Tizhoosh, H.R., Salama, S.: Opposition-based Differential Evolution Algorithms for Optimization of Noisy Problems. In: IEEE Congress on Evolutionary Computation, pp. 6756–6763 (2006) 17. Rubinstein, R.: Monte Carlo Optimization, Simulation and Sensitivity of Queueing Networks. Wiley, Chichester (1986) 18. Sebag, M., Ducoulombier, A.: Extending Population-Based Incremental Learning to Continuous Search Spaces. In: Eiben, A.E., B¨ack, T., Schoenauer, M., Schwefel, H.-P. (eds.) PPSN 1998. LNCS, vol. 1498, pp. 418–427. Springer, Heidelberg (1998) 19. Shapiro, J.: Diversity loss in general estimation of distribution algorithms. In: Parallel Problem Solving in Nature IX, pp. 92–101 (2006) 20. Shokri, M., Tizhoosh, H.R., Kamel, M.: Opposition-based Q(lambda) Algorithm. In: IEEE International Joint Conference on Neural Networks, pp. 646–653 (2006) 21. Storn, R., Price, K.: Differential evolution- a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization 11, 341–359 (1997) 22. Tizhoosh, H.R.: Reinforcement Learning Based on Actions and Opposite Actions. In: International Conference on Artificial Intelligence and Machine Learning (2005) 23. Tizhoosh, H.R.: Opposition-based Reinforcement Learning. Journal of Advanced Computational Intelligence and Intelligent Informatics 10(4), 578–585 (2006) 24. Tizhoosh, H.R., Ventresca, M. (eds.): Oppositional Concepts in Computational Intelligence. Springer, Heidelberg (2008) 25. Toh, K.: Global optimization by monotonic transformation. Computational Optimization and Applications 23(1), 77–99 (2002) 26. Ventresca, M., Tizhoosh, H.R.: Improving the Convergence of Backpropagation by Opposite Transfer Functions. In: IEEE International Joint Conference on Neural Networks, pp. 9527–9534 (2006) 27. Ventresca, M., Tizhoosh, H.R.: Opposite Transfer Functions and Backpropagation Through Time. In: IEEE Symposium on Foundations of Computational Intelligence, pp. 570–577 (2007) 28. Ventresca, M., Tizhoosh, H.R.: Simulated Annealing with Opposite Neighbors. In: IEEE Symposium on Foundations of Computational Intelligence, pp. 186–192 (2007) 29. Ventresca, M., Tizhoosh, H.R.: A diversity maintaining population-based incremental learning algorithm. Information Sciences 178(21), 4038–4056 (2008)
3
The Use of Opposition for Decreasing Function Evaluations
71
30. Ventresca, M., Tizhoosh, H.R.: Numerical condition of feedforward networks with opposite transfer functions. In: IEEE International Joint Conference on Neural Networks, pp. 3232–3239 (2008) 31. Weszka, J., Rosenfeld, A.: Threshold evaluation techniques. IEEE Transactions on Systems, Man and Cybernetics 8(8), 622–629 (1978) 32. Wu, Z., Bai, F., Zhang, L.: Convexification and concavification for a general class of global optimization problems. Journal of Global Optimization 31(1), 45–60 (2005) 33. Yoo, T. (ed.): Insight into Images: Principles and Practice for Segmentation, Registration, and Image Analysis. AK Peters (2004) 34. Zhang, H., Fritts, J., Goldman, S.: Image segmentation evaluation: A survey of unsupervised methods. Computer Vision and Image Understanding 110, 260–280 (2008)
Chapter 4
Search Procedure Exploiting Locally Regularized Objective Approximation: A Convergence Theorem for Direct Search Algorithms Marek Bazan
Abstract. The Search Procedure Exploiting Locally Regularized Objective Approximation is a method to speed-up local optimization processes in which the objective function evaluation is expensive. It was introduced in [1] and further developed in [2]. In this paper we present the convergence theorem of the method. The theorem is proved for the EXTREM [6] algorithm but applies to any Gauss-Siedle algorithm that uses sequential quadratic interpolation (SQI) as a line search method. After some extension it can also be applied to conjugate direction algorithms. The proof is based on the Zangwill theory of closed transformations. This method of the proof was chosen instead of sufficient decrease approach since the crucial element of the presented proof is an extension of the SQI convergence proof from [14] which is based on this approach.
4.1 Introduction Optimization processes with objective functions that are expensive to evaluate – since usually their evaluation requires to solve a large system of linear equations or to simulate some physical process – occur in many fields of modern design. The main strategy in speeding up such processes via constructing a model to approximate an objective function are trust region methods [4]. The application of radial basis function approximation as an approximation model in trust region methods was discussed in [13]. The standard method to prove a convergence of a trust region method is the method of sufficient decrease. In [1] and [2] we presented the search procedure which can be viewed as an alternative to trust region methods. It relies on combining the direct search algorithm EXTREM [6] with the locally regularized radial basis approximation. The Marek Bazan Institute of Informatics, Automatics and Robotics, Department of Electronics, Wrocław University of Technology, ul. Janiszewskiego 11/17, 50-372 Wrocław, Poland e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 73–103. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
74
M. Bazan
method achieves a speed-up of the optimization by exchanging some number of direct function evaluations by radial basis approximation. The method combined with the EXTREM algorithm was implemented within the computer program ROXIE for superconducting magnets design and optimization [3], however it is a general framework and as we shall see, any search algorithm that is based on the GaussSeidle non-gradient algorithm or conjugate direction algorithm, can be used. We will call our method the Search Procedure Exploiting Locally Regularized Objective Approximation (SPELROA). In this paper we give the convergence theorem for the SPELROA method, combined with the EXTREM minimization algorithm under the assumption that the radial basis approximation has a realtive error less than ε . The method of proof is different from those used in proofs of trust region methods since it is based on the Zangwill theory of closed transformations. The crucial element of the proof is the modification of the proof of the convergence of the quadratic interpolation as a line search method (c.f. [14]) under the assumption that a pertubation of the function value may be introduced to the algorithm at each step. We give the conditions on the function values as well as on ε to maintain convergence. The proof of the convergence of the sequential quadratic interpolation in [14] is based on Zangwill’s theory and as we essentially extend it we chose this method also to prove the convergence of the whole SPELROA method. The plan of this chapter is the following. In the next section we sketch the SPELROA method. In the third section we give some theory from [21] to be used in the next section to prove the main result. Finally in the fourth section we discuss the radial basis approximation and heuristics used for its construction. We also describe difficulties in establishing strict error bound in the current state of the development of radial basis function approximation for sparse data. In the reminder of the paper we give numerical results for three test functions of 6,8 and 11 variables from a set of test functions proposed in [12].
4.2 The Search Procedure Let there be given a direct search optimization algorithm A that uses the quadratic interpolation as a line search method. The SPELROA method combined with algorithm A can be written in the form of the following algorithm (c.f. [2]). While generating the set Z we have to take care that data points are not placed too close to each other. When two points are too close to each other – where a distance is controlled by a user-supplied parameter whose value is relative to the diameter of the set Z – one of the points has to be replaced by another point not yet included. Such procedure of constructing Z keeps the separation distance (c.f. [16]) greater than the user-supplied parameter value and therefore guarantees that the radial basis function interpolation matrix is not singular. The crucial step of the scheme is point 3, containing a threefold check for whether the approximation f˜(xk ) can be used in the algorithm A instead of f (xk ) evaluated directly. Conditions being checked in steps 3.a) and 3.b) are related to the radial basis approximation and will be discussed
4
Search Procedure Exploiting Locally Regularized Objective Approximation
75
1 2
3
The Search Procedure exploiting Locally Regularized Objective Approximation Input : f : Rd → R – the objective function, x 0 ∈ Rd – a starting point, ε >0 – prescribed accuracy of the approximation, f˜(·) – a radial basis function approximation of f (·), Is – number of initial steps performed by the direct optimization algorithm A, N < Is – size of the dataset to construct the approximation f˜(·),. ε -check – a procedure to check conditions required by the convergence theorem to hold. 0. Perform Is initial steps of the algorithm A. 1. In the k-th step generate point xk for which the function value is supposed to be evaluated using algorithm A. 2. Generate a set Z from N nearest points for which function was evaluated directly. 3. Judge, whether for point xk a reliable approximation of f (xk ) can be constructed. a. If point xk is located in a reliable region of the search domain then construct the approximation f˜ and evaluate f˜(xk ). b. If the approximation f˜(xk ) was correctly constructed then perform an ε -check. c. If the ε -check is positive (i.e. the procedure returns true) then substitute f (xk ) ← f˜(xk ) to the algorithm A. d. Else evaluate f (xk ) directly. 4. If the stopping criterion of algorithm A is satisfied then stop. Else replace k := k + 1 and go to 1.
in section 4.5 whereas the ε -check procedure from step 3.c) is associated with the convergence theorem and will be discussed in section 4.4.
4.3 Zangwill’s Method to Prove Convergence Feasible direction optimization algorithms can be written as xk+1 = xk + τk dk
76
M. Bazan
where dk is a search direction and τk is a search step. The k-th iteration of such algorithms can be viewed as a composite algorithmic transformation A = M 1 D where D : Rd → Rd × Rd is a search direction generating transform D(x) = (x, d) where d ∈ Rd is a direction vector, whereas M 1 : Rd × Rd → Rd is a line minimization transform M 1 (x, d) = {y : f (y) = min f (x + τ d), y = x + τ 0d} τ ∈J
where (x, d) ∈ Rd × Rd and J is a variability interval for a scalar τ . In his monography [21] Zangwill gave a method of proving the convergence of feasible directions optimization algorithms based on properties of an algorithmic transformation A. In this section we sketch all crucial definitions and lemmas used to state the main convergence theorem. Definition 1. A transformation A : V → V is a transformation of a point into a set when each point x ∈ V is assigned a set A(x) of points from V . The result of application of the transform A(·) to a point xk can be any point xk+1 from a set A(xk ) thus xk+1 ∈ A(xk ). A transformation A = M 1 D defining a feasible direction optimization algorithm is a transformation of a point into a set. Definition 2. We say that a transformation A : V → V is closed in a point x∞ , if the following implication holds true: 1. xk → x∞ , k ∈ K , 2. yk ∈ A(xk ), k ∈ K , 3. yk → y∞ , imply 4. y∞ ∈ A(x∞ ), where K is a sequence of natural numbers. We say that transformation A is closed on X ∈ V if it is closed in any point x ∈ X. The property of closedness for algorithmic transformations is an analogue of the property of continuity for ”usual” functions. Theorem 1. (see [21] page 99) Let a transformation A : V → V of a point into a set be an algorithm, which for a given point x1 , generates a sequence {xk }∞ k=1 . Let S ⊂ V be a set of solutions. Let us assume 1. All points xk are in a compact set X ⊂ V . 2. There exists a function Z : V → R such, as
4
Search Procedure Exploiting Locally Regularized Objective Approximation
77
a. if point x is not a solution, then for any y ∈ A(x) there is Z(y) < Z(x) b. if point x is a solution, then either the algorithm finishes or for any y ∈ A(x) Z(y) ≤ Z(x). 3. Transformation A is closed for a point x if it is not a solution. Then either the algorithm finishes in a point which is a solution or any convergent subsequence generated by the algorithm has its limit in the solution set S. Now we additionally need two lemmas from [21] concerning the closedness of a composition of closed transforms. Lemma 1. Let C : W → X be a given function and B : X → Y be a transform of a point into a set. If function C is continuous in point w∞ and B is closed in C(w∞ ), then a composition A = BC : W → Y is closed in w∞ . Lemma 2. Let f be a continuous function. Then the transform M 1 is closed, if J is a compact and bounded interval.
4.4 The Main Result In practice usually another line search operator is considered, since implementation of the operator M 1 (·, ·) is expensive. Let’s consider a line search operator M ∗ defined as (4.1) M ∗ (x, d) = M 1 (x, d) ∪ {y = x + τ d : f (y) ≤ f (x) − Δ , τ ∈ J}. Its value is a set of points for which function f is decreased by Δ on the direction d from point x or when this set is empty its value is a minimum of function f . A suggestion of a practical application of the operator M ∗ can be found in one of the exercises in [21]. In the algorithm EXTREM such an operator is used instead of M 1 . The operator M ∗ is implemented to be the parabolic search algorithm. Application of the operator M ∗ is more practical, in particular in the initial part of the optimization process, where the first steps of the parabolic interpolation search give the most significant decrease of the function value whereas the next steps usually make the function f decrease much slower. The main result of this note is the following theorem. Theorem 2. Let a function f : Ω → R; Ω ⊂ Rd be an objective function of the optimization problem. Let be given a method to approximate the objective function f in certain points of the domain Ω with a relative error ε > 0. If function f is strictly convex the SPELROA method combined with the Gauss-Seidel search algorithm using approximated sequential quadratic interpolation as a line search method converges to a stationary point x0 ∈ S where S = {x : ∇ f (x) = 0}.
78
M. Bazan
Proof of Theorem 2 will be based on Theorem 1 where transformation A will be A = RM ∗ Dd M ∗ Dd−1 . . . M ∗ D2 M ∗ D1 where Di chooses i-th direction from the orthogonal direction base of the k-th iteration and R is an orthogonalization step to produce a new base along the direction k−1 xk−1 from k − 1 iteration. 0 xd
4.4.1 Closedness of the Algorithmic Transformation To show that point 3 of the theorem 1 is fullfilled we first have to show that the transformation M ∗ defined in (4.1) is closed. We prove the following lemma. Lemma 3. Let f be a continuous function. Then the transform M ∗ is closed, if J is a compact and bounded interval. Proof. According to Definition 2 let us consider sequences {(xk , dk )}∞ k=1 and {yk }∞ k=1 . We assume that 1. (xk , dk ) → (x∞ , d∞ ), k ∈ K 2. yk ∈ M ∗ (xk , dk ), k ∈ K 3. y → y∞ , k ∈ K So we have that
yk = xk + τ k dk
where τk ∈ J k for which f (xk + τ k dk ) ≤ f (xk ) − Δ . Because τ k ∈ J for k ∈ K and J is compact, thus there exists a convergent subsequence τ k → τ ∞, k ∈ K 1, where K 1 ⊂ K and τ ∞ ∈ J. For a fixed τ ∈ J from the definition of yk it folllows
or
f (yk ) < f (xk + τ dk )
(4.2)
f (yk ) − f (xk + τ dk ) ≤ Δ .
(4.3)
Note that if (4.3) is not satisfied then (4.2) is satisfied. Since f is continuous in a limit we get f (y∞ ) = lim f (yk ) < lim f (xk + τ dk ) = f (x∞ + τ d∞ ) k∈K 1
and in the same way
k∈K 1
f (y∞ ) − f (x∞ + τ d∞ ) ≤ Δ .
(4.4)
(4.5)
4
Search Procedure Exploiting Locally Regularized Objective Approximation
79
Because the above for any τ (4.4) or (4.5) is fulfilled then for any point y∗ ∈ M ∗ (x∞ , d∞ ) we can f (y∞ ) < f (y∗ ), (4.6) or
f (y∞ ) − f (y∗ ) ≤ Δ .
(4.7)
On the other hand in point y∗ ∈ M ∗ (x∞ , d∞ ) function f for τ ∈ J attains the least value y∞ = x∞ + τ ∞ d∞ , τ ∞ ∈ J or, therefore
f (y∞ ) < f (x∞ ) − Δ
(4.8)
f (y∗ ) − f (y∞ ) ≤ Δ .
(4.9)
Comparing (4.8) and (4.9) with (4.6) and (4.7) we get the result y∞ ∈ M ∗ (x∞ , d∞ ). To make use of Lemma 1 in order to show the closedness of transformation A we also have to notice that the transformations Di (x) = (x, di ) (i = 1, . . . , d) are continuous functions. For direct search algorithms transformations Di generate orthogonal directions. They are the same as in the Gauss-Seidle algorithm (c.f. [21]). For conjugate direction search algorithms transformations Di generate succesive conjugate directions. Transformation R that generates the orthogonal serach directions new [dnew 0 , . . . , dd−1 ] for the step k is defined as k new new R(xk−1 0 , [d0 , . . . , dd−1 ]) = (x0 , [d0 , . . . , dd−1 ]),
where the sequence of the new orthogonal vectors is uniquely defined by the orthogonalization of the vectors w0 , w1 , . . . , wd−1 w0 = s0 d0 + s1 d1 + . . . + sd−1 dd−1 w1 = + s1 d1 + . . . + sd−1 dd−1 ... wd−1 = + sd−1 dd−1 where scalars s0 , s1 , . . . , sd−1 correspond to step sizes in all directions in the step k − 1. Transformation R is uniquely defined without any conditions on scalars s0 , s1 , . . . , sd−1 as long as the orthogonalization is performed using the algorithm presented in [15]. In this case it is also a continuous function. Finally then since the transformation A is a composition of closed transformations M ∗ with continuous functions Di (i = 0, . . . , d − 1) and R then the assumptions of the Lemma 1 are satisfied we see that transformation A is closed. That proves that assumption 3 of the Theorem 1 holds for unperturbed M ∗ . In the succeeding subsection we show that transformation M ∗ can be realized by the perturbed transformation M 1 .
80
M. Bazan
4.4.2 A Perturbation in the Line Search For non-gradient optimization algorithms with an orthogonal basis as a set of search directions transformation M ∗ is the only place in the algorithm where the perturbation from approximation of the objective function is introduced by the SPELROA method. Therefore to show that point 2 of the Theorem 1 is fullfilled it is sufficient that implementation of the transformation M ∗ that allows perturbation of function value on the level ε > 0 minimizes function f along the direction d. A proof of the convergence of the parabolic interpolation line search method can be found in [8] or [14]. Here we will give conditions on the perturbation of the function so that the proof given in [14] holds true. We will keep the notation as close as possible to that in [14]. (i) (i) (i) Let function f : R → R be unimodal. Let for a triplet ζ (i) = (ζ1 , ζ2 , ζ3 ) be (i)
(i)
(i)
(i)
(i)
f (ζ2 ) ≤ min{ f (ζ1 ), f (ζ3 )} i.e. the interval [ζ1 , ζ3 ] contains an unique minimizer of function f . For a non-perturbed objective function we define a set of feasible triplets T ⊂ R3 defining an interval [ζ1 , ζ3 ] that contains the minimizer λˆ as T := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 ) ≤ min{ f (ζ1 ), f (ζ3 )} ∪{ζ ∈ R3 : ζ1 = ζ2 < ζ3 , f (ζ1 ) ≤ 0, f (ζ3 ) ≥ f (ζ1 )} ∪{ζ ∈ R3 : ζ1 < ζ2 = ζ3 , f (ζ3 ) ≥ 0, f (ζ1 ) ≥ f (ζ3 )} ∪{ζ ∈ R3 : ζ1 = ζ2 = ζ3 = λˆ }. For ζ ∈ T with ζ1 < ζ2 < ζ3 the minimum of the quadratic interpolating points (ζ1 , f (ζ1 )), (ζ2 , f (ζ2 )), (ζ3 , f (ζ3 )) equals
λ ∗ (ζ ) =
1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 ) 2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )
Then the set of admissible replacement triplets A(ζ ) is a set of candidate triplets that may replace ζ ∈ T defining a smaller interval containing λˆ in the next iteration of the algorithm. A0 (ζ ) is defined as (ζ1 , λ ∗ (ζ ), ζ2 ), (ζ2 , λ ∗ (ζ ), ζ3 ), A0 (ζ ) := T ∩ {u1 (ζ ), u2 (ζ ), u3 (ζ ), u4 (ζ )} where (λ ∗ (ζ ), ζ2 , ζ3 ), (ζ1 , ζ2 , λ ∗ (ζ )). (4.10) The crucial assumption on the perturbation that we introduce into the triplet used to construct a quadratic is that we always allow a perturbation in only one point of three points. Without loss of generality let us assume that a minimum of the perturbed quadratic is constructed only for ζ such that ζ1 < ζ2 < ζ3 . For the perturbation of the value of the function f on the level ε = 0 we define three sets of triplets: u1 (ζ ) u2 (ζ ) u3 (ζ ) u4 (ζ )
= = = =
4
Search Procedure Exploiting Locally Regularized Objective Approximation
81
T1 (ε ) := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 ) ≤ min{ f (ζ1 )(1 − |ε |), f (ζ3 )} } (4.11) T2 (ε ) := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 )(1 + |ε |) ≤ min{ f (ζ1 ), f (ζ3 )} } (4.12) T3 (ε ) := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 ) ≤ min{ f (ζ1 ), f (ζ3 )(1 − |ε |)} } (4.13) with the minima of the underlying perturbed quadratics 1 (ζ32 − ζ22 ) f (ζ1 )(1 + |ε |) + (ζ32 − ζ12 ] f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 ) , λ˜ 1∗ (ε ; ζ ) = 2 (ζ3 − ζ2 ) f (ζ1 )(1 + |ε |) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 ) 1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 )(1 + |ε |) + (ζ22 − ζ12 ) f (ζ3 ) , λ˜ 2∗ (ε ; ζ ) = 2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 )(1 + |ε |) + (ζ2 − ζ1 ) f (ζ3 ) 1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )(1 + |ε |) λ˜ 3∗ (ε ; ζ ) = 2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )(1 + |ε |) for a perturbation of function f in points ζ1 , ζ2 and ζ3 respectively. Such definitions of sets Tl (ε ) (l ∈ {1, 2, 3}) provide that the perturbed minima are contained within the interval [ζ1 , ζ3 ]. Corresponding sets of admissible triplets are defined as A˜ l (ε ; ζ ) := Tl (ε ) ∩ {u˜l1 (ε ; ζ ), u˜l2 (ε ; ζ ), u˜l3 (ε ; ζ ), u˜l4 (ε ; ζ )} where
u˜l1 (ε ; ζ ) u˜l2 (ε ; ζ ) u˜l3 (ε ; ζ ) u˜l4 (ε ; ζ )
= = = =
(ζ1 , λ˜ l∗ (ε ; ζ ), ζ2 ), (ζ2 , λ˜ l∗ (ε ; ζ ), ζ3 ), (λ˜ l∗ (ε ; ζ ), ζ2 , ζ3 ), (ζ1 , ζ2 , λ˜ l∗ (ε ; ζ )),
(4.14)
where l ∈ {1, 2, 3}. Lemma 4. Let S denote the set of stationary points of function f S := {ζ ∈ T : f (ζ1 ) = 0 or f (ζ2 ) = 0 or f (ζ3 ) = 0}. The following statements hold: 1. For every ζ ∈ T, the set A(ζ ) = A0 (ζ ) ∪ A˜ 1 (ε ; ζ ) ∪ A˜ 2 (ε ; ζ ) ∪ A˜ 3 (ε ; ζ ) is nonempty. 2. The set valued map A(·) is closed. / S} such that ζ1 < ζ2 < ζ3 there is c(y) < c(ζ ) 3. For every ζ ∈ T\S := {ζ ∈ T : ζ ∈ for all y ∈ A(ζ ) for c(ζ ) = f˜1 (ζ1 ) + f˜2 (ζ2 ) + f˜3 (ζ3 ) where f˜l (·) = f or f˜l (·) = f (·)(1 + |ε |) depending on whether a value at point ζl was exact or perturbed. Proof. Let us introduce the following notation f˜l (·) = f (·) when evaluated at points x = ζl or f˜l (·) = f (·)(1 + ε ) when evaluated at point x = ζl . At the beginning let us note that Tl (ε ) ⊂ T for (l ∈ {1, 2, 3}) and ε > 0.
82
M. Bazan
1. Let ζ = (ζ1 , ζ2 , ζ3 ), ∈ T be fixed. If f (ζ1 ), f (ζ2 ) and f (ζ3 ) are computed without perturbation then A(ζ ) is not empty by proof in [14]. Let us consider then that a function value was approximated with the relative error equal ε . The minimum of the quadratic constructed in this case will be λ˜ l∗ (ε ; ζ ) where l ∈ {1, 2, 3} depending on at which point a function value is approximated. Let us consider the case when λ˜ l∗ (ε , ζ ) ∈ [ζ1 , ζ2 ] providing moreover that the minimum λ ∗ (ζ ) obtained as if the function was evaluated without any perturbation also belongs to [ζ1 , ζ2 ]. Then A(ζ ) is empty if and only if both u˜1 (ε ; ζ ) and u˜3 (ε ; ζ ) are not in A(ζ ), i.e. if and only if f (λ˜ 1∗ (ε ; ζ )) > min{ f (ζ1 )(1 + |ε |), f (ζ2 )} = f (ζ2 )
(4.15)
and f (ζ2 ) > min{ f (λ˜ 1∗ (ε ; ζ )), f (ζ3 )} ≥ min{ f (λ˜ 1∗ (ε ; ζ )), f (ζ2 )}
(4.16)
if a function value is approximated at ζ1 (analogue inequalities are considered for perturbations in ζ2 and ζ3 ). Since the inequalities (4.16) imply, that f (ζ2 ) ≥ f (λ˜ 1∗ (ε ; ζ )) we get a contradiction with the inequality (4.15) which proves the thesis when the perturbation is in the point ζ1 . The similiar contradiction we get for analogue inequalities for perturbations in ζ2 and ζ3 . Please note that the condition that the corresponding unperturbed λ ∗ (ζ ) is also in [ζ1 , ζ2 ] guarantees that if we compare y˜ = (y˜1 , y˜2 , y˜3 ) ∈ A˜ l (ζ ) (l ∈ {1, 2, 3}) with y = (y1 , y2 , y3 ) ∈ A0 (ζ ) we get y˜1 ≤ y1 and y˜3 ≥ y3 . The case when λ˜ ∗ (ζ ) belongs to [ζ2 , ζ3 ] is symmetric. This shows that A(ζ ) is not empty. 2. We will prove the closedness of A from Definition 2. Let us assume that {ζ (i) }∞ i=0 and ζ (i) → ζ∗ ∈ T and there exists ζ∗∗ ∈ T and an infinite subsequence K ⊂ N such that ζ (i+1) ∈ A(ζ (i) ) for every i ∈ K such that ζ (i+1) →K ζ∗∗ as i → ∞. Then there must exist k ∈ {1, 2, 3, 4} and an inifinite subsequence K ⊂ K such that ζ (i+1) = uk (ζ (i) ) or ζ (i+1) = u˜lki (ε ; ζ (i) ) (where li ∈ {1, 2, 3}) for every i ∈ K . As shown in [14] functions uk (·) are continuous and therefore when the sequence (i) generated by approximated values then from the {ζ (i) }∞ i=0 does not contain ζ continuity of uk (·) with respect to ζ and the closedness of the set T it follows that uk (ζ (i) ) → uk (ζ∗ ) = ζ∗∗ ∈ A(ζ∗ ). This proves the closedness of transformation A if no approximation is used in any sequence ζ (i) . Now we will consider the sequences containing approximated triplets. Introducing the approximated triplets ζ (i+1) (i.e. ζ (i+1) ∈ A˜ l (ε ; ζ (i) )) introduces discontinuities of the first kind into functions uk (ζ ) and the argument based on continuity of functions uk (ζ ) cannot be applied directly. We will show that using algorithm A introduces a finite number of isolated points discontinuities and that guarantees that from any sequence {ζ (i) }∞ i=1 after removing some finite number of initial elements we can apply the proof from [14]. We have to consider two cases
4
Search Procedure Exploiting Locally Regularized Objective Approximation
83
• When a number of occurences of u˜lki in the sequence {ζ (i) }∞ i=0 is finite then the proof of the closedness of transformation A given in [14] applies after removing from {ζ (i) }∞ i=0 some number of initial elements. • Let us then assume that u˜lki occur an inifinite number of times in {ζ (i) }∞ i=0 . (i) ∞ Since {ζ }i=1 is convergent then for any δ ∈ R there exists i0 such that for all i > i0 we have (i )
(i )
(i )
(i +1)
|ζ (i0 ) − ζ (i0 +1) | = |(ζ1 0 , ζ2 0 , ζ3 0 ) − (ζ1 0
(i +1)
, ζ2 0
(i +1)
, ζ3 0
)| < δ (4.17)
Let us choose such i1 for which inequality (4.17) holds. Since the approximation in the next iterations is used infnitely many times then there exists a subsequence K = {i : i > i1 } ⊂ N such that if i ∈ K then
ζ (i+1) ∈ A˜ l (ε ; ζ (i) ) for some l ∈ {1, 2, 3}. We will show that for any δ it is possible to choose ε such that ||ζ (i1 ) − ζ (i1 +1) ||2 > 2δ .
(4.18)
This will give us at the end a contradiction with assumed convergence. To give conditions on ε we solve inequalities (4.18) using expressions (4.14) for ulk (ε , ζ ) for k = 1, . . . , 4 and l ∈ {1, 2, 3} and applying a method of parabola transformation qˆ from Appendix A. (i) For example for perturbation in ζ1 we get
ε1 >
(B − C)(ζ(r) − Ka ) − (B − ζ(r)C) − A(Ka − 1) A(Ka − 1)
and one of the system of inequalities is fullfiled A(1 + ε1) + B − C < 0 A(1 + ε1) + B − C > 0 or . A(Ka − 1) < 0 A(Ka − 1) > 0
where
ε1 =
ε (i)
f (ζl )
, Ka =
2δ ζ3 − ζ1
2
(4.19)
(4.20)
− [1 − ζ(r)]2 .
ζ1 with A, B and C defined in Appendix A. We leave to the and ζ(r) = ζζ2 − 3 −ζ1 reader showing that these conditions are not contradicting and also derivation of the analogue inequalities for l = 2 and l = 3. The above considerations mean for any δ the value ε can be chosen so that (4.18) holds. But it contradicts with the assumption that {ζi }∞ i=0 converges since we can choose two subsequences from {ζi }∞ i=0 that converge to two different accumulation points. The first subsequence is formed from ζi1 s and the second from ζi1 +1 s. Both are infinite. From this we conclude that from
84
M. Bazan
certain i0 the sequence {ζi }∞ i=0 cannot contain approximated points and therefore the proof from [14] also applies in this case. This finishes the proof of the closedness of transformation A(ζ ). 3. Let us assume that ζ ∈ T\S. Then λ˜ ∗ (ζ ) ∈ (ζ1 , ζ3 ). Hereafter in this point we consider the quadratic obtained with a transformation qˆ from Appendix A. This transformation preserves all properties used in this proof. In the following consideration we will use the expression Δl (ε ; ζ ) which is the distance between the unperturbed and perturbed minimum i.e. Δl (ε ; ζ ) = |λ˜ l∗ (ζ ) − λ ∗ (ζ )|. See Appendix B for expression Δl (ε ; ζ ). Here the situation is also symmetric with respect to ζ1 and ζ3 and we will consider the case where λ ∗ (ζ ) ∈ (ζ1 , ζ2 ] as well as λ˜ l∗ (ζ ) ∈ (ζ1 , ζ2 ] (l ∈ {1, 2, 3}). Firstly we observe that f (ζ2 ) ≤ f (ζ3 ), because if there was f (ζ2 ) = f (ζ3 ), then λ˜ ∗ (ζ ) = 1/2(ζ2 + ζ3 ) since no value is perturbed or only the value at point ζ2 is perturbed and there is no effect of this perturbation on a minimum. Both cases contradict the assumption that λ˜ l∗ (ζ ) ∈ (ζ1 , ζ2 ] (l ∈ {1, 2, 3}). When λ˜ l∗ (ζ )∈(ζ1 , ζ2 ] (l ∈{1, 2, 3}) then only u1(ζ ) and u˜l1 (ε ; ζ ) (l ∈ {1, 2, 3}) and u3 (ζ ) and u˜l3 (ζ ) (l ∈ {1, 2, 3}) can belong to A(ζ ). For unperturbed function values we have λ ∗ (ζ ) < ζ2 and for perturbed function values we have
λl∗ (ε ; ζ ) + Δl (ε ; ζ ) < ζ2
(l ∈ {1, 2, 3}).
(4.21)
In Appendix B we solve (4.21) with respect to ε establishing conditions for which λl∗ (ε ; ζ ) < ζ2 and provide the line of proof from [14] is valid. We get only three cases / A(ζ ) as well as a. A(ζ ) = {u1 (ζ ), u˜11 (ζ ), u˜21 (ζ ), u˜31 (ζ )}. Then since u3 (ζ ) ∈ u˜l3 (ε ; ζ ) ∈ / A(ζ ) (l ∈ {1, 2, 3}) and f (λ ∗ (ζ )) < f (ζ2 ) as well as f (λ˜ l∗ (ε ; ζ )) < f (ζ2 ) (l = 1, 3) (4.22) f (λ˜ 2∗ (ε ; ζ )) < f (ζ2 )(1 + |ε |) we must have c(u˜l1 (ζ )) = f˜l (ζ1 ) + f (λ˜ l∗ (ζ )) + f˜l (ζ3 ) < f˜l (ζ1 ) + f˜l (ζ2 ) + f˜l (ζ3 ) = c(ζ ), where f˜l (·) = f (·) or f˜l (·) = f (·)(1 + ε ) depending on which value was perturbed. / A(ζ ) as well as b. A(ζ ) = {u3 (ζ ), u˜13 (ζ ), u˜23 (ζ ), u˜33 (ζ )}. Then, since u1 (ζ ) ∈ u˜l1 (ε ; ζ ) ∈ / A(ζ ) (l ∈ {1, 2, 3}) we must have that f (ζ2 ) ≤ f (λ ∗ (ζ )) as well as f˜(ζ2 ) ≤ f (λ˜ l∗ (ζ )) (l ∈ {1, 2, 3}) depending on which value was perturbed. Also f (λ ∗ (ζ )) < f (ζ1 ) as well as fl (λ˜ l∗ (ζ )) < f˜l (ζ1 ) (l ∈ {1, 2, 3}) since otherwise we would have a local maximum in [ζ1 , ζ2 ] contradicting unimodality. Therefore in this case we must have c(u˜l3 (ζ )) = f (λ˜ l∗ (ζ )) + f˜2 (ζ2 ) + f˜3 (ζ3 ) < f˜l (ζ1 ) + f˜l (ζ2 ) + f˜l (ζ3 ) = c(ζ ) with perturbation at point ζl .
4
Search Procedure Exploiting Locally Regularized Objective Approximation
85
c. Finally, we can have a case A(ζ ) = {u1 (ζ ), u3 (ζ )}. In this case we are not able to include to A(ζ ) any of the approximated triplet u˜l (ε ; ζ ) (i = 1, 2, 3). This is because of the following properties i. f (ζ2 ) < f (ζ1 ) by assumption, ii. λ ∗ (ζ ) ≤ ζ2 , iii. f (λ ∗ (ζ )) = f (ζ2 ) which implies λ ∗ (ζ ) = ζ2 . These equalities hold since otherwise, we would have a contradiction with the unimodality of f (·). Approximating any of the value would mean that we would not be able to guarantee the property iii. Therefore since f (ζ2 ) < min{ f (ζ1 ), f (ζ3 )} we get c(u1 (ζ )) < c(ζ ) and c(u3 (ζ )) < c(ζ ). From a practical point of view for a given ζi where one of the coordinates is approximated or all are exact, we can see whether we have to use exact values i.e. remove the approximation, checking whether |λ˜ l∗ (ε ; ζ ) − ζ2 | > Δl (ε ; ζ )
(l ∈ {1, 2, 3}).
This exhausts all the possibilities and finishes the proof of the third point. We are now in a position to formulate the Sequential Quadratic Interpolation Algorithm with Perturbation (c.f. [14] for the unperturbed version). Please note that in the following we use the expressions an “approximated” and “perturbed” function values interchangably – they are synonymous here. Moreover the adjective “exact” here means perturbed on the level of the machine precision ε where ε << ε . The main condition under which we can make use of the available approximation of the function in one of the points is the separation property, i.e. for the approximation used in a point with an index l ∈ {1, 2, 3} the triplet ζi belongs to Tl where Tl is defined by (4.11), (4.12) or (4.13) respectively. Now we can formulate a convergence theorem for the above algorithm analogous to that in ([14] p. 155). Theorem 3. Suppose that {ζi }∞ i=0 is a sequence constructed by Algorithm 2 in minimizing a continuously differentiable and unimodal function f : R → R. Then ζ (i) → ζˆ as i → ∞ with ζˆ ∈ S. To prove the above theorem we will show the way in which we can apply the proof given in [14] making use of the Lemma 4. Proof. The main difficulty in application of the method of the proof from [14] is the fact that using approximation in certain points of the domain causes a discontinuity of the cost function c as well as a discontinuity of the functional of calculating the minimum of the quadratics with respect to a parameter triplet ζ (i) . Let us observe first that the proof of the third point of Lemma 4 shows that allow(i) ing perturbation according to (4.23) provides that {ζ1 }∞ i=0 is monotone increasing (i) is monotone decreasing. Since both these sequences are bounded as well as {ζ3 }∞ i=0 (i) they are both convergent. Moreover keeping Δ (ε ; ζ ) on a level so that the above
86
M. Bazan
1 2
3
The Sequential Quadratic Interpolation Algorithm with the Objective Function Perturbation Input : ζ0 ∈ T – a starting point, ε > 0 – the relative error of the approximation available in certain points of the function evaluation. 0. Set i = 0. 1. Compute λ ∗ = λ ∗ (ζ (i) ) or λ ∗ = λ˜ l∗ (ε ; ζ (i) ) depending on whether the function (i) (i) (i) value was exact in all points ζ1 , ζ2 , ζ3 or it was perturbed in point (i)
ζl (l ∈ {1, 2, 3}) respectively. (i) (i) 2. If λ ∗ = ζ1 or λ ∗ = ζ3 then STOP, else construct the set A(ζ (i) ) and a. If the approximation to any value of the triplet ζ (i) is not available then A(ζ (i) ) = A0 according to (4.10). (i) b. If the approximation in point ζl (l ∈ {1, 2, 3}) is available i. Compute transformation qˆ as described in Appendix A. ii. Compute Δl (ε ; ζ (i) ) iii. If (i) |λl∗ (ε ; ζ ) − ζ2 | < Δl (ε ; ζ ). then A(ζi ) = A0 and go to 3. iv. If
λl∗ (ε ; ζ ) + Δl (ε ; ζ ) < ζi2
or
λl∗ (ε ; ζ ) − Δl (ε ; ζ ) > ζi2
(4.23)
then A = A0 ∪ A˜ l 3. Compute
ζ (i+1) ∈ arg min{c(ζ ) : ζ ∈ A(ζ (i) )}
4. Replace i := i + 1 and go to step 1.
property of the mentioned sequences would be fulfilled when the approximation (i) (i) would not be used, guarantees that ζˆ ∈ [ζ1 , ζ3 ] for any i ∈ N. We have to distinguish between two nontrivial cases ˆ ˆ ˆ ˆ ˆ ˆ 1. When {ζ (i) }∞ i=1 → ζ and ζ = (ζ1 , ζ2 , ζ3 ) is such an accumulation point that ζ1 < ζˆ2 < ζˆ3 . In this case the contradiction is obtained using the continuity of the cost function c if an unperturbed algorithm was used. We can use the continuity argument here since as it was shown in the proof of Lemma 4 the approximation can only be used in a finite number of steps. Therefore after removing a certain number of initial steps the proof from [14] applies. 2. The first case, therefore does not apply, i.e. the sequence constructed by Algorithm 2 can have two accumulation points. In [14] it is shown that it is the same
4
Search Procedure Exploiting Locally Regularized Objective Approximation
87
accumulation point. Lemma 8 guarantees that the argument contained in [14] holds true if the constructed sequence ends up in the stopping criterion with an unperturbed triplet. Since the sequence is not infinite we have to consider one more case, namely, when the algorithm stops with a perturbed triplet. Two accumulation points are ζ∗ = (ζˆ1 , ζˆ1 , ζˆ3 ) and ζ∗∗ = (ζˆ1 , ζˆ3 , ζˆ3 ). To provide the separation of function values in the first case the perturbation can be only in the point ζˆ3 . In the second case the perturbation can be only in the point ζˆ1 . (i) (i) (i) (i) In the above sequences we have therefore ζ2 → ζ1 or ζ2 → ζ3 . On the other (i)
hand, the separation between λ ∗ (ε ; ζ (i) ) and ζ2 has to be greater than Δ3 (ε ; ζ (i) ) and Δ1 (ε ; ζ (i) ) for two cases respectively. This gives the contradiction with the convergence.
From the practical point of view to provide the convergence of the perturbed SQI algorithm and therefore the whole SPELROA method it is sufficient to provide that the procedure ε -check in step 3.c) of Algorithm 1 to be launched in each iteration of the perturbed SQI algorithm. It takes the last three points xi−2 , xi−1 , xi and compute x −xk−1 ζk = (0,t1 ,t2 ) such that t1 = k−2 xk−2 −xk , t2 = 1 and then compute the transformation qˆ defined by (4.34) to obtain points (0, −2), (ζ(r) , q( ˆ ζ(r) )), (1, −1) when f˜(xk−2 ) < f˜(xk ) (for the case when f˜(xk−2 ) > f˜(xk ) it is sufficient to rotate the scaled parabola around point 0.5) and scales ε ⎧ ε when a perturbation is in point xk−2 , ⎪ ⎨ f (xk−2 ) , ε when a perturbation is in point xk−1 , ε := f (xk−1 ) , ⎪ ⎩ 2ε , when a perturbation is in point xk f (x ) k
depending on at which point the function value was approximated. Then if 1. a perturbation is at 0 then if ε satisfies (4.35) and fullfills (4.36) or (4.37) then the approximation can be used else calculate the objective function deterministically, 2. a perturbation is at ζ(r) then if ε satisfies (4.38) and fullfills (4.37) or (4.40) then the approximation can be used else calculate the objective function deterministically, 3. a perturbation is at 1 then ε has to satisfy analogue conditions (left to the reader) for the approximation to be used by the algorithm.
4.5 The Radial Basis Appproximation In this section we will discuss the main aspects and possibilities of constructing a radial basis approximation of the objective function in Algorithm 1.
4.5.1 Detecting Dense Regions Since the data set Z = {xi }Ni=1 is very sparse then a method to detect regions rich in data within the convex hull Ω of Z is required. In [2] we introduced a number of merit
88
M. Bazan
γ (x, Z ) :=
∑Nj=1,i< j ai jWi j ∑Nj=1,i< j Wi j
(4.24)
d
ij 1 where ai j = r j +r , Wi j = r j +r , and di j = ||x j − xi ||2 and r j = ||x − x j ||2 . γ (x, Z ) i i measures how well data points from Z surround the evaluation point x. Numbers ai j measure how far point x is placed from the interval xi x j . The maximal value equal 1 is attained by ai j if point x is on this interval. The weighting Wi j emphasize in γ (x, Z ) the impact of intervals xi x j that are close to point x. The additional denominating by a sum of weights Wi j provides that for any x ∈ Rd the range of values of γ (x, Z ) is (0, 1]. If the value of γ (x, Z ) is greater than a certain threshold value then in the evaluation point x a construction of an approximation with a good local quality can be expected.
0.9 0.8 0.7 0.6 X2
0.5 0.4 0.3 0.2 0.1 0 −1
−0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 X1
Fig. 4.1 γ (x, Z ) defined by (4.24). Here the set Z is a data set constructed on the 50-th optimization step of a 2-parametric Rosenbrock function optimization with the EXTREM algorithm
4.5.2 Regularization Training For a given data set Z = {xi , f (xi )}Ni=1 ⊂ Rd × R of pairwise different data points xi , and for the Gaussian radial basis function φ (x) = exp(−x2 /r2 ), r ≥ 0 (see [9]) a radial basis function interpolatant s(x) is defined as N
s(x) = ∑ wi φi (||x − xi||), i=1
where s(xi ) = f (xi ) = fi ,
i = 1, . . . , N.
(4.25)
4
Search Procedure Exploiting Locally Regularized Objective Approximation
89
The interpolation conditions s(xi ) = fi imply the matrix formulation
Φ w = f,
(4.26)
where Φ = [φ (xi − x j )]i=1,...,N; j=1,...,N , f = [ f1 , f2 , . . . , fN ]T , w = [w1 , w2 , . . . , wN ]T . Positive definiteness [9] of φ guarantees nonsingularity of matrix Φ and thus there exists the unique solution w for which s(x) interpolates data from the set Z. Although the other choice of positive definite radial basis functions is possible we choose the Gaussian function due to its natural interpretation of r parameter which can be set for a value proportional to the diameter of the data set Z = {xi }Ni=1 . There are two reasons for which we solve an approxmiation problem rather than interpolation. Firstly when the data set Z is very irregular and φ is strictly positive definite the matrix Φ is very ill-conditioned. Secondly the problem of solving (4.26) yields solution s(x) which may oscillate between data points where data is sparse. The approximation solution is sought by means of Tikhonov regulatization. Suppose that we are given a linear mapping T : H → R and define the regularization operator J : H → R by J(s) = ||T s||2R where H is a native space ot the underlying radial basis function φ (·). For a function f ∈ H and for the prescribed value of the single regulatization parameter λ its approximation is a function fλ (x) ∈ H in a form (4.25) which is the solution of the minimization problem N
min fλ
∑ [ f (xi ) − fλ (xi)]2 + λ J( fλ ) : fλ ∈ H
.
(4.27)
i=1
In the matrix form the problem (4.27) is written as (Φ T Φ − λ I)w = Φ T t
(4.28)
for a λ > 0 governing a trade-off between a data reproduction and the desired smoothness of solution fλ (x). Due to the ill-conditioning of matrix Φ a direct inversion of matrix (Φ T Φ − λ I) in (4.28) is not numerically stable. To solve (4.28) in a numerically stable way we use the singular value decomposition of the matrix Φ defined as
Φ = USVT ,
(4.29)
where U ∈ RN×N , V ∈ RN×N are orthogonal matrices and S = diag(σ1 , . . . , σN ), where σ1 ≥ σ2 ≥ · · · ≥ σN are singular values of Φ . The singular value decomposition is unique up to the signs of the columns of matrices U and VT . Using the above decompostion we express the inverse matrix in (4.28) as
Φ † = (Φ T Φ + λ 2 I)−1 Φ T = V((ST S + λ 2I)−1 ST )UT = VΩλ UT , where
Ωλ = diag
σ1 σ2 σN , ,..., 2 . σ12 + λ 2 σ22 + λ 2 σN + λ 2
(4.30)
90
M. Bazan
Using the above equation the weight vector wλ is expressed (see [5]) by the expansion with respect to singular vectors of the matrix Φ
σi (uT t)vi . 2 +λ2 i σ i=1 i N
wλ = VT Ωλ† UT t = ∑
(4.31)
Comparing the expansion (4.31) for λ > 0 with the expansion for λ = 0 i.e. for the interpolation problem which reads N
1 T (ui t)vi . i=1 σi
w = VS−1 UT t = ∑
(4.32)
one can see what is the role of the regularization parameter λ . For σ p ≥ λ ≥ σ p+1 we have 1 σi 2 (k > p), σk σk + λ 2 and hence the impact of the singular vectors corresponding to singular values σk < λ is dumped in the expansion (4.31). This enables us to avoid oscillation of the solution that are introduced by inverting small singular values in the expansion of the weight vector. Another approach in solving the problem of ill-conditioning of the interpolation matrix generated by multiquadratic functions was presented in [7].
4.5.3 Choice of the Regularization Parameter λ Value An appropriate procedure to choose λ parameter is a crucial issue in the Tikhonov regularization. A chosen procedure should be able to find a λ for which the solution s(x) reproduces well the data set Z as well as it generalizes well in-between data points. These goals are conflicting and therefore a method should give us the right balance between reproduction of the data set as well as a generalization for data from outside the data set. 4.5.3.1
Weighted Gradient Variance and Local Mean Square Error
In [2] we introduced a method that relates a choice of λ to the reproduction quality on the data points close to the evaluation point x. The method was defined as the following. Let us define r j = ||x − x j ||2 ,
j = 1, . . . , N.
For a sequence of λ s that covers the singular value spectrum (σN , σ1 ) of the matrix S from the decomposition (4.29) we calculate the Normalized Local Mean Square Error
4
Search Procedure Exploiting Locally Regularized Objective Approximation
[sλ (x j )− f j ]2 1 N · r2 ∑ j=1 f 2j j NLMSEλ ,Z (x) = , ∑Nj=1 r12
91
(4.33)
j
and the Weighted Gradient Variance
WGVλ ,Z (x) =
∑Nj=1
||∇sλ (x j )−Gλ ,Z (x)||2 r2j
∑Nj=1 r12
,
j
where Gλ ,Z (x) is a mean gradient variance at point x defined as Gλ ,Z (x) =
∑Nj=1
∇sλ (x j ) r2j
∑Nj=1 r12
,
j
where sλ (x j ) is a value of the constructed approximation for λ at point x j from the data set Z, and ∇ is the gradient operator. It is the discrepancy method since only these λ ’s are considered for which the value of NLMSE is smaller than a prescribed, user-defined threshold value NLMSEthr . The threshold NLMSEthr value says which approximation quality has to be preserved on the points from Z nearest to the evaluation point x. From the set of λ ’s for which NLMSEλ ,Z (x) is smaller than NLMSEthr the minimizer of the oscillation measure WGVλ ,Z (x) is chosen as the optimum. Figure 4.2 shows a) the data set of 30 points from the 2-parametric Rosenbrock function optimization by the EXTREM algorithm, b) a plot of NLMSEλ ,Z (x) for two different points of the domain for a data set, c) a corresponding plot with W GVλ ,Z(A) and W GVλ ,Z (B). Figure 4.3 shows a) the obtained approximation and b) the chosen value of λ parameter.
4.5.4 Error Bounds for Radial Basis Approximation In the previous section we presented a method to choose the value of the regularization parameter λ . Unfortunately, using this method does not provide us that the generalization error is smaller than a prescribed value ε as well as we are not able directly to estimate the generalization error. This method similarly as the Generalized Cross Validation (c.f. [18]) estimates an error on the data points (WGV for nearest points to the evaluation point and GCV for all points from a data set). The error bounds for radial basis interpolation has been studied extensively for more than two decades. The first bounds establishing the rate of convergence of a radial basis interpolant for functions from a native space of the underlaying radial basis function were given in [10]. The latter paper gave the foundations for the development of the theory of convergence of the radial basis interpolation (see
92
M. Bazan
a)
b) 0.98
−1
A B
0.96 −2
A
0.94
−3 log10(NLMSE)
0.92
x2
0.9 0.88
B
0.86 0.84
−4
−5
−6
0.82 −7 0.8 0.78 −1.02
−1
−0.98
−0.96 x
−0.94
−0.92
−8 −18
−0.9
−16
−14
−12
−10
−8
−6
−4
−2
log10(λ)
1
c) 1.8 1.7
A
log10(WGVλ, Z( x))
1.6 1.5 1.4 1.3 1.2 1.1
B 1 −18
−16
−14
−12
−10
−8
−6
−4
−2
log10(λ)
Fig. 4.2 a) Data set consisting of 30 points from the optimization path from EXTREM algorithm optimizing the 2-parametric Rosenbrock function. b) Local reproduction of the data near points A and B measured by NLMSEλ ,Z (A) and NLMSEλ ,Z (B) respectively. Here NLMSEthr = 10−5 is depicted by a straight line. c) A measure of the oscillation of the solution W GVλ ,Z (x) in points A and B. By dots the optimal λ for points A and B are depicted
e.g. [17], [19] and references therein). The rate of the convergence which is considered is with respect to dataset density i.e. a global fill distance h(Z , Ω ) := max min ||y − xi||2 . y∈Ω 1≤i≤N
4
Search Procedure Exploiting Locally Regularized Objective Approximation
a)
93
b) λ for approximation constructed with WGV
approximation error of the approximation−5constructed with WGV for NLMSE thr = 10 x 10
−1
4
−0.95
2
0.98
0.96
0.94
0.92
09
0.88
0.86 −12
y −0.9
−10 −8
1
−6 −4 −2
0 −1 −0.98 −0.96 −0.94 −0.92
0
−0 0.869
0.88
0.9
0.92
0.94
0.96
0.98
Fig. 4.3 a) The approximation error for λ chosen by the measure W GVλ ,Z (x) for NLMSEthr = 5.0 · 10−6 . b) Chosen value of λ parameter – it can be noticed that in data regions where data is sparse the method suggests a greater value of λ
where Z = {xi }Ni=1 ⊆ Ω which is a mesh norm measuring the radius of the biggest ball contained in the domain Ω and not containing any data points inside, and where the domain Ω satisfies the interior cone condition with radius r and angle θ . Definition 3. A set Ω ⊆ Rd is said to satisfy an interior cone condition if there exists an angle θ ∈ (0, π /2) and a radius r > 0 so that for every x ∈ Ω a unity vector ξ (x) exists so that the cone C(x, ξ (x), θ , r) := {x + η y : y ∈ Rd , ||y||2 = 1, yT ξ (x) ≥ cos θ , η ∈ [0, r]} is contained in Ω . The summary of error bounds for various radial basis functions was given in [16]. A very precise derivation of the error bounds for Gauss radial basis function interpolation was given in [20]. In the latter paper one can find a derivation of all constants involved in the bound. The analogue error estimates (with all constants involved) of the approximation with positive defined radial basis function constructed with Tikhonov regularization for functions from the Sobolew space Wpτ was presented in [19]. Here we will show that the error estimates contained in [19] cannot be used in our scheme due to the small number of points in the optimization process and therefore using heuristic methods from previous sections is justified. All of the error bounds rely on a common property of local polynomial reproduction that has to be guaranteed by an approximation procedure (c.f. [19]). The error bounds can be formulated in the form of the following theorem. Theorem 4. Suppose that Ω ⊂ Rd is bounded and satisfies an interior cone condition with angle θ and radius r. Let m be a maximal degree of polynomials reproduced by fλ in a form (4.25) defined as the solution of (4.27). Define the following quantities
94
M. Bazan
sin θ , 4(1 + sin θ ) sin θ sin ϑ . Q(m, θ ) := 2 8m (1 + sin θ )(1 + sin ϑ )
ϑ := 2arcsin
If the global fill distance h(Ω , Y ) satisfies h(Z , Ω ) ≤ Q(m, θ )r. the approximation error can be bounded as || f − fλ ||L∞ (Ω ) ≤ C[h(Z , Ω )]τ −d/p | f |Wpτ (Ω ) + 2ε ,
and ε = max | f (x j )− fλ (x j )|. x j ∈Z
Let us consider the unity ball B(0, 1) as a domain of the approximation. It satisfies the interior cone condition with r = 1 and θ = π /3. It can be seen to reproduce only the linear polynomials i.e. m = 1 and therefore for the above bounds to be satisifed there has to be h(Z , Ω ) ≤ Q(1, π /3) < 0.012613. It means that from one data point to another there must be a distance not greater than 1.3% of the radius of the ball containing the whole data set. Such a number of points cannot be generated by any local optimization algorithm. The above consideration shows that the exisiting accurate error bounds for the regularized approximation with radial basis functions cannot be applied to estimate the approximation in the SPELROA method. It is due to the sparseness of the data in data sets constructed by local optimization algorithms.
4.6 Numerical Results Numerical results of the performance for real optimization problems from the LHC magnet design process with up to 5 parameters were presented in [2]. Here we present results for three test problems from a set of test problems proposed in [12].
4.6.1 Test Problems 1. Six variable problem and eight variable problem I As a six and first eight variable problem we considered the Extended Rosenbrock Function (i.e. problem no 21 in [12]). It is defined as d/2
f ([x1 , x2 , . . . , xd ]) = ∑ 100(x2i − x22i−1)2 + (1 − x2i−1)2 . i=1
A standard starting point is x0 = (ξ j ) where ξ2 j−1 = −1.2 and ξ2 j = 1 and the minimum equals f ∗ = 0 at (1, . . . , 1).
4
Search Procedure Exploiting Locally Regularized Objective Approximation
95
2. Eight variable problem II As a second eight variable problem we chose Chebquad function (i.e. problem no 35 in [12]). It is defined as d
f (x1 , . . . , xd ) = ∑ fi (x)2 ,
where
i=1
fi (x) =
1 d
d
1
j=1
0
∑ Ti (x j ) −
Ti (x)dx
and Ti is the i-th Chebyshev polynomial shifted to the interval [0, 1]. The standard starting point is x0 = (ξi ) where ξ j = j/(d + 1) and the minimum for d = 8 equals to f ∗ = 3.51687 · 10−3. 3. Eleven variable problem As en eleven variable problem we chose Osborne 2 function (i.e. problem no 19 in [12]). It is defined as 11
f (x1 , . . . , x11 ) = ∑ fi (x)2 ,
where
fi (x) = yi − (xi e−ti x5 + x2e−(ti −x9 )
2x 6
+
i=1
+x3 e−(ti −x10 )
2x 7
+ x4e−(ti −x11 )
2x 8
)
ti = (i − 1)/10 and yi for i = 1, . . . , 65 are constants that can be found in [12]. The standard starting point is x0 = [1.3, 0.65, 0.65, 0.7, 0.6, 3, 5, 7, 2, 4.5, 5.5] and the minimum is f ∗ = 4.01377 · · ·10−2.
4.6.2 Results For each problem we run Algorithm 1 combined with EXTREM algorithm with the three following objective function approximation methods 1. Radial basis function aproximation without regularization, 2. Radial basis function aproximation with regularization using Generalized Cross Validation [18] to choose λ parameter, 3. Radial basis function aproximation with regularization using Weighted Gradient Variance to choose λ parameter. To construct the approximation of the objective function with one of these methods we used 30 Gaussian radial basis functions with an equal shape parameter set to half of the distance between the most distant centers. No additional parameters were required for the first and the second method. The single user-defned threshold NLMSEthr for measure (4.33) was used in the third method. Apart from parameters concerning the construction of the radial basis approximation we had to set-up three parameters related directly to the Algorithm 1 itself. These are: the number of initial steps Is = 50 and ε = 10−3 for the ε -chcek procedure and γthr – the threshold value for a measure (4.24) to detect the reliable region. In the tables below we show performance of the Algorithm 1 for problems defined in the previous subsection with the above objective function approximation strategies compared to the EXTREM algorithm. The first column shows the number
96
M. Bazan
Table 4.1 6-variable Rosenbrock function optimization using: Left) pure EXTREM, Right) Algorithm 1 without regularization and with γthr = 0.65
step num. 250 500 750 1000 1250 1523
EXTREM ||x − x∗|| f 2.267853 2.885768 0.650821 0.109817 0.418187 0.031570 0.246522 0.012666 0.029985 0.001228 0.002014 0.000001
step num. 250 500 750 933
Alg. 1 no regularization num. ||x − x∗|| f approx. 2.459117 3.711824 22 1.318398 0.559350 45 0.675244 0.089372 35 0.130808 0.005606 10
Table 4.2 6-variable Rosenbrock function optimization using: Left) Algorithm 1 with regularization using GCV, Right) Algorithm 1 with regularization using WGV with NLMSEthr = 5 · 10−6 . In both cases γthr = 0.65
step num. 250 500 750 1000 1250 1288
Alg. 1 with GCV Alg. 1 with WGV num. step num. ∗ ∗ ||x − x || f approx. num. ||x − x || f approx. 2.363547 3.991352 11 250 2.292762 2.923941 12 1.094954 0.334820 28 500 1.250487 0.435774 16 0.778636 0.162569 48 750 0.630858 0.150665 28 0.218992 0.009666 47 1000 0.513698 0.048269 38 0.025537 0.000131 49 1250 0.094284 0.001763 47 0.025537 0.000131 10 1262 0.094284 0.001763 4
Table 4.3 8-variable Rosenbrock function optimization using: Left) pure EXTREM, Right) Algorithm 1 without regularization and γthr = 0.65
step num. 250 500 750 1000 1250 1500 1750 2000 3471
EXTREM ||x − x∗|| f 3.923525 10.438226 2.731706 5.606541 1.771076 1.032648 1.009230 0.251387 0.484663 0.050215 0.274941 0.015386 0.262434 0.012950 0.208392 0.007465 0.000553 0.000000
step num. 250 500 750 1000 1250 1500 1537
Alg. 1 no regularization num. ||x − x∗|| f approx. 3.030179 6.376938 22 2.153959 1.855113 23 1.380208 0.471618 30 0.910852 0.215283 46 0.361361 0.028169 43 0.186309 0.006926 55 0.186309 0.006926 7
4
Search Procedure Exploiting Locally Regularized Objective Approximation
97
Table 4.4 8-variable Rosenbrock function optimization using: Left) Algorithm 1 with regularization using GCV, Right) Algorithm 1 with regularization using WGV with NLMSEthr = 10−6 . In both cases γthr = 0.65
step num. Alg. 1 with GCV step num. ∗ num. ||x − x || f approx. 250 500 250 3.100123 7.932378 17 750 500 2.353671 3.580721 28 1000 750 1.810286 1.085058 26 1250 1000 0.946269 0.240704 41 1500 1200 0.705151 0.118365 43 1750 1861
Alg. 1 with WGV num. ||x − x∗|| f approx. 3.263247 9.379781 11 2.288176 2.394216 15 1.272541 0.435303 17 0.767617 0.143009 35 0.505116 0.057994 42 0.146317 0.004114 51 0.028930 0.000169 35 0.028958 0.000169 12
Table 4.5 8-variable Chebquad function optimization using: Left) pure EXTREM Right) Algorithm 1 without regularization and with γthr = 0.65
step num. 1 250 500 750 1000 1237
EXTREM ||x − x∗|| f 0.161512 0.038618 0.097709 0.006187 0.059121 0.004681 0.016460 0.003698 0.000351 0.003517 0.000000 0.003517
step num. 250 500 750 1000 1197
Alg. 1 no regularization num. ||x − x∗|| f approx. 0.098420 0.006216 4 0.059421 0.004684 18 0.009926 0.003584 10 0.000454 0.003517 48 0.000141 0.003517 97
of steps of the algorithm i.e. a sum of the number of the direct functions evaluations and the number of steps in which the radial basis function approximation was used. The second column shows the distance from the minium and the third column shows the objective function value. The last column shows the number of steps in which objective function approximation was used within the previous 250 steps. As we can see in all cases SPELROA required considerably fewer steps than the pure EXTREM algorithm to stop. Using the WGV method to build radial basis approximation gave the best convergence results, i.e. the stopping point for the SPELROA with WGV compared to stopping points given by the method with the other methods. For all problems NLMSEthr was chosen intuitively to ∼ 10−6 . It means that the reconstruction of the training set in the vicinity of the evaluation point was at the level of 10−6. To compute the reliable region γthr set up to 0.65 was sufficient to preserve convergence of the method. In the optimization of 8-variable Chebquad function it turned out that it was possible to reduce γthr to 0.6. That gave 16 points in which the objective function was approximated in the first 250 steps
98
M. Bazan
Table 4.6 8-variable Chebquad function optimization using: Left) Algorithm 1 with regularization using GCV and γthr = 0.65, Right) Algorithm 1 with regularization using WGV with NLMSEthr = 10−6 and with γthr = 0.60
step num. 250 500 750 1000 1095
Alg. 1 with GCV num. step ||x − x∗ || f approx. num. 0.098419 0.006216 4 250 0.059477 0.004684 17 500 0.008713 0.003584 11 750 0.00156 0.003518 118 1000 0.00156 0.003518 68 1077
Alg. 1 with WGV num. ||x − x∗ || f approx. 0.109306 0.064139 0.009802 0.001266 0.001051
0.007110 0.004759 0.003586 0.003519 0.003517
16 73 46 31 52
Table 4.7 11-variable Osborne 2 function optimization using: Left) pure EXTREM, Right) Algorithm 1 without regularization and with γthr = 0.65
step num. 1 250 500 750 1000 1250 1434
EXTREM ||x − x∗|| f 4.755269 2.093420 1.004450 0.081672 0.099411 0.041034 0.073530 0.041772 0.007715 0.040138 0.001092 0.040138 0.000000 0.040138
step num. 250 500 710
Alg. 1 no regularization num. ||x − x∗|| f approx. 1.379902 0.977272 29 0.568322 0.059211 31 0.335485 0.041512 43
Table 4.8 11-variable Osborne 2 function optimization using: Left) Algorithm 1 with regularization using GCV and γthr = 0.65 b) Algorithm 1 with regularization using WGV with NLMSEthr = 10−6 . In both cases γthr = 0.65
step num. 250 500 750 1000 1106
Alg. 1 with GCV num. ||x − x∗ || f approx. 1.673133 0.975217 0.496683 0.068066 0.062328
0.365797 0.045721 0.041805 0.040462 0.040177
32 31 43 35 25
step num. 250 500 750 919
Alg. 1 with WGV num. ||x − x∗ || f approx. 0.953416 0.334653 0.0441106 0.037960
0.106864 0.047136 0.040209 0.040185
14 38 49 29
instead 4 such points when γthr = 0.65. An interesting result was also obtained for Osborne 2 function. The EXTREM algorithm found a better solution than that suggested in [12]. Algorithm 1 with any approximation method did not converge to this minimum. The method without regularization did not converge at all whereas GCV
4
Search Procedure Exploiting Locally Regularized Objective Approximation
99
and WGV converged rather to the minimum suggested in [12] than to that found by EXTREM.
4.7 Summary The Search Procedure Exploiting Locally Regularized Objective Approximation is a method to speed-up local optimization processes. The method combines a nongradient optimization algorithm with the regularized local radial basis function approximation. It relies on using a local regularized radial basis function approximation instead of a direct objective function evaluation in a certain number of function evaluation steps of the optimization algorithm. In this chapter we presented the proof of the convergence of the Search Procedure Exploiting Regularized Objective Approximation which applies to any Gauss-Siedle and conjugate direction search algorithm that uses the sequanetial quadratic interpolation as a line search procedure. The convergence is proven under assumption that the approximation of the objective function with the prescribed approximation relative error is exploited only in the seqential quadratic interplation. The performance of the method was presented on 6 and 8-parametric Rosenbrock function, 8-parametric Chebquad function and 11-parametric Osborne 2 function. Further studies will be to compare the method with trust region methods. Acknowledgements. I would like to thank all referees for very valuable comments.
Appendix A The minimum of the quadratic q(ζ ) built of three points (ζ1 , f (ζ1 )), (ζ2 , f (ζ2 )) and (ζ3 , f (ζ3 )) where {ζ1 , ζ2 , ζ3 } ⊂ R equals
λ∗ =
1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 ) , 2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )
Let us transform quadratic q(x) = ax2 + bx + c; x = ζ1 + t(ζ3 − ζ1 );t = (−∞, ∞) to ˆ + cˆ assuming the following quadratic q(x) ˆ = ax ˆ 2 + bx 1. A transformation will be of the form q(x) ˆ = E(p(x)) + D,
p(x) = L2 (x),
(4.34)
where L2 (x) is the Lagrange interpolation parabola constructed on points ζ1 (0, f (ζ1 )), (ζ(r) , f (ζ2 )) and (1, f (ζ3 ) whereas ζ(r) = ζζ2 − . 3 −ζ1 2. a) If f (ζ1 ) > f (ζ3 ) b) if f (ζ1 ) < f (ζ3 ) q(0) ˆ = −1, q(0) ˆ = −2, q(1) ˆ = −2, q(1) ˆ = −1.
100
M. Bazan
From the condition 1. we get L2 (x) = a x2 + b x + c where a =
f (ζ3 ) f (ζ1 ) f (ζ2 ) + + , ζ(r) ζ(r) (ζ(r) − 1) 1 − ζ(r)
! f (ζ3 )ζ(r) f (ζ1 )(ζ(r) + 1) f (ζ2 ) + + , b =− ζ(r) ζ(r) (ζ(r) − 1) 1 − ζ(r)
c = f (ζ1 ). From the conditions 2. we get a) 1 E = a +b , c D = 2 − Ec = 2 − a +b .
b)
1 E = − a +b , c D = 1 + Ec = 1 + a +b .
ˆ + cˆ we have aˆ = Ea , bˆ = Eb , cˆ = Ec + Then in the canonical form q(x) ˆ = ax ˆ 2 + bx D. The crucial properties of this transformation are q( ˆ ζ(r) ) < −1. p ζ(r) = f (ζ2 ). q( ˆ ζ(r) ) does not depend on f (ζ2 ). For the minimum point λ ∗ of q such that λ ∗ ∈ [ζ1 , ζ3 ] we get a minimum point ∗ −ζ 1 λˆ ∗ = λζ − where λˆ ∗ is a minimum point of q. ˆ 3 ζ1 5. The transformation has a singularity when a = b .
1. 2. 3. 4.
This transformation reduces the number of free parameters from 6 to 3.
Appendix B B.1
Expression for Δ (ε ; ζ )
The unperturbed minimum λ ∗ (ζ ) is related to the perturbed minimum λ˜ l∗ (ε ; ζ ) as
λ˜ l∗ (ε ; ζ ) = λ ∗ (ζ ) ± Δl (ε ; ζ ). Here we assume that both λ ∗ (ζ ) and λ˜ l∗ (ε ; ζ ) are minima of the quadratics obtained by the transformation q(·) ˆ from Appendix A with the assumption that f (ζ1 ) < f (ζ3 ) i.e. if the opposite is true then we have to rotate the quadratics with respect to the center of the interval [ζ1 , ζ3 ]. Let us introduce a notation to simplify derivations. Let us denote A = 2(ζ(r) − 1), B = q( ˆ ζ(r) ),C = ζ(r) . Then for the unperturbed quadratic we have that
4
Search Procedure Exploiting Locally Regularized Objective Approximation
λ∗ =
101
1 A(ζ(r) + 1) + B − ζ(r)C 2 A + B −C
whereas for the perturbed one we have 1 A(ζ(r) + 1)(1 + ε ) + B − ζ(r)C ∗ λ1 (ε ; ζ ) = 2 A(1 + ε ) + B − C A( ζ + 1 (r) 1) + B(1 + ε ) − ζ(r)C λ2∗ (ε ; ζ ) = 2 A + B(1 + ε ) − C ζ + A( 1 (r) 1) + B − ζ(r)C(1 + ε ) λ3∗ (ε ; ζ ) = 2 A + B − C(1 + ε ) for the perturbation in 0, ζ(r) and 1 respectively where ζ(r) = the expression |λ ∗ − λ˜ ∗ (ε ; ζ )| to get
Δ1 (ε ; ζ ) =
ζ2 −ζ1 ζ3 −ζ1 . We
can simplify
C − ζ(r) B Aε · 2(A + B − C) Aε + A + B − C
and Δ2 (ε ; ζ ) and Δ3 (ε ; ζ ) similarly.
B.2 The Main Condition If inequalities (4.21) are satisfied then we have a guarantee that λ ∗ (ζ ) < ζ2 . It is because if λ˜ l∗ (ε ; ζ ) is shifted by Δl (ε ; ζ ) to the left then shifting it back to the right will not make it greater than ζ2 . If it is shifted to the right then we have a margin of 2Δl (ε ; ζ ). As previously mentioned we will consider only perturbation in 0 and ζ(r) i.e. l = 1 and l = 2. For l = 1 we get A(ζ(r) + 1)(1 + ε ) + B − ζ(r)C < 2ζ2 A(1 + ε ) + B − C We have to consider two cases depending on the sign of the denominator. 1. If A(1 + ε ) + B − C) > 0 then we get A(1 − ζ(r))ε < A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1) which again is divided into two cases depending on the sign of the cooeficient by ε a. When A(1 − ζ(r)) > 0 then
ε<
A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1) . A(1 − ζ(r))
b. When A(1 − ζ(r)) < 0 then
102
M. Bazan
ε>
A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1) . A(1 − ζ(r))
2. If A(1 + ε ) + B − C) < 0 then we get two cases a. When A(1 − ζ(r)) > 0 then
ε>
A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1) . A(1 − ζ(r))
b. When A(1 − ζ(r)) < 0 then
ε<
A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1) . A(1 − ζ(r))
Only upper bounds for ε , i.e. 1.a) and 2.b), are interpretable as a solution to our problem. So finally we get two regions for ε and ζ(r) where
ε< for
or
A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1) , A(1 − ζ(r))
(4.35)
A(1 + ε ) + B − C > 0 , A(1 − ζ(r)) > 0
(4.36)
A(1 + ε ) + B − C < 0 . A(1 − ζ(r)) < 0
(4.37)
In the same way we can obtain conditions for l = 2:
ε< for
or
−B(1 − ζ(r)) − A , B(1 − ζ(r))
(4.38)
A + B(1 + ε ) − C > 0 , B(1 − ζ(r)) < 0
(4.39)
A + B(1 + ε ) − C < 0 . B(1 − ζ(r)) > 0
(4.40)
References 1. Bazan, M., Russenschuck, S.: Using neural networks to speed up optimization algorithms. Eur. Phys. J. AP 12, 109–115 (2000) 2. Bazan, M., Aleksa, M., Russenschuck, S.: An improved method using radial basis function neural networks to speed up optimization algorithms. IEEE Trans. on Magnetics 38, 1081–1084 (2002)
4
Search Procedure Exploiting Locally Regularized Objective Approximation
103
3. Bazan, M., Aleksa, M., Lucas, J., Russenschuck, S., Ramberger, S., V¨ollinger, C.: Integrated design of superconducting magnets with the CERN field computation program ROXIE. In: Proc. 6th International Computational Accelarator Physics Conference, Darmstadt, Germany (September 2000) 4. Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust region methods. SIAM, Philadelphia (2005) 5. Hansen, P.C.: Rank-deficient and Discrete Ill-posed Problems. SIAM, Philadelphia (1998) 6. Jacob, H.G.: Rechnergest¨utzte Optimierung statischer und dynamischer Systeme. Springer, Heidelberg (1982) 7. Kansa, E.J., Hon, Y.C.: Circumventing the ill-conditionning problem with multiquadratic radial basis: Applications to elliptic partial differentail equations. Comp. Math. with App. 39(7-8), 123–137 (2000) 8. Luenberger, D.G.: Introduction to linear and nonlinear programming, 2nd edn. AddisonWesley, New York (1984) 9. Micchelli, C.A.: Interpolation of Scattered Data: Distance Matrices and Conditionally Positive Definite Functions. Constructive Approximation 2, 11–22 (1986) 10. Madych, W.R., Nelson, S.A.: Multivariate interpolation and conditionally positive definite functions II. Math. Comp. 4(189), 211–230 (1990) 11. Madych, W.R.: Miscellaneous error bounds for multiquadric and related interpolators. Comp. Math. with Appl. 24(12), 121–138 (1992) 12. Mor´e, J.J., Garbow, B.S., Hillstorm, K.E.: Testing unconstrained optimization software. ACM Trans. Math. Software 7(1), 17–41 (1981) 13. Oeuvray, R.: Trust region method based on radial basis functions with application on biomedical imaging, Ecole Polytechnique Federale de Lausanne (2005) 14. Polak, E.: Optimization. Algorithms and Consistent Approximations. Applied Mathematical Sciences, vol. 124. Springer, Heidelberg (1997) 15. Powell, M.J.D.: On calculation of orthogonal vectors. The Computer Journal 11(3), 302–304 (1968) 16. Schaback, R.: Error estimates and condition number for radial basis function interpolation. Adv. Comput. Math. 3, 251–264 (1995) 17. Schaback, R.: Native Hilbert Spaces for Radial Basis Functions I. The new development in Approximation Theory. Birkh¨auser, Basel (1999) 18. Wahba, G.: Spline models for obsevational data. SIAM, Philadelphia (1990) 19. Wendland, H., Rieger, C.: Approximate interpolation with applications to selecting smoothing parameters. Numerische Mathematik 101, 643–662 (2005) 20. Wendland, H.: Gaussian Interpolation Revisited. In: Kopotun, K., Lyche, T., Neamtu, M. (eds.) Trends in Approximation Theory, pp. 427–436. Vanderbilt University Press, Nashville (2001) 21. Zangwill, W.I.: Nonlinear Programming; a Unified Approach. Prentice-Hall International Series. Prentice-Hall, Englewood Cliffs (1969)
Chapter 5
Optimization Problems with Cardinality Constraints Rub´en Ruiz-Torrubiano, Sergio Garc´ıa-Moratilla, and Alberto Su´arez
Abstract. In this article we review several hybrid techniques that can be used to accurately and efficiently solve large optimization problems with cardinality constraints. Exact methods, such as branch-and-bound, require lengthy computations and are, for this reason, infeasible in practice. As an alternative, this study focuses on approximate techniques that can identify near-optimal solutions at a reduced computational cost. Most of the methods considered encode the candidate solutions as sets. This representation, when used in conjunction with specially devised search operators, is specially suited to problems whose solution involves the selection of optimal subsets of specified cardinality. The performance of these techniques is illustrated in optimization problems of practical interest that arise in the fields of machine learning (pruning of ensembles of classifiers), quantitative finance (portfolio selection), time-series modeling (index tracking) and statistical data analysis (sparse principal component analysis).
5.1 Introduction Many practical optimization problems involve the selection of subsets of specified cardinality from a collection of items. These problems can be solved by exhaustive enumeration of all the candidate solutions of the specified cardinality. In practice, only small problems of this type can be exactly solved within a reasonable amount of Rub´en Ruiz-Torrubiano Computer Science Department, Universidad Aut´onoma de Madrid, Spain e-mail: [email protected] Sergio Garc´ıa-Moratilla Computer Science Department, Universidad Aut´onoma de Madrid, Spain e-mail: [email protected] Alberto Su´arez Computer Science Department, Universidad Aut´onoma de Madrid, Spain e-mail: [email protected] Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 105–130. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
106
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
time. The number of steps required to find the optimal solution can be reduced using branch-and-bound techniques. Nevertheless, the computational complexity of the search remains exponential, which means that large problems cannot be handled by these exact methods. It is therefore important to design algorithms that can identify near-optimal solutions at a reduced computational cost. In this article we present an unified framework for handling optimization problems with cardinality constraints. A number of approximate methods within this framework are analyzed and their performance is tested in extensive benchmark experiments. In its general form, an optimization problem with cardinality constraints can be formulated in terms of a vector of binary variables z = {z1 , z2 , . . . , zD }, zi ∈ {0, 1}. The goal is to minimize a cost function that depends on z, subject to a constraint that specifies the number of non-zero bits in z D
∑ zi = k.
min {F(z)} z
(5.1)
i=1
Optimization problems with cardinality constraints given by an inequality ∑D i=1 zi ≤ K can be solved by selecting the best of the solutions of K optimization problems with the equality constraint ∑D i=1 zi = k; k = 1, 2, . . . , K. Finally, the solution of a combinatorial optimization problem without restrictions can be obtained by solving the sequence of problems with cardinality constraints ∑D i=1 zi = k; k = 1, 2, . . . , D. Continuous optimization tasks with cardinality constraints can also be analyzed within this framework. Consider the problem of minimizing a function that depends on a D-dimensional continuous parameter θ . We search for solutions with exactly k non-zero components of θ min {F(θ )} θ
θ ∈ RD ,
D
∑ I(θi = 0) = k,
(5.2)
i=1
where I(·) is an indicator function (I(true) = 1, I(false) = 0). This hybrid problem can be transformed into a purely combinatorial one of the type (5.1) by introducing a D-dimensional binary vector z whose i-th component indicates whether variable i is allowed to take a non-zero value (zi = 1) or is set to zero (zi = 0). min [F ∗ (z)] , z
∑ zi = k.
(5.3)
i
The function F ∗ (z) is the solution of an auxiliary continuous optimization problem in the reduced space defined by z F ∗ (z) = min F(θ [z] ), [θ |z]
(5.4)
5
Optimization Problems with Cardinality Constraints
107
where θ [z] denotes the k-dimensional vector formed by the components of θ for which the value of the corresponding component of z is 1. The remaining components of θ are set to zero in the auxiliary problem. This decomposition makes it clear how hybrid methods that combine techniques for combinatorial and continuous optimization can be applied to identify the solution of the subset selection problem with a continuous objective function: For a given value of z, the optimal θ [z] is calculated by solving the surrogate problem defined by (5.4), where z determines which components of θ are allowed to take a non-zero value. The final solution is obtained by searching in the purely combinatorial space of possible values of z, using the optimal function value that is a solution of (5.4) to guide the exploration. The success of this hybrid approach depends on the availability of a continuous optimization algorithm that can efficiently identify the globally optimal solution of the auxiliary optimization problem defined in (5.4) and on the efficiency of the algorithm used to address the combinatorial part of the search. For simple forms of the continuous objective function and of the remaining restrictions (other than the cardinality constraint), the auxiliary problem can be efficiently solved by exact optimization techniques. For instance, efficient linear and quadratic programming algorithms are available if the function is linear or quadratic, respectively [1]. For more complex objective functions, general non-linear optimization techniques (such as quasi-Newton [2] or interior-point methods [3]) may be necessary. In these cases, there is no guarantee that the solution of the auxiliary problem be globally optimal. As a consequence, if the solutions found are far from the global optimum, the combinatorial search that is used to solve the original problem (5.3) can be seriously misled. In this work, we assume that the continuous optimization task defined by (5.4) can be solved exactly and focus on the solution of the combinatorial part of the original problem. Section 5.2 describes how standard combinatorial optimization techniques can be adapted to handle the cardinality constraints considered. Emphasis is placed on the use of an appropriate encoding for the search states in terms of sets. This set-based encoding is particularly well-suited for the definition of search operators that preserve the cardinality of the candidate solutions. With this adaptation, the approximate methods described provide a practicable alternative to identifying the exact solution by exhaustive search, which becomes computationally infeasible in large problems, or to computationally inexpensive optimization methods, such as greedy search, which tend to find suboptimal solutions. The experiments presented in Section 5.3 illustrate how the techniques reviewed find near-optimal solutions with limited computational resources and can therefore be used to address optimal subset selection problems of practical interest. Novel results regarding the application of these techniques to some of these problems (ensemble pruning and sparse PCA) are also provided. Finally, Section 8.5 summarizes the conclusions of this work.
108
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
5.2 Approximate Methods for the Solution of Optimization Problems with Cardinality Constrains In this section we describe how simulated annealing (SA), genetic algorithms (GA), and estimation of distribution algorithms (EDA) can be used to solve large optimization problems with cardinality constraints. They are stochastic search methods that involve the generation of candidate solutions, which are then rejected or selected according to their performance. In their standard formulation, no particular consideration is given to the number of non-zero components in the candidate solutions generated. Cardinality constraints can be taken into account using one of the following approaches: (i) No candidate solution violating the constraint is generated at any time by the algorithm. Enforcing this property requires the design of appropriate genetic and neighborhood operators, such that the space of solutions of a given cardinality is closed under these search operators [4] [5]. (ii) Solutions that violate the cardinality constraint can be generated by the successor operators. Whenever a violation occurs, a repair algorithm is applied to transform the infeasible solution into a solution of the desired cardinality. Typically, a local search is used to obtain the closest feasible solution, but random repair mechanisms can be used as well [6] [7]. (iii) Solutions that violate the cardinality constraint can be generated by the successor operators. In contrast with the previous approach, infeasible solutions are not repaired. Instead, a penalty term is introduced on the evaluation function so that infeasible candidate solutions have worse scores than feasible ones with an equivalent performance [6]. In the experiments described in Section 5.3, the best overall performance is obtained by methods that use a set-based representation together with appropriately designed successor operators that preserve the cardinality of the solutions. These results underscore the importance of using a representation that properly reflects the structure of the problem. Therefore, the focus of this study is on the design of search operators that preserve the cardinality of the candidate solutions. These specially adapted methods are generally preferable to standard schemes that take into account the restrictions by either ad-hoc repair mechanisms or by including a term in the cost function that penalizes violations of the constraints.
5.2.1 Simulated Annealing Simulated annealing (SA) is an optimization technique inspired by the field of thermodynamics [8]. The main idea is to mimic the physical process of melting a solid and then cooling it to allow the formation of a regular crystalline structure that attains a minimum of the system’s free energy. In simulated annealing the function to be minimized F(z) (objective or cost function) takes the role of the free energy in the physical system. The physical configuration space is replaced by the space of candidate solutions, which are connected by transitions defined by a neighborhood
5
Optimization Problems with Cardinality Constraints
109
operator. The stochastic search proceeds by considering transitions from the current state z(cur) to a neighboring configuration zl ∈ N (z(cur) ) generated at random. The proposed transition is accepted if the value of the objective function decreases. Otherwise, if the candidate configuration is of higher cost, the transition is accepted only with a certain probability. This probability is expressed as a Boltzmann factor (cur) ) − F(z ) F(z l , (5.5) Paccept (zl , z(cur) ; Tk ) = exp − Tk where the parameter Tk plays the role of a temperature. A general version of this technique is given as Algorithm 1. In this pseudocode, the function annealingSchedule returns the temperature Tk for the following epoch. It is common to use a geometrical schedule Tk = γ Tk−1 , where γ smaller, but usually close to one, regulates how fast the temperature is decreased.
Algorithm 1. Simulated annealing • • • •
Generate initial configuration z(0) and initial temperature T0 z(cur) ← z(0) i←0 While convergence criteria are not met [Annealing loop] – – – –
i ← i+1 Fix temperature for epoch i: Ti = annealingSchedule(Ti−1 ) Fix length for epoch i: Li For l = 1, . . . , Li [Epoch loop] 1. Select randomly an element zl ∈ N (z(cur) ). 2. If F(zl ) < F(z(cur) ), then z(cur) ← zl 3. Else, generate u ∼ U[0, 1] If u < Paccept (zl , z(cur) ; Ti ), then z(cur) ← zl
• Return the best value found.
Cardinality constraints can be handled in SA by selecting an appropriate encoding for z and a corresponding neighborhood N (z). In particular, the candidate solutions can be encoded as sets of specified cardinality. The components of the binary vector z are then interpreted as indicating membership to the set: if zi = 1, the ith element is included in the solution. Otherwise, if zi = 0 it is excluded from the selection. It is also necessary to design a neighborhood operator that preserves the cardinality constraints, so that no penalty or repair mechanisms are needed. A simple design is to exchange an element included in the current candidate solution with an element excluded from it. This is the version of SA that will be used in Section 5.3.
110
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
5.2.2 Genetic Algorithms Genetic algorithms are a class of optimization methods that mimic the process of the natural evolution [9]. Optimization is achieved by selection from a population that exhibits some random variability. The outline of a general genetic algorithm is shown in Algorithm 2.
Algorithm 2. Genetic Algorithm • Generate an initial population P0 with P individuals. • For each individual I j ∈ P0 , calculate fitness Φ (I j ). • Initialize the generation counter t ← 0. • While convergence criteria are not met: – Increase the generation counter t ← t + 1. – Select a parent set Πt ⊂ Pt composed of nP individuals from the population. / – While Πt = 0: · Extract two individuals I1 and I2 from Πt . · Apply the crossover operator Θ (I1 , I2 ) and generate nC children (with probability pC ). · Apply the mutation operator to the nC children (with probability pM ). – Calculate the fitness value of the new individuals. – Add the new individuals to the population. – Select P individuals that make up Pt+1 , the population for generation t + 1.
For problems with cardinality constraints, two alternative encodings for the candidate solutions are considered. A first possibility is a standard binary representation, where the chromosomes are bit-strings. The difficulty with this encoding is that standard mutation and crossover operators do not preserve the number of nonzero bits of the parents. A possible solution to this problem is to assign a lower fitness value to individuals in the population that violate the cardinality constraint. Assuming that a problem with an inequality cardinality constraint is considered, a penalized fitness function can be built by subtracting from the standard fitness function a penalty term that depends on the magnitude of the violation of the cardinality constraint Δk (z) = |Card(z) − k| (5.6) The penalized fitness function is
Φ p (z) = Φ (z) − β f p (Δk (z)),
(5.7)
where f p : N → R+ is a monotonically increasing function of Δk (z) and β ≥ 0 represents the strength of the penalty. Another option is to repair infeasible individuals when they are generated. Several repair mechanisms can be defined for this purpose. For instance, an individual
5
Optimization Problems with Cardinality Constraints
111
can be repaired by randomly setting some bits to 0 or 1, as needed, until the cardinality constraint is satisfied (random repair). Another alternative is to use a heuristic to determine which bits must be set to 0 or to 1 (heuristic repair). The results of a greedy optimization or the solutions of a relaxed version of the problem can also be used to achieve this objective [10]. An alternative to binary encoding is to use the set representation introduced in simulated annealing. The use of this representation simplifies the design of crossover and mutation operators that preserve the cardinality of the individuals. The neighboring operator defined in SA can be used to construct mutated individuals. Since this operator swaps a variable in the set of selected variables with another variable in the complement of this set, the cardinality of the original chromosome is preserved by the mutation. Some crossover operators on sets were introduced in [5]. They are defined taking into account the properties of respect and assortment [11]. Respect ensures that the offspring inherit the common genetic material of the parents. Assortment guarantees that every combination of the alleles of the two parents is possible in the child, provided that these alleles are compatible. When cardinality constraints are considered, it is no longer possible to design crossover operators that guarantee both respect and assortment. A crossover operation that provides a good balance of these properties and ensures that the cardinality of the parents is preserved in the offspring is random assorting recombination (RAR). RAR crossover is described in Algorithm 3. In this algorithm, the integer parameter w ≥ 0 determines the amount of common information from both parents that is retained by the offspring. For w = 0, elements that are present in the chromosomes of both parents are not allowed in the child. Higher values of w assign more importance to the elements in the intersection of the parents’ sets (chromosomes). In the limit w → ∞, the child contains every element that is in both of the parents’ chromosomes with a probability that approaches 1.
5.2.3 Estimation of Distribution Algorithms Estimation of distribution algorithms (EDAs) are a class of evolutionary methods in which diversity is generated by a probabilistic sampling scheme [12]. Depending on the nature of this sampling scheme, different variants of EDAs can be designed. In this work, we consider the Population Based Incremental Learning (PBIL) algorithm as a representative algorithm of the EDA family [13]. It operates on binary chromosomes of fixed length (z) and assumes statistical independence among the genes {zi ; i = 1, 2, . . . , D}. In generation g, the genotype of the population is characterized by the probability vector p(g) , whose ith component is the probability of assigning the value 1 to the gene in the ith position. The update of the probability distribution using DSe g (see Algorithm 4) in PBIL is p(g+1) = α
1 M (gim ) ∑ z + (1 − α )p(g), M m=1
(5.8)
112
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
Algorithm 3. Random Assortment Recombination algorithm 1 2
Input: Two parents I1 and I2 , and a fixed cardinality k. Output: A child chromosome Θ . • Create auxiliary sets A, B,C, D, E: – – – –
A = elements present in both parents. B = elements not present in any of the parents. C ≡ D elements present in only one parent. E = 0. /
• Build set G = {w copies of elements from A and B, and 1 copy of elements in C and D} / • While |Θ | < k and G = 0: – Extract g ∈ G without replacement. – If g ∈ A or g ∈ C, and g ∈ / E, Θ = Θ ∪ {g}. – If g ∈ B or g ∈ D, E = E ∪ {g}. • If |Θ | < k, add elements chosen at random from U − Θ until chromosome is complete.
Algorithm 4. Estimation of distribution algorithm(EDA) • Initialize the distribution that characterizes the population P(0) (z) • Initialize generation counter g ← 0. • While convergence criteria are not met – Sample population of P individuals using P(g) (z) Dg = {z(g1) , . . . , z(gP) } – Sort population by non-increasing fitness value
Dg = {z(gi1 )) , z(gi2 ) , . . . , z(giP ) }, where i1 , i2 , . . . , iP is a reordering of the indices 1, 2, . . . , P such that
Φ (z(gi1 ) ) ≥ Φ (z(gi2 ) ) ≥ · · · ≥ Φ (z(giP ) ) – Select the first M ≤ P individuals from the sorted population (gi1 ) (gi2 ) DSe ,z , . . . , z(giM ) } g = {z
– Estimate the new probability P(g+1) (z) distribution using DSe g – Update generation counter g ← g + 1 • Return the best solution found.
5
Optimization Problems with Cardinality Constraints
113
where z(gim ) represents the individual in the im −th position on generation g, and α ∈ (0, 1] is a smoothing parameter included to avoid strong fluctuations in the estimates of the probability distribution. Individuals are sorted by decreasing fitness values. The Univariate Marginal Distribution Algorithm (UMDA [14]) algorithm from the EDA family is recovered when α = 1. Even though the encoding is binary, the cardinality constraints can be enforced in the sampling of individuals. Algorithm 5 describes a sampling method that generates individuals of a specified cardinality k from a distribution of bits characterized by the probability vector p. The application of this method to sample new individuals guarantees that the algorithm is closed with respect to the cardinality constraint.
Algorithm 5. Sampling individuals of a specified cardinality from p. • Initialize pˆ ← p • Initialize individual x = 0. • For i = 1, 2, . . . , k – – – – –
Generate a random number u ∼ U[0, 1] j−1 pˆi < u ≤ ∑D Determine the value of j such that ∑i=1 i= j+1 pˆi . Set x j = 1. Update the value pˆ j ← 0 Renormalize pˆi pˆi ← D , i = 1, 2, . . . , D, ∑k=1 pˆk
so that pˆ can be interpreted as a probability vector ∑D i=1 pˆi = 1. – Return the generated individual x.
5.3 Benchmark Optimization Problems with Cardinality Constraints This section introduces a collection of optimization problems with cardinality constraints that are used to illustrate the application of the methods described in the previous section. These include standard combinatorial optimization problems, such as the knapsack problem, and real-world problems that arise in the fields of machine learning (ensemble pruning), quantitative finance (cardinality-constrained portfolio optimization), time series modeling (index tracking by partial replication) and statistical data analysis (sparse principal components analysis). The problems considered are either purely combinatorial or involve the optimization of continuous parameters. In a purely combinatorial optimization problem F(z) can be directly evaluated once z is known. The knapsack problem and ensemble pruning are of this type. The remaining problems considered (portfolio selection, index tracking and sparse PCA) are hybrid optimization tasks, in which the evaluation of F(z) for a fixed value of the binary vector z requires the solution of an auxiliary continuous optimization
114
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
problem. While it is possible to address the combinatorial and the continuous optimization problems simultaneously, we concentrate on strategies that handle these aspects separately. Therefore, the outcome of the continuous optimization algorithm is used to guide the combinatorial optimization search, as in (5.4). For the hybrid problems considered, the secondary optimization task can be efficiently solved in an exact manner by quadratic programming. Nonetheless, the scheme can be directly generalized when the evaluation of F(z) requires a more complex programming solution, possibly without guarantee of convergence to the global solution of the surrogate optimization problem. Under these conditions the algorithm used to address the combinatorial part can actually be misled by the suboptimal solutions found in the auxiliary problem.
5.3.1 The Knapsack Problem Knapsack problems are a family of combinatorial optimization problems that involve selecting a subset from a pool of items [15]. In this work, we consider the 0/1 knapsack problem, which can be shown to be NP-complete [16]. In an instance of the 0/1 knapsack D items are available to fill up a knapsack. A profit pi and a weight wi are associated to the i-th item, i = 1, 2, . . . , D. The objective is to identify the subset of items whose accumulated profit is maximum and whose overall weight does not exceed a given capacity W D
max
∑ pi zi
i=1
D
s.t. ∑ wi zi ≤ W zi ∈ {0, 1} , i = 1, 2, . . . , D.
(5.9)
i=1
Both exact and approximate methods have been used to address the 0/1 knapsack problem. Exact algorithms based on branch-and-bound approaches and dynamic programming are reviewed in [17]. Genetic algorithms [18, 19] and EDAs [12] have also been used to address this problem. Cardinality constraints are generally not considered in the standard 0/1 knapsack problem. Nevertheless, the optimum of the unconstrained problem can be obtained by solving D cardinality-constrained knapsack problems ∑D i=1 zi = k; k = 1, 2, . . . D. The k-th element in this sequence is a knapsack problem with the restriction that only k items can be included in the knapsack. To compare the performance of the different optimization methods analyzed in this work, we use the testing protocol proposed in [20] [18]. Three types of problems, defined in terms of two parameters v, r ∈ R+ , v > 1, are considered: (1) Uncorrelated: Weights and profits are generated randomly in [1, v]. (2) Weakly correlated: Weights are generated randomly in [1, v] and profits are generated in the interval [wi − r, wi + r]. (3) Strongly correlated: Weights are generated randomly in [1, v] and profit pi = wi + r. In general, knapsack problems with correlations between weights and profits are more difficult to solve than problems in which the weights and profits are independent. We
5
Optimization Problems with Cardinality Constraints
115
use v = 10, r = 5 and a capacity W = 2v, which tends to include very few items in the solution. The results reported are averages over 25 realizations of each problem, which are solved using the different approximate methods: SA, a standard GA with linear penalty, a GA using set encoding and the RAR operator (w = 1), and PBIL. The conditions under which the search is conducted are determined on exploratory experiments. A geometric annealing schedule Tk = γ Tk−1 with γ = 0.9 is used in SA. The GAs evolve populations composed of 100 individuals. The probabilities of crossover and mutation are pc = 1, pm = 10−2 , respectively. In PBIL, a population composed of 1000 individuals is used. The probability distribution is updated using 10% of the individuals. The smoothing parameter α is 0.1. Exact results obtained with the solver SYMPHONY from the COIN-OR project [21] implementing a branch-andcut (B&C) approach [22], are also reported for reference. In the strongly correlated problems it was not possible to find the exact solutions within a reasonable amount of time. Table 5.1 Results for the 0-1 Knapsack problem with restrictive capacity
Corr. No.Items
none none none weak weak weak strong strong strong
100 250 500 100 250 500 100 250 500
Algorithm GA Lin. GA RAR SA PBIL Profit Time Profit Time Profit Time Profit Time 79.36 26.2 82.09 54.5 80.70 98.4 81.89 24.1 90.63 38.1 105.34 134.9 102.91 284.4 104.51 47.4 95.93 57.2 119.88 261.9 118.07 531.7 117.28 91.5 52.97 26.9 54.38 53.9 53.53 99.87 54.33 24.2 59.07 38.3 66.24 130.4 65.13 286.4 65.85 47.7 60.40 56.1 74.17 266.1 73.40 531.9 72.05 87.9 76.19 26.2 79.77 57.8 79.73 98.7 78.99 24.0 83.98 37.8 94.20 139.5 94.15 286.0 92.39 47.3 84.52 55.3 101.40 272.2 102.16 525.6 96.60 86.9
B&C (exact) Profit Time 82.11 62.0 106.43 178.7 123.93 568.8 54.43 81.8 67.10 180.3 76.61 560.1 − − − − − −
Table 5.1 displays the average profit obtained and the time (in seconds) to reach a solution for each method. The experiments were performed on an AMD Turion computer with 1.79 Ghz processor speed and 1 Gb RAM. None of the approximate methods reaches the optimal profit, which is calculated using an exact branch-andcut method. The highest profit obtained by an approximate optimization is highlighted in boldface. In all cases, the algorithms that use a set encoding (GA with RAR crossover and SA) exhibit the best performance. They also require longer times to reach a solution, specially SA. PBIL obtains good results only in small uncorrelated knapsack problems. This is explained by the fact that the sampling and estimation of probability distributions becomes progressively more difficult as the dimensionality of the problem increases. Furthermore, PBIL assumes statistical independence between the variables, which makes the algorithm perform worse on problems in which correlations are present. The standard GA with linear penalty has a very poor performance in all the knapsack problems analyzed.
116
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
5.3.2 Ensemble Pruning Consider the problem of automatic induction of classifiers from a collection of instances {(xn , yn ); n = 1, 2, . . . , N}, where yn is the class label of the example characterized by the vector of attributes xn . The goal is to induce from these data an autonomous system that accurately predicts the class label on the basis of the vector of attributes of a previously unseen instance. There are a number of algorithms that can be used for learning different types of classifiers: decision trees, neural networks, support vector machines, etc. In practice, one of the most successful paradigms is ensemble learning [23]. Ensembles are composed of a diverse collection of classifiers that are generated from the same training data by introducing variations in the algorithm used for induction or in the conditions under which learning takes place. The outputs of the individual classifiers are then combined (for instance, by majority voting) to produce the prediction of the ensemble. Pooling the decisions of the ensemble members has the potential of improving the generalization capacity of a single learner. However, ensembles are costly to generate and have large storage requirements. Furthermore, the time required to classify an unlabeled instance increases linearly with the size of the ensemble. Recent work has shown that the storage requirements and classification times can be significantly reduced by selecting a subset of classifiers whose generalization capacity is equivalent and sometimes superior to the original complete ensemble. This process receives the name of ensemble pruning [24], selection, [25] or thinning [26]. Ensemble pruning has been a subject of great interest in the recent literature on machine learning (see refs. in [27]). Most studies focus on the definition of appropriate quantities that can be optimized on the training set to obtain pruned ensembles with good generalization performance. The individual properties of classifiers are not useful to guide the selection process. The generalization capacity of the pruned ensemble crucially depends on the complementarity of the classifiers that are part of it. The search in the space of subensembles is usually greedy. A notable exception is [28, 29, 30], where genetic algorithms, generally with real-valued chromosomes, are used. Another exception is [31], where ensemble pruning is formulated as a quadratic integer programming problem. Consider an ensemble composed of D classifiers and a set of labeled instances. Define a matrix G, whose element Gi j is the number of common errors between classifier i and classifier j, where i, j = 1, 2, . . . , D. The value of the diagonal term Gii is the number of errors made by classifier i. The matrix is then symmetrized and its elements normalized so that they are in the same scale Gii 1 Gi j G ji ˜ ˜ , Gi j,i= j = Gii = + . (5.10) N 2 Gii G j j Intuitively, ∑i G˜ ii measures the overall strength of the ensemble classifiers and ∑i j,i= j G˜ i j measures their diversity. The subensemble selection problem of size k can now be formulated as a quadratic integer programming problem
5
Optimization Problems with Cardinality Constraints
˜ · z, argmin zT · G z
D
s.t. ∑ zi = k,
117
zi ∈ {0, 1}.
(5.11)
i=1
The binary variable zi indicates whether classifier i should be selected. The size of the pruned ensemble, k, is specified beforehand. The selection process is a combinatorial optimization problem whose exact requires the evaluation of the solution D performance of the exponentially large subensembles of size k that can be k extracted from an ensemble of size D. In [31] the solution is approximated in polynomial time by applying semi-definite programming (SDP) to a convex relaxation of the original problem. To investigate the performance in the ensemble pruning problem of the optimization methods described in Section 5.2, we generate bagging ensembles for five representative benchmark problems from the UCI repository: heart, pima, satellite, waveform and wdbc (Breast Cancer Wisconsin) [32]. The individual classifiers in the ensemble are trained on different bootstrap samples of the original data [33]. If the classifiers used as base learners are unstable the fluctuations in the bootstrap sample lead to the induction of different predictors. Assuming that the errors of these classifiers are uncorrelated, pooling their decisions by majority voting should improve the accuracy of the predictions. In the experiments performed, bagging ensembles of 101 CART trees are built [34]. The original ensemble is pruned to k = 21 decision trees. The strength-diversity measure G, the time consumed in seconds and the number of evaluations are averaged over 5 ten-fold cross-validations for heart, pima, satellite and wdbc, and over 50 independent partitions for waveform. The success rate is the average over 50 repetitions of the optimization for a given partition of the data into training and testing sets. The parameters for the metaheuristic optimization methods are determined in exploratory experiments using the results of SDP as a gauge. For the GAs, populations with 100 individuals are evolved using a steady state generational substitution scheme. The crossover probability is set to 1. The mutation probability is 10−2 for GAs with binary representation and 10−3 for GAs with set representation. The strength of the penalty term in the GA with linear penalties is β = 400. If the best individual of the final population does not satisfy the cardinality constraint, a greedy search is performed to fulfill the restriction. The value w = 1 is used in RAR-GA. A geometric annealing schedules with γ = 0.9 is used in SA. In these experiments, the best solution in 10 independent executions of the SA algorithm is chosen. For PBIL, a population of 1000 individuals is generated, where 10% of the individuals are used to update the probability distribution. The smoothing constant is set to α = 0.1. The results of the ensemble pruning experiments performed are summarized in Table 5.2. Most of the optimization methods analyzed reach similar solutions in all the classification problems considered, with the exception of the standard GA with linear penalty, which obtains the worst values of the objective function. In terms of this quantity, the best overall results correspond to SA and SDP. In terms of efficiency, SDP should be preferred. In machine learning, the relevant measure of performance is the generalization capacity of the classifiers generated. The test
118
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
Table 5.2 Results for the GA, SA and EDA approaches in the ensemble pruning problem
Algorithm
Problem
heart pima SA satellite waveform wdbc heart pima GA satellite Linear Penalty waveform wdbc heart pima GA Heuristic Repair satellite waveform wdbc heart pima GA satellite RAR (w = 1) waveform wdbc heart pima PBIL satellite waveform wdbc heart pima SDP satellite waveform wdbc
Best G Success Time (s) Test Error rate 156.2940 1.00 7.575 18.06 234.8931 0.98 14.644 23.99 185.1163 1.00 63.490 13.15 105.2465 1.00 37.621 19.64 121.9183 0.88 34.085 4.50 157.8668 0.98 2.017 17.96 235.0572 0.92 2.072 23.86 186.0706 1.00 0.870 12.85 105.8294 0.02 5.179 19.70 122.7625 0.06 4.963 4.45 156.3127 1.00 0.851 17.87 234.8931 1.00 1.665 23.99 185.1163 1.00 0.520 13.15 105.2465 0.80 8.168 19.64 121.9183 0.90 7.875 4.50 156.3860 1.00 0.697 17.96 234.9190 1.00 1.381 24.09 185.1163 1.00 0.910 13.15 105.2510 0.48 6.880 19.62 121.9399 0.40 6.449 4.50 156.4111 1.00 38.409 17.69 234.9358 0.96 38.426 24.05 185.1163 1.00 16.400 13.15 105.2663 0.36 16.086 19.67 122.0467 0.34 38.392 4.34 156.3034 1.00 1.137 18.15 234.8956 1.00 1.159 24.09 185.1163 1.00 1.230 13.15 104.9984 0.90 1.230 19.60 121.9143 0.90 1.117 4.39
error displayed in the last column of the table provides an estimate of the error rate in examples that have not been used to train the classifiers. Lower test errors indicate better generalization capacity. According to this measure the ranking of methods is rather different: classifiers that were optimal according to the objective function are suboptimal in terms of their generalization capacity. This indicates that the learning process is affected by overfitting, because the objective function is estimated on the training data. Nevertheless, the generalization performance of the pruned ensembles is very similar for all the optimization methods considered. Table 5.3 shows the test error of a single CART tree, of a complete bagging ensemble and the range of values
5
Optimization Problems with Cardinality Constraints
119
Table 5.3 Test errors for CART, standard bagging and pruned bagging
Problem heart pima satellite waveform wdbc
CART 23.63 24.84 13.80 30.27 7.28
Bagging 21.48 24.67 14.25 22.53 5.68
Pruned bagging [17.69,18.15] [23.86,24.09] [12.85,13.15] [19.62,19.67] [4.34,4.50]
of the test error obtained by pruned bagging ensembles of size k = 21. In all the classification problems considered, pruned ensembles have a lower test error than CART and complete bagging.
5.3.3 Portfolio Optimization with Cardinality Constraints The selection of optimal investment portfolios is a problem of great interest in the area of quantitative finance and has attracted much attention in the scientific community (see refs. in [10]). It is a multiobjective optimization task with two opposed goals: The maximization of profit and the minimization of risk. Several methods have been proposed to address this problem, mostly within the classical meanvariance model developed by H. Markowitz [35]. In this framework, the returns of the assets considered for investment are modeled as white noise. Profit is quantified in terms of the expected return of the portfolio. The variance of the portfolio returns is used as a measure of risk. In its simplest version, the problem can be solved by quadratic programming [1]. However, if cardinality constraints are included, the problem becomes a mixed-integer quadratic problem, which can be shown to be NP-Complete [10] T
min
w[z] · Σ[z,z] · w[z]
s.t.
w[z] · r¯ [z] = R∗ a[z] ≤ w[z] ≤ b[z] , a[z] ≥ 0, b[z] ≥ 0
(5.13) (5.14)
l ≤ A[z] · w[z] ≤ u
(5.15)
z ·1 ≤ K wT · 1 = 1, w ≥ 0.
(5.16) (5.17)
z
T
T
(5.12)
The inputs of the algorithm are r¯ , the vector of expected asset returns and Σ, the covariance matrix of the asset returns. The goal is to determine the optimal weights of the assets in the portfolio; i.e. the value of w that maximizes the variance of the portfolio returns (5.12), for a given value of the expected return of the portfolio, R∗ (5.13). The elements of the binary vector z specify whether asset i is included in the final portfolio (zi = 1) or not (zi = 0). Column vectors x[z] are obtained by removing from the corresponding vector x those components i for which zi = 0. Similarly,
120
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
the matrix A[z] is obtained by eliminating the i-th column of A whenever zi = 0. Finally, Σ[z,z] is obtained by removing from Σ the rows and columns for which the corresponding indicator is zero (zi = 0). The symbols 0 and 1 denote vectors of the appropriate size whose entries are all equal to 0 or to 1, respectively. Minimum and maximum investment constraints, which set a lower and an upper bound on the investment of each asset in the portfolio are captured by (5.14). Vectors a and b are D × 1 column vectors with the lower and upper bounds on the portfolio weights, respectively. Inequality (5.15) summarizes the M concentration of capital constraints. The m-th row of the M × D matrix A is the vector of coefficients of the linear combination that defines the constraint. The M × 1 column vectors l and u correspond to the lower and upper bounds of the M linear restrictions, respectively. Concentration of capital constraints can be used, for instance, to control the amount of capital invested in a group of assets, so that investor preferences or limits for investment in certain asset classes can be formally expressed. Since these constraints are linear, they do not increase the difficulty of the problem, which can still be solved efficiently by quadratic programming. Expression (5.16) corresponds to the cardinality constraint, which limits the number of assets that can be included in the final portfolio. Finally, equation (5.17) ensures that all the capital is invested in the portfolio. The cardinality-constrained problem is difficult to solve by standard optimization techniques. Branch-and-Bound methods can be used to find exact solutions [36]. Despite the improvements in efficiency, the complexity of the search is still exponential. Genetic algorithms have also been used to address this problem: In [37], the performance of GAs is compared to SA and to tabu search (TS) [38]. According to this investigation, the best-performing portfolios are obtained by pooling the results of the different heuristics. In [39] SA is used to search directly in the space of real-valued asset weights. Tabu search is employed in [40]. This work focuses on the design of appropriate neighborhood operators to improve the efficiency of the search. In [7] [41] Multi-Objective Evolutionary Algorithms (MOEAs) are used to address the problem. These algorithms employ a hybrid encoding instead of a pure continuous one and heuristic repair mechanisms to handle infeasible individuals. The impact of local search improvements are also investigated in this work. The authors conclude that the hybrid encoding improves the overall performance of the algorithm. In the experiments carried out in this investigation, we address the problem of optimal portfolio selection with lower bounds and cardinality constraints. The parameters of the constraints considered are li = 0.1, ui = 0.1, i = 1, . . . , D and K = 10. The performance of the different optimization methods is compared by calculating the efficient frontier for the problem with and without these constraints. Points on the efficient frontier correspond to minimum-risk portfolios for a given expected return, or, alternatively, to portfolios that have the largest expected return from a family of portfolios with equal risk. As a measure of the quality of the solution obtained, the average relative distance to the unconstrained efficient frontier (without cardinality and lower bound constraints) is calculated D=
1 NF
σic − σi∗ ∑ σ∗ i i=1 NF
(5.18)
5
Optimization Problems with Cardinality Constraints
121
where NF = 100 is the number of frontier points considered, σic is the solution of the constrained problem in the i-th point of the frontier, and σi∗ is the solution of the corresponding unconstrained problem.
Table 5.4 Results for the GA, SA and EDA approaches in the portfolio selection problem
Algorithm
Index
Hang Seng DAX SA FTSE S&P Nikkei Hang Seng DAX GA FTSE Linear penalty S&P Nikkei Hang Seng DAX GA FTSE Heuristic Repair S&P Nikkei Hang Seng DAX GA FTSE RAR (w = 1) S&P Nikkei Hang Seng DAX PBIL FTSE S&P Nikkei
Best D
Success Time (s) Optimizations rate 0.00321150 1.00 1499.9 3.87 · 107 2.53162860 0.98 2877.3 7.63 · 107 1.92205745 0.92 3610.4 8.87 · 107 4.69373181 0.91 3567.8 9.54 · 107 0.20197748 0.95 4274.5 9.25 · 107 0.00327011 0.86 750.9 1.36 · 107 2.53314271 0.69 2999.0 4.60 · 107 1.93255870 0.51 3539.3 5.76 · 107 4.69373181 0.76 4636.8 7.03 · 107 0.22992173 0.42 4811.7 6.47 · 107 0.00321150 1.00 1122.9 2.18 · 107 2.53162860 1.00 4730.6 7.45 · 107 1.92150019 0.94 6301.4 9.70 · 107 4.69373181 1.00 7860.6 11.42 · 107 0.20197748 0.99 10191.2 11.47 · 107 0.00321150 1.00 1200.8 2.77 · 107 2.53162860 1.00 3178.5 6.14 · 107 1.92150019 0.95 6384.6 12.02 · 107 4.69373181 0.99 6575.6 12.34 · 107 0.20197748 1.00 9893.3 14.17 · 107 0.00321150 1.00 2292.8 5.55 · 107 2.53162860 0.94 4489.1 7.70 · 107 1.92208910 0.85 4782.3 8.06 · 107 4.69570006 0.88 5100.2 8.28 · 107 0.30164777 0.43 7486.5 8.21 · 107
The expected returns and the covariance matrix of the components of five major world markets included in the OR-Library [42] are used as inputs for the optimization: Hang Seng (Hong-Kong, 31 assets), DAX (Germany, 85 assets), FTSE (UK, 89 assets), Standard and Poor’s (U.S.A., 98 assets) and Nikkei (Japan, 225 assets). The methods compared are SA, standard GA with linear penalty, standard GA with heuristic repair, GA with a set representation and RAR (w = 1) crossover, and PBIL. The SA heuristic is used with a geometric annealing scheme with constant γ = 0.9. Populations of 100 individuals are used for the GAs. The mutation and crossover probabilities are pm = 10−2 and pc = 1, respectively. PBIL samples populations
122
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
of 400 individuals, 10% of which are used to update the probability distribution. The heuristic repair scheme performs an unconstrained optimization without the cardinality constraint, and then either includes in the chromosome those products with the highest weights or eliminates the products with the smallest weights in the unconstrained solution, as needed. Table 5.4 summarizes the results of the experiments. The value of D (5.18) displayed in the third column is the best out of 5 executions of each of methods considered. The proportion of attempts in which the corresponding optimization algorithm obtains the best known solution is given in the column labeled sucess rate. The last two columns report the time employed (in seconds) and the number of quadratic optimizations performed, respectively. In terms of the quality of the obtained solutions, using a binary encoding with linear penalties performs worse than all the other approximate methods. By contrast, the heuristic repair scheme identifies the best of the known solutions in all the problems investigated. GA with a set representation and RAR (w = 1) crossover has also an excellent performance and is slightly more efficient on average. High quality solutions are also obtained by SA, albeit at higher computational cost. PBIL performs well only in problems in which the number of assets considered for investment is small. As the dimensionality of the problem increases, sampling and estimation of the probability distribution in algorithms of the EDA family become less effective.
5.3.4 Index Tracking by Partial Replication Index tracking is a passive investment strategy whose goal is to match the performance of a reference financial index. The problem can be exactly solved by investing on each asset an amount of capital that is proportional to the corresponding weight in the index. In practice, this strategy has the drawback of incurring high initial transaction costs. Furthermore, there is an overhead in managing a portfolio that invests in every constituent of the index. In particular, rebalancing the portfolio can be costly if the composition of the index is revised. An alternative is to create a tracking portfolio that invests only in a reduced set of assets. This partial replication strategy will in general be unable to perfectly reproduce the behavior of the index. However, a portfolio that invests in a fixed number of assets and closely follows the evolution of the index can be obtained by minimizing the tracking error ! 1 T D (5.19) min ∑ ∑ (w j r j (t) − rt )2 w,z T t=1 j=1 D
∑ wi = 1,
(5.20)
l ≤ A·w ≤ u zi ∈ {0, 1}, ai zi ≤ wi ≤ bi zi , ai ≥ 0, bi ≥ 0, i = 1, 2, . . . , D
(5.21) (5.22)
i=1
D
∑ zi ≤ K,
i=1
(5.23)
5
Optimization Problems with Cardinality Constraints
123
where T is the length of the time series considered, D is the number of constituents of the index, r j (t) is the return of asset j at time t and rt is the return of the index at time t. Restriction (5.20) is a budget constraint, which ensures that all the capital is invested in the portfolio. Investment concentration constraints are captured by (5.21). Expression (5.22) reflects lower and upper bound constraints. The binary variables {z1 , z2 , . . . , zD } indicate whether an asset is included or excluded from the tracking portfolio. Note that when zi = 0, the lower and upper bounds for the weight of asset i are both equal to zero, which effectively excludes this asset from the investment. The cardinality constraint is expressed by Eq.(5.23). Index tracking has been extensively investigated in the literature. The hybrid GA with set encoding and RAR crossover described in Section 5.2 is used in [4]. Instead of the tracking error, this work minimizes the variance of the difference between the returns of the index and of the tracking portfolio. Optimal impulse control techniques are used in [43]. In [44] the problem is solved by using the threshold accepting (TA) heuristic, which is a deterministic analogue of simulated annealing, in which transitions are rejected only when they lead to a deterioration in performance that is above a specified threshold. Evolutionary algorithms with real-valued chromosome representations are used in [45]. This investigation focuses on the influence of transaction costs and portfolio rebalancing. In [46] the portfolio optimization and index tracking problems are addressed by means of a heuristic relaxation method that consists in solving a small number of convex optimization problems with fixed transaction costs. Hybrid optimization approaches to minimizing the tracking by partial replication are also investigated in [47, 48, 49]. In the current investigation, publicly available benchmark data from the ORLibrary [42] is used to compare the optimization techniques described in Section 5.2. Five major world market indices are used in the experiments: Hang-Seng, DAX, FTSE, S&P and Nikkei. For each index, the time series of 290 weekly returns for the index and for its constituents are given. From these data, the first 145 values are used to create a tracking portfolio that includes a maximum of K = 10 assets. The last 145 values are used to measure the out-of-sample tracking error. The population sizes are 350 for the GAs and 1000 for PBIL. The values of the remaining parameters coincide with those used in the portfolio selection problem. Table 5.5 presents a summary of the experiments performed. The best out of 5 executions of the different optimization methods are reported. GA with random repair obtains the best overall results. GA with set encoding and RAR (w = 1) crossover matches these results except in Nikkei, which is the index with the largest number of constituents. PBIL also has a good performance, but the computational cost is higher than for the other algorithms. In fact, the algorithm reached the maximum number of optimizations established without converging. The results of SA and GA with binary encoding and linear penalty are suboptimal in all but the simplest problems. They also exhibit low success rates. In all problems investigated, the out-of-sample error is typically larger than the in-sample error, but of the same order of magnitude.
124
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
Table 5.5 Results for the GA, SA and EDA approaches in the index tracking problem
Algorithm
Index
Hang Seng DAX SA FTSE S&P Nikkei Hang Seng DAX GA FTSE Linear Penalty S&P Nikkei Hang Seng DAX GA FTSE Random Repair S&P Nikkei Hang Seng DAX GA FTSE RAR (w = 1) S&P Nikkei Hang Seng DAX PBIL FTSE S&P Nikkei
Best MSE MSE Success Time Number In-Sample Out-of-Sample rate (s) opts. 1.3462 · 10−5 2.0575 · 10−5 0.40 1.12 19342 8.0837 · 10−6 7.4824 · 10−5 0.40 1.73 27101 2.3951 · 10−5 7.0007 · 10−5 0.20 1.44 1.43 1.6781 · 10−5 4.7347 · 10−5 0.20 1.97 29764 2.1974 · 10−5 1.0719 · 10−4 0.20 95.00 1476549 1.3462 · 10−5 2.0575 · 10−5 0.60 4.15 51509 8.0837 · 10−6 7.4824 · 10−5 0.20 13.69 144868 2.7345 · 10−5 5.3148 · 10−5 0.20 17.66 158465 1.7974 · 10−5 5.2898 · 10−5 0.20 36.89 311008 2.0061 · 10−5 1.0707 · 10−4 0.20 123.28 1015774 1.3462 · 10−5 2.0575 · 10−5 1.00 5.92 81690 8.0837 · 10−6 7.4824 · 10−5 1.00 18.89 231840 2.1836 · 10−5 8.0091 · 10−5 0.40 21.20 255820 1.6573 · 10−5 5.5457 · 10−5 0.20 47.02 508313 1.8255 · 10−5 6.9574 · 10−5 0.20 170.62 1664696 1.3462 · 10−5 2.0575 · 10−5 1.00 4.67 51513 8.0837 · 10−6 7.4824 · 10−5 1.00 14.17 124717 2.1836 · 10−5 8.0091 · 10−5 0.40 18.83 156456 1.6573 · 10−5 5.5457 · 10−5 0.20 42.31 311002 1.8917 · 10−5 8.1057 · 10−5 0.20 175.34 1015766 1.3462 · 10−5 2.0575 · 10−5 1.00 167.04 2010000 8.0837 · 10−6 7.4824 · 10−5 1.00 199.28 2010000 2.1836 · 10−5 8.0091 · 10−5 1.00 195.31 2010000 1.6781 · 10−5 4.7347 · 10−5 0.60 314.77 2010000 1.9510 · 10−5 7.4572 · 10−5 0.20 222.86 2010000
5.3.5 Sparse Principal Component Analysis Principal Component Analysis (PCA) is a dimensionality reduction technique that is frequently used in data analysis, data compression and data visualization. The goal is to identify the directions along which the multidimensional data have the largest variance. The principal components can be obtained by maximizing the variance of normalized linear combinations of the original variables. Typically they have a nonzero projection on all the original coordinates, which can make their interpretation difficult. The goal of sparse PCA is to find principal components that have nonzero loadings in only a small number of the original directions, while at the same time explaining most of the variance. The first sparse principal component can be obtained by solving the cardinality-constrained optimization problem
5
Optimization Problems with Cardinality Constraints
" max w,z
T
w[z] · Σ[z,z] · w[z]
s.t. w[z] 2 = 1 z · 1 ≤ K, T
125
# (5.24) (5.25) (5.26)
where Σ is the data covariance matrix. As in the previous problems, the elements of the binary vector z encode whether the principal component has a non-zero projection along the corresponding direction. Once the first principal component has been found, if more principal components are to be calculated, the covariance matrix Σ is deflated as follows (5.27) Σ = Σ − wT · Σ · w w wT and a new problem of the form given by (5.24), defined now in terms of this deflated covariance matrix is solved. The decomposition stops after a maximum of Rank(Σ) iterations. In practice, the number of principal components is either specified beforehand or determined by the percentage of the total variance of the data explained. The problem of finding sparse principal components has also received a fair amount of attention in the recent literature. Greedy search is used in [50]. In [51] SPCA is formulated as a regression problem, so that LASSO techniques [52] can be used to favor sparse solutions. In LASSO, an L1 -norm penalty for non-zero values of the factor loadings is used. A higher weight of the penalty term in the objective functions induces models that are sparser. However it is not possible to have a direct control on the number of non-zero coefficients in the solution. The cardinality constraint is explicitly considered in [53], which uses a method based on solving a relaxation of the problem by semidefinite programming (SDP). To compare the performance of the different methods analyzed, we use the benchmark problem introduced in [54]. Consider the sparse vector v, whose components are ⎧ ⎪ ⎨1, if i ≤ 50 (5.28) vi = 1/(i − 50), if 50 < i ≤ 100 ⎪ ⎩ 0, otherwise A covariance matrix is built from this vector and U, a square matrix of dimensions 150 × 150 whose elements are U[0, 1] random variables Σ = σ vvT + UT U,
(5.29)
where σ = 10 is the signal-to-noise ratio. In this manner, the pattern of cardinality is partially masked by noise. In our experiments the results of SA, binary GA with linear penalties, binary GA with random repair, set GA with RAR crossover operator and w = 1, PBIL and DSPCA, an approximate method based on semidefinite programming [53, 55] are compared. SA uses a geometric annealing scheme with γ = 0.9. The GAs use a population of 50 invididuals. Crossover and mutation are performed with probabilities pc = 1 and pm = 10−2 , respectively. PBIL is executed with a population of 400 individuals and α = 0.1. In this algorithm, the
126
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
best 10% of the individuals are used to update the probability distribution. The first sparse principal component is then calculated. For each of the methods that involve stochastic search (all except DSPCA), the best out of 5 independent executions of the algorithm is taken. Figure 7.1 displays the variance explained by the first sparse principal component as a function of its cardinality K = 1, 2, . . . , 140, for all the methods considered. GA using a linear penalty does not obtain good solutions in this high-dimensional problem. PBIL performs slightly better, but is clearly inferior to SA, GAs with random repair, GA with set encoding and DSPCA. Table 5.6 shows the detailed results for cardinality K = 50, which is the cardinality of the true hidden pattern. In this table, the largest value of the variance achieved is highlighted in bold. The success rates, the computation times on an AMD Turion machine with 1.79 Ghz processor speed and 1 Gb RAM and the total number of optimizations are also given. The times for the the DSPCA algorithm times are not given, because a MATLAB implementation was used [54], which cannot be directly compared with the other results, obtained with code written in C. The GA with set encoding and RAR (w = 1) crossover and the GA with binary encoding and random repair obtain the best results and explain more variance than the solution obtained by DSPCA. The first of this methods is slightly faster. SA is very fast and achieves a result that is only slightly worse with a success rate of 100%. PBIL and GA with binary encoding and linear penalty obtain solutions that are clearly inferior.
45
40
35
Variance
30
25
20
15
10 GA Linear Penalty GA Random Repair GA RAR SA PBIL DSPCA
5
0 25
50
75
100
Cardinality
Fig. 5.1 Comparison of results for the SPCA problem
125
150
5
Optimization Problems with Cardinality Constraints
127
Table 5.6 Results for the GA, SA, EDA and SDP approaches in the synthetic problem for K = 50
Algorithm
Best Success Time (s) Optimizations variance rate SA 22.5727 1.00 65.95 11639 GA + Linear Penalty 19.7881 0.20 126.1 5137 GA + Random Repair 22.7423 0.80 172.1 7981 GA + RAR (w = 1) 22.7423 1.00 105.41 5146 PBIL 20.1778 1.00 198.20 40800 SDP 22.5001 − − −
5.4 Conclusions Many tasks of practical interest can be formulated as optimization problems with cardinality constraints. The examples analyzed in this article arise in various fields of application: ensemble pruning, optimal portfolio selection, financial index tracking and sparse principal component analysis. They are large optimization problems whose solution by standard optimization methods is computationally expensive. In practice, using exact methods like branch-and-bound is feasible only for small problem instances. A practicable alternative is to use approximate optimization methods that can identify near-optimal solutions at a lower computational cost: Genetic algorithms, simulated annealing and estimation of distribution algorithms. However, the search operators used in the standard formulations of these techniques are illsuited to the problem because they do not preserve the cardinality of the candidate solutions. This means that either ad-hoc penalization or repair mechanisms are needed to enforce the constraints. Including penalty terms in the objective function distorts the search and generally leads to suboptimal solutions. Applying repair mechanisms to infeasible configurations provides a more elegant and effective approach to the problem. Nonetheless, the best option is to use a set representation, in conjunction with specially designed search operators that preserve the cardinality of the candidate solutions. Some of the problems considered, such as the knapsack problem and ensemble pruning are purely combinatorial optimization tasks. In problems like portfolio selection, index tracking and sparse PCA both combinatorial and continuous aspects are present. For these we advocate the use of hybrid methods that separately handle the combinatorial and the continuous aspects of cardinalityconstrained optimization problems. Among the approximate methods considered, a genetic algorithm with set encoding and RAR crossover obtains the best overall performance. In problems where the comparison was possible, the solutions obtained are close to the exact ones and to those identified by approximate methods that use semidefinitie programming. Using the same encoding, simulated annealing also obtains fairly good solutions, generally at a higher computational cost. This indicates that the RAR crossover operator seems to enhance the search by introducing in the population individuals that effectively combine advantageous features of
128
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
their ancestors. Estimation of distribution algorithms, such as PBIL, perform well on small and medium-sized problem instances. However, they fail to obtain good solutions on large problems. The reason for this loss of efficacy is that the sampling and estimation of probability distributions becomes progressively more difficult as the dimensionality of the problem increases.
Acknowledgments This research has been supported by Direcci´on General de Investigaci´on (Spain), project TIN2007-66862-C02-02.
References 1. Gill, P.E., Murray, W., Saunders, M.A., Wright, M.H.: Inertia-controlling methods for general quadratic programming. SIAM Review 33, 1–36 (1991) 2. Gill, P., Murray, W.: Quasi-newton methods for unconstrained optimization. IMA Journal of Applied Mathematics 9 (1), 91–108 (1972) 3. Adler, I., Karmarkar, N., Resende, M.G.C., Veiga, G.: An implementation of Karmarkar’s algorithm for linear programming. Mathematical Programming 44, 297–335 (1989) 4. Shapcott, J.: Index tracking: genetic algorithms for investment portfolio selection. Technical report, EPCC-SS92-24, Edinburgh, Parallel Computing Centre (1992) 5. Radcliffe, N.J.: Genetic set recombination. Foundations of Genetic Algorithms. Morgan Kaufmann Pulishers, San Francisco (1993) 6. Coello, C.: Theoretical and numerical constraint-handling techniques used with evolutionary algorithms: a survey of the state of the art. Computer Methods in Applied Mechanics and Engineering 191, 1245–1287 (2002) 7. Streichert, F., Ulmer, H., Zell, A.: Evaluating a hybrid encoding and three crossover operators on the constrained portfolio selection problem. In: Proceedings of the Congress on Evolutionary Computation (CEC 2004), vol. 1, pp. 932–939 (2004) 8. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 4598, 671–679 (1983) 9. Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Weasley, Reading (1989) 10. Moral-Escudero, R., Ruiz-Torrubiano, R., Suarez, A.: Selection of optimal investment portfolios with cardinality constraints. In: Proceedings of the IEEE World Congress on Evolutionary Computation, pp. 2382–2388 (2006) 11. Radcliffe, N.J.: Equivalence class analysis of genetic algorithms. Complex Systems 5, 183–205 (1991) 12. Larra˜naga, P., Lozano, J.A. (eds.): Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer Academic Publishers, Dordrecht (2002) 13. Baluja, S.: Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning. Technical Report CMUCS-94-163, Carnegie Mellon University (1994) 14. Muehlenbein, H.: The equation for response to selection and its use for prediction. Evolutionary Computation 5, 303–346 (1998) 15. Kellerer, H., Pferschy, U., Pisinger, D.: Knapsack Problems. Springer, Heidelberg (2004)
5
Optimization Problems with Cardinality Constraints
129
16. Miller, R.E., Thatcher, J.W. (eds.): Reducibility among combinatorial problems, pp. 85– 103. Plenum Press (1972) 17. Pisinger, D.: Where are the hard knapsack problems? Computers & Operations Research, 2271–2284 (2005) 18. Sim˜oes, A., Costa, E.: An evolutionary approach to the zero/one knapsack problem: Testing ideas from biology. In: Proceedings of the Fifth International Conference on Artificial Neural Networks and Genetic Algorithms, ICANNGA (2001) 19. Ku, S., Lee, B.: A set-oriented genetic algorithm and the knapsack problem. In: Proceedings of the IEEE World Congress on Evolutionary Computation, CEC 2001 (2001) 20. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. Springer, Heidelberg (1996) 21. Ladanyi, L., Ralphs, T., Guzelsoy, M., Mahajan, A.: SYMPHONY (2009), https://projects.coin-or.org/SYMPHONY 22. Padberg, M.W., Rinaldi, G.: A branch-and-cut algorithm for the solution of large scale traveling salesman problems. SIAM Review 33, 60–100 (1991) 23. Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139–157 (2000) 24. Margineantu, D.D., Dietterich, T.G.: Pruning adaptive boosting. In: Proc. of the 14th International Conference on Machine Learning, pp. 211–218. Morgan Kaufmann, San Francisco (1997) 25. Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection from libraries of models. In: Proc. of the 21st International Conference on Machine Learning, p. 18. ACM Press, New York (2004) 26. Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: Ensemble diversity measures and their application to thinning. Information Fusion 6, 49–62 (2005) 27. Mart´ınez-Mu˜noz, G., Lobato, D.H., Su´arez, A.: An analysis of ensemble pruning techniques based on ordered aggregation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 245–259 (2009) 28. Zhou, Z.H., Wu, J., Tang, W.: Ensembling neural networks: Many could be better than all. Artificial Intelligence 137, 239–263 (2002) 29. Zhou, Z.H., Tang, W.: Selective ensemble of decision trees. In: Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 476–483. Springer, Heidelberg (2003) ´ Pruning 30. Hern´andez-Lobato, D., Hern´andez-Lobato, J.M., Ruiz-Torrubiano, R., Valle, A.: adaptive boosting ensembles by means of a genetic algorithm. In: Corchado, E., Yin, H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS, vol. 4224, pp. 322–329. Springer, Heidelberg (2006) 31. Zhang, Y., Burer, S., Street, W.N.: Ensemble pruning via semi-definite programming. Journal of Machine Learning Research 7, 1315–1338 (2006) 32. Asuncion, A., Newman, D.: UCI machine learning repository (2007) 33. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996) 34. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall, New York (1984) 35. Markowitz, H.: Portfolio selection. Journal of Finance 7, 77–91 (1952) 36. Bienstock, D.: Computational study of a family of mixed-integer quadratic programming problems. In: Balas, E., Clausen, J. (eds.) IPCO 1995. LNCS, vol. 920. Springer, Heidelberg (1995)
130
R. Ruiz-Torrubiano, S. Garc´ıa-Moratilla, and A. Su´arez
37. Chang, T.J., Meade, N., Beasley, J.E., Sharaiha, Y.M.: Heuristics for cardinality constrained portfolio optimisation. Computers and Operations Research 27, 1271–1302 (2000) 38. Glover, F.: Future paths for integer programming and links to artificial intelligence. Computers and Operations Research 13, 533–549 (1986) 39. Crama, Y., Schyns, M.: Simulated annealing for complex portfolio selection problems. Technical report, Groupe d’Etude des Mathematiques du Management et de l’Economie 9911, Universie de Liege (1999) 40. Schaerf, A.: Local search techniques for constrained portfolio selection problems. Computational Economics 20, 177–190 (2002) 41. Streichert, F., Tamaka-Tamawaki, M.: The effect of local search on the constrained portfolio selection problem. In: Proceedings of the IEEE World Congress on Evolutionary Computation (CEC 2006), Vancouver, Canada, pp. 2368–2374 (2006) 42. Beasley, J.E.: Or-library: Distributing test problems by electronic mail. Journal of the Operational Research Society 41(11), 1069–1072 (1990) 43. Buckley, I., Korn, R.: Optimal index tracking under transaction costs and impulse control. International Journal of Theoretical and Applied Finance 1(3), 315–330 (1998) 44. Gilli, M., K¨ellezi, E.: Threshold accepting for index tracking. Computing in Economics and Finance 72 (2001) 45. Beasley, J.E., Meade, N., Chang, T.: An evolutionary heuristic for the index tracking problem. European Journal of Operations Research 148(3), 621–643 (2003) 46. Lobo, M., Fazel, M., Boyd, S.: Portfolio optimization with linear and fixed transaction costs. Annals of Operations Research, special issue on financial optimization 152(1), 376–394 (2007) 47. Jeurissen, R., van den Berg, J.: Index tracking using a hybrid genetic algorithm. In: ICSC Congress on Computational Intelligence Methods and Applications 2005 (2005) 48. Jeurissen, R., van den Berg, J.: Optimized index tracking using a hybrid genetic algorithm. In: Proceedings of the IEEE World Congress on Evolutionary Computation (CEC 2008), pp. 2327–2334 (2008) 49. Ruiz-Torrubiano, R., Su´arez, A.: A hybrid optimization approach to index tracking. Accepted for publication in Annals of Operations Research (2007) 50. Moghaddam, B., Weiss, Y., Avidan, S.: Spectral bounds for sparse PCA. In: Advances in Neural Information Processing Systems, NIPS 2005 (2005) 51. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. Journal of Computational and Graphical Statistics 15(2), 265–286 (2006) 52. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B 58, 267–268 (1996) 53. d’Aspremont, A., Ghaoui, L.E., Jordan, M., Lanckriet, G.: A direct formulation for sparse PCA using semidefinite programming. SIAM Review 49(3), 434–448 (2007) 54. d’Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal component analysis. Journal of Machine Learning Research 9, 1269–1294 (2008) 55. d’Aspremont, A., Ghaoui, L.E., Jordan, M., Lanckriet, G.: MATLAB code for DSPCA (2008), http://www.princeton.edu/˜aspremon/DSPCA.htm
Chapter 6
Learning Global Optimization through a Support Vector Machine Based Adaptive Multistart Strategy Jayadeva, Sameena Shah, and Suresh Chandra
Abstract. We propose a global optimization algorithm called GOSAM (Global Optimization using Support vector regression based Adaptive Multistart) that applies statistical machine learning techniques, viz. Support Vector Regression (SVR) to adaptively direct iterative search in large-scale global optimization. At each iteration, GOSAM builds a training set of the objective function’s local minima discovered till the current iteration, and applies SVR to construct a regressor that learns the structure of the local minima. In the next iteration the search for the local minimum is started from the minimum of this regressor. The idea is that the regressor for local minima will generalize well to the local minima not obtained so far in the search, and hence its minimum would be a ‘crude approximation’ to the global minimum. This approximation improves over time, leading the search towards regions that yield better local minima and eventually the global minimum. Simulation results on well known benchmark problems show that GOSAM requires significantly fewer function evaluations to reach the global optimum, in comparison with methods like Particle Swarm optimization and Genetic Algorithms. GOSAM proves to be relatively more efficient as the number of design variables (dimension) increases. GOSAM does not require explicit knowledge of the objective function, and also Jayadeva Dept. of Electrical Engineering, Indian Institute of Technology, Hauz Khas, New Delhi - 110016, India e-mail: [email protected] Sameena Shah Dept. of Electrical Engineering, Indian Institute of Technology, Hauz Khas, New Delhi - 110016, India e-mail: [email protected] Suresh Chandra Dept. of Mathematics, Indian Institute of Technology, Hauz Khas, New Delhi - 110016, India e-mail: [email protected] Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 131–154. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
132
Jayadeva, S. Shah, and S. Chandra
does not assume any specific properties. We also discuss some real world applications of GOSAM involving constrained and design optimization problems.
6.1 Introduction and Background Research Global optimization involves finding the optimal or best possible configuration from a large space of possible configurations. It is among the most fundamental of computational tasks, and has numerous applications including bio-informatics [26], robotics [20], portfolio optimization [5], VLSI design [31], and nearly every engineering application [27]. If the search space is small, then obtaining the global optimal is trivial; otherwise some special structure like linearity, convexity or differentiability of the problem needs to be exploited. Classical mathematical optimization techniques are based on utilizing such special structures. Difficulties in obtaining the global optimum arise when the objective function neither has any special structure, nor possesses properties like continuity and differentiability, or if it has numerous local optima that obstruct search for the global optimum [25]. Such objective functions are common in many applications including data mining, location problems, and computational chemistry, amongst others [13, 15]. Similar difficulties arise if some structure exists but is not known a priori. For the global optimization of these kinds of objective functions, one utilizes the broad class of general-purpose algorithms called local search algorithms[29]. Global optimizers typically depend on local search algorithms that search multiple states in the neighborhood of a given state or configuration. Such methods find a local optimum, which depends on the starting state. Local search generally does not yield the global optimum, because it gets stuck in a local optimum. Therefore, local search methods are usually augmented with some strategy to escape from local optima. For instance, in simulated annealing (SA), first introduced by [22], the escape strategy is a probabilistic shaking that is associated with a temperature parameter. The “temperature” is reduced iteratively from a high initial value, based on a cooling schedule. On the other hand, though local search strategies suffer from entrapment in local optima, they are very fast. The time required to run an iteration of simulated annealing is sufficient for several iterations of gradient descent. Therefore, instead of escaping from local optima, an alternative is the use of multi-restart local search approaches. These start from a new initial state once a local search step has terminated in a local optimum. Multistart approaches are known to outperform other strategies on some problems, e.g. simulated annealing on the Travelling Salesperson Problem (TSP) [19]. The performance of any local search procedure depends on the starting state, and multi-restart local search algorithms start from a randomly chosen state. None of the above mentioned approaches exploit knowledge of the space that has been explored so far, to guide further search. In other words, their search strategy does not evolve with time. A question that comes to mind is whether, based on some knowledge
6
Learning Global Optimization through SVM Based Adaptive Multistart
133
collected about the function, it is possible to generate a start state that is better than a random one. If the answer is in the affirmative, then successive iterations will lead us closer to the global minimum. Evolutionary algorithms like Particle Swarm optimization (PSO), Genetic Algorithms (GA) and Ant Colony optimization (ACO), are distributed iterative search algorithms, which indirectly use some form of information about the space explored so far, to direct search. Initially, there is a finite number of “agents” that search for the global optimum. The paths of these agents are dynamically and independently updated during the search based on the results obtained till the current update. PSO, developed by Kennedy and Eberhart [21], is inspired by the flocking behavior of birds. In PSO, particles start search in different regions of the solution space, and every particle is made aware of the best local optimum amongst those found by its neighbors, as well as the global optimum obtained up to the current iteration. Each particle then iteratively adapts its path and velocity accordingly. The algorithm converges when a particle finds a highly desirable path to proceed on, and the other particles effectively follow its lead. Genetic Algorithms [12] are motivated by the biological evolutionary operations of selection, mutation and crossover. In real life, the fittest individuals tend to survive, reproduce and improve over generations. Based on this, “chromosomes” that yield better optima are considered to correspond to fitter individuals, and are used for creating the next generation of chromosomes that hopefully lead us to better optima. The population of chromosomes is updated till convergence, or until a specified number of updates is completed. Ant colony optimization [10] mimics the behavior of a group of ants following the shortest path to a food source. Ants (agents) exchange information indirectly, through a mechanism called “stigmergy”, by leaving a trail of pheromone on the paths traversed. States believed to be good are marked by heavier concentrations of pheromone to guide the ants that arrive later. Therefore, decisions that are taken subsequently get biased by previous decisions and their outcomes. Some heuristic techniques use an alternate approach to guide further search by application of machine learning techniques on past search results. Machine learning techniques help in discovering relationships by analyzing the search data that other techniques may ignore. If any relationship exists, then it could be exploited to reduce search time or improve the quality of optima. For this task, some papers try to understand the structure of the search space, while others try to tune algorithms accordingly (cf. [4] for a survey of these algorithms). Boyan used information of the complete trajectories to the local minima and the corresponding value of the local minima reached, to construct evaluation functions [8], [9]. The minimum of the evaluation function determined a new starting point. Optimal solutions were obtained for many combinatorial optimization problems like bin packing, channel routing, etc. Agakov et. al [3], gave a compiler optimization algorithm that trains on a set of computer programs and predicts which parts of the optimization space are likely to give large performance improvements for programs. Boese et al. [6] explored the use of local minima to adapt the optimization algorithm. For graph bisection and the
134
Jayadeva, S. Shah, and S. Chandra
TSP, they found a “big valley” structure to the set of minima. Using this information they were able to hand code a strategy to find good starting states for these problems. Is this possible for other problems as well ? The proposed work is motivated by the question: for any general global optimization problem, is there a structure to the set of local optima ? If so, can it be learnt automatically through the use of machine learning ? We propose a new algorithm for the general global optimization problem, termed as Global Optimization using Support vector regression based Adaptive Multistart (GOSAM). GOSAM attempts to learn the structure of local minima based on local minima discovered during earlier local searches. GOSAM uses Support Vector Machine based learning to learn a Fit function (regressor) that passes through all the local minima, thereby learning the structure of the locations of local minima. Since the regressor can only learn the structure of the local minima encountered till the present iteration, the idea is that the regressor for local minima will generalize well to the local minima not obtained so far in the search. Consequently, its minimum would be a ‘crude approximation’ to the global minimum of the objective function. In the next iteration the search for the local minimum is started from the minimum of this regressor. The new local minimum obtained is added as a new training point and the Fit function is re-constructed. Over time, this approximation gets better, leading the search towards regions that yield better local minima and eventually the global minimum. Surprisingly, for most problems this algorithm tends to direct search to the region containing the global minimum in just a few iterations and is significantly faster than other methods. The results reinforce our belief that many problems have some ‘structure’ to the location of local minima, which can be exploited in directing further search. It is important to emphasize that GOSAM’s approach is different from approximating a fitness landscape; GOSAM attempts to predict how local minima are distributed, and where the best one might lie. This turns out to be very efficient in practice. In this chapter, we wish to demonstrate the same by testing on many benchmark global optimization problems against established evolutionary methods. The rest of the chapter is organized as follows. Section 6.2 discusses the proposed algorithm. Section 6.3 is devoted to GOSAM’s performance on benchmark optimization problems, as well as a comparison with GA and PSO. Section 6.4 extends the algorithm for constrained optimization problems. Section 6.5 demonstrates how the algorithm may be applied to design optimization problems. Section 6.6 is devoted to a general discussion on the convergence of GOSAM to the global optimum, while Section 6.7 contains concluding remarks.
6.2 Global Optimization with Support Vector Regression Based Adaptive Multistart (GOSAM) The motivation of the proposed algorithm is to use the information about the local minima encountered in earlier steps, to predict the location of other better minima. We denote the objective function to be minimized by f (x), where x (xi , ∀i = 1, . . . , n) is a n dimensional vector of variables. We assume that the lower and upper bounds of each of these variables is known. For an unconstrained optimization problem, the
6
Learning Global Optimization through SVM Based Adaptive Multistart
135
feasible region is the complete search space that lies within the lower and upper bounds of all variables. We now summarize the flow of the GOSAM algorithm. At each iteration, the algorithm performs a local search1 , starting from a location termed as the startstate. Iteratively the algorithm determines the start-state for the next iteration. GOSAM Algorithm for Minimization of a Multivariate Function 1. Initialize start-state to an initial guess, possibly chosen randomly, lying in the search space. 2. Starting from start-state, obtain a locally optimal solution using a local search procedure. Term this solution as current-local-optimum. 3. Store current-local-optimum (x∗i , ∀i = 1 , . . . , n) and the corresponding function value ( f (x∗ )) in the training set. 4. Apply Support Vector Regression treating all the local optima collected so far, as the independent variables and their corresponding function values as the target values. The regressor obtained will be called the current Fit function. 5. Obtain the minimum of the current Fit function using a local search procedure. 6. Set start-state to the minimum obtained in Step 5. If minimum is out of bounds or is same as that obtained in the previous r runs, set the start-state to a random one. 7. If the termination criteria have been met, proceed to Step 8. Otherwise, go to Step 2. 8. Return the best local minimum obtained. Initially, we generate a feasible random state and assign it to start-state. In step 2, we perform local search on the objective function starting from the start-state. The search terminates at a local minimum. We store the local minimum and the corresponding value of the objective function in an array. This constitutes our training data. We treat local minima as data points and their corresponding function values as target values. In Step 4, we use Support Vector Regression (SVR), to perform regression on the training data comprising of the local minima obtained till the current iteration. In general, fitting the local minima with a linear SVR regressor would incur a large error. Nonlinear regression can be achieved with a wide choice of kernel functions, and choice of the kernel will also impact the number of function evaluations required to reach the global optimum. In all our experiments, we chose the most commonly used 2nd degree polynomial kernel, also termed as a quadratic kernel, primarily to simplify computation. It also facilitates Step 5 of the GOSAM algorithm, which requires minimization of the SVR regressor. For a polynomial kernel of degree 2, the problem to be minimized is a quadratic one, which can be solved efficiently. We found that GOSAM is not handicapped by the choice of kernel, and that a quadratic kernel worked well over a wide range of problems. All the local minima 1
To the best of our knowledge, there is no restriction on the choice of local search procedure used.
136
Jayadeva, S. Shah, and S. Chandra
obtained till the current iteration are treated as training samples, and their corresponding function values as the target values. The regressor obtained in Step 4, termed as the current Fit function, approximates local minima of the objective function. In the limit that all local minima are known, SVR will construct a regressor that passes through all local minima of the objective function. The global minimum of this function would then correspond to the global minimum of the original objective function. If we knew all the local minima, then regression is not required and one can easily determine the best local minimum. We utilize only the information of the few local minima obtained through local search till the current iteration. We then rely on the excellent generalization properties of SVRs to predict how the local minima are distributed. Search is redirected to the region containing the minimum of the regressor or the Fit function. Because of the limited size of the training set, this regressor will not be an exact approximation of the local minima of the objective function. However, over successive iterations, the Fit function tends to better localize the global minimum of the function being optimized. This is demonstrated by the experiments presented in section 6.3, that show that the ‘predictor’ turns out to be so good that search terminates in the global minimum within very few iterations. Apart from the generalization ability of SVRs, which is imperative in predicting better starting points and finding the global optimum quickly, the choice of using SVR for function approximation is also motivated by the fact that the regressors obtained using SVR are generally very simple and can be constructed by using only a few support vectors. Since minimization of the Fit function requires evaluating it at several points, the use of only the support vectors contributes to computational efficiency. Regardless of the complexity of the kernel used, the optimization problem that needs to be solved remains a convex quadratic one, because only a kernel matrix that contains an inner product between the points is required. The meagre amounts of data to be fit, i.e. the small number of local minima and their corresponding function values, also contribute to making the process fast and efficient. In step 5, we minimize the Fit obtained and reset the start-state to its minimum. If the local minimum obtained from this start-state is the same as the one obtained in the previous r iterations, or out of bounds, then we conclude that the search has become too localized, and needs to explore other regions to discover new minima. In such a case, we reset the start-state to a random state.
6.3 Experimental Results GOSAM was encoded in MATLAB and run on an 800 MHz Pentium III PC with 256 MB RAM. For all our test cases, we used the local minimizer of LINDO API [23]. We tested GOSAM on a number of difficult global optimization benchmark problems, which are available online at [11, 24]2 . Most of the benchmark problems available in [11, 24] are parameterized in the number of variables, and can thus 2
The websites also include visualizations of the two dimensional examples, a discussion of why each of these problems is difficult, and a mention of the estimated number of local minima of the benchmarks.
6
Learning Global Optimization through SVM Based Adaptive Multistart
137
be extended to any dimension. We first illustrate the working of GOSAM on one and two variable examples that possess several local minima. These examples will help visualize GOSAM’s working. We then discuss results for higher dimensional benchmarks.
6.3.1 One Dimensional Wave Function Figures 6.1 through 6.4 show the objective function f (x) = (|x|−10)cos(2π x) (taken from [8]) as a dotted wave. The bounds for this problem were taken to be x = -10.0 to x = 10.0. As seen in Figs. 6.1 - 6.4, the objective function has several local minima, but the global minimum lies at x = 0. We now show how GOSAM finds the global minimum for this example. 10 8 6 4 2 0 −2 −4 −6 −8 −10 −10
−8
−6
−4
−2
0
2
4
6
8
10
Fig. 6.1 Iteration 1 for the global minimization of the objective function f (x) = (|x| − 10)cos(2π x). Local search started from a random start state given by x = -1.3444 (indicated by the circle) and terminated at the local minimum x = -1.0 where f (x) = -9.0. Using only this one local minimum in the training set, the regressor obtained till the end of iteration 1 is shown by the solid line
The initial randomly chosen starting state is x = -1.3444. This is shown as the circled point in Fig. 6.1. Local search from this point led to the local minimum at x = -1.0, indicated by a square in Fig. 6.1. At this point, the objective function has a value of f (−1.0) = -9.0. Using only one local minimum in the training set, the SVR regressor that was obtained is shown by the solid line parallel to the x-axis. Since this regressor has a constant function value, its minimum is the same everywhere; therefore, any random point can be selected as the minimum. In our simulation, the random point returned was x = -6.3. Local search from this point terminated at
138
Jayadeva, S. Shah, and S. Chandra
the local minimum x = -6.0. The regressor obtained using these two points led to a minimum at the boundary. In cases when the minimum is at a boundary, we find that one can start the next local search from either the boundary point, or from a random new starting point. The search for the global optimum was not hampered by either choice. However, the results reported here are based on a random restart in such cases. In this simulation, search was restarted from a random point at x = 4.483.
10 8 6 4 2 0 −2 −4 −6 −8 −10 −10
−8
−6
−4
−2
0
2
4
6
8
10
Fig. 6.2 Iteration 3 for the global minimization of the objective function f (x) = (|x| − 10)cos(2π x). Local search started from a random start state given by x = 4.483 (indicated by the circle) and terminated at the local minimum at x = 4.0. The regressor obtained using the three points, depicted by squares, is shown as the solid concave curve. The minimum of this curve lies at x = -0.8422
In the third iteration, shown in Fig. 6.2, local search is started from x = 4.483, depicted by a circle. The local minimum was found to be at x = 4.0, and is depicted by a square in the figure. When the information of these three local minima was used, the SVR regressor shown as the solid concave curve was obtained. The minimum of this curve lies at x = -0.8422. The start state for iteration 4 was given by the minimum of the regressor obtained in the previous iteration, given by x = -0.8422. This point is depicted as a circle in Fig. 6.3. The local minimum obtained from this starting state is again depicted as the square at the end of the slope. The regressor obtained using these four local minima is shown as a bowl shaped curve, the minimum of which is located at x = −0.1130. In the next iteration, depicted in Fig 6.4, local search from x = −0.1130, depicted by a circle, led us to the global minimum at x = 0.0, depicted by a square in the figure.
6
Learning Global Optimization through SVM Based Adaptive Multistart
139
10 8 6 4 2 0 −2 −4 −6 −8 −10 −10
−8
−6
−4
−2
0
2
4
6
8
10
Fig. 6.3 Iteration 4 for the global minimization of the objective function f (x) = (|x| − 10)cos(2π x). Local search started from the minimum of the regressor obtained in the previous iteration, given by x = -0.8422 (indicated by the circle). It terminated at the local minimum at x = -1.0. The regressor obtained using the four local minima obtained till the current iteration, depicted by squares, is shown as the solid convex shaped curve. The minimum of this curve lies at x = −0.1130
10 8 6 4 2 0 −2 −4 −6 −8 −10 −10
−8
−6
−4
−2
0
2
4
6
8
10
Fig. 6.4 Iteration 5 for the global minimization of the objective function f (x) = (|x| − 10)cos(2π x). Local search started from the start state given by the minimum of the regressor obtained at the end of the previous iteration, given by x = -0.1130 (indicated by the circle) and terminated at the local minimum at x = 0.0 where f (x) = -10.0. The regressor obtained using all the local minima obtained, depicted by squares, is shown as the solid convex curve
140
Jayadeva, S. Shah, and S. Chandra
6.3.2 Two Dimensional Case: Ackley’s Function Ackley’s function is a multimodal benchmark optimization problem, that is widely used for testing global optimization algorithms. The n-dimensional Ackley’s function is given by f (x) = −ae
−b
$
1 n
∑ni=1 x2i
1
− e n ∑i=1 cos(cxi ) + a + e , n
where a = 20, b = 0.2, and c = 2π . Its global minimum is located at xi = 0, ∀i = 1, 2, . . . , n, with the function value f (0) = 0. For the purpose of illustration, we consider the two dimensional Ackley’s function. As seen in Fig. 6.5, Ackley’s function has a large number of local minima that hinder the search for the global minimum.
Fig. 6.5 Ackley’s function. A huge number of local minima are seen that obstruct the search for the global minimum at (0, 0)
Figures 6.6 through 6.8 show the plots of the regressor function obtained after iterations 2, 3, and 4 respectively. Note that though both figures 6.7 and 6.8 look similar, there is a difference in the locations of their minima. The minimum of the bowl shaped Fit function of Fig. 6.8, when used as the start state for next local minimization procedure, led to the global minimum of Ackley’s function.
6
Learning Global Optimization through SVM Based Adaptive Multistart
141
Fig. 6.6 Regressor obtained after iteration 2, while optimizing Ackley’s function
Fig. 6.7 Regressor obtained after iteration 3, while optimizing Ackley’s function
6.3.3 Comparison with PSO and GA on Higher Dimensional Problems Particle Swarm Optimization (PSO) and Genetic Algorithms (GA) are evolutionary techniques, that also use information revealed during search to generate new search points. We compare our algorithm with both these approaches on several global optimization benchmark problems, ranging in dimension (number of variables n) from
142
Jayadeva, S. Shah, and S. Chandra
Fig. 6.8 Regressor obtained after iteration 4, while optimizing Ackley’s function. Local search starting from the minimum of this Fit function led to the global minimum
2 to 100. The Particle Swarm optimization toolbox was obtained from [30], while the Genetic Algorithm optimization toolbox (GAOT) is the one available at [16]. The next start-state in PSO and GA is obtained by simple mathematical or logical operations, whereas for GOSAM it is generated after determining the SVR followed by minimization of a quadratic problem. Therefore, an iteration of GOSAM takes more time than an iteration of either of these algorithms. Moreover, GA and PSO run a number of agents in parallel, whereas the current implementation of GOSAM is a sequential one. However, the difference in the number of function evaluations required is so dramatic that GOSAM always found the global minimum significantly faster. In all our experiments, we evaluated the three algorithms on three different performance criteria. The first criterion is the number of function evaluations required to reach the global optimum. The second criterion is the number of times the global optimum is reached in 20 runs, each from a randomly chosen starting point. The third measure is the average CPU time taken to reach the global optimum. Table 6.1 presents the results obtained. Each value indicated in the table is the average over 20 runs of the corresponding algorithm. For each run, the initial start state of all the algorithms was the same randomly chosen point. The reported results have been obtained on a PC with a 1.6 GHz processor and 512 MB RAM. The first row in the evaluation parameter for each benchmark function (Fn. Evals.) gives the average number of function evaluations required by each algorithm to find the global optimum. The number of times that the global optimum was obtained out of the 20 runs is given in the second row (GO. Obtd.). If the global minimum was not obtained in all runs, then the average value and the standard deviation of the best
6
Learning Global Optimization through SVM Based Adaptive Multistart
143
optima obtained over all the runs has been mentioned within parentheses. The third row (T (s)) indicates the average time taken in seconds by each algorithm in a run. Though any number of local minima may be used for building a predictor, we used a maximum of 100 local minima. The 101st local minima overwrote the 1st local minimum obtained, and so on. In each case, the search was confined to lie within a box [−10, 10]n where n is the dimension. In all our experiments, we used the conventional SVR framework [14]. The use of techniques such as SMO [28], or online SVMs [7] can be used to speed up the training process further. Our focus in this work is to show the use of machine learning techniques to help predict the location of better local minima. The parameters for GA and PSO (c1 = 2, c2 = 2, c3 = 1, chi = 1, and swarm size = 20) were kept the same as the default ones. For GOSAM, the SVR parameters were taken to be ε = 10−3 , C = 1000, and the kernel to be the two degree polynomial kernel with t = 1. Table 6.1 shows that consistently GOSAM outperforms both PSO and GA by a large margin. This difference gets dramatically highlighted in higher dimensions. Finding the global minimum becomes increasingly difficult as the dimension n increases; PSO and GA fail to find the global optimum in many cases, despite a large number of function evaluations. However, GOSAM always found the global minimum after a relatively small number of function evaluations (the count for function evaluation for GOSAM also includes the number of times the objective function was evaluated during local search). We believe that this result is significant, because it shows that GOSAM scales very effectively to large dimensional problems. The experimental results strikingly demonstrate that GOSAM not only finds the global optimum consistently, but also does so with a significantly fewer number of function evaluations.
6.4 Extension to Constrained Optimization Problems Constrained optimization problems are usually solved by solving related unconstrained problems, which are obtained through the use of penalty or barrier functions. We take recourse to Sequential Unconstrained Minimization Techniques (SUMTs), which we briefly review.
6.4.1 Sequential Unconstrained Minimization Techniques SUMTs comprise of a class of non-linear programming methods that solve a sequence of unconstrained optimization tasks. Given a problem of the form Minimize
a(x)
(6.1)
subject to the constraints gi (x) <= 0, for i = 1, . . . M,
(6.2)
144
Jayadeva, S. Shah, and S. Chandra
Table 6.1 Comparison of GOSAM with PSO and GA on Difficult Benchmark Problems
N Benchmark function
Evaluation GOSAM PSO parameter
GAOT
2
Ackley
Fn. Evals. GO Obtd. T(s)
122.75 20 0.02970
12580.0 20 0.535524
2202.75 20 0.677470
2
Rastrigin
Fn. Evals. GO Obtd. T(s)
129.5 20 0.0328
23037.0 20 0.913196
2198.15 20 0.662956
2
Griewangk
Fn. Evals. GO Obtd.
108.95 20
91824.0 20
T(s)
.02265
3.758419
1180.15 19‡ (0.0057 ± 2.5e5) 0.357913
Fn. Evals.
24.35
22542.0
647.75
GO Obtd. T(s)
20 .0078
20 0.972250
20 0.236874
2
Rotated Hyper Ellipsoid
2
Rosenbrock’s Fn. Evals. Valley GO Obtd. T(s)
105.05 20 .0148
49702.00 20 2.046909
10600.75 20 3.19688
2
Schwefel
Fn. Evals. GO Obtd. T(s)
37.4 20 .01325
24000 1† 5.338205
866.30 20 0.268143
2
Branin’s Rcos Fn. Evals. GO Obtd. T(s)
81.85 20 .01795
52341.0 20 2.371105
649.75 20 0.206017
2
Six Hump Camelback
Fn. Evals. GO Obtd. T(s)
64.5 20 .01715
29892.0 20 1.213340
643.95 20 0.199104
Fn. Evals. GO Obtd. T(s)
208.09 20 .03516
145746.0 20 6.636169
11226.10 20 3.973002
10 Ackley
6
Learning Global Optimization through SVM Based Adaptive Multistart
145
Table 6.1 (continued)
10 Rastrigin
Fn. Evals. 298.65 GO Obtd. 20
300040.0 0‡ (3.035 ± 2.054) 12.854138
11219.25 12‡ (0.4478 ± 0.4587) 3.435049
Fn. Evals. 46.9
105417.00
4766.35
GO Obtd. 20 T(s) .01255
20 4.995237
20 1.448820
T(s) 10 Rotated Hyper Ellipsoid
.0398
10 Rosenbrock’s Fn. Evals. 2177.90 300040.0 Valley GO Obtd. 20 3‡ (1.9104 ± 1.2932) T(s) .04460 12.467250
22917.3 3‡ (2.9247 ± 3.0229) 7.737082
100 Ackley
24708.40 0‡ (1.6847 ± 0.2302) 14.027134
Fn. Evals. 7437.4 GO Obtd. 20 T(s)
100 Rastrigin
Fn. Evals. 4931.5 GO Obtd. 20
T(s) 100 Rotated Hyper Ellipsoid
8.707
7.615
300040.0 0‡ (8.0094 ± 5.7940) 24.979903 2000040.00 0 (486.3559 737.8523) 128.840828 292666.0
37001.70
GO Obtd. 20
0‡ (0.3350 ± 1.0915) 46.069183
0‡ (2.2683 ± 0.9803) 29.993730
.0823
100 Rosenbrock’s Fn. Evals. 8676.20 500040.0 Valley GO Obtd. 20 0‡ (14324550.17± 1535813.439) T(s) .89920 17.913865
†
23.566207
Fn. Evals. 431.5
T(s)
‡
36528.90 0 ± (60.2946 ± 8.7735)
61365.65 0‡ (357.198± 165.5572) 36.662314
The global optimum was not obtained in all the 20 runs. The value in the corresponding parentheses indicates the mean and the standard deviation of the quality of global minima obtained in the 20 runs. The global optimum obtained was not within the specified bounds.
146
Jayadeva, S. Shah, and S. Chandra
where a(x) is the objective function, and gi (x), for i = 1, . . . M are the M constraints. One kind of SUMT, the quadratic penalty function method, minimizes a sequence of functions of the form (p = 1, 2, . . . M). M
Fp (x) = a(x) + ∑ α pi Max(0, gi (x))2 ,
(6.3)
i=1
where α pi is a scalar weight, and p is the problem number. The minimizer for the pth problem in the sequence forms the guess or starting point for the (p + 1)th problem. The scalars change from one problem to the next based on the rule that α pi >= α(p−1)i ; they are typically increased geometrically, by say 10%. These weights indicate the relative emphasis of the constraints and the objective function. In the limit, the constraints become overwhelmingly large, the sequence of minima of the unconstrained problems converges to a solution of the original constrained optimization problem. We now illustrate the use of SUMT through the application of GOSAM to the graph coloring problem. Given a graph with a set of nodes or vertices, and an adjacency matrix D, the Graph Coloring Problem (GCP) requires coloring each node or vertex so that no two adjacent nodes have the same color. The adjacency matrix entry di j is a 1 if nodes i and j are adjacent, and is 0 otherwise. A minimal coloring requires finding a valid coloring that uses the least number of colors. The GCP can be solved through an energy minimization approach. We used an approach based on the Compact Analogue Neural Network (CANN) formulation [17]. In this approach, a N-vertex GCP is solved by considering a network of N neurons, whose outputs denote the node colors. The outputs are represented by a set of real numbers Xi , i = 1, 2, . . . , N. The color is not assumed to be an integer as is done conventionally. The GCP is solved by minimizing a sequence (p = 1, 2, . . .) of functions of the form A N N ∑ ∑ (1 − di j )Vm ln coshβ (Xi − X j ) + 2 i=1 j=1 N N coshβ (Xi − X j + δ )coshβ (Xi − X j − δ ) Bp d V ln i j m ∑∑ 2 i=1 coshβ (Xi − X j )2 j=1 E=
(6.4)
In keeping with the earlier literature on neural network approaches to the GCP, we term E in (6.4) as an energy function. The first term of equation (6.4) is present only for di j = 0, i.e. for non-adjacent nodes. The term is minimized when Xi = X j . The term therefore minimizes the number of distinct colors used. The second term is minimized if the values of Xi and X j corresponding to adjacent nodes differ by at least δ . This term corresponds to the adjacency constraint in the GCP, and becomes large as the problem sequence index p increases. Nodes colored by colors that differ by less than δ correspond to nodes with identical colors.
6
Learning Global Optimization through SVM Based Adaptive Multistart
147
We used GOSAM to minimize the energy function corresponding to difficult GCP benchmark problems [1], which have a large number of connections. Of these the Myciel instances are Graphs based on the Mycielski transformation. These graphs are difficult to solve because they are triangle free (clique number 2) but the coloring number increases with the problem size. “Huck” instance is a graph that is created where each node represents a character. Two nodes are connected by an edge if the corresponding characters encounter each other in the book “Twain’s Huckleberry Finn”. In the “Games120” instance, the games played in a college football season is represented by a graph where the nodes represent each college team, and two teams are connected by an edge if they played each other during the season. The energy functions for these problems are very complex and lead to extremely hard global optimization problems. However, the constrained optimizer was able to obtain the optimal coloring for each of these instances. Table 6.2 sums up the results obtained. Note that the starting value of B, and the amount of increment in B for successive iterations are both related to the time taken to reach the optimal solution. One would like to start with a value of B, which quickly takes us into the feasible region. This leads us to believe that a large value of B would do the trick. However, if the value of B is taken to be too large then we might not be able to reach the optimal solution. Thus there is no obvious answer to determine a good starting value of B, instead it is based on educated guesses. A natural reasoning would be that for dense adjacency matrices a large value of B should be chosen while a relatively smaller value would suffice for sparse adjacency matrices. If we reach the feasible region, then we could slowly and cautiously (making sure that we don’t exit the feasible region) increase the value of A till we reach the optimal solution. We defer a more detailed discussion of this aspect as it is beyond the scope of this chapter. Table 6.2 Constrained optimization on benchmark GCP instances
Instance Nodes Edges Optimal coloring Best Solution Obtained Iterations required Myciel3 11 20 4 4 3 Myciel4 23 71 5 5 5 Huck 74 301 11 11 8 Games120 120 638 9 9 10
6.5 Design Optimization Problems Designers are usually confronted with the problem of finding optimal settings for a large number of design parameters, with respect to several simulated product or process characteristics. Problems of design and synthesis in the electronic domain are generally constrained non-linear optimization problems. The principal characteristics of these problems are very time consuming function evaluations and the absence of derivative information. In most cases, evaluating the cost or objective function requires a system simulation, and the function is rarely available in an analytical form. In fact, the use of classical optimization techniques to give an optimal solution is
148
Jayadeva, S. Shah, and S. Chandra
Up da te va des ria ig ble n s
nearly impossible. For instance, VLSI design engineers carry out time-consuming function evaluations by using circuit or other simulation tools , e.g. Spectre [2], and choose a circuit with optimal component values. Since there are still many possible design parameter settings and computer simulations are time consuming, it is crucial to find the best possible design with a minimum number of simulations. We used GOSAM to solve several circuit optimization problems. The interface between the optimizer and the circuit simulators is shown in Fig. 6.9. Preliminary details of this work were reported in [18].
Invoke Spectre
Updated design variables Optimizer
Netlist
Interface
Read Spectre (Run a simulation)
Function value
ad Re
Write function value Output File
Fig. 6.9 GOSAM’s interface with the circuit simulator Spectre
We initially start with values for the design variables that are provided by a designer, or choose them randomly. Since there are no analytical formulae to compute the output for the input design parameters, the function values are calculated by using a circuit simulator such as Spectre. The simulator writes the output value to a file, which is read by the interface and returned to GOSAM. GOSAM then uses SVR on the set of values obtained so far, to determine the Fit function. The SVR yields a smooth and differentiable regressor. GOSAM then computes the minimum of the Fit function, and sends it as the vector of new design parameters, to the interface. A key feature of this approach is that we can apply it even to scenarios where the objective function is not available in analytical form or is not differentiable. A major bonus is that examination of the Fit function yields information about what constitutes a good design. We now briefly discuss a few interesting circuit optimization examples.
6.5.1 Sample and Hold Circuit For a sample and hold circuit, the objective function was to hold the sampled value as constant as possible during the hold period. The design variables are the widths of 22 MOSFETs, along with values of four capacitors named as C1,C2,C3, and C4. The transistor widths were constrained to lie between 250nm and 1200nm. Capacitor C3 was required to be between 1fF and 5000f, while all other capacitors were constrained to lie between 1fF and 500fF. Simulations show that the sampled
6
Learning Global Optimization through SVM Based Adaptive Multistart
149
value is maintained well during the hold period. As of date, numerous complex VLSI circuits have been designed using GOSAM interfaced with the circuit simulator Spectre. The chosen circuits include Phase Locked Loops (PLLs), a variety of operational amplifiers, and filters. In these examples, transistor sizes and other component values have been selected to optimize specified objectives such as jitter, gain, phase margin, and power, while meeting specified constraints on other performance measures as well as on transistor sizes.
Fig. 6.10 Response of the optimized Sample-and-Hold Circuit, showing output voltage versus time. The goal was to keep the output constant during the hold period
6.5.2 Folded Cascode Amplifier For a folded cascode amplifier, the design objective was to maximize the phase margin. The variables for the optimization task were taken to be the widths of 16 MOSFET transistors. The result obtained, depicted in Figure 6.11 shows that GOSAM obtained the maximum phase margin as 169.73◦, as well as an excellent solution with a phase margin of around 120◦ . An industry level commercial tool found a solution with a phase margin of around 59◦ .
6.6 Discussion An important question relates to assumptions that may be implicitly or explicitly made regarding the function to be optimized. We mentioned previously that any local search mechanism could be used in conjunction with GOSAM. Figure 6.12 illustrates this with the help of a toy example. For the objective function shown by the dashed curve in Fig. 6.12, the gradient cannot be computed to reach two of the minima. A line search method is used in the outer triangular regions, while for the parabolic region in the middle the gradient is available and a simple gradient descent leads us to the local minimum. These three local minima, when used by SVR to construct the regressor, yield the parabolic shaped solid curve of Fig. 6.12. Local search starting from the minimum of this curve led to the global minimum.
150
Jayadeva, S. Shah, and S. Chandra
Fig. 6.11 Phase margin versus iteration count for a folded cascode amplifier
Fig. 6.12 A toy example illustrating that any local minimizing procedure can be used with GOSAM. The function is depicted as the dotted curve. For the outer triangular regions, the gradient information cannot be used, so the local minima are found by a line search method. However for the inner parabolic region, the local minimum can be found using gradient descent. The regressor obtained is shown by the solid curve that passes through the local minima obtained
In the worst case, GOSAM performs similar to a random multistart. This is because whenever it is not possible to use the minimum of the Fit function (for example when it is out of bounds or almost the same minimum is given by the previous two iterations), GOSAM restarts the search from a random state. Therefore in the worst case it will randomly explore the search space for new starting points. However, real applications never involve functions that are discontinuous everywhere, and we have not encountered this worst case.
6
Learning Global Optimization through SVM Based Adaptive Multistart
151
Fig. 6.13 A toy example to illustrate that the regressor for the objective function f (x), denoted as Fit of f (x) is smoother than f (x). Recursively, the Fit for the Fit of f (x) is smoother than the Fit of f (x), and in the limit leads to a convex underestimate of f (x)
Server invokes Instance of GOSAM
Web Server
Server requests function/ Sends optimized points
Client sends a request
Client
Optimizer
Client sends function value at requested points
Fig. 6.14 Testing: A web based service
Minimization of the regressor function is an essential step in GOSAM. In all our experiments we used local search to accomplish this step. A doubt that comes to mind is what might happen if the Fit function itself turns out to have multiple local minima. Such a situation is certainly possible, and is theoretically interesting. An alternative approach that we suggest is to use GOSAM recursively. This idea is intuitive because the regressor function, called Fit function in Fig 6.13, is smoother than the objective function, as it is a smooth interpolation of only the local minima of the objective function encountered earlier. Therefore, a Fit of the Fit function’s
152
Jayadeva, S. Shah, and S. Chandra
local minima would be even smoother. This is depicted pictorially in Fig. 6.13, which uses a hypothetical example to illustrate what the application of GOSAM to f (x) and recursively to Fit functions, might achieve. The original function f (x) has a number of minima. As can be seen, the number of minima reduces at each step and the sequence of recursively computed Fit functions become increasingly smoother, and the sequence terminates at a convex function that is related to the double conjugate of the original function. However, local minimization of the Fit function seems to be more than adequate, as is done in the present implementation. It is possible to construct functions where GOSAM’s strategy will fail. For example, it would be impossible to learn any structure from a function with a uniform distribution of randomly located minima, or a function that is discontinuous almost everywhere. However, on most problems of any practical interest, small perturbations from a local minimum will lead us to another locally minimal configuration. This implies that a learning tool can be used to predict locations of other minima from the knowledge of only a few.
6.7 Conclusion and Future Work In this paper, we presented GOSAM, a fast and effective multistart global minimization algorithm for solving optimization problems. GOSAM applies support vector regression on the training set formed by previously discovered local minima, to guide search towards better local minima. This is different from approximating a fitness landscape; GOSAM attempts to predict how local minima are distributed, and where the best one might lie. A regressor that fits local minima is smoother than one that tries to fit the original function. Approximating the fitness landscape requires fitting all points and not just a few minima. The use of Support Vector Regression allows only support vectors to be retained, and redundant information can be discarded. Experimental results on large benchmarks show that GOSAM searches far more efficiently, uses significantly fewer function evaluations, and finds the global optimum more consistently than other state-of-the-art methods. The effectiveness of GOSAM confirms that the generalizing ability of SVRs is very useful in predicting where good local minima lie. We have also shown how GOSAM can be applied to unconstrained tasks, as well as combinatorial optimization tasks such as graph coloring, that are traditionally solved as integer programming problems. GOSAM does not require the function to be known in terms of an analytical expression. It is enough to have a black box that can evaluate the function at a chosen point. This allows GOSAM to be interfaced to any such black box. We have presented results in the VLSI domain, where GOSAM has been interfaced to a commercial circuit simulator and used to optimize MOSFET sizes and component values to meet desired objectives subject to specified constraints. The objectives are typically complex, such as phase margin of a folded cascode amplifier, or jitter in a Phase Locked Loop. A current version of GOSAM is equipped with a web interface that allows a user to access it without revealing information about the function being optimized. The set up of the web based service is shown in Fig. 6.14. As the figure illustrates,
6
Learning Global Optimization through SVM Based Adaptive Multistart
153
only vectors and corresponding cost values are exchanged between the GOSAM server and a client running a simulator or emulator. This allows GOSAM to be provided as a service across the web while protecting proprietary information about the optimizer and the objective function. Other aspects worthy of investigation include the use of different approaches to SVR, such as online learning techniques, and parallellizing operations in GOSAM to speed up search. Ongoing efforts include extending GOSAM to multi-objective optimization tasks. GOSAM may be obtained from the authors for non-commercial academic use on a trial basis. Acknowledgements. The authors would like to thank Dr. R. Kothari of IBM India Research Laboratory, Prof. R. Newcomb, University of Maryland, College Park, USA, and Prof. S.C.Dutta Roy of the Department of Electrical Engineering, IIT Delhi, for their valuable comments and a critical appraisal of the manuscript.
References 1. http://mat.gsia.cmu.edu/COLOR02/ 2. http://www.cadence.com/products/custom ic/spectre/ index.aspx 3. Agakov, F., Bonilla, E., Cavazos, J., Franke, B., Fursin, G., O’Boyle, M., Thomson, J., Toussaint, M., Williams, C.: Using machine learning to focus iterative optimisation. In: Proceedings of the 4th Annual International Symposium on Code Generation and Optimization (CGO), New York, NY, USA, pp. 295–305 (2006) 4. Baluja, S., Barto, A., Boese, K., Boyan, J., Buntine, W., Carson, T., Caruana, R., Davies, S., Dean, T., Dietterich, T., Hazlehurst, S., Impagliazzo, R., Jagota, A., Kim, K., McGovern, A., Moll, R., Moss, E., Perkins, T., Sanchis, L., Su, L., Wang, X., Wolpert, D.: Statistical machine learning for large-scale optimization. Neural Computing Surveys 3, 1–58 (2000) 5. Black, F., Litterman, R.: Global portfolio optimization. Financial Analysts Journal 48(5), 28–43 (1992) 6. Boese, K., Kahng, A.B., Muddu, S.: A new adaptive multi-start technique for combinatorial global optimizations. Operations Research Letters 16(2), 101–113 (1994) 7. Bordes, A., Bottou, L.: The huller: A simple and efficient online SVM. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 505–512. Springer, Heidelberg (2005), http://leon.bottou.org/papers/bordes-bottou-2005 8. Boyan, J.: Learning evaluation functions for global optimization. Phd dissertation, CMU (1998) 9. Boyan, J., Moore, A.: Learning evaluation functions for global optimization and boolean satisfiability. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence, vol. 15, pp. 3–10. John Wiley and Sons Ltd., Chichester (1998), http://www.cs.cmu.edu/˜jab/cv/pubs/boyan.stage2.ps.gz 10. Dorigo, M., Maniezzo, V., Colorni, A.: Ant system: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics-Part B 26(1), 29–41 (1996), http://iridia.ulb.ac.be/˜mdorigo/ACO/publications.html
154
Jayadeva, S. Shah, and S. Chandra
11. GEATbx: Genetic and evolutionary algorithm toolbox (1994), http://www.geatbx.com/docu/fcnindex.html 12. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., Boston (1989) 13. Grossmann, I.E. (ed.): Global optimization in engineering design. Kluwer Academic Publishers, Dordrecht (1996) 14. Gunn, S.: Support vector machines for classification and regression. Technical report, Image Speech and Intelligent Systems Research Group, University of Southampton, UK (1998), http://www.isis.ecs.soton.ac.uk/resources/svminfo/ 15. Horst, R., Tuy, H.: Global optimization:deterministic approaches. Springer, Berlin (1993) 16. Houck, C., Joines, J., Kay, M.: A genetic algorithm for function optimization: A matlab implementation. NCSU-IE TR 95-09 (1995), http://www.ie.ncsu.edu/mirage/GAToolBox/gaot/ 17. Jayadeva, Dutta Roy, S.C., Chaudhary, A.: Compact analogue neural network: A new paradigm for neural based combinatorial optimisation. IEE Proc-Circuits Devices Syst. 146(3) (1999) 18. Jaydeva, Shah, S., Chandra, S.: Learning to optimize vlsi design problems. In: INDICON, pp. 1–5. IEEE, New Delhi (2006) 19. Johnson, D., McGeoch, L.: The travelling salesman problem: A case study in local optimisation. In: Local Search in Combinatorial Optimisation, pp. 215–310. John Wiley and Sons, London (1997) 20. Kazerounian, K., Wang, Z.: Global versus local optimization in redundancy resolution of robotic manipulators. The International Journal of Robotics Research 7(5), 2–12 (1988) 21. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of the IEEE International Conference on Neural Networks, Perth, Australia, vol. 4, pp. 1942–1948 (1995) 22. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983) 23. LINDO SYSTEMS Inc.: LINDO API User’s Manual (2002) 24. Madsen, K.: Test problems for global optimization, http://www2.imm.dtu.dk/˜km/Test_ex_forms/test_ex.html 25. Mangasarian, O.: Nonlinear Programming. SIAM, Philadelphia (1994) 26. Moles, C., Mendes, P., Banga, J.: Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome Research 13, 2467–2474 (2003) 27. Neumaier, A.: Global optimization, http://www.mat.univie.ac.at/˜neum/glopt/applications.html 28. Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Advances in Kernel Methods - Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999) 29. Russel, S., Norvig, P.: Artificial intelligence: a modern approach. Prentice Hall, Englewood Cliffs (1995) 30. Singh, J.: PSO algorithm toolbox (2003), http://psotoolbox.sourceforge.net/ 31. Wang, M., Yang, X., Sarrafzadeh, M.: Congestion minimization during placement. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 19(10), 1140–1148 (2000)
Chapter 7
Multi-objective Optimization Using Surrogates Ivan Voutchkov and Andy Keane
Abstract. Until recently, optimization was regarded as a discipline of rather theoretical interest, with limited real-life applicability due to the computational or experimental expense involved. Practical multiobjective optimization was considered almost as an utopia even in academic studies due to the multiplication of this expense. This paper discusses the idea of using surrogate models for multiobjective optimization. With recent advances in grid and parallel computing more companies are buying inexpensive computing clusters that can work in parallel. This allows, for example, efficient fusion of surrogates and finite element models into a multiobjective optimization cycle. The research presented here demonstrates this idea using several response surface methods on a pre-selected set of test functions. We aim to show that there are number of techniques which can be used to tackle difficult problems and we also demonstrate that a careful choice of response surface methods is important when carrying out surrogate assisted multiobjective search.
7.1 Introduction In the world of real engineering design, there are often multiple targets which manufacturers are trying to achieve. For instance in the aerospace industry, a general problem is to minimize weight, cost and fuel consumption while keeping performance and safety at a maximum. Each of these targets might be easy to achieve individually. An airplane made of balsa wood would be very light and will have low fuel consumption, however it will not be structurally strong enough to perform at high speeds or carry useful payload. Also such an airplane might not be very safe, Ivan Voutchkov University of Southampton, Southampton SO17 1BJ, United Kingdom e-mail: [email protected] Andy Keane University of Southampton, Southampton SO17 1BJ, United Kingdom e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 155–175. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
156
I. Voutchkov and A. Keane
i.e., robust to various weather and operational conditions. On the other hand, a solid body and a very powerful engine will make the aircraft structurally sound and able to fly at high speeds, but its cost and fuel consumption will increase enormously. So engineers are continuously making trade-offs and producing designs that will satisfy as many requirements as possible, while industrial, commercial and ecological standards are at the same time getting ever tighter. Multiobjective optimization (MO) is a tool that aids engineers in choosing the best design in a world where many targets need to be satisfied. Unlike conventional optimization, MO will not produce a single solution, but rather a set of solutions, most commonly referred to as Pareto front (PF) [12]. By definition it will contain only non-dominated solutions1 . It is up to the engineer to select a final design by examining this front. Over the past few decades with the rapid growth of computational power, the focus in optimization algorithms in general has shifted from local approaches that find the optimal value with the minimal number of function evaluations to more global strategies which are not necessarily as efficient as local searches but (some more than the others) promise to converge to global solutions, the main players being various strands of genetic and evolutionary algorithms. At the same time, computing power has essentially stopped growing in terms of flops per CPU core. Instead parallel processing is an integral part of any modern computer system. Computing clusters are ever more accessible through various techniques and interfaces such as multi-threading, multi-core, Windows HPC, Condor, Globus, etc. Parallel processing means that several function evaluations can be obtained at the same time, which perfectly suits the ideology behind genetic and evolutionary algorithms. For example Genetic algorithms are based on the idea borrowed from biological reproduction, where the offspring of two parents copy the best genes of their parents but also introduce some mutation to allow diversity. The entire generation of offspring produced by parents in a generation represent designs that can be evaluated in parallel. The fittest individuals survive and are copied into the next generation, whilst weak designs are given some random chance with low probability to survive. Such parallel search methods are conveniently applicable to multiobjective optimization problems, where the fitness of an individual is measured by how close to the Pareto front this designs is. All individuals are ranked, those that are part of the Pareto front get the lowest (best) rank, the next best have higher rank and so on. Thus the multiobjective optimization is reduced to single objective minimization of the rank of the individuals. This is idea has been developed by Deb and implemented in NSGA2 [5]. In the context of this paper, the aim of MO is to produce a well spread out set of optimal designs, with as few function evaluations as possible. There are number of methods published and widely used to do this – MOGA, SPEA, PAES, VEGA, NSGA2, etc. Some are better than others - generally the most popular in the literature are NSGA2 (Deb) and SPEA2 (Zitzler), because they are found to achieve good results for most problems [2, 3, 4, 5, 6]. The first is based on genetic algorithms and 1
Non-dominated designs are those where to improve performance in any particular goal performance in at least one other goal must be made worse.
7
Multi-objective Optimization Using Surrogates
157
the second on an evolutionary algorithm, both of which are known to need many function evaluations. In real engineering problems the cost of evaluating a design is probably the biggest obstacle that prevents extensive use of optimization procedures. In the multiobjective world, this cost is multiplied, because there are multiple expensive results to obtain. Evaluating directly a finite element model can take several days, which makes it very expensive to try hundreds or thousands of designs.
7.2 Surrogate Models for Optimization It seems that increased computing power leads to increased hunger for even more computing power, as engineers realise that they can run more detailed and realistic models. In essence, from an engineering point of view, the available computing power is never enough and this tendency does not seem to be changing at least in the foreseeable future. To put these words into prospective, to be useful to an engineering company, a modern optimization approach should be able to tackle a global multiobjective optimization problem in about a week. The problem would typically have 20-30 variables, 2-5 objectives, 2-5 constraints with evaluation times of about 12-48h per design and often per objective. Unless you have access to 5000-7000 parallel CPUs, the only way to currently tackle such problems is to use surrogate models. In the single objective world, approaches using surrogate models are fairly well established and have proven to successfully deal with the problem of computational
Fig. 7.1 Direct search versus surrogate models for optimization
158
I. Voutchkov and A. Keane
expense (see Fig. 7.1) [22]. Since their introduction, more and more companies have adopted surrogate assisted optimization techniques and some are making steps to incorporate this approach in their design cycle as standard. The reason for this is that instead of using the expensive computational models during the optimization step, they are substituted with a much cheaper but still accurate replica. This makes optimization not only useful, but usable and affordable. The key idea that makes surrogate models efficient is that they should become more accurate in the region of interest as the search progresses, rather than being equally accurate over the entire design space, as an FE representation will tend to be. This is achieved by adding to the surrogate knowledge base only at points of interest. The procedure is referred to as surrogate update. Various publications address the idea of surrogates models and multiobjective optimisation [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. As one would expect, no approximation method is universal. Factors such as function modality, number of variables, number of objectives, constraints, computation time, etc., all have to be taken into account when choosing an approximation method. The work presented here aims to demonstrate this diversity and hints at some possible strategies to make best use of surrogates for multi-objective problems.
7.3 Multi-objective Optimization Using Surrogates To illustrate the basic idea, the zdt1 – zdt6 test function suite [3] will be used to begin with. It is a good suite to demonstrate the effectiveness of surrogate models, as it is fairly simple for response surface (surrogate) modelling. Fig. 13.3 represents the zdt2 function and the optimisation procedure. It is a striking comparison, demonstrating the surrogate approach. The problem has two objective functions and two design variables. The Pareto front obtained using surrogates with 40 function evaluations is far superior to the one without surrogates and the same number of function evaluations. Table 7.1 Full function evaluations for ZDT2 - Fig. 13.3
Number of variables Number of function evaluations without surrogates Number of function evaluations with surrogates
2 2500 40
5 5000 40
10 10000 60
On the other hand 2500 evaluations without surrogates were required to obtain a similar quality of Pareto front to the case with surrogates and 40 evaluations. The difference is even more significant if more variables are added – see Table 14.1. Here we have chosen objective functions with simple shapes to demonstrate the effectiveness of using surrogates. Both functions would be readily approximated using most available methods. It is not uncommon to have relationships of similar simplicity in real problems, although external noise factors could make them
7
Multi-objective Optimization Using Surrogates
159
Fig. 7.2 A (left) – Function ZDT2; B (right) – ZDT2 – Pareto front achieved in 40 evaluations: Diamonds – Pareto front with surrogates; Circles – solution without surrogates
look rougher. Relationships of higher order of multimodality would be more of a challenge for most methods, as will be demonstrated later.
7.4 Pareto Fronts - Challenges Depending on the search algorithm, the quality of the Pareto front could vary greatly. There are various characteristics that describe a good quality Pareto front: 1. Spacing – better search techniques will space the points on the Pareto front uniformly rather than producing clusters. See Fig. 7.3a 2. Richness – better search techniques will put more points on the Pareto front than others. See Fig. 7.3b 3. Diversity – better search techniques will produce fronts that are spread out better with respect to all objectives. See Fig. 7.3c 4. Optimality – better search techniques will produce fronts that dominate the fronts produced by less good techniques. In test problems this is usually measured as ‘generational distance’ to an ideal Pareto front. We discuss this later. See Fig. 7.3d 5. Globality – the obtained Pareto front is a global as opposed to local. Similar to single objective optimization, in the multiobjective world, it is also possible to have local and global optimal solutions. This concept is demonstrated using the F5 test function (a full description is given in sections 15.4 and 15.5). Fig. 7.4 illustrates the function and the optimization procedure. Due to the sharp nature of the global solution it cannot be guaranteed that with a small number of GA evaluations, the correct solution will be found. Furthermore, since the surrogate is based only on sampled data, if this data does not contain any points in the global optimum area, then the surrogate will never know about its existence. Therefore
160
I. Voutchkov and A. Keane
any optimization based only on such surrogates will lead us to the local solution. Therefore conventional optimization approaches based on surrogate models rely on constant updating of the surrogate. A widely accepted technique in single objective optimization is to update the surrogate with its current optimal solution. In multiobjective terms this will translate to updating the surrogate with one or more points belonging to its Pareto front. If the surrogate Pareto front is local and not global, then the next update will also be around the local Pareto front. Continuing with this procedure the surrogate model will become more and more accurate in the area of the local optimal solution, but will never know about the existence of the global solution. 6. Robust convergence from any start design with any random number sequence. It turns out that the success of a conventional multiobjective optimization based on surrogates, using updates at previously found optimal locations strongly depends on the initial data used to train the first surrogate before any updates are added. If this data happens to contain points around the global Pareto front, then the algorithm will be able to quickly converge and find a nice global Pareto front. However the odds are that the local Pareto fronts are smoother and easier to find shapes and in most cases this is where the procedure will converge unless suitable global exploration steps are taken. 7. Efficiency and convergence – better search techniques will converge using less function evaluations.
Fig. 7.3 Pareto front potential problems - (a) clustering; (b) too few points; (c) lack of diversity; (d) non-optimality
7
Multi-objective Optimization Using Surrogates
161
Fig. 7.4 F5: Local and global solutions
7.5 Response Surface Methods, Optimization Procedure and Test Functions In a previous publication [20] we have shown that for complex and high-dimensional functions Kriging is the response surface method of choice [22]. We have also stressed the importance of applying a high level of understanding when using Kriging. There have been various publications that critique kriging, due to lack of understanding. Our opinion is that if the user understands the strengths and weaknesses of this approach it can become an invaluable tool, often the only one capable of producing meaningful results in reasonable times. Kriging is a Response Surface (RSM) method, designed in the 60’s for geological surveys [7]. It can be a very efficient RSM model for cases where it is expensive to obtain large amounts of data. A significant number of publications discuss the kriging procedure in detail. An important role for the success of the method is the tuning of its hyper parameters. It should be mentioned that researchers who have chosen rigorous training procedures, report positive results when using kriging, while publications that use basic training procedures often reject this method. Nevertheless, the method is becoming increasingly popular in the world of optimization as it often provides a surrogate with usable accuracy. This method was used to build surrogates for the above test cases, therefore it is useful to briefly outline its major pros and cons: Pros: • can always predict with no error at sample points, • the error in close proximity to sample points is minimal, • requires small number of sample points in comparison to other response surface methods, • reasonably good behaviour with high dimensional problems.
162
I. Voutchkov and A. Keane
Cons: • for large number of data points and variables, training of the hyper-parameters and prediction may become computationally expensive. Researchers should make a conscious decision when choosing Kriging for their RSMs. Such a decision should take into account the cost of a direct function evaluation including constraints (if any), available computational power, and dimensionality of the problem. Sometimes it might be possible to use kriging for one of the objectives while another is evaluated directly, or a different RSM is used to minimize the cost. As this paper aims to demonstrate various approaches in making a better use of surrogate models, we will use kriging throughout, but most conclusions could be generalised for other RS methods as well. The chosen multiobjective algorithm is NSGA2. Other multiobjective optimizers might show slightly different behaviour. The basic procedure is as follows:
1. Carry out 20 LPtau [8] spaced initial direct function evaluations. 2. Train hyper-parameters, using combination of GA and DHC (dynamic hill climbing) [23] 3. Choose a selection of update strategies with specified number of updates. 4. Search the RSMs using each of the selected methods 5. Select designs that are best in terms of ranking and space filling properties. 6. Evaluate selected designs and add to data set. 7. Produce Pareto front and compare with previous. Stop if 2-3 consecutive Pareto fronts are identical. Otherwise continue. 8. If Pareto front contains too many points, choose specified number of points that are furthest away from each other 9. Repeat from step 2. There are several possible stopping criteria: • fixed number of update iterations, • stop when all update points are dominated, • stop if the percentage of new update points that belong to the Pareto front falls below a pre-defined value, • stop if the percentage of old points on the current Pareto front rises above a pre-defined value, • stop when there is no further improvement of the Pareto front quality. The quality of the Pareto front is a complex multiobjective problem on its own. The best Pareto front could be defined as the one being as close as possible to the origin of the objective function space, while having the best diversity, i.e., spread on all
7
Multi-objective Optimization Using Surrogates
163
objectives and the points are evenly distributed. Metrics for assessing the quality of the Pareto front are discussed by Deb [3]. We have used the last of these criteria for our studies.
7.6 Update Strategies and Related Parameters One of the main aims of this publication is to show the effect of different update strategies and number of updates. Here we consider the following six approaches in various combinations: • UPDMOD = 1; (Nr) - Random updates. These can help escape from local Pareto fronts and enrich the genetic material, • UPDMOD = 2; (Nrsm) - RSM Pareto front. A specified number of points are extracted from the Pareto front obtained after the search of the response surface models of the objectives and constraints (if any). When the RSM Pareto front is rich it is possible to extract data that is uniformly distributed. • UPDMOD = 3; (Nsl) - Secondary NSGA2 layer. A completely independent NSGA2 algorithm is applied directly to the non-RSM objective functions and constraints. This exploits the well known property of the NSGA2 which makes it (slowly) converge to global solutions. During each update iteration, the direct NSGA2 is run for one generation with population size of Nsl. There are two strands to this approach. The first one is referred to as ‘decoupled’. The genetic material is completely independent from the other update strategies. No entries other than those from the direct NSGA2 are used. The second strand is referred to as ‘coupled’, where the genetic information is composed of suitable designs obtained by other participating update strategies. Suitable designs are selected in terms of Pareto optimality, or rank in terms of NSGA2. Please note that although it might sound similar, this is a completely different approach from the Mμ GA algorithm, proposed by Coello and Toscano (2000) • UPDMOD = 4; (Nrmse) – Root of the Mean Squared Error (RMSE). When using kriging as a response surface model, it is possible to compute an estimate of the RMSE, at no significant computational cost. The value of this metric is large where there are large gaps between data points. RMSE is minimal close to or at existing data points. Therefore adding updates at the location of the maximum RMSE should significantly improve the quality and coverage of the response surface model. When dealing with multiple objectives/constraints it is appropriate to construct a Pareto front of maximum RMSEs for all objectives and extract Nrmse points from it. • UPDMOD = 5; (Nie) – Expected improvement (EI). This is another kriging specific function which represents the probability of finding the optimal point in a new location. The update points are extracted from the Pareto front of the maximum values of the EI for all objectives. For constrained problems, the values of EI for all objectives are multiplied by the value of the feasibility of the constraints, which is 1 for satisfied constraints 0 for unfeasible and rather smooth ridge around the constraint boundary, see Forrester et al [22].
164
I. Voutchkov and A. Keane
• UPDMOD = 6; (Nkmean) – The RSMs are searched using GA or DHC and points are extracted using a k-mean cluster detection algorithm. All these update strategies have their own strengths and weaknesses, and therefore a suitable combination should be carefully considered. The results section of this chapter provides some insights on the effects of each of these strategies when used in various combinations. Additional Parameters that Can Affect the Search The following parameters can also affect the performance of a multi-objective RSM search: • RSMSIZE – number of points used for RSM construction. It is expected that the more points that are used, the more accurate the RSM predictions, however this comes at increasing training cost. Therefore the number of training points should be limited. • EVALSIZE – number of points used during RSM evaluation. This stage is considerably less expensive than training and therefore more points can be used during the evaluation stage. Ultimately this should increase the density of quality material and therefore fewer gaps for the RSM to approximate. • EPREWARD – endpoint reward factor. Higher value rewards are given at the end points of the Pareto front, and this improves its spread. Lower value would increase the pressure of the GA to explore the centre of the Pareto front. • GA NPOP and GA NGEN – the population size and number of generation used to search the RSM, RMSE and EI Pareto fronts.
7.7 Test Functions Several test functions with various degrees of complexity have been chosen to demonstrate the overview of the RS methods for the purpose of multiobjective optimization. These functions are well known from the literature: F5: (Fig. 7.4). High complexity shape – has a smooth and a sharp feature. The combination of both makes it easier for the optimization procedure to converge to the smooth feature, which represents a local Pareto front. The global Pareto front lies around the sharp feature which is harder to reach. Two objectives, x (i) = 0 .. 1, i = 1, 2; no constraints [3], page 350. ZDT1 - ZDT6: Clustered and discontinuous Pareto fronts. Shape complexity is moderate. Two objectives, n variables (in present study n = 2), no constraints. x (i) = 0 .. 1, i = 1, 2 [3], page 357. ZDT1cons: Same formulation as for ZDT1 but with 25 variables and 2 constraints. Constraints are described in [3], page 368. Bump: The bump function, 25 variables, 2 objectives, 1 constraint. We have used the function as provided in [21] which is a single objective with two constraints. We have made one of the constraints into second objective, so that the optimization problem is defined as : Maximise the original objective, minimize the sum of
7
Multi-objective Optimization Using Surrogates
165
variables whilst keeping the product of the variables greater than 0.75. There are 25 variables, each varying between 0 and 3.
7.8 Pareto Front Metrics To measure the performance of the various strategies discussed in this paper, we have adopted several metrics. Some of them use comparison to an ‘ideal’ solution which is denoted by Q and represents the Pareto front obtained using direct search with a large number of iterations (20,000). All metrics are designed so that smaller is better.
7.8.1 Generational Distance ([3], pp.326) The average of the minimum Euclidian distance between each point of the two Pareto fronts, $ |Q| ∑i=1 di , gd = |Q| % & ' (i) (k) 2 di = mink=1,|p| ∑M , and is the Euclidian distance between the soj=1 f j − p j lution (i) and the nearest member of Q.
7.8.2 Spacing Standard deviation of the absolute differences between the solution (i) and the nearest member of Q, 1 |Q| 2 sp = di − d¯ , ∑ |Q| i=1 & ' M (i) (k) di = min ∑ j=1 f j − p j . k=1,|p|
7.8.3 Spread |Q| e ¯ ∑M m=1 dm − ∑i=1 di − d Δ = 1− , e ¯ ∑M m=1 dm + |Q| d where di is the absolute difference between neighbouring solutions. For compatibility with the above metrics, the values of the spread is subtracted from 1, so that a wider spread will produce a smaller value.
166
I. Voutchkov and A. Keane
7.8.4 Maximum Spread Normalized distance between the most distant points on the pareto front. The distance is normalized against the maximum spread of the ‘ideal’ pareto front. For compatibility with the above metrics, the value of the maximum spread is subtracted from 1, so that a wider spread will produce a smaller value, ⎛ ⎞ (i) (i) 2 max fm − min fm 1 M i=1,|Q| ⎜ i=1,|Q| ⎟ MS = 1 − ∑ ⎝ ⎠ . max M m=1 Pm − Pmmin
7.9 Results The study carried out aims to show the effect of applying various update strategies, number of training and evaluation points, etc. The performance of each particular approach is measured using the metrics described in the previous section. An overall summary is given at the end of this section, but the best recipe appears to be highly problem dependant. It is also not possible to show all results for all functions due to limited space, and we have therefore chosen several that best represent the ideas discussed. To correctly appreciate the results, please bear in mind that they are meant to show diversity rather than a magic recipe that works in all situations.
7.9.1 Understanding the Results The legend on the figures represents the selected strategy in the form [Nr]-[Nrsm]-[Nsl]-[Nrmse]-[Nie]-[Nkmean]MUPD[RSMSIZE]MEVL[EVALSIZE] so that a 8-14-15-10-3-3MUPD50MEVL300 would represent 8 random update points, 14 RSM updates, 15 NSGA2 Second layer updates, 10 RMSE updates, 3 EI updates, 3 KMEAN updates with 50 krig training points and 300 krig evaluation points. All approaches were given a maximum of 60 update iterations and stopping criteria of reaching two consecutive unchanged Pareto fronts. Total number of runs is recorded for each update iteration and all metrics are plotted against number of real function evaluations, (i.e. likely cost on real expensive problems). Strategies with ‘dec’ appended to their name – indicate that the decoupled Second layer is used, as opposed to coupled for those where Nsl = 30 and without any appendix. Those labled ‘43’ use a one pass constraint penalty expected improvement strategy whilst those that have Nie = 30 and no appendix use a constraint feasibility algorithm.
7
Multi-objective Optimization Using Surrogates
167
7.9.2 Preliminary Calculations 7.9.2.1
Finding the Ideal Pareto Front
As mentioned in section 7.8, most of the Pareto front metrics are based on comparison to an ‘ideal’ Pareto front. To find it, each of the test functions has been run through a direct NSGA2 search (direct = without the usage of surrogates) with Population size of 100 for 200 generations, which takes 20000 function evaluations. 7.9.2.2
How Many Generations for the RSM Search?
We have conducted a study for each of the test functions to find what the minimum number of generations they should be run for is, in order to achieve best convergence. We found that a population size of 70 with 80 generations is sufficient for all of test problems and this is what we have used for our tests. Some test functions, such as ZDT1 - ZDT6 with two variables could be converged using a smaller number of individuals and generations, however for comparison purposes we decided to use the same settings for all functions. 7.9.2.3
What Is the Best Value for EPREWARD during the RSM Search?
The EPREWARD value is strictly individual for each function. Taking into account the specifics of the test function it can improve the diversity of the Pareto front. The default value is 0.65, which works well for most of the functions, but we have also conducted studies where this parameter is varied between -1 and 1 in steps of 0.1, and individual value for each function is selected based on best Pareto front metrics.
7.9.3 The Effect of the Update Strategy Selection Fig. 7.5 shows that the selection of update strategy is important even for functions with only two variables. F5 has a deceptive Pareto front and several update strategies were not able to escape from the local Pareto front. Fig. 7.6 clearly shows that some strategies have converged earlier than the others, but some of them to the local front. Generally methods such as Random updates and Secondary NSGA2 layer updates are not based on the RSM and are the strongest candidates when deceptive features in the multiobjective space are expected. It is a common observation amongst most of the low dimensional objective functions (two or three variables) that using all the update techniques together is not necessarily the winning strategy. However combining at least one RSM and one non-RSM technique proves to work well. It is somewhat important to note that the Second NSGA2 layer shows its effect after sixth or seventh update iteration, as it needs time to converge and gather genetic information. Update strategies that employ a greater variety of techniques prove to be more successful for functions with higher number of variables (25).
168
I. Voutchkov and A. Keane
Fig. 7.5 Pareto front for F5
Fig. 7.6 Generational distance for F5
Fig. 7.9 and Fig. 7.10 show that the ‘bump’ function is particularly difficult for all strategies, which makes it a good test problem. This function has extremely tight constraint and multimodal features. It is not yet clear which of combination of strategies should be recommended, as the ‘ideal’ Pareto front has not been reached, however it seems that a decoupled secondary NSGA2 layer is showing a good
7
Multi-objective Optimization Using Surrogates
169
Fig. 7.7 Pareto front for ZDT1cons
Fig. 7.8 Generational distance for ZDT1cons
advancement. We are continuing studies on this function and will give results in future publications. To summarize the performance of each strategy an average statistics is computed. It is derived as follows. The actual performance in most cases is a tradeoff between a given metric and the number of function evaluations needed for
170
I. Voutchkov and A. Keane
Fig. 7.9 Pareto front for the ‘bump’ function
Fig. 7.10 Generational distance for the ‘bump’ function
convergence. Therefore the four metrics can be ranked against the number of runs, in the same way as ranks are obtained during NSGA2 operation. The obtained ranks are then averaged across all test functions. Low average rank means that the strategy has been optimal for more metrics and functions. These results are summarized in Table 14.2.
7
Multi-objective Optimization Using Surrogates
171
Table 7.2 Summary of performance Random RSM PF SL RMSE EI KMEAN 0 30 0 0 30 0 0 30 30 0 0 0 0 30 0 30 0 0 0 30 0 0 30 0 0 30 30 0 0 0 30 30 0 0 0 0 0 60 0 0 0 0
Av. Rank Min. Rank Max. Rank Note 1.53 1 2 EI const.feas 1.83 1 3.33 SL coupled 2 1.33 3.33 RMSE 2.2 1.33 3 EI normal 2.8 1.33 4 SL decoupled 2.84 2 4 Random 2.85 2 3.33 RSM PF
The summary shows that all strategies are generally better than using only the conventional RSM based updates, which is expected, as the conventional method is almost always bound to converge at local solutions. However it must be underlined that a correct selection is problem dependant and must be selected with care and understanding.
7.9.4 The Effect of the Initial Design of Experiments All methods presented here start from a given initial design of experiments. This is the starting point and this is what the initial surrogate model is based on. It is of course important to show the effect of these initial conditions. In what follows we have shown that effect by using a range of different initial DOEs. We have again
Fig. 7.11 Generational distance for zdt1 starting from different initial DOEs
172
I. Voutchkov and A. Keane
Fig. 7.12 Generational distance for F5 starting from different initial DOEs
Fig. 7.13 Pareto fronts for ‘bump’ starting from different initial DOEs
used 10 updates for each of the techniques (60 updates per iteration in total) for all functions. The only difference being the starting set of designs. Fig. 7.11 and Fig. 7.12 illustrate the generational distance for zdt1 and f5 functions - both with two variables. They both demonstrate a good averagibility,
7
Multi-objective Optimization Using Surrogates
173
Fig. 7.14 Generational distance for ‘bump’ starting from different initial DOEs
Fig. 7.15 Pareto fronts for ‘zdt1cons’ starting from different initial DOEs
confirming once again that the surrogate updates are fairly robust for functions with low number of variables. Figures 7.13, 7.14 and 7.15 illustrate much greater variance and show that high dimensionality is a difficult challenge for surrogate strategies, however one should also consider the low number of function evaluations used here.
174
I. Voutchkov and A. Keane
7.10 Summary In this publication we have aimed to share our experience in tackling expensive multiobjective problems. We have shown that as soon as we decide to use surrogate models, to substitute for expensive objective functions, we need to consider a number of other specifics in order to produce a useful Pareto front. We have discussed the challenges that one might face when using surrogates and have proposed six update strategies that one might wish to use. Given understanding of these strategies, the researcher should decide on the budget of updates they could afford and then spread this budget over several update strategies. We have shown that it is best to use at least two different strategies – ideally a mixture of RSM and non-RSM based techniques. When solving problems with few variables we have shown that a combination of two or three techniques is sufficient, however with higher dimensional problems, one should consider using more techniques. It is also beneficial to constrain the number of designs that are used for RSM training and also for RSM evaluation to limit the cost. The selection method of the designs then being used is open to further research. In this material we have used selection based on Pareto front ranking. Our research also included parameters that reward the search for exploring the end points on the Pareto front. Although not explicitly mentioned in this material, our studies are using features such as improved crossover, mutation and selection strategies, declustering algorithm applied both in the variable and objective space to avoid data clustering. Data is also being automatically conditioned and filtered, and advanced kriging tuning techniques are used. These features are part of the OPTIONS [1], OptionsMATLAB and OptionsNSGA2 RSM suites [24]. Acknowledgements. This work was funded by Rolls – Royce Plc, whose support is gratefully acknowledged.
References 1. Keane, A.J.: OPTIONS manual, http://www.soton.ac.uk/˜ajk/options.ps 2. Obayashi, S., Jeong, S., Chiba, K.: Multi-Objective Design Exploration for Aerodynamic Configurations, AIAA-2005-4666 3. Deb, K.: Multi-objective optimization using evolutionary algorithms. John Wiley & Sons, Ltd., New York (2003) 4. Zitzler, et al.: Comparison of multiobjective evolutionary algorithms: Empirical results. Evolutionary Computational Journal 8(2), 125–148 (2000) 5. Knowles, J., Corne, D.: The Pareto archived evolution strategy: A new baseline algorithm for multiobjective optimisation. In: Proceedings of the 1999 Congress on Evolutionary Computation, pp. 98–105. IEEE Service Center, Piscatway (1999) 6. Fonseca, C.M., Fleming, P.J.: Multiobjective optimization and multiple constraint handling with evolutionary algorithms - Part II: Application example. IEEE Transactions on Systems, Man, and Cybernetics: Part A: Systems and Humans, 38–47 (1998)
7
Multi-objective Optimization Using Surrogates
175
7. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive blackbox functions. Journal of Global Optimization 13, 455–492 (1998) 8. Sobol’, I.M., Turchaninov, V.I., Levitan, Y.L., Shukhman, B.V.: Quasi-Random Sequence Generators, Keldysh Institute of Applied Mathematics, Russian Acamdey of Sciences, Moscow (1992) 9. Nowacki, H.: Modelling of Design Decisions for CAD. In: Goos, G., Hartmanis, J. (eds.) Computer Aided Design Modelling, Systems Engineering, CAD-Systems. LNCS, vol. 89. Springer, Heidelberg (1980) 10. Kumano, T., et al.: Multidisciplinary Design Optimization of Wing Shape for a Small Jet Aircraft Using Kriging Model. In: 44th AIAA Aerospace Sciences Meeting and Exhibit, Jannuary 2006, pp. 1–13 (2006) 11. Nain, P.K.S., Deb, K.: A multi-objective optimization procedure with successive approximate models. KanGAL Report No. 2005002 (March 2005) 12. Keane, A., Nair, P.: Computational Approaches for Aerospace Design: The Pursuit of Excellence (2005) ISBN: 0-470-85540-1 13. Leary, S., Bhaskar, A., Keane, A.J.: A derivative based surrogate model for approximating and optimizing the output of an expensive computer simulation. J. Global Optimization 30, 39–58 (2004) 14. Leary, S., Bhaskar, A., Keane, A.J.: A Constraint Mapping Approach to the Structural Optimization of an Expensive Model using Surrogates. Optimization and Engineering 2, 385–398 (2001) 15. Emmerich, M., Naujoks, B.: Metamodel-assisted multiobjective optimization strategies and their application in airfoil design. In: Parmee, I. (ed.) Proc of. Fifth Int’l. Conf. on Adaptive Design and Manufacture (ACDM), Bristol, UK, April 2004, pp. 249–260. Springer, Berlin (2004) 16. Giotis, A.P., Giannakoglou, K.C.: Single- and Multi-Objective Airfoil Design Using Genetic Algorithms and Artificial Intelligence. In: EUROGEN 1999, Evolutionary Algorithms in Engineering and Computer Science (May 1999) 17. Knowles, J., Hughes, E.J.: Multiobjective optimization on a budget of 250 evaluations. In: Coello Coello, C.A., Hern´andez Aguirre, A., Zitzler, E. (eds.) EMO 2005. LNCS, vol. 3410, pp. 176–190. Springer, Heidelberg (2005) 18. Chafekar, D., et al.: Multi-objective GA optimization using reduced models. IEEE SMCC 35(2), 261–265 (2005) 19. Nain, P.: A computationally efficient multi-objective optimization procedure using successive function landscape models. Ph.D. dissertation, Department of Mechanical Engineering, Indian Institute of Technology (July 2005) 20. Voutchkov, I.I., Keane, A.J.: Multiobjective optimization using surrogates. In: Proc. 7th Int. Conf. Adaptive Computing in Design and Manufacture (ACDM 2006), Bristol, pp. 167–175 (2006) ISBN 0-9552885-0-9 21. Keane, A.J.: Bump: A Hard (?) Problem (1994), http://www.soton.ac.uk/˜ajk/bump.html 22. Forrester, A., Sobester, A., Keane, A.: Engineering design via Surrogate Modelling. Wiley, Chichester (2008) 23. Yuret, D., Maza, M.: Dynamic hill climbing: Overcoming the limitations of optimization techniques. In: The Second Turkish Symposium on Artificial Intelligence and Neural Networks, pp. 208–212 (1993) 24. OptionsMatlab & OptionsNSGA2 RSM, http://argos.e-science.soton.ac.uk/blogs/OptionsMatlab/
Chapter 8
A Review of Agent-Based Co-Evolutionary Algorithms for Multi-Objective Optimization Rafał Dre˙zewski and Leszek Siwik
Abstract. Agent-based evolutionary algorithms are a result of mixing two paradigms: multi-agent systems and evolutionary algorithms. Agent-based co-evolutionary algorithms allow for existing many species and sexes of agents within the system as well as for defining co-evolutionary interactions between species and sexes. Algorithms based on the model of co-evolutionary multi-agent system have been already applied in many domains, like multi-modal optimization, generation of investment strategies, portfolio optimization, and multi-objective optimization. In this chapter we present an overview of selected agent-based co-evolutionary algorithms, their formal models, and results of experiments with standard test problems and financial problem, aimed at making comparison of agent-based and “classical” state-of-the-art multi-objective algorithms. Presented results show that, depending on the problem being solved, agent-based algorithms obtain comparable, and sometimes even better, results than “classical” algorithms, however of course they are not the universal solver for any multi-objective optimization problem.
8.1 Introduction In spite of a huge potential dozing in evolutionary algorithms and a lot of successful applications of such algorithms in solving difficult problem of optimization and searching, very frequently such methods have not been able to deal with defined problem and obtained results have not been satisfying. Among the reasons of such a situation the following can be mentioned: • centralization of evolutionary process where the process of selection as well as the process of creation of new generations are controlled by one single algorithm; Rafał Dre˙zewski · Leszek Siwik Department of Computer Science AGH University of Science and Technology, Krak´ow, Poland e-mail: {drezew,siwik}@agh.edu.pl Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 177–209. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
178
R. Dre˙zewski and L. Siwik
• reducing of specimen to the (system of) genes without capabilities of exerting of any influence on the process of evolution; • omitting some crucial—from the evolution and adaptation capabilities point of view—operations and processes observable in the nature. Moreover, in the literature there are opinions that crossover and mutation they are only the kinds of one single—destructive and exploration-oriented—operator and there is no agreement if (and if so—when) they should be used or even if they should be distinguished [17]; • to realize their own goals, during decision-making process, specimens are able neither to gather nor to utilize any kind of information from the environment; • depriving specimens of such—absolutely natural and obvious in nature— biological and social behaviors like competition, rivalry, cooperation etc.; • in the consequence of previous point (limited number of operators) it is almost impossible to define in classical evolutionary algorithms more sophisticated (and more effective simultaneously), advanced algorithms and computational methods. In the consequence, in the literature, there are being raised arguments that classical evolutionary algorithms are methods of adapting and fitting of algorithm’s parameters to defined conditions rather than really creative methods of searching and optimization. It is nothing strange so, that intensive research is being performed on methods utilizing ideas and conceptions of computer models of observable in nature Darwinian evolution but at the same time, on methods that should be devoid of mentioned above shortcomings, and which could be perceived as a full analogy to natural processes. During the research, decentralization and autonomy have been in the limelight. Proposed, as a result, method called Evolutionary Multi-Agent System—EMAS [2] should be perceived as a new trend among evolutionary algorithms allowing for realization of defined postulates by utilizing advantages simultaneously of both: evolutionary and agent-based approaches. Proposed paradigm of evolutionary multi-agent system is characterized by the following—crucial, taking the shortcomings of classical evolutionary algorithms into account—features: • in the process of evolution autonomous agents are taking a part. Agents are able to make decisions to realize their own goals and they are not passive units of global and central evolution which are limited and reduced to the role of (group of) genes; • the prices of evolution is decentralized and agents taking the part in that are able to create advanced social structures and to realize sophisticated strategies of cooperation, competition, interactions and reciprocal relations • agents taking the part in the process of evolution are able to observe the environment (and occurring changes) and to make appropriate decisions and actions what additionally enrich the spectrum of possible for realization complex and effective computational methods and algorithms.
8
A Review of Agent-Based Co-Evolutionary Algorithms
179
During further research on realizing advanced, complex social and biological mechanisms within the confines of EMAS—general model of so called CoEMAS Coevolutionary multi-agent systems (CoEMAS) [8] has been proposed and it has turned out that with the use of such a model almost any kind of interaction, cooperation or competition among many species or sexes of co-evolving agents is possible what allows for improving the quality of obtained result. Such improvement results mainly from better maintenance of population diversity—what is especially important in the case of applying such systems for solving multi-modal or multi-objective optimization tasks. In the course of this chapter we are focusing on applying co-evolutionary multi-agent systems for solving multi-objective optimization tasks. Following [5]—multi-objective optimization problem—MOOP in its general form is being defined as follows: ⎧ ¯ m = 1, 2 . . . , M Minimize/Maximize fm (x), ⎪ ⎪ ⎨ Sub ject to g j (x) ¯ ≥ 0, j = 1, 2 . . . , J MOOP ≡ hk (x) ¯ = 0, k = 1, 2 . . . , K ⎪ ⎪ ⎩ (L) (U) xi ≤ xi ≤ xi , i = 1, 2 . . . , N Authors of this chapter assume that readers are familiar with at least fundamental concepts and notions regarding multi-objective optimization in the Pareto sense (relation of domination, Pareto frontier and Pareto set etc.) and their explanation is omitted in this paper (interested readers can find definitions and deep analysis of all necessary concepts and notions of Pareto multi-objective optimization for instance in [3, 5]). This chapter is organized as follows: • in Section 8.2 formal model as well as detailed description of Co-Evolutionary Multi-Agent System—CoEMAS is presented; • in Section 8.3 detailed description and formal model of two realization of CoEMAS applied for solving MOOP is given. In this section Co-Evolutionary Multi-Agent System with Predator-Prey interactions (PPCoEMAS) as well as Co-Evolutionary Multi-Agent System with Cooperation (CCoEMAS) are discussed; • in Section 8.4 we discuss shortly test suite and performance metric used during experiments, and next we glance at results obtained by both systems presented in the course of this chapter (PPCoEMAS and CCoEMAS); • in Section 8.5 the most important remarks, conclusions and comments are given.
8.2 Model of Co-Evolutionary Multi-Agent System Agent-based models of evolutionary algorithms are the result of mixing two paradigms: multi-agent systems and evolutionary algorithms. The result is decentralized evolutionary system, in which agents “live” within the environment of the system, compete for limited resources, reproduce, die, migrate from one
180
R. Dre˙zewski and L. Siwik
computational node to another, observe the environment and other agents, and can communicate with other agents and change the environment. Basic model of agent-based evolutionary algorithm (so called evolutionary multiagent system—EMAS model) was proposed in [2]. EMAS model included all the features which were mentioned above. However in the case of some problems, for example multi-modal optimization or multi-objective optimization, it turned out that these mechanisms are not sufficient. Such types of problems require maintaining of population diversity mechanisms, speciation mechanisms and possibilities of introducing additional biologically and socially inspired mechanisms in order to solve a problem and obtain satisfying results. Mentioned above limitations of the basic EMAS model and research aimed at applying agent-based evolutionary algorithms to multi-modal and multi-objective problems led to the formulation of the model of co-evolutionary multi-agent system— CoEMAS [8]. This model included the possibilities of existing different species and sexes in the system and allowed for defining co-evolutionary interactions between them. Below we present basic ideas and notions of CoEMAS model, which we will use in Section 8.3 when the systems used in experiments will be described.
8.2.1 Co-Evolutionary Multi-Agent System The CoEMAS is described as 4-tuple: CoEMAS = E, S, Γ , Ω
(8.1)
where E is the environment of the CoEMAS, S is the set of species (s ∈ S) that co-evolve in CoEMAS, Γ is the set of resource types that exist in the system, the amount of type γ resource will be denoted by rγ , Ω is the set of information types that exist in the system, the information of type ω will be denoted by iω .
8.2.2 Environment The environment of CoEMAS may be described as 3-tuple: / . E = T E,Γ E ,Ω E
(8.2)
where T E is the topography of environment E, Γ E is the set of resource types that exist in the environment, Ω E is the set of information types that exist in the environment. The topography of the environment is given by: T E = H, l
(8.3)
where H is directed graph with the cost function c defined: H = V, B, c, V is the set of vertices, B is the set of arches. The distance between two nodes is defined as the length of the shortest path between them in graph H.
8
A Review of Agent-Based Co-Evolutionary Algorithms
181
Fig. 8.1 Co-evolutionary multi-agent system
The l function makes it possible to locate particular agent in the environment space: l : A→V (8.4) where A is the set of agents, that exist in CoEMAS. Vertex v is given by: v = Av , Γ v , Ω v , ϕ
(8.5)
Av is the set of agents that are located in the vertex v, Γ v is the set of resource types that exist within the v (Γ v ⊆ Γ E ), Ω v is the set of information types that exist within the v (Ω v ⊆ Ω E ), ϕ is the fitness function.
8.2.3 Species Species s ∈ S is defined as follows: s = As , SX s , Z s ,Cs
(8.6)
where: • As is the of agents of species s (by as we will denote the agent, which is of species s, as ∈ As ); • SX s is the set of sexes within the s;
182
R. Dre˙zewski and L. Siwik
• Z s is the set of actions, which can be performed by the agents of species s 0 a Z , where Z a is the set of actions, which can be performed by the (Z s = a∈As
agent a); • Cs is the set of relations with other species that exist within CoEMAS. The set of relations of si with other species (Csi ) is the sum of the following sets of relations: 2 1 s ,z+ 2 1 s ,z− i i (8.7) −→: z ∈ Z si ∪ −− −→: z ∈ Z si Csi = −− s ,z−
s ,z+
i i where −− −→ and −− −→ are relations between species, based on some actions z ∈ Z si , which can be performed by the agents of species si :
s ,z−
i −− −→=
s ,z+
i −− −→=
/ 3. si , s j ∈ S × S : agents of species si can decrease the fitness of 4 agents of species s j by performing the action z ∈ Z si
(8.8)
/ 3. si , s j ∈ S × S : agents of species si can increase the fitness of 4 agents of species s j by performing the action z ∈ Z si
(8.9)
s ,z−
i If si −− −→ si then we are dealing with the intra-species competition, for example si ,z+ the competition for limited resources, and if si −− −→ si then there is some form of co-operation within the species si . With the use of the above relations we can define many different co-evolutionary interactions, e.g., mutualism, predator-prey, host-parasite, etc. For example mutualism between two species si and s j (i = j) takes place if and only if ∃zk ∈ Z si ∃zl ∈ Z s j ,
si ,z +
s j ,zl +
k → s j and s j −−−→ si and these two species live in tight co-operation. such that si −−− Predator-prey interactions between two species, si (predators) and s j (preys) (i =
si ,z −
s j ,zl +
k → s j and s j −−−→ si , j), takes place if and only if ∃zk ∈ Z si ∃zl ∈ Z s j , such that si −−− where zk is the action of killing the prey (kill), and zl is the action of death (die).
8.2.4 Sex The sex sx ∈ SX s which is within the species s is defined as follows: sx = Asx , Z sx ,Csx
(8.10)
where Asx is the set of agents of sex sx and species s (Asx ⊆ As ): Asx = {a : a ∈ As ∧ a is the agent of sex sx}
(8.11)
8
A Review of Agent-Based Co-Evolutionary Algorithms
183
With asx we will denote the agent of sex sx (asx ∈ Asx ). Z0sx is the set of actions Z a , where Z a is the which can be performed by the agents of sex sx, Z sx = a∈Asx
set of actions which can be performed by the agent a. And finally Csx is the set of relations between the sx and other sexes of the species s. Analogically as in the case of species, we can define the relations between the sexes of the same species. The set of all relations of the sex sxi ∈ SX s with other sexes of species s (Csxi ) is the sum of the following sets of relations: 2 1 sx ,z+ 2 1 sx ,z− (8.12) Csxi = −−i−→: z ∈ Z sxi ∪ −−i−→: z ∈ Z sxi sx ,z−
sx ,z+
where −−i−→ and −−i−→ are the relations between sexes, in which some actions z ∈ Z sxi are used: sx ,z−
−−i−→=
/ 3. sxi , sx j ∈ SX s × SX s : agents of sex sxi can decrease the fitness of agents of sex sx j by performing the action z ∈ Z sxi
sx ,z+
−−i−→=
3. / sxi , sx j ∈ SX s × SX s : agents of sex sxi can increase the fitness of agents of sex sx j by performing the action z ∈ Z sxi
4
(8.13)
4
(8.14)
With the use of presented relations between sexes we can model for example sexual selection interactions, in which agents of one sex choose partners for reproduction from agents of the other sex within the same species, taking into account some preferred features (see [10]).
8.2.5 Agent Agent a (see Fig. 8.2) of sex sx and species s (in order to simplify the notation we assume that a ≡ asx,s ) is defined as follows: a = gna , Z a , Γ a , Ω a , PRa
(8.15)
where: • gna is the genotype of agent a, which may be composed of any number of chromosomes (for example: gna = (x1 , x2 , . . . , xk ), where xi ∈ , gna ∈ k ); • Z a is the set of actions, which agent a can perform; • Γ a is the set of resource types, which are used by agent a (Γ a ⊆ Γ ); • Ω a is the set of information, which agent a can possess and use (Ω a ⊆ Ω ); • PRa is partially ordered set of profiles of agent a (PRa ≡ PRa , ) with defined partial order relation .
Ê
Ê
184
R. Dre˙zewski and L. Siwik
Fig. 8.2 Agent in the CoEMAS
Relation is defined in the following way: / 3. = pri , pr j ∈ PRa × PRa : realization of active goals of profile pri has equal or higher priority than the realization of 4 active goals of profile pr j
(8.16)
The active goal (which is denoted as gl ∗ ) is the goal gl, which should be realized in the given time. The relation is reflexive, transitive and antisymmetric and partially orders the set PRa : pr pr (pri pr j ∧ pr j prk ) ⇒ pri prk
for every pr ∈ PRa for every pri , pr j , prk ∈ PRa
(8.17a) (8.17b)
(pri pr j ∧ pr j pri ) ⇒ pri = prk
for every pri , pr j ∈ PRa
(8.17c)
The set of profiles PRa is defined in the following way: PRa = {pr1 , pr2 , . . . , prn }
(8.18a)
pr1 pr2 · · · prn
(8.18b)
Profile pr1 is the basic profile—it means that the realization of its goals has the highest priority and they will be realized before the goals of other profiles. Profile pr of agent a (pr ∈ PRa ) can be the profile in which only resources are used: pr = Γ pr , ST pr , RST pr , GL pr
(8.19)
8
A Review of Agent-Based Co-Evolutionary Algorithms
185
Algorithm 6. Basic activities of agent a in CoEMAS 1 2 3 4 5
γ
γ
rγ ← rinit ; /* rinit is the initial amount of resource given to the agent */ while rγ > 0 do activate the profile pri ∈ PRa with the highest priority and with the active goal gl ∗j ∈ GL pri ; if pri is the resource profile then γ γ if 0 < rγ < rmin then ; /* rmin is the minimal amount of resource needed by the agent to realize its activities */
6 7
choose the strategy stk ∈ ST pri with the highest priority that can be used to take some resources from the environment or other agent; perform actions contained within the stk ; else if rγ = 0 then execute die strategy; end else if pri is the reproduction profile then rep,γ rep,γ if rγ > rmin then ; /* rmin is the minimal amount of resource needed for reproduction */
8 9 10 11 12 13 14 15
choose the strategy stk ∈ ST pri with the highest priority that can be used to reproduce; perform actions contained within the stk ;
16 17 18
end else if pri is the migration profile then mig,γ mig,γ if rγ > rmin then ; /* rmin is the minimal amount of resource needed for migration */
19 20 21 22 23 24 25 26
end
choose the strategy stk ∈ ST pri with the highest priority that can be used to migrate; perform actions contained within the stk ; mig,γ give rmin amount of resource to the environment;
end end
in which only information are used: pr = Ω pr , M pr , ST pr , RST pr , GL pr
(8.20)
or resources and information are used: pr = Γ pr , Ω pr , M pr , ST pr , RST pr , GL pr
(8.21)
where: • Γ pr is the set of resource types, which are used within the profile pr (Γ pr ⊆ Γ a ); • Ω pr is the set of information types, which are used within the profile pr (Ω pr ⊆ Ω a );
186
R. Dre˙zewski and L. Siwik
• M pr is the set of information representing the agent’s knowledge about the environment and other agents (it is the model of the environment of agent a); • ST pr is the partially ordered set of strategies (ST pr ≡ ST pr , ), which can be used by agent within the profile pr in order to realize an active goal of this profile; • RST pr is the set of strategies that are realized within the profile pr—generally, not all of the strategies from the set ST pr have to be realized within the profile pr, some of them may be realized within other profiles; • GL pr is partially ordered set of goals (GL pr ≡ GL pr , ), which agent has to realize within the profile pr. The relation is defined in the following way: 3 = sti , st j ∈ ST pr × ST pr : strategy sti has equal or higher 4 priority than strategy st j
(8.22)
This relation is reflexive, transitive and antisymmetric and partially orders the set ST pr . Every single strategy st ∈ ST pr is consisted of actions, which ordered performance leads to the realization of some active goal of the profile pr: st = z1 , z2 , . . . , zk ,
st ∈ ST pr ,
zi ∈ Z a
The relation is defined in the following way: 3 = gli , gl j ∈ GL pr × GL pr : goal gli has equal or higher 4 priority than the goal gl j
(8.23)
(8.24)
This relation is reflexive, transitive and antisymmetric and partially orders the set GL pr . The partially ordered sets of profiles PRa , goals GL pr and strategies ST pr are used by the agent in order to make decisions about the realized goal and to choose the appropriate strategy in order to realize that goal. The basic activities of the agent a are shown in Algorithm 6. In CoEMAS systems the set of profiles is usually composed of resource profile (pr1 ), reproduction profile (pr2 ), and migration profile (pr3 ): PRa = {pr1 , pr2 , pr3 }
(8.25a)
pr1 pr2 pr3
(8.25b)
The highest priority has the resource profile, then there is reproduction profile, and finally migration profile.
8
A Review of Agent-Based Co-Evolutionary Algorithms
187
8.3 Co-Evolutionary Multi-Agent Systems for Multi-Objective Optimization In this section we will describe two co-evolutionary multi-agent systems used in the experiments. Each of these systems uses different co-evolutionary mechanism: co-operation and predator-prey interactions. All of the systems are based on general model of co-evolution in multi-agent system described in Section 8.2—in this section only such elements of the systems will be described that are specific for these instantiations of the general model. In all the systems presented below, real-valued vectors are used as agents’ genotypes. Mutation with self-adaptation and intermediate recombination are used as evolutionary operators [1].
8.3.1 Co-Evolutionary Multi-Agent System with Co-Operation Mechanism (CCoEMAS) The co-evolutionary multi-agent system with co-operation mechanism is defined as follows (see Eq. (8.1)): (8.26) CCoEMAS = E, S, Γ , Ω The number of species corresponds with the number of criteria (n) of the multiobjective problem being solved S = {s1 , . . . , sn }. Three information types (Ω = {ω1 , ω2 , ω3 }) and one resource type (Γ = {γ }) are used. Information of type ω1 denotes nodes to which agent can migrate. Information of type ω2 denotes (for the agent of given species) all agents from other species that are located within the same node in time t. Information of type ω3 denotes (for the given agent) all agents from the same species located within the same node. 8.3.1.1
Species
The species s is defined as follows: s = As , SX s = {sx} , Z s ,Cs
(8.27)
where SX s is the set of sexes which exist within the s species, Z s is the set of actions that agents of species s can perform, and Cs is the set of relations of s species with other species that exist in the CCoEMAS. Actions The set of actions Z s is defined as follows: Z s = {die, seek, get, give, accept, seekPartner, clone, rec, mut, migr}
(8.28)
where: • die is the action of death (agent dies when it is out of resources); • seek is the action of finding a dominated agent from the same species in order to take some resources from it;
188
R. Dre˙zewski and L. Siwik
• get action gets some resource from another agent located within the same node, which is dominated by the agent that performs get action; • give action gives some resources to the agent that performs get action; • accept action accepts partner for reproduction when the amount of resource possessed by the agent is above the given level; • seekPartner action seeks for partner for reproduction, such that it comes from another species and has the amount of resource above the minimal level needed for reproduction; • clone is the action of producing offspring (parents give some of their resources to the offspring during this action); • rec is the recombination operator (intermediate recombination is used [1]); • mut is the mutation operator (mutation with self-adaptation is used [1]); • migr is the action of migrating from one node to another. During this action agent loses some of its resource. Relations The set of relations of si species with other species that exist within the system is defined as follows: 1 s ,get− s ,accept+ 2 i i Csi = −−−−→, −−−−−−→ (8.29) The first relation models intra species competition for limited resources: s ,get−
i −− −−→= {si , si }
(8.30)
The second one models co-operation between species: s ,accept+
i −− −−−−→=
8.3.1.2
/4 3. si , s j
(8.31)
Agent
Agent a of species s (a ≡ as ) is defined as follows: a = gna , Z a = Z s , Γ a = Γ , Ω a = Ω , PRa
(8.32)
Genotype of agent a is consisted of two vectors (chromosomes): x of real-coded decision parameters’ values and σ of standard deviations’ values, which are used during mutation with self-adaptation. Agents of the given species are evaluated according to only one criteria associated with this species. Z a = Z s (see Eq. (8.28)) is the set of actions which agent a can perform. Γ a is the set of resource types used by the agent, and Ω a is the set of information types. Basic activities of agent a in CCoEMAS with the use of profiles are presented in Alg. 7.
8
A Review of Agent-Based Co-Evolutionary Algorithms
189
Algorithm 7. Basic activities of agent a in CCoEMAS 1 2 3 4 5 6 7 8 9 10 11 12 13
γ
rγ ← rinit ; while rγ > 0 do activate the profile pri ∈ PRa with the highest priority and with the active goal gl ∗j ∈ GL pri ; if pr1 is activated then γ if 0 < rγ < rmin then seek, get; γ rγ ← rγ + rget ; else if rγ = 0 then die; end else if pr2 is activated then rep,γ if rγ > rmin then seekPartner, clone, ' rec, mut; & rep,γ
rγ ← rγ − rgive
; end else if pr3 is activated then if accept&is activated ' then rep,γ γ γ r ← r − rgive ;
14 15 16 17 18 19
else if give then is activated γ rγ ← rγ − rget ; end else if pr4 is activated then mig,γ if rγ > rmin then migr; ' &
20 21 22 23 24
mig,γ
25 26 27 28
rγ ← rγ − rmin
;
end end end
Profiles The partially ordered set of profiles includes resource profile (pr1 ), reproduction profile (pr2 ), interaction profile (pr3 ), and migration profile (pr4 ): PRa = {pr1 , pr2 , pr3 , pr4 } pr1 pr2 pr3 pr4 The resource profile is defined in the following way: . pr1 = Γ pr1 = Γ , Ω pr1 = {ω3 } , M pr1 = {iω3 } , / ST pr1 , RST pr1 = ST pr1 , GL pr1
(8.33a) (8.33b)
(8.34)
190
R. Dre˙zewski and L. Siwik
The set of strategies include two strategies: ST pr1 = {die, seek, get}
(8.35)
The goal of the pr1 profile is to keep the amount of resources above the minimal level or to die when the amount of resources falls to zero. This profile uses the model M pr1 = {iω3 }. The reproduction profile is defined as follows: . pr2 = Γ pr2 = Γ , Ω pr2 = {ω2 } , M pr2 = {iω2 } , / (8.36) ST pr2 , RST pr2 = ST pr2 , GL pr2 The set of strategies include one strategy: ST pr2 = {seekPartner, clone, rec, mut}
(8.37)
The only goal of the pr2 profile is to reproduce. In order to realize this goal agent can use strategy of reproduction: seekPartner, clone, rec, mut. During the reproduction rep,γ agent transfers the amount of rgive resources to the offspring. The interaction profile is defined as follows: . pr3 = Γ pr3 = Γ , Ω pr3 = {ω2 , ω3 } , M pr3 = {iω2 , iω3 } , / (8.38) ST pr3 = {accept, give}, RST pr3 = ST pr3 , GL pr3 The goal of the pr3 profile is to interact with agents from another species with the use of accept and give strategies. The migration profile is defined as follows: . pr4 = Γ pr4 = Γ , Ω pr4 = {ω1 } , M pr4 = {iω1 } , 3. /4 / (8.39) ST pr4 = migr , RST pr4 = ST pr4 , GL pr4 The goal of the pr4 profile is to migrate within the environment. In order to realize . / such a goal the migration strategy migr is used, which firstly chooses the node on the basis of information {iω1 } and then realizes the migration. As a result of migrating agent loses some of its resources.
8.3.2 Co-Evolutionary Multi-Agent System with Predator-Prey Interactions (PPCoEMAS) The co-evolutionary multi-agent system with (PPCoEMAS) is defined as follows (see Eq. (8.1)): PPCoEMAS = E, S, Γ , Ω
predator-prey interactions (8.40)
8
A Review of Agent-Based Co-Evolutionary Algorithms
191
The set of species includes two species, preys and predators S = {prey, pred}. Two information types (Ω = {ω1 , ω2 }) and one resource type (Γ = {γ }) are used. Information of type ω1 denote nodes to which agent can migrate. Information of type ω2 denote such prey that are located within the particular node in time t. 8.3.2.1
Prey Species
The prey species (prey) is defined as follows: prey = A prey , SX prey = {sx} , Z prey ,C prey
(8.41)
where SX prey is the set of sexes which exist within the prey species, Z prey is the set of actions that agents of species prey can perform, and C prey is the set of relations of prey species with other species that exist in the PPCoEMAS. Actions The set of actions Z prey is defined as follows: 3 Z prey = die, seek, get, give, accept, seekPartner, 4 clone, rec, mut, migr
(8.42)
where: • die is the action of death (prey dies when it is out of resources); • seek action seeks for another prey agent that is dominated by the prey performing this action or is too close to it in criteria space. • get action gets some resource from another a prey agent located within the same node, which is dominated by the agent that performs get action or is too close to it in the criteria space; • give action gives some resource to another agent (which performs get action); • accept action accepts partner for reproduction when the amount of resource possessed by the prey agent is above the given level; • seekPartner action is used in order to find the partner for reproduction when the amount of resource is above the given level and agent can reproduce; • clone is the action of producing offspring (parents give some of their resources to the offspring during this action); • rec is the recombination operator (intermediate recombination is used [1]); • mut is the mutation operator (mutation with self-adaptation is used [1]); • migr is the action of migrating from one node to another. During this action agent loses some of its resource. Relations The set of relations of prey species with other species that exist within the system is defined as follows: 1 prey,get− prey,give+ 2 (8.43) C prey = −−−−−→, −−−−−−→
192
R. Dre˙zewski and L. Siwik
The first relation models intra species competition for limited resources: prey,get−
−−−−−→= {prey, prey}
(8.44)
The second one models predator-prey interactions: prey,give+
−−−−−−→= {prey, pred} 8.3.2.2
(8.45)
Predator Species
The predator species (pred) is defined as follows: 5 6 pred = A pred , SX pred = {sx} , Z pred ,C pred
(8.46)
Actions The set of actions Z pred is defined as follows: Z pred = {seek, getFromPrey, migr}
(8.47)
where: • The seek action allows finding the “worst” (according to the criteria associated with the given predator) prey located within the same node as the predator; • getFromPrey action gets all resources from the chosen prey, • migr action allows predator to migrate between nodes of the graph H—this results in losing some of the resources. Relations The set of relations of pred species with other species that exist within the system are defined as follows: 1 pred,getFromPrey− 2 C pred = −−−−−−−−−−−→ (8.48) This relation models predator-prey interactions: pred,getFromPrey−
−−−−−−−−−−−→= {pred, prey}
(8.49)
As a result of performing getFromPrey action and taking all resources from selected prey, it dies. 8.3.2.3
Prey Agent
Agent a of species prey (a ≡ a prey ) is defined as follows: a = gna , Z a = Z prey , Γ a = Γ , Ω a = Ω , PRa
(8.50)
8
A Review of Agent-Based Co-Evolutionary Algorithms
193
Algorithm 8. Basic activities of agent a ≡ a prey in PPCoEMAS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
γ
rγ ← rinit ; while rγ > 0 do activate the profile pri ∈ PRa with the highest priority and with the active goal gl ∗j ∈ GL pri ; if pr1 is activated then γ if 0 < rγ < rmin then seek, get; γ rγ ← rγ + rget ; else if rγ = 0 then die; end else if pr2 is activated then rep,γ if rγ > rmin then if seekPartner, clone, rec, ' mut is performed then & clone,γ ; rγ ← rγ − rgive else if accept is performed ' then & accept,γ γ γ ; r ← r − rgive end end else if pr3 is activated then if get is performed by prey agent then give;& ' γ rγ ← rγ − rgive ; else if get is performed by predator agent then give; rγ ← 0; end else if pr4 is activated then mig,γ if rγ > rmin then migr; ' & mig,γ
30 31 32 33
rγ ← rγ − rmin
;
end end end
Genotype of agent a is consisted of two vectors (chromosomes): x of real-coded decision parameters’ values and σ of standard deviations’ values, which are used during mutation with self-adaptation. Z a = Z prey (see Eq. (8.42)) is the set of actions which agent a can perform. Γ a is the set of resource types used by the agent, and Ω a is the set of information types. Basic activities of agent a are presented in Alg. 8. Profiles The partially ordered set of profiles includes resource profile (pr1 ), reproduction profile (pr2 ), interaction profile (pr3 ), and migration profile (pr4 ):
194
R. Dre˙zewski and L. Siwik
PRa = {pr1 , pr2 , pr3 , pr4 } pr1 pr2 pr3 pr4 The resource profile is defined in the following way: . pr1 = Γ pr1 = Γ , Ω pr1 = {ω2 } , M pr1 = {iω2 } , / ST pr1 , RST pr1 = ST pr1 , GL pr1
(8.51a) (8.51b)
(8.52)
The set of strategies include two strategies: ST pr1 = {die, seek, get}
(8.53)
The goal of the pr1 profile is to keep the amount of resources above the minimal level or to die when the amount of resources falls to zero. This profile uses the model M pr1 = {iω2 }. The reproduction profile is defined as follows: . pr2 = Γ pr2 = Γ , Ω pr2 = {ω2 } , M pr2 = {iω2 } , / (8.54) ST pr2 , RST pr2 = ST pr2 , GL pr2 The set of strategies include two strategies: ST pr2 = {seekPartner, clone, rec, mut, accept}
(8.55)
The only goal of the pr2 profile is to reproduce. In order to realize this goal agent can use strategy of reproduction seekPartner, clone, rec, mut or can accept partners for reproduction (accept). The interaction profile is defined as follows: . / M pr3 = 0, / ST pr3 = {give} , pr3 = Γ pr3 = Γ , Ω pr3 = 0, / (8.56) RST pr3 = ST pr3 , GL pr3 The goal of the pr3 profile is to interact with predators and preys with the use of strategy give. The migration profile is defined as follows: . pr4 = Γ pr4 = Γ , Ω pr4 = {ω1 } , M pr4 = {iω1 } , / 3. /4 (8.57) ST pr4 = migr , RST pr4 = ST pr4 , GL pr4 The goal of the pr4 profile is to migrate within the environment. In order to realize such a goal the migration strategy is used, which firstly chooses the node and then realizes the migration. As a result of migrating prey loses some amount of resource.
8
A Review of Agent-Based Co-Evolutionary Algorithms
195
Algorithm 9. Basic activities of agent a ≡ a pred in PPCoEMAS 1 2 3 4 5 6 7 8 9 10 11
γ
rγ ← rinit ; while rγ > 0 do activate the profile pri ∈ PRa with the highest priority and with the active goal gl ∗j ∈ GL pri ; if pr1 is activated then γ if 0 < rγ < rmin then seek, getFromPrey; prey,γ prey,γ rγ ← rγ + rget are all resources of the ; /* rget prey agent that was chosen by a */ end else if pr2 is activated then mig,γ if rγ > rmin then migr; ' & mig,γ
12 13 14 15
rγ ← rγ − rmin
;
end end end
8.3.2.4
Predator Agent
Agent a of species pred is defined analogically to prey agent (see eq. (8.50)). There exist two main differences. Genotype of predator agent is consisted only of the information about the criterion associated with the given agent. The set of profiles is consisted only of two profiles, resource profile (pr1 ), and migration profile (pr2 ): PRa = {pr1 , pr2 }, where pr1 pr2 . Basic activities of agent a are presented in Alg. 9. Profiles The resource profile is defined in the following way: . pr1 = Γ pr1 = Γ , Ω pr1 = {ω2 } , M pr1 = {iω2 } , ST pr1 = {seek, getFromPrey}, RST pr1 = ST pr1 , GL pr1
/
(8.58)
The goal of the pr1 profile is to keep the amount of resource above the minimal level with the use of strategy seek, getFromPrey. The migration profile is defined as follows: . pr2 = Γ pr2 = Γ , Ω pr2 = {ω1 } , M pr2 = {iω1 } , / 3. /4 (8.59) ST pr2 = migr , RST pr2 = ST pr2 , GL pr2 The goal of pr2 profile is to.migrate / within the environment. In order to realize this goal the migration strategy migr is used. The realization of the migration strategy results in losing some of the resource possessed by the agent.
196
R. Dre˙zewski and L. Siwik
8.4 Experimental Results Presented formally in section 8.3 agent-based co-evolutionary approaches for multiobjective optimization have been tentatively assessed. Obtained during experiments preliminary results were presented in some of our previous papers and in this section they are shortly summarized.
8.4.1 Test Suite, Performance Metric and State-of-the-Art Algorithms As a test problem firstly, slightly modified so-called Laumanns multi-objective problem was used, which is defined as follows [15, 18]: ⎧ ⎨ f1 (x) = x21 + x22 Laumanns = f2 (x) = (x1 + 2)2 + x22 (8.60) ⎩ −5 ≤ x1 , x2 ≤ 5 Secondly the so-called Kursawe problem was used. Its definition is as follows [18]: ⎧ $ & & '' ⎪ −10 exp −0.2 x2i + x2i+1 ⎨ f1 (x) = ∑n−1 i=0 (8.61) Kursawe = f2 (x) = ∑n |xi |0.8 + 5 sin x3 i=1 i ⎪ ⎩ n = 3 − 5 ≤ x1 , x2 , x3 ≤ 5 In one of our experiments discussed shortly in this chapter building effective portfolio problem was used. Assumed definition as well as true Pareto frontier for such a problem can be found in [16]. Obviously during our experiments also well known and commonly used test suites were used. Inter alia such problems as ZDT test suite was used ([19, p. 57–63], [21], [5, p. 356–362], [4, p. 194–199]).
f2
max
Dispersing solutions over the whole approximation of the true Pareto frontier
Drifting towards the true Pareto frontier
True Pareto frontier
f1 max
Fig. 8.3 Two goals of multi-objective optimization
8
A Review of Agent-Based Co-Evolutionary Algorithms
197
Two main distinguishing features of high-quality solution of MOOPs are: closeness to the true Pareto frontier as well as dispersion of found non-dominated solution over the whole (approximation) of the Pareto frontier (see Figure 8.3). In the consequence, despite that using only one single measure during assessing the effectiveness of (evolutionary) algorithms for multi-objective optimization is not enough [23], since Hypervolume Ratio measure (HVR) [20] allows for estimating both of these aspects—in this chapter discussion and presentation of obtained results is based on this very measure. Hypervolume or Hypervolume ratio (HVR), describes the area covered by solutions of obtained result set. For each solution, hypercube is evaluated with respect to the fixed reference point. In order to evaluate hypervolume ratio, value of hypervolume for obtained set is normalized with hypervolume value computed for true Pareto frontier. HV and HVR are defined as follows: HV = v(
N 7
vi )
(8.62a)
i=1
HVR =
HV(PF ∗ ) HV(PF)
(8.62b)
where vi is hypercube computed for i − th solution, PF ∗ represents obtained Pareto frontier and PF is the true Pareto frontier. To assess (in a quantitative way) PPCoEMAS and CCoEMAS the comparison with results obtained with the use of state-of-the-art algorithms has to be made. That is why we are comparing results obtained by discussed in this chapter approaches with results obtained by NSGA-II [6, 7] and SPEA2 [12, 22] algorithms since these very algorithms are the most efficient and most commonly used evolutionary multiobjective optimization algorithms. Additionally, obtained results are compared also with NPGA [13] and PPES [15] algorithms.
8.4.2 A Glance at Assessing Co-operation Based Approach (CCoEMAS) Presented in section 8.3.1 co-evolutionary multi-agent system with co-operation mechanism (CCoEMAS) was assessed tentatively using inter alia ZDT test-suite. The size of population of CCoEMAS and the size of benchmarking algorithms (NSGA-II and SPEA2) assumed during presented experiments were as follows: CCoEMAS—200, NSGA-II—300 and SPEA—100. Next, selected parameters and γ their values assumed during those experiments are as follows: rinit = 50 (it represents the level of resources possessed initially by individual just after its creation), γ rget = 30 (it represents the amount of resources transferred in the case of domirep,γ nation), rmin = 30 (it represents the level of resources required for reproduction), pmut = 0.5 (mutation probability).
198
R. Dre˙zewski and L. Siwik
1 HVR Measure value
0.95 0.9 0.85 0.8 0.75 NSGA2 CCoEMAS SPEA
0.7 0.65 0
5
10
15
20
25
30
35
Time [s]
(a)
1 HVR Measure value
0.99 0.98 0.97 0.96 0.95 0.94
NSGA2 CCoEMAS SPEA
0.93 0.92 0 (b)
5
10
15
20
25
30
35
40
Time [s]
Fig. 8.4 HVR values obtained by CCoEMAS, SPEA2, and NSGA-II run against Zitzler’s problems ZDT1 (a) and ZDT2 (b) [11]
As one may see after the analysis of results presented in figures 8.4 and 8.5— CCoEMAS, as not so complex algorithm as NSGA-II or SPEA2, initially allows for obtaining better solutions, but with time classical algorithms—especially NSGAII—are the better alternatives. It is however worth to mention that in the case of
8
A Review of Agent-Based Co-Evolutionary Algorithms
199
HVR Measure value
1 0.9 0.8 0.7 0.6 NSGA2 CCoEMAS SPEA2
0.5 0.4 0
5
10
15
20
25
30
35
40
35
40
35
40
Time [s]
(a)
0.97 HVR Measure value
0.96 0.95 0.94 0.93 0.92 0.91
NSGA2 CCoEMAS SPEA2
0.9 0.89 0
5
10
15
20
25
30
Time [s]
(b)
HVR Measure value
1 0.95 0.9 0.85 0.8
NSGA2 CCoEMAS SPEA2
0.75 0 (c)
5
10
15
20
25
30
Time [s]
Fig. 8.5 HVR values obtained by CCoEMAS, SPEA2, and NSGA-II run against Zitzler’s problems ZDT3 (a) ZDT4 (b) and ZDT6 (c) [11]
200
R. Dre˙zewski and L. Siwik
ZDT4 problem this characteristic seems to be reversed—i.e. initially classical algorithms seem to be better alternatives, but finally CCoEMAS allows for obtaining better solutions (observed as higher values of HVR metrics). Deeper analysis of obtained during presented experiments results can be found in [11].
8.4.3 A Glance at Assessing Predator-Prey Based Approach (PPCoEMAS) In this section some selected results regarding presented in section 8.3.2 coevolutionary multi-agent system with predator-prey interactions are presented. Among the others, PPCoEMAS was assessed with the use of some presented in section 8.4.1 classical benchmarking problems: firstly Laumanns [15] and Kursawe [14] test problems were used. Also the other than NSGA-II and SPEA2 classical algorithms were used during experiments with predator-prey approach. This time predator-prey evolutionary strategy (PPES) and niched-pareto genetic algorithm (NPGA) were used. In this section only a kind of summary of obtained results is given. More detailed analysis can be found in [9, 16].
5
f2
PPCoEMAS frontier after 6000 steps
0 0
1
2
(a)
3
4
5
f1 5
f2
PPES frontier after 6000 steps
0 0
(b)
1
2
3
4
5
f1
Fig. 8.6 Pareto frontier approximations obtained by PPCoEMAS (a) and PPES (b) algorithms for Laumanns problem after 6000 steps [9]
8
A Review of Agent-Based Co-Evolutionary Algorithms
201
NPGA
PPES
PPCoEMAS
50
60
(a)
70
HV
NPGA
PPES
PPCoEMAS
0.7
(b)
0.8
0.9
1
HVR
Fig. 8.7 The value of HV (a) and HVR (b) measure for Laumanns problem obtained by PPCoEMAS, PPES and NPGA after 6000 steps
In the very first experiments with PPCoEMAS relatively simple Laumanns test problem was used. In Figure 8.6 there are presented Pareto frontier approximations obtained by PPCoEMAS and PPES algorithms and in Figure 8.7 there are presented values of HV and HVR metrics for all three algorithms being compared (PPCoEMAS, PPES and NPGA). As it can be seen—the differences between algorithms being analyzed are not so distinct, however proposed PPCoEMAS system seems to be the best alternative. The second problem used was more demanding multi-objective Kursawe problem with disconnected both Pareto set and Pareto frontier. In Figure 8.9 there are presented final approximations of Pareto frontier obtained by PPCoEMAS and by reference algorithms after 6000 time steps. As one may notice, there is no doubt that PPCoEMAS is definitely the best alternative since it is able to obtain Pareto frontier that is located very close to the model solution, that is very well dispersed and what
202
R. Dre˙zewski and L. Siwik
NPGA
PPES
PPCoEMAS
350
400
(a)
450
500
550
600
HV
NPGA
PPES
PPCoEMAS
0.6
(b)
0.7
0.8
0.9
1
HVR
Fig. 8.8 The value of HV (a) and HVR (b) measure for Kursawe problem obtained by PPCoEMAS, PPES and NPGA after 6000 steps
is also very important—it is more numerous than PPES and NPGA-based solutions. The above observations are fully confirmed by the values of HV and HVR metrics presented in Figure 8.8. Proposed co-evolutionary multi-agent system with predator-prey interactions was also assessed with the use of building effective portfolio problem. In this case, each individual in the prey population is represented as a p-dimensional vector. Each dimension represents the percentage participation of i-th (i ∈ 1 . . . p) share in the whole portfolio. During presented experiments—Warsaw Stock Exchange quotations from 200301-01 until 2005-12-31 were taken into consideration. Simultaneously, the portfolio consists of the following three (experiment I) or seventeen (experiment II) stocks quoted on the Warsaw Stock Exchange: in experiment I: RAFAKO, PONARFEH, PKOBP, in experiment II: KREDYTB, COMPLAND, BETACOM, GRAJEWO, KRUK, COMARCH, ATM, HANDLOWY, BZWBK, HYDROBUD, BORYSZEW,
8
A Review of Agent-Based Co-Evolutionary Algorithms
203
0
f2
-5
-10 PPCoEMAS frontier after 6000 steps -20
-19
-18
(a)
-17
-16
-15
-14
-15
-14
-15
-14
f1
0
f2
-5
-10 PPES frontier after 6000 steps -20
-19
-18
(b)
-17
-16
f1
0
f2
-5
-10 NPGA frontier after 6000 steps -20
(c)
-19
-18
-17
-16
f1
Fig. 8.9 Pareto frontier approximations for Kursawe problem obtained by PPCoEMAS (a), PPES (b) and NPGA (c) after 6000 steps [9]
ARKSTEEL, BRE, KGHM, GANT, PROKOM, BPHPBK. As the market index, WIG20 has been taken into consideration. In Figure 8.10there are presented final Pareto frontiers obtained using PPCoEMAS, NPGA and PPES algorithm after 1000 steps in experiment I. As one may notice, in this case frontier obtained by PPCoEMAS is more numerous than NPGA-based and as numerous as PPES-based one. Unfortunately, in this case, diversity of population in PPCoEMAS approach is visibly worse than in the case of NPGA or PPES-based frontiers.
204
R. Dre˙zewski and L. Siwik
0.2
PPCoEMAS-based Pareto frontier after 1000 steps
Profit
0.15
0.1
0.05
0 0
(a)
0.05
0.1
0.15
0.2
0.25
0.3
Risk
0.2
PPES-based Pareto frontier after 1000 steps
Profit
0.15
0.1
0.05
0 0
(b) 0.2
0.05
0.1
0.15 Risk
0.2
0.25
0.3
NPGA-based Pareto frontier after 1000 steps
Profit
0.15
0.1
0.05
0
(c)
0
0.05
0.1
0.15 Risk
0.2
0.25
0.3
Fig. 8.10 Pareto frontier approximations after 1000 steps obtained by PPCoEMAS (a), PPES (b), and NPGA (c) for building effective portfolio consisting of 3 stocks [16]
Similar situation can be also observed in Figure 8.11 presenting Pareto frontiers obtained by PPCoEMAS, NPGA and PPES—but this time portfolio that is being optimized consists of 17 shares. Also this time PPCoEMAS-based frontier is quite numerous and quite close to the true Pareto frontier but the tendency for focusing solutions around only selected part(s) of the whole frontier is very distinct. The explanation of observed tendency can be found in [9, 16] and on the very
8
A Review of Agent-Based Co-Evolutionary Algorithms
0.45
205
PPCoEMAS-based Pareto frontier after 1000 steps
0.4 0.35
Profit
0.3 0.25 0.2 0.15 0.1 0.05 0 0
(a)
0.05
0.1
0.15
0.2
Risk
0.45
PPES-based Pareto frontier after 1000 steps
0.4 0.35
Profit
0.3 0.25 0.2 0.15 0.1 0.05 0 0
(b) 0.45
0.05
0.1 Risk
0.15
0.2
NPGA-based Pareto frontier after 1000 steps
0.4 0.35
Profit
0.3 0.25 0.2 0.15 0.1 0.05 0
(c)
0
0.05
0.1 Risk
0.15
0.2
Fig. 8.11 Pareto frontier approximations after 1000 steps obtained by PPCoEMAS (a), PPES (b), and NPGA (c) for building effective portfolio consisting of 17 stocks [16]
general level it can be said that it is caused by the stagnation of evolution process in PPCoEMAS. Hypothetical, non-dominated average portfolios for experiment I and II are presented in Figure 8.12 and in Figure 8.13 respectively (in Figure 8.13 shares are presented from left to right in the order in which they were mentioned above).
206
R. Dre˙zewski and L. Siwik
percentage share in the portfolio
1
PPCoEMAS portfolio after 1 step
0.8 0.6 0.4 0.2 0 RAFAKO
(a) percentage share in the portfolio
1
PONAR share name
PKOBP
PPCoEMAS portfolio after 900 steps
0.8 0.6 0.4 0.2 0 RAFAKO
(b)
PONAR share name
PKOBP
percentage share in the portfolio
Fig. 8.12 Effective portfolio consisting of three stocks proposed by PPCoEMAS [16] 1
PPCoEMAS portfolio after 1 step
0.8 0.6 0.4 0.2 0
(a) percentage share in the portfolio
share name 1
PPCoEMAS portfolio after 900 steps
0.8 0.6 0.4 0.2 0
(b)
share name
Fig. 8.13 Effective portfolio consisting of seventeen stocks proposed by PPCoEMAS [16]
8
A Review of Agent-Based Co-Evolutionary Algorithms
207
8.5 Summary and Conclusions Agent-based (co-)evolutionary algorithms have been applied already in many different domains, including multi-modal optimization, multi-objective optimization, and financial problems. Agent-based models of evolutionary algorithms allows for mixing and using simultaneously different bio-inspired techniques and algorithms within one coherent agent model, and adding new biologically and socially inspired operators and mechanisms in a very natural way. Agent-based models of evolutionary algorithm also allow for using parallel and decentralized computations without any additional changes because these models are decentralized and use asynchronous computations. In this chapter we have presented two selected agent-based co-evolutionary algorithms for multi-objective optimization—one of them used co-operative mechanisms and the other one used predator-prey mechanism. Formal models of these systems as well as results of experiments with standard multi-objective test problems and financial problem of multi-objective portfolio optimization were presented. The results of experiments show that agent-based algorithms may obtain quite satisfactory results, comparable or in the case of some problems even better than state-ofthe-art multi-objective evolutionary algorithms, however of course there is still place for improvement and further research. Presented results also lead to conclusion that none of the existing evolutionary algorithms for multi-objective optimization can not alone solve all problems in a best way—there is, and always will be, space for new algorithms and improvements suited for some particular problems. Future research on the agent-based models will concentrate on improvements to the already proposed algorithms as well as on new algorithms and techniques. Examples of new techniques which may be incorporated into agent-based models of evolutionary algorithms include cultural and immunological mechanisms. Another way of development would be adding social and economical layer to the existing biological one and using such agent-based models for modeling and simulation of complex and emergent phenomena from social and economical life.
References 1. B¨ack, T., Fogel, D., Michalewicz, Z. (eds.): Handbook of Evolutionary Computation. IOP Publishing and Oxford University Press (1997) 2. Cetnarowicz, K., Kisiel-Dorohinicki, M., Nawarecki, E.: The application of evolution process in multi-agent world to the prediction system. In: Tokoro, M. (ed.) Proceedings of the 2nd International Conference on Multi-Agent Systems (ICMAS 1996). AAAI Press, Menlo Park (1996) 3. Coello, C., Lamont, G., Van Veldhuizen, D.: Evolutionary Algorithms for Solving MultiObjective Problems, 2nd edn. Springer, New York (2007) 4. Coello Coello, C., Van Veldhuizen, D., Lamont, G.: Evolutionary algorithms for solving multi-objective problems, 2nd edn. Genetic and evolutionary computation. Springer, Heidelberg (2007) 5. Deb, K.: Multi-Objective Optimization using Evolutionary Algorithms. John Wiley & Sons, Chichester (2001)
208
R. Dre˙zewski and L. Siwik
6. Deb, K., Agrawal, S., Pratab, A., Meyarivan, T.: A Fast Elitist Non-Dominated Sorting Genetic Algorithm for Multi-Objective Optimization: NSGA-II. In: Deb, K., Rudolph, G., Lutton, E., Merelo, J.J., Schoenauer, M., Schwefel, H.-P., Yao, X. (eds.) PPSN 2000. LNCS, vol. 1917, pp. 849–858. Springer, Heidelberg (2000), citeseer.ist.psu.edu/article/deb00fast.html 7. Deb, K., Pratab, A., Agarwal, S., Meyarivan, T.: A fast and elitist multi-objective genetic algorithm: Nsga-ii. IEEE Transaction on Evolutionary Computation 6(2), 181–197 (2002) 8. Dre˙zewski, R.: A model of co-evolution in multi-agent system. In: Maˇr´ık, V., M¨uller, J.P., Pˇechouˇcek, M. (eds.) CEEMAS 2003. LNCS (LNAI), vol. 2691, pp. 314–323. Springer, Heidelberg (2003) 9. Dre˙zewski, R., Siwik, L.: The application of agent-based co-evolutionary system with predator-prey interactions to solving multi-objective optimization problems. In: Proceedings of the 2007 IEEE Symposium Series on Computational Intelligence. IEEE, Los Alamitos (2007) 10. Dre˙zewski, R., Siwik, L.: Agent-based co-evolutionary techniques for solving multiobjective optimization problems. In: Kosi´nski, W. (ed.) Advances in Evolutionary Algorithms. IN-TECH, Vienna (2008) 11. Dre˙zewski, R., Siwik, L.: Agent-based co-operative co-evolutionary algorithm for multiobjective optimization. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 388–397. Springer, Heidelberg (2008) 12. Zitzler, E., Laumanns, M., Thiele, L.: Spea2: Improving the strength pareto evolutionary algorithm for multiobjective optimization. In: Giannakoglou, K., et al. (eds.) Evolutionary Methods for Design, Optimisation and Control with Application to Industrial Problems (EUROGEN 2001). International Center for Numerical Methods in Engineering (CIMNE), pp. 95–100 (2002) 13. Horn, J., Nafpliotis, N., Goldberg, D.E.: A niched pareto genetic algorithm for multiobjective optimization. In: Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence, vol. 1, pp. 82–87. IEEE Service Center, Piscataway (1994), citeseer.ist.psu.edu/horn94niched.html 14. Kursawe, F.: A variant of evolution strategies for vector optimization. In: Schwefel, H.P., M¨anner, R. (eds.) PPSN 1990. LNCS, vol. 496, pp. 193–197. Springer, Heidelberg (1991), citeseer.ist.psu.edu/kursawe91variant.html 15. Laumanns, M., Rudolph, G., Schwefel, H.P.: A spatial predator-prey approach to multiobjective optimization: A preliminary study. In: Eiben, A.E., B¨ack, T., Schoenauer, M., Schwefel, H.-P. (eds.) PPSN 1998. LNCS, vol. 1498, p. 241. Springer, Heidelberg (1998) 16. Siwik, L., Dre˙zewski, R.: Co-evolutionary multi-agent system for portfolio optimization. In: Brabazon, A., O’Neill, M. (eds.) Natural Computation in Computational Finance, pp. 273–303. Springer, Heidelberg (2008) 17. Spears, W.: Crossover or mutation? In: Proceedings of the 2-nd Foundation of Genetic Algorithms, pp. 221–237. Morgan Kauffman, San Francisco (1992) 18. Van Veldhuizen, D.A.: Multiobjective evolutionary algorithms: Classifications, analyses and new innovations. PhD thesis, Graduate School of Engineering of the Air Force Institute of Technology Air University (1999) 19. Zitzler, E.: Evolutionary algorithms for multiobjective optimization: methods and applications. PhD thesis, Swiss Federal Institute of Technology, Zurich (1999)
8
A Review of Agent-Based Co-Evolutionary Algorithms
209
20. Zitzler, E., Thiele, L.: An evolutionary algorithm for multiobjective optimization: The strength pareto approach. Tech. Rep. 43, Swiss Federal Institute of Technology, Zurich, Gloriastrasse 35, CH-8092 Zurich, Switzerland (1998), citeseer.ist.psu.edu/article/zitzler98evolutionary.html 21. Zitzler, E., Deb, K., Thiele, L.: Comparison of Multiobjective Evolutionary Algorithms: Empirical Results. Evolutionary Computation 8(2), 173–195 (2000) 22. Zitzler, E., Laumanns, M., Thiele, L.: Spea2: Improving the strength pareto evolutionary algorithm. Tech. Rep. TIK-Report 103, Computer Engineering and Networks Laboratory (TIK), Department of Electrical Engineering, Swiss Federal Institute of Technology (ETH) Zurich, ETH Zentrum, Gloriastrasse 35, CH-8092 Zurich, Switzerland (2001) 23. Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., da Fonseca, V.G.: Performance assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on Evolutionary Computation 7(2), 117–132 (2003)
Chapter 9
A Game Theory-Based Multi-Agent System for Expensive Optimisation Problems ¨ Abdellah Salhi and Ozgun T¨oreyen
Abstract. This paper is concerned with the development of a novel approach to solve expensive optimisation problems. The approach relies on game theory and a multi-agent framework in which a number of existing algorithms, cast as agents, are deployed with the aim to solve the problem in hand as efficiently as possible. The key factor for the success of this approach is a dynamic resource allocation biased toward promising algorithms on the given problem. This is achieved by allowing the agents to play a cooperative-competitive game the outcomes of which will be used to decide which algorithms, if any, will drop out of the list of solver-agents and which will remain in use. A successful implementation of this framework will result in the most suited algorithm(s) for the given problem being predominantly used on the available computing platform. In other words it guarantees the best use of the resources both algorithms and hardware with the by-product being the best approximate solution for the problem given the available resources. GTMAS is tested on a standard collection of TSP problems. The results are included.
9.1 Introduction Modelling problems arising in real world applications taking into account the nonlinearity and the combinatorial aspects of solution sets often leads to expensive to solve optimisation problems; they are inherently intractable. Indeed, even checking a given solution for optimality is NP-hard [10, 17, 32]. It is, therefore, not reasonable, in general, to expect the optimum solution to be found in acceptable times. What one can, almost always, only expect is an approximate solution, the quality of which is crucial to its potential use. ¨ Abdellah Salhi · Ozgun T¨oreyen Department of Mathematical Sciences, The University of Essex, Colchester CO4 3SQ, UK e-mail: [email protected] ,[email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 211–232. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
212
¨ T¨oreyen A. Salhi and O.
It is well known, that, at least in the case of stochastic algorithms, the quality of the approximate solution (or some confidence the user may have in it) is proportional to the time spent in the search for it, [22, 23, 24]. As in many applications there is a time constraint, a deadline beyond which a better approximate solution is of no use, it is essential that all available resources (software and hardware) be used as well as possible, to insure that the best approxiate solution, under the circumstances, is obtained. This is what the novel approach suggested here is attempting to achieve. To do so, it must: 1. find which algorithm(s), in the suite of algorithms, is the most appropriate for the given instance of the expensive optimisation problem; 2. replicate this algorithm(s) on all availabe processor nodes in a parallel environment, or allocate to it all of the remaining CPU time if a single processor, or sequential environment, is used; Point (1) above is dealt with through measuring the performance of the algorithms used. Point (2) is dealt with via the implementation of a cooperative/competitive game of the Iterated Prisoners’ Dilemma (IPD) type, [4, 7, 16, 20, 25, 27]. Although other paradigms of cooperative/competitive behaviour, such as the Stag Hunt game, [7], can be used, the IPD seems appropriate. Note that, implementing cooperation is fairly straightforward, and implementing competition is not. We believe that it is at least as important as cooperation between agents for an effective search. To the best of our knowledge, this is the first time implementing competition for optimisation purposes is attempted. We use payoff matrices as a handle to manipulate it. Two algorithms (agents) cooperate by exchanging their current solutions; they compete by not exchanging their solutions. Note that, intuitively, cooperation may lead to early convergence to a local optimum, by virtue of propagating a given solution potentially to all algorithms and having all of them searching the same area. Competition, on the other hand, may lead to good coverage of the search space by virtue of not sharing solutions, i.e. helping algorithms “stay away” from each other and therefore, potentially, explore different areas of the search space. Although the study presents the prototype of a generic solver that can involve any number of solver algorithms and run on any computing platform, here, a system with only two search algorithms, implemented sequentially, is investigated. This simplified model, however, has the inherent complexities of a system with many more agents and should show how good or otherwise the general system can be for expensive optimisation. Note that the generic nature of this approach makes it applicable in any discipline where problem solving is involved and more than one solution method is available. This document is organised as follows. In Section 9.2, a brief literature review is given. In Sections 9.3 and 9.4, the design and implementation of the system is explained. Section 9.5 explains how the system is applied to solve the Travelling Salesman Problem. The results are presented in Section 9.6. Finally, conclusions are drawn and future research prospects are outlined in Section 9.7.
9
A Game Theory-Based Multi-Agent System
213
9.2 Background In the following a brief review of the three main topics involved, i.e. optimisation, the IPD and agents systems will be given.
9.2.1 Optimisation The general optimisation problem is of the global type, constrained, nonlinear and involves mixed variables, i.e. both discrete and continuous variables. However, a lot of optimisation problems that are encountered in real applications do not have all of these characteristics, but are still intractable. The 0-1-Knapsack problem, for example, involves only binary variables and has one single constraint, but is still NP-Hard. The general optimisation problem can be cast in the following form. Let f be a function from Rn to R and A ⊂ Rn , then find x∗ ∈ A such that ∀x ∈ A, f (x∗ ) ≤ f (x).
9.2.2 Game Theory: The Iterated Priosoners’ Dilemma The Prisoners’ Dilemma (PD) brought to attention by Merrill Flood of the Rand Corporation in 1951, and later formulated by Al Tucker [8, 11] is a popular paradigm for the problem of cooperation. Its formulation can be as follows.
Table 9.1 Formulation of PD: The Payoff Matrix
Player2 C D Player1 C R=3,R=3 S=0,T=5 D T=5,S=0 P=1,P=1 In the payoff matrix of Table 1, actions C and D stand for ‘Cooperate’ and ‘Defect’ and payoffs R, P, T, and S stand for ‘Reward’, ‘Punishment’,‘Temptation’, and ‘Sucker’s’ payoff respectively. This payoff matrix shows that defecting is beneficial to both players for two reasons. First, it leads to a greater payoff (T = 5) in case the other player cooperates, (S = 0). Second, it is a safe move because neither knows what the other’s move will be. So, to rational players, defecting is the best choice. But, if both players choose to defect then it leads to a worse payoff (P = 1) as compared to cooperating (R = 3). That is the dilemma. The special setting of the one shot PD is seen by many to be contrary to the idea of cooperation. This is because the only equilibrium point is the outcome [P, P] which is a Nash equilibrium, [7]. Also, [P, P] is at the intersection of minimax strategy choices for both players. These minimax strategies are dominant for both players, hence the exclusion in principal of cooperation (by virtue of the dominance of the chosen strategies). Moreover, even if cooperative strategies were chosen, the resulting cooperative ‘solution’ is not an equilibrium point. This means that it is not stable
¨ T¨oreyen A. Salhi and O.
214
due to the fact that both players are tempted to defect from it. It should also be noted that cooperative problems in real life are likely to be faced repeatedly. This makes the IPD a more appropriate model for the study of cooperation than the one-shot version of the game. The PD game is characterised by the strict inequality relations between the payoffs: T > R > P > S. And to avoid coordination or total agreement getting a ‘helping hand’, most experimental PD games have a payoff matrix satisfying: 2R > S + T , as in Table 1. The close analysis of the IPD reveals that, unlike the one-shot PD, it has a large number of Nash equilibria. These being inside the convex hull of the outcomes (0,5), (3,3), (5,0), (1,1) of the pure strategies in the one-shot PD, (see Figure 9.1). Note that (1,1) corresponding to [P, P] is a Nash equilibrium for the IPD also. For a comprehensive investigation of the IPD, please refer to [4, 5, 15].
(0, 5)
(3, 3)
N
(1, 1)
(5, 0)
Fig. 9.1 Set of Nash Equilibrium Points
9.2.3 Multi-Agent Systems Multi-Agent System (MAS) are collections of agents that work together to accomplish a task that is normally beyond the capabilities of a single agent, [19], [30], [31]. In our case, however, every agent is capable of solving the instance of the optimisation problem in hand. The agents in a classical MAS communicate between themselves, cooperate, collaborate and sometimes negotiate ([1, 9, 12]). They do not, normally, compete. But, they are allowed, in fact required, to do so, in the framework used here. The present paper describes a Game Theoretic Multi-Agent Solver (GTMAS), [18], which implements the ideas introduced above. It will be limited to three agents (Figure 9.2): a coordinator-agent, Solver-Agent 1 (SA1), running the Genetic Algorithm (GA), [14, 26], and Solver-Agent 2 (SA2), running Simulated Annealing
9
A Game Theory-Based Multi-Agent System
215
(SA), [28]. The system, implemented in Matlab, is tested on a set of Travelling Salesman Problems (TSP) from TSPLIB, [21].
9.3 Constructing GTMAS GTMAS architecture follows closely the goal-based MAS architecture of Park and Suguraman [19] and the IPD model as described in [2], [3], [6]. Let the problem to be solved be ℘. The overall goal of GTMAS is to solve ℘ efficiently using the best available algorithm(s) in the system’s library. This is equivalent to completing two tasks: selecting the best algorithm(s) among all available algorithms and obtaining a ‘good’ solution. The overall goal, is divided into sub-goals which are then matched to the system’s agents. Each of the solver-agents runs a different algorithm and tries to be the first to solve ℘ by using as much as possible of the available computing facilities, here CPU time. The coordinator-agent enables coordination, manages the game through which the solver-agents compete for the facilities, allocates the facilities to the solver-agents, and also communicates with the user, i.e. the owner of problem ℘, (Figure 9.2).
Fig. 9.2 Goal Hierarchy Diagram
Solver-agents cooperate (C) or compete (D) with each other by sharing or not their solutions. If an agent can take an opponent’s possibly better solution when it is stuck in a local optimum say, and uses it, then it can improve its own search. The decision to cooperate or to compete is autonomously made by the agents using their beliefs (no notable change in the objective value in the last few iterations, for instance, may mean convergence to a local optimium), the history of the previous encounters with their opponents (the number of times they cooperated and competed), and certain rules which follow observations of the behaviours of agents. Some are explained below. The rules are set to prevent the game from converging too soon to a near pure competition game (which is equivalent to playing the GRIM strategy, [2]).
¨ T¨oreyen A. Salhi and O.
216
Go-it-alone type strategy can not contribute to the solution quality more than running an algorithm on its own. These rules are: • If the number of times SA1 knows the solution of SA2 increases, then the likelihood that SA2 finds the solution to ℘ first, decreases. Therefore, SA2 is unlikely to cooperate. Since all solver-agents are aware of this, they would cooperate less and take their opponent’s solution more often, given the chance. • If SA1 does not cooperate when SA2 cooperates, then SA2 would retaliate. This leads to the TIT-FOR-TAT and the go-it-alone type strategy. • If a solver-agent cooperates in the first encounter with another solver-agent, then it can be perceived as in need of help; i.e. it is stuck at a local optimum. Agents, therefore, perceive the first cooperation of their opponent as a “forceful invitation to cooperate or else...” from bullet point 2 above. There are all sorts of rules which are implicit in the IPD. Agents, however, do not have to apply them systematically.
9.3.1 GTMAS at Work: Illustration Recall that the solver-agents, in order to solve ℘, play the IPD game. In each encounter, they either cooperate (C) or compete (D). Figure 9.3 shows an encounter of 2 agents after they both obtain their intermediary solutions. Node 1 is the starting node which shows the decision alternatives
Fig. 9.3 Decision Tree of the Game
9
A Game Theory-Based Multi-Agent System
217
of SA1. It can cooperate and end up in Node 2, or Node 3. Nodes 2 and 3 are the decision nodes of SA2 which has the same two alternative decisions as SA1. Node 4 follows the cooperation of both of the agents that may result in a solution exchange. Node 5 shows the situation where SA1 cooperates and SA2 competes which means SA2 may take the solution of SA1 and SA1 takes nothing. Node 6 depicts the same situation as that leading to node 5 but with agents taking different actions. In node 7, neither gives its solution; they continue without any exchange. The decision tree is expanded further with branching from nodes 4-7, but with alternatives now being: “Take the opponent’s solution” and “Do not take the opponent’s solution”. This branching determines which agent has a better solution and is essential for setting up the payoff matrices that drive the system. 8 new nodes(leaves) arise. Each pair of sibling nodes yields a different payoff matrix. The labels (G)(for good) and (B)(for bad) refer to agents having a better solution than the opponent or otherwise, respectively. The cells that are crossed refer to impossible outcomes. Managing the resources is based on the outcomes of the decisions of the solveragents. When an agent cooperates it gains one unit (of CPU time or equivalent in terms of iterations it is allowed to do) and loses double that. When it competes it gains two units and loses one (or half of the initial gain). This means, the GTMAS payoff matrix rewards competition. The idea behind supporting competition is to counter the “helping hand” that cooperation gets from the rules underpinning the construction of GTMAS (see above). It can also be argued that, intuitively at least, too frequent exchanging of solutions will lead to early convergence to local optima. So, competition gives solver-agents the chance to cover the search space better. The 4 payoff matrices in Figure 9.3 can be combined in one payoff matrix (Table 9.2).
Table 9.2 Combined Payoff Table for Evaluating and Rewarding Agents
B C D G C (1,-2) (1,-1) D (2,-2) (2,-1)
The equilibrium point for the payoff matrix is (D, D) with payoffs 2 and -1. It is also a regret-free point. The payoff matrix at the core of GTMAS is different from those commonly found in the literature. These matrices would be drawn immediately after decisions have been taken, i.e. at nodes 4 to 7 in Figure 9.3. Here, they are drawn after other decisions are taken. In fact, one can highlight three main differences;
¨ T¨oreyen A. Salhi and O.
218
(i) The return of a player is not dependent on the opponent’s choice directly. Whether the opponent cooperates or competes becomes only relevant after the exchange of solutions has been decided; (ii) The payoff is affected by what has been achieved in terms of the quality of solution after exchange (or otherwise of solutions); Unlike trditional games, here, after the players (solver-agents) have made their choices, they are given a chance to progress with the consequences of the choices. Only after that, are they rewarded/punished. This was made explicit in the above paragraph where reasons for rewarding competition/penalising cooperation, were given; for instance when we said that a cooperating agent “gains one unit and loses double that”, we meant that the solver-agent runs first for a unit of CPU time (or equivent in iterations) and only after that is it penalised by taking 2 units of CPU time from its account. Basically, the quality of the solution following decisions has to be measured first before the payoffs are allocated. Time is an important factor in the IPD. (iii) The third difference is that the players are not “Solver-Agent 1” and “SolverAgent 2”, but instead “Solver-Agent with the better solution” and “Solver-Agent with the worse solution”. The configuration of the table may change at each stage according to the solution qualities of the solver-agents. The one with better solution is always placed as the row player.
9.4 The GTMAS Algorithm GTMAS is a generic (n + 1)-agent system that consists of a coordinator-agent and n solver-agents. It can be seen as a loosely coupled hybrid algorithm that uses any number of algorithms contributing to the hybridisation. The pseudocode of the agents is given below. Coordinator-Agent Pseudocode 1. Initialise belief. Initialise resources,n. 2. For Nstage stages, play the game and update belief where Nstage limits the number of stages the game is played. 2.1. Start decision phase: Run the solver-agents to decide. 2.2. Manage the solution exchange. 2.3. Start competition phase: Run the solver-sgents to compete. 2.4. Evaluate and reward/punish the solver-agents. Update resources, m1 , m2 iterations where mi = n + ∑currentstage−1 ri j , ri j is the reward of agent i j=1 at stage j. 2.5. Increment stage. 3. End the game. Select the best algorithm. Report the results.
9
A Game Theory-Based Multi-Agent System
219
Solver-Agents Pseudocode 1. Initialise belief. 2. If it is a decision phase, do: 2.1. If it is the first stage, do: 2.1.1. Initialise memory and algorithm specific parameters. 2.1.2. Run own algorithm for n iterations. 2.1.3. Cooperate. 2.1.4. End run. Send the results to the Coordinator-Agent. 2.2. If it is the second stage, do: 2.2.1. Update belief. 2.2.2. Run own algorithm for mi iterations. 2.2.3. Compete. 2.2.4. End run. Send the results to the Coordinator-Agent. 2.3. If stage > 2, do: 2.3.1. Update belief. 2.3.2. Run own algorithm for mi iterations. 2.3.3. Decide to cooperate/compete. 2.3.4. End run. Send the results to the Coordinator-Agent. 3. If it is a competition phase, do: 3.1. Update belief. 3.2. Run own algorithm for n iterations. 3.3. End run. Send the results to the Coordinator-Agent. A prototype GTMAS is constructed with 3 agents; a coordinator-agent and two solver-agents. GTMAS starts with initialisation of the coordinator-agent. The coordinator-agent reads the problem data and initialises the payoff table. It also initialises the resources accounts (seconds of CPU time or number of iterations) assigned to the solver-agents. The overall iterations of GTMAS are called stages which consist of decision and competition phases. In the decision phases, the coordinator-agent asks the solver-agents for their decisions. The solver-agents start with the initialisation of their memory the outcome of enounters with their opponents, algorithm parameters (i.e. population for the Genetic Algorithm) and decision parameters (η , β , α , σ and γ ). After they run for the given number of iterations determined by the coordinator-agent to obtain their initial solutions for ℘, they make their decisions. The decisions in the first two stages are not subject to analysis since no historical data (memory of earlier encounters) exist yet. Both agents cooperate in the first stage and compete in the second stage, regardless of the results and without prior analysis. The following stages differ from the first two stages with the introduction of memory. Decisions are made according to procedure Decide() below.
9.4.1 Solver-Agents Decision Making Procedure Let SA1 be a solver-agent and SA2 its opponent in a IPD game. SA1 decides whether to compete or cooperate according to the following procedure.
220
¨ T¨oreyen A. Salhi and O.
Procedure Decide() Pseudocode 0. Begin 1. If (SA2 has cooperated in the last move) then 1.1. If (P(IC|OC) < σ %) then 1.1.1. Cooperate. 1.2. Else If (P(OC|IC) < γ %) then 1.2.0. Cooperate. 1.2.1. Else If (SA1 didn’t improve by γ % in last 2 stages) then 1.2.1.0. Decide randomly to cooperate or compete 1.2.1.1. Else 1.2.1.1.0. Compete. 1.2.1.2. End If 1.2.2. End If 1.3. End If 2. Else (Ask SA2 for its solution); 2.1. If (SA2 solution is α % better than that of SA1) then 2.1.1. If (SA1 is stuck with γ %) then 2.1.1.0. Cooperate. 2.1.2. Else If (SA2 solution is 2α % better than that of SA1) then 2.1.2.1. If (P(OC|ID) > β %) then 2.1.2.1.0. Compete. 2.1.2.2. Else 2.1.2.2.0. Cooperate. 2.1.2.3. End If 2.1.3. End If 2.3. Else 2.3.1. Compete. 2.4. End If 3. End If 4. Stop
In the decision-making process, P(IC|OC) is the probability that SA1 will cooperate in the next iteration given that SA2 cooperates in this iteration. It is equal to the ratio between the number of times SA1 cooperates in the (n + 1)st iteration given that SA2 cooperated in the nth iteration and the the total number of encounters. P(OC|IC) is the probability that SA2 will cooperate in the next iteration given that SA1 cooperates in this iteration. It is equal to the ratio of the number of times SA2 cooperates in (n + 1)st iteration given SA1 cooperated in the nth iteration to the total number of encounters. P(OC|ID) is the probability that SA2 will cooperate in the next iteration given that SA1 cooperates in this iteration. It is equal to the ratio of the number of times SA2 cooperates in (n + 1)st iteration given that SA1 competed in the nth iteration to the total number of encounters.
9
A Game Theory-Based Multi-Agent System
221
σ is a measure of how likely it is for SA1 to adopt a TIT-FOR-TAT strategy; it is referred to as “responsiveness”. η is a measure of how likely it is for SA2 to adopt a TIT-FOR-TAT strategy; γ is how likely it is for either agents to get stuck in a local optimum; α is the difference between SA1’s solution and that of SA2; β is how likely it is for the opponent to cooperate (be nice!); If the opponent cooperated in the last move, the agents check their own responsiveness, first. If it is less than σ , then they cooperate; if not, they check the opponent’s responsiveness. If the latter is less than η , they conclude that the opponent is not as responsive as it should be, so they compete. If the opponent is responsive, then they check their own status: if they performed γ % better than what they obtained in the last 2 stages before, then they compete, otherwise they make a random choice as to whether they cooperate or compete. If the opponent did not cooperate in the last move, then they compare their own status with that of their opponent. If the opponent’s last solution is not α % better than their own solution, they compete. If it is at least α % better, then they check their own progress. If they are stuck with γ %, or in other words, the solution has not improved more than γ % in the last 2 stages, then they cooperate. If they are not stuck, they check the difference between the opponent’s and their own solutions. If the solution of the opponent is not 2α % better than their own solution, they compete. If it is 2α % better, then they check the opponent’s attitude to competition. If the opponent is likly to cooperate, i.e. if the probability it cooperates after competition is larger than β , they compete, otherwise they cooperate. Note that this description is given as the procedure Decide() . After an agent makes a decision, it is sent to the coordinator-agent which manages the solution exchange. Solution exchange is settled randomly when a solution is offered, following a cooperate move; it is accepted with probability 0.5. The worse performing agent is not entirely removed. It is kept but it is only allowed one iteration in each stage. This is specific to the two solver-agent case as it seems, from experimental results, that the weaker algorithm still helps the stronger one to get, overall, a better solution. This, however, may not be the case if a large number of algorithms were used. When all the stages are completed, the results of the best solver-agent are reported.
9.5 Application of GTMAS to TSP GTMAS is applied to a collection of Travelling Salesman Problems (TSP), [21]. The Genetic Algorithm (GA) and Simulated Annealing (SA) are selected as the solvers in the library. GA is coded in Matlab 7.0 and Simulated Annealing Matlab code is borrowed from Matlab Central ([29]). They are customised to be incorporated in GTMAS which is also coded in Matlab 7.0. Generic parameters of GTMAS are defined by pre-experimentation. Nstage , the number of stages the game is played, is set to 5. Preliminary analyses show that
¨ T¨oreyen A. Salhi and O.
222
it is sufficient for convergence to the optimum or a good solution. n, the number of iterations solver-agents start the decision phase and run for in the competition phases is set to 10. Accordingly, all the entries in the payoff table, ri j , are quadrupled for faster progress (see Table 9.3). Table 9.3 Payoff Table of GTMAS
B C D G C (4,-8) (4,-4) D (8,-8) (8,-4)
GTMAS is customised specific to the GA-SA competition. SA is fast in the initial iterations because of the high temperature and large probability to accept bad solutions in order to escape local optima. Afterwards, it slows down considerably with the decreases in temperature. The resource in the decision phase is CPU time. The number of iterations SA is run, mSA , is updated at the beginning of each stage to balance the CPU time usage of SA and that of GA. mSA =
10 ∗ mSA . currentstage
Prior to experimentation, GA is expected to perform better than SA. GA is coded for the specific problem in hand and its parameters are tuned accordingly. The parameters of SA are default parameters as found in the literature. The game played between the agents affects the solution quality substantially. It depends on the solveragents’ attitude to cooperation and competition, the solution exchanges and the payoff matrix. These parameters are summarised with their possible values in Table 9.4.
Table 9.4 GTMAS Parameter Values
Payoff Matrix Decision Model Characteristic Characteristic Matrix Model of GA of SA simple random random random cooperation-rewarded evaluative cooperative cooperative competition-rewarded competitive competitive time-dependent
The payoff matrix can be simple, cooperation-rewarded, competition-rewarded or time-dependent.
9
A Game Theory-Based Multi-Agent System
223
Table 9.5 Simple (Top-left), Cooperation-Rewarded (bottom-left), Competition-Rewarded (Top-right) and Time-Dependent (bottom-right) Payoff Matrices
B C D G C (4,-4) (4,-4) D (4,-4) (4,-4)
B C D G C (4,-8) (4,-4) D (8,-8) (8,-4)
B C D G C (8,-4) (8,-8) D (4,-4) (4,-8)
B C D G C ( 4t ,−4t) ( 4t ,− 4t ) D (4t,−4t) (4t,− 4t )
The solver-agents arrive at decisions using the decision-making parameters η , β , α and σ . These parameters were explained earlier and their values are given in Table 9.6. Table 9.6 The Values of the Decision Parameters Agent Characteristics α β γ η σ cooperative 0.1 0.9 0.01 0.2 0.6 competitive 0.5 0.1 0.01 0.8 0.2
Twenty combinations of parameters were used and for each five runs were carried out on the problems of Table 9.13, from TSPLIB ([21]). GA is found to be the better algorithm in 98runs and SA is found to be better only in the remaining 2. The final solutions, the elapsed times, the solver algorithm and the series of cooperation/competition bouts and exchange of solutions are recorded. The results are entered into SPSS 14 for analysis of significant factors. The cooperation/competition series and the exchange series are categorised prior to analysis. The categorisation is summarised in Table 9.7. Table 9.7 Cooperation and Solution Exchange Sequences of Solver-Agents Cooperate Take Opponent’s Solution less than twice never takes more than twice takes in first stage takes after second stage takes in both
These are added to the factors of the experiment. The final factors of the question are summarised in Table 9.8 with their corresponding values and the number of occurrences.
¨ T¨oreyen A. Salhi and O.
224 Table 9.8 ANOVA - Data Summary
Number of Observations simple 1 25 cooperation-rewarded 2 25 competition-rewarded 3 25 time-dependent 4 25 DECISION PROCESS coop GA vs coop SA 1 20 coop GA vs comp SA 2 20 comp GA vs coop SA 3 20 comp GA vs comp SA 4 20 random 5 20 CATEGORY GA COOPERATES less than twice 1 64 more than twice 2 36 CATEGORY SA COOPERATES less than twice 1 43 more than twice 2 57 CATEGORY GA TAKES never takes 1 25 SA’S SOLUTION takes in first stage 2 23 takes after second stage 3 26 takes in both 4 26 CATEGORY SA TAKES never takes 1 22 GA’S SOLUTION takes in first stage 2 33 takes after second stage 3 33 takes in both 4 12 Factors PAYOFF
Values
Table 9.9 shows ANOVA results for the dependent variable deviation. Here deviation, is the difference between the true solution objective value and the objective value of the solution found. All factors and reasonable multiple interactions are included in the model. Most of them are very insignificant due to the high random variability. However, the interaction of solution taking sequences of agents is significant with 11% confidence. Therefore, the solution taking sequences factors themselves are significant. Even though, it is not a very reliable confidence, these are the most expected factors to be significant to explain the data since the solution quality is expected to depend on the times solution exchanges occur. Table 9.10 shows ANOVA results for the dependent variable time. The only significant factor which is in the 12% significance level is the solution exchange sequences of SA solver-agent. This matches exactly the expectations since SA varies a lot both between iterations within problems and between problems. When it takes a solution in any stage, the average elapsed time is about 100 seconds. When it doesn’t take the GA solution, the average elapsed time is about 60 seconds.
9
A Game Theory-Based Multi-Agent System
225
Table 9.9 ANOVA - Significant Factors Affecting Deviation From True Solution Value
Source Type III SS Deg.F Mean Sq. F.Value Sig. Corrected Model 589.795a 76 7.760 .865 .689 Intercept 2098.415 1 2098.415 233.899 .000 PAYOFF 26.754 3 8.918 .994 .413 DECISION PROCESS 16.223 4 4.056 .452 .770 PAYOFF*DECISION PROCESS 25.528 4 6.382 .711 .593 CATEGORY COOP1* CATEGORY COOP2 5.214 1 5.214 .581 .454 CATEGORY TAKEN1* CATEGORY TAKEN2 124.516 7 17.788 1.983 .102 PAYOFF*CATEGORY COOP1*CATEGORY COOP2 .000 0 . . . PAYOFF*CATEGORY TAKEN1*CATEGORY TAKEN2 106.348 13 8.181 .912 .555 DECISION PROCESS* CATEGORY COOP1* CATEGORY COOP2 .000 0 . . . DECISION PROCESS* CATEGORY TAKEN1* CATEGORY TAKEN2 7.187 4 1.797 .200 .936 PAYOFF*DECISION PROCESS* CATEGORY COOP1* CATEGORY COOP2 .000 0 . . . PAYOFF*DECISION PROCESS* CATEGORY TAKEN1* CATEGORY TAKEN2 5.806 2 2.903 .324 .727 PAYOFF*DECISION PROCESS* CATEGORY COOP1* CATEGORY COOP2* CATEGORY TAKEN1* CATEGORY TAKEN2 .000 0 . . . CATEGORY COOP1 .041 1 .041 .005 .947 CATEGORY COOP2 4.731 1 4.731 .527 .475 CATEGORY TAKEN1 9.223 3 3.074 .343 .795 CATEGORY TAKEN2 11.471 3 3.824 .426 .736 Error 206.344 23 8.971 Total 4238.542 100 Corrected Total 796.139 99 a. R Squared = .741 (Adjusted R Squared =-.116)
226
¨ T¨oreyen A. Salhi and O. Table 9.10 ANOVA - Significant Factors Affecting Time
Source Type III SS Deg.F Mean Sq. F.Value Sig. Corrected Model 331363.405a 76 4360.045 .522 .981 Intercept 626245.612 1 626245.612 74.956 .000 CATEGORY COOP1 6.103 1 6.103 .001 .979 CATEGORY COOP2 111.092 1 111.092 .013 .909 CATEGORY TAKEN1 7246.948 3 2415.649 .289 .833 CATEGORY TAKEN2 54448.233 3 18149.411 2.172 .119 PAYOFF 2783.043 3 927.681 .111 .953 DECISION PROCESS 878.624 4 219.656 .026 .999 CATEGORY COOP1* CATEGORY COOP2 .538 1 .538 .000 .994 CATEGORY TAKEN1* CATEGORY TAKEN2 50147.857 7 7163.980 .857 .553 PAYOFF*DECISION PROCESS 794.601 4 198.650 .024 .999 CATEGORY COOP1* CATEGORY COOP2* .000 0 . . . DECISION PROCESS CATEGORY COOP1* CATEGORY COOP2* .000 0 . . . PAYOFF CATEGORY TAKEN1* CATEGORY TAKEN2* 1624.845 4 406.211 .049 .995 DECISION PROCESS CATEGORY TAKEN1* CATEGORY TAKEN2* 25019.542 13 1924.580 .230 .996 PAYOFF CATEGORY TAKEN1* CATEGORY TAKEN2* 786.468 2 393.234 .047 .954 PAYOFF*DECISION PROCESS CATEGORY COOP1* CATEGORY COOP2* .000 0 . . . PAYOFF*DECISION PROCESS CATEGORY COOP1* CATEGORY COOP2* CATEGORY TAKEN1* CATEGORY TAKEN2* .000 0 . . . PAYOFF*DECISION PROCESS* Error 192161.123 23 8354.831 Total 1536492.299 100 Corrected Total 523524.528 99 a. R Squared = .633 (Adjusted R Squared =-.580)
9
A Game Theory-Based Multi-Agent System Table 9.11 Results with Respect to Solution Exchanges
GA Takes SA’s SA Takes GA’s Average Average Solution Solution Deviation Time (sec) never takes never takes never takes takes in first stage 5.51% 108.8 never takes takes after second stage 4.89% 81.28 never takes takes in both 8.01% 98.8 takes in first stage never takes 4.62% 60.41 takes in first stage takes in first stage 6.85% 105.6 takes in first stage takes after second stage 5.18% 211.16 takes in first stage takes in both 6.06% 71.12 takes after second stage never takes 3.79% 58.24 takes after second stage takes in first stage 6.30% 119.97 takes after second stage takes after second stage 7.08% 93.08 takes after second stage take sin both 7.50% 296.77 takes in both never takes 7.41% 57.16 takes in both takes in first stage 3.36% 140.57 takes in both taken after second stage 5.74% 92.54 takes in both takes in both 5.34% 116.08 Table 9.12 Selection of Best Parameters
Payoff coop-rewarded coop-rewarded coop-rewarded coop-rewarded coop-rewarded comp-rewarded comp-rewarded comp-rewarded comp-rewarded comp-rewarded simple simple simple simple simple time-dependent time-dependent time-dependent time-dependent time-dependent
Characteristic Characteristic # of Time of GA of SA Occur. Dev. (sec.) cooperative cooperative 2 3.59% 61.97 cooperative competitive 1 1.27% 45.49 competitive cooperative 1 3.33% 60.49 competitive competitive 2 7.39% 50.17 random random cooperative cooperative cooperative competitive competitive cooperative competitive competitive 1 0.65% 58.09 random random cooperative cooperative cooperative competitive competitive cooperative 1 6.50% 66.41 competitive competitive 1 3.20% 70.54 random random cooperative cooperative cooperative competitive competitive cooperative competitive competitive 1 1.34% 54.77 random random 1 3.47% 60.31
227
228
¨ T¨oreyen A. Salhi and O.
In Table 9.11, the best deviation is found when GA takes SA’s solution in both the first stage and after the second stage and SA takes GA’s solution only in the first stage. Average deviation is 3.36%. However, the average time elapsed to obtain this average deviation is quite high, at 141 seconds. The second best deviation is observed when GA takes SA’s solution after the second stage and SA never takes GA’s solution. The average deviation is 3.79% with an average elapsed time of 58 seconds. From these results, it can be said that, in this setting, i.e. when GA competes and takes the solution of SA and SA cooperates by offering its own solution and never taking that of GA, the best performance is obtained. Whether obtaining this solution exchange setting is random, is not clear. What is clear is that it occurs quite often. Table 9.12, records some of its occurences. Amongst these 11 occurrences, the best average deviation is obtained with a competition-rewarded payoff matrix and both agents being competitive. The deviation is 0.65% which actually comes from only one occurrence, and the time is 58 seconds. This analysis does not show that if competitive agents play against each other in a competition-rewarded environment, then this is the best environment; rather, it shows that if competitive agents play against each other in a competition-rewarded environment and their solution exchange happens to be one-way benefit to one of the solver-agents, then this might be the best setting.
9.6 Tests and Results GTMAS is tested on 10 problems from TSPLIB [21]. The results are summarised in Table 9.13. Table 9.13 shows the runs for GA alone, SA alone and GA and SA together under the framework of GTMAS, sequentially. For each problem instance, GTMAS selected GA as the best solving agent. When average deviations are compared within problem instances, GTMAS is found to dominate GA. On average, GTMAS always finds better solutions than GA. This is due to GA benefiting from the presence of SA; this must be from a synergistic effect. It is observed that when GA takes the solution of SA the quality of the overall solution increases considerably. Solution exchange seems to play a critical role in defining the quality of the solution. However, when elapsed times are compared, the average time of GTMAS is almost double that of GA for almost all problems. This rather unfavorable time count is the cost of keeping the SA solver-agent because it improves the qualiy of the overall solution. There are also, other overheads that come with the need for coordination, decision making and so on. It should also be noted that the recorded time counts for GTMAS are those of a sequential implementation. The times of a parallel implementation are expected to be significantly lower.
9
A Game Theory-Based Multi-Agent System
229
Table 9.13 Results of GTMAS Applied to TSP
Average Average Selected Problem Agent 1 Agent 2 Deviation Time (sec) Algorithm burma14 GA 0.70% 6.95 burma14 SA 0.53% 8.34 burma14 GA SA 0.00% 13.49 GA ulysses16 GA 0.27% 8.26 ulysses16 SA 0.17% 42.28 ulysses16 GA SA 0.00% 14.71 GA ulysses22 GA 1.56% 9.83 ulysses22 SA 1.16% 97.12 ulysses22 GA SA 1.47% 17.78 GA att48 GA 4.97% 41.23 att48 SA 31.48% 10.52 att48 GA SA 4.06% 129.87 GA eil51 GA 4.46% 44.45 eil51 SA 18.17% 423.26 eil51 GA SA 2.72% 91.58 GA berlin52 GA 8.67% 42.99 berlin52 SA 36.37% 11.44 berlin52 GA SA 5.32% 66.94 GA st70 GA 11.62% 66.17 st70 SA 24.89% 232.21 st70 GA SA 7.97% 143.52 GA eil76 GA 6.84% 74.90 eil76 SA 33.34% 1162.02 eil76 GA SA 5.88 136.23 GA pr76 GA 6.25% 94.02 pr76 SA 35.91% 254.20 pr76 GA SA 5.71% 140.05 GA eil101 GA 10.37% 143.99 eil101 SA 50.11% 220.71 eil101 GA SA 8.58% 211.98 GA
9.7 Conclusion and Further Work A generic smart solver, GTMAS, has been constructed that combines a multi-agent system architecture and game theory to deal with expensive optimisation problems. Within GTMAS different algorithms attached to agents play an Iterated Prisoners’ Dilemma type game in which they cooperate to solve the problem and compete over the computing facilities available (here CPU time). In the process, the system finds the most appropriate algorithm for the given problem from a library of available algorithms and solves the problem. It also obtains a better quality approximate
230
¨ T¨oreyen A. Salhi and O.
solution than the best algorithm would obtain on its own. This is because of the synergistic effect of the algorithms working together. GTMAS implements an interesting resource allocation process that uses a purpose built payoff matrix to encourage competition for the available computing resources. Solver-agents are rewarded by increasing their access to the computing facilities for good performance; they are punished for bad performance, by reducing their access to the computing facilities. This simple rule guarantees that the computing platform is increasingly being dedicated to the most suited algorithm. In other words, the bulk of the computing platform will eventually be used by the best performing algorithm, which is synonymous with the computing resources being used efficiently. GTMAS as implemented here involves only two players. The study will benefit from a more extensive investigation with a large number of algorithms. To extend it to n players, the results obtained can be used. The game can be designed such that given the players A1 , A2 , ..., An , pair-wise games are considered and each game is evaluated separately according to the same 2-by-2 payoff matrix introduced here. The solvers that fail in the simultaneous games in 2-by-2 competitions get eliminated and the tournament continues with the ones that survive. Another approach of playing the n-by-n game could be playing it simultaneously, using notions of Nash’s poker game [13], with a specially created n-by-n payoff matrix that would evaluate all agents at once but select the best iteratively. Current and future research directions concern extending the ideas of the GTMAS prototype to a general n-by-n environment which deals with n algorithms, running in parallel, according to one of the two proposed payoff matrices.
References 1. Aldea, A., Alcantra, R.B., Jimenez, L., Moreno, A., Martinez, J., Riano, D.: The scope of application of multi-agent systems in the process industry: Three case studies. Expert Systems with Applications 26, 39–47 (2004) 2. Axelrod, R.: Effective choice in the prisoner’s dilemma. Journal of Conflict Resolution 24(1), 3–25 (1980) 3. Axelrod, R.: More effective choice in the prisoner’s dilemma. Journal of Conflict Resolution 24(3), 379–403 (1980) 4. Axelrod, R.: The Evolution of Cooperation. Basic Books, New York (1984) 5. Axelrod, R.: The evolution of strategies in the iterated prisoners’ dilemma. In: Davis, L. (ed.) Genetic Algorithms and Simulated Annealing, pp. 32–42. Morgan Kaufmann, Los Altos (1987) 6. Axelrod, R., Hamilton, W.D.: The evolution of cooperation. Science 211, 1390–1396 (1981) 7. Binmore, K.: Fun and Games. D.C.Heath, Lexington (1991) 8. Binmore, K.: Playing fair: Game theory and the social contract. MIT Press, Cambridge (1994) 9. Bratman, M.E.: Shared cooperative activity. The Philosophical Review 101(2), 327–341 (1992)
9
A Game Theory-Based Multi-Agent System
231
10. Byrd, R.H., Dert, C.L., Rinnooy Kan, A.H.G., Schnabel, R.B.: Concurrent stochastic methods for global optimization. Mathematical Programming 46, 1–30 (1990) 11. Colman, A.M.: Game Theory and Experimental Games. Pergamon Press Ltd., Oxford (1982) 12. Doran, J.E., Franklin, S., Jennings, N.R., Norman, T.J.: On cooperation in multi-agent systems. The Knowledge Engineering Review 12(3), 309–314 (1997) 13. Nash, J.F.: Non-cooperative games. Annals of Mathematics 54(2), 286–295 (1951) 14. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975) 15. Linster, B.: Essays on Cooperation and Competition. PhD thesis, University of Michigan, Michigan (1990) 16. Luce, R., Raiffa, H.: Games and Decisions. Wiley, New York (1957) 17. Murty, K.G., Kabadi, S.N.: Some NP-complete problems in quadratic and nonlinear programming. Mathematical Programming 39, 117–130 (1987) ¨ A game-theory based multi-agent system for solving complex optimisation 18. T¨oreyen, O.: problems and a clustering application related to the integration of turkey into the eu community. M.Sc. Thesis Submitted to the Department of Mathematical Sciences, University of Essex, UK (2008) 19. Park, S., Sugumaran, V.: Designing multi-agent systems: A framework and application. Expert Systems with Applications 28, 259–271 (2005) 20. Rapoport, A., Chammah, A.M.: Prisoner’s Dilemma: A Study in Conflict and Cooperation. University of Michigan Press, Ann Arbor (1965) 21. Reinelt, G.: TSPLIB, http://www.iwr.uni-heidelberg.de/groups/comopt/ software/TSPLIB95 22. Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods Part I: Clustering methods. Mathematical Programming 39, 27–56 (1987) 23. Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods Part II: Multi-level methods. Mathematical Programming 39, 57–78 (1987) 24. Rinnooy Kan, A.H.G., Timmer, G.T.: Global optimization. In: Nemhauser, G.L., Rinnooy Kan, A.H.G., Todd, M.J. (eds.) Optimization. Handbooks in Operations Research and Management Science, ch. IX, vol. 1, pp. 631–662. North Holland, Amsterdam (1989) 25. Salhi, A., Glaser, H., De Roure, D.: A genetic approach to understanding cooperative behaviour. In: Osmera, P. (ed.) Proceedings of the 2nd International Mendel Conference on Genetic Algorithms, MENDEL 1996, pp. 129–136 (1996) 26. Salhi, A., Glaser, H., De Roure, D.: Parallel implementation of a genetic-programming based tool for symbolic regression. Information Processing Letters 66(6), 299–307 (1998) 27. Salhi, A., Glaser, H., De Roure, D., Putney, J.: The prisoners’ dilemma revisited. Technical Report DSSE-TR-96-2, Department of Electronics and Computer Science, The University of Southampton, U.K. (February 1996) 28. Salhi, A., Proll, L.G., Rios Insua, D., Martin, J.: Experiences with stochastic algorithms for a class of global optimisation problems. RAIRO Operations Research 34(22), 183– 197 (2000) 29. Seshadri, A.: Simulated annealing for travelling salesman problem, http://www.mathworks.com/matlabcentral/fileexchange
232
¨ T¨oreyen A. Salhi and O.
30. Tweedale, J., Ichalkaranje, H., Sioutis, C., Jarvis, B., Consoli, A., Phillips-Wren, G.: Innovations in multi-agent systems. Journal of Network and Computer Applications 30, 1089–1115 (2007) 31. Wooldridge, M., Jennings, N.R.: Intelligent agents: Theory and practice. Knowledge Engineering Review 10(2), 115–152 (1995) 32. Zhigljavsky, A.A.: Theory of Global Search. Mathematics and its applications, Soviet Series, vol. 65. Kluwer Academic Publishers, Dordrecht (1991)
Chapter 10
Optimization with Clifford Support Vector Machines and Applications N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
Abstract. This chapter introduces a generalization of the real- and complex-valued SVM’s using the Clifford algebra. In this framework we handle the design of kernels involving the geometric product for linear and nonlinear classification and regression. The major advantage of our approach is that we redefine the optimization variables as multivectors. This allows us to have a multivector as output and, therefore we can represent multiple classes according to the dimension of the geometric algebra in which we work. By using the CSVM with one Clifford kernel we reduce the complexity of the computation greatly. This can be done thanks to the Clifford product, which performs the direct product between the spaces of different grade involved in the optimization problem. We conduct comparisons between CSVM and the most used approaches to solve multi-class classification to show that ours is more suitable for practical use on certain type of problems. In this chapter are included several experiments to show the application of CSVM to solve classification and regression problems, as well as 3D object recognition for visual guided robotics. In addition, it is shown the design of a recurrent system involving LSTM network connected with CSVM and we study the performance of this system with time series experiments and robot navigation using reinforcement learning.
10.1 Introduction The Support Vector Machine (SVM) [1, 2, 3, 4] is a powerfull optimization algorithm to solve classification and regression problems, but it was originally designed N. Arana-Daniel · C. L´opez-Franco Computer Science Department, Exact Sciences and Engineering Campus, CUCEI, University of Guadalajara, Av. Revolucion 1500, Col. Ol´ımpica, C.P. 44430, Guadalajara, Jalisco, M´exico e-mail: {nancy.arana,carlos.lopez}@cucei.udg.mx E. Bayro-Corrochano Cinvestav del IPN, Department of Electrical Engineering and Computer Science, Zapopan, Jalisco, M´exico
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 233–262. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
234
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
for binary classification. The methodology to extend this algorithm to do multiclassification is still an on-going research issue. Currently there are two main types of approaches for multi-class SVM [6, 7]. One is by constructing and combining several binary classifiers, while the other is by directly considering all data in one big optimization problem. The last mentioned approach is computationally more expensive to solve multi-class problems. This is why the authors were motivated to develop an SVM-based algorithm for multi-classification and multi-regression, which furthermore is based on the Clifford (or geometric) Algebra’s framework [5]. The authors’ hypothesis was that these algebras could be the appropriate mathematical framework to develop this algorithm because Clifford Algebras allow us to express in a compact way a lot of geometric entities (which are used to represent multi-class) and the products between them. This chapter will present the results obtained from the development of the above mentioned hypothesis: i) the design of the generalization of the real- and complexvalued Support Multi-Vector Machines for classification and regression using the Clifford geometric algebra, which from now on-wards will be called Clifford Support Vector Machines (CSVM), ii) the development of Multiple Input Multiple Output (MIMO) CSVM and iii) the application of CSVM as classifiers, regressors and as an important component of a recurrent system. This work is a continuation of a first one on the generalization of SVMs [8].
10.2 Geometric Algebra Let Gn denote the geometric (Clifford) algebra of n-dimensions, this is a graded linear space. As well as vector addition and scalar multiplication we have a noncommutative product which is associative and distributive over addition – this is the geometric or Clifford product. A further distinguishing feature of the algebra is that any vector squares to give a scalar. The geometric product of two vectors a and b is written ab and can be expressed as a sum of its symmetric and antisymmetric parts ab = a·b + a∧b,
(10.1)
where the inner product a·b and the outer product a∧b are defined by a · b = 12 (ab + ba) a ∧ b = 12 (ab − ba).
(10.2)
The inner product of two vectors is the standard scalar or dot product and produces a scalar. The outer or wedge product of two vectors is a new quantity which we call a bivector. We think of a bivector as a oriented area in the plane containing a and b, formed by sweeping a along b. Thus, b ∧ a will have the opposite orientation making the wedge product anti-commutative as given in ( 10.2). The outer product is immediately generalizable to higher dimensions – for example, (a ∧ b) ∧ c, a trivector, is interpreted as the oriented volume formed by sweeping the area a ∧ b along vector c. The outer product of k vectors is a k-vector or k-blade, and such a quantity is said to have grade k. A multivector A ∈ Gn is the sum of k-blades of different or
10
Optimization with Clifford Support Vector Machines and Applications
235
equal grade. This linear combination is called homogeneous of grade r (A = Ar ) if it contains terms of only a single grade.
10.2.1 The Geometric Algebra of n-D Space In an n-Dimensional space V n we can introduce an orthonormal basis of vectors {σi }, i = 1, ..., n, such that σi · σ j = δi j . This leads to a basis for the entire algebra: 1,
{σi },
{σi ∧ σ j },
{σi ∧ σ j ∧ σk },
...,
σ1 ∧ σ2 ∧ . . . ∧ σn = I.
(10.3)
which spans the entire geometric algebra Gn . Here I is the hyper volume called pseudo scalar which commutes with all the multivectors and it is used as dualization operator as well. Note that the basis vectors are not represented by bold symbols. Any multivector can be expressed in terms of this basis. Any multivector can be expressed in terms of this basis. Because the addition of k-vectors (homogeneous vectors of grade k) is closed and the multiplication of a k-vector is a vector space, k 8 n! . denoted V n . Each of this spaces is spanned by nk k-vectors, where nk := (n−k)!k! n n Thus, our geometric algebra Gn , which is spanned by ∑ k = 0 k = 2n elements, is a direct sum of its homogeneous subspaces of grades 0, 1, 2, ..., n, that is, Gn = 0 8
0 9
Vn ⊕
1 9
Vn ⊕
2 9
Vn ⊕ ...⊕
n 9
Vn
(10.4)
1 8
where V n = R is the set of real numbers and V n = V n corresponds to the linear n-Dimensional vector space. Thus, any multivector of Gn can be expressed in terms of the basis of these subspaces. In this chapter we will specify a geometric algebra Gn of the n dimensional space by G p,q,r , where p, q and r stand for the number of basis vector which squares to 1, -1 and 0 respectively and fulfill n=p+q+r. Its even sub algebra will be denoted by + . G p,q,r In the n-D space there are multivectors of grade 0 (scalars), grade 1 (vectors), grade 2 (bivectors), grade 3 (trivectors), etc... up to grade n. Any two such multivectors can be multiplied using the geometric product. Consider two multivectors Ar and Bs of grades r and s respectively. The geometric product of Ar and Bs can be written as (10.5) Ar Bs = ABr+s + ABr+s−2 + . . . + AB|r−s| where Mt is used to denote the t-grade part of multivector M, e.g. consider the geometric product of two vectors ab = ab0 + ab2 = a · b + a ∧ b. Another simple illustration is the geometric product of A = 4σ3 + 2σ1 σ2 and b = 8σ2 + 6σ3 Ab = 24(σ3 )2 + 16σ1(σ2 )2 + 32σ3σ2 + 12σ1σ2 σ3 = 24 + 16σ1 − 32σ2σ3 + 12I
(10.6)
236
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
Note here, that the Clifford product for σi σi = (σi )2 = σi · σi = 1, because the wedge product between σi ∧ σi = 0, and σi σ j = σi ∧ σ j , the geometric product of different unit basis vectors is equal to their wedge, which for simple notation can be omitted. Using equation 10.5 we can express the inner and outer products for the multivectors as Ar · Bs = Ar Bs |r−s| Ar ∧ Bs = Ar Bs r+s
(10.7) (10.8)
In order to deal with more general multivectors, we define the scalar product A ∗ B = AB0
(10.9)
For an r-grade multivector Ar = ∑ri=0 Ar i , the following operations are defined: Grade Involution: Aˆ r = Reversion: A†r = Clifford Conjugation: A˜ r = =
r
∑ (−1)i Ai
i=0 r
∑ (−1)
i=0 Aˆ †r r
∑ (−1)
i(i−1) 2
(10.10) Ai
(10.11) (10.12)
i(i+1) 2
Ai
(10.13)
i=0
The grade involution simply negates the odd-grade blades of a multivector. The reversion can also be obtained by reversing the order of basis vectors making up the blades in a multivector and then rearranging them to their original order using the anti-commutativity of the Clifford product. The scalar product ∗ is positive definite, i.e. one can associate with any multivector A = A0 + A1 + . . . + An a unique positive scalar magnitude |A| defined by |A|2 = A† ∗ A =
n
∑ |Ar |2 ≥ 0,
(10.14)
r=0
where |A| = 0 if and only : if A = 0. For an homogeneous multivector Ar its magnitude is defined as |Ar | = A†r Ar . In particular, for an r-vector Ar of the form Ar = a1 ∧ a2 ∧ . . . ∧ ar : A†r = (a1 . . . ar−1 ar )† = ar ar−1 . . . a1 and thus A†r Ar = a21 a22 . . . a2r , so, we will say that such a r-vector is null if anda only if it has a null vector for a factor. If in such f actorization or Ar p, q as s factors square in a positive number, negative and zero, respectively, we will say that Ar is a r-vector with signature (p, q, s). In particular, if s = 0 such a non − singular r-vector has a multiplicative inverse A−1 = (−1)q
A† A = 2 2 |A| A
(10.15)
10
Optimization with Clifford Support Vector Machines and Applications
237
In general, the inverse A−1 of a multivector A, if it exists, is defined by the equation A−1 A = 1.
10.2.2 The Geometric Algebra of 3-D Space The basis for the geometric algebra G3,0,0 of the the 3-D space has 23 = 8 elements and is given by: 1 , {σ1 , σ2 , σ3 }, {σ1 σ2 , σ2 σ3 , σ3 σ1 }, {σ1 σ2 σ3 } ≡ I . ;<=> ; <= >; <= >; <= >
scalar
vectors
bivectors
(10.16)
trivector
In G3,0,0 a typical multivector v will be of the form v = α0 + α1 σ1 + α2 σ2 + α3 σ3 + α4 σ2 σ3 + α5 σ3 σ1 + α6 σ1 σ2 + α7 I3 =< v >0 + < v >1 + < v >2 + < v >3 , where 0 8
1 8
the αi s are real numbers and < v >0 = α0 ∈ V n , < v >1 = α1 σ1 + α2 σ2 + α3 σ3 ∈ 2 8
3 8
V n , < v >2 = α4 σ2 σ3 + α5 σ3 σ1 + α6 σ1 σ2 ∈ V n , < v >3 = α7 I3 ∈ V n . In geometric algebra a rotor (short name for rotator), R, is an even-grade element ? of the algebra which satisfies RR=1, where R? stands for the conjugate of R. If A = {a0 , a1 , a2 , a3 } ∈ G3,0,0 represents a unit quaternion , then the rotor which performs the same rotation is simply given by R = a0 + a1 (I σ1 ) − a2 (I σ2 ) + a3 (I σ3 ) ;<=> ; <= > scalar
(10.17)
bivectors
= a0 + a1σ2 σ3 + a2σ3 σ1 + a3σ1 σ2 .
(10.18)
The quaternion algebra is therefore seen to be a subset of the geometric algebra of ? = a0 − a1σ2 σ3 + a2σ3 σ1 − a3σ1 σ2 , 3-space. The conjugated of a rotor given by R The transformation in terms of a rotor a → RaR? = b is a very general way of handling rotations; it works for multivectors of any grade and in spaces of any dimension in contrast to quaternion calculus. Rotors combine in a straightforward manner, i.e. a rotor R1 followed by a rotor R2 is equivalent to a total rotor R where R = R2 R1 .
10.3 Linear Clifford Support Vector Machines for Classification For the case of the Clifford SVM for classification we represent the data set in a certain Clifford Algebra Gn where n = p + q + r, where any multivector base squares to 0, 1 or -1 depending if they belong to p, r, or r multivector bases respectively. We consider the general case of an input comprising D multivectors, and one multivector output, i.e. each ith-vector has D multivector entries xi = [xi1 , xi2 , ..., xiD ]T , where xi j ∈ Gn and D is its dimension. Thus the ith-vector dimension is D×2n , then each
238
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
data ith-vector xi ∈ GnD . And each of ith-vectors will be associated with one output of the 2n possibilities given by the following multivector output yi = yi s + yi σ1 + yi σ2 + ... + yiI ∈ {±1 ± σ1 ± σ2 . . . ± I}
(10.19)
where the first subindex s stands for scalar part. For the classification the CSVM separates these multivector-valued samples into 2n groups by selecting a good enough function from the set of functions T
{ f (x) = w† x + b, }.
(10.20)
where x, w = [w1 , w2 , . . . , wD ]T ∈ GnD and f (x), b ∈ Gn . An entry of the optimal hyperplane w is given by wi = wis , wiσ1 σ1 + ... + wiσ1 σ2 σ1 σ2 + . . . + wiI I. Let us see in detail the last equation T
f (x) = w† x + b = [w†1 , w†2 , ..., w†D ]T [x1 , x2 , ..., xD ] + b =
D
∑ w†i xi + b
(10.21)
i=1
where w†i xi corresponds to the Clifford product of two multivectors and w†i is the reversion of the multivector wi . Next,we introduce now a structural risk functional similar to the real valued one of the SVM for classification. By using a loss function similar to Vapnik’s ξ insensitive one, we utilize following linear constraint quadratic programming for the primalequation T
min L(w, b, ξ ) = 12 w† w +C ∑i, j ξ1 j subject to T yi j ( f (xi )) j = yi j (w† xi + b) j ≥ 1 − ξi j ξi j ≥ 0 for all i, j,
(10.22)
where ξi j stands for the slack variables, i, indicate the data ith-vector and j indexes the multivector component, i.e. j = 1 for the coefficient of the scalar part, j = 2 for the coefficient of σ1 . . . j = 2n for the coefficient of I. The dual expression of this problem can be derived straightforwardly. Firstly let us consider the expression of the orientation of optimal hyperplane. wi = [w1 , w2 , ..., wD ]T
(10.23)
each of the wk is given by the multivector wk = wks + wkσ1 σ1 + ... + wkσ1σ2 σ1 σ2 + ... + wkI I.
(10.24)
10
Optimization with Clifford Support Vector Machines and Applications
239
Each component of these weights are computed as follows: wks =
l
∑
&
' (αs ) j (ys ) j (xks ) j ,
&
' (ασ1 ) j (yσ1 ) j (xkσ1 ) j ...
&
' (αI ) j (yI ) j (xkI ) j .
j=1
wkσ1 =
l
∑
j=1
wkI =
l
∑
(10.25)
j=1
where (αs ) j , (ασ1 ) j , ..., (αI ) j , j = 1, ..., l are the Lagrange multipliers. According the Wolfe dual programing [1] the dual form reads min
1 †T (w w) − 2
∑ αi j
(10.26)
i, j
subject to aT · 1 = 0, and all the Lagrange multipliers should fulfill 0 ≤ (αs ) j ≤ C, 0 ≤ (ασ1 ) j ≤ C, ..., 0 ≤ (ασ1 σ2 ) j ≤ C, ..., 0 ≤ (αI ) j ≤ C for i = 1, ..., D and j = 1, ..., l. In aT · 1 = 0, 1 denotes a vector of all ones. The entries of the vector a = [as , aσ1 , aσ2 , ..., aσ1 σ2 , aI ]
(10.27)
are given by aTs = [(αs )1 (ys )1 , (αs )2 (ys )1 , ..., (αs )l (ys )l ] aTσ1 = [(ασ1 )1 (yσ1 )1 , (ασ1 )1 (yσ1 )1 , ..., (ασ1 )l (yσ1 )l ] .. . aTI
(10.28) = [(αI )1 (yI )1 , (αI )1 (yI )1 , ..., (αI )l (yI )l ]
note that the vector aT has the dimension: (l × 2n ) × 1. We require a compact and easy representation of the resultant GRAM matrix of the multi-components, this will help for the programing of the algorithm. For that let us first consider the Clifford product of (w∗T w), this can be expressed as follows w†T w = w†T ws + w†T wσ1 + w†T wσ2 + . . . + w†T wI
(10.29)
Since w has the components presented in (10.25), the equation (10.29) can be rewritten as follows w†T w = aTs x†T xs as + ... + aTs x†T xσ1 σ2 aσ1 σ2 + ... +aTs x†T xI aI + aTσ1 x†T xs as + ... +aTσ1 x†T xσ1 σ2 aσ1 σ2 + ... + aTσ1 x†T xI aI + . aTI x†T xs as + aTI x†T xσ1 aσ1 + ... +aTI x†T xσ1 σ2 aσ1 σ2 + ... + aTI x†T xI aI .
(10.30)
240
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
Renaming the matrices of the t-grade parts of x†T xt , we rewrite previous equation as: w†T w = aTs Hs as + aTs Hσ1 aσ1 + aTs Hσ1 σ2 aσ1 σ2 + ... +aTs HI aI + aTσ1 Hs as + aTσ1 Hσ1 aσ1 + ... +aTσ1 Hσ1 σ2 aσ1 σ2 + ... + aTσ1 HI aI + . aTI Hs as + aTI Hσ1 aσ1 + ... + aTI Hσ1 σ2 aσ1 σ2 + ... +aTI HI aI .
(10.31)
Taken into consideration the previous equations and definitions, the primal equation (10.22) reads now as follows: 1 min L(w, b, ξ ) = aT Ha + C · ∑ αi j 2 i, j
(10.32)
using the previous definitions and equations we can define the dual optimization problem as follows max
1 aT 1 − aT Ha 2
sub ject to 0 ≤ (αs ) j ≤ C, 0 ≤ (ασ1 ) j ≤ C, ..., 0 ≤ (ασ1 σ2 ) j ≤ C, ..., 0 ≤ (αI ) j ≤ C f or j = 1, ..., l,
(10.33)
where a is given by (10.27) and, again 1 denotes a vector of all ones. H is a positive semidefinite matrix which is the expected generalized Gram matrix. This matrix in terms of the matrices of the t-grade parts of x∗ xt is written as follows: ⎡
⎢ ⎢ ⎢ ⎢ H =⎢ ⎢ ⎢ ⎢ ⎣
Hs Hσ1 Hσ2 .... .... ... ... Hσ1σ2 ... HI HσT1 Hs ... Hσ4 .....Hσ1 σ2 ... HI Hs HσT2 HσT1 Hs ... Hσ1 σ2 ... HI Hs Hσ1 . . . HIT ... HσT1 σ2 .............HσT2 HσT1 Hs
⎤
⎥ ⎥ ⎥ ⎥ ⎥, ⎥ ⎥ ⎥ ⎦
(10.34)
note that the diagonal entries equal to Hs and since H is a symmetric matrix the lower matrices are transposed. The optimal weight vector w is as given by (10.23). The threshold b ∈ GnD can be computed by using KKT conditions with the Clifford support vectors as follows
10
Optimization with Clifford Support Vector Machines and Applications
241
b = bs + bσ1 σ1 + ... + bσ1σ2 σ1 σ2 + ... + bI I), =
l
∑ (y j − w†T x j )/l.
(10.35)
j=1
The decision function can be seen as sectors reserved for each involved class, i.e. in the case of complex numbers (G1,0,0 ) or quaternions (G0,2,0 ) we can see that the circle or the sphere are divide by means spherical vectors. Thus the decision function can be envisaged as " # " # y = csignm f (x) = csignm w†T x + b = csignm
"
l
# ,
∑ (α j ◦ y j )(x†Tj x) + b
(10.36)
j=1
" # where csignm f (x) is the function for detecting the sign of f (x) and m stands for the different values which indicate the state valency, e.g. bivalent, tetravalent and the operation “◦” is defined as (α j ◦ y j ) = < α j >0 < y j >0 + < α j >1 < y j >1 σ1 + ... + < α j >2n < y j >2n I,
(10.37)
simply one consider as coefficients of the multivector basis the multiplications between the coefficients of blades of same degree. For clarity we introduce this operation “◦”which takes place implicitly in previous equation (10.25). Note that the cases of complex numbers 2-state (outputs 1 for − π2 ≤ arg( f (x)) < π π 3π π 2 and -1 for 2 ≤ arg( f (x)) < 2 ) and 4-state (outputs 1+i for 0 ≤ arg( f (x)) < 2 , π 3π 3π 1+i for 2 ≤ arg( f (x)) < π , -1-i for π ≤ arg( f (x)) < 2 and 1-i for 2 ≤ arg( f (x)) < 2π ) can be solved by the multi-class real valued SVM, however in case of higher representations like the 16-state using quaternions, it would be awkward to resort to the multi-class real valued SVMs. The major advantage of our approach is that we redefine the optimization vector variables as multivectors. This allows us to utilize the components of the multivector output to represent different classes. The amount of achieved class outputs is directly proportional to the dimension of the involved geometric algebra. The key idea to solve multi-class classification in the geometric algebra is to avoid that the multivector elements of different grade get collapsed into a scalar, this can be done thanks to the redefinition of the primal problem involving the Clifford product instead of the inner product (10.22). The reader should bear in mind that the Clifford product performs the direct product between the spaces of different grade and its result is represented by a multivector, thus the outputs of the CSVM are represented by y= ys + yσ1 + yσ2 + ... + yI ∈ {±1 ± σ1 ± σ2 . . . ± I}.
242
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
10.4 Non Linear Clifford Support Vector Machines for Classification For the nonlinear Clifford valued classification problems we require a Clifford valued kernel K(x, y). In order to fulfill the Mercer theorem we resort to a componentwise Clifford-valued mapping φ
x ∈ Gn −→ Φ (x) = Φs (x) + Φσ1 σ1 + Φσ1 σ2 (x)σ2 + ... +I ΦI (x) ∈ Gn .
In general we build a Clifford kernel K(xm , x j ) by taking the Clifford product between the reversion of xm and x j as follows K(xm , x j ) = Φ (xm )† Φ (x j ),
(10.38)
note that the kind of reversion operation (·)† of a multivector depends of the signature of the involved geometric algebra G p,q,r . Next as illustration we present kernels using different geometric algebras. According to the Mercer theorem, there exists a mapping u : G → F , which maps the multivectors x ∈ Gn into the complex Euu clidean space x →= ur (x) + IuI (x) Complex-valued linear kernel function in G1,0,0 (the center of this geometric algebra, i.e. s, I = σ1 σ2 is isomorph with C): K(xm , xn ) = u(xm )† u(xn ) = (u(xm )s u(xn )s + u(xm )I u(xn )I ) + I(u(xm )s u(xn )I − u(xm )I u(xn )s ), = (k(xm , xn )ss + k(xm , xn )II ) + ... +I(k(xm , xn )Is ) − k(xm , xn )sI ) = Hr + IHi
(10.39)
where (xs )m , (xs )n , (xI )m , (xI )n are vectors of the individual components of the complex numbers (x)m = (xs )m + I(xI )n ∈ G1,0,0 and (x)n = (xs )n + I(xI )n ∈ G1,0,0 respectively. For the quaternion-valued Gabor kernel function, we use i = σ2 σ3 , j = −σ3 σ1 , k = σ1 σ2 . The Gaussian window Gabor kernel function reads K(xm , xn ) = g(xm , xn )exp−iw0 (xm − xn ) T
(10.40)
where the normalized Gaussian window function is given by ||x − x ||2 − m 2 n 1 2ρ g(xm , xn ) = √ exp 2πρ
(10.41)
10
Optimization with Clifford Support Vector Machines and Applications
243
and the variables w0 and xm − xn stand for the frequency and space domains respectively. Unlike the Hartley transform or the 2D complex Fourier this kernel function separates nicely the even and odd components of the involved signal, i.e. K(xm , xn ) = K(xm , xn )s + K(xm , xn )σ2 σ3 + ... +K(xm , xn )σ3 σ1 + K(xm , xn )σ1 σ2 = g(xm , xn )cos(wT0 xm )cos(wT0 xm ) + ... +g(xm , xn )cos(wT0 xm )sin(wT0 xm )i + ... +g(xm , xn )sin(wT0 xm )cos(wT0 xm ) j + ... +g(xm , xn )sin(wT0 xm )sin(wT0 xm )k.
Since g(xm , xn ) fulfills the Mercer’s condition it is straightforward to prove that k(xm , xn )u in the above equations satisfy these conditions as well. After we defined these kernels we can proceed in the formulation of the SVM n conditions. We substitute the mapped data Φ (x) = ∑2u=1 < Φ (x) >u into the linear function f (x) = w†T x + b = w∗T Φ (x) + b. The problem can be stated similarly as in (10.22-10.26). In fact we can replace the kernel function in (10.33) to accomplish the Wolfe dual programming and thereby to obtain the kernel function group for nonlinear classification Hs = [Ks (xm , x j )]m, j=1,..,l Hσ1 = [Kσ1 (xm , x j )]m, j=1,..,l ... Hσn = [Kσn (xm , x j )]m, j=1,..,l · · HI = [KI (xm , x j )]m, j=1,..,l .
(10.42)
In the same way we use the kernel functions to replace the the dot product of the input data in (10.36). In general the output function of the nonlinear Clifford SVM reads " # " # y = csignm f (x) = csignm w†T Φ (x) + b ,
(10.43)
where m stands for the state valency.
10.5 Clifford SVM for Regression The representation of the data set for the case of Clifford SVM for regression is the same as for Clifford SVM for classification; we represent the data set
244
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
in a certain Clifford Algebra Gn . Each data ith-vector has multivector entries xi = [xi1 , xi2 , ..., xiD ]T , where xi j ∈ Gn and D is its dimension. Let (x1 , y1 ),(x2 , y2 ),...,(x j , y j ),...,(xl , yl ) be the training set of independently and identically distributed multivector-valued sample pairs, where each label yi = yi s + yi σ1 σ1 + yi σ2 σ2 + ... + yi I I, and the first subindex s stands for scalar part. The regression problem using multivectors is to find a multivector-valued function f (x) that has at most ε -deviation from the actually obtained targets yi ∈ Gn for all the training data, and at the same time, is as flat as possible. We will use a multivector-valued ε -insensitive loss function and arrive at the formulation of Vapnik [1]: min 12 w†T w +C · ∑i, j (ξi + ξ˜i ) subject to (yi − w†T xi − b) j ≤ (ε + ξi j ) (w†T xi + b − yi ) j ≤ (ε + ξ˜i j )
(10.44)
ξi j ≥ 0, ξ˜i j ≥ 0
for all
i, j.
where w, x ∈ GnD , and (.) j extracts the scalar accompanying a multivector base. Next we proceed like in section 10.3, since the expression of the orientation of optimal hyperplane is the same that in (10.23) and each of the wi is computed as follows: ' l & ws = ∑ (αs ) j − (α˜ s ) j (xs ) j , , j=1 ' l & wσ1 = ∑ (ασ1 ) j − (α˜ σ1 ) j (xσ1 ) j , ..., j=1 l &
' wI = ∑ (αI ) j − (α˜ I ) j (xI ) j . j=1
We can now redefine the entries of the vector in (10.27), these are given by aTs = [(αs11 − α˜ s1 , (αs2 − α˜ s2 ), ..., (αsl − α˜ s1 )], aTσ1 = [(ασ1 1 − α˜ σ1 1 ), (ασ1 2 − α˜ σ1 2 ), ..., (ασ1 l − α˜ σ1 l )] ...
(10.45)
aTI = [(αI1 − α˜ I1 ), (αI2 − α˜ I2 ), ..., (αIl − α˜ Il )] Now, we can rewrite the Clifford product, as we did in (10.29 - 10.31) to get the primal problem as follows: min 12 aT Ha +C · ∑li=1 (ξ + ξ˜ ) subject to (w† x + b − y) j ≤ (ε + ξ ) j (y − w† x − b) j ≤ (ε + ξ˜ ) j ξi j ≥ 0, ξ˜i j for all i, j.
Thereafter, we write straightforwardly the dual of 10.46 for solving the regression problem
10
Optimization with Clifford Support Vector Machines and Applications max
245
1 ˜ y) − α T (ε + y) − aT Ha −α˜ T (ε − 2
sub ject to l
l
j=1
j=1
∑ (αs j − α˜ s j ) = 0, ∑ (ασ
1
l
∑ (αI j − α˜ I j ) = 0,
˜ σ1 j ) = 0, .., j −α
0 ≤ (αis ) ≤ C, 0 ≤ (αiσ1 ) ≤ C, ...,
j=1
0 ≤ (αiσ1 σ2 ) ≤ C, ..., 0 ≤ (iαI ) ≤ C
0 ≤ (αi∗σ1 σ2 ) ≤ C, ..., 0 ≤ (iαI∗ ) ≤ C
0 ≤ (αis∗ ) ≤ C, 0 ≤ (αi∗σ1 ) ≤ C, ..., j = 1, ..., l,
(10.46)
For nonlinear regression similar as explained in subsection 10.4 we utilize a particular kernel for computing k(xm , x j ) = Φ (x˜m )Φ (x j ), again this kind of conjugation operation ()∗ of a multivector depends of the signature of the involved geometric algebra G p,q,r . We can use the kernels described in subsection 10.4.
10.6 Recurrent Clifford SVM SVMs are very powerful for solving regression and classification tasks. They carry out predictions by linearly combining kernel basis functions. By mapping the input feature space to a higher dimensional space, the SVMs can separate linearly clusters by means an optimal hyperplane. A rather limited way to apply existing SVMs to sequence prediction [? ? ] or classification [12] is to build a training set either by transforming the sequential input to an input vector of some static domain (e.g., a frequency or phase representation, a Hidden Markov Model -HMM- [13, 14]), or by simple frequency counting of patterns, symbols or substrings, or by taking fixed time windows of k sequential values [10]. The window-based approaches of course, fail if the temporal dependency exceeds the length of k steps. As for the case of training HMM with long sequences, unfortunately they get numerous local minima points [15, 16]. Suykens and Vandewalle [17], incorporates the dynamic equations in the primal problem for a SVM solution. The major disadvantage of this approach is that the problem is not longer convex, thus there is no guarantee of finding an optimal global solution. In all these discussed attempts there has not been a recurrent SVM which learns tasks involving time lags of arbitrary length between important input events. However, a pioneering attempt using real valued SVM and neuroevolution for sequence prediction was done by Schmidhuber, et al. [18]. Unfortunately at present the research activity on recurrent SVM is very scarce. We started to explore a way to build a CSVM based recurrent system which will profit of all advantages of the CSVM, namely it helps to maintain the convexity, it is MIMO and it suits to process sequences with geometric characteristics. In order to do that, we decided to connect two processing modules in cascade: a Long Short Term memory LSTM, [20] and a CSVM.
246
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
LSTM-CSVM is a Evolino and Evoke based system [18, 19]: the underlying idea of these systems is that it is needed two cascade modules: a robust module to process short and long-time dependencies (LSTM) and an optimization module to produce precise outputs (CSVM, Moore-Penrose pseudo inverse method, SVM respectively). The LSTM module addresses the disadvantage of having relevant pieces of information outside the history window and also avoids the problem of the “vanishing error” presented by algorithms like Back-Propagation Through Time (BPTT, e.g., Williams an Zipser 1992) or Real-Time-Recurrent Learning ( RTRL, e.g., Robinson and Fallside 1987)1. Meanwhile CSVM maps the internal activations of the fist module to a set of precise outputs, again, it is taken advantage of the multivector output representation to implement a system with less process units and therefore less computational complex. LSTM-CSVM works as follows: a sequence of input vectors (u(0)...u(t)) is given to the LSTM which in turn feeds the CSVM with the outputs of each of its memory cells, see Fig. 10.1.
Fig. 10.1 LSTM-CSVM system
The CSVM aimed at finding the expected nonlinear mapping of training data. The input and output equations of Figure 10.1 are
φ (t) = f (W, u(t), u(t-1),...,u(0),...,). k
y(t) = b + ∑ wi K(φ (t), φi (t)).
(10.47)
i=1
where φ (t) = [ψ1 , ψ2 , ..., ψn ]T ∈ Rn is the activation in time t of n units of the LSTM, this serves as input to the CSVM, given the input vectors(u(0)...u(t)) and the weight matrix W . Since the LSTM is a recurrent net, the argument of the function f (.) represents the history of the input vectors.
1
The reader can get more information about BPTT and RTRL-vanishing error versus LSTM-constant error flow in [20].
10
Optimization with Clifford Support Vector Machines and Applications
247
First, the LSTM-CSVM system was trained using the conventional algorithm for the LSTM. Although the system learns, unfortunately it takes too long to find a suitable matrix W . Instead, propagating the training data through the LSTM-CSVM system, we evolved the rows of the matrix using the evolutionary algorithms known as Enforced Sub-Populations (ESP) [21] algorithm. This approach differs with the standard methods, because instead of evolving the complete set of the net parameters, it rather evolves subpopulations of the LSTM memory cells. For the mutation of the chromosomes, the ESP uses Cauchy density function.
10.7 Applications In this section we present five interesting experiments. The first one shows a multiclass classification using CSVM with a simulated example.Here, we present also a number of variables computing per approach and a time comparison between CSVM and three approaches to do multi-class classification using real SVM. The second is about object multi-class classification with two types of training data: Phase a) artificial data and Phase b) real data obtained from a stereo vision system. We also compared the CSVM against MLP’s (for multi-class classification) . The third experiment presents a multi-class interpolation. The fourth and fifth includes the experimental analysis of the recurrent CSVM.
10.7.1 3D Spiral: Nonlinear Classification Problem We extended the well known 2-D spiral problem to the 3-D space. This experiment should test whether the CSVM would be able to separate five 1-D manifolds embedded in R3 . On this application, we used a quaternion valued CSVM which works in G0,2,0 2 , this allows us to have quaternion inputs and outputs, and therefore, with one output quaternion we can represent until 24 classes .The functions were generated as follows: f1 (t) = = f2 (t) = = f3 (t) = = f4 (t) = = f5 (t) = =
[x1 (t), y1 (t), z1 (t)] [z1 ∗ cos(θ ) ∗ sin(θ ), z1 ∗ sin(θ ) ∗ sin(θ ), z1 ∗ cos(θ )] [x2 (t), y2 (t), z2 (t)] [z2 ∗ cos(θ ) ∗ sin(θ ), z2 ∗ sin(θ ) ∗ sin(θ ), z2 ∗ cos(θ )] [x3 (t), y3 (t), z3 (t)] [z3 ∗ cos(θ ) ∗ sin(θ ), z3 ∗ sin(θ ) ∗ sin(θ ), z3 ∗ cos(θ )] [x4 (t), y4 (t), z4 (t)] [z4 ∗ cos(θ ) ∗ sin(θ ), z4 ∗ sin(θ ) ∗ sin(θ ), z4 ∗ cos(θ )] [x5 (t), y5 (t), z5 (t)] [z5 ∗ cos(θ ) ∗ sin(θ ), z5 ∗ sin(θ ) ∗ sin(θ ), z5 ∗ cos(θ )]
To depict these vectors they were normalized by 10. In Fig. 10.2 one can see that the problem is high nonlinear separable. The CSVM uses for training 50 input quaternions of each of the five functions, since these have three coordinates we use simply 2
The dimension of this geometric algebra is 22 = 4.
248
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano Support Vectors
Fig. 10.2 3D spiral with five classes. The marks represent the support multivectors found by the CSVM
the bivector part of the quaternion, namely xi = xi (t)σ2 σ3 + yi (t)σ3 σ1 + zi (t)σ1 σ2 ≡ [0, xi (t), yi (t), zi (t)]. The CSVM used the kernel given by (10.42). Note that the CSVM indeed manage to separate the five classes. 10.7.1.1
Comparisons Using 3D Spiral Example
According to [22] the most used methods to do multi-class classification are: oneagainst-all [23], one-against-one [24], DAGSVM [26], and some methods to solve multi-class in one step, known as all together methods [27]. Table 10.1 shows a comparison of number of variables computing per approach, considering also CSVM. The experiments shown in [22] indicate that “one-against-one and DAG methods are more suitable for practical use than the other methods”, we have chosen of these methods the one-against-one and the earliest implementation for SVM multi-class classification one-against-all approach to do comparisons between them and our proposal CSVM. The comparisons were made using the 3D spiral toy example and the quaternion CSVM shown in the past subsection. The number of classes was increased on each experiment, we started with K=3 classes and 50 training inputs for each class. Since the training inputs have three coordinates we use simply the bivector part of the quaternion for CSVM approach, namely xi = xi (t)σ2 σ3 + yi (t)σ3 σ1 + zi (t)σ1 σ2 ≡ [0, xi (t), yi (t), zi (t)], therefore CSVM computes D ∗ N = 3 ∗ 150 = 450 variables. The approaches one-against-all and one-againstone compute 450 and 300 variables respectively, however the training times of CSVM and one-against-one are very similar in the first experiment. Note that when we increase the number of classes the performance of CSVM is much better than the other approaches because the number of variables to compute is greatly reduced. We improved the computational efficiency of all these algorithms, utilizing the decomposition method [28] and the shrinking technique [29]. We can see in table 10.2 that the CSVM using a quarter of the variables is still faster with around a quarter of the processing time of the other approaches. The classification performance of the four approaches is presented in Table 10.3. We used during training and test 50 and 20 vectors per class respectively. We can see that the CSVM for classification has overall the best performance.
10
Optimization with Clifford Support Vector Machines and Applications Table 10.1 Number of variables per approach
Approach NQP NVQP TNV CSVM 1 D*N D*N One-against-all K N K*N One-against-one K(K-1)/2 2*N/K N(K-1) DAGSVM K(K-1)/2 2*N/K N(K-1) A method by considering 1 K*N K*N all data at once NQP Number of quadratic problems to solve NVQP Number of variables to compute per quadratic problem TNV Total Number of Variables D Training input data dimension N Total number of training examples K Number of classes
Table 10.2 Time training per approach (seconds)
Approach
K=3, N=150 K=5, N=250 K=16, N=800 (Variables) (Variables) (Variables) CSVM 0.07 0.987 10.07 C=1000 (450) (750) (3200) One-against-all 0.11 8.54 131.24 (C, σ )=(1000,2−3) (450) (1250) (12800) One-against-one 0.09 2.31 30.86 (C, σ )=(1000,2−2) (300) (1000) (12000) DAGSVM 0.10 3.98 38.88 (C, σ )=(1000,2−3) (300) (1000) (12000) K Number of classes, N Number of training examples (50 each class) Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from σ = {2, 20 , 2−1 , 2−2 , 2−3 } and costs C={1,10,100,1000,10000}. From these 5 × 5 combinations, the best result was selected for each approach.
249
250
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano Table 10.3 Percent of accuracy in training and test
Approach
Ntrain=150 Ntrain=250 Ntrain=800 Ntest=60 Ntest=100 Ntest=320 K=3 K=5 K=16 CSVM 98.66 99.2 99.87 C=1000 (95.00) (98.00) (99.68) One-against-all 96.00 98.00 99.75 (C, σ )=(1000,2−3) (90.00) (96.00) (99.06) One-against-one 98.00 98.4 (99.87) (C, σ )=(1000,2−2) (95.00) 99.00 (99.375) DAGSVM 97.33 98.4 (99.87) (C, σ )=(1000,2−3) (95.00) (97) (99.68) K Number of classes, Ntrain=Number of total training vectors Ntest=Number of test vectors, % accuracy in training phase above Below in brackets, the percent of accuracy in test phase Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from σ = {2, 20 , 2−1 , 2−2 , 2−3 } and costs C={1,10,100,1000,10000}. From these 5 × 5 combinations, the best result was selected for each approach.
10.7.2 Object Recognition In this subsection we will show an application of Clifford SVM for multi class object classification. In the experiments shown in this subsection, we want to use only one CSVM with a quaternion as input and a quaternion as output, that allow us to have up 24 = 16 classes. Basically we packed in a feature quaternion one 3-D point (which lies in surface of the object) and the magnitude of the distance between this point and the point which lies in the main axis of the object in the same level curve. Fig. 10.3 depicts the 4 features taking by the object : Xi = δi s + xi σ2 σ3 + yi σ3 σ1 + zi σ1 σ2
(10.48)
≡ [δi , (xi , yi , zi )]T For each object we trained the CSVM using a set of several feature quaternions obtained from different level curves; that means that each object is represented by several feature quaternions and not only one. Due to this way to train the CSVM, the order in which the feature quaternions are shown to the CSVM is important: we begin to sample data from the bottom to the top of the objects and we show the training and test data in this order to the CSVM. We processed the outputs using a counter that computes which class fires the most for each training or test set in
10
Optimization with Clifford Support Vector Machines and Applications [ [
251
n,
+m ,
(x,y,z) n ] (x,y,z) +m] (x,y,z) ]
[ ,
a)
b)
Fig. 10.3 Geometric characteristics of one training object. The magnitude is δi , and the 3D coordinates (xi , yi , zi ) to build the feature vector: [δi , (xi , yi , zi )]
WINNER CLASS
COUNTER
OUTPUTS
CSVM
INPUTS
order to decide which class the object belongs, see Fig. 10.4. Note carefully, that this experiment is anyway a challenge for any algorithm for recognition, because the feature signature is sparse. We will show later, that using this kind of feature vectors the CSVM’s performance is superior to the MLP’s one. Of course, if you spend more time trying to improve the quality of the feature signature, the CSVM’s performance will increase accordingly.
Fig. 10.4 After we get the outputs, these are accumulated using a counter to calculate which class the object belongs
It is important to say that all the objects (synthetic and real) were preprocessed in order to have a common center an the same scale, then our learning process can be seen as centering and scale invariant. 10.7.2.1
Phase a) Synthetic Data
In the first phase of this experiment, we used data training obtained from synthetic objects, the training set are shown in Fig. 10.5. Note that we have six different objects, which means a six-classes classification problem, and we solve it with only one CSVM making use of its multi-output characteristic. In general, for the ”‘one versus all”’ approach one needs n SVMs (one for each class). In contrast, the CSVM needs only one machine because its quaternion output allows to have 16 class outputs. For the input-data coding, we used a 3D point which is packed into
252
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
the σ2 σ3 , σ3 σ1 , σ1 σ2 basis of the feature quaternion and the magnitude was packed in the scalar part of the quaternion . Figure 10.6 shows the 3D points sampled from the objects. We compared the performance of the following approaches: CSVM, a 4-7-6 MLP and the real valued SVM approaches one-against-one, one-against-all and DAGSVM. The results in tables 10.4 and 10.5 show that CSVM has better generalization and less training errors than the MLP approach and the real valued-SVM approaches. Note that all methods were speed up using the acceleration techniques [28, 29]. The authors think that the MLP presents more training and generalization errors because the way we represent the objects (as feature quaternion sets) makes the MLP gets stuck in local minima very often during the learning phase, whereas the CSVM is guaranteed to find the optimal solution to the classification problem because it solves a convex quadratic problem with global minima. With respect to the real-valued SVM based approaches, the CSVM takes advantage of the Clifford product, which enhances the discriminatory power of the classificator itself unlike the other approaches which are based solely on inner products.
a)
b)
c)
d)
e)
f)
Fig. 10.5 Training synthetic object set
10.7.2.2
Phase b) Real Data
In this phase of the experiment we obtained the training data using our robot “Geometer”, it is shown in right side of Fig. 10.7. We take two stereoscopic views of each object: one frontal view and one 180 rotated view (w.r.t. the frontal view), after that, we applied the Harry’s filter on each view in order to get the objects corners and then, with the stereo system, the 3D points (xi , yi , zi ) which laid on the object surface and to calculate the magnitude δi for the quaternion equation (10.49). This process
10
Optimization with Clifford Support Vector Machines and Applications
253
Table 10.4 Object-recognition performance in percent (%) during training using synthetic data
Object NTS CSVM MLP 1-vs-all 1-vs-1 DAGSVM C=1200 a) b) c) C 86 93.02 48.83 87.2 90.69 90.69 S 84 89.28 46.42 89.28 90.47 90.47 F 84 85.71 40.47 83.33 84.52 83.33 W 86 91.86 46.51 90.69 91.86 93.02 D 80 93.75 50.00 87.5 91.25 90.00 U 84 86.90 48.80 82.14 83.33 84.52 C=cylinder, S=sphere, F=fountain, W=worm, D=Diamond, U=cube NTS= Number of Training Vectors. Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from σ = {2−1 , 2−2 , 2−3 , 2−4 , 2−5 } and costs C={150,1000,1100,1200,1400,1500,10000}. From these 8 × 5 combinations, the best result was selected for each approach. a)(2−4 , 1500), b)(2−3, 1200), c)(2−4, 1400). Table 10.5 Object-recognition performance in percent (%) during test using synthetic data
Object NTS CSVM MLP 1-vs-all 1-vs-1 DAGSVM C=1200 a) b) c) C 52 94.23 80.76 90.38 96.15 96.15 S 66 87.87 45.45 83.33 84.84 86.36 F 66 90.90 51.51 83.33 86.36 84.84 W 66 89.39 57.57 86.36 83.33 86.36 D 58 93.10 55.17 93.10 93.10 93.10 U 66 92.42 46.96 89.39 90.90 89.39 C=cylinder, S=sphere, F=fountain, W=worm, D=Diamond, U=cube NTS= Number of Training Vectors. Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from σ = {2−1 , 2−2 , 2−3 , 2−4 , 2−5 } and costs C={150,1000,1100,1200,1400,1500,10000}. From these 8 × 5 combinations, the best result was selected for each approach. a)(2−4 , 1500), b)(2−3, 1200), c)(2−4, 1400).
254
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
a)
b)
c)
***
*
d)
e)
***
***
* *
*
*** *
*
***
***
* *
*** *
***
*
* * ***
f)
Fig. 10.6 Sampling of the training synthetic object set
is illustrated in Fig.10.7 and the whole training object set is shown in Fig.10.8. We take the non-normalized 3D point for the bivector basis σ23 , σ31 , σ12 of the feature quaternion in (10.49).
Fig. 10.7 Left:Sampling view of a real object. We use big white cross for the depiction. Right: Stereo vision system in the experiment environment
After the training, we tested with a set of feature quaternions that the machine did not see during its training and we used the approach of ’winner take all’ to decide which class the object belongs. The results of the training and test are shown in table 10.6. We trained CSVM with equal number of training data for each object, that is, 90 feature quaternions for each object, but we test with different number of data for object. Note that we have two pairs of objects that are very similar between each other; first pair is composed by half sphere shown in Fig.10.8.c) and the rock in Fig.10.8.d), in spite of this similarities, we got very good accuracy percentages in
10
Optimization with Clifford Support Vector Machines and Applications
255
Fig. 10.8 Training real object set, stereo pair images. We include only the frontal views
test phase for both objects: 65.9% for the half sphere and 84% for the rock. We think we got better results for the rock because this object has a lot of texture that produces many corners which in torn capture better the irregularities, therefore we have more test feature quaternions for the rock than for half sphere (75 against 44 respectively). The second pair composed by similar objects is shown in Fig.10.8.e) and Fig10.8.f), these are two equal plastic bottles of juice, but one of them (Fig. 10.8.f)) is burned, that makes the difference between them and give the CSVM enough distinguish features to make two object classes, that is shown in table 10.6, we got 60% of correct classified test samples for bottle in Fig. 10.8.e) against 61% for burned bottle in Fig.10.8.f). The lower learn rates in the last objects (Fig.10.8 c), e) and f)) is because the CSVM is confusing a bit the classes due to the fact that the feature vectors are not large and reach enough.
Table 10.6 Experimental Results using real data
Object NTS NES CTS % Cube 90 50 38 76.00 Prism 90 43 32 74.42 Half sphere 90 44 29 65.90 Rock 90 75 63 84.00 Plastic bottle 1 90 65 39 60.00 Plastic bottle 2 90 67 41 61.20 Number of training samples Number of test samples Number of correct classified test samples Percentage left column
256
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
10.7.3 Multi-case Interpolation A real valued SVM can carry out regression and interpolation for multiple inputs and one real output. Surprisingly a Clifford valued SVM can have multiple inputs and 2n outputs for a n-dimensional space or R n . For handling regression we use 1.0 > ε > 0, where the diameter of the tube surrounding the optimal hyperplane is twice of ε . For the case of interpolation we use ε = 0. We have chosen an interesting task where we use a CSVM for interpolation in order to code a certain kind of behavior we want that a visual guided robot performs. The robot should autonomously draw a complicated 2D pattern. This capacity should be coded internally in Long Term Memory (LTM), so that the robot reacts immediately without the need of reasoning. Similar to as a capable person who reacts in milliseconds with incredible precision for accomplishing a very difficult task, for example a tennis player or tango dancer. For our purpose we trained off line a CSVM using two real valued functions. The CSVM used the geometric algebra G3+ (quaternion algebra). Two inputs using two components of the quaternion input and two outputs (two components of the quaternion output). The first input u and first output x coded the relation x = a ∗ sin(3 ∗ u) ∗ cos(u) for one axis. The second input v and second output y coded the relation y = a ∗ sin(3 ∗ v)∗ sin(v) for another axis, see Fig. 10.9.a), b). The 2D pattern can be drawn using these 50 points generated by functions for x and y, see Fig. 10.9.c. We tested if the CSVM can interpolate good enough using 100 and 400 unseen before input tuples {u, v}, see respectively Fig. 10.9.d) e). Once the CSVM was trained we incorporated it as part of the LTM of the visual guided robot shown in Fig. 10.9.f. For carrying out its task the robot called the CSVM for a sequence of input patterns. The robot was able to draw the desired 2D pattern as we see in Fig. 10.10.a)-d). The reader should bear in mind that this experiment was designed using the equation of a standard function, in order to have a ground true. Any how, our algorithm should be able also to learn 3D curves which do not have explicit equations.
a)
b)
c)
d)
Fig. 10.9 a) and b) Continuous curves of training output data for axes x and y (50 points). c) 2D result by testing with 400 input data. d) Experiment environment
10
Optimization with Clifford Support Vector Machines and Applications
a)
b)
c)
257
d)
Fig. 10.10 a), b), c) Image sequence while robot is drawing d) Robot’s draw. Result by testing with 400 input data
10.7.4 Experiments Using Recurrent CSVM In this section we first analyze the performance of the recurrent CSVM against state of the art algorithms for solving a time series problem. Then, in a second experiment, we apply the recurrent CSVM to tackle a partially observable problem in robotics. 10.7.4.1
Time Series
We utilized the data of water levels of the Venice Lagoon during the periods from 1980 to 1989 and 1990 to 1995.3. The recurrent CSVM was trained with the first 400 series values. The LSTM module was evolved with four memory cells during 100 generations and using the Cauchy parameter α = 10−3 . The CSVM was trained using as inputs the output values of these four memory cells. The achieved training error was of 0.0019 and the recurrent CSVM was tested with 600 steps. The system was able to predict in advance 600 steps of the series. Figure 10.11 shows the prediction performance of the recurrent CSVM by using the training data and Figure 10.11.a) depicts the results of predictions using unforeseen 600 test values. In the figures the ordinate’s range is [0..1].
Fig. 10.11 a)Time seriess Venice Lagoon training. b)Recall data. Tick line (in red) real data, thin line predicted values by LSTM-CSVM 3
A. Tomasin, CNR-ISDMG Universita Ca’Foscari, Venice.
258
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
In the next test, we employed the time series Mackey-Glass which is commonly used for testing the generalization and prediction ability of an algorithm. The series are generated by the following differential equation ˙ =α y(t)
y(t − τ ) , (1 + y(t − τ )β ) − γ y(t)
(10.49)
where the parameters are usually set as α = 0.2, β = 10 and γ = 0.1. This equation is chaotic when the delay is τ > 16.8. We select as delay the most common utilized value of τ = 17. The task is to predict the series values after the delay y[t + P] by using the previous points y[t], y[t − 6], y[t − 12], y[t − 18]. By P = 50 sampled values, it is expected that the algorithm learns the four dimensional function: y(t) = f (y[t − 50], y[t − 50 − 6], y[t − 50 − 12], y[t − 50 − 18]). The LSTM-CSVM was trained with the first 3000 values of the series using P = 100. The module LSTM with 4 memory cells was evolved with a Cauchy parameter α = 10−5 over 150 generations. The “Eco state approach” was trained with 1000 neurons and a mean square error of 10−4 , while using the Evolino system the achieved error was of 1.9 × 10−3 with 30 cells evolved over 200 generations [19, 30]. It has been reported [31] that using a LSTM the minimum error achieved was of 0.1184 using the same amount of 4 neurons as in our LTSM-CSVM.
Table 10.7 Time series Mackey-Glass
Approach Units Generations Error Eco state 1000 200 10−4 Evolino 30 200 1.9 × 10−3 LSTM 4 0.1184 LSTM-CSVM 4 y CSVM 150 0.011
Table 10.7 shows a summary of the comparison results. Here we note that the LTSM alone has a poorer performance than the LTSM-CSVM, showing that the CSVM clearly improves the prediction precision. Note that for this complex time series, as opposite to these two approaches (Eco state approach, Evolino), the LTSMCSVM uses a lower number of neurons and it requires less generations during the training for an acceptable error of 0.011. 10.7.4.2
Robot Navigation in Discrete Labyrinths
Finally, we utilize the LTSM-CSVM with reinforcement learning in a task of a robotic perception action system. The robot system comprises of a stereoscopic camera, a 6 D.O.F: robot arm and a 4 finger Barret-hand. The task consisted to move the robot hand through a real 2D labyrinth. This was built using 10 blocks of 10 cm. height each. To enhance the visibility, their top faces were painted in
10
Optimization with Clifford Support Vector Machines and Applications
259
red, this facilitated the segmentation of the blocks by the stereoscopic cameras. The stereoscopic systems took images from an angle of 45 degrees, for that we needed to calibrate cameras, and correct the position of the cameras views as they were oriented on the top perpendicularly to the labyrinth. With this information we had a complete 3D view from above. Using a color segmentation algorithm, we get the vector coordinates of the block corners. These observation vectors were then fed to the LTSM-CSVM. The architecture of the LTSM-CSVM with reinforcement learning and the training were equal as the previous application. The differences with the simulated experiment were: i) the 3D vectors of the block corners were obtained by the stereoscopic camera (the blocks build a 2D labyrinth), ii) the robot actions were hand movements trough the 2D labyrinth and iii) the length of this real labyrinth was smaller than previous simulated one. We had 4 different labyrinths. Each was 10 blocks length. The evolution of the system consisted of 50 generations using a Cauchy noise parameter of α = 10−4 . The module CSVM is fed with a vector of the output of the last 4 memory cells of the LSTM. The 4 outputs of the CSVM represent the 4 different actions to be carried out during the navigation through the labyrinth. After each generation, the best net was kept and the task was considered fulfilled by a perfect reward of 4.0. The four possible actions of the system are robot hand movements of 10 cm. length towards left, right, back an fort. The initial position of the robot arm was located at the entry of a labyrinth. In all the labyrinths we exploited the intern state (support state), i.e. the coordinates of the exit which was the same for all cases.
1)
2)
3)
4)
1)
2)
3)
4)
Fig. 10.12 a) Training labyrinths 1 y 2. Recall labyrinths 3 y 4.(third column) The robot hand is at the entry of the labyrinth holding a plastic object. (fourth column) Position from the hand marked with a cross
260
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
Figure 10.12 shows the four labyrinths. The images on the first and third columns were obtained by the stereoscopic system. The images on the second and fourth column were obtained after perspective correction and color segmentation. The labyrinths 1 y 2 were used for the training, whereas the 3 and 4 were used for recall. The third and fourth columns in Figure 10.12 show the agent at the beginning the labyrinths. In this labyrinth, only one trajectory was successful (reward 4.0). The training and the test were done off line, then the robot had to follow the action vectors computed by the system LTSM-CSVM.
10.8 Conclusions This chapter generalizes the real valued SVM to Clifford valued SVM and it is used for classification, regression and recurrence. The CSVM accepts multiple multivector inputs and multivector outputs like a MIMO architecture, that allows us to have multi-class applications. We can use CSVM over complex, quaternion or hypercomplex numbers according our needs. The application section shows experiments in pattern recognition and visually guided robotics which illustrate the power of the algorithms and help the reader understand the Clifford SVM and use it in various tasks of complex and quaternion signal and image processing, pattern recognition and computer vision using high dimensional geometric primitives. This generalization appears promising particularly in geometric computing and their applications, like graphics, augmented reality and robot vision.
References 1. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998) 2. Burges, C.J.C.: A tutorial on Support Vector Machines for Pattern Recognition. In: Knowledge Discovery and Data Mining, vol. 2, pp. 1–43. Kluwer Academic Publishers, Dordrecht (1998) 3. M¨uller, K.-R., Mika, S., R¨atsch, G., Tsuda, K., Sch¨olkopf, B.: An Introduction to KernelBased Learning Algorithms. IEEE Trans. on Neural Networks 12(2), 181–202 (2001) 4. Cristianini, N., Shawe-Taylor, J.: Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 5. Hestenes, D., Li, H., Rockwood, A.: New algebraic tools for classical geometry. In: Sommer, G. (ed.) Geometric Computing with Clifford Algebras. Springer, Heidelberg (2001) 6. Lee, Y., Lin, Y., Wahba, G.: Multicategory Support Vector Machines, Technical Report No. 1043, University of Wisconsin, Departament of Statistics, pp. 10–35 (2001) 7. Weston, J., Watkins, C.: Support vector machines for multi-class pattern recognition. In: Proceedings of the 6th European Symposium on Artificial Neural Networks (ESANN), pp. 185–201 (1999) 8. Bayro-Corrochano, E., Arana-Daniel, N., Vallejo-Gutierrez, R.: Geometric Preprocessing, geometric feedforward neural networks and Clifford support vector machines for visual learning. Journal Neurocomputing 67, 54–105 (2005) 9. Bayro-Corrochano, E., Arana-Daniel, N., Vallejo-Gutierrez, R.: Recurrent Clifford Support Machines. In: Proceedings IEEE World Congress on Computational Intelligence, Hong-Kong (2008)
10
Optimization with Clifford Support Vector Machines and Applications
261
10. Mukherjee, S., Osuna, E., Girosi, F.: Nonlinear prediction of chaotic time series using a support vector machine. In: Principe, J., Gile, L., Morgan, N., Wilson, E. (eds.) Neural Networks for Signal Precessing VII - Proceedings of the 1997 IEEE Workshop, New York, pp. 511–520 (1997) 11. M¨uller, K.-R., Smola, A.J., R¨atsch, G., Sch¨olkopf, B., Kohlmorgen, J., Vapnik, V.N.: Predicting time series with support vector machines. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 999–1004. Springer, Heidelberg (1997) 12. Salomon, J., King, S., Osborne, M.: Framewise phone classification using support vector machines. In: Proc. Int. Conference on Spoke Language Processing, Denver (2002) 13. Altun, Y., Tsochantaris, I., Hofmann, T.: Hidden markov support vector machines. In: Proc. Int. Conference on Machine Learning (2003) 14. Jaakkola, T.S., Haussler, D.: Exploting generative models in discriminative classifiers. In: Proc. of the Conference on Advances in Neural Information Systems II, Cambridge, pp. 487–493 (1998) 15. Bengio, Y., Frasconi, P.: Difussion of credit and markovian models. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Systems 14. MIT Press, Cambridge (2002) 16. Hochreiter, S., Mozer, M.: A discrete probabilistic memory for discovering dependencies in time. In: Int. Conference on Neural Networks, pp. 661–668 (2001) 17. Suykens, J.A.K., Vanderwalle, J.: Recurrent least squares support vector machines. IEEE Transactions on Circuits and Systems-I 47, 1109–1114 (2000) 18. Schmidhuber, J., Gagliolo, M., Wierstra, D., Gomez, F.: Recurrent Support Vector Machines, Technical Report, no. IDSIA 19-05 (2005) 19. Schmidhuber, J., Wierstra, D., G´omez, F.J.: Hybrid neuroevolution optimal linear search for sequence prediction. In: Kaufman, M. (ed.) Proceedings of the 19th International Joint Conference on Artificial Intelligence, IJCAI, pp. 853–858 (2005) 20. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory, Technical Report FKI-20795 (1996) 21. Gmez, F.J., Miikkulainen, R.: Active guidance for a finless rocket using neuroevolution. In: Proc. GECCO, pp. 2084–2095 (2003) 22. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class Support Vector Machines. Technical report, National Taiwan University, Taiwan (2001) 23. Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L.Y., Muller, U., Sackinger, E., Simard, P., Vapnik, V.: Comparison of classifier methods: a case study in handwriting digit recognition. In: International Conference on Pattern Recognition, pp. 77–87. IEEE Computer Society Press, Los Alamitos (1994) 24. Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise procedure for building and training a neural network. In: Fogelman, J. (ed.) Neurocomputing: Algorithms, Architectures and Applications. Springer, Heidelberg (1990) 25. Kreßel, U.: Pairwise classification and support vector machines. In: Schlkipf, B., Burges, C.J.J., Smola, A.J. (eds.) Advances in Kernel Methods - Support Vector Learning, pp. 255–268. MIT Press, Cambridge (1999) 26. Platt, J.C., Cristianini, N., Shawe-Taylor, J.: Large margin DAGs for multiclass classification. In: Advances in Neural Information Processing Systems, vol. 12, pp. 547–533. MIT Press, Cambridge (2000) 27. Weston, J., Watkins, C.: Multi-class support vector machines. Technical Report CSDTR-98-04, Royal Holloway, University of London, Egham (1998)
262
N. Arana-Daniel, C. L´opez-Franco, and E. Bayro-Corrochano
28. Hsu, C.W., Lin, C.J.: A simple decomposition method for Support Vector Machines. Machine Learning 46, 291–314 (2002) 29. Joachims, T.: Making large-scale SVM learning practical. In: Sch¨olkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods-Support Vector Learning. MIT Press, Cambridge (1998); Journal of Machine Learning Research 5, 819–844 (1998) 30. Jaeger, H.: Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. EmphScience (304), 78–80 (2004) 31. Gers, F.A., Eck, D., Schmidhuber, J.: Applying LSTM to time series predictable through time-window approaches. In: Dorffner, G., Bischof, H., Hornik, K. (eds.) ICANN 2001. LNCS, vol. 2130, pp. 669–685. Springer, Heidelberg (2001)
Chapter 11
A Classification Method Based on Principal Component Analysis and Differential Evolution Algorithm Applied for Prediction Diagnosis from Clinical EMR Heart Data Sets Pasi Luukka and Jouni Lampinen
Abstract. In this article we have studied the usage of a classification method based on preprocessing the data first using principal component analysis, and then using the compressed data in actual classification process which is based on differential evolution algorithm, an evolutionary optimization algorithm. This method is applied here for prediction diagnosis from clinical data sets with chief complaint of chest pain using classical Electronic Medical Record (EMR), heart data sets. For experimentation we used a set of five frequently applied benchmark data sets including Cleveland, Hungarian, Long Beach, Switzerland and Statlog data sets. These data sets are containing demographic properties, clinical symptoms, clinical findings, laboratory test results specific electrocardiography (ECG), results pertaining to angina and coronary infarction, etc. In other words, classical EMR data pertaining to the evaluation of a chest pain patient and ruling out angina and/or Coronary Artery Disease, (CAD). The prediction diagnosis results with the proposed classification approach were found promisingly accurate. For example, the Switzerland data set was classified with 94.5% ± 0.4% accuracy. Combining all these data sets resulted in the classification accuracy of 82.0% ± 0.5%. We compared the results of the proposed method with the corresponding results of the other methods reported in the literature that have demonstrated relatively high classification performance in solving this problem. Depending on the case, the results of the proposed method were of equal level with the best compared methods, or outperformed their Pasi Luukka Laboratory of Applied Mathematics, Lappeenranta University of Technology, P.O. Box 20, FIN-53851 Lappeenranta, Finland e-mail: [email protected] Jouni Lampinen Department of Computer Science, University of Vaasa, P.O. Box 700, FI-65101 Vaasa, Finland e-mail: [email protected] Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 263–283. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
264
P. Luukka and J. Lampinen
classification accuracy clearly. In general, the results are suggesting that the proposed method has potential in this task.
11.1 Introduction Many data sets that come from the real world are admittedly coupled with noise. Noise can be stated as a random error or variance of a measured variable [13]. Data analysis is almost always burdened with uncertainty of different kinds. There are several different techniques to deal with noisy data [7]. A major problem in mining scientific data sets is that the data is often high dimensional. In many cases there is a large number of features representing the object. One problem is that the computational time for the pattern recognition algorithms can become prohibitive, when the number of dimensions grows high. This can be a severe problem, especially as some of the features are not discriminatory. Besides the computational cost, irrelevant features may also cause a reduction in the accuracy of some algorithms. To address this problem of high dimensionality, a common approach is to identify the most important features associated with an object, so that further processing can be simplified without compromising the quality of the final results. There are several different ways in which the dimension of a problem can be reduced. The simplest approach is to identify important attributes based on the input from domain experts. Another commonly used approach is Principal Component Analysis (PCA) [19], which defines new attributes (principal components or PCs) as mutually-orthogonal linear combinations of the original attributes. For many data sets, it is sufficient to consider only the first few PCs, thus reducing the dimension. However, for some data sets, PCA does not provide a satisfactory representation. It is not always the case that mutually-orthogonal linear combinations are the best way to define new attributes but e.g. nonlinear combinations needs to be sometimes considered. The analysis of the problem of dealing with data of high dimensionality is both difficult and subtle. The information loss caused by these methods is also sometimes a problem. One of the latest methods in evolutionary computation is differential evolution (DE) algorithm [30]. In this paper we will examine the applicability of a classification method where data is first preprocessed with PCA and then the resulting data is classified with DE-classifier to the diagnosis of heart disease. In literature there are several papers where evolutionary computation research has concerned the theory and practice of classifier systems [4], [16], [17], [18], [31], [35], [10]. The differential evolution algorithm has been studied in unsupervised learning problems which can be in a sence repositioned to classification problem in [26], [11]. DE was also used combined with artificial neural networks in [1] for diagnosis of breast cancer. It is also been used to tune classifiers parameter values in [12] and in similarity classifier [23] to tune similarity measures parameters.
11
A Classification Method Based on PCA and DE
265
Here we propose a method which first preprocesses the data using PCA and then classify the processed data using differential evolution classification method. Differential evolution algorithm is applied for finding suitable vectors for each class to classify sample by comparison of class vectors and the sample which we want to classify. Differential evolution algorithm is applied for finding optimal class vectors to represent each class. In addition, it is also applied for determining the value of a distances parameter that we applied for making the final classification decision. Advantages of doing the procedure this way are that we are able to reduce dimensionality and hence reduce the computational cost that would be otherwise untolerably high, especially in the cases for high dimensional data sets. Also advantage of this procedure is that we are able to filter out noise which enhances the creation of class vector for each class in the classifier. The class vectors are optimized using the DE algorithm. Using this procedure we will also find the optimal dimension for these data sets. Combination of finding best reduced dimension, filtering out noise from the data and optimization of class vectors and needed parameters for our problem at hand brings out the more accurate solution for the problem. The data sets for empirical evaluation of the proposed approach were taken from a UCI-Repository of Machine Learning data set [25]. Classifier and preprocessing methods were implemented with MAT LABT M -software. From the optimization and modelling point of view the classification problem subject to our investigations can be divided into two parts: to the classification model and to the optimization approach applied for fitting (or learning) the model. Generally, a multipurpose classifier can be viewed as a scalable and learnable model that can be fitted to a particular dataset by scaling it to the data dimensionality and by optimizing a set of model parameters to maximize the classification accuracy. For the optimization, simply the classification accuracy over the learning set may serve as the objective function value to be maximized. Alternatively the optimization problem can be formulated as a minimization task, as we did here, where the number of misclassified samples is to be minimized. In the literature, mostly linear or nonlinear local optimization approaches has been applied for solving the actual classifier model optimization problem, or approaches that can be viewed as such. This is the most common approach despite of the fact that the underlying optimization problem is a global optimization problem. For example, the weight set of a feed-forward neural network classifier is typically optimized with a gradientdescent based on local optimizer, or alternatively by some other local optimizer like Levenberg-Marquardt algorithm. This kind of usage of limited capacity optimizers for fitting the classification model limits the achievable classification accuracy in two ways. First, the model should be limited so, that local optimizers can be applied to fit them. This means that only very moderately multimodal classification models can be applied, and due to such modelling limitation, the classification capability will be limited correspondingly. Secondly, if a local optimizer is applied to optimize (to fit or to learn) even a moderately multimodal classification model, it is likely to get trapped into a local optima, to a suboptimal solution. Thereby, the only way to get the classifier models with a higher modelling capacity at disposal, and also to get full capacity out of the current multimodal classification models, is applying global
266
P. Luukka and J. Lampinen
optimization for fitting the classification models to the data to be classified. For example, in case of a nonlinear feed-forward neural network classifier, the model is clearly multimodal, but practically always fitted by applying a local optimizer that is capable of providing only locally optimal solutions. Thus, we consider that applying global optimization instead of local optimization is an important fundamental issue that is currently severely constraining the further development of classifiers. The capabilities of currently used local optimizers are limiting the selection of the applicable classifier models, and also the capabilities of the currently used models that are including multimodal properties are limited by the capabilities of the optimizers applied to fit them to the data. Based on the above mentioned considerations, our basic motivation for applying a global optimizer for learning the applied classifier model comes from the fact that typically local (nonlinear) optimizers have been applied for the purpose, despite that the underlying optimization problem is actually a multimodal global optimization problem, and a local optimizer should be expected to become trapped into a local suboptimal solution. The advantage of our proposed method is that since DE does not get trapped in local minimum we can expect it to find better solutions than what can be found in nearest local minimum. Another motivation was that we would like to optimize also the parameter p of the Minkowsky distance metrics (see section 3). In practice, that means increased nonlinearity and increased multimodality of the classification model, resulting in more locally optimal points in the search space, where a local optimizer would be even more likely to get trapped. Practically, optimizing p successfully requires usage of an effective global optimizer since local optimizers are unlikely to provide even an acceptably good suboptimal solution anymore. Otherwise, by using a global optimizer, optimization of p becomes possible. Two folded advantages were expected on this. First, by optimizing (systematically) the value for p, instead of selecting it a priori by trial and error as earlier, a higher classification accuracy may be possible to reach. Secondly, the selection of the value of p can be done automatically this way, and laborious trial and error experimentation by the user is not needed at all. Furthermore, the potential for further developments is increased. The local optimization approaches are severely limiting the selection of classifier models to be used and as well the problem formulations for classifier model optimization task become limited, too. Simply, local optimizers are limited to fit, or learn, only classifier models, where trapping into a local suboptimal solution is not a major problem, while global optimizers do not have such fundamental limitations. For example, the range of possible class membership functions can be extended to those requiring global optimization (due to increased nonlinearity and multimodality), and which cannot be handled anymore by simple local optimizers, even with the nonlinear ones. In addition, we would like to remark, that we have not yet fully utilized the further development capabilities provided by our global optimization approach. For example, even more difficult optimization problem settings are now within possibilities, and the differential evolution have good capabilities for multi-objective and multi-constrained nonlinear optimization that provides further possibilities for our future developments.
11
A Classification Method Based on PCA and DE
267
11.2 Heart Data Sets The heart data sets that we applied for experimentation were all taken from [25] where they are freely available. They all contain 13 attributes (which have been extracted from a larger set of 75). Information about the attributes can be found in Table 11.1 and the basic properties of the data sets are summarized in Table 11.2. About attribute types; real valued attributes are no 1, 4, 5, 8, 10 and 12, ordered type is attribute no 11, binary value are attributes 2, 6, 9 and nominal value attributes 7, 3, 13. Variable to be predicted is absence or presence of heart disease. The data sets were collected in different locations and principal investigators responsible for the data collection are: 1) Andras Janos, Hungarian Institute of Cardiology, 2) William Steinbrunn, University Hospital, Zurich, 3) Matthias Pfisterer, University Hospital, Basel, 4) Robert Detrano, V.A. Medical Center, Long Beach and Cleveland Clinic Foundation. Donor of the statlog data set was Ross D. King, University of Strathclyde, Glasgow. The Statlog data set is slightly modified from the Cleveland data set (instead of 303 samples they are using only 270). Table 11.1 Heart data sets attribute information
no Attribute 1. Age 2. Sex 3. Chest pain type (4 values) 4. Resting blood pressure 5. Serum cholestoral in mg/dl 6. Fasting blood sugar > 120 mg/dl 7. Resting electrocardiographic results (values 0,1,2) 8. Maximum heart rate achieved 9. Exercise induced angina 10. Oldpeak = ST depression induced by exercise relative to rest 11. The slope of the peak exercise ST segment 12. Number of major vessels (0-3) colored by flouroscopy 13. Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
Table 11.2 Test data sets and their properties
Name Nb. classes Dim Nb. cases Heart-Cleveland 2 13 303 Heart-Hungarian 2 13 294 Heart-Long-Beach-va 2 13 200 Heart-Switzerland 2 13 123 Heart-Statlog 2 13 270
268
P. Luukka and J. Lampinen
11.3 Classification Method Heart data sets were classified so that first data was preprocessed using PCA algorithm and then the resulting data was classified using classification method based on differential evolution. Next we first start with explaining in more detail the principal component analysis method and then the classification method based on differential evolution algorithm, after this we give a more thorough description of the differential evolution algorithm.
11.3.1 Dimension Reduction Using Principal Component Analysis High-dimensional data sets present many mathematical challenges as well as some opportunities, and are bound to give rise to new theoretical developments [7]. One of the problems with high-dimensional data sets is that, in many cases, not all the measured variables are ”important” for understanding the underlying phenomena of interest. In mathematical terms, the problem we investigate in dimension reduction can be stated as follows: given the r-dimensional random variable x = (x1 , ..., xr )T , find a lower dimensional representation of it, y = (y1 , ..., yk )T with k < r, that captures the content in the original data, according to some criteria. The components of y are sometimes called the hidden components. Different fields use different names for r multivariate vectors: the term ”variable” is mostly used in statistics, while ”feature” and ”attribute” are alternatives commonly used in the computer science and machine learning literature. PCA [19] is the best linear dimension reduction technique in the mean-square error sense. In various fields, it is also known as the singular value decomposition (SVD), the Karhunen-Loeve transform, the Hotelling transform, and the empirical orthogonal function (EOF) method. Let x1 , ..., xn be the n r-dimensional real vectors constituting the data set. In PCA data is first to be centered 1 n (11.1) ∑ xp = 0 n p=1 PCA attempts to find a k-dimensional subspace L of Rk , such that the orthogonal projections PL xp of the r points on L have maximal variance. If L is the line spanned by unit vector u, the projection of x ∈ Rk on L is PL x = (u x)u
(11.2)
where prime denotes transposition. The variance of data in the direction of L is, therefore 1 n 2 1 n 1 n (u xp ) = ∑ u xp xp u = u( ∑ xp xp )u = u Su ∑ n p=1 n p=1 n p=1
(11.3)
11
A Classification Method Based on PCA and DE
269
where S is the sample covariance matrix of the data. PCA thus looks for the vector u∗ which maximizes u Su, under the constraint ||u|| = 1. It is easy to show that the solution is the normalized eigenvector u1 of S associated to its largest eigenvalue λ1 , and (11.4) u1 Su1 = λ1 u1 u1 = λ1 This is then extented to find the k-dimensional subspace L on which the projected points PL xp have maximal variance. The lines spanned by the eigenvectors uj are called the principal axes of the data, and k new features y j = uj x defined by the coordinates of x along the principal axes are called principal components. The vector yp of principal components for each initial pattern vector xp may easily be computed in matrix form as yp = U r xp , where Ur = [u1 , ..., ur ] is r × k matrix having the r normalized eigenvector of S as its columns. PCA can be used in classification problems to display data in the form of informative plots. The score values have the same properties as the weighted averages, i.e., they are not sensitive to random noise but show the processes that affect several variables simultaneously in a systematic way. This makes them suitable for detecting multivariate trends, such as the clustering of objects or variables in multivariate data sets. PCA can be seen as a data compression method which can be used to (1) display multivariate data sets, (2) filter noise and (3) study and interpret multivariate processes. One clear limitation of the PCA is that it can only handle linear relations between variables [9]. We acknowledge the fact that this may not be the best kernel for the approach but here in our procedure it seems to be working.
11.3.2 Classification Based on Differential Evolution The problem of classification is basically one of partitioning the feature space into regions, one region for each category. Ideally, one would like to arrange this partitioning so that none of the decisions is ever wrong [8]. The objective is to classify a set X of objects to N different classes C1 , . . . ,CN by their features. We suppose that T is the number of different kinds of features that we can measure from objects. The key idea is to determine for each class the ideal vector yi yi = (yi1 , . . . , yiT ) (11.5) that represents class i as well as possible. Later on we call these vectors as class vectors. When these class vectors have been determined we have to make the decision to which class the sample x belongs to according to some criteria. This can be done e.g. by computing the distances di between the class vectors and the sample which we want to classify. For computing the distance we used Minkowsky metric: d(x, y) =
T
∑ |x j − y j |
j=1
1/p p
(11.6)
270
P. Luukka and J. Lampinen
We used Minkowsky metric because it is more general than euclidean Metric. Euclidean metric is still included there as a special case when p = 2. We also found that when p value was optimized by using DE, the optimum was not even near p = 2 which corresponds to euclidean metric. After computing the distances between the samples and class vectors we can make the classification decision according to the shortest distance. for x, y ∈ Rn . We decide that x ∈ Cm if dx, ym = min dx, yi . i=1,...,N
(11.7)
Before doing the actual classification, all the parameters for classifier should be decided. These parameters are 1. The class vectors yi = (yi (1), . . . , yi (T )) for each class i = 1, . . . , N 2. The power value p in (11.6). In this study we used differential evolution algorithm [30] to optimize both the class vectors and p value. For this purpose we split the data into learning set learn and testing set test. Split was made so that half of the data was used in learning set and half in testing set. We used data available in learning set to find the optimal class vectors yi and the data in the testing set test was applied for assessing the classification performance of the proposed classifier. A brief description of differential evolution algorithm is presented in the following section. The number of parameters that differential evolution algorithm needs to optimize here is classes * dimension + parameter coming from minkowsky distance. As results will later show PCA can be used to lower datas dimensionality and with low dimensions we still we can find results which are clearly better than what can be found by using simply DE-classifier. If we are not satisfied to just lower the data’s dimensionality and to enhancement achieved this way but want to find out the best lowered dimension we have to do this also for every dimension that is lower than maximum dimension so we get (classes ∗ (dimension − i) + parameter coming from minkowsky distance) ∑dimension i=1 worth of parameters to be optimized. In short the procedure for our algorithm is as follows: 1. 2. 3. 4. 5.
Divide data into learning set and testing set Create initial class vectors for each class (here we used simply random numbers) Compute distance between samples in the learning set and class vectors Classify samples according to their minimum distance Compute classification accuracy (no. of correctly classified samples/total number of samples in learning set) 6. Compute the objective function value to be minimized as cost = 1 − accuracy 7. Create new class vectors for each class for the next population using selection, mutation and crossover operations of differential evolution algorithm, and goto step 3. until the stopping criteria is reached (e.g. maximum number of iterations is reached)
11
A Classification Method Based on PCA and DE
271
8. Classify data in testing set according to the minimum distance between class vectors and samples.
11.3.3 Differential Evolution The DE algorithm [33], [30] was introduced by Storn and Price in 1995 and it belongs to the family of Evolutionary Algorithms (EAs). The design principles of DE are simplicity, efficiency, and the use of floating-point encoding instead of binary numbers. As a typical EA, DE has a random initial population that is then improved using selection, mutation, and crossover operations. Several ways exist to determine a stopping criterion for EAs but usually a predefined upper limit Gmax for the number of generations to be computed provides an appropriate stopping condition. Other control parameters for DE are the crossover control parameter CR, the mutation factor F, and the population size NP. In each generation G, DE goes through each D dimensional decision vector vi,G of the population and creates the corresponding trial vector ui,G as follows in the most common DE version, DE/rand/1/bin [29]: r1 , r2 , r3 ∈ {1, 2, . . . , NP} , (randomly selected, except mutually different and different from i) jrand = floor (randi [0, 1) · D) + 1 for( j = 1; j ≤ D; j = j + 1) { if(rand j [0, 1) < CR ∨ j =jrand ) u j,i,G = v j,r3 ,G + F · v j,r1 ,G − v j,r2 ,G else u j,i,G = v j,i,G } In this DE version, NP must be at least four and it remains fixed along CR and F during the whole execution of the algorithm. Parameter CR ∈ [0, 1], which controls the crossover operation, represents the probability that an element for the trial vector is chosen from a linear combination of three randomly chosen vectors and not from the old vector vi,G . The condition “ j = jrand ” is to make sure that at least one element is different compared to the elements of the old vector. The parameter F is a scaling factor for mutation and its value is typically (0, 1+]1 . In practice, CR controls the rotational invariance of the search, and its small value (e.g., 0.1) is practicable with separable problems while larger values (e.g., 0.9) are for non-separable problems. The control parameter F controls the speed and robustness of the search, i.e., a lower value for F increases the convergence rate but it also adds the risk of getting stuck into a local optimum. Parameters CR and NP have the same kind of effect on the convergence rate as F has. 1
Notation means that the practical upper limit is about 1 but not strictly defined.
272
P. Luukka and J. Lampinen
After the mutation and crossover operations, the trial vector ui,G is compared to the old vector vi,G . If the trial vector has an equal or better objective value, then it replaces the old vector in the next generation. This can be presented as follows (in this paper minimization of objectives is assumed) [29]: ui,G if f (ui,G ) ≤ f (vi,G ) vi,G+1 = . vi,G otherwise DE is an elitist method since the best population member is always preserved and the average objective value of the population will never get worse. As the objective function, f , to be minimized we applied the number of incorrectly classified learning set samples. Each population member, vi,G , as well as each new trial solution, ui,G , contains the class vectors for all classes and the power value p. In other words, DE is seeking the vector (y(1), ..., y(T ), p) that minimizes the objective function f . After the optimization process the final solution, defining the optimized classifier, is the best member of the last generation’s, Gmax , population, the individual vi,Gmax . The best individual is the one providing the lowest objective function value and therefore the best classification performance for the learning set. The control parameters of DE algorithm were set here as follows: CR=0.9 and F=0.5 were applied for all classification problems. NP was chosen so that it was six times the size of the optimized parameters or if size of the NP. However, these selections were mainly based on general recommendations and practical experiences with the usage of DE, and no systematic investigations were performed for finding the optimal control parameter values, therefore further classification performance improvements by finding better control parameter settings in future are within possibilities.
11.4 Classification Results All data sets were split in half; one half was used for training and the other half for testing the classifier. The training sets were randomly created 30 times for each dimension. The results are also compared to other existing results in the literature. In Table 11.3 the results from the applied data sets are reported. Achieved results are also compared with the results achieved without PCA. In first column data set and possible usage of preprocessing the data first with PCA is reported. In second column best classification accuracy is given and in third the mean classification accuracy. Variance is reported next and then reduced dimension providing the best results is given. Finally also optimized p-value is given in last column. Results from Cleveland heart data set is given in Heart-Cleveland, Hungarian in Heart-Hungarian, Switzerland heart data set in Heart-Switzerland and results from Long-Beach data set in Heart-Long-Beach. All these four data sets are combined in Heart-All. There are several missing values in these data sets and simply a dummy value −9 is used for missing value. Results from heart-statlog data set are in Heart-statlog. Best mean classification accuracies are in boldface.
11
A Classification Method Based on PCA and DE
273
Table 11.3 Classification results of heart data sets. Comparison of classification results with original data and data preprocessed with two dimension reduction methods. Best mean accuracy is in boldface Data Best result (in %) Mean result (in %) Variance (in %) dim p-value Heart-Cleveland 89.44 % 82.86 % 7.71 13 19.3 Heart-Cleveland(PCA) 91.55 % 86.48 % 2.82 12 82.8 Heart-Hungarian 88.44 % 83.42 % 5.95 13 88.1 Heart-Hungarian(PCA) 93.20 % 87.48 % 3.34 11 96.7 Heart-Switzerland 95.16 % 94.35 % 0.67 13 70.8 Heart-Switzerland(PCA) 95.16 % 94.46 % 0.66 5 82.1 Heart-Long Beach 80.20 % 78.32 % 1.31 13 54.4 Heart-Long Beach(PCA) 85.15 % 79.93 % 2.70 12 67.9 Heart-All 78.22 % 76.98 % 0.94 13 1.8 Heart-All(PCA) 84.22 % 82.01 % 1.05 13 49.2 Heart-statlog 88.89 % 83.21 % 10.80 13 81.1 Heart-statlog(PCA) 91.86 % 87.63 % 4.01 13 90.9
Cleveland data set: From the Table 11.3 we can observe that best mean classification accuracy for the Cleveland data set is 86.5% and when 99% confidence √ interval is computed for the results (Using Student’s t distribution μ ± t1−α /2 Sμ / n) we get for the confidence interval 86.5% ± 0.8%. This result was obtained when data was first preprocessed with PCA. The preprocessing by PCA enhanced the results over 3%. The best mean accuracy was found with target dimensionality of 12. Achieved results with the Cleveland data set are compared to other results in Tables 11.4–11.6. In Table 11.4 the classification results obtained by our DE based approach are compared to the corresponding results reported in [32] where method called Classification by Feature Partitioning (CFP) was introduced. This method is an inductive, incremental and supervised learning method. There the data set was divided in two sets as here but training and testing set sizes were a bit different. When comparing our results with the results of Sirin and G¨uvenir [32] we observed that DE classifier classified the Cleveland data set with a higher accuracy (83.4%) than IB classifiers and C4 but yielded a slightly lower accuracy than CFP,(84.0%). When data was first preprocessed with PCA the classification accuracy of 87.5% was reached by the DE-classifier. In Table 11.5 the classification results obtained by DE based approach are compared to results from classifiers which was reported in [21]. They used decision tree classifier and also preprocessed the data with wavelet transform. They also used two fold techniques as here but division for training and testing set was 80 − 20. For their decision tree classifier 76% accuracy was reported. In comparison, DE yielded accuracy of 83%. Li et al. [21] managed to enhance the results by preprocessing the data first with wavelet transform gaining about 4% unit enhancement having mean accuracy of 80%. We reached about 3% unit enhancement with using PCA, corresponding to 86% classification accuracy. In Table 11.6 we have compared our results with the results reported in [3] where tenfold crossover was used instead of two fold as in our experiment. As can be seen there smart crossover operator with multiple parents for a Pittsburgh Learning
274
P. Luukka and J. Lampinen
Classifier seems to be giving around 10% better performance with this data set. Generally, the results obtained here by the DE classifier with PCA preprocessing appeared to be rather promising.
Table 11.4 DE classifiers classification result comparison to results Sirin and G¨uvenir reported in [32] from Cleveland and Hungarian data sets
data set IB1 IB2 IB3 C4 CFP DE PCA + DE Hungarian 58.7 55.9 80.5 78.2 82.3 83.4 87.5 Cleveland 75.7 71.4 78.0 75.5 84.0 82.9 86.5
Table 11.5 DE classifiers classification result comparison to results of Li et al. reported from Cleveland, Hungarian and Switzerland data sets
Data set Decisiontree(Dt) Dt + wavelet Hungarian 76 80 Cleveland 76 80 Switzerland 88 88
DE 83 83 94
PCA + DE 87 86 94
Table 11.6 DE classifiers classification result comparison to results of Bacardit and Krasnogor [3] from Cleveland, Hungarian and Statlog data sets
Data set EnhancedPLCS DE PCA + DE Hungarian 86.05 83.42 87.48 Cleveland 95.54 82.86 86.48 Statlog 94.44 83.21 87.63
Hungarian data set: With the Hungarian data set a same situation was observed. The best results was found when data was first preprocessed with PCA. Best mean accuracy with 99% confidence interval was 87.5% ± 0.9%. Preprocessing with PCA enhanced the results over 3%. Best accuracy was found with the target dimensionality of 11. The results obtained with the Hungarian data set are compared to the results of the other classifiers in Tables 11.4–11.7. When the results are compared with the results reported by Sirin and G¨uvenir [32] (Table 11.4) DE-classifier yielded a slightly higher mean accuracy 83.4%, than the second best CFD with the accuracy of 82.3%. When the Hungarian data set was preprocessed with PCA the accuracy of DE-classifier increased to 87.5%, which can be considered as a remarkably good result. In Table 11.5 our results are compared with the corresponding ones by Li et al. [21]. They reported accuracy of 76% with their decision tree classifier while in comparison our DE-classifier reached with 83% accuracy. Li et al also preprocessed the
11
A Classification Method Based on PCA and DE
275
data and their wavelet transform preprocessing gained about 4% unit enhancement in accuracy (80%). We obtained 4% unit enhancement with the accuracy when we preprocessed the data by using PCA and then performed the classification by using DE classifier, reaching the accuracy of 87%. In Table 11.6 our results are compared with the results of [3]. With their method accuracy of 86% was reported for Hungarian data set. This accuracy outperformes DE but when data is first preprocessed with PCA and then the linear combination is used DE-classifier manages to get better results. In Table 11.7 our results are compared with the results of Detrano et al [6]. They have reported 77% accuracy with CDF, while our corresponding result was 83% accuracy by using DE classifier. Table 11.7 DE classifiers classification result comparison to results Detrano et al [6] reported from Hungarian, Long beach and Switzerland data sets
Data set CDF Hungarian 77 Long Beach 79 Switzerland 81
CADENZKA DE 74 83 77 78 94 81
PCA + DE 87 80 94
Switzerland data set: With the Switzerland data set even higher classification accuracy was reached than with the previous two cases. Best mean accuracy with 99% confidence interval was 94.5%± 0.4%. Also variances were considerably lower than with previous two data sets. The results with the original data and PCA-preprocessed data were very close and there were no statistically significant differences between the classifying the original data or preprocessed data. Comparison to the other results with the Switzerland data set is provided in Tables 11.5 and 11.7. Rather similar findings as in [21] and [6] were done. Also in [21] and [6] the Switzerland data set was found possible to classify with a higher accuracy than the other data sets. Furthermore, in [21] it was noticed the same as what we found that preprocessing the Switzerland data set did not considerably enhanced the results but results were similar with the original data. Li et al. [21] reported accuracy of 88% with decision tree classifier and Detrano et al. [6] accuracy of 81% with both CDF and CADENZKA. We managed to classify the data with 94% accuracy using preprocessed data and DE classifier. DE classifiers accuracy of 94% is the highest one among these results. Long Beach data set: The Long Beach data set seemed to be most difficult to classify with the algorithm presented in this paper. Best mean accuracy with 99% confidence interval was 79.9% ± 0.9%. This result was achieved when the data was first preprocessed using principal component analysis algorithm and then classified with DE classifier. Dimension where best results were achieved was 12. The Long Beach data set has the highest number of missing values among these four data sets, and it is often left out from studies [21] due to this reason. Here all those missing values were replaced by dummy value −9 and large amount of −9 values is probably
276
P. Luukka and J. Lampinen
the reason why accuracies are somewhat lower with this data set than with the other data sets. For the Long Beach data set results are compared in Table 11.7. The results without preprocessing are rather similar with the classifiers CDF and CADENZKA. They reported accuracy of 79% with CDF and 77% with CADENZKA. We obtained accuracy of 78% with DE-classifier. When data was preprocessed with PCA and then classified with DE a accuracy (of 80%) was obtained for this particular dataset. Heart-All: In Heart-All all previous four data sets are combined to achieve larger amount of data to be used. When all four data sets were combined the best mean accuracy with 99% confidence interval was 82.0% ± 0.5%. This was achieved when data was first preprocessed with PCA. This result was achieved by using 13 dimensions. Here variance (of 1.05) was lower than with the other data sets with the exception of the Switzerland data set. When we compare the results with the results without preprocessing with PCA first we managed to enhance the results by using PCA about 5%. In Table 11.8 there is classification result comparison to results reported by Łeski in [22] and Pedreira in [27]. In all the results Cleveland, Hungarian, Switzerland and Long-beach data sets are combined together. Łeski used An ε −margin nonlinear classifier based on fuzzy if-then rules and divided the data in two folds as in our procedure but here testing set was much larger than training set. Pedreira used Kohonen’s LVQ2.1 with Training data selection and ten-fold crossvalidation was used. Table 11.8 DE classifiers classification result comparison to results reported by Łeski in [22] and Pedreira in [27]. In all the results Cleveland, Hungarian, Switzerland and Long-beach data sets are combined together
Method Accuracy (%) 2 SVM-best for several experiments varying C and δ [27] 80.00 Kohonen’s LVQ2.1 (mean of 10-fold experiment) [27] 73.00 ε −margin nonlinear classifier based on fuzzy if-then rules [22] 79.96 Incremental SVM [5] 78.70 Fisher’s linear discriminant [22] 78.04 Logistic-regression-derived discriminant function [6] 77.00 Bayes point machine [14] 77.20 DE classifier 76.98 DE classifier with PCA 82.01 DE-classifier provided similar level of accuracy (76.98%) as Incremental SVM (78.70%), Fisher’s linear discriminant (78.04%), Logistic-regression-derived discriminant function (77.00) and Bayes point machine (77.20%). Kohonen’s LVQ2.1 managed slightly worse with 73.00% accuracy. SVM-best result for several experiments varying C and δ 2 (80.00) and ε -margin nonlinear classifier based on
11
A Classification Method Based on PCA and DE
277
fuzzy if-then rules (79.96%) classified with a bit higher accuracy than DE classifier. When data was preprocessed with PCA and then classified with DE classifier, a higher mean accuracy of 82.01% was gained. As reported in Table 11.8 the result of DE classifier with PCA data preprocessing outperformed the results of these seven classifiers. Heart-statlog: When the heart-statlog data set was classified, the best mean accuracy with 99% confidence interval was 87.6 ± 1.0. When this result is compared to the original Cleveland data set, the accuracy is only little higher. So removing in advanced samples which had missing values increased the accuracy only by about 1% unit. Also variances were actually higher with the heart-statlog data set than with the original Cleveland data set. When the results are compared to the results of [20] where this data set was classified with 19 different classifiers, the results are more accurate with DE classifier than the best compared result, which was 84.4% accuracy with NewId classifier. The results with the heart-statlog data set are compared in Tables 11.9–11.11. Similar mean accuracy was obverved with DE-classifier (83.2%) than with LMT (83.2), SLogistic (83.3%) and MLogistic (83.7%) in Table 11.9. In this experiment Hervas-Martinez & Martinez-Estudillo [15] used two folds as we but division between training set and testing set was 75 − 25. When the data was preprocessed with PCA and then classified with DE we observed 87.6% mean accuracy. In Table 11.10 the results are compared with the results reported by Abdel-Aal [2] with GMDH. With GMDH also optimal feature selection was carried out. They used also two fold division of the data set and there division was 70 − 30 between the training set and testing set. When comparing the results of GMDH with all features (82.5%) to the results of DE-classifier (83.2%), DE-classifier provided a sligthly higher accuracy. When optimal features were selected with GMDH, the accuracy of 85% is reported, which is a bit lower than we reached by applying DE-classifier to the PCA-preprocessed data (87.6%). In Table 11.11 the results are compared to those reported in [28] by Polat et al. Polat et al. [28] used ten-fold crossvalidation scheme to get their results. They reported results for their main classification system, artificial immune recognition system (AIRS), that reached accuracy of 84.50%. Table 11.9 DE classifiers heart statlog data set classification result comparison to results reported by Hervas-Martinez & Martinez-Estudillo [15]
Method Accuracy (%) LMT 83.22 ± 6.50 SLogistic 83.30 ± 6.48 MLogistic 83.67 ± 6.43 C4.5 78.15 ± 7.42 CART 78.00 ± 8.25 LotusS 77.63 ± 7.16 DE 83.2 ± 1.7 PCA+DE 87.6 ± 1.0
278
P. Luukka and J. Lampinen
Table 11.10 DE classifiers heart statlog data set classification result comparison to results reported by Abdel-Aal [2]. Results are compared to results using all features and optimal dimension
Method Accuracy (%) GMDH(all 13 features) 82.5 GMDH(optimal 6 features) 85 DE(all 13 features) 83.2 PCA+DE(13 features) 87.6 Table 11.11 DE classifiers classification result comparison to results Polat et al [28] reported from heart statlog data set
Author Method Accuracy (%) ToolDiag, RA IB1 − 4 50.00 WEKA, RA InductH 58.50 ToolDiag, RA RBF 60.00 WEKA, RA FOIL 64.00 ToolDiag, RA MLP+BP 65.60 WEKA, RA T2 68.10 WEKA, RA 1R 71.40 WEKA, RA IB1c 74.00 76.70 WEKA, RA K∗ Robert Detrano Logistic regression 77.00 Cheung (2001) C4.5 81.11 Cheung (2001) Naive-Bayes 81.48 Cheung (2001) BNND 81.11 Cheung (2001) BNNF 80.96 WEKA, RA Naive-Bayes 83.60 Polat et al. (2005) AIRS 84.50 Polat et al. (2007) Fuzzy-AIRS-k-NN- based system 87.00 Our work DE-classifier 83.21 Our work PCA+DE-classifier 87.63
This is a bit higher accuracy than 83.21% that we reached with DE-classifier. They also used a preprocessing method which they called as weighting scheme based on k−nearest neighbour (k-NN) and utilized that as preprocessing step before classifying with their main classifier, AIRS. Using this method they gained accuracy of 87.00% while we reached accuracy of 87.63% by preprocessing the data with PCA and then classifying with DE-classifier. Thus, the compared results appears to have rather similar level of accuracy in these cases. When we compare the results with enhanced PLCS [3], their method gives about 7% higher classification accuracy than presented method.
11
A Classification Method Based on PCA and DE
279
Heart data classification Mean classification accuracy
95 90 85 80 75 70
Switzerland Statlog Hungarian Cleveland All Long beach
65 60 55
0
Variances classification accuracy
(b)
2
4
6
8
10
12
14
Reduced dimension (with PCA)
(a)
Heart data classification 8 Switzerland Statlog Hungarian Cleveland All Long beach
7 6 5 4 3 2 1 0
0
2
4
6
8
10
12
14
Reduced dimension (with PCA)
Fig. 11.1 Classification results with respect to the reduced dimension a) Mean classification accuracies when data is first preprocessed with PCA and the classified with DE classifier b) variances
In Fig 11.1 mean classification accuracies and variances are plotted for every dimension to see how classification accuracy changes with respect to reduced dimension. As can be seen from Fig 11.1 accurate results can be found already with few dimensions. Preprocessing the data with PCA, good results are still found with all data sets when dimension is as low as 4. Also variances are low with all dimensions if the first three dimensions are disregarded. The figure is suggesting that rather accurate results can be achieved when the reduced dimensionality is only about half of the original. This information can be very useful when dealing with large amount of data having high dimensionality, since computations are taking considerably less
280
P. Luukka and J. Lampinen
time when the dimensionality is reduced. Computations with only six dimension was about 3.5 times faster than computations with full 13 dimension. Comparison to Support Vector Machine classifier To make our comparison even more clear we also computed the classification results with Support Vector Machine (SVM) [34] and with combination of PCA+SVM. Results of these classification runs can be found in Table 11.12. With SVM we used RBF as kernel function. When one compares the results with the original data and with preprocessed data one can see that with using SVM classifier results are not always enhanced when data is first preprocessed using PCA. Sometimes even (as in the case of hungarian data set) we received worser results with combination PCA+SVM than with just using SVM. Also when we compare the results of SVM which can be considered to be quite new and high performing classifier it does not perform that well in this task but compared to DE classifier, classification accuracies are about 20% lower with SVM than with DE-classifier. This experiment was done to emphazise that PCA is working well in this data set with DE-classifier but this observation is not directly transferable to other classifiers. Table 11.12 Results with SVM and with combination of PCA+SVM. Best and mean classification results, variances
Data & method Best result Mean result Variance Original data Heart C & SVM 62.86 58.33 0.065 Heart H & SVM 70.07 63.20 0.095 Heart L & SVM 73.74 65.86 0.136 Heart S & SVM 91.94 86.60 0.021 Heart Statlog & SVM 65.93 60.93 0.073 Heart all & SVM 68.37 64.75 0.029 Preprocessed with PCA Heart C & SVM 66.70 64.05 0.20 Heart H & SVM 46.94 44.83 0.12 Heart L & SVM 52.43 50.26 0.14 Heart S & SVM 99.67 93.38 0.79 Heart Statlog & SVM 67.19 65.22 0.13 Heart all & SVM 65.96 65.07 0.03
11.5 Discussion and Conclusions In this paper we applied classification method based on preprocessing the data first with PCA and then applying differential evolution classifier to the diagnosis of heart disease. For demonstrating and assessing the proposed classification approach, we computed results for four different heart data sets individually, and also the results
11
A Classification Method Based on PCA and DE
281
for the case when all data sets were combined together. With diagnosis of heart disease we found that by preprocessing the data first with PCA a higher classification accuracy can be achieved than without preprocessing. This was observed in all studied cases excluding the case for the Switzerland data set. Another aspect is the reduced overall computing time. By data dimensionality reduction data sets having high dimensionality can be classified considerably faster with this reduced data than what could be done with the original data. This procedure also made it possible for DE to find more robust and accurate class vectors which improved the classification accuracy. Also this way we managed to filter out noise and improve the results. We are considering that the main factor resulting in the good classification accuracy in studied cases was the application of an effective global optimizer, differential evolution, for fitting the classification model instead of local optimization based approaches. The results are indicating that also prepossessing the data before classification may, in successful cases, not only help with the curse of increasing data dimensionality, but also provide a further improvement in classification accuracy. Anyway, the main contributor to the accuracy was global optimization method in the classifier, that made it possible to at least some extend to avoid getting trapped into locally optimal (and thereby suboptimal) solutions, and make it possible to improve the solution further on in comparison with the compared approaches. Another important point contributing to the classification accuracy was systematical optimization of parameter p, instead of keeping it fixed, or setting it manually by trial-and-error. It should be noted, that inclusion of p among the optimized parameters, was possible due to application of global optimizer capable of handling the extra parameter, extra nonlinearity and extra multimodality of classifier model optimization problem that the inclusion of p among parameters to be optimized is resulting in. Generally, the classification accuracy yielded by the proposed approach compared well with the other corresponding results of several classifier reported in literature. We managed to classify the Switzerland data set with 94.5% ± 0.4% mean accuracy and when all heart data sets were combined, we achieved the mean accuracy of 82% ± 0.5%. The results are suggesting that the proposed classification approach has potential in diagnosis of heart disease. A further advantage of the approach is that when dimension of data is reduced, the overall computational time is reduced, allowing classification of even larger data sets.
References 1. Abbass, H.A.: An evolutionary artificial neural networks approach for breast cancer diagnosis. Artificial Intelligence in Medicine 25, 265–281 (2002) 2. Abdel-Aal, R.E.: GMDH-based Feature Ranking and Selection for Improved Classification of Medical Data. Journal of Biomedical Informatics 38, 456–468 (2005)
282
P. Luukka and J. Lampinen
3. Bacardit, J., Krasnogor, N.: Smart Crossover Operator with Multiple Parents for a Pittsburgh Learning Classifier System. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO 2006), pp. 1441–1448. ACM Press, New York (2006) 4. Booker, L.: Improving the performance of generic algorithms in classifier systems. In: Grefenstette, J.J. (ed.) Proc. 1st Int. Conf. on Genetic Algorithms, Pittsburgh, PA, July 1985, pp. 80–92 (1985) 5. Cauwenberghs, G., Poggio, T.: Incremental and decremental support vector machine learning. In: Advanced Neural Information Processing Systems, vol. 13. MIT Press, Cambridge (2001) 6. Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Sandhu, S., Guppy, K., Lee, S., Froelicher, V.: International application of a new probability algorithm for the diagnosis of coronary artery disease. Americal Journal of Cardiology 64, 304–310 (1989) 7. Donoho, D.: High-dimensional data analysis: The curses and blessings of dimensionality. In: Lecture at the “Mathematical Challenges of the 21st Century” conference of the American Math. Society, Los Angeles, August 6-11 (2000) 8. Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley & Sons, Chichester (1973) 9. Fodor, I.K.: A Survey of Dimension Reduction Techniques, LLNL technical report (June 2002) 10. Fogarty, T.C.: Co-evolving co-operative populations of rules in learning control systems. In: Fogarty, T.C. (ed.) AISB-WS 1994. LNCS, vol. 865, pp. 195–209. Springer, Heidelberg (1994) 11. Giacobini, M., Brabazon, A., Cagnoni, S., Gianni, A.D., Drechsler, R.: Automatic Recognition of Hand Gestures with Differential Evolution - Applications of Evolutionary Computing: Evoworkshops (2008) 12. Gomes-Skarmeta, A.F., Valdes, M., Jimenez, F., Marin-Blazquez, J.G.: Approximative fuzzy rules approaches for classification with hybrid-GA technigues. Information Sciences 136, 193–214 (2001) 13. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publisher, San Francisco (2000) 14. Herbrich, R., Graepel, T., Campbell, C.: Bayes point machines. J. Machine Learning Res. 1, 245–279 (2001) 15. Hervas-Martinez, C., Martinez-Estudillo, F.: Logistic Regression Using Covariates Obtained by Product-unit Neural Network Models. Pattern Recognition 40, 52–64 (2007) 16. Holland, J.H.: Properties of the bucket-brigade algorithm. In: Grefenstette, J.J. (ed.) Proc. 1st Int. Conf. on Genetic Algorithms, Pittsburgh, PA, July 1985, pp. 1–7 (1985) 17. Holland, J.H.: Genetic algorithms and classifier systems: foundations and future directions. In: Proc. 2nd Int. Conf. on Genetic Algorithms, pp. 82–89 (1987) 18. Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.R.: Classifier systems, Qmorphisms and induction. In: Davis, L. (ed.) Genetic algorithms and Simulated Annealing, ch. 9, pp. 116–128 (1987) 19. Jolliffe, I.: Principal Component Analysis. Springer, Heidelberg (1986) 20. King, R.D., Feng, C., Sutherland, A.: Statlog: Comparison of Classification Algorithms on Large Real-World Problems. Applied Artificial Intelligence 9(3), 256–287 (1995) 21. Li, Q., Li, T., Zhu, S., Kambhamettu, C.: Improving Medical/Biological Data Classification Performance by Wavelet Preprocessing. In: Proceedings of IEEE International Conference on Data mining (ICDM), pp. 657–660 (2002)
11
A Classification Method Based on PCA and DE
283
22. Łeski, J.M.: An ε − Margin Nonlinear Classifier Based on Fuzzy If-Then Rules. IEEE Transactions on Systems, Man and Cybernetics-Part B: Cybernetics 34(1), 68–76 (2004) 23. Luukka, P., Sampo, J.: Similarity Classifier Using Differential Evolution and Genetic Algorithm in Weight Optimization. Journal of Advanced Computational Intelligence and Intelligent Informatics 8(6), 591–598 (2004) 24. Martens, H., Naes, T.: Multivariate Calibration. John Wiley, Chichester (1989) 25. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA, http://www.ics.uci.edu/˜mlearn/MLRepository.html (Cited 30 November 2008) 26. Omran, M., Engelbrecht, A.P., Salman, A.: Differential Evolution Methods for Unsupervised Image Classification. In: Proceedings of the Seventh Congress on Evolutionary Computation (CEC 2005), Edinburgh, Scotland. IEEE Press, Los Alamitos (2005) 27. Pedreira, C.E.: Learning Vector Quantization with Training Data Selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(1), 157–162 (2006) 28. Polat, K., Sahan, S., G¨unes, S.: Automatic detection of heart disease using an artificial immune recognition system (AIRS) with fuzzy resource allocation mechanism and k-nn (nearest neighbour) based weighting preprocessing. Expert Systems with Applications 32, 625–631 (2007) 29. Price, K.V.: New Ideas in Optimization. In: An Introduction to Differential Evolution, ch. 6, pp. 79–108. McGraw-Hill, London (1999) 30. Price, K., Storn, R., Lampinen, J.: Differential Evolution - A Practical Approach to Global Optimization. Springer, Heidelberg (2005) 31. Robertson, G.: Parallel implementation of genetic algorithms in a classifier system. In: Davis, L. (ed.) Genetic algorithms and Simulated Annealing, ch. 10, pp. 129–140 (1987) 32. Sirin, I., G¨uvenir, H.A.: An Algorithm for Classification by Feature Partitioning Technical Report CIS-9301, Bilkent University, Dept. of Computer Engineering and Information Science, Ankara (1993) 33. Storn, R., Price, K.V.: Differential Evolution - a Simple and Efficient Heuristic for Global Optimization over Continuous Space. Journal of Global Optimization 11(4), 341–359 (1997) 34. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) 35. Wilson, S.W.: Hierarchical credit allocation in a classifier system. In: Davis, L. (ed.) Genetic algorithms and Simulated Annealing, ch. 8, pp. 104–115 (1987)
Chapter 12
An Integrated Approach to Speed Up GA-SVM Feature Selection Model Tianyou Zhang, Xiuju Fu, Rick Siow Mong Goh, Chee Keong Kwoh, and Gary Kee Khoon Lee
Abstract. Significant information or features are often overshadowed by noises and resulted in poor classification results. Feature selection methods such as GA-SVM are desirable in filtering out the irrelevant features and thus improve the accuracy; the selection itself might also offer critical insights into the problems. However, the high computational cost greatly discourages the application of GA-SVM, especially for large-scale datasets. In this paper, an HPC-enabled GA-SVM (HGA-SVM) is proposed and implemented by integrating data parallelization, multithreading and heuristic techniques with the ultimate goal of maintaining robustness and lowering computational cost. Our proposed model is comprised of four improvement strategies: 1) GA Parallelization, 2) SVM Parallelization, 3) Neighbor Search and 4) Evaluation Caching. All the four strategies improve the respective aspects of the feature selection algorithm and contribute collectively towards higher computational throughput.
12.1 Introduction The booming information technologies have promoted the production of data from all sorts of domains. The significant information or features are often mixed up with the noises inside the data. It subsequently places a challenging task in machine learning for filtering the irrelevant and selecting the truly important features out. Given data samples with class labels, supervised classification models are usually used together with optimization algorithms for feature selection in which classification accuracies are used as fitness evaluation of the selected feature subsets. In this Tianyou Zhang · Xiuju Fu · Rick Siow Mong Goh · Gary Kee Khoon Lee Institute of High Performance Computing, 1 Fusionopolis Way, #16-16 Connexis, Singapore 138632 e-mail: [email protected] Chee Keong Kwoh Nanyang Technological University, Singapore 637457
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 285–298. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
286
T. Zhang et al.
work, we develop a feature selection model which combines the merits of support vector machine (SVM), genetic algorithm (GA) and high performance computing techniques. A supervised learning model is capable to learn a set of functions (classifiers) from prior knowledge. It has been widely applied in many domains including bioinformatics, cheminformatics, financial forecasting, etc. A typical example in supervised learning is that given a set of training data with multiple input features and labeled outputs, a classifier is learnt from the “known” examples, and generalized to label the “unknown” ones. The rationale of applying supervised learning is that labeling could be expensive. For example, in most of bioinformatics problems, laboratory approaches are the most reliable and trustful but time-consuming, laborintensive and costly. A more cost-effective and efficient alternative can be sought from conducting laboratory experiments to collect the sufficient labeled data, followed by training a classifier to label the rest of input examples, and finally doing laboratory verification to the highlighted examples. Support vector machine [1] is a set of supervised learning tools based on structural risk minimization principle, and has been popular in both classification and regression tasks. The principle of SVM is constructing an optimal separating hyperplane to maximize the margin between two classes of data. The concept can be visualized as that two boundary planes parallel to the hyperplane are constructed at its each side, and are pushed maximally towards the data points as long as no data points fall in between the boundary planes. In the scene, “margin” refers to the distance between the boundary planes, and “support vectors” are the data points sitting on those boundary planes. However in many real-world problems, the unavoidable existence of noises makes infeasible to construct such hard-margin classifier with zero errors. Soft margin [2] is then introduced to allow the data points to fall in-between the boundary planes or even across the hyperplane with penalty cost C, which controls the tradeoff between maximal margin and minimal error. For linear-separable data, constructing the linear classifier is straightforward. But for non-linear-separable data, kernel trick [3] is needed to map the data into high dimensional space, in which the transformed data become linear separable. The performance of SVM classifier is often estimated by k-fold cross validation. That is done by dividing the data into k subsets and using a different subset for validation rotationally while the rest subsets are reserved for training in each round. The values of generalization error or accuracy computed in all k rounds are finally aggregated to measure the overall performance of SVM. When SVM is employed for data classification, the choice of margin cost C and kernel parameters is considered as a very important process for obtaining high performance of SVM classifiers [4]. The optimal parameters that lead to the minimal generalization error are data-dependent. Presently no rules or formula can be used to compute such values analytically, so parameter tuning is often required. An intuitive realization of parameter tuning is grid search [5]. That is, the parameters are varied by step-size within the preset range of values (in a structure of “grid”); the optimal values can be found by measuring every combination (every node in the grid). Due to its complexity, usually two-dimensional grid is used to tune a pair of parameters
12
An Integrated Approach to Speed Up GA-SVM Feature Selection Model
287
such as C and γ (Gaussian function width in RBF kernel). Even after parameter tuning, SVM classifier might deliver poor accuracy in classifying some particular datasets. One possible reason is noise interference in which an overwhelming number of irrelevant features are included inside the inputs so that a truly representative classifier cannot be learnt. If prior knowledge is insufficient to differentiate which features are truly relevant to the output, such that all the possible features are included in the training data, the accuracy of learning would be deteriorated. In those cases, the key of improving learning performance is feature selection, which is the technique of searching the significant candidates in the feature space by optimization methods such as genetic algorithm. Genetic Algorithm (GA) [6] is a search technique inspired by the natural evolution. In the evolution, the individuals with better genetic merits (chromosome) are more likely to survive under natural selection and reproduce the offspring; the unfit ones are filtered out. By constant filtering, generation after generation, the population tends to carry the fitter and fitter chromosomes. To mimic this process, all candidate solutions to a feature-selection problem can be encoded as “chromosome” (feature subset representation), which takes the form of bit array (e.g. 10010101) 1s and 0s denote the presence or absence of each feature. A group of those candidate solutions is sampled randomly and form the initial population of chromosomes. The chromosomes are then evaluated by objective function to compute the fitness scores. Multiple chromosomes are stochastically selected based on their fitness, recombined (crossover) and mutated, and finally form the next generation. By means of random mutations and crossovers, the variety of chromosomes is introduced and evaluated in every generation and gradually evolve the solutions towards the optimal. The process will be iterated until convergence, i.e. there is no more improvement to the best fitness score in the population. At the end, the chromosome with the best-ever fitness will be the final solution and all the features denoted by 0s will then be filtered out. By taking SVM as objective function in GA, GA-SVM had been widely used to filter out the irrelevant features and improve the learning accuracy in the noisy settings. However, there is a practical problem of high computational cost in GA-SVM. Assuming m population and g generations in GA, wxw grid in parameter tuning and t seconds for SVM plus 10-fold cross validation, then the overall runtime will cost mgw2t seconds. It is a time-consuming process that even a small-scale problem may need nearly a day to complete (demonstrated in the Result section). That strongly discourages the application of GA-SVM to larger and more complex data. In this paper, we introduce high performance computing (HPC) techniques and heuristic methods to speed up the traditional GA-SVM feature selection model. In our HPCenabled GA-SVM (HGA-SVM), we employ data parallelization, multithreading, repeated evaluation reduction and heuristic optimization, with the ultimate goal of trimming down computational cost and making large-scale feature selection more feasible. The HGA-SVM is comprised of four improvement strategies: 1) GA Parallelization, 2) SVM Parallelization, 3) Neighbor Search, and 4) Evaluation Caching. All the four strategies work collectively towards higher computational efficiency.
288
T. Zhang et al.
12.2 Methodology GA-SVM feature selection model is a technique that operates GA to search the feature space for in which a subset of features could produce the best learning performance through SVM. An implementation of such model is comprised of three operators: crossover, mutation, and SVM evaluation. With respect to population size, the first two are of linear complexity; and the last one is of quadratic complexity. It is clear that reducing population size could lower computational cost effectively. Moreover, when input data grows larger and more complex, most of time lag would rise in SVM evaluation since all other operators only work in chromosome layer. Imagine if a single SVM training is slowed down by t seconds in a larger dataset, by the effect of 10-fold cross validation and 10x10 grid search, the time lag of each chromosome in every generation would be amplified to 1000t seconds. Apparently SVM evaluation is the biggest obstacle in GA-SVM that discourages its application in large-scale dataset and therefore the speedup specific to SVM would be desirable. High computational cost of GA-SVM could also arise from parameter tuning in SVM. Exhaustive grid search is an intuitive and straight-forward technique to tune the parameters of SVM; however it is slow despite the grid dimension is small. In a simple 10x10 grid search, it requires 100 times of SVM learning (with 10-fold cross validation) for each chromosome in every generation. This exhaustive search will incur a huge computational cost. However, parameter tuning cannot be omitted even though the cost is high. Otherwise SVM learning would be biased and the purpose of improving learning performance is undermined. An additional waste of computational power is redundant evaluation. It happens when identical chromosomes re-emerge in the different GA generations due to random mutations and crossovers. Since a standard GA is memory-less, i.e. does not keep any historical records on the past-evaluated chromosomes, it has to evaluate every appearance of those identical chromosomes. It costs the computation power for the unnecessary evaluations and increases the runtime. We have enumerated three causes that lead to a slow execution of GA-SVM in the large-scale datasets. With the respective targets, four improvement strategies were designed to alleviate the computational cost. Parallel GA and parallel SVM speed up the GA and SVM respectively; neighbor search replaces grid search to reduce the number of combinations to be measured; and lastly evaluation caching avoids the repeated unnecessary evaluations. Figure. 1 shows the workflow of all four improvement strategies in HGA-SVM.
12.2.1 Parallel/Distributed GA The design of GA parallelization follows parallel island model [7] in a coarse grained architecture. The entire population of chromosomes is divided into n subpopulation and each subpopulation is assigned to a different parallel node. Every parallel node evolves their local subpopulation by a serial GA. At the end of every generation, multiple chromosomes are selected randomly at each node and exchanged
12
An Integrated Approach to Speed Up GA-SVM Feature Selection Model
289
among the peers, which is called “migration”. Migration brings in the new variety to local population and facilitates to build up the common trend of evolution in all subpopulations. By distributing chromosomes to n parallel nodes, local population size is reduced by factor of 1/n (population size is an even integer before and after reduction). The execution time of a single GA generation is speeded up by n times approximately, because selection, crossover and mutation are of O(n3 ) complexity and SVM evaluation is of O(n3) complexity with respect to population size [8]. There are however some drawbacks of parallelization - parallel overheads, which includes start-up/termination overhead, synchronization overhead and communication overhead. The first two overheads are unavoidable in order to coordinate parallel computing in multiple nodes; so we focus on reducing communication overhead in this
Fig. 12.1 Design Scheme of HPC-enabled GA-SVM feature selection model
290
T. Zhang et al.
study. As adoption of parallel island framework, GAs run independently at different nodes with their local copy of data, so the necessity of data communication is minimized. Another reduction is realized in migration operation, which requires passing around the arrays of bits (chromosomes) among the nodes. There are many migration schemes specific to certain topology in the literature. To minimize communication overhead, we adopt “ring” topology for migration in which each node transfers the local best chromosomes to its neighbor on the ring. For instance, among three parallel nodes A, B and C, the exchange will happen as A → B, B → C and C → A. Then the incurred communication overhead is linear w.r.t. the number of parallel nodes. If there are n parallel nodes, it requires n generations for the migrated chromosomes to travel in the ring and return to their original node. Therefore any serial GA should be allowed for termination only if there is no further improvement for at least n consecutive generations. Once any serial GA terminates, the parallel GA will stop the iteration and collect the local best chromosome from individual nodes to compute the final solution. GA parallelization is developed using the MPI library [9] and is usually catered for the distributed-memory hardware architecture. Parallel GA using MPI is able to scale to all the compute nodes available, which would significantly benefit the application of our HGA-SVM in large-scale datasets. With the introduction of multi-core processors, GA parallelization can also be applied to mainstream shared-memory systems to achieve good performance speedup as well.
12.2.2 Parallel SVM SVM training is compute-intensive because it requires quadratic programming (QP) [1] for determining the optimal separating hyperplane. Over the years, several methods had been developed to lower the computation cost. One of the methods is sequential minimal optimization (SMO) [10]. SMO divides a large QP problem into a series of smaller QP problems that can be solved analytically such that the training is speeded up. However, when SVM is required to repeat by thousands of times in a typical feature selection task, the SVM with SMO (SVM-SMO) is still considerably slow. To speedup the SVM-SMO, we applied the parallelization technique to distribute the computations to multiple nodes/threads for concurrent execution. As parallelization introduces the extra overheads in coordination and communication, it is wise to parallelize the most computational intensive section to achieve the maximal speedup. Table 12.1 shows a typical execution profile of SVM-SMO (retrieved from LibSVM [11] training execution). It is clear that kernel calculations take up most of the computational time. The caller function of those kernel calculations is comprised of an iterative loop scanning through the instance space to select a pair for optimization. Thus OpenMP [12], a parallelization protocol designed for shared-memory multi-processor/multicore systems, is most suitable to apply. The implementation of the parallel SVM is relatively simpler than GA parallelization. It could be done by identifying and
12
An Integrated Approach to Speed Up GA-SVM Feature Selection Model
291
Table 12.1 A Typical Execution Profiling For SVM-SMO (LIBSVM)
Time (%) Self (sec) Calls (sec) Function Name 81.15 231.34 791,262,338 Kernel::kernel 11.67 33.27 269,460 SVC Q::get Q 4.78 13.64 66,095 Solver::select working set 2.22 6.34 1 Solver::Solve 0.15 0.44 66 Solver::do shrinking 0.02 0.05 3,806 Cache::swap index resolving data dependency inside the loop, followed by inserting the OpenMP directives, without any modification to the structure of the algorithm. The parallel SVM will be able to utilize the multiple CPU cores concurrently in form of multithreading (refer to Figure 12.1), so effectively reduce the computation time. Since OpenMP also introduces the overheads, the parallel SVM would perform more efficiently if the training dataset is sufficiently large. In our HGA-SVM, the GA and SVM operations are parallelized using MPI and OpenMP respectively. The parallelization techniques used in both of these operations allow them to work together as hybrid parallelization to speed up the workflow concurrently.
12.2.3 Neighbor Search Parameter tuning is crucial to achieve minimal generalization error in SVM learning; however it is also time-consuming when employing exhaustive grid search. Inspired by pattern search method [13], we proposed a new derivative-free method, neighbor search, as a general solution to parameter selection problem. Neighbor search inherits the underlying structure from grid search but not attempt to measure every node in the grid. In our context, the parameters to be tuned are margin cost C and RBF kernel width γ . The neighbor search for C and γ starts from an initial position in the grid (says 10x10 grid) as the centroid and sample multiple neighbor nodes with uniform distribution within the grid of parameter domains. The centroid and its neighbors are measured by SVM learning accuracy with the corresponding pairs of parameters applied, and the best node (the one associated with the highest accuracy) is nominated as the new centroid. By repeating the above process, the centroid will keep moving towards the best node until the convergence, i.e. the centroid itself is the best among the group of examinees. Neighbor search is a heuristic search method and the confidence level of its solution depends on sampling size, i.e. how many neighbor nodes are sampled in every round. If a larger sampling size, the solution is more likely to be the optimal in the grid but slower as more measurements to be done; if a smaller sampling size, less confident to the solution but faster. By introducing neighbor search, the tradeoff between solution confidence and runtime cost could be adjusted appropriately to achieve considerable speedup with the bearable suboptimal solution.
292
T. Zhang et al.
It is intuitive that if two chromosomes differ in few bits, their optimal locations in the grid might be closer to each other. It could be applicable to the mutated chromosomes in the new generation if the hamming distance between parent and child chromosome is small. Since neighbor search has been done for the parent chromosome and found the optimal node, the same node can be used as the initial centroid for the child chromosome, which would be advantageous for the faster convergence.
12.2.4 Evaluation Caching Caching is employed in our algorithm as it can help avoid the repeated unnecessary evaluations in the different generations. A cache is built up to store all the previous chromosomes evaluated. Whenever an evaluation is requested, the cache is first sought. Only if a cache-miss occurs, an SVM evaluation is executed and the cache is updated subsequently. The efficiency of the cache depends on how frequent the identical chromosomes re-emerge, which is varied and stochastic by nature. However, according to probability theory, the cache tends to be less effective when the data dimension grows, because the chance of encountering the identical chromosomes (after random mutation and crossover) decreases. The implementation of evaluation caching requires the additional memory space to store cache entries. Keeping a small footprint for the cache is a challenge as a large number of entries could be expectable. In HGA-SVM, encoding compression is introduced to reduce the length of individual cache entries. There are two encoding schemes developed for different type of data. For the low dimensional dataset, a simple multi-bit encoding is used to compress a chromosome into a multi-bit symbol string, in which the compression rate depends on the number of distinct symbols to be used; for the high dimensional sparse dataset, further compression could be achieved by encoding the difference in the consecutive bits of a chromosome followed by compression to the consecutive 0s in the encoded string. The encoding compress schemes could not only reduce the footprint of evaluation cache, but also cut down the computational cost of cache search due to shorter length of every cache entry.
12.3 Experiments and Results The source codes of GA [14] were ported to Octave [15] and incorporated with the improvement strategies including parallel GA, neighbor search and evaluation cache. The MPI required in the parallel GA is supported by MPITB library [16]. The encoding compression required in evaluation cache was coded as C++ libraries and linked to Octave for efficiency purpose. The source code of LibSVM 2.8.6 [11] was modified to implement the parallel SVM and also ported to Octave as external library which allows direct access to runtime variables in the memory. The experiment platform used is a 2x Intel Xeon Quad-core (3.0GHz) machine with 32GB memory. The parameters of our HGA-SVM are listed in Table 12.2. The population/ subpopulation size and crossover/mutation/ migration rate are self-explanatory, and
12
An Integrated Approach to Speed Up GA-SVM Feature Selection Model
293
Table 12.2 List of Preset Parameters in GA-SVM
GA Parameters # of parallel node n population size 80 subpopulation size 80/n cross-over rate 60% mutation rate 5% Migration rate 50% max-generation 100 max-convergence n fitness epsilon 0.01% SVM Parameters SVM kernel RBF cross-validation 10-fold stratified 10−2 to 103 C and γ range grid size 10x10 neighbor sampling size 8
their values were set based on experience. Max-generation refers to the maximum number of generations to be evolved; and max-convergence denotes the number of consecutive generations to be waited before termination if there is no further improvement to the fitness score, which is measured by average accuracy in stratified 10-fold cross validation of SVM classifier with the tuned parameters. The minimal update level of fitness score is 0.01%. RBF kernel was used in SVM. C and γ were tuned in range of 10− 2 to 103 by a 10x10 grid with sample size of 8. Two datasets had been used in our study to evaluate the performance of our algorithm. The details of datasets are shown in Table 12.3. Both datasets are the preprocessed data from the LibSVM (http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/ datasets/). The performance of improvement strategies was measured by computation reduction (including search reduction and evaluation reduction) and runtime reduction per generation. Search reduction refers to how much percentage of search tasks is saved compared to grid search; and evaluation reduction refers to how much percentage of evaluation tasks is saved compared to GA without cache. As GA is a stochastic process, the number of generations for convergence may vary for different runs. Thus the overall runtime is not appropriate for benchmark purpose. Instead, runtime per generation could be used to illustrate the effectiveness of the improvement strategies. GA-SVM is time-consuming even for a small-scale dataset like Austrian. For 690 examples of 14 features, it took 1048 minutes ( 17.5 hours) to complete the 16 generations of GA-SVM. The slowness affirmed our determination to speed up GA-SVM for any feasible application in practice.
294
T. Zhang et al. Table 12.3 List of Datasets Tested with GA-SVM
Name # of Examples # of Features # of Classes Austrian 690 14 2 Adult 1605 123 2
Fig. 12.2 Distributions of Runtime Per Generation w.r.t. MPI Nodes
Parallel GA can significantly speed up the traditional GA by distributing chromosomes to different parallel nodes. The reduction on the population size would cut down the computational cost in all GA operations especially SVM evaluation. Fig. 12.2 confirms this expectation in the experiment with Austrian dataset. It was observed that the runtime per generation decreased when the number of parallel nodes increased. By taking average runtime into account, Fig. 12.3 plotted the relationship between (average) runtime per generation and the number of parallel nodes. The respective speedup gains for 2, 4 and 8 nodes were 2.01, 4.00 and 8.46, which demonstrated a linear fashion in runtime reduction. In Parallel-SVM, the amount of kernel computations is evenly distributed among multiple threads, and each thread is allocated to a processing core for concurrent executions. Fig. 12.4 shows how SVM training time changes with the growing number of threads (cores) with Adult dataset. The speedups for 2, 3 and 4 threads were 1.89, 2.60 and 2.88 (equivalent to 0.94, 0.86 and 0.71 per thread) respectively, and the plot exhibited an inverse exponential fashion. That was as the result of parallel overheads. In our design, communication overhead had been minimized by data localization and faster migration algorithm. The rest of overheads (like thread start-up and termination) have a fixed cost and are independent of data size. Therefore, a
An Integrated Approach to Speed Up GA-SVM Feature Selection Model
runtme per generation (sec)
12
295
4000 3500 3000 2500 2000 1500 1000 500 0 1
2 4 # of MPI nodes
8
Fig. 12.3 Parallel-GA average runtime per generation w.r.t. MPI Nodes
120
runtime (sec)
100 80 60 40 20 0 1
2
3 # of threads
4
Fig. 12.4 Parallel-SVM training time w.r.t. number of threads
better performance could be expected in dealing with larger datasets, since the cost of the overheads would be amortized. Evaluation caching is able to avoid the unnecessary evaluations of identical chromosomes in the different generations. If the chance of cache hit (i.e. a chromosome has been evaluated earlier and stored in the cache) is significant, the overall speedup would be remarkable. Fig. 12.5 shows how much percentage of evaluations was saved by caching in the experiment. The frequent re-emergence of identical chromosomes was observed as result of low feature dimension and mutation rate. 76.25% of cache hit was observed in the experiment and led to 75.62% reduction on the average runtime per generation (from 61.42 to 14.97 minutes, 4 times speedup). The performance of evaluation caching highly depends on re-emergence probability of identical chromosomes, which is mostly affected by feature dimension. As binary encoding of GA’s chromosomes, the total number of combinations of features is 2n where n is the feature dimension. When feature dimension increases, the chance of hitting a chromosome in the past evaluations will drop rapidly. This phenomenon was confirmed in the experiment with Adult dataset. As feature dimension rose from 14 to 123 with the same population size, there were only 3 cache hits during 40
296
T. Zhang et al.
Fig. 12.5 Evaluation Reduction Distribution for Evaluation Caching (left: Austrian, right: Adult)
Fig. 12.6 Search Reduction Distribution for Neighbor Search (left: Austrian, right: Adult)
Fig. 12.7 Integration of Four Improvement Strategies (Austrian) (left: standard GA-SVM, right: HGA-SVM)
generations of GA. By considering the overhead incurred in caching, the results suggested that this strategy should be cautious in applying to high dimensional data. Neighbor search is an improvement to grid search for tuning C and γ by replacing exhaustive search with neighbor sampling and heuristic search. Fig. 12.6 summarizes the search reduction of neighbor search in the experiments with Austrian and Adult dataset. For Austrian dataset, two independent runs of HGA-SVM were
12
An Integrated Approach to Speed Up GA-SVM Feature Selection Model
297
conducted with grid search and neighbor search. Using the same 10x10 grid and parameter range, grid search required 8000 times (80 chromosomes x 10 x 10 grid) of SVM measurements per generation; and neighbor search measured only 1410 to 1524 times (80.95% to 82.37% reduction, 81.77% on average). Both runs of HGASVM found the best fitness of 87.97% classification accuracy but the one using neighbor search was 5.79 times faster (61.42 mins v.s. 10.60 mins) per generation. The similar observation was also found with Adult dataset: 79.10% to 84.70% search reduction by neighbor search, i.e., on average 5.76 times faster per generation. Finally, the collective speedup of all four improvement strategies was evaluated. A remarkable reduction on computational cost was observed. Fig. 12.7 shows the distributions of runtime per generation with Austrian dataset. The average runtime per generation is reduced from 61.42 min to 0.46 min, ∼133 times. In all the above experiments, the improvement was also observed to SVM learning accuracy over 10-fold cross validation (Fig. 12.8). The learning accuracy was enhanced by 3.74% - 8.10% as a result of feature selection.
Fig. 12.8 Improvement to Classification Accuracy
12.4 Conclusion Our HGA-SVM illustrates an integrated approach that combines parallelization and heuristic techniques that can effectively to lower the computational cost effectively. We had demonstrated the individual speedup gains from parallel GA, parallel SVM, neighbor search and evaluation caching as well as their collective gain. Through the feature selection, the learning accuracy of SVM was enhanced as well. Overall, we show that our HGA-SVM is useful in alleviating the computational cost with the improved learning performance, allowing the feasible application to larger data and
298
T. Zhang et al.
more complex data. In our future work, caching, cross validation, and more efficient heuristic techniques will be explored to further improve the current algorithm.
References 1. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144–152 (1992) 2. Cortes, C., Vapnik, V.: Support Vector Networks. Machine Learning 20, 273–297 (1995) 3. Aronszajn, N.: Theory of reproducing kernels. Transactions of the American Mathematical Society 68, 337–404 (1950) 4. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing Multiple Parameters for Support Vector Machines. Machine Learning 46, 131–159 (2002) 5. Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification (2003) 6. Mitchell, M.: An Introduction to Genetic Algorithms (1998) 7. Tanese, R.: Distributed genetic algorithms. In: Proceedings of the third international conference on Genetic algorithms, George Mason University, United States, pp. 434–439. Morgan Kaufmann Publishers Inc., San Francisco (1989) 8. Vapnik, V.N.: Statistical Learning Theory. Wiley Interscience, Hoboken (1998) 9. Foster, I.: Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley, Reading (1995) 10. Platt, J.: Sequential minimal optimization: A fast algorithm for training support vector machines. Advances in Kernel Methods-Support Vector Learning, 185–208 (1999) 11. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines 80, 604–611 (2001), Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm 12. Dagum, L., Menon, R.: OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Computational Science & Engineering, 46–55 (1998) 13. Momma, M., Bennett, K.P.: A pattern search method for model selection of support vector regression. In: Proceedings of the SIAM International Conference on Data Mining (2002) 14. Houck, C.R., Joines, J., Kay, M.: A Genetic Algorithm for Function Optimization: A Matlab Implementation, NCSU-IE TR, vol. 95 (1995) 15. Eaton, J.W.: Octave, http://www.gnu.org/software/octave/ 16. Fern´andez, J., Anguita, M., Ros, E., Bernier, J.: SCE Toolboxes for the Development of High-Level Parallel Applications. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 518–525. Springer, Heidelberg (2006)
Chapter 13
Computation in Complex Environments; Optimizing Railway Timetable Problems with Symbiotic Networks Kees Pieters
13.1 Introduction The title of this contribution balances on two concepts. The first is ‘complex environments’.‘Complex’ should not be read in the colloquial sense of the word; complexity addresses, amongst others, non-linear, contingent and ‘chaotic’ phenomena ([10],[11]). Many thinkers on complexity consider such characteristics — sometimes called organized complexity— to demarcate a transition point where analytical approaches are no longer feasible ([26]:18). Put in another way, organized complexity moves away from traditional machines, which have supreme performance for their intended tasks, but also require very stable and predictable environments. Rather a line is drawn towards living organisms, which are very robust and are better adjusted for contingent environments than machines are. Along this gradient, ‘robust machines’ form an interesting field of inquiry for optimization problems. Railway Timetable Problems (RTP) can be seen as a benchmark for such robust machines (or algorithms). The second concept, ‘symbiotic networks’, is introduced as an optimization strategy that can, to some extent, optimize in such complex environments. RTP has been a benchmark problem for symbiotic networks, and so this contribution does not focus on RTP in itself, but rather uses RTP to analyze the behaviour of symbiotic networks in a practical setting. This paper is outlined as follows; first a ‘meta-perspective’ on optimization processes will be drawn, then the RTP will be introduced as a complex environment. The theory of symbiotic networks will be discussed, and how this approach was implemented to optimize RTP. Last the various tests that have been carried out with the simulation environment will be discussed. Cornelis P. Pieters Condast, Omloop 82, 3552 AZ, Utrecht, the Netherlands e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 299–324. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
300
K. Pieters
13.1.1 Convergence Inducing Process Research on computational intelligence often tends to focus on the algorithms or heuristics that aim to solve certain problems. This focus sometimes obfuscates the fact that the algorithms are taken up in a larger process, which also contributes to the eventual solutions. In the specific case of computational intelligence, the algorithms that are used are often part of a specific pattern [1, 20, 21, 22] called a ‘convergence inducing process’ ([5]:424-425), as depicted in Table 13.1.
Table 13.1 Convergence Inducing Process
Pattern
Convergence Inducing Process
Description
An actor samples an environment by an iterative cycle of testing and evaluation until a certain goal criterion has been met Problem Solver, Global Search
a.k.a
Notes
The actor typically contains a balanced mix between divergent processes (exploration) and convergent processes (optimization), which map the evaluated variables to the goal function. The latter processes tend to dominate (eventually)
Typically, this process combines a divergent process of exploration with a convergent process of optimization, of which the latter is dominant. The actor, which in the case of computational intelligence consists of the strategies that are deployed, usually includes a securing mechanism which stores high-ranking values, and is able to use these for further evaluation. Genetic algorithms, goal-directed agent systems, and the learning phase of neural networks can all be considered instantiations of this pattern ([6],[18]). The algorithms or heuristics that implement the strategies determine how the convergence-inducing process behaves. The pattern may thus take various specific forms, depending on the problems that are tackled and the manner and quality of the optimization that is aimed for. However, the essential property of this process is always a form of feedback between actor —for instance a certain computational algorithm— and data that are present in the actor’s environment, which is often called the ‘problem domain’ in computational intelligence.
13
Computation in Complex Environments
301
13.1.2 A Classification of Problem Domains This ‘broader’ perspective on computational intelligence shows that there often is a relationship between the process and the characteristics of the problem domain. This, of course, is widely known by now. There are a number of ‘No Free Lunch’ (NFL) theorems that stress this relationship [27], and for instance de Jong functions provide powerful benchmarks against which computational intelligence can be tested and (mutually) evaluated [7]. Some optimization strategies are also paired with so-called deceptive functions, which can be seen as problem domains that deceive the algorithms into optimizing towards poor (sub-optimal) results[14]. Complex problem domains can also be categorized as follows: Table 13.2 Characteristics of Problem Domains Property
Remarks
Familiar/Uncertain
In a predominantly uncertain environment, the process incorporates only limited specific knowledge about the problem domain for its optimization strategy. Often only certain aspects of the problem domain are known, such as its structure (tree, graph, etc.). NP-hard (-complete) Results in a combinatorial explosion of the search space. Linear-nonlinear Usually increases the uncertainty of trends and patterns, such as curves Static/non-static In a static environment, the problem domain does not change while it is being processed by the actor. A non-static environment can be dynamic, stochastic or in any other way changing while being processed Reactive Implies that the problem domain is influenced by the actor. A reactive problem domain is always dynamic as well. Contingency The environment can provide unexpected situations or events that disrupt the process of optimization.
The ‘Traveling Salesman Problem’ or certain job shop problems are examples of familiar, NP-hard problem domains, while agent systems typically operate in reactive and often highly nonlinear environments [28]. Strictly speaking, many neural networks can be considered as operating in ‘familiar’ environments, as the patterns that they learn during the training phase (design) are assumed to be present in the problem domain once the network is in operation. In practical settings, most solutions require a mix of these, in which many constraints and heuristics that are specific for the problem domain are also designed. Note that the categories are ‘actor-centric’; the designer of the process may, and usually will, have more knowledge about the environment than the actor itself. Along the scales that are introduced with these characteristics, it becomes clear that computational intelligence often combines intelligence ‘by design’ and ‘true’ forms of computational intelligence. ‘Designed’ intelligence includes the implicit assumptions about a chosen solution strategy, the type of solution that is chosen (GAs, neural networks, etc.) and ‘tweaking’ of the algorithms by the designer. Problem domains that are predominantly uncertain, NP-hard, non-linear, reactive and contingent are amongst the most difficult to tackle. These form the characteristics of a complex environment. A convergence inducing process operating in
302
K. Pieters
these domains usually cannot be tailored for a specific environment, but will need to find a near-optimal solution in finite time in a problem domain that continuously changes, and which may react to the process itself. A robust algorithm will try to optimize, regardless of the conditions the environment imposes on it, and a good robust optimizer may even use these conditions to its advantage. There is one specific class of problems that provides an interesting point of reference for optimization in these severely complex problem domains; railway timetable and –pathing problems [3, 8, 12, 13, 17].
13.2 Railway Timetable Problems The typical railway timetable problem (RTP) revolves around the question of which set of departure times (timetable) of trains causes no conflicts between trains while they are traveling on a railroad infrastructure. The infrastructure (the problem domain) is a resource-constrained network; trains tend to share tracks at some point, and an ideal timetable would prevent trains from hindering each other during service. The railway timetable problem can be extended by including other variables, such as waiting time for passengers, the availability of carriages and personnel at stations at a given time, and so on. The form of the railroad infrastructure plays an important role in the complexity of the problem domain. Some countries can develop a tree-structure which can reduce the complexity significantly. Many densely populated countries with old railway infrastructures, such as in the Netherlands, have to work with dense networks that contain many cycles. For an extensive overview of RTP see ([12]:37-58), also [3]). This contribution will focus on the ‘simpler’ variant that optimizes on conflicts alone. Besides the infrastructure, usually some other aspects of the problem domain are known as well, such as the allowed speeds of the trains in different situations, the minimal (and often also the maximum) waiting times at stations and so on. A railway timetable can also be further constrained by specific demands of the services that are offered. For instance, a certain trajectory may require trains to travel with fixed intervals, or at least a number of times within a certain time frame (e.g. four times an hour). This is called a cyclic, or periodic railway timetable, which is known to be a special form of a Periodic Event Scheduling Problem (PESP) ( [13]:10-12). PESP is known to be NP-hard and therefore a challenge in itself even without the dynamic aspects. Generally speaking, most solution strategies usually meticulously record these characteristics in graphs and process them, which usually results in computational intensive solutions. As a result, to date few solutions have been developed this way that can optimize a full service of trains on a full infrastructure. Most research therefore concentrates on describing one, or a few tracks [3], or focus on train movement around stations [8].
13
Computation in Complex Environments
303
Train scheduling in practice therefore still relies on human experts [29], although some supportive tools have been developed to assist in generating timetables for a full service of trains on a full infrastructure ([13]:15-16). Another approach can be taken where optimization is performed on a full infrastructure, and where the algorithm has to optimize without prior knowledge of the problem domain. In this case, the number of conflicts will be measured as a function of a given timetable (see Figure 13.1). This approach makes the problem domain predominantly uncertain, NP-hard, non-linear and reactive. If the optimization has to account for delays of trains, or other sources of contingencies, then RTP fulfills all the criteria for a complex environment. In this case, efficiency is not the only criterion for the optimization strategy, but robustness also becomes an important aspect.
Fig. 13.1 RTP as a convergence inducing process
In effect, the trains just follow their intended trajectories (in a simulated environment), while the optimization strategy measures the conflicts and adjusts the timetables ‘on-the-fly’. The departure times are the only degrees of freedom that the problem solver has, but changing them also changes the conflicts (reactivity). Intuitively it is clear that train timetable problems cannot be resolved entirely through optimization of the departure times. The railroad infrastructure may be too restrictive, for instance when two trains are heading towards each other on a single track. Most existing railroad infrastructures will be sufficiently extensive to avoid these infrastructural limitations. A second source of conflicts is related to the amount of trains in service at a given time. The potential for conflicts increases with the amount of trains that use the infrastructure. At a certain critical threshold, the capacity of the railroad infrastructure is so dense that conflicts can no longer be resolved. This is another boundary of the RTP. A practical description of RTP therefore is as follows: Given a certain rail infrastructure, and given a certain required service of trains, can the infrastructure provide this service without conflicts?
One attempt to automatically generate an optimal rail timetable for a full railroad infrastructure (a model of the Dutch railroad infrastructure) used so-called symbiotic networks to resolve this [16, 17, 19].
304
K. Pieters
13.3 Symbiotic Networks Like many forms of evolutionary computational intelligence, symbiotic networks have been inspired by a phenomenon known in nature, namely symbiosis (or mutualism). As a reference, a possible repertoire of interactions between agents is listed in a pattern called actor/co-actor (see Table 13.3).1 Table 13.3 Actor/Co-actor Pattern Pattern:
Actor/Co-actor
Description
An actor can interact with others according to a number of strategies. The notion of perceived ‘benefit’ is assigned to these interactions between actor, the entity initiating the interaction, and the co-actor, the entity that is subject to the interaction. This could, for instance, be according to the following table:
a.k.a Constraints: Name co-existence Competition Parasitism Altruism Symbiosis Synnecrosis Notes
Benefit actor 0 + + + 0/+ -
Benefit co-actor 0 0/-/0/+ + + 0/+ -
a.k.a
Adversarialism Mutualism
(e.g. spite)
A further distinction could be made in predator-prey relationships (the actor is predator), which means that the co-actor is the resource that the actor aims to acquire. Also a ‘vertical’ interaction where the actor is a participant of an co-actor (aggregate form) is an interesting special case.
Of all the interaction patterns between biological organisms, such as competition and altruism, symbiosis is the only one that –to date- has not been elucidated by mathematical means. The initial research therefore aimed to make a model that would demonstrate the minimal conditions in which autonomous agents, which do not necessarily rely on each other, engage in a relationship of mutual dependency [15, 16]. It will be clear that this premise already biased an agent network. The eventual network however proved to have a relationship with neural networks, as the convergence criterion for symbiotic networks was very similar to the Rosenblatt convergence criterion for perceptron neural nets [25]. Basically a symbiotic network can optimize (or ‘learn’) one pattern, whereas a neural network can learn multiple patterns. A symbiotic network however, like most agent systems, does not know a distinct learning cycle. A symbiotic network ‘learns by doing’. 1
The original name of the pattern was ‘actor-actant’, but has been renamed because of the specific use of ‘actant’ in the social sciences.
13
Computation in Complex Environments
305
Practical tests showed that symbiotic networks are particularly interesting in dynamic environments. If a number of n agents collaborated in solving a certain task, the complexity of O(n3 ) proved to be relatively poor in static environments, such as when the Traveling Salesman Problem was taken as benchmark [15]. A chance article in a Dutch newspaper on railway timetable problems provided the ideal experimental environment for symbiotic networks. A model of the Dutch railway infrastructure was made, in which various optimization strategies were implemented and compared. It was experimentally demonstrated that the complexity of the solution remained approximately O(n3 ) for RTP. First, the theory behind symbiotic networks will be given some attention. As was mentioned earlier, the initial research aimed to find the minimal requirements that agents would need to engage in a symbiotic interaction. Symbiosis is seen here as a mutually beneficiary pattern of interaction. According to actor/co-actor, mutualism will also conform to this pattern, but mutualism is generally associated with a parasitic interaction that turns out to be mutually beneficial [9]. Symbiosis in the way it is used here rather starts from co-existence, which develops into symbiosis under certain conditions. This applies for more co-operative strategies, such as CCGAs [23, 24], swarming algorithms and ANT algorithms [2], although these solutions are usually designed to be co-operative. Symbiotic networks have to learn this, given a certain overall (predetermined) goal. As a first crude description, one could say that the agents in a symbiotic network provide some sort of service that benefits the others. This is usually the implicit assumption of symbiosis or mutualism, which comes down to a form of ‘I’ll scratch your back if you’ll scratch mine’. However, the model that was developed shows that this elementary form of feedback is insufficient to create a stable relationship. A specific kind of communication is required to negotiate the need for the services through the network. Symbiosis assumes that the benefit, which is mathematically represented by a goal function, is provided by the other participants in the network and so the agents are encouraged to maintain and optimize the relationship once this benefit is ‘detected’. In symbiotic networks, this optimization is considered to be the result of an unusual feedback loop that is established once the agents are in each other’s sphere of influence. This loop is achieved by an ability of the agents to communicate their needs (through so-called stress-signals) and that they are able to change their behavior, based on these stress-signals. It is a bit like a parent and her baby; when the baby starts crying, the parent stops doing whatever she was engaged in and addresses the baby’s needs. In a way, some agents are sensitive to certain contingencies in, or require resources from the environment, while other agents have the means to address them. It is clear that competitive or altruistic approaches will not be a preferred form of interaction between these agents in such situations. Parasitic or (other) invasive strategies might work if agents know beforehand which others to target. However, it is not always known which teams should be formed. Symbiotic networks, to some extent, are able to figure this out by learning patterns in the stress signals that are communicated.
306
K. Pieters
Fig. 13.2 In a symbiotic network, the problem domain is ‘folded in’ the network
The model presented in this paper considers the environment of the system to be an integral part of the network (Figure 13.2) [4]. This approach is similar to that of Pnuelian reactive systems and agent-based systems [28].
13.3.1 A Theory of Symbiosis The environment is a vector E = {E0 ,. . . ,Ek } of sub-environments Ei , or neighborhoods2. For reasons of descriptive clarity, a dotted line marked with the token Ei will be used to denote the neighborhood from which the entity obtains its input or releases its output. Such a neighborhood usually has a dimension, such as force, temperature, velocity, etc. The model further starts from the premise of a primary transformation process of an input signal to an output signal. The response of this process is: Oi = μi .Ii
(13.1)
Every agent is connected to the environment through a pair of neighborhoods {Ei , Ei+1 }, that may have different dimensions. Agents are connected through these neighborhoods. For now, the following is presupposed:
Fig. 13.3 Primary Transformation Process 2
With the inevitable progression of insight, currently ‘surroundings’ is preferred for the interaction space of an agent.
13
Computation in Complex Environments
307
Ii+1 = Ei+1 · Oi
(13.2)
Apart from this, the response of the neighborhoods cannot be controlled and may even be unknown. Each agent is expanded with the following properties: - A goal gi - A dimensionless stress function si (Ii , gi ) - Symbiotic behavior μi (s0 , s1 , . . . , sn )
(13.3) (13.4)
This results in the following agent, a symbiot (Figure 13.4):
Fig. 13.4 A symbiot emits a stress signal and is able to change its behavior as a function of the stress signals in the network
The goal is associated with the input of the entity, which reflects for instance a biological organism’s need for food. When these symbiots are connected, a very simple network is formed (Figure 13.5). In this figure, symbiot1 is the successor of symbiot0 through the environment. At convergence, symbiot0 should be able to support symbiot1 in reaching its goal I1 = g1 . This means that the following will apply: I1 = g1 = O0 · E1 ⇒ O0 =
g1 E1
Fig. 13.5 Minimal Symbiotic Network
(13.5)
308
K. Pieters
According to (13.1), this results in: O0 = μ0 · I0 ⇒ μ0 =
g1 (E1 ·I0 )
(13.6)
The behavior μ0 is determined by the stress signals and should converge to a situation where I = g, or I0 = g0 and I1 = g1 . In such a situation (13.6) becomes
μ0 =
g1 (g0 ·E1 )
(13.7)
It is clear that if this applies for all symbiots in the system, one could say that the system has mapped its behavior ȝ = [μ0 , μ1 ] against its environment and its goals. This is interesting when the environment is unknown, for a converged system can provide information about the environmental relationships between the various probes that the system has put in the environment. Note that in this situation the inputs (i.e. the goals) should never be allowed to become zero as this would impair convergence. If the symbiots are connected in such a way that E2 = E0 , then a situation of mutual benefit has been formed. This results in symbiosis according to the pattern of actor/co-actor. Convergence of the symbiotic system takes time. A system has converged when: lim Δ μi ⇒ 0
t→tc
(13.8)
Δ is the change over a certain amount of time and the convergence time tc is the time when the system has converged. Suppose the behavior μi changes according to a certain algorithm f μi (s0 , s1 , . . . , sn ): Δ μi = μit+1 − μit = f μi (s0 , s1 , . . . , sn )
(13.9)
Here, t stands for iteration step in time. At convergence the following should apply: f μi (s0 , s1 , . . . , sn ) = 0
(13.10)
Adding this so-called symbiotic algorithm to the symbiot results in figure 13.6.
Fig. 13.6 Symbiot
13
Computation in Complex Environments
309
The stress signal si (Ii , gi ) reflects the aim that one has when applying the network to a specific problem, and should be zero once the goals have been achieved. The network aims to achieve I = g. Take, for instance, the following stress function: si (Ii , gi ) = ρ (gi − Ii ), ρ > 0
(13.11)
In this equation, the stress will increase the further the input edges away from the goal function. The factor ρ is used to make the stress signal dimensionless and to normalize its value, for instance between <−1, 1>. In this case, at convergence: lim si = lim ρ (gi − Ii) = 0
t→tc
t→tc
(13.12)
Convergence should ideally never take an infinitesimal amount of time. However, the convergence time tc also poses restrictions to the goals of the symbiot. If these goals are functions of time, then it will be clear that ideally during optimization a goal should not change and the input signal should be stable, as the system would otherwise possibly never be able to converge. This also applies for the environment E. However, practically this cannot be assumed, and thus the system will be in continuous flux. Besides this, the various signals are limited to minimum and maximum ranges in practical systems. If the behavior or the stress signals of the symbiots run into these, then the system may converge to a state where I = g. So far, it has been assumed that convergence has taken place. The question however is, what the criteria are that allow the symbiotic network to converge to a situation where I ≈ g. Convergence Consider a network of n + 1 symbiots with: • an environment E = [E0 , E1 , . . . , En ] • an input vector I = [I0 , I1 , . . . , In ] • a goal vector g = [g0 , g1 , . . . , gn ] • an output vector O = [O0 , O1 , . . . , On ] • a stress vector s = [s0 , s1 , . . . , sn ] The goal vector is constant within the convergence time tc . For each symbioti , the following applies at a given time t: Oti = μit .Iit
μit+1 = μit + δ · f(s), δ > 0, f(s) is dimensionless sti = ρ (gti − Iit ), ρ > 0, si is dimensionless
(13.13) (13.14) (13.15)
δ and ρ are included in (13.14) and (13.15) to scale the dimensionless factors f (s) and si to the dimensions of Ii and μi . For now, each neighborhood is defined as follows:
310
K. Pieters
I j = E j .Oi ,
E j > 0∀i, j
(13.16)
Now suppose an initial situation where I = g and the following relations apply: ⎧ n ≥ 0, i f ∑ si ≥ 0 ⎨ i=0 (13.17) f (s) = n ⎩ < 0, i f ∑ si < 0 i=0
Because of (13.14), μi will increase or decrease due to (13.17). The same will also happen to Oi due to (13.13),(13.14) and (13.16). But the opposite will occur due n
to (13.15). This process will therefore repeat itself until
∑ si = 0. The time tc that
i=0
this convergence takes depends on the values of the elements of s, and therefore of the values of I (13.15) and E (13.16). This means that the environment affects the convergence time of the network. The convergence criterion shows that any symbiotic algorithm that complies with the constraints of (13.17) will cause converge of the system and that every symbiot in the network should be connected to at least one other through the environment. The convergence criterion also shows that I = g is only one of the possible solutions. Depending on the symbiotic algorithm, averages of the various (gi − Ii ) functions can also cause convergence. This will result in premature convergence (gi − Ii ) = ei , where ei is not equal to zero. The problem of premature convergence is similar to limitations in pattern matching that are recognized in neural networks [5]. The n symbiots as a whole form an nxn matrix —n symbiots that are ‘listening’ to n stress signals, including their own— that is dynamically altered by the stress vector s. This results in a number of eigenvectors, which stand for solutions of s where the symbiotic algorithm is zero. The feedback loop that is constructed through the environment causes the system to converge to one of them. However, the eigenvector still represents a whole range of solutions of s, of which only (s = 0) is desired. The challenge of the system is to approximate this ideal convergence. This will be discussed further in the next section. The neighborhood influences the convergence process through its sign. Ei could be a negative function in (13.16), in which case convergence is still possible, provided that the signs of the appropriate si+1 are inverted in the symbiotic algorithm that is used. Up to now a very rigid goal criterion has been used, namely one where Ii = gi . In nature, the survival goals are often less strict and could be, for instance, Ii ≥ gi (e.g. food). This would translate to the stress signals as follows: { ρ(gi − Ii ), i f Ii < gi si (Ii , gi ) = (13.18) 0, i f Ii ≥ gi This choice increases the solution space of the network significantly, as a whole range of input vectors I lead to a situation where s = 0. This allows a much large
13
Computation in Complex Environments
311
portion of the solution space to give adequate results, leading to efficient systems that use very simple symbiotic algorithms.
13.3.2 Premature Convergence As a symbiotic network is intrinsically embedded in its environment, the latter will partially determine the network structure. This unpredictable aspect influences the network in two ways: 1. By the transformation of an output signal of one symbiot to the input of another. This is determined by the environment E = [E0 , . . . , En ]. 2. By the connection scheme, or structure of the network. The convergence criterion showed that the first issue mainly concentrates on the sign of each transformation, and furthermore that the values of E affect the convergence time tc of the entire network. The connection scheme, which deals with the question which and how many successors a symbiot has, deserves further scrutiny, as it influences premature convergence where I = g when f(s) = 0. A successor of a certain symbioti is a symbiot j of which the input is either connected to the output of symbioti through a neighborhood (immediate successor) or through a number of symbiot-neighborhood pairs as depicted in 13.7. This figure shows a parallel construction, where one symbiot services two immediate successors, as well as a branch, a sequence of symbiots. Both constructions have implications on the convergence of the network.
Fig. 13.7 Extended Symbiotic Network
Parallel connections of symbiots can impair ideal convergence. It is clear that if a symbiot services multiple successors, convergence to all their goals is only possible if there is overlap between them. The symbiot will never be able to serve its successors if they have contradicting goals. In such a case the network will opt for an average, the upper, or the lower bounds of their goals, depending on the symbiotic algorithm that is chosen. These limitations are intrinsically determined by the way the symbiots are connected and the goals they have. There is no way that a symbiotic algorithm can work around these limitations.
312
K. Pieters
Branches in a network can also cause premature convergence through propagation. In a branch, a change in the behavior of one symbiot results in changes to the input of its immediate successors, and their immediate successors and so forth. If such sequence of successors contains amplifying elements, then a small adjustment at the start of the branch can cause enormous fluctuations just behind the amplification. This amplification is transferred to the stress signals and result in a situation where the contributions of symbiots further down the chain are much higher than that of the immediate successors. They ‘out-shout’ not only the immediate successors, but also other stress signals in the network. Therefore, they impair correct functioning of the network. Note that amplification occurs when a goal value of a symbiot is lower than that of a successor. For instance, take a symbiotic algorithm of the following type: n
f(s) = w0 · s0 + · · · + wn · sn = ∑ (wi · si )
(13.19)
i=0
It is clear that a major issue for such a symbiotic algorithm is to ‘decide’ which stress signals should get which weights. A symbiot can only serve its (immediate) successors and therefore the symbiotic algorithm should ideally be constructed to discard the stress signals of other symbiots. These contributions may possibly only pollute useful information. It is as if trying to hear your children call out on a busy playground. The dilemma that a symbiot faces is that its successors are determined by the environment and are therefore not necessarily known. If a symbiot knows what its immediate successors are, and there is overlap in their goals, then it can concentrate on optimizing these goals. This would result in a network where ideal convergence can take place. Therefore, a perfect symbiotic algorithm will be able: 1. to identify the immediate successors 2. set their corresponding weights to be unequal to zero The weights of other stress signals should ideally be zero, as their sum totals zero at convergence anyway. Besides this, they should ideally not influence the successor’s operation. These issues were investigated in an experimental setting which confirm these observations [15]. One of the most interesting algorithms proved to be a variant of (13.19) where the weight factor was determined by a Hebbian learning rule:
wti = wt−1 + ρ ⋅ si , i
ρ ∈< 0, 1 >
(13.20)
The algorithm also included a forgetting rule in order to compensate erroneous learning during the initial phase of the convergence process. The resulting network converged almost ideally in situations where branches had decreasing goals (no amplifications). Apparently the network is able to ‘learn’ its (immediate) successors. When amplifications were included (random goals), the system converged to the same level as an averaging algorithm, a variant of (13.19) where:
13
Computation in Complex Environments
wi = w∀wi ,
313
w ∈< 0, 1 >.
This particular configuration gives the network traits of a neural network, in which the algorithms learn patterns of communications in the stress signals. At this point also, the environment has become less specific than was initially depicted in (13.2) and (13.16). All that is now required is that a symbiot should be able to service its successors in whatever way. If this is possible, then the network will try to optimize. As the environment influences the stress signals, the traits of the environment are taken up in the patterns that are learned. This does not mean that the system will converge ideally, but only that the system will do its best given the the means and limitations of the problem domain. The symbiots ‘fold around’ the neighborhoods and try to address the contingencies through the feedback loops they form, and this contributes to the robustness of the algorithm. The challenge then is to see which symbiotic algorithms are best suited to guide the convergence inducing process.
13.4 Symbiotic Networks as Optimizers The theoretical model of symbiotic networks may have obfuscated the fact that in order to use a symbiotic network as an optimizer, a relatively simple heuristic needs to be implemented. Every agent in the network must be able to monitor stress signals, and it must be able to change its behaviour so that the overall stress becomes less. At first glance, this seems problematic, for the change of the global stress signals do not (necessarily) depend on the temporal change of an individual agent. However, as all the agents are trying to achieve the same, the network will try to optimize globally. The quality of the optimization is another matter, of course, but it was hypothesized that the dynamics of the network might actually support optimization to some extent, as this might help the network from getting trapped in a local minimum. Besides this, as more stress signals are resolved, the network can concentrate more and more on optimizing the remaining ‘pockets of resistance’. This hypothesis needed experimental verification, which was investigated with RTP. When symbiotic networks are used as optimizers, the manner of optimization becomes relatively simple. As the stress signals are both a measure for the effectiveness of the co-operation the agents engage in, as well as that they monitor events from the environment, optimization is a matter of trying to minimize the stress signals, albeit at the expense that these stress signals partially reflect an environment that can disrupt the optimization process. Therefore, for many practical optimization problems, a symbiotic network may be configured to be continuously in operation —like a living cell—, but it can also terminate when a certain condition has been met, such as:
∣∣s∣∣ < ε,
ε >0
(13.21)
It is also worth pointing out that, in principle, there are no parameters that need to be ‘tweaked’; the multiple feedback loops in an n-agent network are the principle operators of the optimization process.
314
K. Pieters
This does not mean that specific heuristics or optimization algorithms cannot improve the overall convergence when applied to a specific problem. A number of these ‘designed’ interventions will return in the specific case of RTP, that will be discussed next. Globally, RTP configured as a symbiotic network means simulating the railway infrastructure and the intended services of the trains. The simulation lets the trains travel their trajectory, and every time a conflict between trains occurs, a stress signal is generated. The actual optimizing layer collects the stress signals and generates a new timetable, which defines the ‘behaviour’ of the symbiot. The symbiotic algorithms that are used, determine how the stress signals modify the timetable. Most of the strategies implemented here change the departure time of a train a minute earlier or later per optimization cycle, and keep track whether a previous change results in an improvement or not. Thus, local optimization (of individual trains) should result in global optimization. The overall flow chart is given in Figure 13.8. This approach will be discussed in greater detail next.
Fig. 13.8 Basic Flow Chart of RTP Configured as a Symbiotic Network
13.5 Trains as Symbiots One of the major problems of a dense railway infrastructure is that trains put a demand on scarce resources, i.e. the tracks, switch points, platforms of railway stations and so on. Trains traveling along different trajectories with different speed, sometimes may find other trains ahead of them traveling much slower. On single-track trajectories there is a risk of running into a train coming from the opposite direction. There are only a limited number of places where such conflicts can be resolved. Railway stations often have a number of extra tracks that allows faster trains to overtake slower ones, and switch points can cause the trains to take different directions. This contributes to the highly dynamic nature of such conflicts, and makes it a complex problem to tackle. As was mentioned earlier, optimizing the departure times of the trains is one of the most cost-effective means of improving the capacity of the rail net. As the conflicts are a function of these departure times, there should be a timetable that results in a minimal amount of conflicts, and preferably none, of course. A symbiotic network was implemented to generate a timetable with a minimal amount of conflicts for a model of the Dutch rail net. The Netherlands Railways
13
Computation in Complex Environments
315
(NS) operates a (daily) cyclic rail timetable, so trains depart at fixed intervals (for instance twice or three times an hour) [12, 13]. Before continuing, it may be helpful to give a few definitions: • A timetable is a set of departure times for all the trains that are required to travel on a certain rail infrastructure during a certain time frame, for instance daily. • A trajectory is a service that one or more trains are required to travel and runs from a certain station (origin) to a destination. In-between the trains usually stop at one or more stopover stations. • A (train) schedule is the schedule that is assigned to a trajectory. This includes the number of trips during a certain period (e.g. four times per hour) and the departure times. • A trip is the service that a certain train is actually carrying out at some point. RTP has been used as an environment to investigate the behaviour of a symbiotic network in a practical setting. The focus was not on generating actual timetables. The research did confirm that the network does optimize, and therefore could, in principle, be used to generate real timetables. The experiments were conducted by running a computer simulation of the Dutch railway system with the different symbiotic algorithms and under different conditions. Each run consisted of 30,000 iterations, where an iteration step represents a ‘real time’ of 15 seconds. All the experiments initially select a random departure time within the range of a trip.
13.5.1 Trains in Symbiosis The previous discussion on symbiotic networks have suggested that a practical application basically consists of three layers that interact with each other: • The environment, which consists of the railway infrastructure and the rules and constraints that apply, such as maximum speed, waiting times at stations, etc. • The trains, which are considered the active optimizing entities in this network, as the departure times of these trains can be changed. • An optimizing layer, which consists of the symbiotic algorithms that are used. In the particular solution that was developed (other strategies are also possible), the trains ‘collect’ stress signals, based on the conflicts they encounter. These are passed to the optimizing layer, which uses the stress signals to manage train schedules. This alters the departure time of the trains, which is the ‘change of behavior’ that symbiotic agents need to be capable of doing. The departure times thus depend on the stress signals. While the environment is domain-specific, the optimizing layer does not depend on the problem domain. The optimizing layer also has a sort of ‘plugin’ structure for the symbiotic algorithms. This way different types of algorithms can be tested and compared.
316
K. Pieters
Fig. 13.9 Dutch Rail Infrastructure and Global Software Architecture
13.5.2 The Environment For this research, an environment was created in software that simulates a very constrained version of the Dutch railway infrastructure. This ensured that the model would generate a lot of conflicts, but also implied that a conflict-free timetable would be nearly impossible. This choice was made for practical reasons and because the focus was on optimization strategies rather than creating an actual timetable. The environment consists of a set of neighborhoods, which provide the interaction space of the trains and constrains their movement. The neighborhoods implement tracks, stations, border crossings, curves and so on. Single, double, or multiple (parallel) tracks are possible, but in the experimental model the environment mainly consists of single and double tracks. Only in very dense areas, such as in and between major cities (especially the densely populated area between Amsterdam, Rotterdam, the Hague and Utrecht, called the ‘Randstad’) four or six tracks are sometimes used in order to anticipate congestion in those areas. For details on the implementation, see [17]. Every train is subject to a trajectory, a list of neighborhoods that determines the travel plan. The trajectory starts and ends at stations, the origin and destination. The trajectory also defines at which stations will be stopped in between (stopover stations), the departure time in both directions, and the amount of trains traveling this trajectory per hour. The application generates one timetable for a full day (from 05.00 AM to midnight), and does not differentiate between weekdays and weekends, or other specific circumstances.
13.5.3 The Trains The trains are the active entities in the network. As mentioned earlier, they travel along their trajectory according to the constraints defined by the neighborhoods they pass. But trains also have their own set of constraints, which sometimes overrule
13
Computation in Complex Environments
317
those of the neighborhoods. For one, the different types of trains define a hierarchical structure (Table 13.4): Table 13.4 Types of Trains
Type
vmax [km/h]
Description
International Intercity Express Train Local Train
140 140 120 100
Only stops in a few major cities and border crossings Only stops in major cities of its trajectory Stops at a limited amount of stations of its trajectory Stops at most of the stations of its trajectory
The table shows the hierarchy from top to bottom. The type of train not only determines the maximum speed vmax and the number of stops, it also defines how the trains are influenced by stress signals. The departure time will only be influenced by stress signals of trains of equal type or higher. An international train will only respond to stress signals of other international trains, while local trains respond to stress signals of all other train types. This approach minimizes possible goal conflicts of trains that are considered more important, but imposes a great deal of stress on express trains and local trains. On the other hand, a lot of local trains operate in rural areas, sparsely populated parts of the railway infrastructure, where they mainly encounter stress around the origin and destination stations of their trajectory. In such cases, the hierarchy in train types contributes to a situation where stress is distributed from ‘hot spots’ to the periphery. If the trajectory is a train’s general travel plan, a trip is its instantiation. If a trajectory is traveled, say, four times an hour, the application creates eight trips for that trajectory, four for both directions. A train may ‘collect’ stress signals based on its encounters with other trains, but it is the trip that uses the result to change the departure time of the next train. When a train has reached its destination, the collected stress is made available to the optimizing layer. This approach makes the optimization process less sensitive to short-term fluctuations of the stress signals during a trip. Strictly speaking, the trip is therefore the active optimizing element in the network, while the train merely causes and collects stress signals. This way a relationship between departure time and stress signal is established. The Stress Signals The stress signals reflect the conflicts that trains can encounter: • Two trains traveling in opposite directions pass each other on a single track. • A train tries to enter a neighborhood that has no free tracks. • A train encounters another train of lower rank (speed) less than 1500 meters in front of it, heading in the same direction. The model does not adjust the behavior of the trains, instead they pass each other as if the other train is not there. Therefore, only the stress signal is a reminder of these
318
K. Pieters
encounters. When two trains find themselves in conflict, the stress that is calculated is determined by the location of the trains and the nearest neighborhood that has sufficient tracks to resolve the problem. If the nearest free neighborhood is in front of a train, it outputs a negative stress signal, which is translated to a request to leave earlier at the next trip. In the other situation a positive stress signal is given, which is a request to leave later. The system enforces that both trains give a stress signal that leads them to the same free neighborhood. This prevents that stalemates can occur, especially for trains coming from opposite directions. If for instance both trains would give a stress signal to leave earlier, it would result in exactly the same conflict occurring a bit earlier on the next trip. An advantage of this particular environment is, that every agent in the system knows exactly which other agents it is interacting with, and so its response is targeted to service those that actually profit from it. This improves the behavior of the system as a whole, as the chance of premature convergence due to goal conflicts and neutralizing stress signals becomes smaller. Of course, the stress collected by a train is still the superposition of the collected stress signals, but due to the dynamic character of the network there is a good chance that the system ‘pulls’ itself out of temporary premature convergence due to fluctuation of the stress signals. The dynamic character of the network also prevents stable branches with amplifications to occur in the network. This means that the dynamics of the environment can, to some extent, actually be used by the system to improve its overall behavior.
13.5.4 The Optimizing Layer The optimizing layer collects the stress signals of the trains and feeds this to the symbiotic algorithms, which in turn calculate a new departure time for those trains. The optimizing layer is not specific for a certain problem domain. Various algorithms were initially tested, which were mainly variants of the following formula: m
dit = dit−1 + ρ .ri ∑ stki
(13.22)
k=0
In this equation, dit is the departure time of tripi at a given time t, ri is the range in which the departure time can fluctuate, and ρ an additional limiting factor. The sum of the stress of the m trains that were encountered during the trip determines the new departure time of the trip. The stress is normalized to a range of −1, 1. The range ri depends on the amount of trips that are carried out per hour. If a trajectory is traversed four times per hour, the departure time fluctuates 7.5 minutes around a every quarter of an hour. For a trajectory that runs two times an hour, it fluctuates 15 minutes around every half of an hour. The factor ρε [0, 2 has been added as an additional optimization parameter. For most experiments it defaulted to 1. The range and the factor ρ together determine the bounds of the ‘liveliness’ in the system.
13
Computation in Complex Environments
319
Most learning algorithms included a weight vector for every stress signal, in a similar fashion as (13.19) and (13.20). A series of tests have been carried out to analyze the optimizing behaviour of the network using three algorithms, namely a hill-climbing algorithm, Hebbian learning and a third that made the system behave a bit like a Kohonen network [17].
13.5.5 Computational Complexity With every iteration step, a total of n trains can be traveling, with n possibly being larger than the amount of trips of the system, as the duration of a trip is usually longer than the amount of trips per hour. The active trains can interact with maximally n other trains, leading to n stress signals that need to be updated with every iteration step. This leads to an upper bound of the computational complexity of O(n2 ) per iteration step. However, most, if not all, trains will only interact with a limited number of other trains, leading to a practical complexity closer to O(n) per iteration step for most railway infrastructures. The complexity needs to be multiplied with the convergence time tc of the network, in order to get an estimate for the computational complexity of the solution. In practice, convergence usually has completed well within 30,000 iterations for 186 trips, which is much smaller than n2 , so the upper bound of the complexity is O(n4 ), although the practical complexity is better than O(n3 ).
13.5.6 Results A typical result of a number of runs has been depicted in Figure 13.10.
Fig. 13.10 Average Amount of Trains Without Conflicts in %
320
K. Pieters
On average, a system with Hebbian learning manages to reduce the amount of conflicts to 17 ± 5% of the trips. However, individual runs have been made that managed to reduce the amount of conflicts to less than 9%, while the upper bound of 22% was fairly constant. This gives reason to believe that there is still much room for improvement.
Fig. 13.11 Convergence with Delays: Maximally 5 minutes, 10% probability
Fig. 13.12 Convergence with Delays: Maximally 5 minutes, 20% probability
13
Computation in Complex Environments
321
Figure 13.11 shows the results when random delays of trains were introduced to test the robustness of the solutions. The most striking result is the fact that on average the network hardly seems affected by the delays. Normally a conflict is resolved the moment when conflicting trains find a free track. Delays push the trains further away from the conflict area into the free neighborhoods. As most of these ‘sinkholes’ are formed by stations with a three minute stopover time, they are sufficient to resolve the majority of the delays that are generated, while delays larger than three minutes will only cause incidental stress and not lead to structural changes in the system. Delays with a higher impact will at some point deteriorate convergence, although the variance seems fairly constant (Figure 13.12). Similar results were obtained when the maximum delay time is increased.
13.5.7 A Symbiotic Network as a CCGA The lack of comparable solutions makes it difficult to find a benchmark for RTP. Alternative strategies based on for instance competition are hard to implement, as these normally work best when a number of known alternatives can be tested and mutually compared. However, as symbiotic networks are fundamentally co-operative, they are comparable to so-called co-evolving cooperative genetic algorithms (CCGA) as proposed by Potter and de Jong [23, 24]. Therefore, as a follow-up research, an attempt at benchmarking was done by implementing a CCGA. In a CCGA, a number of cooperating genetic algorithms (GA) are mutually connected by certain credit-assignment strategies, which are implemented by design. The populations optimize individually, but they share their results, which are then translated into a fitness function. As the stress signals provide a means of credit-assignment, it was relatively easy to configure the symbiotic network to operate as a CCGA. Every symbiotic algorithm is implemented as a GA operating on a population of departure times of trains, and the stress signals are used as fitness function. Starting from an initial, random population that covers the first trips, new individuals are formed through the standard operators for GAs. This approach can both provide a benchmark for symbiotic networks against a well-documented alternative co-operative strategy, as well as provide a novel approach for CCGAs, as symbiotic networks learn a form of credit-assignment, rather than that this is implemented by design. Besides this, the configuration could be used to assess an intuition that the network would be less prone to the notorious ‘tweaking’ of reproduction and mutation rates in GAs; symbiotic networks tend to utilize the dynamics of the network itself for optimization. For details see [19]. On average, the CCGA configuration performs a bit better than Hebbian learning strategies, presumably because the latter tend to dampen out when the stress becomes less. Hebbian learning tends to ‘coagulate’ the system near convergence. Due to mutation and crossover, CCGAs keep trying new solutions around convergence and, like the delays, manage to utilize the possibilities of neighborhoods with
322
K. Pieters
free tracks much better. On average, CCGAs manage to let 85±3% of the trains run without conflicts. The configuration is hardly affected by varying mutation rates up to 5%. Above that, the system gives significantly poorer results.
13.5.8 Discussion When problem domains and solution strategies are considered from a wider perspective, such as provided by the pattern of a convergence inducing process, various classes of solution strategies can be compared in relationship to the specific problems they aim to address. Besides the more intuitive distinctions between intelligence ‘by design’ and ‘true’ computational intelligence, it also allows a more comprehensive assessment of various solution strategies provided by interactions of multiple agents, such as for instance depicted in the actor/co-actor pattern. Most of all, it introduces the environment as an integral part of the optimization process. The specific problem domain provided by railway timetable problems has demonstrated the relationships between environmental conditions, its constraints, heuristics and possibilities, and the interplay between designed and computational intelligence. In this contribution, symbiotic networks were introduced as an approach to optimize in complex, dynamic environments. Currently the research has pursued modest goals, and concentrated on understanding how agents can learn to collaborate in complex environments, in order to achieve an overall goal. Railway timetable problems offered a means to analyze this, but the results demonstrate that symbiotic networks have certain potential to be applied in real-world applications as robust problem solvers.
Acknowledgments I would like to thank prof. dr. Harry Hunneman from the University for Humanistics in Utrecht in the Netherlands, and prof. dr. Paul Cilliers from the Centre for Studies in Complexity of Stellenbosch University in South Africa for their valuable support in developing a ‘helicopter view’ on the issues related to complexity thinking. I am also greatly indebted to dr. ir. Schil de Vos and dr. Jack Gerissen for their support during my research in symbiotic algorithms at the Open University in the Netherlands, and Schil also for his feedback for the draft version of this chapter.
References 1. Alexander, C.: A Pattern Language: Towns, Buildings, Construction. Oxford University Press, USA (1977) 2. Blum, C.: Ant colony optimization: Introduction and recent trends. Physics of Life Reviews 2(4), 373, 353 (2005), http://www.dx.doi.org/10.1016/j.plrev.2005.10.001
13
Computation in Complex Environments
323
3. Caprara, A., Fischetti, M., Toth, P.: Modeling and solving the train timetabling problem. Operations Research 50(5), 851–861 (2002), http://www.jstor.org/stable/3088485; ArticleType: primary article / Full publication date: September-October 2002 / Copyright 2002 INFORMS 4. Cilliers, P.: Boundaries, hierarchies and networks in complex systems. International Journal of Innovation Management 5(2), 135–147 (2001) 5. Hassoun, M.H.: Fundamentals of Artificial Neural Networks. The MIT Press, Cambridge (1995) 6. Holland, J.H.: Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. The MIT Press, Cambridge (1992) 7. Jong, K.A.D.: An analysis of the behavior of a class of genetic adaptive systems. PhD thesis, University of Michigan (1975), http://portal.acm.org/citation.cfm?id=907087 8. Lee, Y., Chen, C.: Modeling and solving the train pathing problem. In: Twelfth World Multi-Conference on Systemics, Cybernetics and Informatics, IIIS, Orlando (2008) 9. Margulis, L.: Symbiotic Planet: A New Look At Evolution, 1st edn. Basic Books (1999) 10. Mitchell, M.: Complexity: A Guided Tour. Oxford University Press, USA (2009) 11. Morin, E.: On Complexity. Hampton Press (2008) 12. Odijk, M.A.: Railway timetable generation (1998) 13. Peeters, L.: Cyclic Railway Timetable Optimization. Phd Thesis, ERIM PhD Series Research in Management, Erasmus Universiteit, Rotterdam (2003) 14. Picek, S., Gloub, M.: Dealings with problem hardness in genetic algorithms. WSEAS Transactions on Computers 8(5) (2009) 15. Pieters, C.P.: Symbiotic algorithms. Master’s thesis, Open University (2003) 16. Pieters, C.P.: Symbiotic networks. Evolutionary Computation. In: The 2003 Congress on CEC 2003, vol. 2, pp. 921–927 (2003) 17. Pieters, C.P.: Trains in symbiosis. In: IASTED 2004 Congress on Artificial Intelligence and Soft Computing 2004, pp. 481–487 (2005) 18. Pieters, C.P.: Effective Adaptive Plans, pp. 277–282. Springer, Heidelberg (2006), http://dx.doi.org/10.1007/1-4020-5263-4_44 19. Pieters, C.P.: Reflections on the geno- and the phenotype. In: CEC 2006 IEEE Congress on Evolutionary Computation, pp. 1638, 1632 (2006), http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/ 11108/35623/01688504.pdf?tp=&isnumber=&arnumber=1688504 20. Pieters, C.P.: Complex systems and patterns. In: Twelfth World Multi-Conference on Systemics, Cybernetics and Informatics, vol. VII, pp. 268–275 (2008) 21. Pieters, C.P.: A pattern-oriented approach to health; using pac in a discourse of health. International Journal of Education and Information Technologies 3(2), 126–134 (2009), http://www.naun.org/journals/educationinformation/ eit-90.pdf 22. Pieters, C.P.: Patterns, complexity and the lingua democratica. In: Proceedings of the 10th WSEAS International Conference on Automation and Information, ICAI 2009. Revent Advances in Automation & Information. WSEAS Press, Prague (2009)
324
K. Pieters
23. Potter, M.A., de Jong, K.A.: A cooperative coevolutionary approach to function optimization. In: Davidor, Y., M¨anner, R., Schwefel, H.-P. (eds.) PPSN 1994. LNCS, vol. 866, pp. 249–257. Springer, Heidelberg (1994), http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.119.2706 24. Potter, M.A., de Jong, K.A.: Cooperative coevolution: An architecture for evolving coadapted subcomponents. Evolutionary Computation 8(1), 1–29 (2000), http://dx.doi.org/10.1162/106365600568086 25. Rosenblatt, F.: The perceptron: A probabilistic model for information. Psychological Review 65(6), 386–408 (1958) 26. Weinberg, G.M.: An Introduction to General Systems Thinking, 25th edn. Dorset House Publishing Company, Incorporated (2001) 27. Wolpert, D., Macready, W.: No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1(1), 67–82 (1997) 28. Wooldridge, M.: Reasoning about Rational Agents, 1st edn. The MIT Press, Cambridge (2000) 29. Zwaneveld, P.J., Kroon, L.G., van Hoesel, S.P.M.: Routing trains through a railway station based on a node packing model. European Journal of Operational Research 128(1), 14–33 (2001), http://www.sciencedirect.com/science? ob=ArticleURL& udi=B6VCT-41Y1XYH-2& user=10& rdoc=1& fmt=& orig=search& sort=d&view=c& acct=C000050221& version=1& urlVersion=0& userid=10&md5=0a0c12353a918d29b76ed2ad4c741a4f
Chapter 14
Project Scheduling: Time-Cost Tradeoff Problems Sanjay Srivastava, Bhupendra Pathak, and Kamal Srivastava
Abstract. We design and implement new methods to solve multiobjective time-cost tradeoff (TCT) problems in project scheduling using evolutionary algorithm and its hybrid variants with fuzzy logic, and artificial neural networks. We deal with a wide variety of TCT problems encountered in real world engineering projects. These include consideration of (i) nonlinear time-cost relationships of project activities, (ii) presence of a constrained resource apart from precedence constraints, and (iii) project uncertainties. We also present a hybrid meta heuristic (HMH) combining a genetic algorithm with simulated annealing to solve discrete version of multiobjective TCT problem. HMH is employed to solve two test cases of TCT.
14.1 Introduction The project manager handles conflicting states to optimize various parameters of project scheduling process. Minimizing project completion time and project cost continues to be universally sought objectives, conflicting in nature, which is known as time-cost tradeoff (TCT) in project scheduling. TCT belongs to a class of multiobjective optimization (MOO) problem wherein there is no single optimum solution rather there exists a number of solutions, which are all optimal – Paretooptimal solutions – optimal TCT profile in project scheduling literature. The tradeoff between project time and cost gives project managers both challenges and opportunities to work out the best schedule to complete a project, and is of considerable economic importance. Projects are usually represented using networks, having nodes Sanjay Srivastava Department of Mechanical Engineering, Dayalbagh Educational Institute, Dayalbagh, Agra, India e-mail: [email protected] Bhupendra Pathak · Kamal Srivastava Department of Mathematics, Dayalbagh Educational Institute, Dayalbagh, Agra, India e-mail: [email protected] ,[email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 325–357. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
326
S. Srivastava, B. Pathak, and K. Srivastava
Fig. 14.1 Network model
and directed arcs (Figure 14.1). These diagrams provide a powerful visualization of the relationships among the various project activities, which form the precedence constraints in TCT analysis. There are two kinds of networks: i Activity-On-Arc (AOA): The arcs represent activities and the nodes represent events. An event represents a stage of accomplishment; either the start or the completion of an activity. ii Activity-On-Node (AON): The nodes represent activities and the directed arcs represent precedence relations. This representation is easier to construct. Nonincreasing time-cost relationship of a project activity can be either continuous or discrete (Cost refers to direct cost throughout this work). The continuous relationship can be linear or nonlinear. Accordingly TCT problems may be categorized as: (i) linear TCT problem for linear continuous time-cost relationship; (ii) nonlinear TCT problem for nonlinear continuous time-cost relationship, and (iii) discrete TCT problem for discrete time-cost relationship. There are a large number of activities within real-life projects; therefore, it is almost impossible to enumerate all possible combinations to identify the best decisions for completing a project in the shortest time and at the minimum cost. Several researchers have suggested various methods [1], including mathematical techniques and heuristics, for obtaining TCT profile, but there still remains many serious impediments that restrict a wider use of TCT profile as management contrivances. The first concerns the form of the time-cost relationship of project activities and the size of project networks. Most of the existing methodologies for determining optimal TCT profile need to rely on unrealistic assumptions about the time-cost relationship of activities such as linear, convex, continuous etc. in order to manage computational costs. This, however, renders the derived profile inaccurate. Most of the reported methodologies that attempt to deal with realistic time-cost relationships is accompanied by the only-for-small-networks admonition. Moreover, the discrete version of TCT problem is known to be NP-hard, and it has been proved that any exact solution algorithm would very likely exhibit an exponential worst-case complexity [2]. The complexity of TCT problem further increases, if resource constraints are also present, which is not uncommon in realistic projects. In addition, to solve TCT problem in a generalized way, the scheduler must consider the presence of project uncertainties such as weather conditions, space congestion, labor performance etc., which
14
Project Scheduling: Time-Cost Tradeoff Problems
327
dynamically affect both, the project duration and cost, during its implementation. In view of foregoing research issues we developed comprehensive and intelligent methods to solve a variety of realistic TCT problems. Some related research efforts follow. Richard et al. [3] developed nonlinear timecost tradeoff models with quadratic cost relations. Vanhouke [4] applied a branch and bound method to solve discrete TCT problem with time switch constraints. Vanhouke and Debels [5] used analytical method as well as tabu search to solve discrete TCT problem. As mentioned, TCT is a MOO problem with two conflicting objectives. MOO is a field reasonably explored by researchers in recent years since 1990 - as a result diverse techniques have been developed over the years [6]. Most of these techniques elude the complexities involved in MOO and usually transform multiobjective problem into a single objective problem by employing some user defined function. Since MOO involves determining Pareto-optimal solutions, therefore, it is hard to compare the results of various solution techniques of MOO, as it is the decision-maker who decides the best solution out of all optimal solutions pertaining to a specific scenario [7]. Evolutionary algorithms (EAs) are meta heuristics that are able to search large regions of the solutions space without being trapped in local optima [8]. Some well-known meta heuristics are genetic algorithm (GA), simulated annealing (SA), and tabu search. Genetic algorithms are search algorithms [9], which are based on the mechanics of natural selection and genetics to search through decision space for optimal solutions [10]. In GA, a string represents a set of decisions (chromosome combination), a potential solution to a problem. Each string is evaluated on its performance with respect to the fitness function (objective function). The ones with better performance (fitness value) are more likely to survive than the ones with worse performance. Then the genetic information is exchanged between strings by crossover and perturbed by mutation. The result is a new generation with (usually) better survival abilities. This process is repeated until certain termination condition is met. A genetic algorithm uses a population of solutions in each iteration of its search procedure, instead of a single solution. Since a population of solutions is processed in each iteration, the outcome of a GA is also a population of solutions. This unique feature of GA makes it a true multiobjective optimization technique and that is how GAs transcend classical search and optimization techniques [11]. The robustness of GAs is greatly enhanced when they are hybridized with other meta heuristics such as simulated annealing, tabu search etc. [8]. Different versions of multiobjective GAs have been successfully employed to solve many MOO problems in science and engineering [7]. GA based MOO techniques have also been used to solve TCT problem [12], [13]. More recently, Azaron et al. [14] proposed models using genetic algorithms and the Pareto front approach to solve nonlinear TCT problem in PERT networks. However, these models did not consider the presence of a constrained resource and/or nonlinear time-cost relationships of project activities.
328
S. Srivastava, B. Pathak, and K. Srivastava
14.1.1 A Mathematical Description of TCT Problems TCT problems dealt in this work involve, at activity level, expediting an activity anywhere between two time limits (i) normal time, NT, (maximum activity time) associated with normal cost (least activity direct cost), and (ii) crash time, CT, (minimum activity time) associated with crash cost (maximum activity direct cost). We describe TCT problems as follows. Let the set φ represent the space of all feasible instances θ of the network, where an instance θ = {< ti , ci > /CTi ≤ ti ≤ NTi , i = 1, 2, . . . , n}, ti and ci having nonincreasing relationship denote time and cost of ith activity respectively, n is the number of activities in the network, and CTi , NTi are crash time and normal time of ith activity respectively. Further, for ith activity, ci = fi (ti ), where, fi : [CTi , NTi ] → R is a linear map (R is the set of real numbers), for a linear TCT, and fi is a nonlinear map, for a nonlinear TCT. tθ and cθ denote project duration and project cost respectively. The discrete version of TCT problem requires defining θ in the following way. Let Ai = {< ti j , ci j > / j = 1, 2, . . . , pi }, i = 1, 2, . . . , n, denotes the set of pi alternatives for the activity i where ti j and ci j are the time required and the cost involved for the jth alternative. An instance θ is defined as θ = {< x, y > / < x, y >∈ Ai , i = 1, . . . , n}; each Ai contributes exactly one pair < x, y > to θ . Clearly |θ | = n. Three possible problem formulations for the general TCT problem are: i Find θ s.t. cθ = minθ ∈φ {cθ /tθ ≤ Pro jectDeadline} ii Find θ s.t. tθ = minθ ∈φ {tθ /cθ ≤ Budget} iii When the objective is to identify the entire TCT profile for the project network, the problem is to find B = {θ ∈ φ /there does not exist another θ ∈ φ with(tθ ≤ tθ ) ∧ (cθ ≤ cθ )} with strict inequality in at least one case. Here the set of instances θ represents the entire TCT profile over the set of feasible project durations for the network. The decision-maker is free to choose a θ depending on specific project requirements. This formulation is the most generalized one, which has been addressed throughout our work. The chapter is organized in the following sections. Section 1 introduces time-cost tradeoff problem, and establishes the background of our work. It includes the mathematical description of TCT problems. Section 2 presents a methodology using ANNs and a heuristic with multiobjective GA to solve multimode resourceconstrained nonlinear time-cost tradeoff (RCNTCT) problem. An integrated FuzzyANN-GA method is developed to carry out the sensitivity analysis of nonlinear TCT profile in Section 3. Design and implementation of hybrid meta heuristic (HMH), an evolutionary MOO method, to solve discrete TCT problem is detailed out in Section 4. Section 5 is used for conclusions and to lay down important dimensions of future work.
14
Project Scheduling: Time-Cost Tradeoff Problems
329
14.2 Resource-Constrained Nonlinear TCT We present a method for solving nonlinear TCT problem of project scheduling under a constrained resource. In real world projects, some resources are usually constrained, and therefore in such situations the project manager should include resource constraints in TCT analysis apart from precedence constraints [15]. Further, existing methods usually ignore nonlinear time-cost relationship of project activities. We tackled the foregoing obstacles through an intelligent method, integrating ANN models and a heuristic with a GA based MOO algorithm. GA in its usual perspective is employed to search for Pareto-optimal front. The method is termed as artificial neural network and heuristic embedded genetic algorithm (ANNHEGA). RCNTCT may also be viewed as a case of resource constrained project scheduling problem (RCPSP) [16]. A survey of solution procedures for RCPSP can be found in Ozdamar and Ulusoy [17], and more recently in Kolish and Hartman [18]. Many different objectives are possible in RCPSP as per the needs of the decision-maker; the one which has been investigated the most is to find the minimum makespan of the project. However, in our case the objective is to search for the entire RCNTCT profile. With this objective, the problem dealt in this section may fall to a class of multimode resource-constrained project scheduling (MM-RCPSP) [19]. There have been a few research attempts earlier of combining TCT problem with RCPSP. Erenguc et al. [20] proposed a branch and bound method to find out resource requirements, the extent of crashing, and start time for each activity so as to minimize total project cost. Leu and Yang [13] proposed a model unifying resource allocation, time-cost tradeoff and resource leveling using a GA with multiple attribute decision-making approach. The literature on the class of TCT problem (RCNTCT) as defined here is nearly void to the best of our knowledge. We design a heuristic to account for a resource constraint in solving nonlinear TCT problem – it checks for the availability of the resource requirement of project activities. Some related definitions follow. A mode of an activity implies a different option in terms of cost and duration for performing the activity under consideration. In mode m, let activity i requires a processing time ti (m) and a constant amount of resource rti (m) during each period while it is in progress. Once a mode is selected, the activity continues in this mode until finishing. A constant work content Wi is defined for each activity as a product of ti (m) and rti (m). A constrained resource is assumed to be available in constant amount RA throughout the project, which is not unusual in practice. Non-preemptive scheduling of activities is considered. The activities are numbered from 0 to n + 1, where activities 0 and n + 1, represent dummy activities denoting the beginning and the end of a project in an AON network. CTi and NTi , the two limits of ti , determine the upper and lower limits of constrained resource requirement of ith activity respectively. In RCNTCT it is important to mention that although there is a continuous and non-increasing relationship between ti and ri , as well as between ti and ci , for every activity of the project network, yet there is no direct correspondence between ri and ci .
330
S. Srivastava, B. Pathak, and K. Srivastava
For RCNTCT, the mathematical formulation of nonlinear TCT described in Subsection 1.1 will include the following constraints. sk (m) − si (m) ≥ ti (m) ∀k ∈ S (Precedence constraints)
∑
si (m)∈Os (m)
rti (m) ≤ RA (Resource constraints)
i
where, si (m) and sk (m) are the starting time of activities i and k respectively, in mode m; S is the set of the succeeding activities of ith activity; rti (m) is the resource requirement of ith activity at processing time of ti (m); and Osi (m) is the set of activities being performed at si (m).
14.2.1 Artificial Neural Networks An effort is made to utilize the function approximation capabilities of ANNs using back propagation neural network with Levenberg Marquardt (LM) learning rule in modelling time-cost relationships for each activity of a project network. An ANN model is capable of capturing nonlinear time-cost relationship. One back propagation neural network is employed for each activity of the project network for rapid estimation of the corresponding activity cost. The LM learning rule uses an approximation of the Newton’s method to get better performance. This technique is relatively faster. The LM Approximation update rule is: Δ W = (J T J + μ I)−1 J T e, where Δ W is weight update matrix, J is the Jacobian matrix of derivatives of each error to each weight, μ is a scalar and e is an error vector. If the scalar is very large, the above expression approximates the Gradient Descent method while it is small the above expression becomes the Gauss-Newton method. The Gauss-Newton method is faster and more accurate near error minima. Hence, the aim is to shift towards the Gauss-Newton as quickly as possible. Thus μ is decreased after each successful step and increased only when step increases the error. The architecture of ANN employed to solve RCNTCT problem is shown in Figure 14.2.
Fig. 14.2 Artificial neural network architecture for nonlinear TCT problem
14
Project Scheduling: Time-Cost Tradeoff Problems
331
14.2.2 Working of ANN and Heuristic Embedded Genetic Algorithm A string represents a potential solution to a problem in ANNHEGA (see Subsection A). ANNHEGA starts with generating an initial population (see Subsection B). Each string is checked beforehand against the resource constraint throughout the project schedule in the heuristic module of the algorithm (see Subsection G). TCT profile and convex hull of the existing (initial) generation (see Subsection C) are determined and plotted. Fitness values ( f itu ) of the individuals of the initial population (see Subsection D) are then determined. Keeping the individuals on the TCT profile, a pool of chromosomes according to individual’s fitness value is generated. It is important to mention that GA employed here incorporates elitism, by keeping the individuals in the TCT profile for the next generation, as it helps in converging to the true TCT profile. It is worthy to mention here that Arias and Coello [21] have proved that GAs for MOO converge to the global optimal solution for some functions in the presence of elite-preserving operator. As a next step of ANNHEGA, the crossover operator (see Subsection E) and the mutation operator (see Subsection F) are applied to produce the next generation. This process is repeated for a pre-specified number of generations, or, alternatively, untill no improvements are observed in the non dominated solutions for a pre-specified number of iterations.
A. Structure of a Solution A solution here is a string which basically represents an instance θ of the project schedule (Figure 14.3); each element ti of an n-tuple string, T , can assume any value, a natural number, from [CTi , NTi ] . As already mentioned, ci = fi (ti ) for ith activity, and fi : [CTi , NTi ] → R is a nonlinear map for nonlinear TCT. The associated tθ and cθ of each individual string are determined by summing up the corresponding cost for each activity and by computing the maximum path time respectively.
Fig. 14.3 An instance of project schedule
B. Initial Population The initial population consists of n p solutions, where (n p − 2) strings are selected randomly from the feasible search space, i.e., each ti of a string is chosen randomly from [CTi , NTi ] . The remaining two strings are formed such that for the first string
332
S. Srivastava, B. Pathak, and K. Srivastava
ti = NTi ∀i = 1, . . . , n and for the second string ti = CTi ∀i = 1, . . . , n. This would help in identifying the extent of diversification of population in each generation of GA while searching for optimal TCT profile. These solutions are referred to as parents. Each string, representing a unique network schedule, is tested beforehand against the resource constraint; and the early start time of non-critical activities is modified if necessary as per the procedure in Subsection G. The associated cost of each individual string is determined by summing up the cost for each activity and the project duration of each string is determined by computing the maximum path time. The cost data of each activity is intelligently determined by the corresponding trained ANN. These n p strings form the initial population of ANNHEGA.
C. Time-Cost Tradeoff Profile and Convex Hull Let θ1 and θ2 be two strings in a population F. θ1 dominates θ2 if tθ1 ≤ tθ2 and cθ1 ≤ cθ2 with either being tθ1 < tθ2 or cθ1 < cθ2 . Let D be a binary relation defined on the set F by D = {(θ1 , θ2 )/θ1 , θ2 ∈ F ∧ θ1 dominates θ2 }, then the non-dominating set / D ∀ j, j = i}, i.e. it represents the strings (NDS) is given by NDS = {θi ∈ F/(θ j , θi ) ∈ (solutions) of F which are not dominated by any other string of F. The curve formed by joining these solutions is referred to as time-cost tradeoff profile and the solutions as the tradeoff points in the context of project scheduling literature. We define a convex hull merely as a boundary that encloses all members of a population from below (Figure 14.4). This boundary is in the form of straight line segments. The purpose of drawing a convex hull for each population is to evaluate the fitness of each individual in the population [12]. A convex hull may not include all the solution points of the non-dominated set.
D. Distance Measure vs. Fitness After determining the TCT profile and the convex hull of the existing generation (the parent generation) we calculate the minimal distance (du ) between the parent and each of the segments of the convex hull (Figure 14.4). Then we determine the fitness value and the probability of selection for each individual within the parent population as below. (14.1) f itu = dmax − du probu =
f itu ∑ f itu
(14.2)
where f itu = fitness value of parent u; dmax = maximum du in the generation; du = minimal distance between the parent u and each of the segment v of the convex hull, du = min (duv , for all v); and probu = probability of selection of parent u.
14
Project Scheduling: Time-Cost Tradeoff Problems
333
Fig. 14.4 Fitness evaluation of a member of the population
E. Crossover We consider one point crossover, wherein a parent P1 produces a child by crossing over with another parent P2 selected randomly. A random integer q with 1 ≤ q ≤ n is chosen, where q represents the crossover site. The first q positions of the child are taken from the first q positions of P1 while the remaining (n- q) positions are defined by the (n-q) positions of P2 .
F. Mutation The mutation operator modifies a randomly selected activity of a string with a probability mr ; that is (mr × |F|) strings will undergo mutation. The mutation operator works on a given string in the following manner. Let the string be represented by str(i), i = 1, . . . , n. A random number q, 1 ≤ q ≤ n is generated for the location of gene to be mutated. Another random natural number r, r ∈ [CTq , NTq ], is generated and str[q] is replaced by r.
G. Heuristic Procedure The float value of an activity is defined as the available time of an activity by which it can be delayed without affecting the time deadline of project. Obviously the float value of a critical activity is zero. Each string, representing a unique network schedule, is tested beforehand against resource constraint in this module; and early start time of non-critical activities is modified if necessary exploiting their float values. The heuristic checks for resource requirement (RR ) period by period for each string against resource availability (RA ) in the given project. If RR > RA at any time interval Δ t of a network schedule, start time of non-critical activity falling in Δ t would be shifted period by period exploiting its float value so as to adjust RR within RA throughout the network schedule. Further, if more than one non-critical activity falls
334
S. Srivastava, B. Pathak, and K. Srivastava
in Δ t, each one would be processed one by one (ties may be broken arbitrarily) in a similar manner as mentioned above till RR is adjusted within RA . If, even after shifting corresponding non-critical activity (activities) the RR is not adjusted within RA for a string, it is altogether rejected and hence it does not participate in the evolutionary process of ANNHEGA. In case of rejection of strings due to violation of resource constraints ANNHEGA keeps on generating other strings and checking them against resource constraints till the population size is met. That is how ANNHEGA maintains the population size in the evolutionary process. It is a well-known fact that in general, the resource requirements of a project over all periods is never constant, even after applying the best resource leveling procedure. The proposed heuristic procedure makes use of this fact while fixing the upper limit of RA . This is detailed out below. The peak resource requirement Rmax based on activities’ normal time is computed for two extreme cases viz. (1) all the non-critical activities are scheduled to start at their earliest start time (ES) and, (2) all the non-critical activities are scheduled to start at their latest start time (LS). The averaging of peak resource requirements of these two cases is considered to be the peak resource requirement of the project network: Rmax =
Rmax o f ES + Rmax o f LS 2
The initial value of RA to run ANNHEGA is taken as equal to Rmax ; this may be termed as the upper limit of the constrained resource of the project. Now in generating TCT profile the project duration is basically crashed and obviously more resources would be required for each subsequent crashing. In order to deciding the lower possible limit of RA (below this limit project expediting is not possible), RA is subsequently reduced and ANNHEGA is run every time till a point occurs when project time starts increasing instead of decreasing to satisfying the constrained resource. More formally, the ANNHEGA scheme can be summarized in pseudo code as follows. Let CHP be the set of children, I be the current generation number, and GEN is the maximum number of iterations.
14.2.3 ANNHEGA for a Case Study ANNHEGA is implemented on a case study to illustrate its working (Figure 14.5). Its basic structure is similar to that of Feng et al. [12], however, a continuous and nonlinear time-cost relationship for each activity with additional time-cost information is incorporated and also a work-content (product of the activity duration and the amount of the resource needed) is assigned to each one. In this project 18 activities are numbered as i = 1, . . . , 18. Per-period-availability of resource (RA ) is constant and computed as mentioned earlier.
14
Project Scheduling: Time-Cost Tradeoff Problems
335
Fig. 14.5 Network of the test problem
14.2.3.1
Computational Results
Table 14.1 shows the network data of test problem: time, cost and work content (WC) of each activity. ANNs are trained with time-cost data of Table 14.1. ID denotes the activity number. There are total seven time-cost options available for each activity; training data for ANN is prepared by picking up first and last time-cost options and by randomly selecting three more options. Remaining two options of each activity are used as testing data for the neural network. A three-layer network, as shown in Figure 14.3, with one input-activity time and one output-activity cost is used. The training effort is very less with LM learning rule (it takes between 5 to 7 iterations only). One network is trained for each activity, thus a total of 18 ANNs are employed here. An error goal of 1.0e-03 is specified. The modelling power of ANNs is validated using the testing data set – the activity cost is evaluated using ANNs, and is compared with known cost data. The close comparison of values of cost obtained using the neural network (ANN cost) enumerates the accuracy of ANN module of ANNHEGA (Table 14.2). Initial population (n p ), mutation rate (mr ), and crossover rate (cr ) are chosen as 200, 0.02, and 1.0 respectively based on the criterion of faster convergence to the final Pareto front. We use these parameters for other test problems as well. Initial value of resource availability (RA ) is computed as mentioned earlier in Subsection 2.2G. Below the lowest limit of RA project duration would start increasing instead. In addition, the search is set to terminate when the tradeoff points do not change in 5 consecutive iterations. An initial generation of 200 strings is randomly selected and shown in the Figure 14.6. It can be seen that the initial generation is distributed over the solution space and does not gather in one region. The fitness of an individual in a population depends on its proximity to the convex hull. The movement of improved population would be towards the convex hull with each passing generation. Accordingly convex hull also moves towards coordinate axes. This illustrates the preference of algorithm to converge towards Pareto-optimal front. Figure 14.7 depicts intermediate improvements in the tradeoff points and convex hull as these move towards axes. Figure 14.8 shows the convex hull and tradeoff points of the final population as achieved; a clear improvement is visible. Since the tradeoff points do not improve further, therefore these points are concluded to be the best solution points as searched by ANNHEGA. Further, these are compared with the analytical solutions for judging the accuracy of ANNHEGA
336
S. Srivastava, B. Pathak, and K. Srivastava
– it is able to search for at least 90% of the Pareto-optimal solutions on an average of 50 runs. ANNHEGA based system can help to monitor and control the project in a cost-effective way in real time, and one can choose the best alternative over the RCNTCT profile to execute the real world projects.
14.3 Sensitivity Analysis of TCT Profiles In real life projects the duration and cost of each activity could change dynamically as a result of many uncertain variables, such as management experience (ME), labor skill (LS), weather conditions (WC), etc. Project managers must take these uncertainties into account and provide an optimal balance of time and cost based on their own experience and knowledge. The uncertainty features can be well represented by the fuzzy set concepts. Time analysis of a project under uncertainties has been studied using fuzzy set theoretic approach [22]. Daisy and Thomas [23] applied fuzzy set theory to model the managers’ behavior in predicting project network parameters within an activity. Leu et al. [24] used fuzzy set theory to model the variations in the duration of activities due to changing environmental factors. Other types of uncertainties such as budget uncertainty have also been incorporated into project time-cost tradeoff [25]. Existing methods for sensitivity analysis of TCT profiles with regard to project uncertainties ignore the cost parameter of project activities [26], and do not include provision for nonlinear time-cost relationship of project activities. To comprise these problems we devised and executed a novel method – it examines the effects of project uncertainties on both, the duration as well as the cost of the activities, and incorporates nonlinear time-cost relationship of project activities. The method integrates three key fields of computational intelligence – Fuzzy Logic, ANNs and multiobjective Genetic Algorithm – the method is referred to as Integrated Fuzzy-ANN-GA (IFAG). A rule based fuzzy logic framework is developed which brings up the changes in the duration and the cost of each activity for the inputted uncertainties, and then ANNs are trained with these time-cost data (one ANN is used for each activity) to model time-cost relationships. It has been already shown in Section 2 that the integration of ANNs with GA facilitates the evaluation of fitness function of GA. GA is employed to search for Pareto-optimal front for a given set of time-cost pair of each project activity. That is how the integration of fuzzy logic framework and ANNs with GA is implemented to comprehend the responsiveness of nonlinear TCT profile with respect to project uncertainties. A test case of TCT problem is solved using IFAG. Fuzzy sets and fuzzy inference system are briefly described below. A. Fuzzy Sets Fuzzy set theory is an efficient tool for modelling uncertainties associated with vagueness, imprecision, or/and lack of information regarding variables of decision space. The underlying power of fuzzy set theory is that it uses linguistic variables,
14
Project Scheduling: Time-Cost Tradeoff Problems
337
Table 14.1 Network Data of Test Problem ID
Time
Cost
WC
ID
Time
Cost
WC
ID
Time
Cost
WC
1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 7 5 5 5 5 5 6 6 6 6 6 6 6
14 15 16 18 21 23 24 15 17 18 20 21 23 25 15 17 19 22 25 30 33 12 13 15 16 18 19 20 22 24 25 26 27 28 30 14 16 17 18 20 22 24
2400 2150 1900 1750 1500 1340 1200 3000 2630 2400 1800 1720 1500 1000 4500 4415 4220 4000 3730 3375 3200 45000 44300 38450 35000 33700 32400 30000 20000 17500 16400 15900 15700 15000 10000 40000 39200 34500 32000 27700 20300 18000
224
7 7 7 7 7 7 7 8 8 8 8 8 8 8 9 9 9 9 9 9 9 10 10 10 10 10 10 10 11 11 11 11 11 11 11 12 12 12 12 12 12 12
9 11 13 14 15 17 18 14 15 16 17 21 23 24 15 18 20 23 24 25 25 15 22 23 27 28 30 33 12 13 14 16 17 19 20 22 24 25 27 28 29 30
30000 27200 26100 25600 24000 22300 22000 220 215 200 190 167 150 120 300 240 180 150 130 110 100 450 400 390 345 320 325 320 450 420 370 350 330 305 300 2000 1750 1690 1525 1500 1200 1000
900
13 13 13 13 13 13 13 14 14 14 14 14 14 14 15 15 15 15 15 15 15 16 16 16 16 16 16 16 17 17 17 17 17 17 17 18 18 18 18 18 18 18
14 15 16 18 21 23 24 9 10 12 14 15 17 18 10 13 14 16 17 18 20 20 22 24 26 28 29 30 14 16 17 18 21 23 24 9 10 12 14 15 16 18
4000 3795 3500 3200 2750 2155 1800 3000 2930 2825 2605 2400 2295 2200 6525 5990 4500 3500 3355 2600 1930 3000 2000 1750 1685 1500 1385 1000 4000 3700 3455 3200 2780 2335 1800 3000 2900 2790 2565 2400 2315 2200
860
625
300
520
420
800
120
225
100
380
480
90
80
650
100
108
338
S. Srivastava, B. Pathak, and K. Srivastava
Table 14.2 Comparison of Ann Cost with Actual Cost
ID
Time
Cost
ANN ID Cost
Time
Cost
ANN Cost
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9
15 21 17 23 22 25 15 18 27 28 17 22 14 15 17 23 18 23
2150 1500 2630 1500 4000 3730 38450 33700 15700 15000 34500 20300 25600 24000 190 150 240 150
2150.0710 1500.0510 2629.9711 1499.9211 4000.0512 3730.0112 38450.213 33700.513 15700.014 15000.414 34500.615 20300.215 25600.516 24000.116 190.00117 150.00317 239.99718 149.98818
23 28 14 17 24 27 16 21 12 14 13 16 22 29 21 23 14 16
390 330 370 330 1750 1525 3500 2750 2825 2605 5990 3500 2000 1385 2780 2335 2565 2315
390.000 329.928 370.000 330.000 1750.03 1525.09 3500.00 2750.30 2825.30 2605.11 5990.93 3500.30 2000.70 1385.21 2780.41 2335.24 2565.00 2315.10
rather than quantitative variables to represent imprecise concepts. The values of linguistic variables are words or sentences in a given language. For example management experience can be considered as a linguistic variable. Since the values of this variable, such as long experience, or short experience, are not clearly defined but are meaningful classifications nonetheless. B. Fuzzy Inference System Fuzzy inference is the process of formulating the mapping from a given input to an output using fuzzy logic. The mapping then provides a basis from which decisions can be made. The process of fuzzy inference involves membership functions, fuzzy logic operators, and if-then rules. Fuzzy inference systems (FIS) have been successfully applied in fields such as automatic control, data classification, decision analysis, expert systems and computer vision. We have employed Mamdani-type fuzzy inference system using MATLAB’s fuzzy logic toolbox. Mamdani’s fuzzy inference method [27] is the most commonly seen fuzzy methodology. Mamdani’s effort was based on Lotfi Zadeh’s work on fuzzy algorithms for complex systems and decision processes [28].
14
Project Scheduling: Time-Cost Tradeoff Problems
339
14.3.1 Working of IFAG IFAG starts with inputting project uncertainties in Fuzzy Logic Framework (see Subsection 3.1.1), which in turn generates a set of time-cost pair for each activity of a given project. This results in the same project network but with different time-cost data of activities. This data is now inputted to ANNs for their subsequent training as described earlier. Working of GA is already described in Section 2 the only difference is that any string generated in GA need not go to heuristic module as resources are assumed to be sufficiently available in this part of work. Working of IFAG is further detailed out with a case study presented in Subsection 3.2. 14.3.1.1
Fuzzy Logic Framework for Project Uncertainties
A. FIS for Activity Duration and Cost The FIS to capture the effect of linguistic variables on the activity duration and cost is designed with 3 input variables – ME, LS and WC, and two output variables – activity duration and activity cost. Triangular membership functions are used to model the linguistic variables, input as well as output variables. The FIS editor interfaces inputs and outputs. B. Membership Function (MF) Curves of Input Linguistic Variables The linguistic variables namely ME, LS, and WC are modeled using five membership functions such as one shown in Figure 14.9 for weather condition. The linguistic variables are defined in the range 0–1. C. Membership Function (MF) Curves for Output Variables The output variables – activity duration, and activity cost – are modeled by 7 membership functions (Figure 14.10) over the universe of discourse (UOD). The range for UOD for activity duration has been assumed from (D− 0.2 × D) to (D+ 0.2 × D) where D represents an initial estimate of activity duration by the project experts. Similarly the range for UOD for activity cost has been assumed from (C − 0.2 ×C) to (C + 0.2 ×C) where C represents an initial estimate of activity cost by the project experts.
14.3.2 IFAG for a Case Study Project network shown in Fig. 14.5 is taken as a test problem with time-cost options as illustrated in Table 14.1. However activities work content of Table 14.1 are not considered here as resources are assumed to be sufficiently available.
340
S. Srivastava, B. Pathak, and K. Srivastava
5
x 10
Population
1.7 1.6
Project Cost
1.5 1.4 1.3 1.2 1.1 1 0.9
100
110
120
130 140 Project time
150
160
170
Fig. 14.6 Initial population 5
x 10
Population Tradeoff Points Convex Hull
1.7 1.6
Project Cost
1.5 1.4 1.3 1.2 1.1 1 0.9
100
110
120
130 140 Project time
150
160
170
Fig. 14.7 Intermediate improvements in the tradeoff points and convex hull 5
x 10
Population Tradeoff Points Convex Hull
1.7 1.6
Project Cost
1.5 1.4 1.3 1.2 1.1 1 0.9
100
110
120
130 140 Project time
150
160
170
Fig. 14.8 Tradeoff points and convex hull of the final population
14
Project Scheduling: Time-Cost Tradeoff Problems
341
Fig. 14.9 Membership curves for weather condition (fuzzy sets: VeryBad, Bad, Medium, Good, & VeryGood)
Fig. 14.10 Membership curves for activity duration (fuzzy sets: VerySmall, Small, SmallMedium, Medium, LongMedium, Long, VeryLong)
14.3.2.1
Computational Results
IFAG starts with generating a TCT profile using time-cost data of Table 14.1 with project uncertainties as (ME=0.5, LS=0.5, and WC=0.5) using its ANN-GA module. This is equivalent to running IFAG without considering project uncertainties. The corresponding result (a TCT profile), shown in Fig. 14.11 and Table 14.3, is termed as normal TCT profile. Thereafter, we input project uncertainties at user interface by changing the values of ME, LS, WC; which obviously causes normal TCT profile to vary up and down, as it is sensitive to different values of ME, LS and WC. If the values of these linguistic variables are greater than (0.5, 0.5, 0.5) i.e. better than normal conditions, the profile moves towards the coordinate axes i.e. project duration and cost are reduced and vice versa. The following tables represent the important results, wherein project cost is in $(1.0e+005), and project time is in days. Table 14.4 depicts the pessimistic case wherein the values of ME, LS, and WC worsen (i.e. ME = 0.3, LS = 0.3, WC= 0.3); these values fall below the normal ones. As obvious, in this case the whole TCT profile shift upwards, i.e., it moves away from coordinate axes (Figure 14.11). On the similar lines Table 14.5 illustrates the optimistic case. The TCT profile moves towards the coordinate axes (Figure 14.11).
342
S. Srivastava, B. Pathak, and K. Srivastava Table 14.3 Project Cost and Time Under Normal Condition
Project Time Project Cost
169 162 159 152 133 121 116 108 104 98040 98370 98520 99630 101670104360 107120122540 137700
Table 14.4 ME = 0.3, LS = 0.3, WC= 0.3 Project Time Project Cost
185 152 146 145 138 117 104 103 102 110650 111450 112350 112630 114060 118610 154250 158750 166440
Table 14.5 ME = 0.9, LS = 0.9, WC= 0.9
Project Time Project Cost
151 125 121 117 117 105 102 73550 76680 77550 78440 79280 87170 97910
5
2.5
x 10
ME = 0.5 LS =0.5 WC =0.5 ME = 0.9 LS =0.9 WC =0.9 ME = 0.8 LS =0.6 WC =0.9 ME = 0.3 LS =0.3 WC =0.3 ME = 0.2 LS =0.3 WC =0.4
Project Cost
2
1.5
1
0.5 100
150 Project Time
200
250
Fig. 14.11 TCT Profiles under different values of linguistic variables
Responsiveness of TCT profile for scenarios such as (ME= 0.2, LS =0.3, and WC= 0.4) and (ME = 0.8, LS = 0.6, WC= 0.9) works consistently (Figure 14.11). Further, the values of linguistic variables are taken in different ways i.e. some have the values above the normal conditions and some below the normal conditions. TCT profile under normal conditions (i.e. ME, LS, and WC as 0.5, 0.5, and 0.5 respectively) is shown in Figure 14.12 for a run different than earlier one. For (ME, LS, and WC) as (0.8, 0.4, and 0.7) respectively, the project duration and cost are obtained as shown in Table 14.6 / Figure 14.12. Table 14.7 depicts the case when (ME = 0.3, LS = 0.8, and WC = 0.3). The results are shown in Figure 14.12 for
14
Project Scheduling: Time-Cost Tradeoff Problems
343
Table 14.6 ME = 0.8, LS = 0.4, WC= 0.7 Project Time Project Cost
154 145 137 129 119 110 106 105 103 99720 100410 101970 103880 108190 126740 143980 148630 158120
Table 14.7 ME = 0.3, LS = 0.8, WC= 0.2 Project Time Project Cost
187 178 176 159 157 122 119 117 115 108280 109100 109280 112440 113280 117640 131100 141300 148830
5
x 10
ME = 0.5 LS = 0.5 WC = 0.5 ME = 0.8 LS = 0.4 WC = 0.7 ME = 0.3 LS = 0.8 WC = 0.2
2.4 2.2 2
Project Cost
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 60
80
100
120
140 160 Project Time
180
200
220
240
Fig. 14.12 TCT profiles under different values of linguistic variables
comparison purpose. IFAG provides a comprehensive tool to project managers in analyzing their time-cost optimization decisions in a flexible and realistic manner.
14.4 Hybrid Meta Heuristic Lastly we present a hybrid meta heuristic technique for solving multiobjective discrete TCT problem, which is known to be NP-hard. HMH hybridizes a multiobjective GA with simulated annealing, and is apposite for problems where the generation of complete Pareto front, a TCT profile in this case, is essential for a decision-maker. We validated HMH on two standard test problems of MOO. We also present two case studies of discrete TCT which are solved using HMH. As mentioned, the robustness of GAs is greatly enhanced when they are hybridized with other meta heuristics such as simulated annealing, tabu search etc. Yip and Pao [29] employed a hybrid simulated annealing and simulated evolution based evolutionary algorithm to solve traveling salesman problem for single objective optimization; the design and development of HMH for multiobjective optimization to solve discrete TCT problem is basically motivated from this work.
344
S. Srivastava, B. Pathak, and K. Srivastava
It is important to mention here that HMH presented in this work is unconventional in terms of its working (fitness function evaluation etc.) in comparison to existing multiobjective evolutionary algorithms (MOEAs) well comprehended in [11]. HMH suits well to our problem of searching the optimal TCT profile. The Pareto front solutions obtained from HMH are diverse enough from project expediting viewpoint; in fact, it includes almost all the relevant solution points on the Pareto front through which project compressing needs to be carried out by the decision-maker. We have not incorporated a mechanism to preserve diversity except generating two non-random extreme solutions. Further HMH is meant to work for two objectives only and for problems involving convex Pareto-optimal front. Apart from diversity preservation, convergence to true Pareto front is another important issue in MOEAs [11]. In most of the real world projects the true Pareto front (optimal TCT profile) is unknown, so all such metrics which measure the extent of convergence to a known set of Pareto-optimal solutions can not be used for our problem. However, HMH is validated on two standard test problems involving convex Pareto front [11]. The proposed HMH incorporates the concept of Pareto’s optimality to evolve a family of non dominated solutions distributing along the TCT profile, hence eliminating the need of aggregating multiple objectives into a compromise function. HMH embeds simulated annealing in GA to deciding the number of children to be generated from the parents of next generation. The algorithm is general enough to incorporate the various aspects of time-cost relationships as per the given specifications of real life networks, such as linearity of time-cost relationships, or continuous mapping between time and cost of activities. The mathematical description of discrete TCT problem is presented in Subsection 1.1. The preliminaries and definitions to understand HMH are explained concisely below. Some of these definitions are similar to those given in Subsection 2.2; the change in definitions is attributed to discrete version of TCT. Identical definitions are not repeated here. A. Structure of a Solution A solution here is a string which represents an instance θ of the project schedule (Fig. 14.2); each element ti of an n-tuple string, T , can assume any value from the set {ti j }, j = 1, . . . , pi . The associated cost cθ and tθ of each individual string are determined in the usual manner. B. Initial Population The initial populaton, consisting of n p solutions, is generated by randomly selecting (n p − 2) individual strings from the feasible search space, i.e., each ti of a string is chosen randomly from the set {ti j }, j = 1, . . . , pi . The remaining two strings are nonrandomly added in the population as follows. Tmaxi and Tmini , are two strings added such that all activities have t maxi = max∀ j {ti j } and t mini = min∀ j {ti j } duration respectively. See Subsections 2.2C and 2.2E for definitions of TCT profile/convex hull and crossover respectively.
14
Project Scheduling: Time-Cost Tradeoff Problems
345
C. Distance Measurement The distance dw of an individual solution point in a population is determined by calculating the minimal Euclidean distance (dwv ) between the wth solution point and each of the segment v of the convex hull, i.e., dw = min∀v (dwv ) (Figure 14.4). The solutions with a lower value of distance are considered to be fitter than those having larger value of the distance. D. Mutation The mutation operator modifies a randomly selected activity of a string with a probability mr ; that is (mr × |F|) strings will undergo mutation. The mutation operator works on a given string in the following manner. For the values of string T represented by ti , i = 1, . . . , n, a random number q, 1 ≤ q ≤ n is generated for the location of gene to be mutated. Another random value tq , [(t mini ≤ tq , ≤ t maxi )and (tq = tq )], is generated and tq replaces tq . E. Simulated Annealing Simulated annealing (SA) is a popular search technique which imitates the cooling process of material in a heat bath. SA as stochastic optimization was introduced in the context of minimization problems by Kirckpatrick et al. [30]. It is a global optimization method that distinguishes between different local optima. Starting from an initial configuration, the SA algorithm generates at random new configurations from the neighborhood of the original configuration. If the change results in a better configuration, then the transition to the new configuration is accepted with a Boltzmann probability factor. The probability factor is regulated by a parameter called Temperature (temp) and provides a mechanism for accepting a bad move. In the initial iterations (temp = temp0 ) this probability is high (almost one) and when the temperature is subsequently lowered using a cooling ratio (cool r) it comes down to almost zero in the final stage of iterations (temp = temp f ).
14.4.1 Working of Hybrid Meta Heuristic HMH begins with generating an initial population of n p solutions, which are referred to as parents. Initially each parent u is allowed to produce (child num(u) = nc /n p ) number of children, where nc is the total number of children produced in a generation; this number is suitably chosen so that the search space can be extensively scanned for the selection process to follow. A parent produces a child by crossing over with a randomly selected string from the remaining population of parents. Thus child num(u) number of children are produced by repeating this process for the required number of times for each parent. A parent together with its children constitutes a family; all the solutions in a family are referred to as members of the family. Thus, throughout the procedure, n p families exist in a population. In the initial generation, each family has a single parent. The procedure followed in each iteration is explained below.
346
S. Srivastava, B. Pathak, and K. Srivastava
The Pareto front (or TCT profile) of the generation, i.e., all the parents together with their children, is determined, which represents the nondominated set of a generation. Thereafter, a convex hull that encloses all members of a population from below, is drawn. The basic idea is that lesser the distance of an individual within a generation from the convex hull, better is its fitness with respect to either /all of the objectives (Figure 14.4). For each family u, its members on the Pareto front are counted (par num(u), u = 1, . . . , n p ). These par num(u) members become the parents for the uth family. However, it is important to note that if for a family u, no member appears on the Pareto front, the family is not rejected all together; in the hope of its improvement in future, a member of that family which is ’nearest’ to the Pareto front is selected to be the parent for the next generation. This ’nearness’ is measured by a distance function described in Section 4. The importance of the number par num(u) is twofold. Firstly it determines the parents for the next generation chosen from each family. This is how elitism is incorporated in the algorithm, which helps it in converging closer to Pareto-optimal front. Elites of a current population are given an opportunity to be directly carried over to the next generation. Therefore, a ’good’ solution found in a current generation will never be lost unless a better solution is discovered. The absence of elitism does not guarantee this feature. Importantly the presence of elites enhances the probability of creating better offspring. Secondly it helps in keeping a ’good’ distribution of solutions over the Pareto front. The next step is to decide the number of children, child num(u), u = 1, . . . , n p allocated to each family in the next generation. This number actually provides the information of how good the region is. To accomplish this, a distance measure as defined in Section 4 is used, which measures the nearness of each member of a family to the Pareto front. To find child num(u), the process of simulated annealing has been incorporated into the selection process as mentioned in the procedure f ind num() given in the Subsection 4.1.1. It first counts the members of family which satisfy Bolzmann criteria. Clearly the number child num(u) is proportional to the number of members of the uth family which are closer to the convex-hull. Further, par num(u) also plays a direct role in measuring the fitness of each family u, that is, number of children to be produced in the next generation by family u is determined by par num(u) plus the number of family members who qualify the Boltzman criterion. This is obvious as these par num(u) members are on the TCT profile. The next step is the generation of children by each family. As mentioned earlier, initially each family has a single parent, but in subsequent generations the number of parents per family may be more than one (as par num(u) ≥ 1 for the families whose members are on the Pareto front). In such a case, the number child num(u) is almost equally divided among these par num(u) parents for producing the children. The method of producing children by any parent is same as explained for the initial generation. Now mutation is applied on randomly selected strings of the population and the temperature is cooled down. The process is repeated until no improvement is observed in the TCT profile for a specified number of generations. The algorithm is able to search for the best family in the evolution process.
14
Project Scheduling: Time-Cost Tradeoff Problems
14.4.1.1
347
Pseudocode of HMH
1. A step-wise pseudocode of HMH follows. Step 1: Set initial temperature temp = temp0 , no improve iter. Set n p, nc 4 Set Gen = 1 Step 2: Select (n p − 2) parents randomly, and add remaining two parents by taking shortest and longest durations of the activities Step 3: For u = 1 to n p do child num(u)= nc /n p , and par num(u) = 1 Step 4: Generate nc children from the parents, with parent(u) producing child num(u) children. This creates n p families consisting of parents and their corresponding children Step 5: Determine the TCT profile and convex-hull of the existing generation i.e. np
∑ child num(u) + par num(u)strings constitute a generation
(14.3)
u=1
For each f amily(u), u = 1, . . . , n p , do steps 6 to 8 Step 6: Find the number of members appearing on the obtained non-dominated front, i.e., par num(u) . These members become the parents for the next generation. Step 7: For each member (w), w = 1, . . . , par num(u) + child num(u), dw is comu ), where d puted as defined by the distance function dwu = min∀v (dwv wv is the th th distance of w member from the v line segment of the convex hull (Figure 14.4) Step 8: Determine the number of children child num(u) that will be generated by the family in the next generation, as detailed out in Procedure f ind num. Step 9: Parents (those mentioned in step 6) produce child num(u) children by crossing over randomly with the others members of the obtained nondominated front. If no member of a family appears in the front then the new parent for this family is decided as follows: The string having best fitness value among all the members of the family becomes the parent for the next generation. Step 10: Apply mutation. Step 11: Gen = Gen + 1; Step 12: Decrease the temperature temp = temp ∗ cool r Step 13: Repeat steps 5 to 12 until Gen ≥ Max iter or the TCT profile remains identical for improve iter number of generations. Procedure f ind num() Step 1: sum = 0; Step 2: for u = 1 to n p do accept(u) = 0; Repeat step 3 to 6 for each f amily(u). Step 3: Repeat step 4 for each member of the family.
348
S. Srivastava, B. Pathak, and K. Srivastava
Step 4: If the member is not in the Pareto front, then if exp(−dw /temp) > ρ ), (accept(u) = accept(u) + 1); end Step 5: sum = sum + accept(u) + par num(u); Step 6: for u = 1 to n p do child num(u) = (nc × accept(u))/sum
14.4.2 HMH Approach for Case Studies Many test cases are generated to validate the efficiency and accuracy of HMH. However, two case studies of discrete TCT have been detailed out in this section. Firstly, a project network (Fig. 14.5.) is considered with time-cost options of different activities (time in days and cost in $) as shown in Table 14.8 [12]. Table 14.8 Options of First Test Problem ID
Time
Cost
ID
Time
Cost
ID
Time
Cost
ID
Time
Cost
1 1 1 1 1 2 2 2 2 2 3 3 3 4 4 5
14 15 16 21 24 15 18 20 23 25 15 22 33 16 20 22
2400 2150 1900 1500 1200 3000 2400 1800 1500 1000 4500 4000 3200 35000 30000 20000
5 5 5 6 6 6 7 7 7 8 8 8 8 8 9 9
24 28 30 14 18 24 9 15 18 14 15 16 21 24 15 18
17500 15000 10000 40000 32000 18000 30000 24000 22000 220 215 200 208 120 300 240
9 9 9 10 10 10 11 11 11 12 12 12 12 13 13
20 23 25 15 22 33 12 16 20 22 24 28 30 14 24
180 150 100 450 400 320 450 350 300 2000 1750 1500 1000 4000 1800
14 14 14 15 16 16 16 16 16 17 17 17 18 18 18
9 15 18 16 20 22 24 28 30 14 18 24 9 15 18
3000 2400 2200 3500 3000 2000 1750 1500 1000 4000 3200 1800 3000 2400 2200
Experiments are performed to select SA and GA parameters. The SA parameters – initial temperature, (tempo ), cooling ratio (cool r), and final temperature (temp f ) – are chosen as 100, 0.85 and 0.1 respectively. We initially experimented with cool r = 0.75, 0.8, 0.85, and 0.9, and found that the value of 0.85 was giving the best results. Similar experiments are conducted to decide the parameters – tempo and temp f – which would ensure the faster convergence to the final Pareto front. GA parameters – initial population, n p , the ratio nc /n p , and mutation rate, mr – are selected as 60, 8 and 0.02 respectively. We illustrate one such selection – to decide n p , experiments are done with different values of n p , ranging from 20 to 100. For each value of n p , 50 trials are conducted by keeping other parameters constant. The average time to converge to final Pareto front is reported in Table 14.9. The results indicate that for n p = 60, the convergence to the final Pareto front is fastest. In addition, the
14
Project Scheduling: Time-Cost Tradeoff Problems
349
Table 14.9 Selection of Initial Population
np 20 Average time(sec) for 10 runs 20.95
40 17.85
60 9.33
80 17.44
100 18.74
Table 14.10 Options of Second Test Problem
Activity Duration Cost
Activity Duration Cost
Activity Duration Cost
A
E
F G
B C D
5 6 9 12 13 15
480 300 450 850 600 420
F
12 13 14 16 17 18
1860 1450 1050 3860 3220 2600
H I
19 13 14 7 8 9
2000 1900 1200 950 640 560
search is set to terminate when the TCT profile does not change in five consecutive iterations (it is found to be a good enough number). We use these parameters for all the experiments with HMH on the test problems. An initial generation of n p strings is randomly selected and nc children are produced. Results of a typical run of HMH for this test problem follow. It can be seen (Figure 14.13) that the initial generation is well distributed over the solution space. Figure 14.14 illustrates the intermediate improvements. In succeeding iterations HMH searches for optimal TCT profile. Figure 14.15 depicts the tradeoff points of the final generation population. Since our tradeoff points do not improve further, therefore these points are concluded to be the best points obtained. It takes on an average 6 iterations for HMH to search for the best possible TCT profile for this test problem. Interestingly HMH commands a good efficiency as it searches for a Paretooptimal front after examining an extremely small fraction of possible solutions. For the project network of Figure 14.5, total number of possible schedules are 4.72 × 109 , whereas HMH (on an average of 50 runs) searched for only 3600(180 × 20) possible schedules to converge to best possible TCT profile, which is an extremely small fraction (0.00007627%) of the solution space. The results of TCT profile of final generation obtained by HMH are compared with analytical results obtained from exhaustive enumeration. HMH proves very well in terms of accuracy as it is able to search for 95% of the optimal solutions on TCT profile (on an average of 50 runs of HMH). Further, on comparing visually HMH results with GA based MOO results [12] to solve the same test problem (Figure 14.5), HMH turns out to be better in terms of both, degree of convergence to true Pareto front as well as diversity of solutions.
350
S. Srivastava, B. Pathak, and K. Srivastava
5
x 10
Population Tradeoff points Convex hull
17
16
Project Cost
15
14
13
12
11
1
09 90
100
110
120
130 140 Project Time
150
160
170
180
Fig. 14.13 Initial population with tradeoff points and convex hull
5
x 10
Population Tradeo f points Convex hull
17
16
Project Cost
15
14
13
12
11
1
09 90
100
110
120
130 140 Project Time
150
160
170
180
Fig. 14.14 Intermediate improvements
The second problem involves an adaptation of 9-activity network (Figure 14.16 and table 14.10) from [31]. The resources are assumed to be available without constraints. The HMH parameters chosen for this problem are same as mentioned earlier. An initial generation of n p strings is randomly selected and n p children are produced. Initial generation is found to be well diversified over the solution space for this test problem as well. The diversity of solutions is further maintained in the intermediate improvements in the tradeoff points as these move towards axes. Figure 14.17 illustrates the initial population, and Figure 14.18 presents the tradeoff points and its convex hull for the final generation population. HMH again proves itself in terms of efficiency as well as accuracy while comparing the results with those obtained by exhaustive enumeration technique.
14
Project Scheduling: Time-Cost Tradeoff Problems
351
5
x 10 17
Population Tradeoff points x min x max x mean x median x std y min y max y mean y median y std Convex hull
16
Project Cost
15
14
13
12
11
1
09 90
100
110
120
130 140 Project Time
150
160
170
180
Fig. 14.15 Tradeoff points and convex hull of final generation population
11500
11500
11000
11000
10500
10500
10000
10000
9500
Pro ect Cost
Project Cost
Fig. 14.16 Network of the second test problem
9000
9500
9000
8500
8500
8000
8000
7500 7000 48
7500
Population 50
Populat on Tradeoff po nts xmn x max x mean x median x std ymn y max y mean y median y std Convex hu l
52 54 Project Time
56
Fig. 14.17 Initial population
58
7000 48
49
50
51
52
53 Project Time
54
55
Fig. 14.18 Tradeoff points and convex hull of final generation population
56
57
58
352
S. Srivastava, B. Pathak, and K. Srivastava
14.4.3 Standard Test Problems To test and validate HMH, two standard test problems involving convex Paretooptimal front from [11] are successfully attempted here. On visualizing the results it is clear that HMH produces Pareto solutions that are good enough from diversity and convergence viewpoints. 14.4.3.1
Schaffers Two Objective Problem
This problem has two objectives, which are to be minimized: ⎧ ⎨ Minimize fi (x) = x2 , SCH1: Minimize f2 (x) = (x − 2)2, ⎩ −A ≤ x ≤ A This problem has Pareto-optimal solutions x∗ ∈ [0, 2] and the Pareto-optimal set is a : ∗ ∗ 2 convex set: f2 = ( f1 − 2) in the range 0 ≤ fi∗ ≤ 4. Different values of the boundparameter are used in different studies. Values as low as A = 10 to values as high as A = 105 have been used. Figure 14.19 shows the first generation tradeoff points and convex hull along with population. The tradeoff points and convex hull of the final generation, occurred in the 3rd iteration only, are shown in Figure 14.20 and it can be clearly seen that the obtained non-dominated front well matches with known Pareto-optimal front. 14.4.3.2
Zitzler-Deb-Thiele’s 1st (ZDT1) Test Problem
ZDT1 has two objectives to be minimized. In the general form the problem is as below: Minimize f1 (x), Minimize f2 (x) = g(x)h( f1 (x), g(x)) Other ZDT test problems vary in the way the three functions f1 (x),g(x), and h(x) are defined. ZDT1 is illustrated below: ⎧ ⎪ ⎨ f1 (x) = x1 , 9 n ∑i=2 xi ZDT1: g(x) = 1 + n−1$ ⎪ ⎩ h( f , g) = 1 − f1 . 1
g
The problem has 30 variables, which lie in the range [0, 1]. It has a convex Paretooptimal region that corresponds to 0 ≤ x∗1 ≤ 1 and x∗i = 0 for i = 2, 3, . . . , 30. In this problem, the Pareto-optimal front is formed with g(x) = 1 . Figure 14.21 shows the first generation tradeoff points and the convex hull along with population. The final generation tradeoff points and convex hull are shown in the Figure 14.22; importantly, the obtained non-dominated front matches fairly well with known Paretooptimal front. Further, it has a good distribution of non-dominated solutions across the front. HMH is, therefore, efficient and accurate in tackling a large number of
14
Project Scheduling: Time-Cost Tradeoff Problems
353
Schaffer’s two objective problem Population Tradeoff points Convex hull
20
f2
15
10
5
0 0
5
10
15
20
f1
Fig. 14.19 First generation tradeoff points (NDS) and convex hull
Schaffer’s two objective problem 4 Tradeoff points Convex hull
3.5 3
f2
2.5 2 1.5 1 0.5 0 0
0.5
1
1.5
2 f1
2.5
3
3.5
4
Fig. 14.20 Final tradeoff points (NDS) and convex hull
decision variables. The HMH has performed extremely well on the above standard test problems. The non dominated solutions have converged very close to known Pareto-optimal front. Further, it is visualized that obtained non-dominated solutions maintain a good diversity. All the test problems presented in this work are performed on HP Intel(R)Pentium(R) 4 CPU with 3.2 GHz Processor and 1 GB RAM. The procedures are coded in MATLAB 7.0 and tested under Microsoft Windows XP Professional version 2002.
354
S. Srivastava, B. Pathak, and K. Srivastava
Zitzler−Deb −Thiele’s (ZDT) Test Problem 1 0.9 0.8 0.7
f2
0.6 0.5 0.4 0.3 0.2
Population Tradeoff points Convex hull
0.1 0 0
1
2
3 f1
4
5
6
Fig. 14.21 First generation tradeoff points (NDS) and convex hull
Fig. 14.22 Final tradeoff points (NDS) and convex hull
14.5 Conclusions ANNHEGA amalgamates ANN models and a heuristic technique with GA in a unique way to solve resource-constrained nonlinear TCT problem and becomes a powerful multiobjective optimization method without losing its simplicity. The method succeeds in making TCT analysis more realistic by adding two important dimensions to it. Firstly any existing arbitrary shaped time-cost relationship can be dealt using its ANN module. Secondly the heuristic module takes care for a constrained resource in the TCT analysis. The feasibility of the ANNHEGA is shown
14
Project Scheduling: Time-Cost Tradeoff Problems
355
through an illustrative test case. An additional outcome of this work is that it delivers the lowest limit of the constrained resource beyond which project expediting is not feasible, this information is important for schedule planner. ANNHEGA based system can help to monitor and control the project in the most cost-effective way in real time, and one can choose the best alternative over the RCNTCT profile to execute the projects. There are interesting future extents of this work. More than one constrained resource can be incorporated in the system. Also other precedence relationships may be considered in the system. IFAG is presented to carry out the sensitivity analysis of nonlinear TCT profiles with respect to real life project uncertainties. Fuzzy logic framework facilitated (1) the representation of imprecise activity duration as well as activity cost; (2) the estimation of new time-cost pair for each activity based on inputted uncertainties; and (3) the interpretation of the fuzzy results in the crisp forms. A case study is solved using IFAG to demonstrate the working of the IFAG. The method provides a comprehensive tool to project managers in analyzing their time – cost optimization decisions in a more flexible and realistic manner. In future we intend to investigate the responsiveness of RCNTCT profile for project uncertainties. HMH is a new MOO method implemented combining genetic algorithm and simulated annealing to solve TCT problem by incorporating the concept of Pareto’s optimality to evolve a family of nondominated solutions distributing well along the TCT profile. Two case studies of discrete TCT are solved using HMH to illustrate its performance. HMH can discover near-optimal solutions after examining an extremely small fraction of possible solutions. HMH is also tested on two standard test problems of MOO to validate its performance. Interestingly HMH suits well to our problems, however, from algorithm viewpoint, we, as part of future work, intend to (1) incorporate a mechanism to preserve the diversity in the algorithm, (2) compare it with standard MOEAs such as NSGA-2, SPEA-2, PAES etc., by using metrics to evaluate diversity & convergence properties and (3) enhance it to incorporate more than two objectives. The obvious future extensions of our work can be to experiment HMH to solve RCNTCT in place of GA in ANNHEGA. Similarly performance of sensitivity analysis of TCT profiles can be investigated using HMH along with fuzzy logic and ANNs. HMH may be further explored for solving other complex MOO problems.
References 1. De, P., Dunne, E.J., Ghosh, J.B., Wells, C.E.: The discrete time-cost tradeoff problem revisited. European Journal of Operational Research 81, 225–238 (1995) 2. De, P., Dunne, E.J., Ghosh, J.B., Wells, C.E.: Complexity of the discrete time/cost tradeoff problem for project networks. Operations Research 45, 302–306 (1997) 3. Richard, F.D., Hebert, J.E., Verdini, W.A., Grimsrud, P.H., Venkateshwar, S.: Nonlinear time/cost tradeoff models in project management. Computers & Industrial Engineering 28(2), 219–229 (1995)
356
S. Srivastava, B. Pathak, and K. Srivastava
4. Vanhoucke, M.: New computational results for the discrete time/cost trade-off problem with time-switch constraints. European Journal of Operational Research 165, 359–374 (2005) 5. Vanhoucke, M., Debels, D.: The discrete time/cost trade-off problem: extensions and heuristic procedures. Journal of Scheduling 10(4-5), 311–326 (2007) 6. Ehrgott, M., Gandibleux, X.: A survey and annotated bibliography of multiobjective combinatorial optimization. OR Spektrum 22, 425–460 (2000) 7. Coello, C.A.C.: An updated survey of GA-based multiobjective optimization techniques. ACM Computing Surveys 32(2), 109–142 (2000) 8. Dimopoulos, C., Zalzala, M.S.: Recent developments in evolutionary computation for manufacturing optimization: problems, solutions and comparisons. IEEE Transactions on Evolutionary Computation 4, 93–113 (2000) 9. Holland, J.H.: Adaptation in natural selection and artificial systems. Univ. of Michigan Press, Ann Arbor (1975) 10. Goldberg, D.E.: Genetic algorithms in search optimization & machine learning. Addison Wesley, Reading (1998) 11. Deb, K.: Multi-objective optimization using evolutionary algorithms. Wiley, Chichester (2001) 12. Feng, C.W., Liu, L., Burns, A.: Using genetic algorithms to solve construction time-cost trade-off problems. Journal of Computer in Civil Engineering 11, 184–189 (1997) 13. Leu, S.S., Yang, C.H.: GA-based multicriteria optimal model for construction scheduling. Journal of Construction Engineering and Management 125(6), 420–427 (1999) 14. Azaron, A., Perkgoz, C., Sakawa, M.: A genetic algorithm approach for the time cost trade-off in PERT networks. Applied Mathematics and Computation 168, 1317–1339 (2005) 15. Pathak, B.K., Singh, H.K., Srivastava, S.: Multi-resource-constrained discrete time-cost tradeoff with MOGA based hybrid method. In: Proc. 2007 IEEE Congress on Evolutionary Computation, pp. 4425–4432 (2007) 16. Demeulemeester, E., Herroelen, W.: Project scheduling – A research handbook. Kluwer Academic Publishers, Boston (2002) 17. Ozdamar, L., Ulusoy, G.A.: Survey on the Resource-Constrained Project Scheduling Problem. IIE Transactions 27, 574–586 (1995) 18. Kolish, R., Hartmann, S.: Experimental investigation of heuristics for resourceconstrained project scheduling: An update. European Journal of Operational Research 174, 23–37 (2006) 19. Yang, B., Geunes, J., O’Brien, W.J.: Resource-Constrained Project Scheduling: Past Work and New Directions. Research Report, Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL (2001) 20. Erenguc, S.S., Ahn, T.D., Conway, G.: The resource constrained project scheduling problem with multiple crashable modes: An exact solution method. Naval Research Logistics 48(2), 107–127 (2001) 21. Arias, M.V., Coello, C.A.C.: Asymptotic convergence of metaheurisitcs for multiobjective optimization problems. Soft Computing 10, 1001–1005 (2005) 22. Mares, M.: Network analysis of fuzzy set methodology in industrial engineering. In: Evans, G., Karwowski, W., Wilhelm, M.R. (eds.), pp. 115–125. Elsevier Science Publishers, B. V., Amsterdam (1989) 23. Daisy, X.M., Thomas, S.: Stochastic Time-cost optimization model incorporating fuzzy sets theory and nonreplaceable front. Journal of Construction Engineering and Management 131(2), 176–186 (2005)
14
Project Scheduling: Time-Cost Tradeoff Problems
357
24. Leu, S.S., Chen, A.T., Yang, C.H.: A GA-based fuzzy optimal model for construction time-cost trade-off. International Journal of Project Management 19, 47–58 (2001) 25. Yang, T.: Impact of budget uncertainty on project time-cost tradeoff. IEEE Transactions on Engineering Management 52(2), 167–174 (2005) 26. Pathak, B.K., Srivastava, S.: MOGA-based time-cost tradeoffs: responsiveness for project uncertainties. In: Proc. 2007 IEEE Congress on Evolutionary Computation, pp. 3085–3092 (2007) 27. Mamdani, E.H.: Application of fuzzy logic to approximate reasoning using linguistic synthesis. IEEE Transactions on Computers 26(12), 1182–1191 (1977) 28. Zadeh, L.A.: Outline of a new approach to the analysis of a complex system and decision processes. IEEE Transactions on Systems, Man and Cybernetics SMC-3, 28–44 (1973) 29. Yip, P., Pao, Y.H.: Combinatorial optimization with use of guided evolutionary simulated annealing. IEEE Transactions on Neural Networks 6(2), 290–295 (1995) 30. Kirkpatrick, S., Gelatt, C.D., Veechi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
Chapter 15
Systolic VLSI and FPGA Realization of Artificial Neural Networks Pramod Kumar Meher
Abstract. Systolic architectures are established as a widely popular class of VLSI structures for repetitive and computation-intensive applications due to the simplicity of their processing elements (PEs), modularity of design, regular and nearest neighbor interconnections between the PEs, high-level of pipelinability, small chiparea and low-power consumption. In systolic arrays, the desired data is pumped rhythmically in a regular interval across the PEs to yield high throughput by fully pipelined processing. The I/O bottleneck is significantly reduced by systolic array architectures by feeding the data at the chip-boundary, and pipelining that across the structure. The extensive reuse of data within the array allows for executing large volume of computation with only a modest increase of bandwidth. Since the FPGA devices consist of regularly placed inter-connected logic blocks, they closely resemble with the systolic processors. The systolic computation within the PEs therefore could easily be mapped to the configurable logic blocks in FPGA device. Interestingly also, the artificial neural network (ANN) algorithms are quite suitable for systolic implementation due to their repetitive multiply-accumulate behaviour. Several variations of one-dimensional and two-dimensional systolic arrays are reported in the literature for the implementation of different types of neural networks. Special purpose systolic designs for various ANN-based applications relating to pattern recognition and classification, adaptive filtering and channel equalization, vector quantization, image compression and general signal/image processing applications have been suggested in the last two decades. We have devoted this chapter on the systolic architectures for the implementation of ANN algorithms in custom VLSI and FPGA platforms. The key techniques used for the design of basic systolic building blocks of ANN algorithms are discussed in detail. Moreover, the mapping of fully-connected unconstrained ANN, as well as, multilayer ANN algorithm into fully-pipelined systolic architecture is described with generalized Pramod Kumar Meher Department of Embedded Systems, Institute for Infocomm Research, 1 Fusionopolis Way, Singapore-138632 e-mail: [email protected] Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 359–380. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
360
P.K. Meher
dependence graph formulation. A brief overview of systolic architectures for advance ANN algorithms for different applications are presented at the end.
15.1 Introduction Over the years, the artificial neural network (ANN), not only has been more and more popular due to its adaptive and non-linear capabilities, but also has been established as a potential intelligent tool in every imaginable area of technology for solving ill-posed problems where conventional techniques fail to be effective. The ANN algorithms are, however, computation-intensive, and computational complexity of these algorithms increases with the number of inputs in a training pattern, the number of layers of neurons in case of multilayer network, and the number of neurons in different layers. Apart from that, the ANN algorithms for training phase are iterative by nature, and require execution of several iterations to train the network. The general-purpose-computers based on sequential von Neumann architecture are found to be slow to implement the iterative training particulary when the network consists of a large number of neurons and multiple hidden layers. On the other hand, the ANN algorithms are inherently parallel, and each layer of multilayer network could easily be implemented by a separate pipeline stage. Attempts have, therefore, been made to exploit these features of ANN algorithms to implement them in the single instruction-stream multiple data-stream (SIMD) machines and arrayprocessors [1, 2]. The SIMD configuration has been considered as a good choice for the implementation of these algorithms, as it provides a large number of processing cells using a shared controller with minimal programming, and low burden to the operating system. The real-time and embedded systems, however, impose stringent limitations on the cost, size, power-consumption, throughput rate and computational latency of the neural algorithms. To fit into the embedding environment, the size of the computing structure very often should be small enough, and at the same time it should meet the speed requirement of the time-critical and hard-real-time applications. Although the general-purpose-computers can execute the ANN algorithms of small-sized network through software, it is essential to realize these algorithms in dedicated VLSI or field programmable gate array (FPGA) device to meet the cost, size and time requirement of the embedded and real-time applications. The conventional general-purpose-machines and SIMD machines fall far too short to match the requirements and specifications of many such application environments. Several attempts have therefore been made in the last two decades for the realization of ANN in analog, as well as, digital VLSI. There can be two kinds of approaches to hardware implementation of ANN algorithms, e.g., the direct-design approach and the indirect-design approach [3, 4]. In the direct-design approach, the neural algorithms are directly mapped into dedicated hardware, while the indirect approach makes use of the matrix processing behaviour of the neural models. The ANN algorithms are found to be well suited for systolic implementation due to their repetitive and recursive behaviour. Several variations of one-dimensional and two-dimensional systolic arrays are, therefore, reported for the implementation
15
Systolic VLSI and FPGA Realization of Artificial Neural Networks
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
Processing Element
361
Fig. 15.1 A generalized representation of a systolic architecture
of artificial neural network [5]-[15]. Application specific systolic architectures for various ANN-based applications relating to pattern recognition and classification, adaptive filtering and channel equalization, vector quantization, image compression and general signal/image processing applications have come up in the last two decades. The general organization of a systolic structure is shown in Fig. 15.1. It is a network of simple and identical processing elements (PEs), arranged regularly in an array, connected by localized communication-links. Each PE computes rhythmically and pumps data synchronously in and out such that a regular flow of data is maintained across the array for fully pipelined computation. Systolic architectures are now established as a class of the most popular VLSI structures for repetitive and computation-intensive applications due to simplicity of their PEs, modularity of their structure, regular and nearest neighbor interconnections between the PEs, high-level of pipelinability, small chip-area and low-power consumption [16]-[19]. The I/O bottleneck is significantly reduced by systolic array architectures by feeding the data at the boundary and pipelining that across the structure. The extensive reuse of data within the array allows to execute large volume of computation with only a modest increase of bandwidth [17]. FPGA devices have progressed steadily and substantially not only in terms of their logic and I/O resources but also in terms of their performance. The reusability of these devices along with widely and freely available easy-to-use software tools for simulation, synthesis, and place and route has made FPGA a preferred platform for computation-intensive applications in signal processing and communication. FPGA
362
P.K. Meher
Configurable Logic Blocks
I/O Blocks
Programmable Interconnects
Fig. 15.2 A generalized structure of an FPGA device
implementation of several ANN-based applications has been reported in the literature [20]-[22]. The generalized structure of an FPGA device is shown in Fig. 15.2. It consists of regularly arranged programmable logic components called configurable logic blocks (CLB), and a hierarchy of reconfigurable interconnects that allow the CLBs to be connected together to perform the necessary computations for specific applications. From Fig. 15.1 and Fig. 15.2 we can find that the structure of FPGA and systolic structure are quite similar, particulary in terms of regularity and modularity of computing logic. It is therefore very much straight-forward to map the systolic architectures to FPGA platforms which could be matched with the repetitive modular structure of systolic designs. In this chapter, we have discussed the basic design principles and development of systolic VLSI for the ANN algorithms which could easily be ported to FPGA platform as well. The rest of the chapter is organized as follows: In the next section, we have discussed the direct-design approach for systolic implementation of ANN algorithms. The basic design principles of mapping neural algorithms into systolic array architectures are discussed in Section 15.3. The systolic architectures for fully connected unconstrained network, and multilayer neural network are discussed in Section 15.4. A brief overview of systolic architectures for advance ANN algorithms for different applications is placed at the end of this Section. Conclusions of the chapter are presented in Section 15.5.
15.2 Direct-Design of VLSI for Artificial Neural Network In the direct-design approach, the computing structure directly follows the neural computing model by straight forward mapping. These designs are tailored for particular ANN models for specific applications, and aim at high-performance
15
Systolic VLSI and FPGA Realization of Artificial Neural Networks
363
processing of the algorithm by fully dedicated implementation. Unlike the structures for the indirect-designs, these structures use global connectivity and, therefore, can be used only for networks of smaller dimensions. Although there are some optical implementations of direct-designs for some neural models [23, 24], direct-design electronic neural processors are relatively more prevalent. Several electronic processors have been designed and developed using analog CMOS [25, 26], as well as, digital CMOS technologies [3, 4] for various dedicated applications e.g., pattern recognition, feature extraction and machine vision etc. Analog designs are continuous-valued and may be implemented in continuous-time RC circuits or discrete-time switched capacitor circuits, while the digital designs, by definition, are discrete-time, as well as, discrete-valued. Although the analog, as well as, the digital approaches have shown some success in specific application areas, each of these approaches is associated with certain disadvantages also. The salient features of analog neural processors are discussed in the followings. Analog approach of implementation of neural algorithm has considerable potential, as it can be used for massive computation with low power dissipation since the transistors in sub-threshold region consume very low power. An analog neuron can be implemented conveniently by a simple differential amplifier, where the synaptic weights are taken care of by the resistive circuit elements. Consequently, it is possible to pack several neurons in a small analog chip. Analog circuits are capable to process more than 1 bit per transistor to yield very high-throughput of result, and consequently, analog approach offers compactness of hardware and higher speed performance compared with the digital neural processors. The computation of the weighted sum is performed conveniently in analog circuits either by the magnitude of currents or the amount of charge, while addition operations in digital circuits introduce substantial latency in the implementation of neural algorithms. Analog neural network allows asynchronous updating of weights which helps to realize very high-speed processing. The non-linear behaviour of electronic circuits facilitates convenient implementation of the non-linear threshold functions at the output of the neurons. The non-volatile storage of analog weights, however, sets serious limitation on the learning capabilities of analog processors. Therefore, the learning algorithm may be executed in a companion digital signal processor chip on the printed circuit board, so that the weight values are updated and stored conveniently in digital memory. The updated weight values are dynamically transferred and stored in the capacitors formed by MOS transistors with additional augmented capacitors in the analog circuit [27]. In spite of several encouraging features, the analog neural processors have only limited domain of applicability due to some inherently associated drawbacks: 1. The analog neural networks are more susceptible to noise, crosstalk, variation of power supply, and effect of temperature. 2. Analog circuits are not perfectly reproducible and will have variation of performances from chip to chip. 3. The error voltage due to switching transient and data loss due to leakage current tends to deteriorate the operational accuracy of the analog neural processors.
364
P.K. Meher
4. Analog processors are used with only limited precision (usually of 8-bit) because the chip-area increases with the precision. Contrastingly, the digital designs offer better tolerance to variations of power supply, ambient temperature, noise and crosstalk etc. compared with the analog processors. Besides, they provide accurate storage of weights in digital memory cells to facilitate updating of weights during learning. Compared with the analog circuits the digital designs also allow flexible choice of word-length depending on the precision requirement. Along with these, the digital designs offer fast turn-around due to availability of advanced CAD software and better support for silicon fabrication. The digital implementations on the other hand are relatively slow and involve more chip-area compared with their analog counterpart.
15.3 Design Considerations and Systolic Building Blocks for ANN A neural network usually consists of a number of neurons with dense interconnections among them. A variety of network topologies and interconnections of the neurons have been suggested to perform a wide range of applications using different learning mechanisms. Basing on the type of connectivity, all these networks may be put under two major categories. In one category of networks, every neuron is connected to every other neuron, which may be called as the neural network of unconstrained connectivity. The other category is multilayer neural network. In a multilayer neural network, the neurons are arranged in different layers. The neurons of a given layer are not connected with each other, but each neuron in a layer is connected to every other neuron in its adjacent layer through some connecting weights. Irrespective of the type, a neural network goes through a series of iterations to arrive at the solution of a given problem. In general, the neural algorithms are executed in two phases: The search phase and the learning phase [3]-[6]. In the search phase, all the neurons of the network iteratively update their activation values or the states. Assuming unconstrained connectivity among the neurons, the state/output of the i-th neuron at (k + 1)th iteration can be expressed in a generalized form as: xi (k + 1) = f (ui (k + 1), ui (k), xi (k)) for i = 1, 2, · · ·, N,
(15.1)
where the internal activation of the neuron is given by ui (k + 1) =
N
∑ Wi j (k) · x j (k) + θi(k)
(15.2)
j=1
Wi j is the weight value for the connection running from the j-th neuron to the i-th neuron, and θi (k) is the bias value associated with the i-th neuron. {Wi j } constitutes the weight matrix W of size N ×N, where N is the number of neurons in the network. f is a non-linear function, usually called as the threshold function or the activation function, which may be a sigmoid function, a step function, a squashing function or
15
Systolic VLSI and FPGA Realization of Artificial Neural Networks
365
may be a stochastic function. In the learning phase, the neurons adaptively update their weights either by supervised or by unsupervised learning. In this article, we focus on the VLSI implementation of the neural networks based on the most basic supervised learning rule [5], namely the Widrow-Hoff rule (also popularly known as delta rule) where the weights of the neurons are updated in every iteration of learning according to (Wi j )NEW = (Wi j )OLD + η · x j · (di − xi )
(15.3)
NEURON WEIGTHS
INPUT VALUES
η in (15.3) is the learning rate, that determines the rate at which the weights converge to the optimal values and also the magnitude of the residual mean square error. di is the desired target value specified for the i-th neuron for a given set of input. The processing of each neuron (given by equations (15.1-15.3)) is performed in two successive stages as shown in Fig.15.3.
COMPUTATION OF NEURON OUTPUT
NEURON OUTPUT
UPDATING OF NEURON WEIGHTS
ERROR VALUES
Fig. 15.3 Two-stage processing of a neuron
The computation of neuron output/state is performed in the first stage followed by the updating of weights to be used for computing the output of the next iteration. Besides, from equations (15.1-15.3) we can find that once the outputs of all the neurons are available, it is fairly simple and straight-forward to perform the weight updating in a fully parallel computing section without any data-dependencies. The outputs of all the neurons for any given iteration can also be computed in parallel according to equations (15.1 and 15.2) since there are no data-dependencies between them, but data-dependencies exist between the partial sums for computing the outputs of individual neurons, which need to be taken care of parallel implementation. Moreover, the matrix-vector multiplication of the form of equation (15.2) is frequently encountered in most of the ANN algorithms. (Note that for multilayer feed-forward network as well as the network with back-propagation learning, we can have similar computations as that of the unconstrained network given in (15.2), for computing the outputs of the neurons of a given layer and for updating the neuron weights, where the connection weights run from the j-th neuron in the preceding layer to the i-th neuron in the current layer.) Keeping this in view, we discuss here the derivation of systolic array for the computation of equation (15.2) which could be used as a building block for the systolic implementation of many different types of ANN. To derive the systolic architecture, let us represent equation (15.2) for any given neuron for a specific iteration in a generalized matrix-vector product form:
366
P.K. Meher
yi =
N
∑ Wi j · x j + θi.
(15.4)
j=1
A three-step systolic mapping procedure, e.g., • representation of the computation in a locally recursive form, • transformation of the recursive computation into a dependence graphs (DG), • appropriate mapping of the DG onto suitable systolic array architecture, is normally followed to derive a systolic array by DG formulation [5]. Since equation (15.4) is already in a locally recursive form, directly from this equation we can derive a localized DG as shown in Fig. 15.4. It consists of N 2 identical nodes arranged in N rows and N columns. Function of each node of the DG is depicted in Fig. 15.4(b). The elements of vector x = {x1 , x2 , · · ·xN } are fed to the N nodes on the first row of the DG which move vertically down to the adjacent nodes, while the output of the nodes of different columns are transferred horizontally to the adjacent nodes on their right. The desired matrix-vector product in this case is obtained from the right-most column of nodes of the DG. The nodes of this DG can be projected vertically down to obtain a linear systolic array (shown in Fig. 15.5) consisting of N locally connected PEs. The function of the PEs is described in Fig. 15.5(b). Each PE of the structure performs one multiplication and one addition during a clock period. The elements the weight matrix W are stored in N circularly-shift-registers of size N, such that the i-th register stores the elements of the i-th column of the weight matrix and feeds a weight value to the
x1
x2
x3
xN
θ1
W11
W12
W13
W1N
y1
θ2
W21
W22
W23
W2N
y2
θ3
W31
W32
W33
W3N
y3
Yin Xin
Wij
Xout
Yout
Xout ← Xin + Wij Yin; Yout ← Yin;
θΝ
WN1
WN2
WN3
(a)
WNN
yN (b)
Fig. 15.4 DG for the computation of matrix-vector product of equation (15.4). (a) The DG. (b) Function of each node of the DG
15
Systolic VLSI and FPGA Realization of Artificial Neural Networks
x1
θ1 θ2 θ3
θΝ
x2
x3
xN
Δ
2Δ
(Ν 1)Δ
PE
PE
PE
PE
W11
W12
W13
W1N
W21
W22
W23
W2N
W31
W32
W33
W3N
WN1
WN2
WN3
WNN
367
yN yN-1
y2 y1
(a) Yin Xin
PE
Xout
Xout ← Xin + WinYin;
Win
(b) Fig. 15.5 The linear systolic array for the computation of matrix-vector product. (a) The systolic array. (b) Function of each PE of the structure
i-th PE in each clock cycle. The elements of input vector x are fed to N different PEs as shown in Fig. 15.5(a) such that the input to a PE is staggered by one clock cycle relative to the adjacent PE on its left. The elements of the input vector x (once loaded to the PEs) stay in their respective PEs throughout the computation, while the computed output of each PE is transferred to its neighbouring PE on the right. The first output value of the array is obtained after N cycles from the right-most PE, while the rest N − 1 output values are obtained in the next N − 1 cycles, where the duration of a clock period T = TM + TA , for TM and TA , respectively, being the time involved to perform one multiplication and one addition in a PE. The latency of the structure is N cycles and has a throughput rate of one output per cycle. It has all the advantages of systolic design, but the output is required to be demultiplexed to be stored in separate registers to be used in the next iteration. For large size ANN, the time required for demultiplexing is large; and the demultiplexer involves considerably high area complexity and requires a lot of additional interconnections. To avoid this difficulty of pure systolic realization, we discuss here a semi-systolic implementation of matrix-vector product of (15.4). For semi-systolic realization, the
368
P.K. Meher
dependence graph of Fig. 15.4 can be modified to a form as shown in Fig. 15.6, where the nodes are flipped about the diagonal (so that the weight values appear in transposed form) and i-th column of resulting DG is circularly-shifted-up by (i − 1) places. As in case of the original DG of Fig. 15.4, the modified DG also consists of N 2 number of nodes arranged in N rows and N columns. In this case also the input values are loaded to the nodes on the first row of the DG, but they move diagonally down to the adjacent nodes on the left on the lower row, while the outputs computed by the nodes of each row are transferred vertically down to the adjacent nodes. θ1
x1
θ2
x2
θ3
x3
θΝ
W11
W22
W33
WNN
W12
W23
W34
WN1
xN
Yin
Zin
Wij
W13
W24
W35
WN2 Zout
Yout
Yout ← Yin + Wij Zin; Zout ← Zin; W1N
W21
W32
WNN’
y1
y2
y3
yN
(a)
(b)
Fig. 15.6 The modified DG for semi-systolic computation of matrix-vector product of (15.4). (a) The dependence graph. (b) Function of each node of the DG. Note: N = N − 1
The DG of Fig. 15.6 can be projected vertically down as in case of the DG of Fig. 15.4 to obtain a semi-systolic linear array consisting of N identical PEs as shown in Fig. 15.7. The function of the PEs of this structure is described in Fig. 15.7(b). Each PE in this case also performs one multiplication and one addition during a cycle period. The input structure for the elements of the weight matrix W is also the same as that of the pure-systolic structure of Fig. 15.5. Unlike the latter, the input values xi for i = 1, 2, · · ·, N in this case are loaded simultaneously to the individual PEs without staggering, and transferred to the adjacent PEs to the left in the subsequent cycles. The elements of the input vector x thus move circularly across the array, and the output of the left-most PE in this case is connected to the input of the rightmost PE. The results of computation in different PEs in this case
15
Systolic VLSI and FPGA Realization of Artificial Neural Networks
369
x1
x2
x3
xN
PE
PE
PE
PE
W11
y1
W22
y2
W33
y3
WNN
W12
W23
W34
WN1
W13
W24
W35
WN2
W1N
W21
W32
WNN’
yN
(a)
Yin Xout
PE
Xin
Initialise : X ← Yin and S ← 0; For count = 1 to N in every cycle do : S ← S + W in X ; Xout ← X ; X ← Xin; count ← count + 1; End do;
Win Yout
Yout ← S ; End.
(b) Fig. 15.7 A semi-systolic array for the computation of equation (15.4). (a) The semi-systolic array. (b) Function of each PE of the structure. N = N − 1
do not move, and get accumulated in the respective PEs. The sum of products computed in each PE is finally released simultaneously after N cycles as output. The semi-systolic structure has also a latency of N cycles, and has the duration of each clock cycle equal to the time required to perform one multiply-accumulate operation T = TM + TA as in the case of the pure systolic structure of Fig. 15.5. Unlike the other, since all the outputs in this case are obtained after N cycles from N PEs of the structure, the output of each PE can be reused as input in the same PE for the
370
P.K. Meher
x1
θ1
x2
x3
θ2
xN
θ3
W11
W22
W33
WNN
W12
W23
W34
WN1
W13
W24
W35
WN2
θΝ
Yin
Zin
Wij Zout
Yout
Zout ← Zin + Wij Yin; W21
W1N
y1
W32
WNN’
yN-1
y2
yN
(a)
PE
(b) Initialise : X ← Yin; X out ← 0 and X in ← 0; For count = 1 to N in every cycle do :
Yin Xout
Yout ← Yin;
Xin
X out ← X in + W in X ; count ← count + 1; End do; Yout ← X in;
Win Yout
End.
(c) Fig. 15.8 An alternative dependence graph of the semi-systolic computation of equation (15.4). (a) The DG. (b) Function of each node of the DG. (c) Function of each PE of a semi-systolic array
next iteration. In spite of its global communication, the semi-systolic structure of Fig. 15.7 may be suitable for fast implementation of the matrix-vector product due to its scope for reuse of computed values which we have discussed in detail in the next Section. The computation of (15.4) can alternatively be represented by a DG as shown in Fig. 15.8, where the distribution of the weight values is similar to that of the DG of Fig. 15.6, and the elements of vector x are loaded to individual nodes on the first row. But the input values in this case move vertically down to the respective adjacent
15
Systolic VLSI and FPGA Realization of Artificial Neural Networks
371
nodes while the computed results move diagonally down to the adjacent nodes on the next lower row on the left column. The output computed by the leftmost node of a row of the DG is transferred to the rightmost node on the next row of the DG. The DG of Fig. 15.8 can be projected vertically to obtain an array similar to the semisystolic array of Fig. 15.7, where the function of the PEs is depicted in Fig. 15.8(c).
15.4 Systolic Architectures for ANN In this Section, we describe the derivation of systolic architectures for Hopfield net [28] and multilayer ANN [29] with back-propagation learning [30] using the basic matrix-vector processing units discussed in Section 15.3. The structures presented here are closely similar to those suggested by Kung and Hwang in references [5] and [6]; and those of Shams and Przytula in [2] .
15.4.1 Systolic Architecture for Hopfield Net The system dynamics of the Hopfield net [28] for the search phase can be presented as a matrix-vector product computation followed by a non-linear activation as ui (k + 1) = Wi j (k) · x j (k) + θi .
(15.5)
where the summation over the repeated index is assumed, and xi (k + 1) = (1 + tanh[(ui (k) + η · ui (k + 1))/u0])/2 for 1 ≤ i ≤ N.
(15.6)
The vector u(k) = {ui (k), for 1 ≤ i ≤ N} represents the states of all the N neurons at the k-th iteration, and θ = {θi , for 1 ≤ i ≤ N} is the bias vector. The systolic and semi-systolic architectures for matrix-vector product which we have discussed in the last Section can be utilized for VLSI realization of Hopfield nets as shown in Fig. 15.9. It consists of N PEs, where N is the length of input activation. Each of these PEs consists of two sub-cells PE-1 and PE-2 and a circularshift-register. The set of circular-shift-registers Ri for 1 ≤ i ≤ N is used to feed the appropriate values of weights to the PEs as shown in Fig. 15.9. The function of PE-1 is the same as that of the PEs of the semi-systolic structure of Fig. 15.7. Function of PE-2 is described in Fig. 15.9(b). It performs the desired non-linear function given by equation (15.6). There are several techniques reported for efficient computation of tanh function to be performed by these cells, and can be implemented in many different ways [31]-[34]. For low-complexity implementation of this function one may use a CORDIC circuit or a look up table consisting of 2L words where L is the word-length [31]. In the search phase, each PE may be treated as a neuron where the weight vector W j = {W j j ,W j( j+1) , · · ·,W jN ,W j1 ,W j2 , · · ·,W j( j−2) ,W j( j−1) } in the shift-register R j of the j-th PE (for 1 ≤ j ≤ N) corresponds to the synaptic weights of the neuron. During (k + 1)-th iteration of the search phase, the activation output xi (k) of the i-th PE (for 1 ≤ i ≤ N) of the k-th iteration is reloaded to the same PE, which moves
372
P.K. Meher
x1in
x2in
x3in
xNin
PE-1
PE-1
PE-1
PE-1
R1
R2
R3
PE-2
θ1
PE-2
x1out
θ2
x2out
PE-2
x3out
RN
θ3
PE-2
θΝ
xNout
(a)
Xin Ain
PE-2
X ← η ⋅ ( Xin + θ in); X ← ( Ain + X )/u0 ;
θ in
Xout ← [1 + tanh( x)] / 2;
Xout (b) Fig. 15.9 The linear array architecture for implementation of the search phase of the Hopfield net. (a) The array architecture. Ri of the i-th PE is a circular-shift register that contains the i-th column of the weight matrix W as shown in Fig.5. (b) Function of the non-linear processing cell PE-2. u0 is the initial activation value stored in the PE and η is a constant
across the array from one PE to its adjacent PE in every computational cycle such that each input activation visits each PE once in every N cycles. When x j (k) arrives at the i-th PE, it is multiplied with Wi j , and the product value is accumulated in the same PE. N such product values are accumulated in N consecutive cycles, and the accumulated sum is then transferred to its non-linear processing cell PE-2. The output activation xi (k + 1) obtained from the non-linear processing cell of the i-th PE is reloaded to itself for the processing of next iteration. The iterative process is continued till convergence is reached, and once the convergence is reached after certain number of iterations, the learning phase starts for the adjustment of weights. The systolic architecture derived for the search phase can be reused for the learning phase as well. The architecture for the search phase (Fig. 15.9) can be used for the learning phase as follows: 1. Calculate the product of the error value (di − xi ) and the learning rate η as: Si = η · (di − xi ), and store that in the i-th PE. 2. To calculate the weight increment terms according to (15.3), the converged activation values x j for 1 ≤ j ≤ N move across the PEs as in case of the search phase
15
Systolic VLSI and FPGA Realization of Artificial Neural Networks
373
form a PE to its adjacent PE on the left in every computational cycle. When x j visits the i-th PE, it is multiplied with Si to find the current weight increment term Δ Wi j for updating the weights available from the circular-shift register. 3. After N clock cycles all the activation values pass through a PE, and all the N weights associated with a PE are adjusted.
15.4.2 Systolic Architecture for Multilayer Neural Network The multilayer perceptron model [29] is very much popular due to its wide range of applications. In this Subsection, it is intended to show that the architectural design discussed in the previous Section can be used for the implementation of the multilayer neural model also. The system dynamics of search phase of the l-th layer of an L-layer network is given by: xi (l) = f (ui (l)),
(15.7)
where ui (l) =
Nl −1
∑ Wi j · x j (l − 1)) + θi(l)
for 1 ≤ i ≤ Nl and 1 ≤ l ≤ L.
(15.8)
j=1
Nl is the number of nodes in the l-th layer. For simplicity of presentation, (without loss of generalization) we have assumed that each layer consists of equal number of nodes (e.g., Nl = N) and θi (l) is the bias input of the i-th neuron in the l-th layer. The computations of each neuron can be performed by a systolic array of the kind shown in Fig. 15.9 (discussed in the Subsection 15.4.1) such that the computation of (15.7) and (15.8) for all the neurons of a layer can be realized in fully parallel form in L systolic arrays. The resulting mesh architecture, consisting of LN number of PEs arranged in L rows and N columns is shown in Fig. 15.10. The bias values (not shown explicitly in the structure) are used to initialize the accumulation registers in the PEs. For a reduced-hardware implementation, the computation of different layers may be performed by a single array structure by time-multiplexing of the computation of different layers by a simple control unit and external storage elements to store the outputs of the neurons.
15.4.3 Systolic Implementation of Back-Propagation Algorithm The back propagation (BP) algorithm [30] is one of the most widely used learning schemes of multilayer neural net. It is an iterative gradient descent technique to minimize the mean-square-error between the desired target values and the output values of the neurons in a multilayer neural net. The BP algorithm is comprised of two basic steps: (i) the feed-forward step and (ii) the reverse step or backward step. In the feed-forward step, one of the input training patterns is fed to the network, and the output activation values are computed for each layer according to (15.7) and (15.8). In the reverse step, the differences between the desired targets and the output
374
P.K. Meher
x1(0)
x2(0)
x3(0)
xN(0)
x1(1)
x2(1)
x3(1)
xN(1)
x1(L-1)
x2(L-1)
x3(L-1)
xN(L-1)
x1(L)
x2(L)
x3(L)
xN(L)
Fig. 15.10 Systolic mesh architecture for multilayer ANN
activation values (also called as the error signals) are estimated for all the neurons at the output layer, and propagated back for weight adjustment of the preceding layers progressively backward. The weight adaptation of the l-th layer according to BP algorithm pertaining to the m-th input pattern is given by Wimj (l) = Wim−1 (l) + η · δim (l) · xmj (l − 1) j
(15.9)
where the error signal ‘δim (l)’ for updating the weights of the l-th layer is recursively computed as: m δim (L) = (dim − xm i (L)) · f (ui (L)) for l = L and
δim (l) =
@
A
(l + 1) ∑ δ jm (l + 1) ·Wim−1 j
· f (um i (l)) for l < L.
(15.10) (15.11)
j
The feed-forward step is the same as that of the search phase and can be implemented by the structure of Fig.15.10. For the weight adjustment by back-propagation algorithm, the structure of search phase can be reused by simple modification. The formula for weight updating for the L-th layer is given by (15.10), which is similar to that of (15.3) except that η is replaced by the derivative of non-linear function f ’, and can be implemented by the structure discussed in the Subsection 15.4.1. For all other values of l, ( e.g., for l < L in (15.11)), the error signal used in (15.10) is replaced by an inner-product of the N-point error-vector and a row of the weight matrix. The inner-products of (15.11) can be easily realized by the semi-systolic structure of Fig. 15.7 using the PEs whose function is described in Fig. 15.8(c). The
15
Systolic VLSI and FPGA Realization of Artificial Neural Networks
375
l-th layer
δ1(l)
δ2(l)
δ3(l)
δN(l) (l+1)-th layer
δ1(l+1)
δ2(l+1)
δ3(l+1)
δN(l+1)
Fig. 15.11 Systolic array structure of weight updating by back-propagation algorithm
array structure for updating of weights {Wim−1 (l)} and calculation of {δim (l)} is j shown in Fig. 15.11, and discussed in the followings: 1. The weight value Wim−1 (l + 1) available from the local shift-register of the j-th j PE of the (l + 1)-th layer is multiplied with the residing error value δ jm (l + 1), and the product is added thereafter with the accumulated products from its preceding PE on its right. The accumulated product for calculating δim (l) is initialized at the i-th PE, and then propagated circularly left-ward across the array. 2. Similar operations are repeated for all i, for i = 1, 2, 3, · · ·, N. The accumulated value originating from the i-th PE returns back to itself after adding up all the products relating to the inner product of (15.11). In the mean time, f (um i (l)) is calculated at the i-th PE on the l-th layer. The inner-product (not shown in the figure) computed at the i-th PE of the (l + 1)-th layer is transferred to the i-th PE m on the l-th layer where it is multiplied with f (um i (l)) to obtain δi (l). m 3. δi (l) is then used by the i-th PE on the l-th layer for updating {W jim−1 (l) for j = i, i + 1, · · ·, N, 1, · · ·, i − 1} at the l-th layer and calculation of δim (l − 1) at the (l − 1)-th layer. All these 3 steps are repeated for all the layers backward till the input layer is reached. The computation time of each layer here is proportional to the number of neurons on the layer. The hardware utilization of the structure will, therefore, be optimal when all the layers have equal number of neurons. But, if different layers have different number of neurons then layers with less number of neurons are required to wait till the layers with higher number of neurons complete their computations. An architecture for unequal number of neurons is discussed in [6]. Optimized one-and two-dimensional systolic architectures are suggested in [7]-[9] for the reduction of computation-time. Several other variations and optimization of systolic architectures for the ANN models are reported in [10]-[15].
376
P.K. Meher
15.4.4 Implementation of Advance Algorithms and Applications Systolic realizations of several ANN applications relating to signal processing, image processing, pattern recognition and classification problems have been reported in the literature. Recurrent neural networks (RNNs), form the most general class of neural networks, in which every node can be connected to any other node. RNN has the ability to implement highly non-linear dynamical systems of arbitrary complexity [35], [36]. A systolic array architecture (similar to the one discussed for the Hopfield net in Section 15.4.1) for recurrent learning algorithm of [35], has been derived by Kechriotis and Manolakos [37] systematically from the DG formulation using the canonical mapping methodology, for the implementation of retrieving phase as well as the learning phase. A highly regular and modular architecture of recurrent neural network implementation of a shortest path processor in reconfigurable hardware and dedicated VLSI is suggested in [38]. Ramacher et al have developed a neural signal processor that executes the compute-bound primitives shared by all the neural nets for high-speed signal processing [39]. Vidal and Massicotte have presented an efficient architecture for channel equalization using a piecewise linear multilayer neural network [40]. Broomhead et al have presented a fully systolic network based on multilayer feed-forward perceptron model using the radial basis function for nonlinear system identification, nonlinear adaptive filtering and pattern classification [41]. Cavaiuolo and others have presented a systolic neural network for image processing, and have discussed its advantages over the conventional implementations [42]. VLSI architectures for pattern recognition and classification using ANN algorithms are discussed in [43] and [44]. A reconfigurable systolic implementation of face recognition system based on principal component neural network is presented in a recent paper [45]. A neural chip along with an analog vector quantizer for image processing applications is suggested by Sheu et al in [46]. An interesting scheme for hybrid analog-digital systolization of neural network is presented in [47].
15.5 Conclusion The systolic array architectures, due to their several features of advantage, are considered to be attractive for the implementation of computation-intensive ANN algorithms in custom VLSI and FPGA devices for real-time applications. The key techniques used for mapping of ANN algorithms into systolic computing structure are discussed, and a brief overview of systolic architectures for different ANN applications are presented in this chapter. Along with the design of basic systolic building blocks for various ANN algorithms, the mapping of fully-connected unconstrained ANN, as well as, multilayer ANN algorithm into fully-pipelined systolic architecture is described by generalized dependance graph formulation. The readers may refer to the cited references for detail discussions on hardware implementation of advance ANN algorithms and extended forms of ANN for different applications. Interested readers may also like to find several variations and optimization of systolic architectures for the ANN models in the references. Most of the VLSI structures suggested in the literature are meant for a particular topology of network,
15
Systolic VLSI and FPGA Realization of Artificial Neural Networks
377
and a specific training algorithm. Only a few of the architectures offer flexibility of adapting to different learning process or topologies and constitution of network [48]-[49]. Self-configuring and adaptive architectures could also be designed for complex multi-modal learning applications and for the applications subjected to different constraints and environmental influences [50, 51]. It is observed that both analog, as well as, the digital implementations have their inherent advantages and disadvantages. Therefore, it is expected that mixed analog-digital circuits might be able to deliver the best of the two for the VLSI implementation of different ANN models. Mixed analog-digital neural networks have significant potential to be deployed directly and more efficiently for various applications in signal processing, communication and instrumentation where real-world interaction in analog domain is very much prevalent during different phases of network operation.
References 1. Brown, J.R., Garber, M.M., Venable, S.F.: Artificial neural network on a SIMD architecture. In: Proceedings Frontiers of Massively Parallel Computation, pp. 43–47 (1988) 2. Shams, S., Przytula, K.W.: Mapping of neural networks onto programmable parallel machines. In: Proceedings IEEE International Symposium on Circuits and Systems, vol. 4, pp. 2613–2617 (1990) 3. Kung, S.Y.: Digital Neurocomputing. Prentice Hall, Englewood Cliffs (1992) 4. Kung, S.Y.: Tutorial: digital neurocomputing for signal/image processing. In: Proceedings of IEEE Workshop Neural Networks for Signal Processing, pp. 616–644 (1991) 5. Kung, S.Y., Hwang, J.N.: Parallel architectures for artificial neural nets. In: Proceedings of IEEE International Conference on Neural Networks, vol. 2, pp. 165–172 (1988) 6. Kung, S.Y., Hwang, J.N.: A unifying algorithm/architecture for artificial neural networks. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 2505–2508 (1989) 7. Amin, H., Curtis, K.M., Hayes Gill, B.R.: Efficient two-dimensional systolic array architecture for multilayer neural network. Electronics Letters 33(24), 2055–2056 (1997) 8. Amin, H., Curtis, K.M., Hayes Gill, B.R.: Two-ring systolic array network for artificial neural networks. In: IEE Proceedings Circuits, Devices and Systems, vol. 164(5), pp. 225–230 (1999) 9. Myoupo, J.F., Seme, D.: A single-layer systolic architecture for back propagation learning. In: Proceedings of IEEE International Conference on Neural Networks, vol. 2, pp. 1329–1333 (1996) 10. Khan, E.R., Ling, N.: Systolic architectures for artificial neural nets. In: Proceedings of IEEE International Joint Conference on Neural Networks, vol. 1, pp. 620–627 (1991) 11. Zubair, M., Madan, B.B.: Systolic implementation of neural networks. In: Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors, pp. 479–482 (1989) 12. Pazienti, F.: Systolic array for neural network implementation. In: Proceedings 6th Mediterranean Electrotechnical Conference, vol. 2, pp. 981–984 (1991) 13. Girones, R.G., Salcedo, A.M.: Systolic implementation of a pipelined on-line back propagation. In: Proceedings Seventh International Conference on Microelectronics for Neural, Fuzzy and Bio-Inspired Systems, pp. 387–394 (1999)
378
P.K. Meher
14. Naylor, D., Jones, S.: A performance model for multilayer neural networks in linear arrays. IEEE Transactions on Parallel and Distributed Systems 5(12), 1322–1328 (1994) 15. Naylor, D., Jones, S., Myers, D.: Back propagation in linear arrays-a performance analysis and optimization. IEEE Transactions on Neural Networks 6(3), 583–595 (1995) 16. Kung, H.T.: Why systolic architectures? Computer 15(1), 37–46 (1982) 17. Kung, S.Y.: VLSI Array Processors. Prentice Hall, Englewood Cliffs (1988) 18. Parhi, K.K.: VLSI Digital Signal Processing Systems: Design and Implementation. Wiley-Interscience Publication, John Wiley & Sons, New York (1999) 19. Zhang, D., Pal, S.K. (eds.): Neural Networks and Systolic Array Design. World Scientific, River Edge (2002) 20. Ben Salem, A.K., Ben Othman, S., Ben Saoud, S.: Design and implementation of a neural command rule on a FPGA circuit. In: Proceedings 12th IEEE International Conference on Electronics, Circuits and Systems, pp. 1–4 (2005) 21. Liu, J., Liang, D.: A Survey of FPGA-Based Hardware Implementation of ANNs. In: Proceedings 1st International Conference on Neural Networks and Brain, pp. 915–918 (2005) 22. Mohan, A.R., Sudha, N., Meher, P.K.: An embedded face recognition system on A VLSI array architecture and its FPGA implementation. In: Proceedings 34th Annual Conference of IEEE Industrial Electronics, pp. 2432–2437 (2008) 23. Farhat, N.H., Paaltis, D., Prata, A., Paek, E.: Optical Implementation of the Hopfield Model. Applied Optics 24, 1469–1475 (1985) 24. Wanger, K., Paaltis, D.: Multilayer optical learning networks. Applied Optics 26, 5061– 5076 (1987) 25. Mead, C.: Analog VLSI and neural systems. Addison Wesley, Reading (1989) 26. Sivilotti, M.A., Mahowald, M.A., Mead, C.A.: Real-time visual computations using analog CMOS processing arrays. In: Loslben, P. (ed.) Advanced Research on VLSI, pp. 295–312. MIT Press, Cambridge (1987) 27. Sheu, B.J., Choi, J.: Neural Information Processing and VLSI. Kluwer Academic Publishers, Dordrecht (1995) 28. Hopfield, J.J., Tank, D.W.: Neural computation of decisions in optimization problems. Biological cybernetics 52, 141–154 (1985) 29. Rummelhart, D.E., McClelland, J.L.: Parallel and distributed processing: Explorations in the Microstructure of cognition. MIT Press, Cambridge (1986) 30. Werbos, P.: Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University, Cambridge, Mass. (1974) 31. Gisutham, B., Srikanthan, T., Asari, K.V.: A high speed flat CORDIC based neuron with multi-level activation function for robust pattern recognition. In: Proceedings Fifth IEEE International Workshop on Computer Architectures for Machine Perception, pp. 87–94 (2000) 32. Anna Durai, S., Siva Prasad, P.V., Balasubramaniam, A., Ganapathy, V.: A learning strategy for multilayer neural network using discretized sigmoidal function. In: Proceedings Fifth IEEE International Conference on Neural Networks, pp. 2107–2110 (1995) 33. Zhang, M., Vassiliadis, S., Delgado-Frias, J.G.: Sigmoid generators for neural computing using piecewise approximations. IEEE Transactions on Computers 45, 1045–1049 (1996) 34. Saichand, V., Nirmala, D.M., Arumugam, S., Mohankumar, N.: FPGA realization of activation function for artificial neural networks. In: Proceedings Eighth International Conference on Intelligent Systems Design and Applications, vol. 3, pp. 159–164 (2008)
15
Systolic VLSI and FPGA Realization of Artificial Neural Networks
379
35. Williams, R.J., Zipser, D.: Experimental analysis of the real-time recurrent learning algorithm. Connection Science 1, 87–111 (1989) 36. Williams, R.J., Zipser, D.: Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Back-propagation: Theory, Architectures and Applications. Erlbaum, Hillsdale (1992) 37. Kechriotis, G., Manolakos, E.S.: A VLSI array architecture for the on-line training of recurrent neural networks. In: Conference Record of Asilomar Conference on the TwentyFifth Signals, Systems and Computers, vol. 1, pp. 506–510 (1991) 38. Shaikh-Husin, N., Hani, M.K., Teoh, G.S.: Implementation of recurrent neural network algorithm for shortest path calculation in network routing. In: Proceedings of International Symposium on Parallel Architectures, Algorithms and Networks, I-SPAN 2002, pp. 313–317 (2002) 39. Ramacher, U., Beichter, J., Bruls, N., Sicheneder, E.: Architecture and VLSI design of a VLSI neural signal processor. In: Proceedings IEEE International Symposium on Circuits and Systems, vol. 3, pp. 1975–1978 (1993) 40. Vidal, M., Massicotte, D.: A VLSI parallel architecture of a piecewise linear neural network for nonlinear channel equalization. In: Proceedings the 16th IEEE Conference on Instrumentation and Measurement Technology, vol. 3, pp. 1629–1634 (1999) 41. Broomhead, D.S., Jones, R., McWhirter, J.G., Shepherd, T.J.: A systolic array for nonlinear adaptive filtering and pattern recognition. In: Proceedings IEEE International Symposium on Circuits and Systems, vol. 2, pp. 962–965 (1990) 42. Cavaiuolo, M., Yakovleff, A.J.S., Watson, C.R., Kershaw, J.A.: A systolic neural network image processing architecture. In: Proceedings Computer Systems and Software Engineering, pp. 695–700 (1992) 43. Bermak, A., Martinez, D.: Digital VLSI implementation of a multi-precision neural network classifier. In: Proceedings 6th International Conference on Neural Information Processing, vol. 2, pp. 560–565 (1999) 44. Shadafan, R.S., Niranjan, M.: A systolic array implementation of a dynamic sequential neural network for pattern recognition. In: Proceedings IEEE World Congress on Computational Intelligence and IEEE International Conference on Neural Networks, vol. 4, pp. 2034–2039 (1994) 45. Sudha, N., Mohan, A.R., Meher, P.K.: Systolic array realization of a neural networkbased face recognition system. In: Proceedings 3rd IEEE Conference on Industrial Electronics and Applications, pp. 1864–1869 (2008) 46. Sheu, B.J., Chang, C.F., Chen, T.H., Chen, O.T.C.: Neural-based analog trainable vector quantizer and digital systolic processors. In: Proceedings IEEE International Symposium on Circuits and Systems, vol. 3, pp. 1380–1383 (1991) 47. Moreno, J.M., Castillo, F., Cabestany, J., Madrenas, J., Napieralski, A.: An analog systolic neural processing architecture. IEEE Micro. 14(3), 51–59 (1994) 48. Madraswala, T.H., Mohd, B.J., Ali, M., Premi, R., Bayoumi, M.A.: A reconfigurable ‘ANN’ architecture. In: Proceedings IEEE International Symposium on Circuits and Systems, vol. 3, pp. 1569–1572 (1992) 49. Jang, Y.-J., Park, C.-H., Lee, H.-S.: A programmable digital neuro-processor design with dynamically reconfigurable pipeline/parallel architecture. In: Proceedings International Conference on Parallel and Distributed Systems, pp. 18–24 (1998) 50. Patra, J.C., Lee, H.Y., Meher, P.K., Ang, E.L.: Field Programmable Gate Array Implementation of a Neural Network-Based Intelligent Sensor System. In: Proceeding International Conference on Control Automation Robotics and Vision, December 2006, pp. 333–337 (2006)
380
P.K. Meher
51. Patra, J.C., Chakraborty, G., Meher, P.K.: Neural Network-Based Robust Linearization and Compensation Technique for Sensors under Nonlinear Environmental Influences. IEEE Transactions on Circuits and Systems-I: Regular Papers 55(5), 1316–1327 (2008)
About the author Pramod Kumar Meher received the B.Sc. and M.Sc. degrees in physics and the Ph.D. in science from Sambalpur University, Sambalpur, India, in 1976, 1978, and 1996, respectively. He has a wide scientific and technical background covering physics, electronics, and computer engineering. Currently, he is a Senior Scientist with the Institute for Infocomm Research, Singapore. Prior to this assignment he was a visiting faculty with the School of Computer Engineering, Nanyang Technological University, Singapore. Previously, he was a Professor of computer applications with Utkal University, Bhubaneswar, India, from 1997 to 2002, a Reader in electronics with Berhampur University, Berhampur, India, from 1993 to 1997, and a Lecturer in physics with various Government Colleges in India from 1981 to 1993. His research interest includes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal processing, image processing, communication, and intelligent computing. He has published more than 140 technical papers in various reputed journals and conference proceedings. Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers (IETE), India and a Fellow of the Institution of Engineering and Technology (IET), UK. He is currently serving as Associate Editor for the IEEE Transactions on Circuits and Systems-II: Express Briefs, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, and Journal of Circuits, Systems, and Signal Processing. He was the recipient of the Samanta Chandrasekhar Award for excellence in research in engineering and technology for the year 1999.
Chapter 16
Application of Coarse-Coding Techniques for Evolvable Multirobot Controllers Jekanthan Thangavelautham, Paul Grouchy, and Gabriele M.T. D’Eleuterio
Abstract. Robots, in their most general embodiment, can be complex systems trying to negotiate and manipulate an unstructured environment. They ideally require an ‘intelligence’ that reflects our own. Artificial evolutionary algorithms are often used to generate a high-level controller for single and multi robot scenarios. But evolutionary algorithms, for all their advantages, can be very computationally intensive. It is therefore very desirable to minimize the number of generations required for a solution. In this chapter, we incorporate the Artificial Neural Tissue (ANT) approach for robot control from previous work with a novel Sensory Coarse Coding (SCC) model. This model is able to exploit regularity in the sensor data of the environment. Determining how the sensor suite of a robot should be configured and utilized is critical for the robot’s operation. Much as nature evolves body and brain simultaneously, we should expect improved performance resulting from artificially evolving the controller and sensor configuration in unison. Simulation results on an example task, resource gathering, show that the ANT+SCC system is capable of finding fitter solutions in fewer generations. We also report on hardware experiments for the same task that show complex behaviors emerging through self-organized task decomposition.
16.1 Introduction Our motivation for evolutionary-based control approaches for multirobot systems originates in the use of robots for space exploration and habitat construction on Jekanthan Thangavelautham Mechanical Engineering Department, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, MA, USA, 02139 e-mail: [email protected] Paul Grouchy · Gabriele M.T. D’Eleuterio Institute for Aerospace Studies, University of Toronto, 4925 Dufferin St., Toronto, Canada, M3H5T6 e-mail: {paul.grouchy,gabriele.deleuterio}@utoronto.ca
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 381–412. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
382
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
alien planets and planetoids, such as Mars and the Moon. Potential scenarios include establishing a distributed antenna for communications, deploying a mobile array of actuators and sensors for geological measurements, or constructing elements of an outpost in preparation for the arrival of humans. These kinds of project call for not just a single monolithic robotic system but teams of elemental robots working in collaboration and coordination. While space applications may require the use of a multiagent strategy, they are by no means the only ones. Consider, for example, terrestrial applications such as search and rescue, mapping, manufacturing and construction. A number of factors make the team approach viable and attractive. Among them are increased reliability. One can afford to lose a member of the team without destroying the team’s integrity. A team approach can offer increased efficiency through parallelization of operations. As such, multiagent systems are more readily scalable. Most important, however, a team can facilitate task decomposition. A complex task can be parsed into manageable subtasks which can be delegated to multiple elemental units. Robotic systems can themselves be complex and their environments are generally unstructured. Sound control strategies are therefore not easy to develop. A methodical approach is not only desired but arguably required. It would be ideal if the controller could be automatically generated starting from a ‘blank slate,’ where the designer is largely relieved of the design process and detailed models of the systems or environment can be avoided. It is by such a road that we have come to the use of evolutionary algorithms for generating controllers that are based on neural-network architectures. We have developed and tested, both in simulation and hardware, a neuroevolutionary approach called the Artificial Neural Tissue (ANT). This neural-networkbased controller employs a variable-length genome consisting of a regulatory system that dictates the rate of morphological growth and can selectively activate and inhibit neuron ensembles through a coarse-coding scheme [31]. The approach requires an experimenter to specify a goal function, a sensory input layout for the robots and a repertoire of allowable basis behaviors. The control topology and its contents emerge through the evolutionary process. But is it possible to evolve, in addition to the controllers themselves, the necessary sensor configurations and the selection of behavior primitives (motor-actuator commands) concurrently with the evolution of the controller? It is this question that we address in this work. In tackling this challenge, we turn to a key theme in our ANT concept, namely, coarse coding. Coarse coding is an efficient, distributed means of representation that makes use of multiple coarse receptive fields to represent a higher-resolution field. As is well known, nature exploits coarse coding in the brain and sensory systems. In artificial systems, coarse coding is used to interpret data. In ANT, however, it is the program (the artificial neural structure responsible for computation) that is coarse coded. This allows the development of spatially modular functionality in the architecture that mimics the modularity in natural brains.
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
383
We present in this work a Sensory Coarse Coding (SCC) model that extends capabilities of ANT and allows for the evolution of sensor configuration and coupled motor primitives. The evolution of sensor configuration and behavior primitives can be used to take advantage of regularities in the task space and can help to guide and speed up the evolution of the controller. Training neural network controllers using an evolutionary algorithms approach like ANT for robotics is computationally expensive and tends to be used when other standard search algorithms like gradient descent are unsuitable or require substantial supervision for the given task space. The controllers are developed using a biologically motivated development process and replicated on one or more robotic platforms for evaluation. Training on hardware is often logistically difficult, requiring a long-term power source and a means of automating the controller evaluation process. The alternative is to simulate the robotic evaluation process on one or more computers. The bulk of the required time in training is the evaluation process. Genetic operations including selection, mutation and crossover tend to take less than one percent of the computational time. Therefore any method that can reduce the number of genetic evaluations will have a substantial impact on the training process. Furthermore, a significant reduction in the number of generations required can also make the training process feasible on hardware. Robotic simulations often take into account the dynamics and kinematics of robotic vehicle interactions. However, care has to be taken to ensure the simulation environment resembles or is compatible with actual hardware. In other circumstances, it may not be beneficial to prototype and demonstrate capabilities and concepts in simulation before proceeding towards expensive hardware demonstration. With robotics applications on the lunar surface, the low gravity environment cannot be easily replicated on earth and hence high fidelity dynamics simulations may be needed to demonstrate aspects of system capability. For multirobotic tasks, the global effect of local interactions between robots is often difficult to gauge, and the specific interactions required to achieve coordinated behavior may even be counterintuitive. Furthermore, it is not at all straightforward to determine the best sensor configuration. Often detailed analysis of the task needs to be performed to figure out the necessary coordination rules and sensory configurations. The alternative is to use optimization techniques to in effect shrink the search space sufficiently to enable evolutionary search algorithms to find suitable solutions. Evolving this configuration may also give useful insight into the sensors necessary for a task. This may help guide a robotic designer in their design processes—we do not presume to make the designer completely redundant—by helping them determine which sensors and actuators are best to achieve a given objective. In addition, the evolution of the sensor configuration in conjunction with the controller would allow us to mimic nature more closely. Nature perforce evolves body and brain together. The remainder of this chapter is organized as follows. First, we provide background to our problem by reviewing past work on the use of evolutionary algorithms for the development of multirobot controllers and on ‘body and brain’ evolution. We present the workings of the Artificial Neural Tissue approach followed by the Sensory Coarse Coding model. We refer to the integration of the latter into the former
384
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
as the ANT+SCC system. Next, we report on a number of experiments we have conducted to demonstrate the ANT+SCC system on a group of robots, concentrating on an example of resource gathering. This is followed by a discussion of the findings and finally we venture some concluding remarks.
16.2 Background Coordination and control of multirobot systems are often inspired by biology. In nature, multiagent systems such as social insects use a number of mechanisms for control and coordination. These include the use of templates, stigmergy, and selforganization. Templates are environmental features perceptible to the individuals within the collective [3]. In insect colonies, templates may be a natural phenomenon or they may be created by the colonies themselves. They may include temperature, humidity, chemicals, or light gradients. Stigmergy is a form of indirect communication mediated through the environment [12]. One way in which ants and termites exploit stigmergy is through the use of pheromone trails. Self-organization describes how local or microscopic behaviors give rise to a macroscopic structure in systems [2]. However, many existing approaches suffer from another emergent feature called antagonism [5]. This is the effect that arises when multiple agents trying to perform the same task interfere with each other and reduce the overall efficiency of the group. Within the field of robotics, many have sought to develop multirobot control and coordination behaviors based on one or more of the prescribed mechanisms used in nature. These solutions have been developed using user-defined deterministic ‘ifthen’ rules or preprogrammed stochastic behaviors. Such techniques in robotics include template-based approaches that exploit light fields to direct the creation of walls [33] and planar annulus structures [34]. Stigmergy has been used extensively in collective-robotic construction tasks, including blind bull dozing [24], box pushing [21] and heap formation [1]. Inspired by insect societies, the robots are equipped with the necessary sensors required to demonstrate multirobot control and coordination behaviors. Furthermore, the robot controllers are often designed by hand to be reactive and have access only to local information. They are nevertheless able to self-organize through cooperation to achieve an overall objective. This is difficult to do by hand, since the global effect of these local interactions is often very difficult to predict. The simplest handcoded techniques have been to design a controller for a single robot and scaling to multiple units by treating other units as obstacles to be avoided [1], [24]. Other more sophisticated techniques make use of explicit communication or designing an extra set of coordination rules to gracefully handle agent-to-agent interactions [33]. These approaches are largely heuristic, rely on ad hoc assumptions that often require knowledge of the task domain and are implemented with a specified robot configuration in mind.
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
385
16.2.1 The Body and the Brain The ecological balance principle [25] states that the complexity of an ‘agent’ must match the complexity of the task environment. In the natural world, this matching is done by the evolutionary process which molds both the body and the brain of individual organisms for survival. Adaptive systems may evolve to exploit regularities in the task environment that concurrently impact the physical design and control. This has been demonstrated with artificial evolution of both the body and brain of artificial creatures. Early work by Sims [27] explored breeding virtual creatures that could swim and hop in a simulated 3-D environment. A generative encoding system was used to describe the phenotype and was based on Lindenmayer’s L-system [19]. The creatures were fully described by the genome and composed of a hierarchical description outlining three-dimensional rigid parts and various joints. Reactive control laws were also evolved that described the interaction of the various parts. Framsticks, another method of evolving artificial body-brain systems, combined a neural-network controller with a specified morphology [17]. The resultant virtual creatures were evolved to perform locomotion in various environments, both on land and in water. Grammar-based techniques have also been used to evolve brain and bodies for robotic applications [22] and have demonstrated simple Braintenberg tasks. Work by Lipson and Pollack [20] demonstrated the evolution of robot designs and actuators that were realized by a 3-D printer. Standard components such as motors were added allowing the system to perform locomotion. Further work in this area by Zykov et al. [35] has been directed towards evolving self-replicating machine configurations using a physical substrate. In robotics, there are both potential advantages and disadvantages when designing both body and brain concurrently for solving individual tasks. One advantage is that the end design may be specifically tuned towards solving a specialized task in an efficient manner (equivalent to finding a niche in nature). However, this may be at the cost of losing multipurpose capabilities. On the other hand, specific needs and performance considerations may warrant the need for special-purpose robots. In many practical situations, one may not have the resources necessary to design specialized robots for a task at hand. Hence one has to use standard robot configurations and implement a controller for this configuration. However, it may be practical to reconfigure placement of sensors on a standard robot configuration for use on a specified task. How best to configure these sensors and implement the associated controllers remains a crucial question in multirobot systems.
16.2.2 Task Decomposition Sensor configuration is particularly key to task decomposition. The ability to partition or segment a complex task into subtasks is a vital capability for both natural and artificial systems. Part of solving a real-world task requires sensing the environment and using it to provide feedback when performing actions.
386
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
One of the advantages of multirobot systems, as mentioned above, is precisely the opportunity to facilitate task decomposition. With the use of such systems, control and coordination are critical. Both the control and coordination capabilities are dependent on the individual robot’s ability to sense its environment and its ability to perform actions.
16.2.3 Machine-Learning Techniques and Modularization Machine-learning techniques, particularly artificial evolution, exploit selforganization and relieve the designer of having to determine a suitable control strategy and sensor configuration. Using these techniques, controllers and sensor configurations develop cooperation and interaction strategies by setting the evolving system loose in the environment. By contrast, it is difficult to design controllers by hand with cooperation in mind because it is difficult, given the complexity of the system as well as the environment, to predict or control the global behaviors that will result from local interactions. One machine-learning technique to overcome the difficulties of design by hand is based on Cellular Automata (CA) look-up tables. A genetic algorithm can be used to evolve the table entries [7]. The assumption is that each combination of sensory inputs will result in a particular choice of output behaviors. This approach is an instance of a ‘tabula rasa’ technique. The control system starts off as a blank slate with limited assumptions regarding control architecture and is guided through training by a fitness function (system goal function). Such approaches can be used to obtain robust, scalable controllers that exploit multirobot mechanisms such as stigmergy and self-organization. Furthermore, these approaches are beneficial for hardware experiments as there is minimal computational overhead incurred, especially if onboard sensor processing is available. One of the limitations of a look-up table approach is that the table size grows exponentially with the number of inputs. For a 3 × 3 tiling formation task, a single look-up table architecture is found to be intractable owing to premature search stagnation [29]. To address this limitation, the controller can be modularized into subsystems by exploiting regularities in the task environment. These subsystems can explicitly communicate and coordinate actions with other agents. This act of dividing the agent functionality into subsystems is a form of user-assisted task decomposition. Such intervention requires domain knowledge of the task and ad hoc design choices to facilitate searching for a solution. Use of neural networks is also another form of modularization, where each neuron can communicate and perform some form of sensory information processing. The added advantage of neural-network architectures is that the neurons can generalize (unlike CAs) by recognizing correlations between a combination of sensory inputs, thus effectively shrinking the search space. Fixed-topology neural-network architectures have been used extensively for multirobot tasks, including building walls [33], tile formation [30], and cooperative transport [13]. However, monolithic fixed-topology neural-network architectures also face scalability problems. With an
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
387
increasing number of hidden neurons, one must contend with the effects of spatial crosstalk where noisy neurons interfere and drown out signals from featuredetecting neurons [16]. Crosstalk in combination with limited supervision (through use of a global fitness function) can lead to the ‘bootstrap problem’ [23], where evolutionary algorithms are unable to pick out incrementally fitter solutions resulting in premature stagnation of the evolutionary run. Thus, choosing the wrong network topology may lead to a situation that is either unable to solve a task or is difficult to train [31].
16.2.4 Fixed versus Variable Topologies Fixed-topology architectures accordingly have limitations, particularly in robotics, for the very reason that the topology must be determined a priori and there is no opportunity to modifying it without starting over. However, variable-topology architectures allow for the evolution of both the network architecture and the neuronal weights simultaneously. The genotypes for these systems are encoded in a one-toone mapping such as in Neuro-Evolution of Augmenting Topologies (NEAT) [28]. The use of recursive rewriting of the genotype contents to a produce a phenotype is used in methods such as in Cellular Encoding [14], L-systems [27] and through artificial ontogeny [8]. Ontogeny (morphogenesis) models developmental biology and includes a growth program in the genome that starts from a single egg and subdivides into specialized daughter cells. Other morphogenetic systems include [4] and Developmental Embryonal Stages (DES) [10]. The growth program within many of these morphogenetic systems is controlled through artificial gene regulation. Artificial gene regulation is a process in which gene activation/inhibition regulates (and is regulated by) the expression of other genes. Once the growth program has been completed, there is no further use for gene regulation within the artificial system, which is in stark contrast to biological systems where gene regulation is always present. These variable topologies also have to be grown incrementally starting from a single cell in order to minimize the dimensional search space as the size of the network architecture may inadvertently make training difficult [28]. With recursive rewriting of the phenotype, limited mutations can result in substantial changes to the growth program. Such techniques also introduce a deceptive fitness landscape where limited fitness sampling of a phenotype may not correspond well to the genotype, resulting once again in premature search stagnation [26]. The Artificial Neural Tissue concept [31] is intended to address limitations evident with existing variable topologies through the modeling of a number of biologically plausible mechanisms. ANT also uses a nonrecursive genotype-to-phenotype mapping, avoiding deceptive fitness landscapes, and includes gene duplication similar to DES. Gene duplication produces redundant copies of a master gene and facilitates neutral ‘complexification,’ where the copied gene undergoes mutational drift and results in the expression of incremental innovations [10].
388
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
16.2.5 Regularity in the Environment Most of the developmental systems described above deal with techniques used to facilitate evolution of network topologies. A critical advantage of the neural-network approach is its ability to generalize. Generalization often relies on regularities and patterns in the sensory input space to be effective. However the sensor configuration used by these developmental controllers still need to be specified by the experimenter. As discussed earlier, in a multirobot environment, it is often counterintuitive as to what control rules are necessary to obtain desired global behaviors. Furthermore, it is difficult to know what sensor configuration is necessary to facilitate these control rules. A number of techniques including the one presented here attempt to address this limitation. By allowing for the evolutionary search process to modify sensor configuration and geometry, it expected that this will facilitate finding efficient solutions with fewer genetic evaluations. Blondie24 [6], an evolved checkers-playing neural network, is one of the early efforts to exploit regularity in task space. A standard fixed neural-network topology was designed to take into account the regularity of the checkerboard. Hidden nodes of the network were tied to subsquares (cell regions arranged in square shape). The resultant network was coevolved with real players on an Internet checkers server and reached an expert level of proficiency in the game. HyperNEAT extends NEATs capabilities by combining a variable neural-network topology with a hypercube-based generative description of the topology [11]. Instead of encoding for every weight in the network separately in the genome, HyperNEAT uses a type of Compositional Pattern Producing Network (CPPN) to produce a network that represents the weight (connectivity) parameters of the phenotype network (controller). It also allows for the CPPNs to represent symmetries and regularities from the geometry of the task inputs directly in the controller. Geometric regularities can also be extracted using coarse-coding techniques. Coarse coding allows for the partitioning of separate geometric locations and thus allows for each part to be learned separately through task decomposition [11]. While this is advantageous, it has also been argued that it may prevent the learning system from discovering interdimensional regularities and alternate approaches have been shown to overcome this limitation using a priori knowledge [18]. However, as we show here, this capability can also be evolved. ANT+SCC extends the ANT approach with a sensor mapping and filtering scheme. This functionality is performed by a group of sensory neurons that interact in a coarse-coding fashion. This interaction helps determine resultant sensor geometry and resolution and aids in the filtering process. The output from these sensor neurons feeds into an ANT controller where higher-level processing is performed. This laminar scheme bears some resemblance in functionality to how the visual cortex in the mammalian brain operates with lower layers performing sensory filtering such as edge and line detection which in turn is used by higher-level functionality to make, for example, optical flow measurements.
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
389
16.3 Artificial Neural Tissue Model The ANT architecture (Fig. 16.1a) presented in this paper consists of a developmental program, encoded in the ‘genome,’ that constructs a three-dimensional neural tissue and associated regulatory functionality. The tissue consists of two types of neural units, decision neurons and motor-control neurons, or simply motor neurons. Regulation is performed by the decision neurons that dynamically exhibit or inhibit motor-control neurons within the tissue based on a coarse-coding techniques. The following sections discuss the computational aspects of the tissue and how it is created.
16.3.1 Computation We imagine the motor neurons of our network to be spheres arranged in a regular rectangular lattice in which the neuron Nλ occupies the position λ = (l, m, n) ∈ I3 (sphere centered within cube). The state sλ of the neuron is binary, i.e., sλ ∈ S = {0, 1}. Each neuron Nλ nominally receives inputs from neurons Nκ where κ ∈ ⇑(λ ), the nominal input set. Here we shall assume that these nominal inputs are the 3 × 3 neurons centered one layer below Nλ ; in other terms, ⇑(λ ) = {(i, j, k) | i = l −1, l, l + 1; j = m− 1, m, m+ 1; k = n − 1}. (As will be explained presently, however, we shall not assume that all the neurons are active all the time.) The activation function of each neuron is taken from among four possible threshold functions of the weighted input σ : 0, if σ ≥ θ1 ψdown (σ , θ1 ) = 1, otherwise 0, if σ ≤ θ2 ψup (σ , θ2 ) = 1, otherwise (16.1) 0, min(θ1 , θ2 ) ≤ σ < max(θ1 , θ2 ) ψditch (σ , θ1 , θ2 ) = 1, otherwise 0, σ ≤ min(θ1 , θ2 ) or σ > max(θ1 , θ2 ) ψmound (σ , θ1 , θ2 ) = 1, otherwise The weighted input σλ for neuron Nλ is nominally taken as
σλ =
∑κ ∈⇑(λ ) wκλ sκ ∑κ ∈⇑(λ ) sκ
(16.2)
with the proviso that σ = 0 if the numerator and denominator are zero. Also, wκλ ∈ R is the weight connecting Nκ to Nλ . We may summarize these threshold functions in a single analytical expression as
ψ = (1 − k1 )[(1 − k2 )ψdown + k2 ψup ] + k1 [(1 − k2 )ψditch + k2 ψmound ] (16.3)
390
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
where k1 and k2 can take on the value 0 or 1. The activation function is thus encoded in the genome by k1 , k2 and the threshold parameters θ1 , θ2 ∈ R. It may appear that ψdown and ψup are mutually redundant as one type can be obtained from the other by reversing the signs on all the weights. However, retaining both increases diversity in the evolution because a single 2-bit ‘gene’ is required to encode the threshold function and only one mutation suffices to convert ψdown into ψup or vice versa as opposed to changing the sign of every weight. The sensor data are represented by the activation of the sensor input neurons Nα i , i = 1 . . . m, summarized as A = {sα 1 , sα 2 . . . sα m }. Similarly, the output of the network is represented by the activation of the output neurons Nω j , j = 1 . . . n, summarized as Ω = {sω 1 , sω 2 . . . sω bn }, where k = 1 . . . b specifies the output behavior. 1 2 Each output neuron commands one behavior of the agent. (In the case of a robot, a typical behavior may be to move forward a given distance. This may result in the coordinated action of several actuators. Alternatively, the behavior may be more primitive such as augmenting the current of a given actuator.) If sω k = 1, output neuj ron ω j votes to activate behavior k; if sω k = 0, it does not. Since multiple neurons j can have access to a behavior pathway, an arbitration scheme is imposed to ensure n the controller is deterministic where p(k) = ∑s k, j=1 sω k /nk and nk is the number of j
output neurons connected to output behavior k resulting in behavior k being activated if p(k) ≥ 0.5. As implied by the set notation of Ω , the outputs are not ordered. In this embodiment, the order of activation is selected randomly. We are primarily interested here in the statistical characteristics of relatively large populations but such an approach would likely not be desirable in a practical robotic application. However this can be remedied by simply assigning a sequence a priori to the activations (as shown in Table 16.2 for the resource gathering task). We moreover note that the output neurons can be redundant; that is, more than one neuron can command the same behavior, in which case for a given time step one behavior may be “emphasized” by being voted multiple times. Neurons may also cancel each other out.
16.3.2 The Decision Neuron The coarse-coding nature of the artificial neural tissue is provided by the decision neurons. Decision neurons can be thought of as rectangular structures occupying nodes in the lattice as established by the evolutionary process (Fig. 16.1). The effect of these neurons is to excite into operation or inhibit (disable) the motor control neurons (shown as spheres). Once a motor control neuron is excited into operation, the computation outlined in (16.2) is performed. Motivated as we are to seek biological support for ANT, we may look to the phenomenon of chemical communication among neurons. In addition to communicating electrically along axons, some neurons release chemicals that are read by other neurons, in essence serving as a “wireless” communication system to complement the “wired” one.
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
391
(a)
(b) Fig. 16.1 Synaptic connections between motor neurons and operation of neurotransmitter field, (a) Synaptic connections and (b) Coarse-coding
For the state of a decision neuron Tμ where μ is binary and determined by one of the same activation functions (16.1) that is used to calculate the output of a motor control neuron. The inputs to Tμ are all the input sensor neurons Nα ; i.e., μ μ sμ = ψμ (sα 1 . . . sα m ) where σμ = ∑α vα sα / ∑α sα and vα are the weights. The decision neuron is dormant if sμ = 0 and releases a virtual neurotransmitter chemical of uniform concentration cμ over a prescribed field of influence if sμ = 1. Motor control neurons within the highest chemical concentration field are excited into operation. Only those neurons that are so activated will establish the functioning network for the given set of input sensor data. Owing to the coarse-coding effect, the sums used in the weighted input of (16.1) are over only the set ⇑(λ ) ⊆ ⇑(λ ) of active inputs to Nλ . Likewise the output of ANT is in general Ω ⊆ Ω . The decision neuron’s field of influence is taken to be a rectangular box extending ±dμr , where r = 1, 2, 3, from μ in the three perpendicular directions. These three dimensions along with μ and cμ , the concentration level of the virtual chemical emitted by Tμ , are encoded in the genome.
16.3.3 Evolution and Development A population of ANT controllers is evolved in an artificial Darwinian manner. The ‘genome’ for a controller contains a ‘gene’ for each cell with a specifier D that is
392
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
Fig. 16.2 Gene map for the Artificial Neural Tissue
used to distinguish the functionality (between motor control, decision and tissue). A constructor protein (an autonomous program) interprets the information encoded in the gene and translates this into a cell descriptor protein (see Fig. 16.2). The gene ‘activation’ parameter is a binary flag resident in all the cell genes and is used to either express or repress the contents of the gene. When repressed, a descriptor protein of the gene content is not created. Otherwise, the constructor protein ‘grows’ the tissue in which each cell is located relative to a specified seed-parent address. A cell death flag determines whether the cell commits suicide after being grown. Once again, this feature in the genome helps in the evolutionary process for a cell, by committing suicide, still occupies a volume in the lattice although it is dormant. In otherwise retaining its characteristics, evolution can decide to reinstate the cell by merely toggling a bit.
Fig. 16.3 Genes are ‘read’ by constructor proteins that transcribe the information into a descriptor protein which is used to construct a cell. When a gene is repressed, the constructor protein is prevented from reading the gene contents
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
393
In turn mutation (manipulation of gene parameters with uniform random distribution) to the growth program results in new cells being formed through cell division. The rate at which mutation occurs to a growth program is also specified for each tissue and is dependent on the neuron replication probability parameter. Cell division requires a parent cell (selected with highest replication probability relative to the rest of the cells within the tissue) and results in copying m% of the original cell contents to a daughter cell (where m is determined based on a uniform random distribution), with the remaining cell contents initialized with a uniform random distribution. The cell type of each new cell is determined based on the ratio of motor control to decision neurons specified in the tissue gene. The new cell can be located in one of six neighboring locations (top, bottom, north, south, east, west), sharing a common side with the parent, as long as the volume is not occupied by another cell.
16.3.4 Sensory Coarse Coding Model In this section we present the Sensory Coarse Coding (SCC) model. The model includes two components that allow for filtering and mapping locations of sensory inputs. Sensory Coarse Coding provides additional functionality not found within the ANT model, namely the ability to search for spatial mappings of sensory inputs while simultaneously filtering these inputs for further processing. Biological motivation for this capability comes from analyzing the visual cortex. Within the tissue architecture, we include one additional type of neuron, the sensor neuron. There exists a group of v sensory neurons Π = [Φτ 1 , Φτ 2 . . . Φτ v ], where each neuron has a position τ = (l, m), l ∈ [0, h − 1] + 0.5, m ∈ [0, h − 1] + 0.5 on a spatial map spanning h × h grid squares and representing the agent/robot and its surroundings (Fig. 16.4 left). The state sτ of a sensor neuron can assume one of up to q states, i.e., sτ ∈ A = {sα 1 , sα 2 . . . sα q }. Each neuron Φτ receives input from spatial map locations Lϕ , where ϕ ∈ ⇑(τ ), the input set. Each grid square Lϕ assumes a sensor reading, one of q states, i.e., Lϕ ∈ A . Here we shall assume that the receptive field for this sensor neuron is a bounded area containing aτ × bτ grid squares centered at (l, m), i.e., ⇑(τ ) = {(i, j) | i ∈ [l − aτ /2, l + aτ /2], j ∈ [m − bτ /2, m + bτ /2]}. The sensory neurons are not necessarily fed their entire input set. A coarse coding system is used to decide which inputs, if any, a sensory neuron will receive. Each sensory neuron emits a stimulus chemical in the area ⇑(τ ) such that the amount of chemical diffused at location ϕ due to sensory neuron Φτ i is the following: 1, if ϕ ∈ ⇑(τ i ) cϕ ,τ i = (16.4) 0, otherwise Therefore, the net concentration of chemical diffused due to the v sensory neurons at ϕ is: v
cϕ = ∑ cϕ ,τ i i=1
(16.5)
394
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
Fig. 16.4 (Left) Three sensor neurons and their respective receptive field shown shaded. With S = 1, only Φ2 is selected since ∑ϕ ∈⇑(τ ) cϕ is the highest among the three. (Right) Once Φ2 is activated, only locations with the highest chemical concentrations (shaded in dark gray) are fed as inputs to the evolved priority filter. The result is a single output from the neuron, indicating red
To determine which grid squares a sensory neuron Φτ will receive, the chemical concentration at each location ϕ ∈⇑ (τ ) is calculated. The states of the locations Lϕ that have a maximum chemical concentration in the grid are fed to the sensory neuron inputs. If ∑ϕ ∈⇑(τ ) cϕ = 0, sensory neuron Φτ is deactivated. Furthermore, S sensor neurons with the highest ∑ϕ ∈⇑(τ ) cϕ are activated. S ∈ I and can be evolved within the tissue gene or be specified for a task. Therefore, we define Iτ = {Lϕ |cϕ = maxϕ ∈⇑(τ ) cϕ } as the input set after coarse coding to sensory neuron Φτ . For sensor neurons that are active, we calculate sτ : sτ = min (pi ) p j ∩ Iτ = 0, / ∀j < i i∈[1,...,q]
(16.6)
where p j is an element of a global priority list P of sensory states, P = [p1 . . . pq ] and where p j ∈ A . The global priority list is obtained by polling a group of filter units and is described in Section 16.3.4.1. In summary, each sensory neuron takes inputs from ⇑(τ ) and produces a single output sτ , where both inputs and outputs are restricted to the states in A . This reduction of inputs to a single output is done through prioritized filtering using the global priority list P. Thus if a sensory neuron’s input set ⇑(τ ) contains one or more states p1 , the sensory neuron’s output sτ is set to p1 , regardless of its other input states. Similarly, if a sensory neuron’s input set contains one or more states p2 and no states p1 , the output is set to p2 , regardless of its other inputs, and so on down the priority list. 16.3.4.1
Input Filtering
The priority list P is generated by polling a group of n filter units. Each of these independent units takes in as input q weighted inputs and produces a single binary output using the threshold activation function ψup from (16.1). Each filter unit j has q weights w jk , 1 ≤ k ≤ q. To poll the filter units for a particular input state sα k ∈ A , the units are given an input vector of size q containing all zeros, except for at position k, which is set to one, and their outputs are summed to yield Vsα k .
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
395
Thus to tally the votes for input state sα 3 , the filter units receive the input vector [0 0 1 0 . . . 0] of size q and their outputs are summed as given below: n Vsk = ∑ ψup w jk , θ j
(16.7)
j
This process is repeated for all states in A , and the priority list is generated by assigning the state with the highest number of votes to p1 , assigning the state that garnered the second highest number of votes to p2 , etc. In case of a tie, the tiebreaker is the sum of the raw outputs of the filter networks, i.e., before the ψup activation function is applied. 16.3.4.2
Evolution and Development
Fig. 16.5 shows the additional types of genes included in the tissue genome. These genes are developed similarly to the motor neurons and decision neurons as described in Section 16.3.3. The sensor neurons are grown on a two-dimensional spatial map. Mutations can perturb the contents of an existing gene or result in the development of new ones. Both the filter units and sensor neurons also have a ‘sensor type’ specifier which restricts each genome to access certain types of sensory inputs such as obstacle detection or color detection (See Section 16.4 for further details). Furthermore, sensor neurons have the capability of referencing different groups of filter units using the ‘Filter Reference’ parameter. However for the experiments presented here we set this value to 0.
Fig. 16.5 Gene map for the Sensory Coarse Coding Model
16.4 An Example Task: Resource Gathering The effectiveness of the ANT controller is demonstrated in simulation on the resource gathering task [32]. A team of robots collects resource material distributed throughout its work space and deposits it in a designated dumping area. The
396
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
workspace is modeled as a two-dimensional grid environment with one robot occupying four grid squares. For this task, the controller must possess a number of capabilities including gathering resource material, avoiding the workspace perimeter, avoiding collisions with other robots, and forming resources into a berm at the designated location. (In the present experiment, a berm is simply a mound of the resource material.) The berm location has perimeter markings on the floor and a light beacon mounted nearby. The two colors on the border are intended to allow the controller to determine whether the robot is inside or outside the berm location (Fig. 16.6).
Fig. 16.6 2D grid world model of experiment chamber
Though solutions can be found without the light beacon, its presence improves the efficiency of the solutions found, as it allows the robots to track the target location from a distance instead of randomly searching the workspace for the perimeter. The global fitness function for the task measures the amount of resource material accumulated in the designated location within a finite number of time steps, in this case T = 300. Darwinian selection is performed based on the fitness value of each controller averaged over 100 different initial conditions. Table 16.1 Predefined Sensor Inputs
Sensor Variables
Function
Description
V1 . . .V4 C1 . . .C4 S1 , S2 LP1 LD1
Resource Detection Template Detection Obstacle Detection Light Position Light Range
Resource, No Resource Blue, Red, Orange, Floor Obstacle, No Obstacle Left, Right, Center, No Light 0-10 (distance to light)
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
397
Fig. 16.7 Predefined input sensor mapping, with simulation model inset Table 16.2 Preordered Basis Behaviors
Order
Behavior
Description
1
Dump Resource
2 3 4 5, 7, 9, 11 6, 8, 10, 12
Move Forward Turn Right Turn Left Bit Set Bit Clear
Move one grid square back; turn left Move one grid square forward Turn 90◦ right Turn 90◦ left Set memory bit i to 1, i = 1 . . . 4 Set memory bit i to 0, i = 1 . . . 4
Simple feature-detection heuristics are used to determine the values of V1 . . .V4 and C1 . . .C4 based on the grid locations shown. For detection of the light beacon, the electronic shutter speed and gain are adjusted to ensure that the light source is visible while other background features are underexposed. The position of the light LP1 is determined based on the pan angle of the camera. The distance to the light source LD1 is estimated based on its size in the image. The robots also have access to four memory bits, which can be manipulated using some of the basis behaviors. Table 16.2 lists the basis behaviors the robot can perform. These behaviors are activated based on the output of the ANT controller, and all occur within a single time step.
16.4.1 Coupled Motor Primitives In this section we consider an alternative setup, where the ANT controllers are provided building blocks for the basis behaviors in the form of motor primitive [9] sequences. The motor primitives are taken as discrete voltage signals over a discrete time window applied on DC motors as shown in Fig. 16.8 and as arguments to the motor primitive commands in Table 16.3. These voltage output signals feed to the actuators and can be in one of three states, {1, 0, −1}V for a discrete time window, Δ tn , n ∈ {0, 1, 2, 3, 4, 5} as shown. In addition, each actuator takes on a default
398
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
Fig. 16.8 Motor primitives composed of discretized voltage signals shown for a simulated robot
Fig. 16.9 Modified tissue gene that includes order of execution of motor primitive sequences Table 16.3 Coupled Motor Primitives for the Sign-Following Task
Neuron ID
Behavior
Coupled Motor Signals
1 2 3 4 5 6 7 8, 10, 12, 14 9, 11, 13, 15
Move Forward Turn Right 90◦ Turn Left 90◦ Pivot Right Pivot Left Pivot Right Pivot Left Bit set Bit clear
Left Motor 1 || Right Motor 1 Left Motor 1 || Right Motor -1 Left Motor -1 || Right Motor 1 Left Motor 0 || Right Motor -1 Left Motor 0 || Right Motor 1 Left Motor 1 || Right Motor 0 Left Motor -1 || Right Motor 0 Set memory bit i to 1, i = 1 · · · 4 Set memory bit i to 0, i = 1 · · · 4
voltage value of 0. The actual value of V , the voltage constant, is dependent on the actuator. The ANT controller also needs to determine the order of execution of these motor primitive sequences. The modified tissue gene is shown in Fig. 16.9. The order of the output coupled motor primitive (CMP) sequences are evolved as additional parameters in the tissue gene and is read starting from the left. The elements of the table, o1 , · · · , oε contain the Neuron ID values. The order is randomly initialized when starting the evolutionary process and with each Neuron ID occupying one spot on the gene. Point mutations to this section of the tissue gene result in swapping
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
399
Neuron ID values between sites. Table 16.3 shows the repertoire of coupled motor primitives provided for the ANT controllers and thus ε = 15 for this particular setup (Fig. 16.9). The motor primitives are coupled, where for example the left drive motor and the right drive motor are executed in parallel (indicated using ||). Under this setup, it is still possible for the controller to execute a sequence of motor primitives in a serial fashion.
16.4.2 Evolutionary Parameters The evolutionary algorithm population size for the experiments is P = 100, with crossover probability pc = 0.7, mutation probability pm = 0.025 and a tournament size of 0.06P. The tissue is initialized as a ‘seed culture’ , with 3 × 6 motor control neurons in one layer. After this, the tissue is grown to include 70–110 neurons (selected from a uniform random distribution) before starting the evolutionary process. These seeding parameters are not task specific and have been observed to be sufficient for a number of different robotic tasks.
16.5 Results We compare evolutionary performance of various evolvable control system models in Fig. 16.10. Included is a Cellular Automata lookup table that consists of a table of reactive rules that spans 1216384 entries for this task which was evolved using population, selection and mutation parameters from Section 16.4.2. The genome is binary and is merely the contents of the lookup table. For this approach, we also assumed that the light beacon is turned off. Hence there exists 24 × 44 × 22 = 16384 possible combinations of sensory inputs states, accounting for resource detection, template detection and obstacle detection sensors respectively (Table 16.1). For each combination of sensory input, the 12 allowable behaviors outlined in Table 16.2 could be executed. As can be seen, the population quickly stagnates at a very low fitness due to the ‘bootstrap problem’ [23]. With limited supervision, the fitness function makes it difficult to distinguish between incrementally fitter solutions. Instead the system depends on ‘bigger leaps’ in fitness space (through sequences of mutations) for it to be distinguishable during selection. However, bigger leaps become more improbable as evolution progresses, resulting in search stagnation. The performance of a population of randomly initialized fixed-topology, fully-connected networks, consisting of between 2 and 3 layers, with up to 40 hidden and output neurons is also shown in Fig. 16.10. In a fixed-topology network there tends to be more ‘active’ synaptic connections present (since all neurons are active), and thus it takes longer for each neuron to tune these connections for all sensory inputs. In this regard ANT is advantageous, since the topology is evolved and decision neurons learn to inhibit noisy neurons through a masking process. The net result is that ANT requires fewer genetic evaluations to evolve desired solutions in comparison to standard neural networks. The standard ANT model using sensory inputs and basis behaviors outlined
400
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
Fig. 16.10 Evolutionary performance comparison, showing population best averaged over 30 evolutionary algorithm runs of various evolvable control architectures. Error bars indicate standard deviation. As shown, ANT combined with Sensory Coarse Coding (SCC) and Coupled Motor Primitives (CMP) ordered through evolution obtains desired solutions with fewer genetic evaluations. The CA lookup table approach as shown remains stagnant and is unable to solve the task while fixed-topology neural nets converge at a much slower rate
in Tables 16.1 and 16.2, respectively, shows a noticeable improvement in evolutionary performance over the lookup table and fixed topology architectures. Further improvement is gained when we allow the ANT architecture to evolve the execution order scheme and coupled motor primitives (Section 16.4.1) instead of using a list of preordered basis behaviors. Finally, we also compare ANT+SCC using coupled motor primitives with these models. To make the comparison meaningful with respect to the other models, we impose some restrictions on the ANT+SCC configuration. This includes limiting the maximum number of active (selected) sensor neurons from the SCC model to 4 for the resource detection layer and 4 for the template detection layer. We also used predefined layouts for the other spatial sensors, namely obstacle detection. As can be seen in these results, ANT+SCC shows a noticeable performance advantage over the baseline ANT model. Furthermore, we obtain equivalent population best fitness values requiring approximately 5 times less genetic evaluations than with the baseline ANT model (Table 16.4). It should be noted that desired solutions ( f ≥ 0.885) were not obtained using standard neural networks within 10,000 generations of evolution. Examples of an evolved execution order scheme and sensor priority table using ANT+SCC+CMP are shown in Fig. 16.12. A typical evolutionary run for the resource gathering task using ANT takes approximately six hours on a dual core Intel T7200, 2GHz desktop processor, with
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
401
Table 16.4 Number of generations required to obtain desired solutions ( f ≥ 0.885)
Method
Avg. Generations
Standard Deviation
ANT+SCC+CMP ANT+CMP ANT+SCC+CMP Control Experiment 1 ANT+SCC+CMP Control Experiment 2 ANT Fixed Topology Neural Net.
421 1,142 1,343 1,968 1,983 > 10,000
133 241 104 312 235 NA
only one core being used for the evolutionary run. With ANT+SCC+CMP, one can get a comparably suitable solution in less than one hour and thirty minutes. Furthermore since this five fold improvement in performance is due to enhancements in the search process, the improvement is expected to carry over with faster processors. Using regular neural networks comparable solutions were not obtained even after approximately 30 hours (10,000 generations) of evolution. The solutions obtained in Table 16.4 can accumulate at least 88.5% of the dispersed resources in the designated dumping area within T = 300 timesteps and has been determined to be of sufficient quality to complete the task (see Fig. 16.18 for hardware demonstration). Given more time steps, it is expected that the robots will have accumulated the remaining resources. One would ideally like to provide as input raw sensor data to the robot controller. However this results in an exponential increase in search space for a linear increase in sensor states. The alternative would be to filter out and guess which subset of the sensory states maybe useful for solving a prespecified task. A wrong guess or poor design choice may make the process of finding a suitable controller difficult or impossible. Hence, ANT+SCC allows for additional flexibility, by helping to filter out and determine suitable sensory input states. Fig. 16.13 shows the evolved population of sensory neurons on the body-centric spatial map. Fig. 16.11 (left) shows the average area used by selected sensor neurons and the average number of sensor neurons that participated in the selection process during evolution. The average area remains largely constant indicating there is strong selective pressure towards particular geometric shapes and area. This makes sense for the resource gathering task, as controllers need to detect a sufficiently large area in front to identify color cues indicating whether the robot is inside or outside the dumping area. What is interesting is that with S = 4, for template detection sensor neurons, we still see a steady increase in the number of sensor neurons competing to get selected. The increased number of neurons can potentially act in a cooperative manner, reinforcing and serving as redundant receptive fields covering key locations on the spatial map. Redundancy is beneficial in limiting the impact of
402
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
deleterious mutations. Fig. 16.11 (right) shows that the individuals in the evolutionary process start off by sensing a smaller area and that this area is steadily increased as the solutions converge. If each sensor neuron senses just one grid square area, then filtering is effectively disabled. At the beginning of the evolutionary process, individuals take on reduced risk by sensing a smaller effective area, but as the filtering capability evolves concurrently (correctly prioritizing the sensory cues), it allows for the individual controllers to sense and filter a larger area. The number of active filter units continue to get pruned, until they reach a steady state number. This trend is consistent with experiments using ANT [31], where noisy neurons are shut off as the controllers converge towards a solution. In order to measure the impact of the coarse-coding and filtering towards ANT+SCC performance improvement, we performed control experiment 1, where the maximum size of the sensor cells was restricted to one grid square and where the net concentration of each grid square within the spatial map was set to 1 (Table 16.4). These two modifications effectively prevent coarse sensor cells from forming and interacting to form fine representations. Instead, what is a left is a group of fine sensor neurons that are always active. With the sensor cell area being restricted to one
Fig. 16.11 (Left) Average area occupied by selected sensor neurons and number of sensor neurons that participated in the selection process during evolution. (Right) Number of active filter units and number of grid squares accessible by the sensor neurons during evolution. Both plots show parameters from population best averaged over 30 evolutionary algorithm runs
Fig. 16.12 Evolved coupled motor primitives ordering scheme and sensor priority list for template detection shown for an ANT+SCC+CMP controller with a fitness of 0.98. See Table 16.3 and 16.1 for reference
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
403
Fig. 16.13 Example of an evolved sensor layout (fitness of 0.98) using the ANT+SCC+CMP model. (Left) Participating sensor neurons and receptive fields (template detection) shown. (Right) Selected sensor neurons shown. Shaded area indicates resultant regions sensed by the controller
grid square, the priority filter has no effect, since it requires at least two grid squares with differing sensory input states. The fitness performance of this model is comparable to the baseline ANT model. However, since this model also uses coupled motor primitives and it performed worse than ANT+CMP alone, the net impact of these imposed constraints is actually a decrease in performance. Furthermore, we performed a second control experiment (control experiment 2), where we imposed the receptive field sizes to 3 × 3 grid squares and set the net concentration at each grid square to 1 (Table 16.4). These two modifications ensure the receptive field remains coarse and prevents coarse coding interactions from occurring, while leaving the filter functionality within SCC turned on. The net effect is that we see a noticeable drop in performance due to SCC. Both of these experiments indicate that coarse-coding interaction between sensor neurons is helping to find desired solution within fewer genetic evaluation.
16.5.1 Evolution and Robot Density Fig. 16.14 shows the fitness (population best) of the overall system evaluated at each generation of the artificial evolutionary process using the baseline ANT model, with a specified initial resource density and various robot densities. These results show that system performance increases with the number of robots present (with total area held constant). For scenarios initialized with more robots, each robot has a smaller area to cover in trying to gather and dump resources.
16.5.2 Behavioral Adaptations In an ANT-based architecture, networks are dynamically formed with decision neurons processing the sensory input and in turn ‘selecting’ motor-control neurons through coarse-coding [31]. The behavioral activity of the controllers (see Fig. 16.16) shows the formation of small networks of neurons which handle individual behaviors, such as dumping resources or detecting visual templates (boundary perimeters,
404
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
Fig. 16.14 Evolutionary performance comparison of ANT-based solutions for one to five robots. Error bars indicate standard deviation
target area markings, etc.). Localized regions within the tissue do not exclusively handle these specific user-defined, distal behaviors. Instead, the activity of the decision neurons indicate distribution of specialized ‘feature detectors’ among independent networks. Some of the emergent solutions evolved indicate that the individual robots all figure out how to dump nearby resources into the designated berm area, but that not all robots deliver resource all the way to the dumping area every time. Instead, the robots learn to pass the resource material from one individual to another during an encounter, forming a ‘bucket brigade’ (see Fig. 16.15, 16.18). This technique improves the overall efficiency of the system as less time is spent traveling to and from the dumping area. Since the robots cannot explicitly communicate with one another, these encounters happen by chance rather than through preplanning. As with other multiagent systems, communication between robots occurs through the manipulation of the environment in the form of stigmergy. The task in [33] is similar in that distributed objects must be delivered to a confined area; however, the hand-designed controller does not scale as well as the ‘bucket brigade’ solution that the ANT controllers discovered here. We also noticed that the robot controllers do make use of the light beacon to home in on the light beacon that is located next to a dumping area, however there is no noticeable difference in fitness performance when the robot controllers are evolved with light turned off [32]. In these simulation experiments, the robots have no way to measure the remaining time available; hence, the system cannot greedily accumulate resource materials without periodically dumping the material at the designated area.
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
405
Fig. 16.15 Snapshots of robots and trajectories of a task simulation (4 robots)
Fig. 16.16 Tissue Topology and neuronal activity of a select number of decision neurons. Decision neurons in turn ‘select’ (excite into operation) motor control neurons within its diffusion field
406
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
Fig. 16.17 Scaling of ANT-based solutions from one to five robots
Fig. 16.18 Snapshots of two rovers performing the resource gathering task using an ANT controller. Frames 2 and 3 show the ‘bucket brigade’ behavior, while frames 4 and 5 show the boundary avoidance behavior
16.5.3 Evolved Controller Scalability We examine the fittest solutions from the simulation runs shown in Fig. 16.17 for scalability in the number of robots while holding the amount of resources constant. Taking the controller evolved for a single robot and running it on a multirobot system shows limited performance improvement. In fact, using four or more robots results in a decrease in performance, due to the increased antagonism created.
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
407
The scalability of the evolved solution depends in large part on the number of robots used during the training runs. The single-robot controller expectedly lacks the cooperative behavior necessary to function well within a multiagent setting. For example, such controllers fail to develop ‘robot collision avoidance’ or ‘bucket brigade’ behaviors. Similarly, the robot controllers evolved with two or more robots perform demonstrably worse when scaled down to a single robot, showing that the solutions are dependent on cooperation among the robots.
16.6 Discussion In this chapter, we use a global fitness function to train multirobot controllers with limited supervision to perform self-organized task decomposition. Techniques that perform well for the task make use of modularity and generalization. Modularity is the use and reuse of components, while generalization is the process of finding patterns or making inferences from many particulars. With a multirobot setup, modularity together with parallelism is exploited by evolved controllers to accomplish the task. Rather than have one centralized individual attempting to solve a task using global information, the individuals within the group are decentralized, make use of local information and take on different roles through a process of self-organization. This process of having different agents solve different subcomponents of the task in order to complete the overall task is a form of task decomposition. In this multirobot setup, there are both advantages and disadvantages to consider. Multiple robots working independently exploit parallelism, helping to reduce the time and effort required to complete a task. Furthermore, we also see that solutions show improved overall system performance when evolved with groups of robots. It should be noted that the density of robots is critical towards solving the task. Higher densities of robots result in antagonism, with robots spending more time getting out of the way of one another rather than progressing on the task, leading to reduced system performance. It was shown that a CA look-up table architecture that lacked both modularity and generalization is found to be intractable due to the ‘bootstrap problem,’ resulting in premature search stagnation. This is due to the fact that EAs are unable to find an incrementally better solution during the early phase of evolution. Use of neural networks is a form of functional modularization, where each neuron performs sensory-information processing and makes solving the task more tractable. However with increased numbers of hidden neurons, one is faced with the effects of spatial crosstalk where noisy neurons interfere and drown out signals from featuredetecting neurons [16]. Crosstalk in combination with limited supervision (through use of a global fitness function) can again lead to the ‘bootstrap problem’ [23]. Thus, choosing the wrong network topology may lead to a situation that is either unable to solve the problem or difficult to train [31]. With the use of Artificial Neural Tissues (ANT), we introduce hierarchical functional modularity into the picture. The tissue consists of modular neurons that can form dynamic, modular networks of neurons. These groups of neurons handle
408
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
specialized functionality as we have shown and can be reused repeatedly for this purpose. In contrast, with a standard fixed topology neural network, similar functionality may need to evolve independently multiple times in different parts of the network. In these various neural network architectures, modularity is functional, with behaviors and capabilities existing in individual neurons or in groups and triggered when necessary. ANT facilitates evolution of this capability by allowing for regulatory functionality that enables dynamic activation and inhibition of neurons within the tissue. Groups of neurons could be easily activated or shut-off through a coarse-coding process. Furthermore, with the ANT+SCC model, we allow for evolution of both spatial and functional modularity. Spatial modularity is possible with the SCC model, since we may get specialized sensory neurons that find spatial sensory patterns. The output from these sensory neurons are used as input by various groups of neurons active within ANT. These sensor neurons act as specialized feature detectors looking for either color cues or resources. Comparison of the various evolvable control system models indicates that controllers with an increased ability to generalize evolve desired solution with far fewer genetic evaluations. The CA lookup table architecture lacks generalization capability and performed the worst. For the CA lookup table, evolved functionality needs to be tuned for every unique combination of sensory inputs. A regular fixed topology network performed better, but since the topology had no capacity to increase in size or selectively activate/inhibit neurons within the network, it needed to tune most of the neurons in the network towards both helping perform input identifications or actions and preventing these same neurons from generating spurious outputs. Thus the same capabilities may have to be acquired by different neurons located in different parts of the network requiring an increased number of genetic evaluations to reach a desired solution. The standard ANT architecture can quickly shut off (mask out) neurons generating spurious output and thus does not require having sequences of mutation occur, tuning each neuron within the tissue to acquire compatible (or similar) capabilities or remain dormant. Thus certain networks of neurons within the tissue can acquire and apply a certain specialized capability (Fig. 16.16), while most others remain dormant through the regulatory process. Hence within ANT, increased functional generalization is achieved through specialization. With the fixed topology neural network, the net effect of all the neurons having to be active all the time implies that the controllers have to evolve to individually silence each of the spurious neurons or acquire the same capabilities repeatedly, thus implying reduced functional generalization. ANT+SCC can generalize even further. Apart from being able to selectively activate/inhibit neurons, it can also choose to receive a coarse or fine representation of the sensory input. In other words, it can perform further sensor generalization. A coarse representation of the sensory input in effect implies some degree of generalization. The priority filtering functionality prioritizes certain sensor states over others, while the coarse coding representation selects a subset of the inputs to send to the filter. The resultant input preprocessing facilitates finding and exploiting underlying patterns in the input set. The net effect is that the controller does not have
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
409
to deal with as many unique conditions since the number of unique sensory input combinations seen by the ANT controller is reduced by SCC. This in turn facilitates evolution of controllers that require fewer generations to reach a desired solution. At the same time, over-generalization of the sensory inputs is problematic (see ANT+SCC+CMP control experiment 2). By imposing coarse receptive fields and preventing coarse-coded interactions, the controllers may miss key (fine) features through prioritized filtering. Hence, although the sensory input space may effectively have shrunk, through over-generalization valuable information is lost. These results justify the need for representations that selectively increase or decrease generalization of sensory input through coarse-coding. This increased ability to generalize by the ANT+SCC model also seems to offset the increase number of parameters (increased search space) that needs to be evolved. Herein lies a tradeoff, as a larger search space alone may require a greater number of genetic evaluations to reach a desired solution, but this may also provide some unexpected benefits. In particular, a larger space may help in finding more feasible or desirable solutions than those already present and may even reduce the necessary number of genetic evaluations by guiding evolution (as in the ANT+SCC case). As pointed out, ANT+SCC with its ability to further generalize sensory input appears to provide a net benefit, even though it needs to be evolved with additional parameters (in comparison to the standard ANT model). This benefit is also apparent when comparing the baseline ANT controller with ANT-ordered coupled motor primitives. The additional genomic parameters appears to be beneficial once again, since the search process has access to more potential solutions. Furthermore, it should be noted that these additional degrees of freedom within the ANT+SCC controller do not appear to introduce deceptive sensory inputs or capabilities. Deceptive inputs and capabilities can slow down the evolutionary process, since the evolving system may retain these capabilities when they initially provide a slight fitness advantage. However, these functionalities can in turn limit or prevent the controllers from reaching the desired solution. Thus in effect, the evolving population can get stuck in a local minimum, unable to transcend towards a better fitness peak.
16.7 Conclusion This chapter has reported on a number of experiments used to automatically generate neural network based controllers for multirobot systems. We have shown that with judicious selection of a fitness function, it is possible to encourage selforganized task decomposition using evolutionary algorithms. We have also shown that by exploiting hierarchical modularity, regulatory functionality and the ability to generalize, controllers can overcome tractability concerns. Controllers with increased modularity and generalization abilities are found to evolve desired solutions with fewer training evaluations by effectively reducing the size of the search space. These techniques are also able to find novel multirobot coordination and control strategies. To facilitate this process of evolution, coarse-coding techniques are used
410
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
to evolve ensembles of arbitration neurons that acquire specialized functionality. Similar techniques are used to evolve sensor-filter configurations. Both techniques facilitate functional and spatial modularity and generalization. This combination allows for a methodical approach to control development, particularly one where the controller and robot sensory configurations can be automatically generated starting from a blank slate, where the designer can be largely relieved of the design process and where detailed models of the system or environment can be avoided.
References 1. Beckers, R., Holland, O.E., Deneubourg, J.L.: From local actions to global tasks: Stigmergy and collective robots. In: Fourth International Workshop on the Syntheses and Simulation of Living Systems, pp. 181–189. MIT Press, Cambridge (1994) 2. Bonabeau, E., Theraulaz, G., Deneubourg, J.-L., Aron, S., Camazine, S.: Selforganization in social insects. Trends in Ecology and Evolution 12, 188–193 (1997) 3. Bonabeau, E., Dorigo, M., Theraulaz, G.: Swarm Intelligence: From Natural to Artificial Systems. Oxford Univ. Press, New York (1999) 4. Bongard, J., Pfeifer, R.: Repeated structure and dissociation of genotypic and phenotypic complexity in artificial ontogeny. In: Proceedings of the Genetic and Evolutionary Computation Conference 2001, San Francisco, CA, pp. 829–836 (2001) 5. Chantemargue, F., Dagaeff, T., Schumacher, M., Hirsbrunner, B.: Implicit cooperation and antagonism in multi-agent systems, University of Fribourg, Technical Report (1996) 6. Chellapilla, K., Fogel, D.B.: Evolving an expert checkers playing program without using human expertise. IEEE Transactions on Evolutionary Computation 5(4), 422–428 (2001) 7. Das, R., Crutchfield, J.P., Mitchell, M., Hanson, J.: Evolving globally synchronized cellular automata. In: Proceedings of the Sixth International Conference on Genetic Algorithms 1995, pp. 336–343. Morgan Kaufmann, San Fransisco (1995) 8. Dellaert, F., Beer, R.: Towards an evolvable model of development for autonomous agent synthesis. In: Artificial Life IV: Proceedings of the 4th International Workshop on the Synthesis and Simulation of Living Systems, pp. 246–257. MIT Press, Cambridge (1994) 9. Demeris, J., Matari´c, M.J.: Perceptuo-Motor Primitives in Imitation. In: Autonomous Agents 1998 Workshop on Agents in Interaction Acquiring Competence (1998) 10. Federici, D., Downing, K.: Evolution and Development of a Multicellular Organism: Scalability, Resilience, and Neutral Complexification. Artificial Life 12, 381–409 (2006) 11. Gauci, J., Stanley, K.: A Case Study on the Critical Role of Geometric Regularity in Machine Learning. In: Proceedings of the 23rd AAAI Conference on AI. AAAI Press, Menlo Park (2008) 12. Grass´e, P.: La reconstruction du nid les coordinations interindividuelles; la theorie de stigmergie. Insectes Sociaux 35, 41–84 (1959) 13. Groß, R., Dorigo, M.: Evolving a Cooperative Transport Behavior for Two Simple Robots. In: Liardet, P., Collet, P., Fonlupt, C., Lutton, E., Schoenauer, M. (eds.) EA 2003. LNCS, vol. 2936, pp. 305–316. Springer, Heidelberg (2004) 14. Gruau, F., Whitley, D., Pyeatt, L.: A comparison between cellular encoding and direct encoding for genetic neural networks. In: Genetic Programming 1996, pp. 81–89. MIT Press, Cambridge (1996)
16
Coarse-Coding Techniques for Evolvable Multirobot Controllers
411
15. Hastie, T., Tibshirani, R., Friedman, R.: The Elements of Statistical Learning. Springer, New York (2001) 16. Jacobs, R., Jordan, M., Barto, A.: Task decomposition through competition in a modular connectionist architecture. Cognitive Science (15), 219–250 (1991) 17. Komosinski, M., Ulatowski, S.: Framsticks: towards a simulation of a nature-like world, creatures and evolution. In: Proceedings of the 5th European Conference on Artificial Life. Springer, Berlin (1998) 18. Leffler, B.R., Littman, M.L., Edmunds, T.: Efficient reinforcement learning with relocatable action models. AAAI Journal, 572–577 (2007) 19. Lindenmayer, A.: Mathematical models for cellular interaction in development I. Filaments with one-sided inputs. Journal of Theoretical Biology 18, 280–289 (1968) 20. Lipson, H., Pollack, J.: Automatic design and manufacture of artificial lifeforms. Nature 406, 974–978 (2000) 21. Matari´c, M.J., Nilsson, M., Simsarian, K.T.: Cooperative multi-robot box-pushing. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 556–561 (1995) 22. Mautner, C., Belew, R.K.: Evolving Robot Morphology and Control. In: Sugisaka, M. (ed.) Proceedings of Artificial Life and Robotics 1999 (AROB 1999), Oita, ISAROB (1999) 23. Nolfi, S., Floreano, D.: Evolutionary Robotics: The Biology, Intelligence, and Technology of Self-Organizing Machines. MIT Press, Cambridge (2000) 24. Parker, C.A., Zhang, H., Kube, C.R.: Blind bulldozing: Multiple robot nest construction. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2010– 2015 (2003) 25. Pfeifer, R., Scheier, C.: Understanding Intelligence. MIT Press, Cambridge (1999) 26. Roggen, D., Federici, D.: Multi-cellular Development: Is There Scalability and Robustnes to Gain? In: Yao, X., Burke, E.K., Lozano, J.A., Smith, J., Merelo-Guerv´os, J.J., Bullinaria, J.A., Rowe, J.E., Tiˇno, P., Kab´an, A., Schwefel, H.-P. (eds.) PPSN 2004. LNCS, vol. 3242, pp. 391–400. Springer, Heidelberg (2004) 27. Sims, K.: Evolving 3D Morphology and Behavior by Competition. In: Proceedings of Artificial Life IV, pp. 28–39. MIT Press, Cambridge (1994) 28. Stanley, K., Miikkulainen, R.: Continual Coevolution through Complexification. In: Proceedings of the Genetic and Evolutionary Computation Conference 2002. Morgan Kaufmann, San Francisco (2002) 29. Thangavelautham, J., Barfoot, T., D’Eleuterio, G.M.T.: Coevolving communication and cooperation for lattice formation tasks (updated). In: Advances In Artificial Life: Proceedings of the 7th European Conference on ALife, pp. 857–864 (2003) 30. Thangavelautham, J., D’Eleuterio, G.M.T.: A neuroevolutionary approach to emergent task decomposition. In: Yao, X., Burke, E.K., Lozano, J.A., Smith, J., Merelo-Guerv´os, J.J., Bullinaria, J.A., Rowe, J.E., Tiˇno, P., Kab´an, A., Schwefel, H.-P. (eds.) PPSN 2004. LNCS, vol. 3242, pp. 991–1000. Springer, Heidelberg (2004) 31. Thangavelautham, J., D’Eleuterio, G.M.T.: A coarse-coding framework for a generegulatory-based artificial neural tissue. In: Advances In Artificial Life: Proceedings of the 8th European Conference on ALife, pp. 67–77 (2005) 32. Thangavelautham, J., Alexander, S., Boucher, D., Richard, J., D’Eleuterio, G.M.T.: Evolving a Scalable Multirobot Controller Using an Artificial Neural Tissue Paradigm. In: IEEE International Conference on Robotics and Automation, Washington, D.C (2007)
412
J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
33. Wawerla, J., Sukhatme, G., Mataric, M.: Collective construction with multiple robots. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2696–2701 (2002) 34. Wilson, M., Melhuish, C., Sendova-Franks, A.B., Scholes, S.: Algorithms for building annular structures with minimalist robots inspired by brood sorting in ant colonies. Autonomous Robots 17, 115–136 (2004) 35. Zykov, V., Mytilinaios, E., Adams, B., Lipson, H.: Self-reproducing machines. Nature 435(7038), 163–164 (2005)