Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster
Subseries of Lecture Notes in Computer Science
6471
Jaume Bacardit Will Browne Jan Drugowitsch Ester Bernadó-Mansilla Martin V. Butz (Eds.)
Learning Classifier Systems 11th International Workshop, IWLCS 2008 Atlanta, GA, USA, July 13, 2008 and 12th International Workshop, IWLCS 2009 Montreal, QC, Canada, July 9, 2009 Revised Selected Papers
13
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors Jaume Bacardit University of Nottingham, Nottingham, NG8 1BB, UK E-mail:
[email protected] Will Browne Victoria University of Wellington, Wellington 6140, New Zealand E-mail:
[email protected] Jan Drugowitsch University of Rochester, Rochester, NY 14627, USA E-mail:
[email protected] Ester Bernadó-Mansilla Universitat Ramon Llull, 08022 Barcelona, Spain E-mail:
[email protected] Martin V. Butz University of Würzburg, 97070 Würzburg, Germany E-mail:
[email protected]
Library of Congress Control Number: 2010940267
CR Subject Classification (1998): I.2.6, I.2, H.3, D.2.4, D.2.8, F.1, H.4, H.2.8 LNCS Sublibrary: SL 7 – Artificial Intelligence ISSN ISBN-10 ISBN-13
0302-9743 3-642-17507-4 Springer Berlin Heidelberg New York 978-3-642-17507-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
Learning Classifier Systems (LCS) constitute a fascinating concept at the intersection of machine learning and evolutionary computation. LCS’s genetic search, generally in combination with reinforcement learning techniques, can be applied to both temporal and spatial problem-solving and promotes powerful search in a wide variety of domains. The LCS concept allows many representations of the learned knowledge from simple production rules to artificial neural networks to linear approximations often in a human readable form. The concepts underlying LCS have been developed for over 30 years, with the annual International Workshop on Learning Classifier Systems supporting the field since 1992. From 1999 onwards the workshop has been held yearly, in conjunction with PPSN in 2000 and 2002 and with GECCO in 1999, 2001, and from 2003 onwards. This book is the continuation of the six volumes containing selected and revised papers from the previous workshops, published by Springer as LNAI 1813, LNAI 1996, LNAI 2321, LNAI 2661, LNCS 4399, and LNAI 4998. The articles in this book have been loosely organized into four overlapping themes. Firstly, the breadth of research into LCS and related areas is demonstrated. Then the ability to approximate complex multidimensional function surfaces is shown by the latest research on computed predictions and piecewise approximations. This work leads on to LCS for complex domains, such as temporal decision-making and continuous domains, whereas traditional learning approaches often require problem-dependent manual tuning of the algorithms and discretization of problem spaces, resulting in a loss of information. Finally, diverse application examples are presented to demonstrate the versatility and broad applicability of the LCS approach. Pier Luca Lanzi and Daniele Loiacono investigate the use of general-purpose Graphical Processing Units (GPUs), which are becoming increasingly common in evolutionary computation, for speeding up matching of environmental states to rules in LCS. Depending on the problem investigated and representation scheme used, they find that the use of GPUs improves the matching speed by 3 to 50 times when compared with matching with standard CPUs. Association rule mining, where interesting associations in the occurrence of items in streams of unlabelled examples are to be extracted, is addressed by Albert Orriols-Puig and Jorge Casillas. Their novel CSar Michigan-style learning classifier system shows promising results when compared with the benchmark approach to this problem. Stewart Wilson shows that there is still much scope in generating novel approaches with the LCS concept. He proposes an automatic system for creating pattern generators and recognizers based on a three-cornered competitive co-evolutionary algorithm approach.
VI
Preface
Patrick O. Stalph and Martin V. Butz investigate current capabilities and challenges facing XCSF, an LCS in which each rule builds a locally linear approximation to the payoff surface within its matching region. It is noted that the XCSF approach was the most popular branch of LCS research within the latest editions of this workshop. In a second paper the same authors investigate the impact of variable set sizes, which show promise beyond the standard two offspring used in many genetics-based machine learning techniques. The model used in XCSF by Gerard David Howard, Larry Bull, and Pier Luca Lanzi uses an artificial neural network, instead of standard rules, for matching and action selection, thus illustrating the flexible nature of LCS techniques. Their method is compared with principles from the NEAT (Neuro Evolution of Augmenting Topologies) approach and augmented with previous LCS neural constructivism work to improve their performance in continuous environments. ´ ee and Mathias P´eroumalna¨ık also examine how LCS copes with Gilles En´ complex environments by introducing the Adapted Pittsburgh Classifier System and applying it to maze type environments containing aliasing squares. This work shows that the LCS is capable of building accurate strategies in non-Markovian environments without the use of rules with memory. Ajay Kumar Tanwani and Muddassar Farooq compare three LCS-based data mining techniques to three benchmark algorithms for biomedical data sets, showing that, although not completely dominant, the GAssist LCS approach in general is able to provide the best classification results on the majority of datasets tested. Illustrating the diversity of application domains for LCS, supply chain management sales is investigated by Mar´ıa Franco, Ivette Mart´ınez, and Celso Gorrin, showing that the set of generated rules solves the sales problem in a satisfactory manner. Richard Preen uses the well established XCS LCS to identify trade entry and exit timings for financial timeseries forecasting. These results show the promise of LCS in this difficult domain due to its noisy, dynamic, and temporal nature. In the final application paper, Jos´e G. Moreno-Torres, Xavier Llor` a, David E. Goldberg, and Rohit Bhargava provide an approach to the homogenization of laboratory data through the use of a genetic programming based algorithm. As in the previous volumes, we hope that this book will be a useful support for researchers interested in learning classifier systems and will provide insights into the most relevant topics. Finally we hope it will encourage new researchers, business, and industry to investigate the LCS concept as a method to discover solutions to their varied problems. September 2010
Will Browne Jaume Bacardit Jan Drugowitsch
Organization
The postproceedings of the International Workshops on Learning Classifier Systems 2008 and 2009 were assembled by the organizing committee of IWLCS 2009.
IWLCS 2008 Organizing Committee Jaume Bacardit (University of Nottingham, UK) Ester Bernad´ o-Mansilla (Universitat Ramon Llull, Spain) Martin V. Butz (Universit¨at W¨ urzburg, Germany) Advisory Committee
Tim Kovacs (University of Bristol, UK) Xavier Llor`a (Univ. of Illinois at Urbana-Champaign, USA) Pier Luca Lanzi (Politecnico de Milano, Italy) Wolfgang Stolzmann (Daimler Chrysler AG, Germany) Keiki Takadama (Tokyo Institute of Technology, Japan) Stewart Wilson (Prediction Dynamics, USA)
IWLCS 2009 Organizing Committee Jaume Bacardit (University of Nottingham, UK) Will Browne (Victoria University of Wellington, New Zealand) Jan Drugowitsch (University of Rochester, USA) Advisory Committee
Ester Bernad´ o-Mansilla (Universitat Ramon Llull, Spain) Martin V. Butz (Universit¨at W¨ urzburg, Germany) Tim Kovacs (University of Bristol, UK) Xavier Llor`a (Univ. of Illinois at Urbana-Champaign, USA) Pier Luca Lanzi (Politecnico de Milano, Italy) Wolfgang Stolzmann (Daimler Chrysler AG, Germany) Keiki Takadama (Tokyo Institute of Technology, Japan) Stewart Wilson (Prediction Dynamics, USA)
VIII
Organization
Referees Ester Bernad´ o-Mansilla Lashon Booker Will Browne Larry Bull Martin V. Butz Jan Drugowitsch Ali Hamzeh
Francisco Herrera John Holmes Tim Kovacs Pier Luca Lanzi Xavier Llor`a Daniele Loiacono Drew Mellor
Luis Miramontes Hercog Albert Orriols-Puig Wolfgang Stolzmann Keiki Takadama Stewart W. Wilson
Past Workshops 1st IWLCS
October 1992
NASA Johnson Space Center, Houston, TX, USA 2nd IWLCS July 1999 GECCO 1999, Orlando, FL, USA 3rd IWLCS September 2000 PPSN 2000, Paris, France 4th IWLCS July 2001 GECCO 2001, San Francisco, CA, USA 5th IWLCS September 2002 PPSN 2002, Granada, Spain 6th IWLCS July 2003 GECCO 2003, Chicago, IL, USA 7th IWLCS June 2004 GECCO 2004, Seattle, WA, USA 8th IWLCS June 2005 GECCO 2005, Washington, DC, USA 9th IWLCS July 2006 GECCO 2006, Seattle, WA, USA 10th IWLCS July 2007 GECCO 2007, London, UK 11th IWLCS July 2008 GECCO 2008, Atlanta, GA, USA 12th IWLCS July 2009 GECCO 2009, Montreal, Canada 13th IWLCS July 2010 GECCO 2010, Portland, OR, USA
Table of Contents
LCS and Related Methods Speeding Up Matching in Learning Classifier Systems Using CUDA . . . . Pier-Luca Lanzi and Daniele Loiacono Evolution of Interesting Association Rules Online with Learning Classifier Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Albert Orriols-Puig and Jorge Casillas Coevolution of Pattern Generators and Recognizers . . . . . . . . . . . . . . . . . . Stewart W. Wilson
1
21 38
Function Approximation How Fitness Estimates Interact with Reproduction Rates: Towards Variable Offspring Set Sizes in XCSF . . . . . . . . . . . . . . . . . . . . . . . Patrick O. Stalph and Martin V. Butz Current XCSF Capabilities and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . Patrick O. Stalph and Martin V. Butz
47 57
LCS in Complex Domains Recursive Least Squares and Quadratic Prediction in Continuous Multistep Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniele Loiacono and Pier-Luca Lanzi Use of a Connection-Selection Scheme in Neural XCSF . . . . . . . . . . . . . . . Gerard David Howard, Larry Bull, and Pier-Luca Lanzi
70 87
Building Accurate Strategies in Non Markovian Environments without Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ ee and Mathias P´eroumalna¨ık Gilles En´
107
Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets . . . . . . . . . . . Ajay Kumar Tanwani and Muddassar Farooq
127
Applications Supply Chain Management Sales Using XCSR . . . . . . . . . . . . . . . . . . . . . . . Mar´ıa Franco, Ivette Mart´ınez, and Celso Gorrin
145
X
Table of Contents
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators in XCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard Preen On the Homogenization of Data from Two Laboratories Using Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jose G. Moreno-Torres, Xavier Llor` a, David E. Goldberg, and Rohit Bhargava Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
166
185
199
Speeding Up Matching in Learning Classifier Systems Using CUDA Pier Luca Lanzi and Daniele Loiacono Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy {lanzi,loiacono}@elet.polimi.it
Abstract. We investigate the use of NVIDIA’s Compute Unified Device Architecture (CUDA) to speed up matching in classifier systems. We compare CUDA-based matching and CPU-based matching on (i) real inputs using interval-based conditions and on (ii) binary inputs using ternary conditions. Our results show that on small problems, due to the memory transfer overhead introduced by CUDA, matching is faster when performed using the CPU. As the problem size increases, CUDA-based matching can outperform CPU-based matching resulting in a 3-12× speedup when the interval-based representation is applied to match real-valued inputs and a 20-50× speedup for ternary-based representation.
1
Introduction
Learning classifier systems [10,8,17] combine evolutionary computation with methods of temporal difference learning to solve classification and reinforcement learning problems. A classifier system maintains a population of conditionaction-prediction rules, called classifiers, which identifies its current knowledge about the problem to be solved. At each time step, the system receives the current state of the problem and matches it against all the classifiers in the population. The results is a match set containing the classifiers that can be applied to the problem in its current state. Based on the value of the actions in the match set, the classifier system selects an action to perform on the problem to progress toward its solution. As a consequence of the executed action, the system receives a numerical reward that is distributed to the classifiers accountable for it. While the classifier system is interacting with the problem, a genetic algorithm is applied to the population to discover better classifiers through selection, recombination and mutation. Matching is the main and most computationally demanding process of a classifier system [14,3] that can occupy up to the 65%-85% of the overall computation time [14]. Accordingly, several methods have been proposed in the literature to speed up matching in learning classifier systems. Llor`a and Sastry [14] compared the typical encoding of classifier conditions for binary inputs, an encoding based on the underlying binary arithmetic, and a version of the J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 1–20, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
P.L. Lanzi and D. Loiacono
same encoding optimized via vector instructions. Their results show that binary encodings combined with optimizations based on the underlying integer arithmetic can speedup the matching process up to 80 times. The analysis of Llor` a and Sastry [14] did not consider the influence of classifier generality on the complexity of matching. As noted in [3], the matching usually stops as soon as it is determined that the classifier cannot be applied to the current problem instance (e.g., [1,12]). Accordingly, matching a population of highly specific classifiers takes much less than matching a population of highly general classifiers. Butz et al. [3] extended the analysis in [14] (i) by considering more encodings (the specificity-based encoding used in Butz’s implementation [1] and the encoding used in some implementations of Alecsys [7]); and (ii) by taking into account classifiers’ generality. Their results show that, overall, specificity-based matching can be 50% faster than character-based encoding when general populations are involved, but it can be slower than character-based encoding if more specific populations are considered. Binary encoding was confirmed to be the fastest option with a reported improvement up to 90% compared to the usual character-based encoding. Butz et al. [3] also proposed a specificity-based encoding for real-coded inputs which could halve the time required to match a population. In this work, we took a different approach to speed up matching in classifier systems based on the use of Graphical Processing Units (GPUs). More precisely, we used NVIDIA’s Compute Unified Device Architecture (CUDA) to implement matching for (i) real inputs using interval-based conditions and for (ii) binary inputs using ternary conditions. We tested our GPU-based matching by applying the same experimental design used in [14,3]. Our results show that on small problems, due to the memory transfer overhead introduced by GPUs, matching is faster when performed using the usual CPU. On larger problems, involving either more variables or more classifiers, GPU-based matching can outperform CPU-based implementation with a 3-12× speedup when the interval-based representation is applied to match real-valued inputs and a 20-50× speedup for ternary-based representation.
2
General-Purpose Computation on GPUs
Graphics Processing Units (GPUs) currently provide the best floating-point performance with a throughput that is at least ten times higher than the one provided by multi-core CPUs. Such a large performance gap has pushed developers to move several computationally intensive parts of their software on GPUs. Many-core GPUs perform better than general-purpose multi-core CPUs on floating-point computation because they have a different underlying design philosophy (see Figure 1). The design of a CPU is optimized for sequential code performance. It exploits sophisticated control logic to execute in parallel instructions from a single thread while maintaining the appearance of sequential
Speeding Up Matching in Learning Classifier Systems Using CUDA
3
execution. In addition, large cache memories are provided to reduce the instruction and data access latencies required in large complex applications. On the other hand, the GPU design is optimized for the execution of massive number of threads. It exploits the large number of executed threads to find work to do during long-latency memory accesses, minimizing the control logic required for each thread. Small cache memories are provided so that when multiple threads access to the same memory data, they do not need to all access to the DRAM. As a result, much more chip area is dedicated to the floating-point calculations.
Fig. 1. An overview of the CPUs and GPUs design philosophies
2.1
The CUDA Programming Model
NVIDIA’s Compute Unified Device Architecture (CUDA)1 allows developers to write computationally intensive applications on a GPU by using an extension of C which provides abstractions for parallel programming. In CUDA, GPUs are represented as devices that can run a large number of threads. Parallel tasks are represented as kernels mapped over a domain. Each kernel represents a sequential task to be executed as a thread on each point of the domain. The data to be processed by the GPU must be loaded into the board memory and, unless deallocated or overwritten, they remain available for subsequent kernels. Kernels have built-in variables to identify themselves in the domain and to access the data in the board memory. The domain is defined as a 5-dimensional structure consisting of a two-dimensional grid of three-dimensional thread blocks. Thread blocks are limited to 512 total threads; each block is assigned to a single processing element and runs as a unit until completion without preemption. Note that the resources used by a block are released only after the execution of all the threads in the same block are completed. Once a block is assigned to a streaming multiprocessor, it is further divided into groups of 32 threads, called warps. All threads within the same block are simultaneously live and they are temporally multiplexed but, at any time, the processing element executes only one of its resident warps. When the number of thread blocks in a grid exceeds the hardware 1
http://www.nvidia.com/object/cuda_home_new.html
4
P.L. Lanzi and D. Loiacono
resources, new blocks are assigned to processing element as soon as previous ones completed their execution. In addition to the global shared memory of the device, GPUs also have a private memory visible only to threads within the same block called per-block shared memory (PBSM). 2.2
Performance Issues
Although CUDA is very intuitive, it requires a deep knowledge of the underlying hardware architecture. CUDA developers need to take into account the specific features of the GPU architecture, such as memory transfer overhead, shared memory bank conflicts, and the impact of control flow. In fact, in CUDA, it is necessary to manage the communication between main memory and GPU shared memory explicitly. Developers have to reduce the transfer overhead by avoiding frequent data transfers between the GPU and CPU. Accordingly, rather than to increase the amount of communication with the CPU, computation on the GPU is usually duplicated and computation is typically overlapped to data communication. Once the memory transfer overhead has been optimized, developers must optimize the access to the global memory of the device, which represents one of the most important performance issue in CUDA. In general, CUDA applications exploit massive data parallelism in that they process a massive amount of data within a short period of time. Therefore, a CUDA kernel must be able to access a massive amount of data from the global memory within a very short period of time. As the memory access is a very slow process, modern DRAMs use a parallel process to increase their data access rate. When a memory location is accessed, many consecutive locations are also accessed. If an application exploits data from multiple, consecutive locations before moving on to other locations, the DRAMs can supply the data at much higher rate with respect to the access to a random sequence of locations. In CUDA, it is possible to take advantage of the fact that threads in a warp are executing the same instruction at any given point in time. When all threads in a warp execute a load instruction, the hardware detects whether the threads access consecutive global memory locations. The most favorable access pattern is achieved when the same instruction for all threads in a warp accesses consecutive global memory locations. In this case, the hardware combines, or coalesces, all these accesses into a consolidated access to the DRAMs that requests all consecutive locations involved. Such coalesced access allows the DRAMs to deliver data at a rate close to the maximal global memory bandwidth. Finally, control flow instructions (e.g., the if or switch statements) can significantly affect the instruction throughput when threads within the same warp follow different branches. When executing different branches, either the execution of each path must be serialized or all threads within the warp must execute each instruction, with predication used to mask out the effects of instructions that should not be executed [19]. Thus, kernels should be optimized avoid excessive use of control flow
Speeding Up Matching in Learning Classifier Systems Using CUDA
5
statements or to ensure that the branches executed will be the same across the whole warp.
3
The XCS Classifier System
XCS [17] maintains a population of condition-action-prediction rules (or classifiers), which represents the current system’s knowledge about a problem solution. Each classifier represents a portion of the overall solution. The classifier’s condition identifies a part of the problem domain; the classifier’s action represents a decision on the part of the domain identified by its condition; the classifier’s prediction p estimates the value of the action in terms of problem solution. Classifier conditions are usually strings defined over the ternary alphabet {0,1,#} in which the don’t care symbol # indicates that the corresponding position can either match a 0 or a 1. Actions are usually binary strings. XCS applies supervised or reinforcement learning to evaluate the classifiers’ prediction and a genetic algorithm to discover better classifiers by selecting, recombining, and mutating existing ones. To guide the evolutionary process, the classifiers keep three additional parameters: the prediction error ε, which estimates the average absolute error of the classifier prediction p; the fitness F , which estimates the average relative accuracy of the payoff prediction given by p and is a function of the prediction error ε; and the numerosity num, which indicates how many copies of classifiers with the same condition and the same action are present in the population. At time t, XCS builds a match set [M] containing the classifiers in the population [P] whose condition matches the current input st ; for each classifier, the match procedure scans all the input bits to check whether the classifier condition contains a don’t care symbol (#) or an input bit is equal to the corresponding character in the condition. If [M] contains less than θmna actions, covering takes place and creates a new classifier with a random action and a condition, with a proportion P# of don’t care symbols, that matches st . For each possible action a in [M], XCS computes the system prediction P (st , a), which estimates the payoff that XCS expects if action a is performed in st . The system prediction P (st , a) is computed as the fitness weighted average of the predictions of classifiers in [M] that advocate action a: P (st , a) =
clk ∈[M ](a)
pk ×
Fk cli ∈[M ](a)
Fi
,
(1)
where [M](a) represents the subset of classifiers of [M ] with action a, pk identifies the prediction of classifier cl k, and Fk identifies the fitness of classifier cl k. Next, XCS selects an action to perform; the classifiers in [M] that advocate the selected action form the current action set [A]. The selected action at is performed, and a scalar reward rt+1 is returned to XCS together with a new
6
P.L. Lanzi and D. Loiacono
input st+1 . The incoming reward rt+1 is used to compute the estimated payoff P (t) as, P (t) = rt+1 + γ max P (st+1 , a) a∈[M]
(2)
Next, the parameters of the classifiers in [A] are updated [5]. At first, the prediction p is updated with learning rate β (0 ≤ β ≤ 1) as, p ← p + β(P (t) − p)
(3)
Then, the prediction error ε and the fitness are updated [17,5]. On a regular basis (dependent on the parameter θga ), the genetic algorithm is applied to the classifiers in [A]. It selects two classifiers, copies them, and with probability χ performs crossover on the copies; then, with probability μ it mutates each allele. The resulting offspring classifiers are inserted into the population and two other classifiers are deleted from the population to keep the population size N constant.
4
Matching Interval-Based Conditions Using GPUs
Learning classifier systems typically assume that inputs are encoded as binary strings and that classifier conditions are strings defined over the ternary alphabet {0,1,#} [9,8,16,17]. There are however several representations that can deal with real-valued inputs: center-based intervals [18], simple intervals [19,15], convex hulls [13], ellipsoids [2], and hyper-ellipsoids [4]. 4.1
Interval Based Conditions and Matching
In the interval-based case [19], a condition is represented by a concatenation of n real interval predicates, int i = (li , ui ); given an input x consisting of n real numbers, a condition matches s if, for every i ∈ {1, . . . n}, the predicate li ≤ si ∧ si ≤ ui is verified. The matching is straightforward and its pseudocode is reported as Algorithm 1: the condition (identified by the variable condition) is represented as a vector of intervals; the inputs are a vector of real values (in double precision); the n inputs (i.e., inputs.size()) are scanned and each input is tested against the corresponding interval; the process stops either when all the inputs matched or as soon as one of the intervals does not match (when result in Algorithm 1 becomes false). Butz et al. [3] showed that this matching procedure can be sped-up by changing the order in which the inputs are tested: if smaller (more specific) intervals are tested first, the match is more likely to fail early so as to speed up the matching process. Their results on matching alone showed that this specificitybased matching could produce a 60% speed increase when applied to populations containing classifiers with highly specific conditions. However, they reported no significant improvement when their specificity-based matching was applied to typical testbeds.
Speeding Up Matching in Learning Classifier Systems Using CUDA
7
Algorithm 1. Matching for interval-based conditions in XCSLib. // representation of classifier condition vector
condition; // representation of classifier inputs vector<double> inputs; // matching procedure int pos = 0; bool result = true; while ( (result) && (pos=condition[pos].lower) && (condition[pos].upper>=inputs[pos])); pos++; } return result;
4.2
Interval-Based Matching Using CUDA
Implementing interval-based matching using CUDA is straightforward and involves three simple design steps. First, we need to decide how to represent classifier conditions in the graphic board memory; then, we have to decide how parallelization is organized; finally, we need to implement the require kernel functions. Once these steps are performed, the matching of interval-based conditions on the GPU consists of (i) transferring the data to the board memory of the GPU, (ii) invoking the kernels that perform the matching, and finally (iii) retrieving the result from the board memory. Condition Representation. An interval-based condition can be easily encoded using two arrays of float variables, one to store all the condition’s lower bounds and one to store all the condition’s upper bounds. Algorithm 2 reports the matching algorithm using the lower and upper bound vectors. We can apply the same principle to encode a population of N classifiers using two matrices of float variables lb and ub which contain all the lower bounds and all the upper bounds of the conditions in the population. Given a problem with n real inputs, the matrices lb and ub can be either organized (i) by rows, putting in each row of the matrices the n lower/upper bounds of the same classifier (Figure 2a) or (ii) by columns, putting in each column of the matrices the n lower/upper bounds of the same classifier (Figure 2b). In both the representations, the matrices lb and ub are then linearized into arrays to be stored into the GPU memory. In particular, when the representation by rows is used, the
8
P.L. Lanzi and D. Loiacono
Algorithm 2. Matching for interval-based conditions using arrays. // representation of classifier condition float lb[n]; float ub[n]; // representation of classifier inputs float inputs[n]; // matching procedure int pos = 0; bool result = true; while ( (result) && (pos=lb[pos]) && (ub[pos]>=inputs[pos])); pos++; } return result;
first n values of lb contain the lower bounds of the first classifier condition in the population; while the first n values of ub contain the upper bounds of the same condition. The next n values in lb and ub contain the lower and upper bounds of the second classifier condition, and so on for all the N classifiers in the population. In contrast, when the representation by columns is used, the first N values of lb contain the lower bounds associated to the first input of the N classifiers in the population; similarly the first N values of ub contain the corresponding upper bounds. The next N values in lb and ub contain the lower and upper bounds associated to the second input, and so on for all the n inputs of the problem.
(a)
(b)
Fig. 2. Classifier conditions in the GPU global memory are represented as two matrices lb and ub which can be stored (a) by row or (b) by columns; cli represents the variables in the classifier condition; si shows what variables should be matched in parallel by the kernel
Speeding Up Matching in Learning Classifier Systems Using CUDA
9
Matching. To perform matching, the classifier conditions in the population are stored (either by rows or by columns) in the GPU main memory as the two vectors lb and ub of n × N elements each; the current input is stored in the GPU memory as a vector s of n floats. A result vector matched of N integers in the GPU memory is used to store the result of a matching procedure: a 1 in position i means that condition of classifier cli matched the current input; a 0 in the same position means that the condition of cli did not match. Then, matching is performed by running the matching kernel on the data structures that have been loaded into the device memory. Memory Organization. As we previously noted, the vector lb and ub can be stored into the device memory by rows (Figure 2a) or by columns (Figure 2b). To maximize the performance of a GPU implementation, at each clock cycle, GPU must access very close memory positions since the GPU accesses blocks of contiguous memory locations. Note that, while the representation of lb and ub by row (Figure 2a) appears to be straightforward, it also provides the lesser parallelization possible. As an example consider the first two classifiers in the population (cl0 and cl1 ) whose lower bounds are respectively stored in positions from 0 to n-1 for cl0 and from n to 2n-1 from cl1 . At the first clock cycle, one kernel will start the matching of the first condition and will access the value in lb[0] while the second kernel will access the value in lb[n] (i.e., the first value of lower bound for cl0 and cl1 ). When n is large these two memory positions will
Algorithm 3. Kernel for interval-based matching in CUDA using a row-based representation. // LB and UB represent the classifier condition // n is the size of the input // N is the population size __global__ void match( float* LB, float* UB, float *input, int *matched, int n, int N) { // computes position of the classifier condition in the arrays LB and UB const unsigned int tidx = threadIdx.x + blockIdx.x*blockDim.x; const unsigned int pos = tidx*n; if (tidx= LB[pos+i]) && (input[i] <= UB[pos+i]); i++; } matched[tidx]=has_matched; } }
10
P.L. Lanzi and D. Loiacono
be too distant and require the GPU to perform two separate memory accesses. Accordingly, the GPU will remain idle for a significant amount of time to access memory. In contrast, if lb and ub are represented by column (Figure 2b), the same operations will access contiguous memory locations. In fact, at the first clock cycle, one kernel will now access the value in lb[0] (the first lower bound of cl0 ). while the second kernel will access the nearby memory position lb[1] where the first lower bound of cl1 is stored. As a result, the GPU can perform several operations using just one memory access resulting in the maximum parallelization possible. Kernels are the basic computation units in CUDA and they are the source of the parallelization. Kernels are executed in parallels on separate GPU cores grouped into blocks whose size depends on the model of GPU used and must be properly set to achieve the best parallelization. As soon as a core completed the execution of a block of kernels, a new block is assigned to it. In our case, a kernel is in charge of performing matching one classifier. Accordingly, the GPU will execute N kernels one for each classifier in the population. We used blocks of 64 kernels which we empirically found to be the best block size on the card models we tested. Algorithm 3 shows the kernel for interval-based matching using CUDA when using a representation of lb and ub by row is used. Each kernel reads the condition of a classifier from the device shared memory and checks whether it matches the current input. If a match is found, the position of the matched array in the device memory corresponding to the classifier is set to one otherwise it is set to zero.
Algorithm 4. Kernel for interval-based matching in CUDA using a column-based representation. // LB and UB represent the classifier condition // n is the size of the input // N is the population size __global__ void matchReal( float* LB, float* UB, float *input, int *matched, int n, int N) { // access thread id const unsigned int tidx = threadIdx.x + blockIdx.x*blockDim.x; if (tidx= LB[i*N+tidx]) && (input[i] <= UB[i*N+tidx]); i++; } matched[tidx]=has_matched; } }
Speeding Up Matching in Learning Classifier Systems Using CUDA
11
Algorithm 4 shows the kernel for column-based interval-based matching using CUDA. The only difference with respect to the row-based implementation is the computation of the index of the interval to be tested.
5
Matching Ternary Conditions Using GPUs
Ternary conditions are usually implemented using a character-based encoding that represents conditions as string of characters and encodes each one of the three symbols {0, 1, #} as a character variable (e.g., as a char in C/C++). Character-based encoding is very simple and for this reason widely used but it is also highly inefficient in that (1) it wastes 75% of the memory by using 8 bits characters to encode three symbols and (2) it processes input information that is in principle useless [3]. More compact encodings were used in the early days
Algorithm 5. Binary representation and matching. // representation of classifier condition bitset fp; bitset sp; // representation of classifier inputs bitset inputs; // matching procedure bitset result = ((input^fp) & (input^sp)); return result.none();
Algorithm 6. Improved binary representation and matching. // representation of classifier condition (m is the number of unsigned integer // necessary to represents n bits) unsigned int fp[m]; unsigned int sp[m]; // representation of classifier inputs unsigned int inputs[m]; // matching procedure bool matched = true; while ( (matched) && i<m ) { matched = ( ( (fp[i]^inputs[i]) & (sp[i]) ) == 0) i++; } return matched;
12
P.L. Lanzi and D. Loiacono
Algorithm 7. Kernel for the row-based matching on ternary conditions in CUDA. // fp and sp represent the classifier condition // m is the number of integers required to represent the classifier condition (and the input) // N is the population size __global__ void matchBinary( int* fp, int* sp, int *input, int *matched, int m, int N) { // access thread id const unsigned int tidx = threadIdx.x + blockIdx.x*blockDim.x; // base position in fp and sp arrays const unsigned int pos = tidx*m; if (tidx
of classifier systems research [7] and similar ones have been recently proposed to speed up the matching using standard CPUs [14]. In their famous classifier system Alecsys [7], Dorigo, Colombetti and colleagues implemented classifier conditions as arrays of bits packed up inside unsigned integers. In Alecsys, a condition was represented by two arrays, fp and sp, of unsigned integers; a one in the condition was represented by a bit set to one in the same position of fp and sp; a zero was represented by a bit set to zero in the same position of fp and sp; a don’t care (#) could be either represented by a 0 in fp and a 1 in sp or by a 1 in fp and a 0 in sp. Given the bit encoded inputs i, a condition matches if fp^i & sp^i returns a set of zero bits, where ^ is the bitwise exclusive or and & is the bitwise logical and. Algorithm 5 shows the C++ implementation of the encoding used in Alecsys and the corresponding matching taken from [3]. The condition is represented as two variables, fp and sp, using the Standard Template Library (STL) bitset class [11], which encodes a set of bits; the condition matches if the resulting bitset has all the bits set to zero, i.e., if result.none() returns true. We can apply the same approach we used for interval-based conditions to speed up the matching of ternary conditions using CUDA. For this purpose, we need to modify Alecsys’s encoding as follows. A classifier condition is still represented using two arrays, fp and sp, each one representing part of the condition. In the case of the GPU representation however, the first array fp encodes only the specific positions while the second array sp encodes only the general positions. As a results, this encoding reduces the number of bitwise operations
Speeding Up Matching in Learning Classifier Systems Using CUDA
13
Algorithm 8. Kernel for the column-based matching on ternary conditions in CUDA. // fp and sp represent the classifier condition // m is the number of integers required to represent the classifier condition (and the input) // N is the population size __global__ void matchBinary( int* fp, int* sp, int *input, int *matched, int m, int N) { // access thread id const unsigned int tidx = threadIdx.x + blockIdx.x*blockDim.x; if (tidx
needed to match an input bitstring. In fact, given an input bistring i, a condition matches if the expression fp^i & sp returns all zero bits (^ is the bitwise exclusive or and & is the bitwise logical and ). This new matching procedure requires only one bitwise exclusive or and one bitwise and, while, in Alecsys, matching required two bitwise exclusive or and one bitwise and. This small modification dramatically reduces the number of registers and memory accesses required to perform fast bitwise operations on the GPU. Finally, since the Standard Template Library (STL) bitset is unavailable on GPUs, the the arrays fp and sp must be represented as two arrays of unsigned integers. Each unsigned integer is used to encode 32 bits (i.e., the size of integers and unsigned integers in the CUDA specification). The matching procedure for ternary conditions using CUDA is reported as Algorithm 6. As the algorithm shows, using two arrays of unsigned integers instead of two bitsets allows to stop the matching procedure as soon as a non-matching position is found, as it happens in the character-based encoding (see [3]). As in the case of interval-based conditions, also with ternary conditions we can have (i) a row-based matching, in which the unsigned integers representing a condition are stored in subsequent positions, and (ii) a column-based matching, in which the same unsigned integers are stored in positions that are N positions away. The kernels implementing the two approaches are reported respectively as Algorithm 7 and Algorithm 8.
14
P.L. Lanzi and D. Loiacono
6
Experimental Results
We performed two sets of experiments to evaluate the speed-up introduced by the use of GPU-based implementation of matching for interval-based conditions and binary conditions. 6.1
Design of Experiments
In this work, we used an experimental design similar to the one applied in [3] which was inspired to the previous work of Llor`a and Sastry [14]. We generated a population of N interval-based or ternary conditions of length n with different generality and 1000 random input configurations. For interval-based conditions, the generality of a random population was determined by setting an adequate value of the parameter r0 (see [19] for details); for ternary conditions, generality was set using the don’t care probability P# . We matched each random input against the N conditions using one of the kernels previously discussed and measured the average time required to perform all the match operations using the functions provided by the CUDA distribution. We repeated this procedure 10 times. Overall, we tested two matching kernels (one using the representation by row and one using the representation by columns) on the CPU2 , on a Tesla C1060 and on a GeForce 9600 GT (see Appendix A). The performance was measured as the average CPU time to perform the 1000 matches over the N conditions. The reported average performance takes into account (i) the time to load each one of the 1000 inputs to be matched into the GPU; and (ii) the time to move the result vector from the GPU to main CPU memory. 6.2
Interval-Based Conditions
Table 1 reports the average matching time for one condition using either the CPU, a Tesla C1060 GPU or a GeForce 9600 GT GPU, when (i) the number of inputs n in 10, 100, or 1000; the generality is chosen in {0.25, 0.50, 0.75, 0.90, 0.95}; the population size N is either 1000 (Table 1a), 10000 (Table 1b), or 100000 (Table 1c); the representation is either row-based or column-based. As expected, Tesla C1060 is always faster than GeForce 9600 GT which, on the other hand, has only 512Mb of memory and cannot manage a large population of 100000 classifiers (Table 1c). As anticipated, column-based representation results on a superior performance on GPUs. However, on the CPU, column-based representation can be 10 times slower than its row-based counterpart. In fact, on the CPU, row-based matching allows for the caching of contiguous data positions (both condition bounds or inputs) which significantly speed up the matching process. In contrast, column-based matching accesses data positions in a scattered way with respect to the storage resulting in a slower matching on CPUs. 2
Experiments have been performed on a 2 quad-core Xeon (2.66 GHz) with 8GB of RAM running Linux Fedora Core 6.
Speeding Up Matching in Learning Classifier Systems Using CUDA
15
Table 1. Time (ms) required to match 1000 instances when the problem consists of 10, 100 or 1000 real inputs, the population size N is (a) 1000, (b) 10000, and (c) 100000; the population generality gen varies between 0.25 and 0.95; statistics are averages over 10 runs n
gen
10 10 10 10 10 100 100 100 100 100 1000 1000 1000 1000 1000
0.25 0.50 0.75 0.90 0.95 0.25 0.50 0.75 0.90 0.95 0.25 0.50 0.75 0.90 0.95
CPUrow ± 0.002 ± 0.002 ± 0.003 ± 0.003 ± 0.002 ± 0.010 ± 0.011 ± 0.004 ± 0.005 ± 0.009 ± 0.029 ± 0.026 ± 0.013 ± 0.022 ± 0.017
0.029 0.032 0.032 0.030 0.029 0.159 0.207 0.237 0.257 0.265 1.654 2.154 2.537 2.719 2.779
CPUcol 0.032 ± 0.001 0.035 ± 0.002 0.037 ± 0.003 0.034 ± 0.002 0.033 ± 0.002 0.170 ± 0.008 0.216 ± 0.014 0.250 ± 0.006 0.268 ± 0.003 0.277 ± 0.009 7.487 ± 0.121 9.423 ± 0.094 11.002 ± 0.073 11.815 ± 0.065 12.101 ± 0.049
TESLArow 0.047 0.047 0.047 0.047 0.047 0.162 0.167 0.171 0.174 0.175 1.463 1.588 1.659 1.694 1.703
± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.000 0.001 0.000 0.006 0.005 0.005 0.004 0.003
TESLAcol 0.045 0.045 0.045 0.045 0.045 0.147 0.147 0.147 0.147 0.147 1.148 1.148 1.148 1.148 1.148
± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.000 0.000 0.001
GF9600row 0.046 0.052 0.056 0.059 0.059 0.246 0.297 0.350 0.381 0.384 3.009 3.678 4.133 4.374 4.444
± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.002 0.002 0.001 0.002 0.001 0.005 0.001 0.010 0.017 0.005 0.056 0.051 0.038 0.028 0.033
GF9600col 0.044 0.052 0.054 0.055 0.055 0.206 0.252 0.295 0.315 0.325 1.678 1.982 2.222 2.362 2.392
± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.002 0.006 0.004 0.002 0.001 0.006 0.010 0.011 0.002 0.007 0.026 0.037 0.039 0.025 0.012
(a) n
gen
CPUrow
10 10 10 10 10 100 100 100 100 100 1000 1000 1000 1000 1000
0.25 0.50 0.75 0.90 0.95 0.25 0.50 0.75 0.90 0.95 0.25 0.50 0.75 0.90 0.95
0.282 ± 0.009 0.312 ± 0.017 0.300 ± 0.003 0.284 ± 0.008 0.277 ± 0.009 2.442 ± 0.041 2.663 ± 0.046 2.799 ± 0.039 2.869 ± 0.026 2.884 ± 0.013 19.232 ± 0.106 23.703 ± 0.083 27.025 ± 0.103 28.793 ± 0.094 29.354 ± 0.098
CPUcol 0.321 ± 0.009 0.353 ± 0.015 0.345 ± 0.018 0.324 ± 0.011 0.318 ± 0.013 3.651 ± 0.094 4.106 ± 0.077 4.506 ± 0.080 4.643 ± 0.068 4.701 ± 0.067 121.913 ± 0.574 158.491 ± 0.753 189.205 ± 0.427 205.408 ± 0.642 210.484 ± 0.627
TESLArow 0.136 ± 0.001 0.153 ± 0.001 0.164 ± 0.001 0.168 ± 0.001 0.169 ± 0.001 0.935 ± 0.004 1.067 ± 0.004 1.155 ± 0.002 1.202 ± 0.002 1.218 ± 0.001 29.569 ± 0.208 41.444 ± 0.132 44.579 ± 0.109 44.872 ± 0.076 44.839 ± 0.076
TESLAcol 0.091 0.091 0.091 0.091 0.091 0.367 0.368 0.369 0.368 0.369 2.505 2.512 2.510 2.511 2.511
± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.001 0.001 0.001 0.002 0.001 0.001 0.001 0.001
GF9600row 0.258 ± 0.006 0.316 ± 0.002 0.365 ± 0.004 0.390 ± 0.007 0.399 ± 0.005 2.244 ± 0.009 2.829 ± 0.007 3.307 ± 0.008 3.576 ± 0.006 3.657 ± 0.005 117.867 ± 0.535 164.823 ± 0.680 199.877 ± 0.362 216.719 ± 0.288 222.016 ± 0.283
GF9600col 0.204 ± 0.005 0.241 ± 0.001 0.272 ± 0.005 0.291 ± 0.006 0.294 ± 0.004 1.468 ± 0.008 1.878 ± 0.005 2.207 ± 0.005 2.395 ± 0.009 2.448 ± 0.003 10.975 ± 0.112 12.382 ± 0.211 14.107 ± 0.243 14.912 ± 0.324 15.206 ± 0.482
(b) n
gen
10 10 10 10 10 100 100 100 100 100 1000 1000 1000 1000 1000
0.25 0.50 0.75 0.90 0.95 0.25 0.50 0.75 0.90 0.95 0.25 0.50 0.75 0.90 0.95
CPUrow 3.042 ± 0.016 3.290 ± 0.022 3.265 ± 0.017 3.073 ± 0.025 2.983 ± 0.028 29.947 ± 0.086 31.040 ± 0.076 31.086 ± 0.093 30.946 ± 0.061 30.849 ± 0.060 192.730 ± 0.724 236.868 ± 0.827 270.308 ± 0.831 287.616 ± 0.792 292.706 ± 0.729
CPUcol 5.598 ± 0.092 5.891 ± 0.119 5.733 ± 0.086 5.592 ± 0.077 5.536 ± 0.062 42.500 ± 0.124 47.296 ± 0.149 51.373 ± 0.187 53.472 ± 0.231 54.254 ± 0.205 1253.090 ± 4.225 1640.120 ± 3.299 1950.790 ± 3.422 2120.160 ± 2.686 2177.270 ± 2.225
TESLArow 1.079 ± 0.002 1.242 ± 0.003 1.353 ± 0.002 1.396 ± 0.001 1.407 ± 0.001 9.171 ± 0.010 10.499 ± 0.007 11.329 ± 0.010 11.757 ± 0.007 11.900 ± 0.005 329.946 ± 0.878 423.478 ± 0.421 445.080 ± 0.353 449.720 ± 0.333 450.194 ± 0.289
TESLAcol 0.628 ± 0.001 0.630 ± 0.002 0.632 ± 0.002 0.630 ± 0.002 0.630 ± 0.002 3.329 ± 0.001 3.340 ± 0.001 3.341 ± 0.001 3.341 ± 0.001 3.341 ± 0.001 23.186 ± 0.075 23.286 ± 0.047 23.306 ± 0.054 23.255 ± 0.052 23.299 ± 0.067
GF9600row 2.355 ± 0.009 2.951 ± 0.007 3.433 ± 0.007 3.694 ± 0.013 3.772 ± 0.003 21.414 ± 0.091 27.100 ± 0.084 32.071 ± 0.036 34.769 ± 0.043 35.654 ± 0.026 -
GF9600col 1.795 ± 0.008 2.180 ± 0.011 2.487 ± 0.015 2.659 ± 0.031 2.699 ± 0.007 13.847 ± 0.053 17.631 ± 0.104 20.926 ± 0.074 22.711 ± 0.035 23.304 ± 0.056 -
(c)
In small populations (i.e., when N = 1000), GPUs provide no speed up on very small problems, when n is 10 or 100. When more variables are involved (n is 1000), GPUs achieve a rather limited speedup (Table 1a). In fact, when we compare the fastest CPU implementation (CPUrow ) against the fastest GPUs implementation (TESLAcol), we note a speedup up to 2.42×. However, as the number of classifiers increases, GPUs scale up very well: when N = 10000 the speedup ranges from 3× up to 11× in the largest problems involving 1000 inputs; speedup is even higher in huge populations containing 100000 classifiers where the speedup is between 4.8× and 12.56×. Note that the performance of Tesla C1060 is not influenced by the classifiers’ generality. In contrast, the average match time for the CPU increases with the classifiers’ generality. This is not surprising and can be easily explained. In the
16
P.L. Lanzi and D. Loiacono
experiments performed, the classifiers’ generality ranges between 0.25 and 0.95, thus, at least one out of four classifiers will match. In the GPUs many matches are run in parallel and the overall matching time depends on the slower match. Accordingly, even when classifier generality is 0.25 the overall matching time is almost the same. However, this does not happen with the GeForce 9600 GT, where the results are very similar to the ones of the CPU, i.e., matching time increases with the classifiers’ generality (as in [3]). This is due to the more strict requirements that the GeForce 9600 GT poses on the memory access pattern. To maximize parallelization with the GeForce 9600 GT, cores need to access memory positions that are both contiguous and adequately aligned, whereas Tesla C1060 only poses constraints on the former. As more and more matching are performed the pattern memory access of the GeForce 9600 GT tends to diverge (accessed memory positions become more and more misaligned) resulting in a worsening of the overall performance. 6.3
Ternary Conditions
We repeated a similar set of experiments using ternary conditions. As previously done, we used a CPU, a Tesla C1060 GPU and a GeForce 9600 GT GPU; the number of binary inputs n was chosen in {32, 512, 1024,4096,10240}; the generality, tuned by the parameter P# , was selected from {0.0, 0.25, 0.50, 0.75, 0.99, 1.0}; the population size N was either 1000, 10000, and 100000; we considered both row-based and column-based representations. Table 2, Table 3, and Table 4 Table 2. Time (ms) required to match 1000 instances when the problem size is 32, 512, 1024, 4096 or 10240 bits, the population size N is 1000, and the population generality gen is 0.00, 0.25, 0.50, 0.75, 0.99 or 1.0. Data are averages over 10 runs n
P#
32 32 32 32 32 32 512 512 512 512 512 512 1024 1024 1024 1024 1024 1024 4096 4096 4096 4096 4096 4096 10240 10240 10240 10240 10240 10240
0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0
CPUrow 0.003 0.003 0.004 0.004 0.006 0.004 0.006 0.006 0.006 0.006 0.034 0.041 0.007 0.007 0.007 0.008 0.037 0.080 0.016 0.015 0.015 0.016 0.047 0.312 0.031 0.031 0.031 0.032 0.067 0.768
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.002 0.003 0.001 0.001 0.001 0.001 0.002 0.005 0.001 0.001 0.001 0.001 0.003 0.004 0.002 0.002 0.002 0.002 0.004 0.002
CPUcol ± 0.000 ± 0.000 ± 0.000 ± 0.000 ± 0.000 ± 0.000 ± 0.000 ± 0.000 ± 0.000 ± 0.000 ± 0.003 ± 0.004 ± 0.000 ± 0.000 ± 0.000 ± 0.000 ± 0.002 ± 0.007 ± 0.001 ± 0.001 ± 0.001 ± 0.001 ± 0.002 ± 0.011 ± 0.002 ± 0.002 ± 0.002 ± 0.002 ± 0.004 ± 0.020
0.004 0.004 0.004 0.004 0.004 0.004 0.005 0.005 0.005 0.005 0.038 0.047 0.006 0.006 0.006 0.006 0.037 0.091 0.014 0.014 0.013 0.013 0.045 0.350 0.029 0.029 0.029 0.029 0.064 3.491
TESLArow 0.034 0.035 0.035 0.035 0.035 0.035 0.037 0.037 0.037 0.038 0.051 0.061 0.038 0.037 0.038 0.039 0.064 0.069 0.047 0.047 0.046 0.048 0.086 0.276 0.061 0.062 0.062 0.063 0.101 0.435
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.000 0.000 0.000 0.001 0.001 0.001 0.000 0.001 0.000 0.000 0.000 0.000 0.001 0.000 0.001 0.000 0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.002 0.002 0.002 0.002 0.002 0.001
TESLAcol 0.034 0.034 0.035 0.035 0.035 0.035 0.035 0.035 0.035 0.037 0.050 0.050 0.037 0.037 0.037 0.038 0.066 0.067 0.046 0.045 0.045 0.046 0.088 0.170 0.061 0.062 0.061 0.061 0.103 0.373
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.001 0.000 0.000 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.001 0.001 0.000 0.001 0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.002 0.002 0.002 0.002 0.002 0.001
GF9600row 0.030 0.029 0.029 0.029 0.029 0.029 0.032 0.033 0.033 0.033 0.052 0.085 0.035 0.037 0.036 0.038 0.066 0.152 0.058 0.057 0.058 0.057 0.102 0.633 0.096 0.098 0.099 0.097 0.142 1.335
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.003 0.002 0.001 0.001 0.001 0.001 0.001 0.001 0.002 0.002 0.001 0.002 0.001 0.003 0.002 0.002 0.004 0.011 0.003 0.002 0.002 0.002 0.005 0.002 0.002 0.005 0.005 0.003 0.004 0.002
GF9600col 0.028 0.029 0.029 0.029 0.029 0.029 0.031 0.033 0.032 0.032 0.050 0.077 0.035 0.034 0.036 0.037 0.063 0.133 0.055 0.054 0.053 0.056 0.093 0.436 0.094 0.093 0.095 0.095 0.133 0.969
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.001 0.002 0.001 0.002 0.002 0.002 0.001 0.002 0.002 0.001 0.001 0.001 0.002 0.001 0.003 0.002 0.001 0.004 0.003 0.002 0.000 0.003 0.002 0.002 0.004 0.002 0.004 0.003 0.003 0.003
Speeding Up Matching in Learning Classifier Systems Using CUDA
17
Table 3. Time (ms) required to match 1000 instances when the problem size is 32, 512, 1024, 4096 or 10240 bits, the population size N is 10000, and the population generality gen is 0.00, 0.25, 0.50, 0.75, 0.99 or 1.0. Data are averages over 10 runs. n
P#
32 32 32 32 32 32 512 512 512 512 512 512 1024 1024 1024 1024 1024 1024 4096 4096 4096 4096 4096 4096 10240 10240 10240 10240 10240 10240
0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0
CPUrow ± 0.002 ± 0.002 ± 0.002 ± 0.003 ± 0.004 ± 0.003 ± 0.002 ± 0.004 ± 0.004 ± 0.004 ± 0.010 ± 0.000 ± 0.010 ± 0.008 ± 0.007 ± 0.011 ± 0.007 ± 0.007 ± 0.008 ± 0.003 ± 0.004 ± 0.003 ± 0.067 ± 0.075 ± 0.003 ± 0.002 ± 0.004 ± 0.002 ± 0.013 ± 0.106
0.034 0.034 0.032 0.035 0.061 0.037 0.050 0.054 0.053 0.056 0.322 0.375 0.072 0.071 0.067 0.072 0.359 0.747 0.728 0.732 0.730 0.730 1.582 3.571 0.360 0.361 0.362 0.364 1.483 9.350
CPUcol 0.036 ± 0.002 0.038 ± 0.003 0.037 ± 0.003 0.036 ± 0.002 0.036 ± 0.002 0.037 ± 0.003 0.037 ± 0.003 0.036 ± 0.002 0.040 ± 0.003 0.040 ± 0.003 0.333 ± 0.011 0.429 ± 0.011 0.040 ± 0.003 0.040 ± 0.003 0.039 ± 0.003 0.042 ± 0.003 0.381 ± 0.016 0.877 ± 0.015 0.048 ± 0.004 0.047 ± 0.003 0.049 ± 0.003 0.051 ± 0.004 0.427 ± 0.009 7.691 ± 0.075 0.060 ± 0.003 0.062 ± 0.005 0.061 ± 0.004 0.066 ± 0.004 0.452 ± 0.009 67.344 ± 0.055
TESLArow 0.065 0.066 0.066 0.065 0.066 0.066 0.078 0.079 0.079 0.081 0.173 0.323 0.075 0.075 0.075 0.076 0.194 0.423 0.090 0.090 0.091 0.092 0.255 2.259 0.100 0.102 0.102 0.103 0.268 4.421
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.002 0.002 0.001 0.001 0.002 0.002 0.002 0.002 0.003 0.001
TESLAcol 0.066 0.066 0.066 0.066 0.066 0.066 0.066 0.065 0.066 0.067 0.109 0.111 0.067 0.067 0.066 0.068 0.131 0.163 0.076 0.076 0.076 0.078 0.164 0.467 0.090 0.091 0.090 0.092 0.178 0.998
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.002 0.002 0.002 0.002 0.001 0.002 0.003 0.002 0.002 0.001 0.001
GF9600row 0.075 ± 0.002 0.076 ± 0.003 0.075 ± 0.003 0.077 ± 0.004 0.078 ± 0.003 0.075 ± 0.003 0.088 ± 0.002 0.090 ± 0.002 0.090 ± 0.003 0.091 ± 0.003 0.284 ± 0.003 0.625 ± 0.011 0.095 ± 0.003 0.093 ± 0.003 0.098 ± 0.010 0.095 ± 0.003 0.330 ± 0.007 1.194 ± 0.004 0.126 ± 0.003 0.125 ± 0.004 0.126 ± 0.003 0.134 ± 0.017 0.450 ± 0.002 5.985 ± 0.001 0.197 ± 0.005 0.196 ± 0.003 0.195 ± 0.001 0.198 ± 0.004 0.610 ± 0.007 21.601 ± 0.007
GF9600col ± 0.002 ± 0.003 ± 0.003 ± 0.004 ± 0.010 ± 0.003 ± 0.002 ± 0.003 ± 0.002 ± 0.004 ± 0.001 ± 0.011 ± 0.003 ± 0.004 ± 0.004 ± 0.003 ± 0.002 ± 0.003 ± 0.003 ± 0.004 ± 0.003 ± 0.003 ± 0.002 ± 0.002 ± 0.003 ± 0.002 ± 0.007 ± 0.002 ± 0.005 ± 0.007
0.075 0.077 0.076 0.077 0.079 0.077 0.079 0.080 0.077 0.082 0.224 0.456 0.084 0.083 0.084 0.084 0.260 0.854 0.102 0.103 0.101 0.103 0.300 3.218 0.140 0.139 0.142 0.139 0.341 7.764
Table 4. Time (ms) required to match 1000 instances when the problem size is 32, 512, 1024, 4096 or 10240 bits, the population size N is 100000, and the population generality gen is 0.00, 0.25, 0.50, 0.75, 0.99 or 1.0. Data are averages over 10 runs. n
P#
32 32 32 32 32 32 512 512 512 512 512 512 1024 1024 1024 1024 1024 1024 4096 4096 4096 4096 4096 4096 10240 10240 10240 10240 10240 10240
0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0 0.0 0.25 0.5 0.75 0.99 1.0
CPUrow CPUcol 0.317 ± 0.009 0.349 ± 0.003 0.314 ± 0.000 0.348 ± 0.001 0.314 ± 0.000 0.351 ± 0.009 0.336 ± 0.003 0.351 ± 0.008 0.568 ± 0.009 0.348 ± 0.000 0.339 ± 0.000 0.348 ± 0.001 4.226 ± 0.115 0.353 ± 0.009 4.276 ± 0.046 0.349 ± 0.001 4.195 ± 0.229 0.352 ± 0.003 4.276 ± 0.061 0.384 ± 0.001 4.819 ± 0.043 7.296 ± 0.214 4.644 ± 0.106 8.449 ± 0.191 5.255 ± 0.090 0.350 ± 0.000 5.269 ± 0.007 0.350 ± 0.001 5.244 ± 0.088 0.352 ± 0.001 5.241 ± 0.088 0.386 ± 0.001 9.374 ± 0.013 9.308 ± 0.010 9.377 ± 0.108 16.793 ± 0.074 7.333 ± 0.010 0.361 ± 0.001 7.338 ± 0.002 0.360 ± 0.001 7.332 ± 0.008 0.363 ± 0.002 7.331 ± 0.011 0.507 ± 0.033 19.103 ± 0.012 9.451 ± 0.014 37.569 ± 0.185 115.499 ± 0.296 5.711 ± 0.021 0.391 ± 0.004 5.710 ± 0.019 0.392 ± 0.004 5.718 ± 0.014 0.396 ± 0.001 5.727 ± 0.036 0.566 ± 0.007 19.537 ± 0.074 9.496 ± 0.018 93.529 ± 0.208 690.014 ± 0.985
TESLArow 0.358 ± 0.003 0.356 ± 0.003 0.358 ± 0.003 0.360 ± 0.003 0.359 ± 0.003 0.359 ± 0.003 0.494 ± 0.003 0.493 ± 0.002 0.494 ± 0.001 0.499 ± 0.002 1.445 ± 0.004 2.949 ± 0.001 0.440 ± 0.002 0.441 ± 0.004 0.437 ± 0.001 0.447 ± 0.004 1.589 ± 0.005 3.915 ± 0.005 0.517 ± 0.001 0.517 ± 0.001 0.517 ± 0.001 0.527 ± 0.001 1.999 ± 0.003 22.712 ± 0.004 0.737 ± 0.001 0.737 ± 0.001 0.737 ± 0.001 0.744 ± 0.001 2.392 ± 0.006 68.851 ± 0.020
TESLAcol 0.357 0.357 0.359 0.357 0.357 0.358 0.359 0.357 0.358 0.366 0.789 0.817 0.360 0.359 0.359 0.366 0.979 1.311 0.365 0.364 0.364 0.373 1.082 4.239 0.378 0.379 0.379 0.386 1.095 9.764
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.003 0.003 0.003 0.004 0.003 0.003 0.004 0.003 0.004 0.003 0.002 0.001 0.003 0.003 0.003 0.002 0.001 0.001 0.003 0.000 0.000 0.001 0.002 0.001 0.001 0.002 0.001 0.001 0.004 0.003
GF9600row 0.548 ± 0.017 0.551 ± 0.014 0.541 ± 0.005 0.550 ± 0.014 0.548 ± 0.011 0.563 ± 0.017 0.681 ± 0.030 0.658 ± 0.009 0.662 ± 0.013 0.662 ± 0.006 2.617 ± 0.023 5.959 ± 0.014 0.674 ± 0.019 0.669 ± 0.007 0.649 ± 0.004 0.660 ± 0.005 2.905 ± 0.005 11.650 ± 0.016 0.809 ± 0.010 0.801 ± 0.003 0.803 ± 0.003 0.817 ± 0.004 3.924 ± 0.008 58.887 ± 0.041 -
GF9600col 0.542 ± 0.005 0.553 ± 0.020 0.546 ± 0.011 0.550 ± 0.013 0.559 ± 0.023 0.573 ± 0.031 0.560 ± 0.017 0.553 ± 0.009 0.548 ± 0.011 0.554 ± 0.007 2.015 ± 0.018 4.289 ± 0.089 0.550 ± 0.011 0.570 ± 0.050 0.534 ± 0.003 0.544 ± 0.005 2.257 ± 0.010 8.243 ± 0.016 0.561 ± 0.012 0.555 ± 0.003 0.559 ± 0.014 0.563 ± 0.004 2.411 ± 0.007 30.017 ± 0.039 -
18
P.L. Lanzi and D. Loiacono
report the average matching time for one condition respectively when N is 1000 (Table 2), 10000 (Table 3), and 100000 (Table 4). The results confirm several of the previous findings. Column-based matching outperforms row-based matching on GPUs. Tesla C1060 is generally faster than GeForce 9600 GT as expected. Again, in the smaller population, the CPU is generally faster than both GPUs. In addition, also with 10000 classifiers, when less than 32 or 512 binary inputs are considered (i.e., when conditions are represented by one to 16 unsigned integers), the CPU is faster; as the population size or the number of inputs increases, the GPUs outperforms CPU on larger problems. When P # ≥ 0.99 the speedup provided from the Tesla C1060 with respect to column-based implementation on the CPU can be close to 50×. Compared to the row-based implementation on the CPU, the results show that Tesla C1060 implementation outperforms CPU on medium and big problems (when n > 512 and N ≥ 10000) with a speedup near to 20×. As before, column-based matching outperforms row-based matching on GPUs. However, while with interval-based conditions, the CPU performed best with row-based implementation, in this case, column-based always performs better except when classifiers are fully general (i.e., P# =1.0). To understand this result, we need to consider the memory access patterns in the two implementations. When the P# is not very high, i.e., P# < 0.99, the probability of matching is easily close to zero when more than few dozens of inputs are considered3 . Accordingly, the matching process is very likely to stop very early, before the first 100 bits have been tested. This is why the average matching times of classifiers with P# in the range [0, 0.75] are very close. As a result, with the column-based representation only the small memory areas, where the first bits are stored, are accessed. Thus, in this case, the cache locality is exploited across the matching of several classifiers: as the matching involve few initial inputs for each classifier, once the data is loaded for matching one classifier, then it is readily available for the following ones. In contrast, in the row-based implementation, the pattern of memory accesses spread over the whole memory where the classifiers are allocated. On the other hand, when the classifiers are fully general (i.e., when P # = 1.0) the matching involve all the inputs for all the classifiers. Accordingly, the locality is fully exploited by the row-based implementation, because it performs a sequential memory access pattern. In contrast, in this case, the memory access pattern of column-based representation is highly inefficient.
7
Conclusions
In this paper, we studied GPU-based parallelization of the matching in learning classifier systems for real inputs (using interval-based conditions) and binary 3
The probability of matching an input nwith n bits for a classifier generated with a don’t care probability P# is 1+p ; thus, when P#=0.75, the probability of 2 matching an input of size n = 100 is lower than 10−5 .
Speeding Up Matching in Learning Classifier Systems Using CUDA
19
inputs (using ternary conditions). In particular, we applied NVIDIA’s Compute Unified Device Architecture (CUDA) to implement matching procedures that could exploit the massive parallelization available in GPUs. Our results show that in small problems, CPU-based matching is faster due to the transfer overhead introduced by GPUs. However, as the problem size increases, the transfer overhead becomes less significant with respect to the time gained through parallelization. Accordingly, GPU-based matching significantly outperforms CPUbased matching providing a 3-12× speedup on the interval-based representation and a 20-50× speedup on the ternary-based representation.
References 1. Butz, M.V.: XCS (+ tournament selection) classifier system implementation in c, version 1.2. Technical Report 2003023, Illinois Genetic Algorithms Laboratory – University of Illinois at Urbana-Champaign (2003) 2. Butz, M.V.: Kernel-based, ellipsoidal conditions in the real-valued xcs classifier system. In: Beyer, H.-G., O’Reilly, U.-M. (eds.) GECCO, pp. 1835–1842. ACM, New York (2005) 3. Butz, M.V., Lanzi, P.L., Llor` a, X., Loiacono, D.: An analysis of matching in learning classifier systems. In: Ryan, C., Keijzer, M. (eds.) Proceedings of Genetic and Evolutionary Computation Conference, GECCO 2008, Atlanta, GA, USA, July 12-16, ACM Press, New York (2008) 4. Butz, M.V., Lanzi, P.L., Wilson, S.W.: Hyper-ellipsoidal conditions in XCS: rotation, linear approximation, and solution structure. In: Cattolico [6], pp. 1457–1464. 5. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. Journal of Soft Computing 6(3-4), 144–153 (2002) 6. Cattolico, M. (ed.): Proceedings of Genetic and Evolutionary Computation Conference, GECCO 2006, Seattle, Washington, USA, July 8-12. ACM, New York (2006) 7. Dorigo, M., Colombetti, M.: Robot Shaping: An Experiment in Behavior Engineering. MIT Press/Bradford Books (1998) 8. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading (1989) 9. Holland, J.H.: Escaping Brittleness: The possibilities of General-Purpose Learning Algorithms Applied to Parallel Rule-Based Systems. In: Mitchell, Michalski, Carbonell (eds.) Machine learning, an artificial intelligence approach, vol. II, ch. 20, pp. 593–623. Morgan Kaufmann, San Francisco (1986) 10. Holland, J.H., Reitman, J.S.: Cognitive systems based on adaptive algorithms (1978); Reprinted in: Fogel, D.B. (ed.): Evolutionary Computation. The Fossil Record. IEEE Press, Los Alamitos (1998) ISBN: 0-7803-3481-7 11. Josuttis, N.M.: The C++ Standard Library: A Tutorial and Reference. AddisonWesley Professional, Reading (1999) 12. Lanzi, P.L.: The XCS library (2002) 13. Lanzi, P.L., Wilson, S.W.: Using convex hulls to represent classifier conditions. In: Cattolico [6], pp. 1481–1488 14. Llor` a, X., Sastry, K.: Fast rule matching for learning classifier systems via vector instructions. In: Cattolico [6], pp. 1513–1520 15. Stone, C., Bull, L.: For real! XCS with continuous-valued inputs. Evolutionary Computation 11(3), 298–336 (2003)
20
P.L. Lanzi and D. Loiacono
16. Wilson, S.W.: ZCS: A zeroth level classifier system. Evolutionary Computation 2(1), 1–18 (1994), http://prediction-dynamics.com/ 17. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149–175 (1995) 18. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209–222. Springer, Heidelberg (2000) 19. Wilson, S.W.: Mining oblique data with XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS, vol. 1996, pp. 158–176. Springer, Heidelberg (2001)
A
Device Specifications Table 5. Specification of GeForce 9600GT
Major revision number: Minor revision number: Total amount of global memory: Number of multiprocessors: Number of cores: Total amount of constant memory: Total amount of shared memory per block: Total number of registers available per block: Warp size: Maximum number of threads per block: Maximum sizes of each dimension of a block: Maximum sizes of each dimension of a grid: Maximum memory pitch: Texture alignment: Clock rate: Concurrent copy and execution:
1 1 536608768 bytes 8 64 65536 bytes 16384 bytes 8192 32 512 512 x 512 x 64 65535 x 65535 x 1 262144 bytes 256 bytes 1.60 GHz Yes
Table 6. Specification of Tesla C1060 Major revision number: Minor revision number: Total amount of global memory: Number of multiprocessors: Number of cores: Total amount of constant memory: Total amount of shared memory per block: Total number of registers available per block: Warp size: Maximum number of threads per block: Maximum sizes of each dimension of a block: Maximum sizes of each dimension of a grid: Maximum memory pitch: Texture alignment: Clock rate: Concurrent copy and execution:
1 3 4294705152 bytes 30 240 65536 bytes 16384 bytes 16384 32 512 512 x 512 x 64 65535 x 65535 x 1 262144 bytes 256 bytes 1.30 GHz Yes
Evolution of Interesting Association Rules Online with Learning Classifier Systems Albert Orriols-Puig1 and Jorge Casillas2 1
2
Grup de Recerca en Sistemes Intel·ligents La Salle - Universitat Ramon Llull Quatre Camins 2, 08022 Barcelona (Spain) [email protected] Dept. Computer Science and Artificial Intelligence University of Granada 18071, Granada, Spain [email protected]
Abstract. This paper presents CSar, a Michigan-style learning classifier system designed to extract quantitative association rules from streams of unlabeled examples. The main novelty of CSar with respect to the existing association rule miners is that it evolves the knowledge online and it is thus prepared to adapt its knowledge to changes in the variable associations hidden in the stream of unlabeled data quickly and efficiently. The results provided in this paper show that CSar is able to evolve interesting rules on problems that consist of both categorical and continuous attributes. Moreover, the comparison of CSar with Apriori on a problem that consists only of categorical attributes highlights the competitiveness of CSar with respect to more specific learners that perform enumeration to return all possible association rules. These promising results encourage us to further investigate on CSar.
1
Introduction
Association rule mining [2] aims at extracting interesting associations among the attributes—i.e., associations that occur with a certain frequency and strength— of repositories of unlabeled data. Research conducted on association rule mining was originally focused on extracting rules that identified strong relationships between the occurrence of two or more attributes or items on collections of binary data, e.g., “if item X occurs then also item Y will occur ” [2,3,14]. Later on, several researchers concentrated on extracting association rules from data described by continuous attributes [10,22], which posed new challenges to the field. Several algorithms proposed to apply a discretization method in advance to transform the original data into binary values [16,18,22,24] and then use a binary association rule miner. This led to further research on designing discretization procedures that avoid losing useful information. Other approaches mined interval-based association rules and permitted the algorithm to independently move the interval J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 21–37, 2010. c Springer-Verlag Berlin Heidelberg 2010
22
A. Orriols-Puig and J. Casillas
bound of each rule’s variable [17]. Also, fuzzy modeling was introduced to create fuzzy association rules (e.g., see [13,15]). Association rules are widely used in various areas such as telecommunication networks, market and risk management, and inventory control. All these applications are characterized by generating data online, so that data may be made available in form of streams [1,19]. Nonetheless, all the aforementioned algorithms were designed for static collections of data. Learning from data streams has received a special amount of attention in the last few years, particularly in supervised learning [1,19]. However, few proposals of online binary association rule miners can be found in the literature, and most of them are only able to deal with problems with categorical attributes (e.g., see [23]). In this paper, we address the problem of mining association rules from streams of examples online. We propose a learning classifier system (LCS) whose architecture is inspired by XCS [25,26] and UCS [6], which we address as classifier system for association rule mining (CSar). CSar uses an interval-based representation for evolving quantitative association rules from data with continuous attributes and a discrete representation for categorical attributes. The system receives a stream of unlabeled examples which are used to create new rules and to tune the parameters of the existing ones with the aim of evolving as many interesting rules as possible. CSar is first compared with Apriori [3] on a problem defined only by categorical attributes. The results on this problem indicate that CSar can evolve rules of similar interest as those created by Apriori, one of the most referred algorithms in the association rule mining realm, which considers all the possible combinations of attribute values to create all interesting association rules (notice that this approach can only be used in domains with categorical data). The experimentation is then extended by considering a collection of real-world problems and by analyzing the behavior of different configurations of CSar over these problems. The results denote that CSar is able to create highly supported and interesting interval-based association rules in which the intervals have not been prefixed by a discretization algorithm. The remainder of this paper is organized as follows. Section 2 provides the basic concepts of association rules and reviews the main proposals in the literature for both binary and quantitative association rule mining. Section 3 describes in detail our proposal. Section 4 explains the methodology followed in the experiments, and Section 5 analyzes the results of these experiments. Finally, Section 6 summarizes, concludes, and gives the future work lines that will be followed.
2
Framework
Before proceeding with the description of our proposal, this section introduces some important concepts of association rules. We first describe the problem of extracting association rules from categorical data. Then, we extend the problem to mining association rules from data with continuous attributes and review different proposals that can be found in the literature.
Evolution of Interesting Association Rules
2.1
23
Association Rule Mining
The problem of association rule mining was firstly defined over binary data in [2] as follows. Let I = {i1 , i2 , . . . , i } be a set of binary attributes called items. Let T be a set of transactions, where each transaction t is represented as a binary vector of length . Each position i of t indicates whether the item i is present (ti = 1) or not (ti = 0) in the transaction. X is an itemset if X ⊆ I. An itemset X has a support supp(X) which is computed as supp(X) =
|X(T )| , |T |
(1)
That is, the support is the number of transactions in the database which have the itemset X, X(T), divided by the total number of transactions in the database, |T |. An itemset is said to be a frequent itemset if its support is greater than a user-set threshold, typically addressed as minsupp in the literature. Then, an association rule R is an implication of the form X → Y , where both X and Y are itemsets and X ∩ Y = ∅. Typically, association rules are assessed with two qualitative measures, their support (supp) and their confidence (conf ). The support of a rule is defined as ratio of the support of the union of antecedent and consequent to the number of transactions in the database, i.e., supp(R) =
supp(X ∪ Y ) . |T |
(2)
The confidence is computed as the ratio of the support of the union of antecedent and consequent to the support of the antecedent, i.e., conf (R) =
supp(X ∪ Y ) . supp(X)
(3)
Therefore, support indicates the frequency of occurring patterns, and confidence evaluates the strength of the implication denoted in the association rule. Since the proposal of AIS [2], the first algorithm to mine association rules from categorical examples, several algorithms have been designed to perform this task. Agrawal et al. [3] presented the Apriori algorithm, probably the most influential categorical association rule miner. This work resulted in several papers which designed some modifications to the initial Apriori algorithm (e.g., see [8,21]). All these algorithms used the same methodology as Apriori to mine association rules, which basically consisted of two different phases: (1) identification of all frequent itemsets (i.e., all itemsets whose support was greater than minsupp), and (2) generation of association rules from these frequent itemsets. 2.2
Quantitative Association Rules
Early research in the realm of association rules only addressed the problem of extracting association rules from binary data. Therefore, these types of rules only permitted reflecting whether particular items were present in the transaction, but they did not consider their quantities. Later on, researchers focused
24
A. Orriols-Puig and J. Casillas
on algorithms that were able to extract association rules from databases that contained quantitative attributes. Srikant and Agrawal [22] designed an Apriori-like approach to mine quantitative association rules. The authors used an equi-depth partitioning to transform continuous attributes to categorical attributes. Moreover, the authors identified the problem of the sharp boundary between discrete intervals, which highlighted that quantitative mining algorithms may either ignore or over-emphasize the items that lay near the boundary of intervals. Attempting to address this problem, several authors applied different clustering mechanisms to extract the best possible intervals from the data [16,18]. A completely different approach was taken in [17], where a genetic-algorithm-based technique was used to evolve interval-based association rules without applying any discretization procedure to the variables. The GA was responsible for creating new promising association rules and for evolving the intervals of the variables of the association rules. The problem associated to creating variables with unbounded intervals is that, in general, the support for small intervals is smaller than the support for large intervals, which makes the system create rules with large intervals, covering nearly all the domain. To avoid this, the system penalized the fitness of rules that had large intervals. In [20] a similar approach was followed. The authors proposed a framework in which finding good intervals from which interesting association rules could be extracted was addressed as an optimization problem. As done in [17,20], CSar does not apply any discretization mechanism to the original data and interval bounds are evolved by the genetic procedure. The main novelty of our proposal is that association rules are not mined from static databases but from streams of examples. This characteristic guides some parts of the algorithm design, which is described in detail in the next section.
3
Description of CSar
CSar is a Michigan-style LCS for mining interval-based association rules from data that contain both quantitative and categorical attributes. The learning architecture of CSar is inspired by UCS [6] and XCS [25,26]. CSar aims at evolving populations of interesting association rules, i.e., rules with large support and confidence. For this purpose, CSar evaluates a set of association rules online and evolves this rule set by means of a steady-state genetic algorithm (GA) [11,12] that is applied to population niches. As follows, a detailed description of the system is provided, focusing on the differences in the knowledge representation and learning process with respect to those of XCS and UCS. 3.1
Knowledge Representation
CSar evolves a population of classifiers [P], where each classifier consists of a quantitative association rule and a set of parameters. The quantitative association rule is represented as if xi ∈ vi and . . . and xj ∈ vj then xk ∈ vk ,
Evolution of Interesting Association Rules
25
where the antecedent is represented by a set of a input variables xi , . . . , xj (0 < a < , 0 ≤ i < , and 0 ≤ j < ; where is the number of variables of the problem) and the consequent contains a single variable xk . Note that we permit that rules have an arbitrary number of variables in the antecedent, but we only enable them to have a single variable in the consequent. Restricting the number of consequent variables to one aims at simplifying the creation of niches (see next subsection). For quantitative attributes, a similar representation to the XCSR one is used [27], in which both antecedent and consequent variables are represented by the interval of values to which this variable applies, i.e., vi = [li , ui ]. A maximum interval length maxInt is set to avoid having large intervals that nearly contain all the possible values of a given variable; therefore, ∀i : ui − li ≤ maxInt. Categorical attributes are represented by one of the possible categorical values xij , i.e., vi = xij . A rule matches an input example if, for all the variables in the antecedent and consequent of the rule, the corresponding value of the example is either included in the interval defined for continuous variables or equal to the value defined for categorical variables. Each classifier has seven main parameters: (1) the support supp, i.e., the occurring frequency of the rule; (2) the confidence conf, which indicates the strength of the implication; (3) the fitness F, which denotes the quality of the given rule; (4) the experience exp, which counts the number of times that the antecedent of the rule has matched an input instance; (5) the consequent matching sum cm, which counts the number of times that the whole rule has matched an input instance; (6) the numerosity num, which reckons the number of copies of the classifier in the population; and (7) the time of creation of the classifier tcreate. The next subsection explains how the classifiers are created and evolved and how their parameters are updated. 3.2
Learning Process Organization
At each learning iteration, CSar receives an input example (e1 , e2 , . . ., e ). Then, the system creates the match set [M] with all the classifiers in the population that match the input example. If [M] contains less that θmna classifiers, the covering operator is triggered to create as many new matching classifiers as required to have θmna classifiers in [M]. Then, classifiers in [M] are organized in association set candidates following one of the two methodologies explained below. Each association set is given a probability to be selected that is proportional to the average confidence of the classifiers that belong to this association set. The selected association set [A] is checked for subsumption with the aim of diminishing the number of rules that express similar associations among variables. Then, the parameters of all the classifiers in [M] are updated. At the end of the iteration, a GA is applied to the selected association set if the average time since the last application of the GA to the classifiers of the selected association set is greater than θGA (θGA is a user-set parameter). Finally, for each continuous attribute, we maintain a list with no repeated elements that stores the last few values seen for the attribute (in our experiments we stored the last hundred different
26
A. Orriols-Puig and J. Casillas
values). This list is used by the mutation operator with the aim of preventing the existence of intervals that cover the same examples but are slightly different. As follows, we provide details about (1) the covering operator, (2) the procedures to create association set candidates, (3) the association set subsumption mechanism, and (4) the parameter update procedure. Next section explains in more detail the discovery component. It is worth noting that some of the operators are similar to those of several existing systems such as the ones described in [5,9]. Covering Operator. The purpose of the covering operator is to feed classifiers that denote interesting associations among variables into the population. Given the sampled input example e, the covering operator creates a new matching classifier as follows. Each variable is selected with probability 1 − P# to belong to the rule’s antecedent, with the restriction that, at the end of this process, at least one variable has to be selected. The values of the selected variables are initialized differently depending on the type of attribute. For categorical attributes, the variable is initialized to the corresponding input value ei . For continuous attributes, the interval [li , ui ] that represents the variable is obtained from generalizing the input value ei , i.e., li = ei − rand(maxInt/2) and ui = ei + rand(maxInt/2),
(4) (5)
where maxInt is the maximum interval length. Finally, one of the previously unselected variables is randomly chosen to form the consequent of the rule, which is initialized following the same procedure. Note that the association rule created is supported by, at least, the sampled example. Creation of Association Set Candidates. The aim of creating association set candidates or niches is to group rules that express similar associations to establish a competition among them and so let the best ones take over their niche. Whilst the creation of these niches of similar rules is quite immediate in reinforcement learning [25] and classification [6] tasks, several approaches could be used to form groups of similar rules in association rule mining. Herein, we propose two alternatives which are guided by different heuristics: Grouping by antecedent. This strategy considers that two rules are similar if they have exactly the same variables in their antecedent, regardless of their corresponding values Vi . Therefore, this grouping strategy creates Na association set candidates, where Na is the number of rules in [M] with different variables in the antecedent. Each association set contains rules that have exactly the same variables in the antecedent. The underlying idea is that rules with the same antecedent may express similar knowledge. Note that, under this strategy, rules with different variables in the consequent can be grouped in the same association set. Grouping by consequent. This strategy groups in the same association set the classifiers in [M] that have the same variable in the consequent with
Evolution of Interesting Association Rules
27
equivalent values. We consider that two continuous variables are equivalent if their intervals are overlapped and that two categorical variables are equivalent if they have the same categorical value. For this purpose, the next process is followed. The rules in [M] are sorted ascendingly according to the variable that they have in their consequent. Given two rules r1 and r2 that have the same variable in the consequent, we consider that r1 is smaller than r2 if l1 < l2 or (l1 = l2 and u1 > u2 ) if continuous attribute ord(x1 ) < ord(x2 ) if categorical attribute where l1 , l2 , u1 , and u2 are the lower bound and upper bound of the consequent variable of r1 and r2 for a continuous attribute, x1 and x2 are the values of the consequent variable for a categorical attribute, and ord(xi ) maps each categorical value to a numeric value. It is worth noting that given two continuous variables with the same lower bound in the interval, we sort first the rule with the most general variable (i.e., the rule with larger ui ). We take this approach with the aim of forming association set candidates with the largest number of overlapping classifiers by using the procedure explained as follows. Once [M] has been sorted, the association set candidates are built as follows. At the beginning, an association set candidate is created and the first classifier in [M] is added to this association set candidate. Then, the following classifier is added if it has the same variable in the consequent, and his lower bound is smaller than the minimum upper bound of the classifiers in the association set. This process is repeated until finding the first classifier that violates this condition. In this case, a new association set candidate is created, and the same process is applied to add new classifiers to this association set. The underlying idea of this association set strategy is that rules that explain the same region of the consequent may denote the same associations among variables. The cost of both methodologies for creating the association sets are guided by the cost of sorting the population. We applied a quicksort strategy for this purpose, which has a cost of O(n · logn), where n is the match set size. Association Set Subsumption. A subsumption mechanism inspired by the one presented in [26] was designed with the aim of reducing the number of different rules that express the same or similar knowledge. The process works as follows. Each rule of the selected association set is checked for subsumption with each other rule in the same association set. A rule ri is a candidate subsumer of rj if it satisfies the following three conditions: (1) ri has higher confidence and it is experienced enough (i.e., conf i >conf 0 and expi > θexp , where conf 0 and θexp are user-set parameters); (2) all the variables in the antecedent of ri are also present in the antecedent of rj and both rules have the same variable in the consequent (rj can have more variables in the antecedent than ri ); and (3) ri
28
A. Orriols-Puig and J. Casillas
is more general than rj . A rule ri is more general than rj if all the input and the output variables of ri are also defined in rj , each categorical variable of ri has the same value as the corresponding variable in rj , and the interval [li , ui ] of each continuous variable in ri includes the interval [lj , uj ] of the corresponding variable in rj (i.e., li ≤ lj and ui ≥ uj ). Parameter Update. At the end of each learning iteration, the parameters of all the classifiers that belong to the match set are updated. First, we increment the experience of the classifier. Next, we increment the consequent matching estimate cm if the rule’s consequent also matches the input example. These two parameters are used to update the support and confidence of the rule i as follows. Support is computed as: suppi =
cmi , ctime − tcreatei
(6)
where ctime is the time of the current iteration and tcreatei is the iteration in which the classifier i has been created. Then, the confidence is computed as confi =
cmi . expi
(7)
Lastly, the fitness of each rule i in [M] is updated with the following formula Fi = (confi · suppi)ν ,
(8)
where ν is a user-set parameter that permits controlling the pressure toward highly fit classifiers. Note that with this fitness computation, the system makes pressure towards the evolution of rules with not only high confidence but also high support. We empirically tested to compute the fitness only from conf, but preliminary experiments indicated that CSar could obtain a larger variety of interesting association rules if support was included in the fitness computation. Finally, the association set size estimate of all rules that belong to the selected association set is updated. Each rule maintains the average size of all the association sets in which it has participated. 3.3
Discovery Component
CSar uses a steady-state niched GA to discover new promising rules. The GA is applied to the selected association set [A]. Therefore, the niching is intrinsically provided since the GA is applied to rules that are similar according to one of the heuristics for association set formation. The GA is triggered when the average time from its last application upon the classifiers in [A] exceeds the threshold θGA . It selects two parents p1 and p2 from [A] using proportionate selection [11], where the probability of selecting a classifier k is pksel =
Fk
i∈[A]
Fi
.
(9)
Evolution of Interesting Association Rules
29
The two parents are copied into offspring ch1 and ch2 , which undergo crossover and mutation if required. The system applies uniform crossover with probability Pχ . First, it considers each variable in the antecedent of both rules. If only one parent has the variable, one child is randomly selected and the variable is copied to this child. If both parents contain the variable, this variable is copied to each offspring. The procedure controls that, at the end of the process, each offspring has, at least, one input variable. Then, the rule consequent is crossed by adding to the first offspring the consequent of one of the parents (which is randomly selected) and adding to the remaining offspring the consequent of the other parent. Three types of mutation can be applied to a rule: (1) introduction/removal of antecedent variables (with probability PI/R ), (2) mutation of variable’s values (with probability Pμ ), and (3) mutation of the consequent variable (with probability PC ). The first type of mutation chooses randomly whether a new antecedent variable has to be added to or one of the antecedent variables has to be removed from the rule. If a variable has to be added, one of the non-existing variables is randomly selected and added to the rule. This operation can only be applied if the rule does not have all the possible variables. If a variable has to be removed, one of the existing variables is randomly selected and removed from the rule. This operation can only be applied if the rule has at least two variables in the antecedent. The second type of mutation selects one of the existing variables of the rule and mutates its value. For continuous variables, two random amounts ranging in [-m0 , m0 ] are added to the lower bound and the upper bound respectively, where m0 is a user-set parameter. If the interval surpasses the maximum length or the lower bound becomes greater than the upper bound, the interval is repaired. Finally, the lower and the upper bounds of the mutated variable are approximated to the closest value in the list of seen values for this variable. This process is applied to avoid having rules in the population with very similar interval bounds in its variables, since having all them may not only provide no additional knowledge, but also hinder human experts from reading the whole population. For categorical variables, a new value for the variable is randomly selected. The last type of mutation randomly selects one of the variables in the antecedent and exchanges it with the output variable. After crossover and mutation, the new offspring are introduced into the population. First, each classifier is checked for subsumption [26] with their parents. To decide if any parent can subsume the offspring, the same procedure explained for association set subsumption is followed. If any parent is identified as a possible subsumer for the offspring, the offspring is not inserted and the numerosity of the parent is increased by one. Otherwise, we check [A] for the most general rule that can subsume the offspring. If no subsumer can be found, the classifier is inserted into the population. If the population is full, excess classifiers are deleted from [P] with probability proportional to their association set size estimate as. Moreover, if a classifier k is sufficiently experienced (expk > θdel ) and its fitness F k is significantly
30
A. Orriols-Puig and J. Casillas
lower than the average fitness of the classifiers in [P] (F k < δF[P ] where F[P ] = 1 i i∈[P ] F ), its deletion probability is further increased. That is, each classifier N has a deletion probability pk of pk =
dk
∀j∈[P ]
where dk =
as·num·F[P ] Fk
as · num
dj
,
if expk > θdel and F k < δF[P ] otherwise.
(10)
(11)
Thus, the deletion algorithm balances the classifier allocation in the different association sets by pushing toward the deletion of rules belonging to large correct sets. At the same time, it favors the search toward highly fit classifiers, since the deletion probability of rules whose fitness is much smaller than the average fitness is increased. 3.4
Rule Set Reduction
At the end of the learning process, the final rule set is processed to provide the user with only interesting rules. For this purpose, we apply the following reduction mechanism. Firstly, we remove all rules whose experience is smaller than θexp (θexp is a user-set parameter). Then, each rule is checked against each other for subsumption following the same procedure used for association rule subsumption but with the following exception: now, a rule ri is a candidate subsumer for rj if ri and rj have the same variables in their antecedent and consequent, ri is more general than rj , and ri has higher confidence than rj . Note that, during learning, the subsumption mechanism requires that the confidence of ri be greater than conf 0 . After applying the rule set reduction mechanism, we make sure that the final population consists of different rules. Other policies can be easily incorporated to this process such as removing rules whose support and confidence are below a predefined threshold. Nonetheless, in our experiments we return all the experienced rules in the final population that are not subsumed by any other. The overall section has described the mechanisms that CSar uses to evolve a population of interesting association rules online. Differently from other quantitative association-rule miners, CSar is characterized for having a maximum population size that limits the number of different interesting association rules that can exist in the final population. Then, the system organizes rules in different association sets and uses a GA to make rules in the same association set compete. Therefore, CSar does not aim at returning all the possible association rules, but at providing the user with a population of limited size with “phenotypically” different and interesting association rules.
4
Experimental Methodology
After having carefully described the system, now we are in position to experimentally analyze the behavior of CSar. The aim of the experimental analysis
Evolution of Interesting Association Rules
31
Table 1. Properties of the data sets. The columns describe: the identifier of the data set (Id.); the name of the data set (dataset); the number of instances (#Inst); the total number of features (#Fea); the number of real features (#Re); the number of integer features (#In); and the number of nominal features (#No). Id. adl ann aud aut bpa col gls h-s irs let pim tao thy wdbc wne wpbc
dataset #Inst #Fea #Re #In #No Adult 48841 15 0 6 9 Annealing 898 39 6 0 33 Audiology 226 70 0 0 70 Automobile 205 26 15 0 11 Bupa 345 7 6 0 1 Horse colic 368 23 7 0 16 Glass 214 10 9 0 1 Heart-s 270 14 13 0 1 Iris 150 5 4 0 1 Letter recognition 20000 17 0 16 1 Pima 768 9 8 0 1 Tao 1888 3 2 0 1 Thyroid 215 6 5 0 1 Wisc. diagnose breast-cancer 569 31 30 0 1 Wine 178 14 13 0 1 Wisc. prognostic breast-cancer 198 34 33 0 1
was to (1) study whether CSar could actually evolve a set of interesting association rules, (2) examine the behavior of the system under different configurations. With these objectives in mind, we did the following two experiments. As our first concern was to analyze whether CSar could evolve the most interesting association rules regardless of having a fixed population size. Therefore, we compared CSar with Apriori [3], probably the most influential association rule miner, on the zoo problem [4]. We selected the zoo problem for this analysis since Apriori only works on problems described by categorical attributes and the zoo problem satisfies this requirement. More specifically, the zoo problem is defined by (1) fifteen binary attributes which indicate whether the animal has a total of fifteen characteristics such as whether it has tail or hair and (2) two categorical attributes that can take more than two values and which represent the number of legs and the type of animal. Secondly, we studied the impact of using the two different procedures to create association rule candidates and of using progressively bigger maximum intervals. For this purpose, we ran CSar (1) with both antecedent- and consequent-grouping strategies to create association sets candidates and (2) with different maximum interval lengths on a collection of real-world problems extracted from the UCI repository [4] and from local repositories [7]. The characteristics of these problems are reported in Table 1. In all runs, CSar employed the following configuration: num iterations = 100 000, popSize = 6 400, conf0 = 0.95, ν = 10, θmna = 10, {θdel , θGA } = 50, θexp = 1000, Pχ = 0.8, {PI/R , Pμ , PC } = 0.1, m0 =0.2. Association set subsumption was activated in all runs.
32
A. Orriols-Puig and J. Casillas
5
Analysis of the Results
With the aim of the experiments in mind, in what follows we discuss about the experimental results. 5.1
Ability of CSar to Discover Interesting Rules
In order to study the ability of CSar to extract interesting association rules, we first compared the system with Apriori on a problem with only categorical attributes, the zoo problem. CSar was ran with both antecedent-grouping and consequent-grouping strategies. As we wanted to analyze the interestingness of the rules created by the systems, we report the number of rules with different minimum supports and confidences obtained by CSar with the two grouping strategies (see Figure 1). The same information is reported for Apriori in Figure 2; however, in this case, the resulting rules of Apriori have been filtered. 1400
800 600 400 200
0
0.1
0.2
0.3
0.4
0.5 0.6 support
0.7
0.8
Conf > 0.05 Conf > 0.10 Conf > 0.15 Conf > 0.20 Conf > 0.25 Conf > 0.30 Conf > 0.35 Conf > 0.40 Conf > 0.45 Conf > 0.50 Conf > 0.55 Conf > 0.60 Conf > 0.65 Conf > 0.70 Conf > 0.75 Conf > 0.80 Conf > 0.85 Conf > 0.90 Conf > 0.95
160 140 Number of Rules
Number of Rules
1000
0
180
Conf > 0.05 Conf > 0.10 Conf > 0.15 Conf > 0.20 Conf > 0.25 Conf > 0.30 Conf > 0.35 Conf > 0.40 Conf > 0.45 Conf > 0.50 Conf > 0.55 Conf > 0.60 Conf > 0.65 Conf > 0.70 Conf > 0.75 Conf > 0.80 Conf > 0.85 Conf > 0.90 Conf > 0.95
1200
120 100 80 60 40 20
0.9
0
1
(a) antecedent grouping
0
0.1
0.2
0.3
0.4
0.5 0.6 support
0.7
0.8
0.9
1
(b) consequent grouping
Fig. 1. Number of rules evolved with minimum support and confidence for the zoo problem with (a) antecedent-grouping and (b) consequent-grouping strategies. The curves are averages over five runs with different random seeds. 1600
Conf > 0.05 Conf > 0.10 Conf > 0.15 Conf > 0.20 Conf > 0.25 Conf > 0.30 Conf > 0.35 Conf > 0.40 Conf > 0.45 Conf > 0.50 Conf > 0.55 Conf > 0.60 Conf > 0.65 Conf > 0.70 Conf > 0.75 Conf > 0.80 Conf > 0.85 Conf > 0.90 Conf > 0.95
1400
Number of Rules
1200 1000 800 600 400 200 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
support
Fig. 2. Number of rules created by Apriori with minimum support and confidence for the zoo problem. Lower confidence and support are not shown since Apriori creates all possible combinations of attributes, exponentially increasing the number of rules.
Evolution of Interesting Association Rules
33
Support
Table 2. Comparison of the number of rules evolved by CSar with antecedent- and consequent-grouping strategies to form the association set candidates with the number of rules evolved by Apriori at high support and confidence values
0.40 0.50 0.60 0.70 0.80 0.90 1.00
Confidence antecedent grouping consequent grouping 0.4 0.6 0.8 0.4 0.6 0.8 275 ± 30 271 ± 27 230 ± 23 65 ± 10 63 ± 9 59 ± 9 123 ± 4 123 ± 4 106 ± 3 61 ± 8 61 ± 8 58 ± 8 58 ± 2 58 ± 2 51 ± 4 51 ± 8 51 ± 8 47 ± 7 21 ± 1 21 ± 1 19 ± 1 19 ± 2 19 ± 2 18 ± 2 2±0 2±0 2±0 2±0 2±0 2±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0 0±0
A-priori 0.4 0.6 0.8 2613 2514 2070 530 523 399 118 118 93 30 30 27 2 2 2 0 0 0 0 0 0
That is, Apriori is a two-phase algorithm that exhaustively explores all the feature space, discovers all the itemsets with a minimum predefined support, and creates all the possible rules with these itemsets. Therefore, some of the rules supplied by Apriori are included in other rules. We consider that a rule r1 is included in another rule r2 if r1 has, at least, the same variables with the same values in the rule antecedent and the rule consequent as r2 (r1 may have more variables). In the results provided herein, we removed from the final population all the rules that were included by other rules. Thus, we provide an upper bound of the number of different rules that can be generated. Two important observations can be made from these results. Firstly, the results clearly show that Apriori can create a higher number of rules than CSAr (for the sake of clarity, Table 2 specifies the number of rules for support values ranging from 0.4 to 1.0 and confidence values of {0.4,0.6,0.8}). This behavior was expected, since CSar has a limited population size, while Apriori returns all possible association rules. Nevertheless, it is worth noting that CSAr and Apriori found exactly the same number of highly interesting rules; that is, both systems discovered two rules with both confidence and support higher than 0.8. This highlights the robustness of CSar, whose mechanisms guide the system to discover the most interesting rules. Secondly, focusing on the results reported in Figure 1, we can see that the populations evolved with the antecedent-grouping strategy are larger than those built with the consequent-grouping strategy. This behavior will be also present, and discussed in more detail, in the extended experimental analysis conducted in the next subsection. 5.2
Study of the Behavior of CSar
After showing that CSar can create highly interesting association rules in a case-study problem characterized by categorical attributes, we now extend the experimentation by running the system on 16 real-world data sets. We ran the system with (1) antecedent-grouping and consequent-grouping strategies and (2)
34
A. Orriols-Puig and J. Casillas
Table 3. Average (± standard deviation of the) number of rules with support and confidence greater than 0.60 created by CSar with antecedent- and consequent-grouping strategies and with maximum interval sizes of MI={0.10, 0.25, 0.50}. The average and standard deviation are computed on five runs with different random seeds. antecedent MI=0.10 MI=0.25 adl 135 ± 3 294 ± 15 ann 1736 ± 133 1765 ± 79 aud 2206 ± 80 2017 ± 147 aut 84 ± 14 192 ± 7 bpa 11 ± 4 174 ± 15 col 134 ± 14 188 ± 7 gls 33 ± 4 160 ± 17 H-s 28 ± 1 61 ± 4 irs 0±0 0±0 let 0±0 113 ± 17 pim 4±1 93 ± 9 tao 0±0 0±0 thy 46 ± 2 152 ± 4 wdbc 0±0 419 ± 43 wne 116 ± 9 273 ± 48 wpbc 0±0 0±0
MI=0.50 MI=0.10 567 ± 66 46 ± 1 1702 ± 135 478 ± 86 1999 ± 185 1014 ± 12 710 ± 106 25 ± 6 365 ± 42 17 ± 2 377 ± 64 180 ± 13 694 ± 26 23 ± 2 248 ± 32 13 ± 1 50 ± 5 0±0 991 ± 40 0±0 570 ± 51 3±0 8±1 0±0 350 ± 27 29 ± 2 1143 ± 131 0±0 536 ± 34 26 ± 3 740 ± 234 0±0
consequent MI=0.25 74 ± 3 525 ± 112 982 ± 100 58 ± 3 100 ± 4 191 ± 7 89 ± 6 29 ± 1 0±0 103 ± 6 53 ± 5 0±0 80 ± 3 145 ± 17 65 ± 9 0±0
MI=0.50 147 ± 23 489 ± 34 880 ± 215 188 ± 6 123 ± 22 198 ± 8 205 ± 23 92 ± 13 28 ± 8 205 ± 13 154 ± 25 5±2 160 ± 2 304 ± 16 137 ± 17 264 ± 34
allowing intervals of maximum length maxInt = {0.1, 0.25, 0.5} for continuous variables. Note that by using different grouping strategies we are changing the way how the system creates association set candidates; therefore, as competition is held among rules within the same association set, the resulting rules can be different in both cases. On the other hand, having an increasing larger interval length for continuous variables enables the system to obtain more general rules. Table 3 reports the number of rules, with confidence and support greater than or equal to 0.6, created by the different configurations of CSar. All the reported results are averages of five runs with different random seeds. Comparing the results obtained with the two different grouping schemes, we can see that the antecedent-grouping strategy yielded larger populations than the consequent-grouping strategy, on average. This behavior was expected since the antecedent grouping creates smaller association sets, and thus, maintains more diversity in the population. Nonetheless, a closer examination of the final population indicates that the difference in the final number of rules decreases if we only consider the rules with the highest confidence and support. For example, considering all the rules with confidence and support greater than or equal to 0.60, the antecedent-grouping strategy results in populations 2.16 bigger than those of the consequent-grouping strategy. However, considering only the rules with confidence and support greater than or equal to 0.85, the average difference in the population length gets reduced to 1.12. This indicates a big proportion of the most interesting rules are discovered by the two strategies. It is worth
Evolution of Interesting Association Rules
35
highlighting therefore that the lower number of rules evolved by the consequentgrouping strategy can be considered as an advantage, since the strategy avoids creating and maintaining uninteresting rules in the population, which implies a lower computational time to evolve the population. Focusing on the impact of varying the interval length, the results indicate that for lower maximum interval lengths CSar tends to evolve rules with less support. This behavior can be easily explained as follows. Large maximum interval length enable the existence of highly general rules, which will have higher support. Moreover, if both antecedent and consequent variables are maximally general, rules will also have high confidence. Taking this idea to the extreme, rules that contain variables whose intervals range from the minimum value to the maximum value for the variable will have maximum confidence and support. Nonetheless these rules will be uninteresting for human experts. On the other hand, small interval lengths may result in more interesting association rules, though too small lengths may result in rules that denote strong associations but have less support. This highlights a tradeoff in the setting of this parameter, which should be adjusted for each particular problem. As a rule of thumb, similarly to what can be done with other association rule miners, the practitioner may start setting small interval lengths and increase them in case of not obtaining rules with enough support for the particular domain used.
6
Summary, Conclusion, and Further Work
In this paper, we presented CSar, a Michigan-style LCS designed to evolve quantitative association rules. The experiments conducted in this paper have shown that the method holds promise for online extraction of both categorical and quantitative association rules. Results with the zoo problem indicated that CSar was able to create interesting categorical rules, which were similar to those built by Apriori. Experiments with a collection of real-world problems also pointed out the capabilities of CSar to extract quantitative association rules and served to analyze the behavior of different configurations of the system. These results encourage us to study the system further with the aim of applying CSar to mine quantitative association rules from new challenging real-world problems. Several future work lines can be followed in light of the present work. Firstly, we aim at comparing CSar with other quantitative association rule miners to see if the online architecture can extract knowledge similar to that obtained by other approaches that go several times through the learning data set. Actually, the online architecture of CSar makes the system suitable for mining association rules from changing environments with concept drift [1]; and we think that the existence of concept drift may be a common trait in many real-world problems to which association rules have historically been applied such as profile mining from customer information. Therefore, it would be interesting to analyze how CSar adapts to domains in which variable associations change over time.
36
A. Orriols-Puig and J. Casillas
Acknowledgements The authors thank the support of Ministerio de Ciencia y Tecnolog´ıa under projects TIN2008-06681-C06-01 and TIN2008-06681-C06-05, Generalitat de Catalunya under Grant 2005SGR-00302, and Andalusian Government under grant P07-TIC-3185.
References 1. Aggarwal, C. (ed.): Data streams: Models and algorithms. Springer, Heidelberg (2007) 2. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington D.C, pp. 207–216 (May 1993) 3. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, Santiago, Chile, pp. 487–499 (September 1994) 4. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, University of California (2007), http://www.ics.uci.edu/~ mlearn/MLRepository.html 5. Bacardit, J., Krasnogor, N.: Fast rule representation for continuous attributes in genetics-based machine learning. In: GECCO 2008: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, pp. 1421–1422. ACM, New York (2008) 6. Bernad´ o-Mansilla, E., Garrell, J.M.: Accuracy-based learning classifier systems: Models, analysis and applications to classification tasks. Evolutionary Computation 11(3), 209–238 (2003) 7. Bernad´ o-Mansilla, E., Llor` a, X., Garrell, J.M.: XCS and GALE: A comparative study of two learning classifier systems on data mining. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 115–132. Springer, Heidelberg (2002) 8. Cai, C.H., Fu, A.W.-C., Cheng, C.H., Kwong, W.W.: Mining association rules with weighted items. In: International Database Engineering and Application Symposium, pp. 68–77 (1998) 9. Divina, F.: Hybrid Genetic Relational Search for Inductive Learning. PhD thesis, Department of Computer Science, Vrije Universiteit, Amsterdam, the Netherlands (2004) 10. Fukuda, T., Morimoto, Y., Morishita, S., Tokuyama, T.: Mining optimized association rules for numeric attributes. In: PODS 1996: Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 182–191. ACM, New York (1996) 11. Goldberg, D.E.: Genetic algorithms in search, optimization & machine learning, 1st edn. Addison-Wesley, Reading (1989) 12. Holland, J.H.: Adaptation in natural and artificial systems. The University of Michigan Press (1975) 13. Hong, T.P., Kuo, C.S., Chi, S.C.: Trade-off between computation time and number of rules for fuzzy mining from quantitative data. International Journal of Uncertainty, Fuzziness, and Knowledge-Based Systems 9(5), 587–604 (2001)
Evolution of Interesting Association Rules
37
14. Houtsma, M., Swami, A.: Set-oriented mining of association rules. Technical Report RJ 9567, Almaden Research Center, San Jose, California (October 1993) 15. Kaya, M., Alhajj, R.: Genetic algorithm based framework for mining fuzzy association rules. Fuzzy Sets and Systems 152(3), 587–601 (2005) 16. Lent, B., Swami, A.N., Widom, J.: Clustering association rules. In: Procedings of the IEEE International Conference on Data Engineering, pp. 220–231 (1997) 17. Mata, J., Alvarez, J.L., Riquelme, J.C.: An evolutionary algorithm to discover numeric association rules. In: SAC 2002: Proceedings of the 2002 ACM Symposium on Applied Computing, pp. 590–594. ACM, New York (2002) 18. Miller, R.J., Yang, Y.: Association rules over interval data. In: SIGMOD 1997: Proceedings of the 1997 ACM SIGMOD International Conference on Management of data, pp. 452–461. ACM, New York (1997) 19. N´ un ˜ez, M., Fidalgo, R., Morales, R.: Learning in environments with unknown dynamics: Towards more robust concept learners. Journal of Machine Learning Research 8, 2595–2628 (2007) 20. Salleb-Aouissi, A., Vrain, C., Nortet, C.: Quantminer: A genetic algorithm for mining quantitative association rules. In: Veloso, M.M. (ed.) Proceedings of the 2007 International Join Conference on Artificial Intelligence, pp. 1035–1040 (2007) 21. Savasere, A., Omiecinski, E., Navathe, S.: An efficient algorithm for mining association rules in large databases. In: Proceedings of the 21st VLDB Conference, Zurich, Switzerland, pp. 432–443 (1995) 22. Srikant, R., Agrawal, R.: Mining quantitative association rules in large relational tables. In: Jagadish, H.V., Mumick, I.S. (eds.) Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, pp. 1–12 (1996) 23. Wang, C.-Y., Tseng, S.-S., Hong, T.-P., Chu, Y.-S.: Online generation of association rules under multidimensional consideration based on negative border. Journal of Information Science and Engineering 23, 233–242 (2007) 24. Wang, K., Tay, S.H.W., Liu, B.: Interestingness-based interval merger for numeric association rules. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, KDD, pp. 121–128. AAAI Press, Menlo Park (1998) 25. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3(2), 149–175 (1995) 26. Wilson, S.W.: Generalization in the XCS classifier system. In: 3rd Annual Conf. on Genetic Programming, pp. 665–674. Morgan Kaufmann, San Francisco (1998) 27. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209–219. Springer, Heidelberg (2000)
Coevolution of Pattern Generators and Recognizers Stewart W. Wilson Prediction Dynamics, Concord MA 01742 USA Department of Industrial and Enterprise Systems Engineering The University of Illinois at Urbana-Champaign IL 61801 USA [email protected]
Abstract. Proposed is an automatic system for creating pattern generators and recognizers that may provide new and human-independent insight into the pattern recognition problem. The system is based on a three-cornered coevolution of image-transformation programs.
1
Introduction
Pattern recognition is a very difficult problem for computer science. A major reason is that in many cases pattern classes are not well-specified, frustrating the design of algorithms (including learning algorithms) to identify or discriminate them. Intrinsic specification (via formal definition) is often impractical—consider the class consisting of hand-written letters A. Extrinsic specification (via finite sets of examples) has problems of generalization and over-fitting. Many interesting pattern classes are hard to specify because they exist only in relation to human or animal brains. Humans employ mental processes such as scaling, point of view adjustment, contrast and texture interpretation, saccades, etc., permitting classes to be characterized very subtly. It is likely that truly powerful computer pattern recognition methods will need to employ all such techniques, which is not generally the case today. In this paper we are concerned mainly with human-related pattern classes. A further challenge for pattern recognition research is to create problems with large sets of examples that can be learned from. An automatic pattern generator would be valuable, but it should be capable of producing examples of each class that are diverse and subtle as well as numerous. This paper proposes an automatic pattern generation and recognition process, and speculates that it would shed light on both the formal characterization problem and recognition techniques. The process would permit unlimited generation of examples and very great flexibility of methods, by relying on competitive and cooperative coevolution of pattern generators and recognizers. The paper is organized into a first part in which the pattern recognition problem is discussed in greater detail; a second part in which the competitive and cooperative method is explained in concept; and a third part containing suggestions for a specific implementation. J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 38–46, 2010. c Springer-Verlag Berlin Heidelberg 2010
Coevolution of Pattern Generators and Recognizers
2
39
Pattern Recognition Problem
The following is a viewpoint on the pattern recognition problem and what makes it difficult. Let us first see some examples of what are generally regarded as patterns. Characters, such as letters and numerals. Members of a class can differ in numerous ways, including placement in the field of view, size, orientation, shape, thickness, contrast, constituent texture, distortion including angle of view, noise of construction, and masking noise, among others. Patterns in time series, such as musical phrases, price data configurations, and event sequences. Members of a class can differ in time-scale, shape, intensity, texture, etc. Natural patterns, such as trees, landscapes, terrestrial features, and cloud patterns. Members of a class can differ in size, shape, contrast, color, texture, etc. Circumstantial patterns such as situations, moods, plots. Members of a class can differ along a host of dimensions themselves often hard to define. This sampling illustrates the very high diversity within even ordinary pattern classes and suggests that identifying a class member while differentiating it from members of other classes should be very difficult indeed. Yet human beings learn to do it, and apparently quite easily. While that of course has been pointed out before, we note two processes which may play key roles, transformation and context. Transformative processes would include among others centering an object of interest in the field of view via saccades, i.e., translation, and scaling it to a size appropriate for further steps. Contextual processes would include adjusting the effective brightness (of a visual object) relative to its background, and seeing a textured object as in fact a single object on a differently textured background. It is clear that contextual processes are also transformations and that viewpoint will be taken here. A transformational approach to pattern recognition would imply a sequence in which the raw stimulus is successively transformed to a form that permits it to be matched against standard or iconic exemplars, or produces a signal that is associated with a class. Human pattern recognition is generally rapid and its steps are not usually conscious, except in difficult cases or in initial learning. However, people when asked for reasons for a particular recognition will often cite transformational steps like those above that allow the object to be interpreted to some standard form. For this admittedly informal reason, transformations are emphasized in the algorithms proposed here. It is possible to provide a more formal framework. Pattern recognition can be viewed as a process in which examples are mapped to classes. But the mappings are complicated. They are unlike typical functions that map vectors of elements into, e.g., reals. In such a function, each element has a definite position in the
40
S.W. Wilson
vector (its index). Each position can be thought of as a place, and there is a value there. An ordinary function is thus a mapping of “values in places” into an outcome. Call it a place/value (PV) mapping. If you slide the values along the places—or expand them from a point—the outcome is generally completely different. The function depends on just which values are in which places. Patterns, on the other hand, are relative place/relative value (RPRV) mappings. Often, a given instance can be transformed into another instance, but with the same outcome, by a transformation that maintains the relative places or values of the elements—for example, such transformations as scaling, translation, rotation, contrast, even texture. The RPRV property, however, makes pattern recognition very difficult for machine learning methods that attach absolute significance to input element positions and values. There is considerable work on relative-value, or relational, learning systems, e.g., in classifier systems [5,4], and in reinforcement learning generally [1]. But for human-related pattern classes, what seems to be required is a method that is intrinsically able to deal with both relative value and relative place. This suggests that the method must be capable of transformations, both of its input and in subsequent stages. The remainder of the paper lays out one proposal for achieving this.
3
Let the Computer Do It
Traditionally, pattern recognition research involves choosing a domain, creating a source of exemplars, and trying learning algorithms that seem likely to work in that domain. Here, however, we are looking broadly at human-related pattern recognition, or relative place/relative value mappings (Sec. 2). Such a large task calls for an extensive source of pattern examples. It also calls for experimentation with a very wide array of transformation operators. Normally, for practicality one would narrow the domain and the choice of operators. Instead, we want to leave both as wide as possible, in hopes of achieving significant generality. While it changes the problem somewhat, there fortunately appears to be a way of doing this by allowing the computer itself to pose and solve the problem. Imagine a kind of communication game (Figure 1). A sender, or source, S, wants to send messages to a friend F. The messages are in English, and the letters are represented in binary by ASCII bytes. As long as F can decode bytes to ASCII (and knows English), F will understand S ’s messages. But there is also an enemy E that sees the messages and is not supposed to understand them. S and F decide to encrypt the messages. But instead of encrypting prior to conversion to bits, or encrypting the resulting bit pattern, they decide to encrypt each bit. That is, E ’s problem is to tell which bits are 1s and which 0s. If E can do that, the messages will be understandable. Note that F also must decrypt the bits. For this peculiar setup, S and F agree that when S intends to send a 0, S will send a variant of the letter A; for a 1, S will send a variant of B. S will produce these variants using a generation program. Each variant of A created will in general be different; similarly for B. F will know that 0 and 1 are represented
Coevolution of Pattern Generators and Recognizers
S
41
F
E Fig. 1. S sends messages to F that are sniffed by E
by variants of A and B, respectively, and will use a recognition program to tell which is which. E, also using a recognition program, knows only that the messages are in a binary code but does not know anything about how 0s and 1s are represented. In this setup, S ’s objective is to send variants of As and Bs that F will recognize but E will not recognize. The objectives of both F and E are to recognize the letters; for this F has some prior information that E does not have. All the agents will require programs: S for generation and F and E for recognition. The programs will be evolved using evolutionary computation. Each agent will maintain its own population of candidate programs. The overall system will carry out a coevolution [2] in which each agent attempts to evolve the best program consistent with its objectives. Evolution requires a fitness measure, which we need to specifiy for each of the agents. For each bit transmitted by S, F either recognizes it or does not, and E either recognizes it or does not. S ’s aim is for F to recognize correctly but not E ; call this a success for S. A simple fitness measure for an S program would be the number of its successes divided by a predetermined number of transmissions, T, assuming that S sends 0s and 1s with equal probability. A success for F as well as for E would be a correct recognition. A simple fitness measure for their programs would be the number of correct recognitions, again divided by T transmissions. S ’s population would consist of individuals each of which consists of a generation program. To send a bit, S picks an individual, randomly1 decides whether to send a 0 or a 1, then as noted above, generates a variant of A for 0, or of B for 1, the variant differing each time the program is called. The system determines whether the transmission was a success (for S ). After a total of T transmissions using a given S individual, its fitness is updated. F and E each have populations of individual recognition programs. Like S, after T recognition attempts using a population individual, its fitness is updated based on its number of successes. The testing of individuals could be arranged so that for each transmission, individuals from the S, F, and E populations would be selected at random. Or an individual from S could be used for T successive transmissions with F 1
For our purposes, the bits need not encode natural language.
42
S.W. Wilson
and E individuals still randomly picked on each transmission. Various testing schemes are possible. Selection, reproduction, and genetic operations would occur in a population at intervals long enough so that the average individual gets adequately evaluated. Will the coevolution work? It seems there should be pressure for improvement in each of the populations. Some initial programs in S should be better than others; similarly for F and E. The three participants should improve, but the extent is unknown. It could be that all three success rates end up not much above 50%. The best result would be 100% for S and F and 0% for E. But that is unlikely since some degree of success by E would be necessary to push S and F toward higher performance.
4
Some Implementation Suggestions
Having described a communications game in which patterns are generated and recognized, and a scheme for coevolving the corresponding programs, it remains to suggest the form of these programs. For concreteness we consider generation and recognition of two-dimensional, gray-scale visual patterns and take the transformational viewpoint of Sec.2. The programs would be compounds of operators that take an input image and transform it into an output image. The input of one of S ’s generating programs would be an image of an archetypical A or B and its output would be, via transforms, a variant of the input. A recognition program would take such a variant as input and, via transforms, output a further variant. F would match its program’s output against the same archetypes of A and B, picking the better match, and deciding 0 or 1 accordingly. E would simply compute the average gray level of its program’s output image and compare that to a threshold to decide between 0 and 1. For a typical transformation we imagine in effect a function that takes an image—an array of real numbers—as input and produces an image as output. The value at a point x, y of the output may depend on the value at a point (not necessarily the same point) of the input, or on the values of a collection of input points. As a simple example, in a translation transformation, the value at each output point would equal the value at an input point that is displaced linearly from the output point. In general, we would like the value at an output point potentially to be a rather complicated function of the points of the input image. Sims [6], partly with an artistic or visual design purpose, evolved images using fitnesses based on human judgements. In his system, a candidate image was generated by a Lisp-like tree of elementary functions taking as inputs x, y, and outputs of other elementary functions. The elementary functions included standard Lisp functions as well as various image-processing operators such as blurs, convolutions, or gradients that use neighboring pixel values to calculate their outputs. Noise generating functions were also included. The inputs to the function tree were simply the coordinates x and y, so that the tree in effect performed a transformation of the “blank” x-y plane to yield the
Coevolution of Pattern Generators and Recognizers
43
output image. The results of evolving such trees of functions could be surprising and beautiful. Sim’s article gives a number of examples of the images, including one (Figure 2) having the following symbolic expression, (round (log (+ y (color-grad (round (+ (abs (round (log (+ y (color-grad (round (+ y (log (invert y) 15.5)) x) 3.1 1.86 #(0.95 0.7 0.59) 1.35)) 0.19) x)) (log (invert y) 15.5)) x) 3.1 1.9 #(0.95 0.7 0.35) 1.35)) 0.19) x).
c 1991 Fig. 2. Evolved image from Sims [6]. Gray-scale rendering of color original. Association for Computing Machinery, Inc. Reprinted with permission.
Such an image-generating program is a good starting point for us, except for two missing properties. First, the program does not transform an input image; its only inputs are x and y. Second, the program is deterministic: it is not able to produce different outputs for the same image input, a property required in order to produce image variants. To transform an image, the program needs to take as input not only x and y, but also the input image values. A convenient way to do this appears to be to add the image to the function set. That is, add Im(x, y) to the function set, where Im is a function that maps image points to image values of the current input. For example, consider the expression (* k (Im (- x x0 ) (- y y0 )). The effect is to produce an output that translates the input by x0 and y0 in the x and y directions and alters its contrast by the factor k. It seems fairly clear that adding the current input image, as a kind of function, to the function set (it could apply at any stage), is quite general and would permit a great variety of image transformations.
44
S.W. Wilson
To allow different transformations from the same program is not difficult. One approach is to include a “switch” function, Sw , in the function set. Sw would have two inputs and would pass one or the other of them to its output depending on the setting of a random variable at evaluation time (i.e., set when a new image is to be processed and not reset until the next image). The random variable would be a component of a vector of random binary variables, one variable for each specific instance of Sw in the program. Then at evaluation time, the random vector would be re-sampled and the resulting component values would define a specific path through the program tree. The number of distinct paths is 2 raised to the number of instances of Sw , and equals the number of distinct input image variants that the program can create. If that number turns out to be too small, other techniques for creating variation will be required. The transformation programs just described would be directly usable by S to generate variants of A and B starting with archetypes of each. F and E would also use such programs, but not alone. Recognition, in the present approach, reverses generation: it takes a received image and attempts to transform it back into an archetype. Since it does not know the identity of the received image, how does the recognizer know which transformations to apply? We suggest that a recognition program be a kind of “Pittsburgh” classifier system [7] in which each classifier has a condition part intended to be matched against the input, and an action part that is a transformation program of the kind used by S (but without Sw ). In the simplest case the classifier condition would be an image-like array of reals to be matched against the input image; the bestmatching classifier’s transformation program would then be applied to the image. The resulting output would then be matched (by F ) against archetypes A and B and the better-matching character selected. E, as noted earlier, would compare the average of the output image with a threshold. It might be desirable for recognition to take more than one match-transform step; they could be chained up to a certain number, or until a sufficiently sharp A/B decision (or difference from threshold) occurred.2
5
Discussion and Conclusion
A coevolutionary framework has been proposed that, if it works, may create interesting pattern generators and recognizers. We must ask, is it relevant to the kinds of natural patterns noted in Section 2? Natural patterns are not ones created by generators to communicate with friends without informing enemies3 . Instead, natural patterns seem to be clusters of variants that become as large as possible without confusing their natural recipients, and no intruder is involved. Perhaps that framework, which also may 2
3
Recognition will probably require a chain of steps, as the system changes its center of attention or other viewpoint. State memory from previous steps will likely be needed, which favors use of a Pittsburgh over a “Michigan” [3,8], classifier system, since the former is presently more adept at internal state. There may be special cases!
Coevolution of Pattern Generators and Recognizers
45
suggest a coevolution, ought to be explored. But the present framework should give insights, too. A basic hypothesis here is that recognition is a process of transforming a pattern into a standard or archetypical instance. Success by the present scheme— since it uses transformations—would tend to support that hypothesis. More important, the kinds of operators that are useful will be revealed (though extracting such information from symbolic expressions can be a chore). For instance, will the system evolve operators similar to human saccades and will it size-normalize centered objects? It would also be interesting to observe what kinds of matching templates evolve in the condition parts of the recognizer classifiers. For instance, are large-area, relatively crude templates relied upon to get a rough idea of which transforms to apply? If so, it would be in contrast to recognition approaches that proceed from bottom up—e.g. finding edges—instead of top down. Such autonomously created processes would seem of great interest to more standard studies of pattern recognition. The reason is that standard studies involve choices of method that are largely arbitrary, and if they work there is still a question of generality. In contrast, information gained from a relatively unconstrained evolutionary approach might, by virtue of its human-independence, have a greater credibility and extensibility. It is unclear how well the present framework will work—for instance whether F ’s excess of a priori information over E ’s will be enough to drive the coevolution. It is also unclear, even if it works, whether the results will have wider relevance. But the proposal is offered in the hope that its difference from traditional approaches will inspire new experiments and thinking about a central problem in computer science.
References 1. Dˇzeroski, S., de Raedt, L., Driessens, K.: Relational reinforcement learning. Machine Learning 43, 7–52 (2001) 2. Daniel Hillis, W.: Co-evolving parasites improve simulated evolution as an optimization procedure. Physica D 42, 228–234 (1990) 3. Holland, J.H.: Escaping Brittleness: The Possibilities of General-Purpose Learning Algorithms Applied to Parallel Rule-Based Systems. In: Mitchell, Michalski, Carbonell (eds.) Machine Learning, an Artificial Intelligence Approach, vol. II, ch. 20, pp. 593–623. Morgan Kaufmann, San Francisco (1986) 4. Mellor, D.: A first order logic classifier system. In: Beyer, H.-G., O’Reilly, U.-M., Arnold, D.V., Banzhaf, W., Blum, C., Bonabeau, E.W., Cantu-Paz, E., Dasgupta, D., Deb, K., Foster, J.A., de Jong, E.D., Lipson, H., Llora, X., Mancoridis, S., Pelikan, M., Raidl, G.R., Soule, T., Tyrrell, A.M., Watson, J.-P., Zitzler, E. (eds.) GECCO 2005: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation, Washington DC, USA, June 25-29, vol. 2, pp. 1819–1826. ACM Press, New York (2005) 5. Shu, L., Schaeffer, J.: VCS: Variable Classifier System. In: David Schaffer, J. (ed.) Proceedings of the 3rd International Conference on Genetic Algorithms (ICGA 1989), George Mason University, pp. 334–339. Morgan Kaufmann, San Francisco (June 1989), http://www.cs.ualberta.ca/~ jonathan/Papers/Papers/vcs.ps
46
S.W. Wilson
6. Sims, K.: Artificial evolution for computer graphics. Computer Graphics 25(4), 319– 328 (1991), http://doi.acm.org/10.1145/122718.122752, Also http://www.karlsims.com/papers/siggraph91.html 7. Smith, S.F.: A Learning System Based on Genetic Adaptive Algorithms. PhD thesis, University of Pittsburgh (1980) 8. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149–175 (1995)
How Fitness Estimates Interact with Reproduction Rates: Towards Variable Offspring Set Sizes in XCSF Patrick O. Stalph and Martin V. Butz Department of Cognitive Psychology III, University of W¨ urzburg R¨ ontgenring 11, 97080 W¨ urzburg, Germany {patrick.stalph,butz}@psychologie.uni-wuerzburg.de http://www.coboslab.psychologie.uni-wuerzburg.de
Abstract. Despite many successful applications of the XCS classifier system, a rather crucial aspect of XCS’ learning mechanism has hardly ever been modified: exactly two classifiers are reproduced when XCSF’s iterative evolutionary algorithm is applied in a sampled problem niche. In this paper, we investigate the effect of modifying the number of reproduced classifiers. In the investigated problems, increasing the number of reproduced classifiers increases the initial learning speed. In less challenging approximation problems, also the final approximation accuracy is not affected. In harder problems, however, learning may stall, yielding worse final accuracies. In this case, over-reproductions of inaccurate, ill-estimated, over-general classifiers occur. Since the quality of the fitness signal decreases if there is less time for evaluation, a higher reproduction rate can deteriorate the fitness signal, thus—dependent on the difficulty of the approximation problem—preventing further learning improvements. In order to speed-up learning where possible while still assuring learning success, we propose an adaptive offspring set size that may depend on the current reliability of classifier parameter estimates. Initial experiments with a simple offspring set size adaptation show promising results. Keywords: LCS, XCS, Reproduction, Selection Pressure.
1
Introduction
Learning classifier systems were introduced over thirty years ago [1] as cognitive systems. Over all these years, it has been clear that there is a strong interaction between parameter estimations—be it by traditional bucket brigade techniques [2], the Widrow-Hoff rule [3,4], or by recursive least squares and related linear approximation techniques [5,6]—and the genetic algorithm, in which the successful identification and propagation of better classifiers depends on the accuracy of these estimates. Various control parameters have been used to balance genetic reproduction with the reliability of the parameter estimation, but to the best of our knowledge, there is no study that addresses the estimation problem explicitly. J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 47–56, 2010. c Springer-Verlag Berlin Heidelberg 2010
48
P.O. Stalph and M.V. Butz
In the XCS classifier system [4], reproduction takes place by means of a steadystate, niched GA. Reproductions are activated in current action sets (or match sets in function approximation problems as well as in the original XCS paper). Upon reproduction, two offspring classifiers are generated, which are mutated and recombined with certain probabilities. Reproduction is balanced by the θGA threshold. It specifies that GA reproduction is activated only if the average time of the last GA activation in the set lies longer in the past than θGA . It has been shown that the threshold can delay learning speed but it also prevents the neglect of rarely sampled problem niches in the case of unbalanced data sets [7]. Nonetheless, the reproduction of two classifiers seems to be rather arbitrary— except for the fact that two offspring classifiers are needed for simple recombination mechanisms. Unless the Learning Classifier System has a hard time to learn the problem, the reproduction of more than two classifiers could speed up learning. Thus, this study investigates the effect of modifying the number of offspring classifiers generated upon GA invocation. We further focus our study on the real-valued domain and thus on the XCSF system [8,9]. Besides, we use the rotating hyperellipsoidal representation for the evolving classifier condition structures [10]. This paper is structured as follows. Since we assume general knowledge of XCS1 , we immediately start investigating performance of XCSF on various test problems and with various offspring set sizes. Next, we discuss the results and provide some theoretical considerations. Finally, we propose a road-map for further studying the observed effects and adapting the offspring set sizes according to the perceived problem difficulty and learning progress as well as on the estimated reliability of available classifier estimates.
2
Increased Offspring Set Sizes
To study the effects of increased offspring set sizes, we chose four challenging functions defined in [0, 1]2 , each with rather distinct regularities: f1 (x1 , x2 ) = sin(4π(x1 + x2 )) 2 2 f2 (x1 , x2 ) = exp −8 (xi − 0.5) cos 8π (xi − 0.5) i
(1) (2)
i
f3 (x1 , x2 ) = max exp −10(2x1 − 1)2 , exp −50(2x2 − 1)2 , 1.25 exp −5((2x1 − 1)2 + (2x2 − 1)2 )
(3)
f4 (x1 , x2 ) = sin(4π(x1 + sin(πx2 )))
(4)
Function f1 has been used in various studies [10] and has a diagonal regularity. It requires the evolution of stretched hyperellipsoids that are rotated by 45◦ . Function f2 is a radial sine function that requires a somewhat circular distribution of 1
For details about XCS refer to [4,11].
Towards Variable Offspring Set Sizes in XCSF
49
prediction 0.5 0 -0.5 1 0.5 f
0 -0.5 1 -1
0.8 0.6
0
0.2
0.4
0.4 x
0.6
0.8
y
0.2 1 0
(a) sine function prediction 1 0.5 0 1 f
0.5 0 1 0.8 -0.50
0.6 0.2
0.4
0.4 x
0.6
0.8
y
0.2 1 0
(b) radial sine function prediction 1 0.5 0 1.5 1 f 0.5 1
0
0.8 0.6
0
0.2
0.4
0.4 x
0.6
0.8
y
0.2 1 0
(c) crossed ridge function prediction 1 0.5 0 -0.5 -1
1.5 1 f 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 x
0.6 0.8 1
0
0.2
0.4
0.6 y
0.8
1
(d) sine-in-sine function Fig. 1. Final function approximations, including contour lines, are shown on the lefthand side. The corresponding population distributions after compaction are shown on the right-hand side. For visualization purposes, the conditions are drawn 80% smaller than their actual size.
P.O. Stalph and M.V. Butz
6400
1
1
100
0.01
0
20
40 60 80 number of learning steps (1000s)
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel50% - pred. error macro cl.
0.1 prediction error
0.1 prediction error
1000
select 2 - pred. error macro cl. select 4 - pred. error macro cl. select 8 - pred. error macro cl.
macro classifiers
6400
1000
macro classifiers
50
100
0.01
0
100
20
40 60 80 number of learning steps (1000s)
100
(a) sine function
1
100
0.01
0
20
40 60 80 number of learning steps (1000s)
100
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel50% - pred. error macro cl.
0.1 prediction error
0.1 prediction error
1000
select 2 - pred. error macro cl. select 4 - pred. error macro cl. select 8 - pred. error macro cl.
macro classifiers
1
1000
macro classifiers
6400
6400
100
0.01
0
20
40 60 80 number of learning steps (1000s)
100
(b) radial sine function Fig. 2. Different selection strengths with fixed (left hand side) or match-set-size relative (right hand side) offspring set sizes can speed-up learning significantly but potentially increase the final error level reached. The vertical axis is log-scaled. Error bars represent one standard deviation and the thin dashed line shows the target error ε0 = 0.01.
classifiers. Function f3 is a crossed ridge function, for which it has been shown that XCSF performs competitively in comparison with deterministic machine learning techniques [10]. Finally, function f4 twists two sine functions so that it becomes very hard for the evolutionary algorithm to receive enough signal from the parameter estimates in order to structure the problem space more effectively for an accurate function approximation. Figure 1 shows the approximation surfaces and spatial partitions generated by XCSF with a population size of N = 6400 and with compaction [10] activated after 90k learning iterations.2 The graphs on the left-hand side show the actual function predictions and qualitatively confirm that XCSF is able to learn accurate approximations for all four functions. On the right-hand side, the corresponding condition structures of the final populations are shown. In XCS and 2
Other parameters were set to the following values: β = .1, η = .5, α = 1, ε0 = .01, ν = 5, θGA = 50, χ = 1.0, μ = .05, r0 = 1, θdel = 20, δ = 0.1, θsub = 20. All experiments in this paper are averaged over 20 experiments.
Towards Variable Offspring Set Sizes in XCSF
51
1
100
0.01
0
20
40
60
80
1000
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel50% - pred. error macro cl.
0.1 prediction error
0.1 prediction error
1000
select 2 - pred. error macro cl. select 4 - pred. error macro cl. select 8 - pred. error macro cl.
macro classifiers
1
6400 macro classifiers
6400
100
0.01
100
0
20
number of learning steps (1000s)
40
60
80
100
number of learning steps (1000s)
(a) crossed ridge function
1000
100
select 2 - pred. error macro cl. select 4 - pred. error macro cl. select 8 - pred. error macro cl.
0.01
0
20
40
60
80
100
1000
100
0.1 prediction error
0.1 prediction error
1
macro classifiers
1
6400 macro classifiers
6400
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel50% - pred. error macro cl.
0.01
0
20
number of learning steps (1000s)
40
60
80
100
number of learning steps (1000s)
(b) sine-in-sine function Fig. 3. While in the crossed ridge function larger offspring set sizes mainly speed-up learning, in the challenging sine-in-sine function, larger offspring set sizes can strongly affect the final error level reached
XCSF, two classifiers are selected for reproduction, crossover, and mutation. We now investigate the influence of modified reproduction sizes. Performance of the standard setting, where two classifiers are selected for reproduction (with replacement), is compared with four other reproduction size choices. In the first experiment the offspring set size was set to four and eight classifiers respectively. Thus, four (eight) classifiers are reproduced upon GA invocation and crossover is applied twice (four times) before the mutation operator is applied. In a second, more aggressive setting the offspring set size is set relative to the current match set size, namely to 10% and 50% of the match set size. Especially the last setting was expected to reveal that excessive reproduction can deteriorate learning. Learning progress is shown in Figure 2 for functions f1 and f2 . It can be seen that in both cases standard XCSF with two offspring classifiers learns significantly slower than settings with a larger number of offspring classifiers. The number of distinct classifiers in the population (so called macro classifiers), on the other hand, shows that initially larger offspring set sizes increase the population sizes much faster. Thus, an initially higher diversity due to larger offspring sets yields faster initial learning progress. However, towards the end of the run,
52
P.O. Stalph and M.V. Butz
standard XCSF actually reaches a slightly lower error than the settings with larger offspring sets. This effect is the more pronounced the larger the offspring set. In the radial sine function, this effect is not as strong as in the sine function. Similar observations can also be made in the crossed ridge function, which is shown in Figure 3(a). In the sine-in-sine function f4 (Figure 3(b)), larger offspring set sizes degrade performance most severely. While a selection of four offspring classifiers as well as a selection of a size of 10% of the match set size still shows slight error decreases, larger offspring set sizes completely stall learning— despite large and diverse populations. It appears that the larger offspring set sizes prevent the population from identifying relevant structures and thus prevent the development of accurate function approximations.
3
Theoretical Considerations
What is the effect of increasing the number of offspring generated upon GA invocation? The results indicate that initially, faster learning can be induced. However, later on, learning potentially stalls. Previously, learning in XCS was characterized as an interactive learning process in which several evolutionary pressures [12] foster learning progress: (1) A fitness pressure is induced since usually on average more accurate classifiers are selected for reproduction than for deletion. (2) A set pressure, which causes an intrinsic generalization pressure, is induced since also on average more general classifiers are selected for reproduction than for deletion. (3) Mutation pressure causes diversification of classifier conditions. (4) Subsumption pressure causes convergence to maximally accurate, general classifiers, if found. Since fitness and set pressure work on the same principle, increasing the number of reproductions generally equally increases both pressures. Thus, their balance is maintained. However, the fitness pressure only applies if there is a strong-enough fitness signal, which depends on the number of evaluations a classifier underwent before the reproduction process. The mutation pressure also depends on the number of reproductions; thus, a faster diversification can be expected given larger offspring set sizes. Another analysis estimated the reproductive opportunities a superior classifier might have before being deleted [13]. Moreover, a niche support bound was derived [14], which characterizes the probability that a classifier is sustained in the population, given that it represents an important problem niche for the final solution. Both of these bounds assume that the accuracy of the classifier is accurately specified. However, the larger the offspring set size is, the faster the classifier turnaround, thus the shorter the average iteration time a classifier stays in the population, and thus the fewer the number of iterations available to a classifier until it is deleted. The effect is that the GA in XCS has to work with classifier parameter estimates that are less reliable since they underwent less updates on average. Thus, larger offspring set sizes induce larger noise in the selection process. As long as the fitness pressure leads in the right direction because the parameter estimates have enough signal, learning proceeds faster. This latter reason
Towards Variable Offspring Set Sizes in XCSF
53
stands also in relation to the estimated learning speed of XCS approximated elsewhere [15]. Since reproductions of more accurate classifiers are increased, learning speed increases as long as more accurate classifiers are detected. Due to this reasoning, however, it can also be expected that learning can stall prematurely. This should be the case when the noise, induced by an increased reproduction rate, is too high so that the identification of more accurate classifiers becomes impossible. Better offspring classifiers get deleted before their fitness is sufficiently evaluated. In other words, the fitness signal is too weak for the selection process. This signal-to-noise ratio (fitness signal to selection noise) depends on (1) the problem structure at hand, (2) the solution representation given to XCS (condition and prediction structures), and (3) on the population size. Thus, it is hard to specify the ratio exactly and future research is needed to derive mathematical bounds on this problem. Nonetheless, these considerations explain the general observations in the considered functions: The more complex the function, the more problematic larger offspring sets become— even the traditional two offspring classifiers may be too fast to reach the target error ε0 . To control the signal-to-noise problem, consequently, it is important to balance reproduction rates and offspring set sizes problem-dependently. A similar suggestion was made elsewhere for the control of parameter θGA [7]. In the following, we investigate an approach that decreases the offspring set size over a learning experiment to get the best of both worlds: fast initial learning speeds and maximally accurate final solution representations.
4
Adapting Offspring Set Sizes
As a first approach to determine if it can be useful to use larger initial offspring set sizes and to decrease those sizes during the run, we linearly scale the offspring set size from 10% offspring set size to two over the 100k learning iterations. Figure 4 shows the resulting performance in all four functions comparing the linear scaling with traditional two offspring classifiers and fixed 10% offspring. In graphs 4(a)-(c) we can see that the scaling technique reaches maximum accuracy. Particularly in Graph 4(a) we can see that the performance stalling is overcome and an error level is reached that is similar to the one reached with the traditional XCS setting. However, performance in function f4 shows that the error still stays on a high level initially but it starts decreasing further when compared to a 10% offspring set size later in the run. Thus, the results show that a linear reduction of offspring set sizes can have positive effects on initial learning speed while low reproduction rates at the end of a run allow for a refinement of the final solution structure. However, the results also suggest that the simple linear scheme is not necessarily optimal and its success is highly problem-dependent. Future research needs to investigate flexible adaptation schemes that take the signal-to-noise ratio into account.
P.O. Stalph and M.V. Butz
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel10%to2 - pred. error macro cl.
1
100
0.01
0
20
40 60 80 number of learning steps (1000s)
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel10%to2 - pred. error macro cl.
0.1 prediction error
prediction error
0.1
1000
100
0
20
40 60 80 number of learning steps (1000s)
macro classifiers
1000
6400
1
100
0.01
0
20
40 60 80 number of learning steps (1000s)
(c) crossed ridge function
100
1000
100
0.1 prediction error
prediction error
0.1
100
(b) radial sine function 6400
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel10%to2 - pred. error macro cl.
100
0.01
(a) sine function
1
1000
macro classifiers
1
6400 macro classifiers
6400
0.01
macro classifiers
54
sel2 - pred. error macro cl. sel10% - pred. error macro cl. sel10%to2 - pred. error macro cl. 0
20
40 60 80 number of learning steps (1000s)
100
(d) sine-in-sine function
Fig. 4. When decreasing the number of generated offspring over the learning trial, learning speed is kept high while the error convergence reaches the level that is reached by always generating two offspring classifiers (a,b,c). However, in the case of the challenging sine-in-sine function, further learning would be necessary to reach a similarly low error level (d).
5
Conclusions
This paper has shown that a fixed offspring set size does not necessarily yield the best learning speed that XCSF can achieve. Larger offspring set sizes can strongly increase the initial learning speed but do not necessarily reach maximum accuracy. Adaptive offspring set sizes, if scheduled appropriately, can get the best of both worlds in yielding high initial learning speed and low final error. The results however also suggest that a simple adaptation scheme is not generally applicable. Furthermore, the theoretical considerations suggest that a signalto-noise estimate could be used to control the GA offspring schedule and the offspring set sizes. Given a strong fitness signal, a larger set of offspring could be generated. Another consideration that needs to be taken into account in such an offspring generation scheme, however, is the fact that problem domains may be
Towards Variable Offspring Set Sizes in XCSF
55
strongly unbalanced, in which some subspaces may be very easily approximated while others may be very hard. In these cases, it has been shown, though, that the θGA threshold can be increased to ensure a representation of the complete problem [7]. Future research should consider adapting θGA hand-in-hand with the offspring set sizes. In which way this may be accomplished exactly still needs to be determined. Nonetheless, it is hoped that the results and considerations of this work provide clues in the right direction in order to speed-up XCS(F) learning and to make learning even more robust in hard problems.
Acknowledgments The authors acknowledge funding from the Emmy Noether program of the German research foundation (grant BU1335/3-1) and like to thank their colleagues at the department of psychology and the COBOSLAB team.
References 1. Holland, J.H.: Adaptation. In: Progress in Theoretical Biology, vol. 4, pp. 263–293. Academic Press, New York (1976) 2. Holland, J.H.: Properties of the bucket brigade algorithm. In: Proceedings of the 1st International Conference on Genetic Algorithms, Hillsdale, NJ, USA, pp. 1–7. L. Erlbaum Associates Inc., Mahwah (1985) 3. Widrow, B., Hoff, M.E.: Adaptive switching circuits. Western Electronic Show and Convention, Convention Record, Part 4, 96–104 (1960) 4. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3(2), 149–175 (1995) 5. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Prediction update algorithms for XCSF: RLS, Kalman filter, and gain adaptation. In: GECCO 2006: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pp. 1505–1512. ACM, New York (2006) 6. Drugowitsch, J., Barry, A.: A formal framework and extensions for function approximation in learning classifier systems. Machine Learning 70, 45–88 (2008) 7. Orriols-Puig, A., Bernad´ o-Mansilla, E.: Bounding XCS’s parameters for unbalanced datasets. In: GECCO 2006: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pp. 1561–1568. ACM, New York (2006) 8. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209–219. Springer, Heidelberg (2000) 9. Wilson, S.W.: Classifiers that approximate functions. Natural Computing 1, 211– 234 (2002) 10. Butz, M.V., Lanzi, P.L., Wilson, S.W.: Function approximation with XCS: Hyperellipsoidal conditions, recursive least squares, and compaction. IEEE Transactions on Evolutionary Computation 12, 355–376 (2008) 11. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 267–274. Springer, Heidelberg (2001)
56
P.O. Stalph and M.V. Butz
12. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: Toward a theory of generalization and learning in XCS. IEEE Transactions on Evolutionary Computation 8, 28–46 (2004) 13. Butz, M.V., Goldberg, D.E., Tharakunnel, K.: Analysis and improvement of fitness exploitation in XCS: Bounding models, tournament selection, and bilateral accuracy. Evolutionary Computation 11, 239–277 (2003) 14. Butz, M.V., Goldberg, D.E., Lanzi, P.L., Sastry, K.: Problem solution sustenance in XCS: Markov chain analysis of niche support distributions and the impact on computational complexity. Genetic Programming and Evolvable Machines 8, 5–37 (2007) 15. Butz, M.V., Goldberg, D.E., Lanzi, P.L.: Bounding learning time in XCS. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 739–750. Springer, Heidelberg (2004)
Current XCSF Capabilities and Challenges Patrick O. Stalph and Martin V. Butz Department of Cognitive Psychology III, University of W¨ urzburg R¨ ontgenring 11, 97080 W¨ urzburg, Germany {patrick.stalph,butz}@psychologie.uni-wuerzburg.de http://www.coboslab.psychologie.uni-wuerzburg.de
Abstract. Function approximation is an important technique used in many different domains, including numerical mathematics, engineering, and neuroscience. The XCSF classifier system is able to approximate complex multi-dimensional function surfaces using a patchwork of simpler functions. Typically, locally linear functions are used due to the tradeoff between expressiveness and interpretability. This work discusses XCSF’s current capabilities, but also points out current challenges that can hinder learning success. A theoretical discussion on when XCSF works is intended to improve the comprehensibility of the system. Current advances with respect to scalability theory show that the system constitutes a very effective machine learning technique. Furthermore, the paper points-out how to tune relevant XCSF parameters in actual applications and how to choose appropriate condition and prediction structures. Finally, a brief comparison to the Locally Weighted Projection Regression (LWPR) algorithm highlights positive as well as negative aspects of both methods. Keywords: LCS, XCS, XCSF, LWPR.
1
Introduction
The increasing interest in Learning Classifier Systems (LCS) [1] has propelled research and LCS have proven their capabilities in various applications, including multistep problems [2,3], datamining tasks [4,5], as well as robot applications [6,7]. The focus of this work is on the Learning Classifier System XCSF [8], which is a modified version of the original XCS [2]. XCSF is able to approximate multi-dimensional, real-valued function surfaces from samples by locally weighted, usually linear, models. While XCS theory has been investigated thoroughly in the binary domain [5], theory on real-valued input and output spaces remains sparse. There are two important questions: When does the system work at all and how does it scale with increasing complexity? We will address these questions by first carrying over parts of the XCS theory and, secondly, showing the results of a scalability analysis, which suggests that XCSF scales optimally in the required population size. J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 57–69, 2010. c Springer-Verlag Berlin Heidelberg 2010
58
P.O. Stalph and M.V. Butz
However, when theory tells that a system is applicable to a specific problem type, the problem is still not solved, yet. The practitioner has to choose appropriate parameters and has to decide on the solution representation, which are condition and prediction structures for XCSF. Therefore, we give a short guide on the system’s relevant parameters and how to set them appropriately. Furthermore, a brief discussion on condition and prediction structures is provided to foster the understanding of how XCSF’s generalization power can be fully exploited. Finally, we briefly compare XCSF with Locally Weighted Projection Regression (LWPR). LWPR is a statistics-based greedy algorithm for function approximation that also uses spatially localized linear models to predict the value of non-linear functions. A discussion of pros and cons points out the capabilities of each algorithm. The remainder of this article is structured as follows. Section 2 is concerned with theoretical aspects of XCSF, that is, (1) when the system works at all and (2) how XCSF scales with increasing problem complexity. In contrast, Section 3 discusses how to set relevant parameters given an actual, unknown problem. In Section 4, we briefly compare XCSF with LWPR and the article ends with a short summary and concluding remarks.
2
Theory
We assume sufficient knowledge about the XCSF Learning Classifier System and directly start with a theoretical analysis. We carry over preconditions for successful learning known from binary XCS and propose a scalability model, which shows how the population size scales with increasing function complexity and dimensionality. 2.1
Preconditions - When It Works
In order to successfully approximate a function, XCSF has to overcome the same challenges that were identified for XCS in binary domains [5]. These challenges were described as (1) covering challenge, (2) schema challenge, (3) reproductive opportunity challenge, (4) learning time challenge, and (5) solution sustenance challenge. The following paragraphs briefly summarize results from a recent study [9] that investigated the mentioned challenges in depth with respect to XCSF. Covering Challenge. The initial population of XCSF should be able to cover the whole input space, because otherwise the deletion mechanism creates holes in the input space and local knowledge about these subspaces is lost (so called coveringdeletion cycle [10]). Consequently, when successively sampled problem instances tend to be located in empty subspaces, the hole is covered with a default classifier and another hole is created due to the deletion mechanism. In analogy to results with binary XCS, there is a linear relation between initial classifier volume and the required population size to master the covering challenge. In particular, the population size has to grow inversely linear with the initial classifier volume.
Current XCSF Capabilities and Challenges
59
Schema and Reproductive Opportunity Challenge. When the covering challenge is met, it is required that the genetic algorithm (a) discovers better substructures and (b) reproduces these substructures. In binary genetic algorithms such substructures are often termed Building Blocks, as proposed in John H. Holland’s Schema Theory [1]. However, the definition of real-valued schemata is non-trivial [11,12,13,14] and it is even more difficult to define building blocks for infinite input and output spaces [15,16]. While the stepwise character in binary functions emphasizes the processing of building blocks via crossover, the smooth character of real-valued functions emphasizes hill-climbing mechanisms. To the best of our knowledge, there is no consensus in the literature on this topic and consequently it remains unclear how a building block can be defined for the real-valued XCSF Learning Classifier System. If XCSF’s fitness landscape is neither flat nor deceptive, there remains one last problem: noise on the fitness signal due to a finite number of samples. Prediction parameter estimates rely on the samples seen so far and so does the prediction error and the fitness. If the classifier turnaround (that is, reproduction and deletion of classifiers) is too high, the selection mechanism cannot identify better substructures and the learning process is stuck [17], which can be alleviated by slowing down the learning, e.g. by increasing θGA [18]. Learning Time Challenge. The learning time mainly depends on the number of mutations from initial classifiers to the target shape of accurate and maximally general classifiers. A too-small population size may delay the learning time, because good classifiers get deleted and knowledge is lost. Furthermore, redundancy in the space of possible mutations (e.g. rotation for dimensions n > 3 is not unique) may increase the learning time. A recent study estimated a linear relation between the number of required mutations and the learning time [9]. Solution Sustenance Challenge. Finally, XCSF has to assure that the evolved accurate solution is sustained. This challenge is mainly concerned with the deletion probability. Given the population size is high enough, the GA has enough “room” to work without destroying accurate classifiers. The resulting bound states that the population size needs to grow inversely linear in the volume of the accurate classifiers to be sustained. 2.2
A Scalability Model
Given that all of the above challenges are overcome and the system is able to learn an accurate approximation of the problem at hand, it is important to know how changes in the function complexity or dimensionality affect XCSF’s learning performance. In particular, we model the relation between – – – –
function complexity (defined via the prediction error), input space dimensionality, XCSF’s population size, and the target error ε0 .
60
P.O. Stalph and M.V. Butz
In order to simplify the model, we assume a uniform function structure and uniform sampling1 . This also implies a uniform classifier structure, that is, uniform shape and size. Without loss of generality, let the n-dimensional input space be confined to [0, 1]n . Furthermore, we assume that XCSF evolves an optimal solution [19]. This includes four properties, namely 1. completeness, that is, each possible input is covered in that at least one classifier matches. 2. correctness, that is, the population predicts the function surface accurately in that the prediction error is below the target error ε0 . 3. minimality, that is, the population contains the minimum number of classifiers needed to represent the function completely and correctly. 4. non-overlappingness, that is, no input is matched by more than one classifier. In sum, we assume a uniform patchwork of equally sized, non-overlapping, accurate, and maximally general classifiers. These assumptions reflect reality on uniform functions except for non-overlappingness, which is almost impossible for real-valued input spaces. We consider a uniformly sampled function of uniform structure fΓ : [0, 1]n → R,
(1)
where n is the dimensionality of the input space and Γ reflects the function complexity. Since we do neither fix the condition type, not the predictor used in XCSF, we have to define the complexity via the prediction error. We define Γ such that a linear increase in this value results in the same increase in the prediction error. Thus, saying that the function is twice as complex induces that the prediction error is twice as high for the same classifiers. Since the classifier volume V influences the prediction error ε in a polynomial fashion on uniform functions, we can summarize the assumptions in the following equation. √ n ε=Γ V (2) We can now derive the optimal classifier volume and the optimal population size. Using the target error ε0 , we get an optimal volume of ε n 0 Vopt = . (3) Γ The volume of the input space to be covered is one and it follows that the optimal population size is n Γ Nopt = . (4) ε0 To sum up, the dimensionality n has an exponential influence on the population size, while the function complexity Γ and the target error ε0 have a polynomial influence. Increasing the function complexity will require a polynomial increase of the population size in the order n. 1
Non-uniform sampling is discussed elsewhere [18].
Current XCSF Capabilities and Challenges
macro classifiers (log-scale)
5000
61
macro classifiers 1D 2D 3D 4D 5D 6D
1000
500
100 0.1
1
10
gradient (log-scale)
Fig. 1. Comparative plots of the final population size after condensation (data points) and the developed scalability theory (solid lines) for dimensions n = 1 to n = 6. The number of macro classifiers is plotted against the function complexity, which is modeled via the increasing gradient. The order of the polynomials are equal to the dimension n, which requires an exponential increase in population size. An increasing function complexity results in a polynomial increase. Apart from an approximately constant overhead due to overlapping classifiers, the scalability model fits reality.
Note that no assumptions are made about the condition type or the predictor used. The intentionally simple equations 3 and 4 hide a complex geometric problem in the variable Γ . For example, assume a three-dimensional non-linear function that is approximated using linear predictions and rotating ellipsoidal conditions. Calculating the prediction error is non-trivial for such a setup. When the above bounds are required exactly, this geometric problem has to be solved for any condition-prediction-function combination anew. In order to validate the scalability model, we conducted experiments with interval conditions and constant predictions on a linear function2 . XCSF with constant predictions equals XCSR [20], however, only one dummy action is available. As done before in [19] with respect to XCS, we analyze a restricted class of problems for XCSF. On the one hand, the constant prediction makes this setup a worst case scenario in terms of required population size. On the other hand, the simple setup allows for solving the geometric problem analytically—thus, we can compare the theoretical population size bound from Equation 4 with the actual population size that is required to approximate the respective function. A so called bisection algorithm runs XCSF with different population size settings in a binary search fashion. On termination, the bisection procedure returns the approximately minimal population size N that is required for successful learning. 2
Other settings: 500000 iterations, ε0 = 0.01, β = 0.1, α = 1, δ = 0.1, ν = 5, χ = 1, μ = 0.05, r0 = 1, θGA = 50, θdel = 20, θsub = 20. GA subsumption was applied. Uniform crossover was applied.
62
P.O. Stalph and M.V. Butz
For details of the bisection algorithm and how the geometric problem is solved, please refer to [9]. Figure 1 shows the results of the bisection experiments on the one- to sixdimensional linear function fΓ (x1 , . . . , xn ) = Γ ni=1 xi , where solid lines represent the developed theory (Equation 4) and the data shown represents the final population size after condensation [21]. For each dimension n, the function difficulty Γ was linearly increased by increasing the gradient of the linear function. The polynomials are shown as straight lines on a log-log-scale plot, where the gradient of a line equals the order of the corresponding polynomial. We observe an approximately constant overhead from scalability theory to actual population size. This overhead is expected, since the scalability model assumes non-overlappingness. Most importantly, the prediction of the model lies parallel to the actual data, which indicates that the dimension n fits the exponent of the theoretical model. Thus, the experiment confirms the scalability model: Problem dimensionality has an exponential influence on the required population size (given full problem space sampling). Furthermore, a linear increase in the problem difficulty (or a linear decrease of the target error ε0 ) induces a polynomial increase in the population size.
3
How to Set XCSF’s Parameters
Although theoretical knowledge shows that XCSF works theoretically optimally, it is also important to understand the influence of XCSF’s parameter settings such as population size, condition structures, and prediction types. Besides the importance and the direct influence of a parameter, the interdependencies between parameters are also relevant for the practitioner. In the following, we give a brief overview of important parameters, their dependencies, and how to tune them in actual applications. 3.1
Important Parameters and Interdependencies
A long list of available parameters exists for both XCS and XCSF. Among obviously important parameters, such as the population size N , there are less frequently tuned parameters (e.g. θGA ) and parameters that are rarely changed at all, such as the crossover rate χ or the accuracy scale ν. The most important parameters are summarized here. Population Size N – This parameter specifies the available workspace for the evolutionary search. Therefore it is crucial to set this value high enough to prevent deletion of good classifiers (see Section 2.1). Target Error ε0 – The error threshold defines the desired accuracy. Evolutionary pressures drive classifiers towards this threshold of accurate and maximally general classifiers. Condition Type – The structuring capability of XCSF is defined by this settings. Various condition structures are available, including simple axis-parallel intervals [22], rotating ellipsoids [23], and arbitrary shapes using gene expression programming [24].
Current XCSF Capabilities and Challenges
63
Prediction Type – Typically linear predictors are used for a good balance of expressiveness and interpretability. However, others are possible, such as constant predictors [8] or polynomial ones [25]. Learning Time – The number of iterations should be set high enough to assure that the prediction error converges to a value below the desired ε0 . GA Frequency Threshold θGA – This threshold specifies that GA reproduction is activated only if the average time of the last GA activation in the set lies longer in the past than θGA . Increasing this value delays learning, but may also prevent forgetting and overgeneralization in unbalanced data sets [18]. Mutation Rate μ – The probability of mutation is closely related to the available mutation options of the condition type and thus it is also connected to the dimensionality of the problem. It should be set according to the problem at hand, e.g. μ = 1/m, where m is the number of available mutation options. Initial classifier size r0 – One the one hand, this value should be set high enough to meet the covering challenge, that is, it should be set such that simple covering with less than N classifiers is sufficient to cover the whole input space. On the other hand, the initial size should be small enough to yield a fitness signal upon crossover or mutation in order to prevent oversized classifiers from taking over the population. The other parameters can be set to their default values, thus ensuring a good balance of the evolutionary pressures. The strongest interdependencies can be found between population size N , target error ε0 , condition structure, and prediction type as indicated by the scalability model of Section 2.2. Changing either of these will affect XCSF’s learning performance significantly. For example, with a higher population size a lower target error can be reached. An appropriate condition structure may turn a polynomial problem into a linear one, thus requiring less classifiers. Advanced predictors are able to approximate more complex functions and thus enable coarse structuring of the input space, again reducing the required population size. When tuning either of these settings, the related parameters should be kept in mind. 3.2
XCSF’s Solution Representation
Before running XCSF with some arbitrary settings on a particular problem, a few things have to be considered. This concerns mainly the condition and prediction structures, that is, XCSF’s solution representation. The next two paragraphs highlight some issues about different representations. Selecting an Appropriate Predictor. The first step is to select the type of prediction to be used for the function approximation. Linear predictions have a reasonable computational complexity and good expressiveness, while the final solution is well interpretable. In some cases, it might be required to invert the approximated function after learning, which is easily possible with a linear predictor. However, if prior knowledge suggests a special type of function (e.g. polynomials
64
P.O. Stalph and M.V. Butz
or sinusoidal functions) this knowledge can be exploited by using corresponding predictors. The complexity of the prediction mainly influences the classifier updates, which is usually – depending on the dimensionality – a minor factor. Structuring Capabilities. Closely related to the predictor is the condition structure. The simplest formulation are intervals, that is, rectangles. Alternatively, spheres or ellipsoids (also known as radial basis functions or receptive fields) can be used. More advanced structures include rotation, which allows for exploiting interdimensional dependencies, but also increases the complexity of (1) the evolutionary search space and (2) the computational time for matching, which are major influences on the learning time. On the other hand, if interdependencies can be exploited, the required population size may shrink dramatically—effectively speeding up the whole learning process by orders of magnitude. Finally, it is also possible to use arbitrary structures such as gene expression programming or neural networks. However, the improved generalization capabilities can reduce the interpretability of the developed solutions and learning success can usually not be guaranteed because the used genetic operators may not necessarily yield a mainly local phenotypic search through the expressible condition structures. 3.3
When XCSF Fails
Even the best condition and prediction structures do not necessarily guarantee successful learning. This section discusses some issues, where fine-tuning of some parameters may help to reach the desired accuracy. Furthermore, we point out when XCSF reaches its limits, so that simple parameter tuning cannot overcome learning failures. Ideally, given an unknown function, XCSF’s prediction error quickly drops below ε0 (see Figure 2(a) for a typical performance graph). When XCSF is not able to accurately learn the function, there are four possible main reasons: 1. The prediction error has not yet converged. 2. The prediction error converged to an average error above the target error. 3. The prediction error stays on an initially very low level, but the function surface is not fully approximated. 4. The prediction error stays on an initially high level. Given case 1, the learning time is too short to allow for an appropriate structuring of the input space. Increasing the number of iterations will solve this issue. In contrast, case 2 indicates that the function is too difficult to approximate with the given population size, target error, predictor, and condition structure. Figure 2(b) illustrates a problem in which the system does not reach the target error. Increasing the learning time allows for a settling of the prediction error, but the target error is only reached when the maximum population size is increased. While in the previous examples XCSF just does not reach the target error, in other scenarios the system completely fails to learn anything due to bad parameter choices. There are two major factors that may prevent learning completely: covering-deletion cycles and flat fitness landscapes. Although case 3
Current XCSF Capabilities and Challenges
6400
10
0.01
1
1000
1 100 prediction error
100 0.1
macro classifiers
1000
1
prediction error
6400 pred. error macro cl. matchset macro cl.
10
0.1
macro classifiers
pred. error macro cl. matchset macro cl.
65
1 0.01
0.001 0
20 40 60 80 number of learning steps (1000s)
100
0
(a) crossed ridge 2D
20 40 60 80 number of learning steps (1000s)
100
(b) sine-in-sine 2D
Fig. 2. Typical performance measurements on two benchmark functions. The target error ε0 = 0.01 is represented by a dashed line. (a) The chosen settings are well suited for the crossed-ridge function and the prediction error converges to a value below the target error. (b) In contrast, the sine-in-sine function is too difficult for the same settings and the system does neither reach the target error nor does the prediction error converge within the given learning time.
6400
10 1e-16 1
1e-17
pred. error macro cl. matchset macro cl. prediction error
100
1000 macro classifiers
1000
1e-15 prediction error
6400
100
10 10
macro classifiers
pred. error macro cl. matchset macro cl.
1 1
0.1 0
20 40 60 80 number of learning steps (1000s)
100
(a) sine 20D, too small r0
0
20 40 60 80 number of learning steps (1000s)
100
(b) sine 20D, too large r0
Fig. 3. Especially on high-dimensional functions, it is crucial to set the initial classifier size r0 to a reasonable value. (a) A small initial size leads to a covering-deletion cycle. (b) When the fitness landscape is too flat, the evolutionary search is unable to identify better substructures and oversized classifiers prevent learning.
seems strange, there is a simple explanation. If the population size and initial classifier size are set such that the input space cannot be covered by the covering mechanism, the system continuously covers and deletes classifiers without any knowledge gain (so called covering-deletion cycle [10]). Typically, the average match set size is one, the population size quickly reaches the maximum, and the average prediction error is almost zero because the error during covering is zero. Exemplary, we equip XCSF with a small initial classifier size r0 and run the system on a 20-dimensional sine function as shown in Figure 3(a). Especially high-dimensional input spaces are prone to this problematic cycle, because (1)
66
P.O. Stalph and M.V. Butz
the initial classifier volume has to be high enough to allow for a complete coverage, but (2) the initial volume may not exceed the size where the GA does not receive a sufficient fitness signal. The latter may be the case when a single mutation of the initial covering shape cannot produce a sufficiently small classifier that captures the (eventually fine-grained) structure of the underlying function. Thus, the GA is missing a fitness gradient and, due to higher reproductive opportunities, over-general classifiers take over the population as shown in Figure 3(b). Typically, the prediction error does not drop at all. Here XCSF reaches its limits and “simple” parameter tuning may not help to overcome the problem with a reasonable population size. Eventually, a refined initial classifier size hits a reasonable fitness and prevents over-general classifiers. Otherwise, it might be necessary to reconsider the condition structure or corresponding evolutionary operators.
4
A Brief Comparison with Locally Weighted Projection Regression
Apart from traditional function fitting, where the general type of the underlying function has to be known before fitting the data, the so called Locally Weighted Projection Regression (LWPR) algorithm [26,27] also approximates functions iteratively by means of local linear models, as does XCSF. The following paragraphs highlight the main differences of LWPR to XCSF and sketch some theoretical thoughts on performance as well as on the applicability of both systems. The locality of each model is defined by so called receptive fields, which correspond to XCSF’s rotating hyperellipsoidal condition structures [23]. However, in contrast to the steady state GA in XCSF, the receptive fields in LWPR are structured by means of a statistical gradient descent. The center, that is, the position of a receptive field, is never changed once it is created. Based on the prediction errors, the receptive fields can shrink in specific directions, which – theoretically – minimize the error. Indefinite shrinking is prevented by introducing a penalty term, which penalizes small receptive fields. Thus, receptive fields shrink due to prediction errors and enlarge if the influence of prediction errors is less than the influence of the penalty term. However, the ideal statistics from batch-learning can only be estimated in an iterative algorithm and experimental validation is required to shed light on the actual performance of both systems, when compared on benchmark functions. One disadvantage of LWPR is that all its statistics are based on linear predictions and the ellipsoidal shape of receptive fields. Thus, alternative predictions or conditions cannot be applied directly. In contrast, a wide variety of prediction types and condition structures are available for XCSF, allowing for a higher representational flexibility. Furthermore, it is easily possible to decouple conditions and predictions in XCSF [6], in which case conditions cluster a contextual space for the predictions in another space. Since the fitness signal for the GA is only based on prediction errors, no coupling is necessary. It remains an open research challenge to realize similar mechanisms and modifications with LWPR.
Current XCSF Capabilities and Challenges
67
On the other hand, the disadvantage of XCSF is a higher population size during learning, which is necessary for the niched evolutionary algorithm to work successfully. Different condition shapes have to be evaluated with several samples before a stable fitness value can be used in the evolutionary selection process. Nevertheless, it has been shown that both systems achieve comparable prediction errors in particular scenarios [23]. Future research will compare XCSF and LWPR in detail, including theoretical considerations as well as empirical evaluations on various benchmark functions.
5
Summary and Conclusions
This article discussed XCSF’s current capabilities as well as scenarios that pose a challenge for the system. From a theoretical point of view, we analyzed the preconditions for successful learning and, if these conditions are met, how the system scales to higher problem complexities, including function structure and dimensionality. In order to successfully learn the surface of a given function, XCSF has to overcome the same challenges that were identified for XCS: covering challenge, schema challenge, reproductive opportunity challenge, learning time challenge, and solution sustenance challenge. Given a uniform function structure and uniform sampling, the scalability model predicts an exponential influence of the input space dimensionality on the population size. Moreover, a polynomial increase in the required population size is expected when the function complexity is linearly increased or when the target error is linearly decreased. From a practitioner’s viewpoint, we highlighted XCSF’s important parameters and gave a brief guide how to set these parameters appropriately. Additional parameter tuning suggestions may help if initial settings fail to reach the desired target error in certain cases. Examples illustrate when XCSF completely fails due to a covering-deletion cycle or due to flat fitness landscapes. Thus, failures in actual applications can be understood and refined parameter choices can eventually resolve the problem. Finally, a brief comparison with a statistics-based machine learning technique, namely Locally Weighted Projection Regression (LWPR), discussed advantages and disadvantages of the evolutionary approach employed in XCSF. A current study, which includes also empirical experiments, supports the presented comparison with respect to several relevant performance measures [28].
Acknowledgments The authors acknowledge funding from the Emmy Noether program of the German research foundation (grant BU1335/3-1) and like to thank their colleagues at the department of psychology and the COBOSLAB team.
68
P.O. Stalph and M.V. Butz
References 1. Holland, J.H.: Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. The MIT Press, Cambridge (1992) 2. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3(2), 149–175 (1995) 3. Butz, M.V., Goldberg, D.E., Lanzi, P.L.: Gradient descent methods in learning classifier systems: Improving XCS performance in multistep problems. Technical report, Illinois Genetic Algorithms Laboratory (2003) 4. Bernad´ o-Mansilla, E., Garrell-Guiu, J.M.: Accuracy-based learning classifier systems: Models, analysis, and applications to classification tasks. Evolutionary Computation 11, 209–238 (2003) 5. Butz, M.V.: Rule-Based Evolutionary Online Learning Systems: A Principal Approach to LCS Analysis and Design. Springer, Heidelberg (2006) 6. Butz, M.V., Herbort, O.: Context-dependent predictions and cognitive arm control with XCSF. In: GECCO 2008: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, pp. 1357–1364. ACM, New York (2008) 7. Stalph, P.O., Butz, M.V., Pedersen, G.K.M.: Controlling a four degree of freedom arm in 3D using the XCSF learning classifier system. In: Mertsching, B., Hund, M., Aziz, Z. (eds.) KI 2009. LNCS, vol. 5803, pp. 193–200. Springer, Heidelberg (2009) 8. Wilson, S.W.: Classifiers that approximate functions. Natural Computing 1, 211– 234 (2002) 9. Stalph, P.O., Llor` a, X., Goldberg, D.E., Butz, M.V.: Resource Management and Scalability of the XCSF Learning Classifier System. Theoretical Computer Science (in press), http://dx.doi.org/10.1016/j.tcs.2010.07.007 10. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: How XCS evolves accurate classifiers. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001), pp. 927–934 (2001) 11. Wright, A.H.: Genetic algorithms for real parameter optimization. In: Foundations of Genetic Algorithms, pp. 205–218. Morgan Kaufmann, San Francisco (1991) 12. Goldberg, D.E.: Real-coded genetic algorithms, virtual alphabets, and blocking. Complex Systems 5, 139–167 (1991) 13. Radcliffe, N.J.: Equivalence class analysis of genetic algorithms. Complex Systems 5, 183–205 (1991) 14. M¨ uhlenbein, H., Schlierkamp-Voosen, D.: Predictive models for the breeder genetic algorithm – I. continuous parameter optimization. Evolutionary Computation 1, 25–49 (1993) 15. Beyer, H.G., Schwefel, H.P.: Evolution strategies - a comprehensive introduction. Natural Computing 1(1), 3–52 (2002) 16. Bosman, P.A.N., Thierens, D.: Numerical optimization with real-valued estimationof-distribution algorithms. In: Scalable Optimization via Probabilistic Modeling. SCI, vol. 33, pp. 91–120. Springer, Heidelberg (2006) 17. Stalph, P.O., Butz, M.V.: How Fitness Estimates Interact with Reproduction Rates: Towards Variable Offspring Set Sizes in XCSF. In: Bacardit, J. (ed.) IWLCS 2008/2009. LNCS (LNAI), vol. 6471, pp. 47–56. Springer, Heidelberg (2010) 18. Orriols-Puig, A., Bernad´ o-Mansilla, E.: Bounding XCS’s parameters for unbalanced datasets. In: GECCO 2006: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pp. 1561–1568. ACM, New York (2006)
Current XCSF Capabilities and Challenges
69
19. Kovacs, T., Kerber, M.: What makes a problem hard for XCS? In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 251–258. Springer, Heidelberg (2001) 20. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209–219. Springer, Heidelberg (2000) 21. Wilson, S.W.: Generalization in the XCS classifier system. In: Genetic Programming 1998: Proceedings of the Third Annual Conference, pp. 665–674 (1998) 22. Stone, C., Bull, L.: For real! XCS with continuous-valued inputs. Evolutionary Computation 11(3), 299–336 (2003) 23. Butz, M.V., Lanzi, P.L., Wilson, S.W.: Function approximation with XCS: Hyperellipsoidal conditions, recursive least squares, and compaction. IEEE Transactions on Evolutionary Computation 12, 355–376 (2008) 24. Wilson, S.W.: Classifier conditions using gene expression programming. In: Bacardit, J., Bernad´ o-Mansilla, E., Butz, M.V., Kovacs, T., Llor` a, X., Takadama, K. (eds.) IWLCS 2006 and IWLCS 2007. LNCS (LNAI), vol. 4998, pp. 206–217. Springer, Heidelberg (2008) 25. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Extending XCSF beyond linear approximation. In: GECCO 2005: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation, pp. 1827–1834 (2005) 26. Vijayakumar, S., Schaal, S.: Locally weighted projection regression: An O(n) algorithm for incremental real time learning in high dimensional space. In: ICML 2000: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 1079–1086 (2000) 27. Vijayakumar, S., D’Souza, A., Schaal, S.: Incremental online learning in high dimensions. Neural Computation 17(12), 2602–2634 (2005) 28. Stalph, P.O., Rubinsztajn, J., Sigaud, O., Butz, M.V.: A comparative study: Function approximation with LWPR and XCSF. In: GECCO 2010: Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (in press, 2010)
Recursive Least Squares and Quadratic Prediction in Continuous Multistep Problems Daniele Loiacono and Pier Luca Lanzi Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy {loiacono,lanzi}@elet.polimi.it
Abstract. XCS with computed prediction, namely XCSF, has been recently extended in several ways. In particular, a novel prediction update algorithm based on recursive least squares and the extension to polynomial prediction led to significant improvements of XCSF. However, these extensions have been studied so far only on single step problems and it is currently not clear if these findings might be extended also to multistep problems. In this paper we investigate this issue by analyzing the performance of XCSF with recursive least squares and with quadratic prediction on continuous multistep problems. Our results show that both these extensions improve the convergence speed of XCSF toward an optimal performance. As showed by the analysis reported in this paper, these improvements are due to the capabilities of recursive least squares and of polynomial prediction to provide a more accurate approximation of the problem value function after the first few learning problems.
1
Introduction
Learning Classifier Systems are a genetic based machine learning technique for solving problems through the interaction with an unknown environment. The XCS classifier system [16] is probably the most successful learning classifier system to date. It couples effective temporal difference learning, implemented as a modification of the well-known Q-learning [14], to a niched genetic algorithm guided by an accuracy based fitness to evolve accurate maximally general solutions. In [18] Wilson extended XCS with the idea of computed prediction to improve the estimation of the classifiers prediction. In XCS with computed prediction, XCSF in brief, the classifier prediction is not memorized into a parameter but computed as a linear combination of the current input and a weight vector associated to each classifier. Recently, in [11] the classifier weights update has been improved with a recursive least squares approach and the idea of computed prediction has been further extended to polynomial prediction. Both the recursive least squares update and the polynomial prediction have been effectively applied to solve function approximation problems as well as to learn Boolean functions. However, so far it is not currently clear whether these findings might be extended also to continuous multistep problems, where Wilson’s XCSF has J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 70–86, 2010. c Springer-Verlag Berlin Heidelberg 2010
Recursive Least Squares and Quadratic Prediction
71
been already successfully applied [9]. In this paper we investigate this important issue. First, we extend the recursive least squares update algorithm to multistep problems with the covariance resetting, a well known approach to deal with a non stationary target. Then, to test our approach, we compare the usual Widrow-Hoff update rule to the recursive least squares one (extended with covariance resetting) on a class of continuous multistep problems, the 2D Gridworld problems [1]. Our results show that XCSF with recursive least squares outperforms XCSF with Widrow-Hoff rule in terms of convergence speed, although both reach finally an optimal performance. Thus, the results confirm the findings of previous works on XCSF with recursive least squares applied to single step problems. In addition, we performed a similar experimental analysis to investigate the effect of polynomial prediction on the same set of problems. Also in this case, the results suggest that quadratic prediction results in a faster convergence of XCSF toward the optimal performance. Finally, to explain why recursive least squares and polynomial prediction increase the convergence speed of XCSF we showed that they improve the accuracy of the payoff landscape learned in the first few learning problems.
2
XCS with Computed Prediction
XCSF differs from XCS in three respects: (i) classifier conditions are extended for numerical inputs, as done in XCSI [17]; (ii) classifiers are extended with a vector of weights w, that are used to compute prediction; finally, (iii) the original update of classifier prediction must be modified so that the weights are updated instead of the classifier prediction. These three modifications result in a version of XCS, XCSF [18,19], that maps numerical inputs into actions with an associated calculated prediction. In the original paper [18] classifiers have no action and it is assumed that XCSF outputs the estimated prediction, instead of the action itself. In this paper, we consider the version of XCSF with actions and linear prediction (named XCS-LP [19]) in which more than one action is available. As said before, throughout the paper we do not keep the (rather historical) distinction between XCSF and XCS-LP since the two systems are basically identical except for the use of actions in the latter case. Classifiers. In XCSF, classifiers consist of a condition, an action, and four main parameters. The condition specifies which input states the classifier matches; as in XCSI [17], it is represented by a concatenation of interval predicates, int i = (li , ui ), where li (“lower”) and ui (“upper”) are integers, though they might be also real. The action specifies the action for which the payoff is predicted. The four parameters are: the weight vector w, used to compute the classifier prediction as a function of the current input; the prediction error ε, that estimates the error affecting classifier prediction; the fitness F that estimates the accuracy of the classifier prediction; the numerosity num, a counter used to represent different copies of the same classifier. Note that the size of the weight vector w depends on the type of approximation. In the case of piecewise-linear approximation, considered in this paper, the weight vector w has one weight wi
72
D. Loiacono and P.L. Lanzi
for each possible input, and an additional weight w0 corresponding to a constant input x0 , that is set as a parameter of XCSF. Performance Component. XCSF works as XCS. At each time step t, XCSF builds a match set [M] containing the classifiers in the population [P] whose condition matches the current sensory input st ; if [M] contains less than θmna actions, covering takes place and creates a new classifier that matches the current inputs and has a random action. Each interval predicate int i = (li , ui ) in the condition of a covering classifier is generated as li = st (i) − rand(r0 ), and ui = st (i) + rand(r0 ), where st (i) is the input value of state st matched by the interval predicate int i , and the function rand(r0 ) generates a random integer in the interval [0, r0 ] with r0 fixed integer. The weight vector w of covering classifiers is randomly initialized with values from [-1,1]; all the other parameters are initialized as in XCS (see [3]). For each action ai in [M], XCSF computes the system prediction which estimates the payoff that XCSF expects when action ai is performed. As in XCS, in XCSF the system prediction of action a is computed by the fitness-weighted average of all matching classifiers that specify action a. However, in contrast with XCS, in XCSF classifier prediction is computed as a function of the current state st and the classifier vector weight w. Accordingly, in XCSF system prediction is a function of both the current state s and the action a. Following a notation similar to [2], the system prediction for action a in state st , P (st , a), is defined as: cl ∈[M ]|a cl.p(st ) × cl.F P (st , a) = (1) cl∈[M ]|a cl.F where cl is a classifier, [M]|a represents the subset of classifiers in [M] with action a, cl.F is the fitness of cl; cl.p(st ) is the prediction of cl computed in the state st . In particular, when piecewise-linear approximation is considered, cl.p(st ) is computed as: cl.p(st ) = cl .w0 × x0 + cl.wi × st (i) (2) i>0
where cl.w i is the weight wi of cl and x0 is a constant input. The values of P (st , a) form the prediction array. Next, XCSF selects an action to perform. The classifiers in [M] that advocate the selected action are put in the current action set [A]; the selected action is sent to the environment and a reward P is returned to the system. Reinforcement Component. XCSF uses the incoming reward P to update the parameters of classifiers in action set [A]. The weight vector w of the classifiers in [A] is updated using a modified delta rule [15]. For each classifier cl ∈ [A], each weight cl.w i is adjusted by a quantity Δwi computed as: η Δwi = (P − cl.p(st ))st (i) (3) |st |2 where η is the correction rate and |st |2 is the norm of the input vector st , (see [18] for details). Equation 3 is usually referred to as the “normalized ” Widrow-Hoff
Recursive Least Squares and Quadratic Prediction
73
update or “modified delta rule”, because of the presence of the term |st (i)|2 [5]. The values Δwi are used to update the weights of classifier cl as: cl.w i ← cl.w i + Δwi
(4)
Then the prediction error ε is updated as: cl.ε ← cl.ε + β(|P − cl.p(st )| − cl.ε)
(5)
Finally, classifier fitness is updated as in XCS. Discovery Component. The genetic algorithm and subsumption deletion in XCSF work as in XCSI [17]. On a regular basis depending on the parameter θga , the genetic algorithm is applied to classifiers in [A]. It selects two classifiers with probability proportional to their fitness, copies them, and with probability χ performs crossover on the copies; then, with probability μ it mutates each allele. Crossover and mutation work as in XCSI [17,18]. The resulting offspring are inserted into the population and two classifiers are deleted to keep the population size constant.
3
Improving and Extending Computed Prediction
The idea of computed prediction, introduced by Wilson in [18], has been recently improved and extended in several ways [11,12,6,10]. In particular, Lanzi et al. extended the computed prediction to polynomial functions [7] and they introduced in [11] a novel prediction update algorithm, based on recursive least squares. Although these extensions proved to be very effective in single step problems, both in function approximation problems [11,7] and in boolean problems [8], they have never been applied to multistep problems so far. In the following, we briefly describe the classifier update algorithm based on recursive least squares and how it can be applied to multistep problems. Finally, we show how computed prediction can be extended to polynomial prediction. 3.1
XCSF with Recursive Least Squares
In XCSF with recursive least squares,the Widrow-Hoff rule used to update the classifier weights is replaced with a more effective update algorithm based on recursive least squares (RLS). At time step t, given the current state st and the target payoff P , recursive least squares update the weight vector w as wt = wt−1 + kt [P − xt wt−1 ], where xt = [x0
st ]T and kt , called gain vector, is computed as kt =
Vt−1 xt , 1 + xTt Vt−1 xt
while matrix Vt is computed recursively by, Vt = I − kt xTt Vt−1 .
(6)
(7)
74
D. Loiacono and P.L. Lanzi
The matrix V(t) is usually initialized as V(0) = δrls I, where δrls is a positive constant and I is the n × n identity matrix. A higher δrls , denotes that initial parametrization is uncertain, accordingly, initially the algorithm will use a higher, thus faster, update rate (kt ). A lower δrls , denotes that initial parametrization is rather certain, accordingly the algorithm will use a slower update. It is worthwhile to say that the recursive least squares approach presented above involves two basic underlying assumptions [5,4]: (i) the noise on the target payoff P used for updating the classifier weights can be modeled as a unitary variance white noise and (ii) the optimal classifier weights vector does not change during the learning process, i.e., the problem is stationary. While the first assumption is often reasonable and has usually a small impact on the final outcome, the second assumption is not justified in many problems and may have a big impact on the performance. In the literature [5,4] many approaches have been introduced for relaxing this assumption. In particular, a straightforward approach is the resetting of the matrix V: every τrls updates, the matrix V is reset to its initial value δrls I. Intuitively, this prevent RLS to converge toward a fixed parameter estimate by continually restarting the learning process. We refer the interested reader to [5,4] for a more detailed analysis of recursive least squares and other related approaches, like the well known Kalman filter. The extension of XCSF with recursive least squares is straightforward: we added to each classifier the matrix V as an additional parameter and we replaced the usual update of classifier weights with the recursive least squares update described above and reported as Algorithm 1. Algorithm 1. Update classifier cl with RLS algorithm 1: procedure update prediction(cl , s, P ) 2: error ← P − cl.p(s); 3: x(0) ← x0 ; 4: for i ∈ {1, . . . , |s|} do 5: x(i) ← s(i); 6: end for 7: if # of updates from last reset > τrls then 8: cl .V ← δrls I 9: end if 10: ηrls ← (1 + xT · cl.V · x)−1 ; 11: cl .V ← cl .V − ηrls cl.V · xxT · cl .V ; 12: kT ← cl .V · xT ; 13: for i ∈ {0, . . . , |s|} do 14: cl.w i ← cl.w i + k(i)· error; 15: end for 16: end procedure
Compute the current error Build x by adding x0 to s
Reset cl .V Update cl .V Update classifier’s weights
Computational Complexity. It is worth comparing the complexity of the Widrow-Hoff rule and recursive least squares both in terms of memory required for each classifier and time required by each classifier update. For each classifier, recursive least squares stores the matrix cl.Vwhich is n × n, thus its additional space complexity is O(n2 ), where n = |x| is the size of the input vector. With
Recursive Least Squares and Quadratic Prediction
75
respect to the time required for each update, the Widrow-Hoff update rule involves only n scalar multiplications and, thus, is O(n); instead, recursive least squares requires a matrix multiplication, which is O(n2 ). Therefore, recursive least squares is more complex than Widrow-Hoff rule both in terms of memory and time requirements. 3.2
Beyond Linear Prediction
Usually in XCSF the classifier prediction is computed as a linear function, so that piecewise linear approximations of the action-value function are evolved. However, XCSF can be easily extended to evolve also polynomial approximations. Let us consider a simple problem with a single variable state space. At time step t, the classifier prediction is computed as, cl.p(st ) = w0 x0 + w1 st , where x0 is a constant input and st is the current state. Thus, we can introduce a quadratic term in the approximation evolved by XCSF: cl.p(st ) = w0 x0 + w1 st + w2 s2t .
(8)
To learn the new set of weights we use the usual XCSF update algorithm (e.g., either RLS or Widrow-Hoff) applied to the input vector xt , defined as xt = x0 , st , s2t . When more variables are involved, so that st = st (1), . . . , st (n), we define xt = x0 , st (1), s2t (1), . . . , st (n), s2t (n), and apply XCSF to the newly defined input space. The same approach can be generalized to allow the approximation of any polynomials of order k by extending the input vector xt with high order terms. However in this paper, for the sake of simplicity, we will limit our analysis to the quadratic prediction.
4
Experimental Design
To study how recursive least squares and the quadratic prediction affect the performance of XCSF on continuous multistep problems we considered a well known class of problems: the 2D gridworld problems, introduced in [1]. They are two dimensional environments in which the current state is defined by a pair of real valued coordinates x, y in [0, 1]2 , the only goal is in position 1, 1, and there are four possible actions (left, right, up, and down) coded with two bits; each action corresponds in a step of size s in the corresponding direction; actions that would take the system outside the domain [0, 1]2 take the system to the nearest position of the grid border. The system can start anywhere but in the goal position and it reaches the goal position when both coordinates are equal or greater than one. When the system reaches the goal it receives 0, in all the other cases it receives -0.5. We called the problem described above empty gridworld,
76
D. Loiacono and P.L. Lanzi
0 −2 V(x,y)
−4 −6 −8 −10 1 1 0.5
0.5 0
y
0
x
(a) 1
1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 y
0
x
0
1
(b)
0
V(x,y)
−5 −10 −15 −20 1
1 0.5
0.5 y
0 0
x
(c) Fig. 1. The 2D Continuous Gridworld problems: (a) the optimal value function of Grid(0.05) when γ=0.95; (b) the Puddles(0.05) environment; (c) the optimal value function of Puddles(0.05) when γ=0.95
dubbed Grid(s), where s is the agent step size. Figure 1a shows the optimal value function associated to the empty gridworld problem, when s = 0.05 and γ = 0.95. A slightly more challenging problem can be obtained by adding some obstacles to the empty gridworld environment, as proposed in [1]: each obstacle represents an area in which there is an additional cost for moving. These areas are called “puddles” [1], since they actually create a sort of puddle in the optimal value function. Figure 1b depicts the Puddles(s) environment that is derived from Grid(s) by adding two puddles (the gray areas). When the system is in a puddle, it receives an additional negative reward of -2, i.e., the action has an additional
Recursive Least Squares and Quadratic Prediction
77
cost of -2; in the area where the two puddles overlap, the darker gray region, the two negative rewards add up, i.e., the action has a total additional cost of -4. We called this second problem puddle world, dubbed Puddles(s), where s is the agent step size. Figure 1c shows the optimal value function of the puddle world, when s = 0.05 and γ = 0.95. The performance is computed as the average number of steps to reach the goal during the last 100 test problems. To speed up the experiments, problems can last at most 500 steps; when this limit is reached the problem stops even if the system did not reach the goal. All the statistics reported in this paper are averaged over 20 experiments.
5
Experimental Results
Our aim is to study how the RLS update and the quadratic prediction affect the performance of XCSF on continuous multistep problems. To this purpose we applied XCSF with different type of prediction, i.e., linear and quadratic, and with different update rules, i.e., Widrow-Hoff and RLS, on the Grid(0.05) and Puddles(0.05) problems. In addition, we also compared the performance of XCSF to the one obtained with tabular Q-learning [13], a standard reference in the RL literature. In order to apply tabular Q-learning to the 2D Gridworld problems, we discretized the the continuous problem space, using the step size s = 0.05 as resolution for the discretization process. In the first set of experiments we investigated the effect of the RLS update on the performance of XCSF, while in the second set of experiments we extended our analysis also to quadratic prediction. Finally, we analyzed the results obtained and the accuracy of the action-value approximations learned by the different versions of XCSF. 5.1
Results with Recursive Least Squares
In the first set of experiments we compared Q-learning and XCSF with the two different updates on the 2D continuous gridworld problems. For XCSF we used the following parameter settings: N = 5000, 0 = 0.05; β = 0.2; α = 0.1; γ = 0.95; ν = 5; χ = 0.8, μ = 0.04, pexplr = 0.5, θdel = 50, θGA = 50, and δ = 0.1; GA-subsumption is on with θsub = 50; while action-set subsumption is off; the parameters for integer conditions are m0 = 0.5, r0 = 0.25 [17]; the parameter x0 for XCSF is 1 [18]. In addition, with the RLS update we used δrls = 10 and τrls = 50. Accordingly, for Q-learning we set β = 0.2, γ = 0.95, and pexplr = 0.5. The Figure 2a compares the performance of Q-learning and of the two versions of XCSF on the Grid(0.05) problem. All the systems are able to reach an optimal performance and XCSF with the RLS update is able to learn much faster than XCSF with the Widrow-Hoff update, although Q-learning is even faster. This is not surprising, as Q-learning is provided with the optimal state space discretization to solve the problem, while XCSF has to search for it. However it is worthwhile to notice that when the RLS update rule is used, XCSF is able to learn almost as fast as Q-learning. Moving to the more difficult Puddles(0.05) problem, we find very similar results as showed by Figure 2b.
78
D. Loiacono and P.L. Lanzi
AVERAGE NUMBER OF STEPS
40
WH RLS QL Optimum (21)
30
20
10
0
0
1000
2000 3000 4000 LEARNING PROBLEMS
5000
(a) AVERAGE NUMBER OF STEPS
40
WH RLS QL
30
20
10
0
0
1000
2000 3000 4000 LEARNING PROBLEMS
5000
(b) Fig. 2. The performance of Q-learning (reported as QL), XCSF with the Widrow-Hoff update (reported as WH), and of XCSF with the RLS update (reported as RLS) applied to: (a) Grid(0.05) problem (b) Puddles(0.05) problem. Curves are averages on 20 runs.
Also in this case, XCSF with RLS update is able to learn faster than XCSF with the usual Widrow-Hoff update rule and the difference with Q-learning is even less evident. Therefore, our results suggest that the RLS update rule is able to exploit the experience collected more effectively than the Widrow-Hoff rule and confirm the previous findings on single step problems reported in [11]. 5.2
Results with Quadratic Prediction
In the second set of experiments, we compared linear prediction to quadratic prediction on the Grid(0.05) and the Puddles(0.05) problems, using both Widrow-Hoff and RLS updates. Parameters are set as in the previous experiments. Table 1a reports the performance of the systems in the first 500 test problems as a measure of the convergence speed. As found in the previous set of
Recursive Least Squares and Quadratic Prediction
79
Table 1. XCSF applied to Grid(0.05) and to Puddles(0.05) problems. (a) Average number of steps to reach the goal per episode in the first 500 test problems; (b) average number of steps to reach the goal per episode in the last 500 test problems; (c) size of the population evolved. Statistics are averages over 20 experiments.
experiments, the RLS update leads to a faster convergence, also when quadratic prediction is used. In addition, the results suggest that also quadratic prediction affects the learning speed: both with Widrow-Hoff update and with the RLS update the quadratic prediction outperforms the linear one. In particular, XCSF with the quadratic prediction and the RLS update is able to learn even faster than Q-learning in both Grid(0.05) and Puddles(0.05) problems. However, as Table 1b shows, all the systems reach an optimal performance. Finally, it can be noticed that the number of macroclassifiers evolved (Table 1c) is very similar for all the systems, suggesting that XCSF with quadratic prediction does not evolve a more compact solution. 5.3
Analysis of Results
Our results suggest that in continuous multistep problems, the RLS update and the quadratic prediction does not give any advantage either in terms of final performance or in terms of population size. On the other hand, both these extensions lead to an effective improvement of the learning speed, that is they play an important role in the early stage of the learning process. However, this
80
D. Loiacono and P.L. Lanzi
AVERAGE ERROR
4
LINEAR WH LINEAR RLS QUADRATIC WH QUADRATIC RLS
3
2
1
0
0
1000
2000
3000
4000
5000
LEARNING PROBLEMS
(a)
AVERAGE ERROR
4
LINEAR WH LINEAR RLS QUADRATIC WH QUADRATIC RLS
3
2
1
0
0
1000
2000
3000
4000
5000
LEARNING PROBLEMS
(a) Fig. 3. Average absolute error of the value functions learned by XCSF on (a) the Grid(0.05) problem and (b) the Puddles(0.05) problem. Curves are averages over 20 runs.
results is not surprising: (i) the RLS update exploits more effectively the experience collected and learns faster an accurate approximation; (ii) the quadratic prediction allows a broader generalization in the early stages that leads very quickly to a rough approximation of the payoff landscape. Figure 3 reports the error of the value function learned by the four XCSF versions during the learning process. The error of a learned value function is measured as the absolute error with respect to the optimal value function, computed as the average of the absolute errors over an uniform grid of 100 × 100 samples of the problem space. For each version of XCSF this error measure is computed at different stages of the learning process and then averaged over the 20 runs to generate the error curves reported in Figure 3. Results confirm our hypothesis: both quadratic prediction and RLS update lead very fast to accurate approximations of the optimal value function, although the final approximations are as accurate as the one evolved by XCSF with Widrow-Hoff rule and linear prediction. To better understand how the different versions of XCSF approximate the value function, Figure 4,
Recursive Least Squares and Quadratic Prediction
81
0
V(x,y)
−2 −4 −6 −8 −10 1 1 0.5 y
0.5 0
0
x
(a) 0
V(x,y)
−2 −4 −6 −8 −10 1 1 0.5 y
0.5 0
0
x
(b) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(c) Fig. 4. Examples of the value function evolved by XCSF with linear prediction and Widrow-Hoff update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500 learning episodes (c) at the end of the experiment (after 5000 learning episode)
82
D. Loiacono and P.L. Lanzi
0
V(x,y)
−2 −4 −6 −8 −10 1 1 0.5 y
0.5 0
0
x
(a) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(b) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(c) Fig. 5. Examples of the value function evolved by XCSF with linear prediction and RLS update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500 learning episodes (c) at the end of the experiment (after 5000 learning episode)
Recursive Least Squares and Quadratic Prediction
83
0
V(x,y)
−2 −4 −6 −8 −10 1 1 0.5 y
0.5 0
0
x
(a) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(b) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(c) Fig. 6. Examples of the value function evolved by XCSF with quadratic prediction and Widrow-Hoff update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500 learning episodes (c) at the end of the experiment (after 5000 learning episode)
84
D. Loiacono and P.L. Lanzi
0
V(x,y)
−2 −4 −6 −8 −10 1 1 0.5 y
0.5 0
0
x
(a) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(b) 5
V(x,y)
0
−5
−10 1 1 0.5 y
0.5 0
0
x
(c) Fig. 7. Examples of the value function evolved by XCSF with quadratic prediction and RLS update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500 learning episodes (c) at the end of the experiment (after 5000 learning episode)
Recursive Least Squares and Quadratic Prediction
85
Figure 5, Figure 6, and Figure 7 show some examples of the value functions learned by XCSF at different stages of the learning process. In particular, Figure 4a and Figure 5a show the value function learned by XCSF with linear prediction after few learning episodes, using respectively the Widrow-Hoff update and the RLS update. While the value function learned by XCSF with Widrow-Hoff is flat and very uninformative, the one learned by XCSF with RLS update provides a rough approximation to the slope of the optimal value function, despite it is still far from being accurate. Finally, Figure 6 and Figure 7 report similar examples of value functions learned by XCSF with quadratic predictions. Figure 7a shows how XCSF with both quadratic prediction and RLS update may learn very quickly a rough approximations of the optimal value function after very few learning episodes. A similar analysis can be performed on the Puddles(0.05) but it is not reported here due to the lack of space.
6
Conclusions
In this paper we investigated the application of two successful extensions of XCSF, the recursive least squares update algorithm and the quadratic prediction, to multistep problems First, we extended the recursive least squares approach, originally devised only for single step problems, to the multistep problems with the covariance resetting, a technique to deal with a non stationary target. Second, we showed how the linear prediction used by XCSF can be extended to quadratic prediction in a very straightforward way. Then the recursive least squares update and the quadratic prediction have been compared to the usual XCSF on the 2D Gridworld problems. Our results suggest that the recursive least squares update as well as the quadratic prediction lead to a faster convergence speed of XCSF toward the optimal performance. The analysis of the accuracy of the value function estimate showed that recursive least squares and quadratic prediction play an important role in the early stage of the learning process. The capabilities of recursive least squares of exploiting more effectively the experience collected and the broader generalization allowed by the quadratic prediction, lead to a more accurate estimate of the value function after a few learning episodes. In conclusion, we showed that the previous findings on recursive least squares and polynomial prediction applied to single step problems can be extended also to continuous multistep problems. Further investigations will include the analysis of the generalizations evolved by XCSF with recursive least squares and quadratic prediction.
References 1. Boyan, J.A., Moore, A.W.: Generalization in reinforcement learning: Safely approximating the value function. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Processing Systems 7, pp. 369–376. The MIT Press, Cambridge (1995)
86
D. Loiacono and P.L. Lanzi
2. Butz, M.V., Pelikan, M.: Analyzing the evolutionary pressures in xcs. In: Spector, L., Goodman, E.D., Wu, A., Langdon, W.B., Voigt, H.-M., Gen, M., Sen, S., Dorigo, M., Pezeshk, S., Garzon, M.H., Burke, E. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001), July 7-11, pp. 935–942. Morgan Kaufmann, San Francisco (2001) 3. Butz, M.V., Wilson, S.W.: An algorithmic description of xcs. Journal of Soft Computing 6(3-4), 144–153 (2002) 4. Goodwin, G.C., Sin, K.S.: Adaptive Filtering: Prediction and Control, PrenticeHall information and system sciences series (March 1984) 5. Haykin, S.: Adaptive Filter Theory, 4th edn. Prentice-Hall, Englewood Cliffs (2001) 6. Lanzi, P.L., Loiacono, D.: Xcsf with neural prediction. In: IEEE Congress on Evolutionary Computation, CEC 2006, pp. 2270–2276 (2006) 7. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Extending XCSF beyond linear approximation. In: Genetic and Evolutionary Computation – GECCO-2005, Washington DC, USA, pp. 1859–1866. ACM Press, New York (2005) 8. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: XCS with computed prediction for the learning of boolean functions. In: Proceedings of the IEEE Congress on Evolutionary Computation – CEC 2005, Edinburgh, UK, pp. 588–595. IEEE, Los Alamitos (September 2005) 9. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: XCS with computed prediction in continuous multistep environments. In: Proceedings of the IEEE Congress on Evolutionary Computation – CEC 2005, Edinburgh, UK, pp. 2032– 2039. IEEE, Los Alamitos (September 2005) 10. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Prediction update algorithms for XCSF: RLS, kalman filter, and gain adaptation. In: GECCO 2006: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pp. 1505–1512. ACM Press, New York (2006) 11. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Generalization in the XCSF classifier system: Analysis, improvement, and extension. Evolutionary Computation 15(2), 133–168 (2007) 12. Loiacono, D., Marelli, A., Lanzi, P.L.: Support vector regression for classifier prediction. In: GECCO 2007: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, pp. 1806–1813. ACM Press, New York (2007) 13. Watkins, C.J.C.H.: Learning from delayed reward. PhD thesis (1989) 14. Watkins, C.J.C.H., Dayan, P.: Technical note: Q-Learning. Machine Learning 8, 279–292 (1992) 15. Widrow, B., Hoff, M.E.: Neurocomputing: Foundation of Research. In: Adaptive Switching Circuits, pp. 126–134. The MIT Press, Cambridge (1988) 16. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149–175 (1995), http://prediction-dynamics.com/ 17. Wilson, S.W.: Mining Oblique Data with XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (workshop organisers): Proceedings of the International Workshop on Learning Classifier Systems (IWLCS-2000), in the Joint Workshops of SAB 2000 and PPSN 2000, pp. 158–174 (2000) 18. Wilson, S.W.: Classifiers that approximate functions. Journal of Natural Computing 1(2-3), 211–234 (2002) 19. Wilson, S.W.: Classifier systems for continuous payoff environments. In: Deb, K., Poli, R., Banzhaf, W., Beyer, H.-G., Burke, E., Darwen, P., Dasgupta, D., Floreano, D., Foster, J., Harman, M., Holland, O., Lanzi, P.L., Spector, L., Tettamanzi, A., Thierens, D., Tyrrell, A. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 824–835. Springer, Heidelberg (2004)
Use of a Connection-Selection Scheme in Neural XCSF Gerard David Howard1, Larry Bull1, and Pier-Luca Lanzi2 1
Department of Computer Science, University of the West of England, Bristol, UK {gerard2.howard,larry.bull}@uwe.ac.uk 2 Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milan, Italy [email protected]
Abstract. XCSF is a modern form of Learning Classifier System (LCS) that has proven successful in a number of problem domains. In this paper we exploit the modular nature of XCSF to include a number of extensions, namely a neural classifier representation, self-adaptive mutation rates and neural constructivism. It is shown that, via constructivism, appropriate internal rule complexity emerges during learning. It is also shown that self-adaptation allows this rule complexity to emerge at a rate controlled by the learner. We evaluate this system on both discrete and continuous-valued maze environments. The main contribution of this work is the implementation of a feature selection derivative (termed connection selection), which is applied to modify network connectivity patterns. We evaluate the effect of connection selection, in terms of both solution size and system performance, on both discrete and continuous-valued environments. Keywords: feature selection, neural network, self-adaptation.
1 Introduction Two main theories to explain the emergence of complexity in the brain are constructivism (e.g.[1]), where complexity develops by adding neural structure to a simple network, and selectionism [2] where an initial amount of over-complexity is gradually pruned over time through experience. We are interested in the feasibility of combining both approaches to realize flexible learning within Learning Classifier Systems (LCS) [3], exploiting their Genetic Algorithm (GA) [4] foundation in particular. In this paper we present a form of neural LCS [5] based on XCSF [6] which includes the use of self-adaptive search operators to exploit both constructivism and selectionism during reinforcement learning. The focus of this paper centres around the impact of a form of feature selection that we apply to the neural classifiers, allowing a more granular exploration of the network weight space. Unlike traditional feature selection, which acts only on input channels, we allow every connection in our networks to be enabled or disabled. We term this addition “connection selection”, and evaluate in detail the effects of its inclusion in our LCS, in terms of solution size, internal knowledge representation and stability of evolved solutions in two evaluation environments; the first a discrete maze and the second a continuous maze. J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 87–106, 2010. © Springer-Verlag Berlin Heidelberg 2010
88
G.D. Howard, L. Bull, and P.-L. Lanzi
For clarity’s sake, we shall refer to the system without connection selection as NXCSF, and the version with connection selection as N-XCSFcs. Applications of this type of learning system are varied, including (but not limited to) agent navigation, data mining and function approximation; we are interested in the field of simulated agent navigation. The rest of this paper is organized as follows: section 2 details background research, section 3 introduces the evaluation environments used, and section 4 shows the implementation of neural XCSF. Section 5 describes “connection selection”, section 6 provides results of the experiments conducted, and section 7 provides a brief discussion and suggests further avenues of research.
2 Background 2.1 Neural Classifier Systems Benefits of Artificial Neural Network (ANN) representations mimic those of their real-life inspiration; including flexibility, robustness to noise and graceful performance degradation. The type of neural network that will be used in our project is the Multi Layer Perceptron (MLP) [7]. There are a number of neural LCS in the literature that are relevant to this paper. The initial work exploring artificial neural networks within LCS used traditional feedforward MLPs to represent the rules [5]. Recurrent MLPs were then shown able to provide memory for a simple maze task [8]. Radial Basis Function networks [9] were later used for both simulated [10] and real [11] robotics tasks. Both forms of neural representation have been shown amenable to a constructionist approach wherein the number of nodes within the hidden layer is under evolutionary control, along with the network connection weights [5][11]. Here a mutation operator either adds or removes nodes from the hidden-layer. MLPs have also been used in LCS to calculate the predicted payoff [12][13][14], to compute only the action [15], and to predict the next sensory state [16]. 2.2 Neural Constructivism Heuristic approaches to neural constructivism include FAST [17]. Here, a learning agent is made to navigate a discrete maze environment using Q learning [18]. The system begins with a single network, and more are added if the oscillation in Q value between two states is greater than a given threshold (e.g. there exist two states specifying different payoffs/actions, with only one network to cover both states). More networks are added until the solution space is fully covered by a number of neural networks, which allows the system to select optimal actions for each location within the environment. With regards to the use of constructivism in LCS, the first implementation is described in [5], where Wilsons’ Zeroth-level Classifier System (ZCS) [19] is used as a basis, the system being evaluated (NCS) on the Woods1 environment. The author implements a constructivist approach to topology evolution using fully-connected, MLPs to represent a classifier condition. Each classifier begins with one hidden layer node. A constructivism event may be triggered during a GA cycle, and adds or
Use of a Connection-Selection Scheme in Neural XCSF
89
removes a single, fully-connected hidden layer neuron to the classifier condition. The author then proceeds to define the use of NCS in continuous-valued environments using a bounded-range representation, which reduces the number of neurons required by each MLP. This constructivist LCS was then modified to include parameter self-adaptation in [11]. The probabilities of constructivism events occurring are self-adaptive in the same way as the mutation rate in [20], where an Evolutionary Strategy– inspired implementation is used to control the amount of genetic mutation that occurs within each GA niche in a classifier system. This allows classifiers that match in suboptimal niches to search more broadly within the solution space when µ is large, and decreasing the mutation rate when an optimal solution has been found to maintain stability within the niche. In both cases it is reported that networks of different structure evolve to handle different areas of the problem space, thereby identifying the underlying structure of the task. Constructivism leads us to the field of variable length neural representations. Traditional genetic crossover operators are of questionable utility when applied to the variable-length genomes that constructivism generates, as all rely on randomly picking points within the genome to perform crossover on. This can have the effect of breaking the genome in areas that rely on spatial proximity to provide high-utility. A number of methods, notably Harvey’s Species Adaptive Genetic Algorithm (SAGA) [21] and Hutt and Warwick’s Synapsing Variable-Length Crossover (SVLC) [22] provide methods of crossing variable-length genetic strings, with SVLC reporting superior performance than SAGA in a variable-length test problem. SVLC also eliminates the main weakness of SAGA; that the initial crossover point on the first genome is still chosen randomly, with only the second subject to a selection heuristic. It should be noted that neither N-XCSF nor N-XCSFcs use any version of crossover during a GA cycle; the reasoning behind this omission being twofold. Firstly, directly addressing the problem would require increasing the complexity of the system (adding SVLC-like functionality, for example). Secondly, and more importantly, experimental evidence suggests that sufficient solution space exploration can be obtained via a combination of GA mutation, self-adaptive mutation and neural constructivism, to produce optimal solutions in both discrete and continuous environments. This view is reinforced elsewhere in literature, e.g. [23]. Aside from GA-based crossover difficulties, there are also problems related to creating novel network structures of high utility. For example, the competing conventions problem (e.g. [24]) demonstrates how two networks of different structure but identical utility may compete with each other for fitness, despite being essentially the same network. Neuro Evolution of Augmenting Topologies (NEAT) [25] presents a method for addressing this problem under constructivism. Each gene under the NEAT scheme specifies a connection, specifying the input neuron and output neuron, the connection weight, and a Boolean flag indicating if the connection is currently enabled or disabled. Each gene also has a marker that corresponds to that genes’ first appearance in the population, with markers passed down from parents to children during a GA event, and is based on the assumption that genes from the same origin are more likely to encode similar functions. The marker is retained to make it more likely that homologous genes will be selected during crossover. NEAT has been applied to evolve robot controllers [26].
90
G.D. Howard, L. Bull, and P.-L. Lanzi
2.3 Feature Selection Feature selection is a method of streamlining the data input to a process, where the input data can be imagined as a vector of inputs, with dimension >1. This can be done manually (by a human with relevant domain knowledge), although this process can be error-prone, costly in terms of both time and potentially money, and, of course, requires expert domain knowledge. A popular alternative in the machine learning community is automatic feature selection. The use of feature selection brings two major benefits – firstly, that the amount of data being input to a process can be reduced (increasing computational efficiency), and secondly that noisy connections (or those otherwise inhibitory to the successful performance of the system) can be disabled. Useful features within the input vector are preserved as the performance of the system can be expected to drop if they are disabled, with the converse being true for disabling noisy/low-fitness connections. This is especially useful when considering the case of mobile robot control, where sensors are invariably subject to a certain level of noise that can be automatically filtered out by the feature selection mechanism. This description of the concept of feature selection can be seen to display a strong relationship with the MLP (and indeed any connectionist neural) paradigm, which uses a collection of clearly discretised input channels to produce an output. It can be demonstrated that the disabling of connections within the input layer of an MLP can have a (sometimes drastic) affect on the output of the network [27]. Related work on the subject of feature selection in neural networks can be found in [28] and [29], who explore the use of feature selection in a variety of neural networks. Also especially pertinent is the implementation of feature selection within the NEAT framework (FS-NEAT) [30], who apply their system to a double pole balancing task with 256 inputs. FS-NEAT performs feature selection by giving each input feature a small chance (1/I, where I is the dimension of the input vector) to be connected to every output node. An unaltered NEAT mutation sequence then allows these connections to connect to nodes in the hidden layers of the networks, as well as providing the ability to add further input nodes to the networks, again with a small probability of input addition. The authors make the point that NEAT, following a constructivist methodology, tends to evolve small networks without superfluous connections. They observe both a quicker convergence to optimality and networks with only around 32% of the available input nodes connected in the best-performing network, a reduction from 256 inputs to an average “useful” subset size of 83.6 enabled input nodes. Also highly relevant is the derivative FD-NEAT (Feature Deselection NEAT) [31], where all connections are enabled by default, and pruning rather than growing of connections takes place (it should be noted that FS-NEAT and neural constructivism [1] are similar, as are FD-NEAT and Edelman’s theory of neural Darwinism [2]). Consistent between all four papers mentioned above is that they perform input feature selection only (in other words, only input connections are viable candidates for enabling/ disabling). A comparative study into neuroevolution for both classification and regression tasks (supervised) can be found in [32], where the authors compare purely heuristic approaches with an ensemble of evolutionary neural networks (ENNs), whose MLPs
Use of a Connection-Selection Scheme in Neural XCSF
91
are designed through evolutionary computing. In the former case, randomly-weighted fully-connected networks with hidden layer size N (determined experimentally) are used to solve the tasks. In the latter, each network begins with a bounded-random number of hidden layer nodes. A feature-selection derivative similar to our approach is then implemented, whereby each network connection is probabilistically enabled. Structural mutation is then applied so that, with each GA application, a random number of either nodes or connections are added or deleted. Also similar to our implementation, the authors disable crossover, citing [17] due to negligible impact on the final solution performance. They then expand this work to evolve topologies and weights simultaneously, as evolving one without the other was revealed to be disruptive to the learning process. In their implementation, the non-adaptive rates of weight mutation and topological mutation are controlled by individual variables, each with a 50% chance of altering the network. Finally, it should be noted that this work builds on a previous publication [33], which introduces the design of the N-XCSF (and N-XCS [ibid.], which does not include function approximation). The research highlights the benefits of N-XCSF, mainly in terms of generalization capability and population size reduction. It is shown that the use of MLPs allow the same classifier to match in multiple location within the same environmental payoff level, indicating differing actions thanks to action computation. It is also shown that the inclusion of function approximation allows the same classifier to match accurately in many payoff levels; combined these two features allow the system to perform optimally with a degree of generalization (i.e. fewer total networks required in [P]).
3 Environments Discrete maze experiments are conducted on a real-valued version of the Maze4 environment [34] (Figure 1). In the diagram, “O” represents an obstacle that the agent cannot traverse, “G” is the goal state, where the agent must reach to receive reward, and “*” is a free space that the agent can occupy. The environmental discount rate γ=0.71. The environmental representation was altered to loosely approximate a real robots sensor readings - the binary string normally used to represent a given input state st is replaced with a real-valued counterpart in the same way as [5]. That is, each exclusive object type the agent could encounter is represented by a random real number within a specified range ([0.0, 0.1] for free space, [0.4,0.5] for an obstacle and [0.9, 1.0] for the goal state). In the discrete environment, the input state st consists of the cell contents of the 8 cells directly surrounding the agents current position, and the boundedly-random numeric representation attempts to emulate the sensory noise that real robots encounter. Performance is gauged by a “Step-to-goal” count – the number of discrete movements required to reach the goal state from a random starting position in the maze; in Maze 4 this figure is 3.5. Upon reaching the goal state, the agent receives a reward of 1000. Action calculation is covered in section 4. The test environment for the continuous experiments is the 2-D continuous grid world, Grid(0.05) (Figure 2) [35]. This is two-dimensional environment where the agent’s current state, st, consists of the x and y components of the agents current location within the environment; to emulate sensory noise both the x and y location of the
92
G.D. Howard, L. Bull, and P.-L. Lanzi
agent are subject to random noise +/- [0%-5%] of the agents true position. Both x and y are bounded in the range [0,1]; any movement outside of this range takes the agent to the nearest grid boundary. The environmental discount rate γ=0.95. The agent moves a predetermined step size (in this case 0.05) within this environment. The only goal state is in the top-right hand corner of the grid – where (x+y >1.90). The agent can start anywhere except the goal state, and must reach a goal state in the fewest possible movements, where it receives a reward of 1000. Again, action calculation is covered in section 4. O
O
O
O
O
O
O
O
O
*
*
O
*
*
G
O
O
O
*
*
O
*
*
O
O
O
*
O
*
*
O
O
O
*
*
*
*
*
*
O
O
O
*
O
*
*
*
O
O
*
*
*
*
O
*
O
O
O
O
O
O
O
O
O
Fig. 1. The discrete Maze4 environment
1.0
0.5
0.0
0.5
1.0
Fig. 2. The continuous grid (0.05) environment
4 Neural XCSF (N-XCSF) XCSF [6] is a form a classifier system in which a classifiers prediction (that is, the reward a classifier expects to gain from executing its action based on the current input state) is computed. Like other classifier systems, XCSF evolves a population of classifiers, [P], to cover a problem space. Each classifier consists of a condition and an action, as well as a number of other parameters. In our case, a fully-connected Multi-Layer Perceptron neural network[7] is used in place of the traditional ternary condition, and is used to calculate the action. Prediction computation is unchanged, computed linearly using a separate series of weights. Each classifier is represented by a vector that details the connection weights of an MLP. Each connection weight is uniformly initialized randomly in the range [-1, 1]. In the discrete case, there are 8 input neurons, representing the contents of the cells in 8 compass directions surrounding the agent’s current location. For the continuous environment, each network comprises 2 input neurons (representing the noisy x and y location of the agent). Both network types also consist of a number of hidden layer neurons under evolutionary control (see Section 4.2), and 3 output neurons. Each node (hidden and output) in the neural network has a sigmoidal activation function to constrain the range of output values. The first two output neurons represent the strength of action passed to the left and right motors of the robot respectively, and the third output neuron is a “don’t-match” neuron, that excludes the classifier from the
Use of a Connection-Selection Scheme in Neural XCSF
93
match set if it has activation greater than 0.5. This is necessary as the action of the classifier must be re-calculated for each state the classifier encounters, so each classifier “sees” each input. The outputs at the other two neurons (real numbers) are mapped to a single discrete movement, which varies between discrete and continuous environments. In the discrete case, the outputs at the other two neurons are mapped to a movement in one of eight compass directions (N, NE, E, etc.). This takes place in a way similar to [5], where three ranges of discrete output are possible for each node: 0.0<x<0.4 (low), 0.4<x<0.6 (medium), and 0.6<x<1.00 (high). The unequal partitioning is used to counteract the insensitivity of the sigmoid function to values within the extreme reaches of its range. A discrete movement is mapped from these continuous action inputs – (high, high) = north, (high, med) = northeast, (high, low) = east, and so on. It should be noted that the final two motor pairings – (low, medium) and (low, low) both produce a move to the northwest. In the continuous environment, movement is constrained to one of four compass directions (North, east, south, west). This takes place similarly to discrete environments, except here there are four possible directions and only two ranges of discrete output are possible: 0.0<x<0.5 (low), and 0.5<x<1.00 (high). The combined actions of each motor translate to a discrete movement according to the two motor output strengths – (high, high) = north, (high, low) = east, (low, high) = south, and (low, low) = west. At each time-step, XCSF builds a match set, [M], from [P] consisting of all classifiers whose conditions match the current input state st. In neural XCSF, every action must be present in each [M]. If this is not the case, covering is used to generate classifiers that advocate the missing action(s); covering repeatedly generates random networks until the network action matches the desired output for a given input state. Once [M] is formed, a prediction array is created. In XCSF, each classifier prediction (cl.p) is calculated as a product of the environmental input (or state, st) and the prediction weight vector (w) associated with each classifier. This vector has one element for each input (8 in the discrete case, 2 in the continuous case), plus an additional element w0 which corresponds to x0, a constant input that is set as a parameter of XCSF. A classifiers prediction is calculated as shown in equation 1: .
∑
.
.
(1)
The prediction array is the fitness weighted-average of the calculated predictions for each possible action. An action selection policy is used to decide which action should be taken (in [6], a random action selection policy is used on explore trials, and deterministic on exploit trials). All classifiers that advocate the selected action form the action set [A]. The action is taken and, if the goal state is reached, a reward is returned from the environment that is used to update the parameters of the classifiers in [A]. A discounted reward is propagated to the previous action set [A-1] if it exists. The prediction weight vector of each classifier in the action set is updated using a version of the delta rule, rather than updating the classifiers’ prediction value (equation 2). Each prediction weight is then updated (equation 3) and prediction error is calculated (equation 4). Here, the vector x is the state st augmented by the parameter x0. ∆
|
|
.
(2)
94
G.D. Howard, L. Bull, and P.-L. Lanzi
.
. |
∆ .
(3) |
)
(4)
Further details of the update procedure used in XCSF can be found in [6]. The GA may then fire if the average time since the last GA application to the classifiers in [A] exceeds a threshold θGA. Our GA is modified to be a two-stage process. Stage 1 (section 4.1) controls the rates of mutation and constructivism/connection selection that occur within the system, with stage 2 (section 4.2 and section 5) controlling the evolution of neural architecture in terms of both neurons and connections. Deletion occurs as in [6]. This cycle of [P] ([M] [A]) reward is called a trial. Each experiment consists of 50,000 trials (20,000 in the continuous case). Each trial is either in exploration mode (roulette wheel action selection) or exploitation mode (deterministic action selection). We employ roulette wheel action selection on exploration trials to discourage potentially time wasting agent movements, especially when the agents payoff landscape becomes more accurate. 4.1 Self-adaptation Traditionally in XCSF, two offspring classifiers are generated by reproducing, crossing, and mutating the parents. The offspring are inserted into the population, two classifiers are deleted. As happens in all other models of classifier systems, parents stay in the population competing with their offspring. A GA is periodically triggered in [A] to evolve fitter classifiers in an environmental niche. It is potentially beneficial if a learning system is able to exert some form of control over its own learning interactions with the environment. To this end, we include a number of self-adaptive mechanisms which grant the learner flexibility to tailor its internal knowledge representation in a problem-dependent manner at a rate controlled by the learner. We apply self-adaptation as in [20], to dynamically control the amount of genetic search (the frequency of network weight mutation events) taking place within the niche. This provides stability to parts of the problem space that are already “solved” as the mutation rate for a niche is typically directly proportional to its distance from the goal state during learning; generalization learning, along with the value function learning, occurs faster nearer the goal state. Self-adaptive mutation is here applied, whereby the µ value (rate of mutation per allele) of each classifier is initialized uniformly randomly in the range [0,1]. During a GA cycle, a parent’s µ value is modified: µ µ * e N(0,1). The offspring then applies its own µ to itself (for each allele) before being inserted into the population. 4.2 Neural Constructivism Implementation of NC in this system is based on the work of Bull [5]. Each rule has a varying number of hidden layer neurons (initially 1, and always > 0), with additional neurons being added or removed from the hidden layer depending on the constructivism element of the system. Constructivism takes place during a GA cycle, after mutation. Two new self-adaptive parameters, ψ and ω, are added. Here, ψ represents the
Use of a Connection-Selection Scheme in Neural XCSF
95
probability of performing a constructivism event and ω is the probability of adding a neuron, with removal occurring with probability 1- ω. As with self-adaptive mutation, both are initially randomly generated uniformly in the range [0,1], and offspring classifiers have their parents’ ψ and ω values modified during reproduction as with µ. We feel it is important to draw a number of comparisons between NEAT [25] and our constructivism mechanism, mainly in the respect of GA action, which in our case is confined to the area of search space covered by the current action set [A] when a GA event is triggered. This encourages the GA to select similar classifiers within a niche, much as NEAT selects structures from its own niches. In N-XCSF, as successful classifiers evolve within a niche, offspring tend to share the same heredity (a nexus of high-fitness parents that, due to roulette selection, will be more likely to be repeatedly selected for reproduction within that niche). This phenomenon shares certain traits with the NEAT “genetic marker” mechanism, whereby genes that possess a common ancestry are more likely to be mated together, although our approach does not require niches to be predefined. The effect of the mechanisms described in sections 4.1 and 4.2 is to tailor the evolution of the classifier to the complexity of the environment, either by altering the amount of mutation that takes place in a given niche at a given time [20], or by adapting the hidden layer topology of the neural networks to reflect the complexity of the problem space considered by the network [5,11]. During a GA cycle, the operators are applied as follows: (1) self-adaptation (2) mutate MLP weights (3) enable/disable connections (4) add/remove nodes.
5 Connection Selection (N-XCSFcs) Feature selection is a way of streamlining inputs to a given process. Automatic feature selection includes both wrapper approaches (where feature subsets can change during the running of the algorithm) and filter approaches (where the subset selection is a pre-processing step). It should be noted that traditional feature selection is applied only to input channels (in a neural network, this corresponds to connections between the input and hidden layer). In this work we allow any connection within the network to be disabled, allowing potentially more parsimonious results whilst retaining the capability to filter our unnecessary or noisy channels. We term this networkwide feature selection connection selection. Additional information can be found in [36], where many neuro-evolution methodologies are compared and contrasted by the authors, including connection selection. They report favourable performance using a connection selection scheme similar to our own, although in a single neural network as opposed to the collection of networks used in this paper. The purpose of the experimentation carried out in sections 6 and 7 is to assess the impact of a network-wide feature selection scheme on our neural XCSF system, both in terms of performance and computational efficiency. Connection selection is implemented in our system as follows: Each connection in a classifiers’ condition has a Boolean flag attached to it. During a GA cycle, and based on a new self-adaptive parameter τ (which is initialized and self-adapted in the same manner as the other parameters), the Boolean flag can be flipped. If the flag is false, the connection is
96
G.D. Howard, L. Bull, and P.-L. Lanzi
disabled (set to 0.0, not a viable target for connection weight mutation). If the flag was false but then flipped to true, the connection weight is randomly initialised uniformly in the range [-1, 1]. All flags are initially set to true for newly initialised classifiers and classifiers created via covering. During a node addition event, the flags representing the new nodes’ connections are set probabilistically, with P(connection enabled) = 0.5. Sharing something of a middle ground between FS-NEAT [30] and FD-NEAT [31], we grow whole neurons (using constructivism, high-granularity feature selection), but tend to prune connections from those neurons (using our network-wide feature selection implementation, low granularity feature deselection). The exception to this is node addition, which produces neurons that are on average 50% connected.
6 Experimentation Following this brief introduction is a comparison of neural XCSF, with (N-XCSFcs) and without (N-XCSF) connection selection, in both discrete (Maze4) and continuous (Grid(0.05)) environments. In Figure 3(a), as well as all other steps-to-goal graphs (4(a), 5(a), 6(a)), the red dashed line represents optimal performance. 6.1 Discrete Environment Discrete-environment experiments are parameterized as follows: N=7000,β=0.2, ε0=0.01, ν=5, θGA= 25, θDEL=50. Additionally the x0 parameter is set to 1.0 and the correction rate η to 0.2. Each experiment is repeated ten times with the results being the mean average of these 10 runs. Every 50 trials, the current state of the system is analyzed. All averages are means from the population as a whole. 6.1.1 System Performance In the Maze4 environment, it can be seen that the system without connection selection initially descends steeply to the optimal steps-to-goal value (Figure 3(a)), but then takes some time to transition from near-optimal to optimal performance. Connection selection (figure 4(a)) shows a less uniform curvature of descent, and can be seen to reach optimality with slightly more expediency. Connection selection has the advantage of being able to disable connections as a binary event, and hence can more drastically alter network performance (effectively allowing it to perform connection weight changes that are out of range of the standard mutation operator), whilst also providing an extra degree of freedom in which to alter network behaviour. The non-uniform curvature of the plot could hint towards the potential disruptiveness of connection selection, although since the granularity of the changes (and hence magnitude of disruption potential) is minimal (i.e. a connection is the smallest possible component to alter in a MLP), and small network alterations usually result in small performance alterations, this does not prevent the connection selection-enabled system from performing optimally. After each explore-exploit cycle, an additional knowledge test trial is held with the agent always starting in the closest available location to the top-left hand corner of the
Use of a Connection-Selection Scheme in Neural XCSF
(a)
97
(b)
(c) Fig. 3. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sselfN (with no connection selection) in Maze4 adaptive parameter values in N-XCSF
maze, and the steps-to-goaal count recorded. Under the standard maze scenario uused above, and in the LCS literrature, it is not possible to perform standard statistical teests for significant differences in performance as performance is plotted as a 50-pooint ma moving average due to the random start location. Using an extra exploit trial from fixed position eases statisticcal comparison and allows us to define stabilty. Stabilitty is defined as follows: A solu ution can be said to be stable if, for each of 50 consecuttive knowledge test trials – interrspersed between standard explore and exploit trials - frrom the constant location in thee maze, the solution always finds the optimal path to the goal. The first trial at whicch each run of a system reaches stability is recorded, and this set of 10 numbers is compared c to the sets produced by the other variants of the system using a standard T--test. We also record various other indicators, namely the average self-adaptive mutaation rate, µ, and the average number of connected hiddden layer nodes.
98
G.D. Howard, L. Bulll, and P.-L. Lanzi
(a)
(b)
(c)
(d) Fig. 4. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sselfadaptive parameter values (d)) average enabled connections per classifier (%) in N-XCS SFcs (with connection selection) in Maze4.
Table 1 shows that the stabilities s of the two systems are not overly affected by the addition of connection selection. Most indicative of this fact is the P value of 0.644. It does however show that connection c selection has the potential to be performannceenhancing (comparing average steps to stability).
Use of a Connection-Selection Scheme in Neural XCSF
99
Table 1. Detailing the average time to stability and T-Test results when comparing N-XCSF and N-XCSFcs in Maze4
Connection selection No connection selection
Average 8838.40 10358.8
P value 0.64
6.1.2 Effect on Self-adaptive Mutation Rates Comparison between Figures 3(c) and 4(c) show that overall the plots follow the same pattern (i.e. ω is the highest value, then µ, and finally ψ); additionally, all three plots share similar curvature. The most obvious difference is that all three self-adaptive values are higher in the connection selection version. A possible explanation for these results is that, since disabled connections cannot be mutated weight-wise, more search is required by the enabled connections to effectively search all of the connection weight space (e.g higher mutation rates make up for the lack of mutation variety due to there being fewer connections to mutate). Interestingly enough, where 60% connections are enabled, the difference in final mu values is approximately 60% (µ = 0.1 for the standard system vs. µ = 0.06 for the connection-selection version, Figures 3(c) and 4(c)/4(d)). Table 2 highlights statistically significant differences in the final self-adaptive mutation rate parameters, whereas Table 3 shows variances in the average number of connected hidden layer nodes per classifier that are also statistically significant. This indicates that, although performance is not statistically different when connection selection is added, evidence suggests that the internal problem representation (that is, the way the system solves the problem) is altered significantly. Table 2. Detailing the average self-adaptive mutation rate and T-Test results when comparing N-XCSF and N-XCSFcs in Maze4
Connection selection No connection selection
Average 0.11 0.058
P value 2.99E-09
Table 3. Detailing the average number of hidden layer nodes and T-Test results when comparing N-XCSF and N-XCSFcs in Maze4
Connection selection No connection selection
Average 2.81 1.50
P value 9.16E-08
6.1.3 Effect on Computational Efficiency We explore the effect of connection selection on computational efficiency in three ways; the size of the population, the size of of action sets produced, and the number of enabled connections within the population. When connection selection is enabled, the average final population is 3041, whereas with no connection selection this value is 4652.5. This translates into a saving during match set generation, where the entire population must be processed to derive their actions from the current input. Interestingly, the
100
G.D. Howard, L. Bull, and P.-L. Lanzi
reverse is true for the action set size estimate; 225.7 is the average when connection selection is enabled, 143.8 without. So even though the match set generation is quicker with connection selection, all action set-based operations (overall action determination, parameter updates, reinforcement, GA activation) can be expected to be computationally less efficient with a connection selection scheme applied. In terms of actual enabled connections within the population, we can observe that the average number of connected nodes in the hidden layers of the classifiers (Figures 3(b) and 4(b)) do not favour connection selection (1.5 connected nodes vs. 2.7 connected nodes). However, connection selection has only 60% enabled connections on average (Figure 4(d)). We can then calculate the number of connections enabled in the entire population as: average population size * average connected hidden layer nodes * average enabled connections per connected hidden layer node. Connection selection: = 2.7 * (0.6*11) = 17.82 connections per network = 3041 * 2.7 * (0.6*11) = 54,190.62 connections in population No connection selection = 1.5 * (1.0 *11) = 16.5 connections per network = 4652 * 1.5 * (1.0 *11) = 76,758 connections in population Hence, even though there are more connections in the connection selection networks (as there are more hidden layer nodes on average in those networks), the lower required population means that fewer calculation computations are necessary. For a neural representation to function, it is postulated that information from all surrounding locations would be needed to make an accurate decision with regards to movement in the environment (i.e. keeping a Markov problem structure). Observations of the final networks agree with this, showing that connections are more frequently cut between the hidden and output neurons. 6.2 Continuous Environment The continuous-environment experiments are parameterized as follows: N=20000, β=0.2, ε0=0.005, ν=5, θGA=50, θDEL=50, x0=1, η=0.2. As the continuous environments have fewer connections per node it is suggested that more of the connections will be required to preserve necessary classifier utility. Note that due to the differing environmental representations, and difficulties found in the continuous environment with calculating certain actions in certain areas of the state space, a bias single node was added to all networks, providing a constant weighted positive input in the range [0,1] to each hidden layer node. 6.2.1 System Performance Comparison of figures 5(a) and 6(a) reveal that both N-XCSF and N-XCSFcs have very similar performance in the continuous environment. Connection selection (Figure 5(a)) shows a less uniform curvature of descent, and can be seen to form optimal solutions with less connected hidden layer nodes (figures 5(b) and 6(b)). These figures also reveal some of the disruption that connection selection can add to the solution, which is evidenced most obviously in the uneven curvature in figure 6(b), which is echoed in figure 6(c) (the path of the µ variable towards the end of the trial
Use of a Connection-Selection Scheme in Neural XCSF
101
Table 4. Detailing the averag ge time to stability and T-Test results when comparing N-XC CSF and N-XCSFcs in the continuo ous Grid (0.05) environment
Connection seleection No connection selection s
Average 11453.5 13453.7
P value 0.45
and the unsteady path of τ after a the initial steep descent). Figure 6(d) shows over 880% network connectivity in thee continuous case; the reason for this reasonably high vaalue is given in section 6.2. Tab ble 4 shows the results of T-Tests carried out to assess the impact of connection selecction in the continuous environment; it can be seen tthat connection selection in thiss case is beneficial, in terms of possessing lower averrage steps-to-stability. However,, a P value of 0.45 shows that any performance differennces are not statistically significaant.
(a)
(b)
(c) Fig. 5. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sselfadaptive parameter values in N-XCSF N (with no connection selection) in Grid (0.05)
102
G.D. Howard, L. Bulll, and P.-L. Lanzi
6.2.2 Effect on Self-adaptive Mutation Rates Again, it can be said that th he self-adaptation mechanisms in both versions of the ssystem perform comparably, (figures ( 5(c) and 6(c)). Connection selection (N-XCSF Fcs) introduces a slight instabiliity that is not apparent in N-XCSF. This observation w was also made about the system ms when compared in the Maze4 environment (Figures 33(c) and 4(c)). However, unlikee the discrete environment results, the actual self-adapttive parameter values are very similar s between N-XCSF and N-XCSFcs, being on averrage only slightly higher in the connection c selection version.
( (a)
(b)
(c)
(d) Fig. 6. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sselfadaptive parameter values (d)) average enabled connections per classifier (%) in N-XCS SFcs (with connection selection) in Grid (0.05)
Use of a Connection-Selection Scheme in Neural XCSF
103
The final self-adaptive τ parameter is the most pronounced of these differences, differing between continuous and discrete environments by a factor of ten (0.3 and 0.03 respectively). Table 5 shows the impact in regards to self-adaptive mutation rate when connection selection is added in the continuous environment. Both connection selection and non-connection selection versions share similar average values, although the P value reveals that the difference is still close to statistically significant. Table 6 indicates that, in contrast to the discrete case (Table 3), the average number of hidden layer nodes evolved by connection selection and non-connection selection versions in the continuous case are not statistically significantly different. These results indicate that the impact of connection selection is less in the continuous environment than in the discrete environment. A possible explanation is that since there are fewer connections per node in the continuous environment (5 as opposed to 11), the ability of a connection selection scheme to alter the functionality of a network is reduced. Table 5. Detailing the average self-adaptive mutation rate and T-Test results when comparing N-XCSF and N-XCSFcs in the continuous Grid (0.05) environment
Connection selection No connection selection
Average 0.45 0.47
P value 0.02
Table 6. Detailing the average number of hidden layer nodes and T-Test results when comparing N-XCSF and N-XCSFcs in the continuous Grid (0.05) environment
Connection selection No connection selection
Average 1.36 1.29
P value 0.46
6.2.3 Effect on Computational Efficiency Similarly to section 6.1.3, the average number of connections in the entire populations of the final solutions can be compared, the results being calculated as follows: Average connections in population = average population size * average connected hidden layer nodes * average enabled connections per connected hidden layer node. Connection selection: = 2.1 * (0.82*5) = 8.61 connections per network = 8128 * 2.1 * (0.82*5) = 69,982connections in population No connection selection = 2.3 * (1.0*5) = 11.5 connections per network = 12187 * 2.3 * (1.0*5) = 140,150.5 connections in population These results show that even with 20% more network connectivity (figure 4(d) vs. figure 6(d)), in the continuous case, the reduced population needs of N-XCSFcs in a continuous environment provides a greater efficiency enhancement, as not only does each network contain less connections, also the overall number of networks in the final solution is significantly reduced.
104
G.D. Howard, L. Bull, and P.-L. Lanzi
7 Discussion This paper has detailed the implementation of an XCSF system for simulated agent navigation; as well as the addition of various other elements, namely neural classifier representation, self-adaptive mutation rates, and neural constructivism. The effects of a network-wide feature-selection derivative have been examined, with particular emphasis placed on computational efficiency and final solution parsimony. Furthermore, it has been shown that such a system can have a significant impact on both of these factors in solving both discrete and continuous agent navigation tasks. The research presented here could be extended in a number of ways; including comparison of other network types or classifier representations on the same tasks. We also aim to investigate the effects of different methods of performing constructivism (see e.g.[35]).
References [1] Quartz, S.R., Sejnowski, T.J.: The Neural Basis of Cognitive Development: A Constructionist Manifesto. Behavioural and Brain Sciences 20(4), 537–596 (1997) [2] Edelman, G.: Neural Darwinism: The Theory of Neuronal Group Selection. Basic Books, New York (1987) [3] Holland, J.H.: Adaptation. In: Rosen, R., Snell, F.M. (eds.) Progress in Theoretical Biology, vol. 4, pp. 263–293. Academic Press, New York (1976) [4] Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975) [5] Bull, L.: On Using Constructivism in Neural Classifier Systems. In: Guervós, J.J.M., Adamidis, P.A., Beyer, H.-G., Fernández-Villacañas, J.-L., Schwefel, H.-P. (eds.) PPSN 2002. LNCS, vol. 2439, pp. 558–567. Springer, Heidelberg (2002) [6] Wilson, S.W.: Function Approximation with a Classifier System. In: Spector, L.D., Wu, G.E.A., Langdon, W.B., Voight, H.M., Gen, M. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001), pp. 974–981. Morgan Kaufmann, San Francisco (2001) [7] Rumelhart, D.E., McClelland, J.L.: Parallel Distributed Processing. MIT Press, Cambridge (1986) [8] Bull, L., Hurst, J.: A Neural Learning Classifier System with Self-Adaptive Constructivism. In: IEEE Congress on Evolutionary Computation. IEEE Press, Los Alamitos (2003) [9] Buhmann, M.D.: Radial Basis Functions: Theory and Implementations. Cambridge University, Cambridge (2003) [10] Bull, L., O’Hara, T.: Accuracy-based Neuro and Neuro-Fuzzy Classifier Systems. In: Langdon, W.B., Cantu-Paz, E., Mathias, K., Roy, R., Davis, D., Poli, R., Balakrishnan, K., Hanavar, V., Rudolph, G., Wegener, J., Bull, L., Potter, M.A., Schultz, A.C., Miller, J.F., Burke, E., Jonoska, N. (eds.) GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 905–911. Morgan Kaufmann, San Francisco (2002) [11] Hurst, J., Bull, L.: A Neural Learning Classifier System with Self-Adaptive Constructivism for Mobile Robot Control. Artificial Life 12(3), 353–380 (2006) [12] Giani, A., Baiardi, F., Starita, A.: PANIC: A Parallel Evolutionary Rule Based System. In: Proceedings of the Fourth Annual Conference on Evolutionary Programming, EP 1995 (1995)
Use of a Connection-Selection Scheme in Neural XCSF
105
[13] O’Hara, T., Bull, L.: Prediction Calculation in Accuracy-based Neural Learning Classifier Systems. Tech report UWELCSG04-004 (2004) [14] Lanzi, P.L., Loiacono, D.: XCSF with Neural Prediction. In: IEEE Congress on Evolutionary Computation, CEC 2006, pp. 2270–2276 (2006) [15] Dam, H.H., Abbass, H.A., Lokan, C., Yao, X.: Neural-Based Learning Classifier Systems. IEEE Trans. on Knowl. and Data Eng. 20(1), 26–39 (2008) [16] O’Hara, T., Bull, L.: Building Anticipations in an Accuracy-based Learning Classifier System by use of an Artificial Neural Network. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 2046–2052. IEEE Press, Los Alamitos (2005) [17] Pérez-Uribe, A., Sanchez, E.: FPGA Implementation of an Adaptable-Size Neural Network. In: Vorbrüggen, J.C., von Seelen, W., Sendhoff, B. (eds.) ICANN 1996. LNCS, vol. 1112, pp. 383–388. Springer, Heidelberg (1996) [18] Watkins, C.J.C.H.: Learning with Delayed Rewards. PhD thesis, Psychology Department, University of Cambridge, England (1989) [19] Wilson, S.W.: ZCS: A Zeroth-level Classifier System. Evolutionary Computation 2(1), 1–18 (1994) [20] Bull, L., Hurst, J., Tomlinson, A.: Self-Adaptive Mutation in Classifier System Controllers. In: Meyer, J.-A., Berthoz, A., Floreano, D., Roitblatt, H., Wilson, S.W. (eds.) From Animals to Animats 6 – The Sixth International Conference on the Simulation of Adaptive Behaviour. MIT Press, Cambridge (2000) [21] Harvey, I., Husbands, P., Cliff, D.: Seeing the Light: Artificial Evolution, Real Vision. In: Cliff, D., Husbands, P., Meyer, J.-A., Wilson, S.W. (eds.) From Animals to Animats 3: Proceedings of the Third International Conference on Simulation of Adaptive Behaviour, pp. 392–401. MIT Press, Cambridge (1994) [22] Hutt, B., Warwick, K.: Synapsing Variable-Length Crossover: Meaningful Crossover for Variable-Length Genomes. IEEE Transactions on Evolutionary Computation 11(1), 118–131 (2007) [23] Rocha, M., Cortez, P., Neves, J.: Evolutionary Neural Network Learning. In: Pires, F.M., Abreu, S.P. (eds.) EPIA 2003. LNCS (LNAI), vol. 2902, pp. 24–28. Springer, Heidelberg (2003) [24] Schaffer, J.D., Whitley, D., Eshelman, L.J.: Combinations of genetic algorithms and neural networks: A survey of the state of the art. In: Whitley, D., Schaffer, J. (eds.) Proceedings of the International Workshop on Combinations of Genetic Algorithms and Neural Networks (COGANN 1992), pp. 1–37. IEEE Press, Piscataway (1992) [25] Stanley, K.O., Miikkulainen, R.: Evolving Neural Networks Through Augmenting Topologies. Evolutionary Computation 10(2), 99–127 (2002) [26] Stanley, K.O., Miikkulainen, R.: Competitive Coevolution through Evolutionary Complexification. Journal of Artificial Intelligence Research 2004(21), 63–100 (2002) [27] Basheer, A., Hajmeer, M.: Artificial neural networks: fundamentals, computing, design, and application. Journal of Microbiological Methods 43(1) (2000) [28] Belue, L.M., Bauer Jr., K.W.: Determining input features for multilayer perceptrons. Neurocomputing 7, 111–121 (1995) [29] Basak, J., Mitra, S.: Feature selection using radial basis function networks. Neural Comput. Appl. 8, 297–302 (1999) [30] Whiteson, S., Stone, P., Stanley, K.O., Miikkulainen, R., Kohl, N.: Automatic feature selection in neuroevolution. In: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation, Washington DC, USA, June 25-29 (2005) [31] Tan, M., Hartley, M., Bister, M., Deklerck, R.: Automated feature selection in neuroevolution. Evolutionary Intelligence 1(4), 271–292 (2009)
106
G.D. Howard, L. Bull, and P.-L. Lanzi
[32] Rocha, M., Cortez, P., Neves, J.: Evolution of neural networks for classification and regression. Neurocomput. 70(16-18), 2809–2816 (2007) [33] Howard, D., Bull, L., Lanzi, P.-L.: Self-Adaptive Constructivism in Neural XCS and XCSF. In: Keijzer, M., et al. (eds.) GECCO 2008: Proceedings of the Genetic and Evolutionary Computation Conference, ACM Press, New York (2008) [34] Lanzi, P.L.: An Analysis of Generalization in the XCS Classifier System. Evolutionary Computation 7(2), 125–149 (1999) [35] Boyan, J.A., Moore, A.W.: Generalization in reinforcement learning: Safely approximating the value function. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Processing Systems 7, pp. 369–376. The MIT Press, Cambridge (1995) [36] Schlessinger, E., Bentley, P.J., Lotto, R.B.: Analysing the Evolvability of Neural Network Agents through Structural Mutations. In: Capcarrère, M.S., Freitas, A.A., Bentley, P.J., Johnson, C.G., Timmis, J. (eds.) ECAL 2005. LNCS (LNAI), vol. 3630, pp. 312– 321. Springer, Heidelberg (2005)
Building Accurate Strategies in Non Markovian Environments without Memory Énée Gilles1 and Péroumalnaïk Mathias2 Université des Antilles Guyane LAMIA Laboratory Campus Fouillole, BP 592 97157 Pointe à Pitre Cedex Guadeloupe [email protected], [email protected]
Abstract. This paper focuses on the study of the behavior of a genetic algorithm based classifier system, the Adapted Pittsburgh Classifier System (A.P.C.S), on maze type environments containing aliasing squares. This type of environment is often used in reinforcement learning literature to assess the performances of learning methods when facing problems containing non markovian situations. Through this study, we discuss on the performance of the APCS upon two mazes (Woods 101 and Maze E2) and also on the efficiency of an improvement of the APCS learning method inspired from the XCS: the covering mechanism. We manage to show that, without any memory mechanism, the APCS is able to build and to keep accurate strategies to produce regular sub-optimal solution to these maze problems. This statement is shown through a comparison between the results obtained by the XCS on two specific maze problems and those obtained by the APCS.
1
Introduction
Classifier systems based on genetic algorithm are rule based systems whose diagnose ability is known to be used on parameters optimisation problems [10]. Nevertheless, this kind of classifiers needs to perform a learning step before being used in a production and/or a diagnostic context. Most often, this learning stage is performed on a sample of data representing the available and validated / expertised data of the considered environment. Tendencies contained by this data set are assimilated by the classifier system using reinforcement learning. In this purpose, the system is continuously exposed to signals created using the learning sample. At this point, the action performed by the classifier in reaction to the incomming signal is rewarded thanks to the fitness function. This function is defined depending on the learning problem considered: its aim is to maintain accurate classifiers within the population by preventing them to be deleted or lost by genetic pressure. J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 107–126, 2010. c Springer-Verlag Berlin Heidelberg 2010
108
É. Gilles and P. Mathias
In the literature, we encounter various methods used to successfully perform this type of reinforcement learning but, concerning learning classifier system using genetic algorithms, Q-Learning reinforcement methods and anticipation based methods are the most widely used [13,5,4,14]. However, when facing some multi-step problems, most of these methods have difficulties to build accurate strategies. A solution which is often used is to add a certain amount of information in order to build more precise strategies [11,13]. The main drawback of this solution is to determine which amount of information should be added to solve a given multi-step problem. In this study, we chose to focus on another possibility that allows us to create a cognitive pattern within the learning system by using a different structure of knowledge. Our assesment relies on a parallel exploration of the cognitive space available by different collections of classification rules. Our approach is mainly supported by the structure of the cognitive system we use: the Pittsburgh Classifier System. This paper is structured as follows. In Section 2 we introduce the type of multi-step environment chosen and the measures we used. Then, in Section 3, we describe the main algorithm of Adapted Pittsburgh Classifier System, including improvements brought by the covering mechanism. After this point, we will describe the experiments we conducted and discuss the obtained results in Section 4. This discussion will be extended to Section 5 with a comparison between the results previously obtained and results obtained with the eXtended Classifier System (XCS) [5] on the same reinforcement learning problems. We will then conclude on the measured improvements led by the covering mechanism and on further results.
2 2.1
Context and Related Work Maze Problems
Maze problems, as simplified reinforcement learning problems, are often used in classifier systems literature to assess the efficiency of a learning method (XCS, ZCS), an improvement of an existing classifier system (XCSM, ZCSM), or to validate a new algorithm (ACS, AgentP, ATNoSFERES) [18]. Moreover, some mazes also offer perceptually similar situations which require different actions to reach the goal. These situations are designated in previous studies as “aliasing” situations. In addition to that fact, abilities needed by a learning classifier system to solve a maze problem may be related to abilities needed to solve a given optimisation problem with a learning sample containing missing or aliased data. A maze can be defined as a given number of neighbouring cells. A cell is a bounded space formally defined: it is the elementary unit of a maze. When it is not empty, a cell can contain either an obstacle, either food, either an animat or eventually a predator of the animat. The maze problem is defined as following: an animat is randomly placed in the maze and its aim is to set its position to a cell containing food. To perform this task, it possesses a limited perception of its environment. This perception is defined by collecting the state (i.e. obstacle, food or empty) of the eight cells
Building Accurate Strategies in Non Markovian Environments
109
surrounding its position. The animat can move only to an empty cell between these neighbouring cells, moving step by step through the maze in order to fulfill its goal. A cognitive system which is studied on this kind of environment have to pilot an animat through the maze in order to reach the food. The problem given to this cognitive system is to attend to adopt a policy of moves inside of this environment. This strategy must allow the animat to complete its goal with an accurate and finite number of steps. Maze environments offer plenty of parameters that allows to evaluate the complexity of a given maze and the efficiency of given a learning method. Aliasing positions are described in [2] as positions with identical perceptions for the animat. According to this study, it exists three types of aliasing positions. Type I aliasing positions are located at a different distances from the food but require the same actions to get closer to it (Actionx,y = Actionx ,y ∧ D(x, y) = D(x , y )). Type II aliasing positions are located at different distances from the food and require different actions to get closer to this objective (Actionx,y = Actionx ,y ∧ D(x, y) = D(x , y )). At last, type III aliasing squares, which also require different actions to reach the goal, are located at the same distance from it (Actionx,y = Actionx ,y ∧ D(x, y) = D(x , y )). The folowing chart (fig. 1) , extracted from the same study, presents most of mazes that are available in the literature. It was built considering both the type of the aliasing squares contained by each maze and the mean of the average number of steps φQ m done by a Q-Learning algorithm to reach the food over the average distance to the food φm measured for this maze. We have chosen to focus our experimental part on the study of two mazes : the Woods101 and the Maze E2. The Woods101 maze is easy to represent φQ m (fig. 2) and offers a high complexity (φQ m = 402.3 and φm = 2.7, φm = 149 [18]).
Fig. 1. Complexity chart of maze-type environments
110
É. Gilles and P. Mathias
(a) Maze Woods101
(b) Maze E2
Fig. 2. Maze environments used in this study
On the other hand, the E2 maze, which is also easy to represent, has a higher φQ m complexity than the Woods101 (φQ m = 710.23 and φm = 2.33, φm = 304.81 [18]). Moreover, this maze also present both type II and type III aliasing squares. 2.2
Random Walk and Optimal Number of Steps
In order to determine the quality of the results obtained on a given maze, they are two situations that we need to test and to establish clearly: the performances observed when using a random method and these that would be obtained considering the optimal choices. We refer as “optimal” the best choices that could be done by a cognitive system which is not able to distinguish two perceptually similar situations. We have chosen to perform the same measures on the random walk and on the optimal choice case than on the performances of the classifier systems. Concerning the random walk, the animat is randomly placed in the maze and at each time step it chooses a random direction between the eight directions available. We measure the number of steps done by the animat and the final distance of the animat to the food (these choices are clearly established in Section 4.2, please refer to it for more details). To calculate the optimal number of steps that should be done by the animat to fulfill its goal, we had to consider the fact that the mazes we have chosen contain aliasing squares. As the learning classifier system we study do not use memory mechanisms, we must take into account that neither the XCS, neither the APCS are able to make the difference between two squares of a given aliasing situation. As a consequence, we need to reevaluate the optimal number of steps for each maze considering the optimal policy that these classifier systems would be able to adopt to solve the problem. Concerning the Woods101 environment, each aliasing square (dashed on fig. 3a) impose to the system to maintain 2 different moves in order to reach the food. If the animat is placed on the left-side aliasing position, it must go southeast. On the contrary, if the animat is placed on the right-side aliasing position, it must go south-west to reach the food. As a consequence, the optimal number of steps done when starting from those positions should be set to 3 instead of 2 (fig. 3b). As a consequence, it increases the number of steps to reach the food for squares situated behind the aliasing positions. Those modifications raise the
Building Accurate Strategies in Non Markovian Environments
111
(a) Woods101 opt. pol- (b) Woods101 opt. nb. icy steps
(c) Maze E2 opt. policy (d) Maze E2 opt. nb. steps Fig. 3. Considered optimal policies
average number of steps for this maze to 3.5 instead of 2.7 [2] considering this new optimal policy. The impact of the aliasing positions is harder to evaluate when considering the maze E2. Sixteen aliasing squares are at the same distance to the food (2 squares) and offer the same perception (8 empty squares). As a consequence, when we consider each one of these squares, we notice that at least 2 opposite available diagonal directions can be kept in order to reach the food (see fig. 3c). As a consequence, in order to establish a metric for this maze, we will consider the following statement to evaluate the optimal number of steps for those squares: we exclude action chains that allow moves from an aliasing position to another of the same type. These modifications raise the average number of steps when using an optimal policy on this maze to 3.16 instead of 2.33.
3 3.1
Principles of Studied CS The Adapted Pittsburgh Classifier System
The Adapted Pittsburgh Classifier System (APCS) is derived from the original work of Smith on LS1 [16]. As Michigan classifier systems (ZCS, XCS) [15], Pittsburgh classifier systems rely on two basic elements: the classifier, which is the container of the knowledge acquired by the system, and the genetic algorithm which allow this knowledge to evolve. Nevertheless, instead of considering a global collection of classifiers, Pittsburgh classifier systems co-evolve multiple small collections of classifiers in parallel.
112
É. Gilles and P. Mathias
In the next sub-sections, we will precise how does structure, evaluation and evolution mechanism in APCS differs from the original work done by Smith on LS1 and from other existing Pittsburgh approaches. Structure. As other classifier systems, APCS is built with production rules also called classifiers. These classifiers are formed of two parts: a condition part (also called sensor) which is sensitive to signals from the environment, and an action part (also called effecter) which is the answer predicated by the classifier to signals that activate the condition part. When dealing with multi-step problems, this answer may induce a modification of the perception of the system. In this case of study, the condition part is defined upon a ternary alphabet {0, 1, #}, where #(wildcard) stands for both 0 and 1. The action part contains only bits {0, 1}. As we said before, Pittsburgh classifier systems evolve collections of classifiers. These collections are designated individuals. Contrary to other Pittsburgh classifier systems (GABIL [7], GALE [3], GAssist [1]), the individuals of the APCS contain a fixed number of classifiers (see fig. 4). Table 1 summarizes the structural differences between two recent Pittsburgh classifier systems and APCS. We can notice that, regarding the structure of the population, the main difference resides in the composition of the individuals. As a consequence, the classical genetic operators (crossover and mutation) are also impacted. In the APCS, a population of individuals is initially created using four parameters: – – – –
A fixed number Ni of individuals in population. A fixed number Nc of classifiers per individual. A fixed size Lc for all classifiers. An allelic probability P# of having a wildcard in the condition part.
This population is first filled with random classifiers. It is also possible to fill initial population with specific classifiers uploaded from a file. We will focus now on how this population is evaluated in the case of the APCS. Evaluation mechanism. As they are collections of classifiers, the individuals of an APCS interact with their environment through their sensors and effectors.
Fig. 4. APCS’ Population
Building Accurate Strategies in Non Markovian Environments
113
Table 1. Comparison between GALE/GAssist structure and APCS structure Element GALE/GAssist APCS Number of Individual Ni Fixed Fixed Classifiers (Nc ) per Individual Variable Fixed Crossover Operator Variable length operator Standard monopoint Mutation Operator Allelic Allelic
As these collections evolve in parallel, each individual is affected either to local partition of the global environment either to a copy of this environment which provides it elements to perform its learning. An individual is rewarded thanks to a fitness function which measure the adequation of the answers given by this individual to its cognitive context. As a consequence, a scalar (the strength) is affected to each individual and globally reflects the mean strength of the classifiers filling it. As first evocated by Smith, by Énée [8] and by other studies related to Pittsburgh Classifier systems [1], at the end of the simulation, each individual tend to be a potential solution to the problem. In order to ensure that the strengh of an individual reflects its composition, all individuals are submitted to a fixed number K of trials. A trial consists in four steps (see fig.5). First, the environment sends a signal to the individual which is currently evaluated. Then, thanks to that signal, the individual forms a match set [M ] which contains all of its classifiers that have been activated by this signal. Then, a classifier is selected in [M ] to perform its action. We can use four methods to select a classifier in [M ]: (1) randomly, (2) with the lesser wildcard rate in the condition part (specific), (3) with the higher wildcard rate in the condition part (generic) or (4) the first one in [M ]. The classifier which is selected is used to perform its action upon the environment. At last, the fitness function gives a K weighted reward to the individual considering the action expressed.
Fig. 5. Evaluation mechanism
114
É. Gilles and P. Mathias
Individuals are evaluated separately. Due to that fact, regarding multistep problems, the environment related to each individual may change along the learning stage. The major interest we found in separated environment is that the genetic algorithm mix together positive experiences of each individual and allows the population of individual to evolve and to improve their adequation to the global environment. In the following sub-section, we describe more precisely this evolution mechanism. Evolution mechanism: The Genetic Algorithm. The evolution algorithm (here genetic algorithm) is applied when all individuals had been evaluated K times. The reward attributed to each individual can be either continuous, using a Q-Learning method for fitness / strength update, either be reset at each generation and based only upon the trials occured during the last generation. As for other learning classifier systems, GA is essential to APCS: it allows individuals to exchange cognitive material. The GA applies three main operators among individuals of the population using their fitness. It first selects parents that will eventually reproduce using crossover operator and using mutation operator to create new offspring. The selection mechanism mainly uses the roulette wheel or the tournament methods to select parents that will be kept to generate next generation. The crossover operator chooses randomly a {n, i} pair where i{2, Lc − 1} is the position where crossover will occur within classifier n (n{1, Nc}) in each individual selected for reproduction. The crossover is one point and manipulates individuals of the same size. The reason why Énée has chosen fixed length individuals [8] essentially comes from Smith’ work[16] and from Baccardit’ work on the “bloat effect” [1]. Both studies have stated that the crossover operator between individuals of different lengths will mainly accentuate the difference of length between the two offspring. The selection mechanism would then progressively erase short individuals that would not be able to answer problems because they do not have enough classifiers. Thus, bigger individuals would be selected and the individual size would tend to grow through generations. Smith also observed that while the individuals tend to grow, they first reach an optimal size to well answer to the problem. Then additional classifiers appear and produce noise in answer: this effect is also known as the “bloat effect” [1]. In our case of study, fixed size individuals and monopoint crossover appeared to be the simplest way to avoid this side effect without discarding the advantages of the intrinsic parallelism of the individuals of an APCS. The mutation operator is allelic: each binary position (allele) of a classifier can be mutated depending upon a mutation probability PMut . Crossover and mutation are expressed as probabilities. Best genetic algorithm parameters usually taken by Smith are (except for the selection mechanism):
Building Accurate Strategies in Non Markovian Environments
115
– Selection mechanism: roulette wheel. – Crossover probability: 50-100%. – Allelic mutation probability: 0,01-0,5%. As other Pittsburgh LCS, the individuals contained by a population of APCS tend to be very similar after a given number of generations which is problem dependent [9]. At this point, each individual is a complete solution of the problem. Further discussion upon APCS and covering mechanism. Figure 3 shows the optimal policy that could be used to solve mazes E2 and Woods101 without memory. This optimal policy can be considered only when the learning system is able to keep different actions associated to the same biased perceptual situation. The structure of individuals used in APCS allows to maintain classifiers that matches the same perceptual situation but acts differently. This particularity is mainly due to the fact that each individual is globally rewarded. Due to that fact, the strength of an individual directly reflects the adequation of its collection of classifiers to the environment. As previously described in this section, they are four mechanism that can be used during the evaluation stage to select a classifier within the matching set [M ]: (1) random, (2) specific, (3) generic, (4) first. If the random classifier selection mechanism is chosen, the system is guaranteed to fire, at least sometime, the adequate classifier if it is in its pool [M ]. These two previous points indicate us that APCS should be efficient when facing problems with non Markovian chains to the solution. As a consequence, our system should be able to build accurate strategies without memory. This assesment is strongly linked to the ability of the fitness function to reward the most fitted individuals. Algorithm 1 propose a complete view of the evaluation mechanism. The ending criteria evocated in this algorithm can either be a number of generation either occur when the K trials are successful for an individual. So to evaluate an APCS, the following parameters are needed: – – – –
Simulator / Environment reset mode: To zero / To previous status. A number of trials. A selection rule mechanism: random / specific / generic / first. A number of generations.
In the introduction of this study, we have announced that the covering mechanism is newly implemented in APCS. Directly inspired from Wilson work with XCS[17], this mechanism consists here in replacing the sensor part of a classifier when no classifier matches a given signal from environment. To enhance this mechanism, we add a parameter to each classifier, called covering time Ct , in order to measure the number of generations a classifier has not been activated before it should be covered. This permits to every classifier to have a chance of being useful to its cognitive pool, i.e. the individual, before being covered. The covering mechanism replaces the condition part of a classifier with the message from environment adding wildcards depending on the wildcard probability
116
É. Gilles and P. Mathias
P# (see section 3.1). Nevertheless, the action part of the covered classifier is kept as it was before covering. Now that we have presented the main trends of our system, the APCS, we will focus on the other learning classifier system used in this study : the eXtended Classifier System (XCS). 3.2
The eXtended Classifier System
The eXtended Classifier System is a classifier issued from the Michigan approach which was first built by Wilson in 1995 [17]. It started to become mature near 2000 thanks to Butz, Lanzi and Kovacs who realised rigorous performance and accuracy studies over this system, allowing it to evolve through new mechanisms: improvements on the main algorithm, adding memory (XCSM1, XCSM2 [12]). As a consequence, this system shows pretty good results, even on maze-type environments with aliasing squares (see [13]). Table 2. Differences between XCS and APCS Element XCS APCS Number of Individual Ni Adaptative Fixed Classifiers (Nc ) per Individual 1 Fixed Crossover Operator Literature standards Standard monopoint Mutation Operator Allelic Allelic
The XCS principle, as described in [17] consists in a population of classifiers (sensor plus effecter) called individuals, which evolution relies on the ability of each classifier to predict, thanks to a Q-Learning algorithm, the reward that should be obtained by an individual for a given action in answer to a given signal. These classifiers are mixed with a classical GA and converge easily towards quite generalist classifiers that answer globally to the problem. Table 2 summarizes the main structural differences between XCS and APCS. The main difference resides in the fact that a classifier is an individual in the design of XCS. As a consequence, a given reward concerns only one classifier. This mainly explains why XCS without memory may not be able to solve non Markovian processes (please also refer to [11]). In this study, we have compared the performances of the APCS with those obtained with an XCS on the same maze. To perform our measures, we have used the version 1.2 of the XCS [6], which is related to the algorithmic version of XCS published in the article of Butz and Wilson [5]. This version includes most of the mechanisms developed until here for the XCS, without including register memory mechanism. We strongly recommend the reader to read [17] and [5] for further details upon XCS.
Building Accurate Strategies in Non Markovian Environments
117
Algorithm 1. of Adapted Pittsburgh style classifier system Begin //Random initialization of P or using a file initialization. Fill(P ); Generation = 1; Repeat For (All Ij of P ) Do //Reset environment to individual Ij last status or to zero . Reset_Environment(j); //In case of continuous reward, remove the following line. RewardIj = 0; For (A number of Trials k) Do //Store a message from environment. Fill(Message); //Create Match-Set from classifiers of Ik that match signal. Fill(Match-List); If (IsEmpty(Match-List)) Then //Select a classifier not used since Ct last generations. ReplaceP osition = ChooseCoveredClassifier(); //Replace condition part of classifier number ReplaceP osition. CoverClassifier(ReplaceP osition, Message); Fill(Match-List); Endif If (Size(Match-List)> 1) Then //Choose classifier in Match-List using strategy described in this section. C = BestChoice(Match-List); Else C = First(Match-List); Endif A = Action(C); //Acts upon environment using action A. DoAction(A); //ActionReward is the fitness function. RewardIj = RewardIj + ActionReward(A)/k; EndFor //Change strength of individual Ij using RewardIj . ChangeStrength(Ij ,RewardIj ); { } EndFor //Genetic Algorithm is applied after every individuals had been evaluated. ApplyGA(P ); Generation = Generation + 1; Until (Ending Criteria encountered); End
118
4
É. Gilles and P. Mathias
Experiments – Results
4.1
Experimental Settings
For each experiment, we submitted 20000 consecutive problems to the system: for each problem, the animat is randomly put on a free square of the maze and the trial stops when one of those two conditions is fulfilled: 1. the position of the animat in the maze is equal to the position of the food 2. the number of steps done by the animat surpass a certain threshold (MaxSteps, equal to 50 steps for every presented results) When the problem is solved, we record the starting distance to the food of the animat, its final distance to the food1 and its total number of steps. Both APCS and XCS results shown in this paper are averaged over 10 experiments. As done in [5] and in [13], the signal received by the system consists in a 16 bits string that represents each of the 8 squares surrounding the animat. Those squares are encoded clockwise, starting by North: (00) stands for an empty cell, (11) for food and (10) for an obstacle. As a consequence, the sensor part of the classifier also contain 16 positions. Each position in the sensor can be randomly occupied by 0, 1 or by a wildcard (#). The effector part, coded by a string of 3 bits, stands for one of the eight directions available for the animat, coded clockwise, as the sensor part. Specific settings used for XCS are the same as those used by Lanzi in 1999 [13], please refer to this experiment for more details. Concerning the APCS, each evaluation group, i.e. individual, controls an animat. As a consequence, during the experiment, each group is submitted to the 20000 problems and solve them asynchronously. The experiment stops when all evaluation groups have solved at least 20000 problems. The number of moves is measured during each of the K trials (see section 3.1).
Algorithm 2. algorithmic view of the fitness function used in this study if (next positionIj = food) RewardIj ← RewardIj + 1.0 K else if (next positionIj = obstacle) RewardIj ← RewardIj − 0.5 K else RewardIj ← RewardIj + 0.2 K endif
In the algorithm 2, we propose an algorithmic view of the fitness function we use in this study. This function is defined by the move done by a given animat at each trial: if it is correct, (i.e. if the animat moves toward an empty cell), the evaluated individual receives a reward of 0.2 K and the movement is performed by 1
Due to MaxSteps, it may be greater than 0.
Building Accurate Strategies in Non Markovian Environments
119
the animat of the group; if it reaches the food, it receives a reward of 1.0 K and the animat of this group is randomly placed in the maze. Else, the individual receives a negative reward of −0.5 K and the animat of the group is not moved. During the K trials, for each individual, the random selection method is used to select a classifier from [M ] (see section 3.1). Concerning the GA step, the mutation mechanism allows to reinforce the exploratory ability of the system by creating new classifiers. These classifiers are created when modifying existing classifiers locally and randomly according to a certain rate, which is expressed by the chosen mutation probability P Mut . However, two consecutive positions of the considered mazes rarely differs one from another for more than 4 bits, and the non-sense sequence (01) can, in the present case, invalidate the activation of a classifier. As a consequence, the higher the number of mutated bits, the more the system may lose classifiers that could have allowed it to manage to find the food. Experiments performed in [8] have also validated that a high mutation rate (P Mut > 0.2) prevents the system from keeping optimal classifiers. The other important parameter of the GA step, the cross-over rate, has a great influence on the homogeneity and on the stabilisation of the system: combined to elitism, it allows the system to preserve the best behaviours expressed inside the population. Énée has measured in [8] that a low cross-over rate (P Cross < 0.6) slows the convergence of the system by preventing good genetic precursors to be replicated inside the population. Elitism was set to 60% in order to keep the best parents and to stabilize population more rapidly. As shown in [9,8], stable results are obtained using a mutation probability P Mut set to 0.005 and a cross-over probability P Cross set to 0.75. As a consequence, we have chosen to use those values of parameters to perform our experiments. 4.2
Results without Covering
For the presented experiments, we made two significant measures: the first measure, the average number of steps done by the animats, allows us to tackle the ability of the system to conform to a moving policy inside of the maze. Due to that fact, the evolution of this measure reflects and characterize the evolution of this ability. As a second measure, the average final distance of the evaluation groups at the end of a trial allows us to validate the results outlined by the first measure. As the animat is randomly placed in the environment when the number of steps it has done crosses a certain threshold, this second measure shows the efficiency of the policy built by the system. Now, we can study the influence of parameters proper to the classifier system. The number of individuals and the number of classifiers are the only parameters that really have an influence on the cognitive properties of the CS [8,16]. First, for a fixed number of classifiers (Nc = 30) , lets compare results obtained for NI between 20 and 50 (Table 3).
120
É. Gilles and P. Mathias
Table 3. Measure of the influence of the variation of the number of individuals on the average number of steps (Nc = 30, NI = 20, .., 50) Woods 101 Avg. nb of steps Avg. Random walk 29.19 NI = 20 8.72 NI = 30 7.52 NI = 40 7.11 NI = 50 6.56
Maze E2 final dist. Avg. nb of steps Avg. 0.66 32.49 0.12 24.84 0.08 15.03 0.05 13.43 0.03 12.62
final dist. 0.78 0.58 0.11 0.08 0.06
Table 4. Measure of the influence of the variation of the number of classifiers on the average number of steps (NI = 30, Nc = 20, .., 50) Woods 101 Avg. nb of steps Avg. Random walk 29.19 Nc = 20 12.13 Nc = 30 7.52 Nc = 40 11.10 Nc = 50 8.67
Maze E2 final dist. Avg. nb of steps Avg. 0.66 32.49 0.33 33.35 0.08 15.03 0.28 12.53 0.18 11.88
final dist. 0.78 1.25 0.11 0.06 0.07
When the number of individuals changes, it also modifies the number of evaluation groups (see section 3.1). As in this experiment, each group controls the moves of an animat, the classifier system is able to learn on N I different situations for each trials. As a consequence, the raise of the number of individuals also raise the exploratory ability of the system which accelerates and improves the convergence of the system. When considering the evolution of the average final distance to the food (Table 3, column “Avg. final dist”)., we can conclude that the raise of the number of individuals contribute to the system’s stabilization by diffusing more efficiently a common accurate strategy. Lets now consider the influence of the variation of the number of classifiers contained by an individual on the performances of the system on this problem. The following experiments (Table. 4) have been conducted with a fixed number of individuals (NI = 30) and various number of classifiers (N c between 20 and 50). Each classifier determines the answer of the individual to one or several given signals coming from the environment. As a consequence, the number of classifiers contained by an individual can possibly have an influence on the number of signals which may trigger an answer from a given individual. As shown by Smith [16], if the potential information contained by an individual is too high, the exceeding information generates noise that disturbs the answer of the system and the evolution of more fitted classifiers. In addition to this phenomenon, as non markovian situations may induce a rise in the
Building Accurate Strategies in Non Markovian Environments
121
average wildcard rate of the classifiers [2], each additional classifier may bring more unnecessary information. This side effect emphasizes the fact that it exists a potential information threshold on information contained by an individual. When considering the obtained measures, we can conclude that this threshold depend on the considered problem. As a conclusion, if in the beginning, providing to the individuals of the APCS additional cognitive capacity can improve the quality of the answer of the system, additional classifiers may contain useless precursors that disturb the convergence of the system. As said earlier, this issue is successfully addressed in studies of other Pittsburgh approaches as the “bloat effect” [1]. 4.3
Results Obtained Using Covering
In order to improve our results, we have adapted the covering mechanism used in the XCS to allow APCS to generate well fitted classifiers when encountering an unknown signal. To measure the impact of the activation of this mechanism on the evolution of the classifiers contained by the individuals of the APCS, we have chosen to measure the changes registered when choosing different values for the Ct parameter. To prevent any additional perturbation, these tests have been performed using a fixed number of classifiers and a fixed number of individuals. In this paper, we have chosen to present those results for NI = 30, and Nc = 30 for the Woods 101 and with NI = 40, Nc = 30 for the maze E2 with a Ct between 0 (no covering) and 30 (uses the classifiers that have been less triggered during the last 30 generations). We can notice in table 5 that the number of steps done by the system to reach the food decreases when increasing the Ct parameter. This improvement occurs due to the modification done to the available pool of classifiers by the covering mechanism. As we have measured, a classifier that has not been activated on Ct generations has a strong probability to contain one or many defective precursors. While this mechanism allows to remove those precursors from the population, it also increases the accuracy of the strategy built by the system. This phenomenon is suggested by the evolution of the average final distance to the food of the animats: the tendancies observed on the first measure are assessed by the second one. Table 5. Measure of the influence of the variation of the parameter Ct on the obtained results
Ct = 0 Ct = 3 Ct = 7 Ct = 11 Ct = 15 Ct = 20 Ct = 30
Woods 101 (NI = 30, Nc = 30) Maze E2 (NI = 40, Nc = 30) Avg. nb of steps Avg. final dist. Avg. nb of steps Avg. final dist. 7.52 0.08 13.43 0.08 6.22 0.03 8.65 0.01 5.7 0.001 8.47 0.01 6.03 0.02 8.60 0.01 5.84 0.01 8.64 0.01 5.86 0.01 8.48 0.01 5.77 0.01 8.83 0.01
122
É. Gilles and P. Mathias
(a) Woods 101
(b) Maze E2
Fig. 6. Measure of the gain related to the covering mechanism
As a summary, through the results presented in table 5, we deduce that the classifiers with the condition part replaced by the covering mechanism carry an amount of information whose usability decreases depending on the raise of the number of evaluation steps passed without being activated. As the animat problem studied is a multi step environment, it may occur that a classifier that has not been triggered at a given evaluation step ti can be triggered at the step ti+n depending on the moves of the animats through the environment. However, the number of squares which can be occupied by an animat are finite so it exist an upper boundary to the number of available signals generated by this environment. If we relate this hypothesis to the previous one, we can suppose that, for each environment, each situation will have been tested by the system after a finite number of generations. Figures 6a and 6b plot the average gain in % over all the conducted experiments when we increase the Ct parameter. We can observe that this gain becomes equal to 0 when a certain value of Ct is crossed. Thus, we can suppose that it exists a finite number of generations N CT which may allow us to diagnose that if a classifier has not matched any signal emitted by the environment during the last N CT generations, the signals corresponding to this classifier are not available in the current environment. Due to that fact and according to the structure of the APCS (see Section 3.1), if we extend the range of this conclusion, we can also suppose that for each environment of this type, it exists a threshold value of evaluation steps K ∗ N CT over which we can decide that an upper value of the Ct parameter does not carry any additional knowledge on the potentially useless information carried by a classifier.
5
Comparison between APCS and XCS
As shown by the results of the experiments presented in this paper, APCS manage to evolve classifiers allowing it to adopt a stable moving policy whose quality is greatly improved by the recently added covering mechanism.
Building Accurate Strategies in Non Markovian Environments
(a) XCS (2000 Individuals)
123
(b) APCS (NI = 30, Nc = 20, Ct = 7)
Fig. 7. Best performances with Woods 101
We will now focus on the comparative study of the best results we obtained with the XCS (fig. 7 and 8) with the best results obtained with the APCS. Parameters used for the XCS during those experiments are those used in the experiment conducted by Lanzi in 1999 on this type of environment[13], exception made for the number of individual which is 2000 for the Woods101 and 8000 on the Maze E2. We will center our discussion on the policies built by the classifiers. When we consider the differences between the best results obtained by the XCS and the best results obtained by the APCS, we notice two differences. The first one, significant on the Maze E2 but not on the Woods 101 is that the average number of steps done by the APCS is closer to the optimal than the average number of steps done by the XCS. This statement can find its foundation in the second difference noticed: the average final distance to the food for the APCS on the two mazes is lower than the one measured for the XCS. This difference means that the APCS accurately finds the food more often than the XCS which implies that the policy built by the APCS is more stable and accurate. However, the learning strategies employed by the two systems are quite different one from another: the XCS learning stage focuses on the value function and the policy deployed by this system is strongly dependent on this value function. In addition to that fact, in order to keep the accuracy of its prediction, this mechanism requires to maintain most of the actions available for a given signal. XCS was built to find Markov chains to solve a problem, so it is not supposed to maintain at the same level of prediction, classifiers with identical condition parts and different actions. Reward in XCS is given to each classifier part of the Markov chain to solution. This is what we pointed out while describing both CS. APCS maintains knowledge structures with several classifiers. As the reward concerns the whole knowledge structure, it is possible to have two or more classifiers with the same condition part but with different actions upon environment. Due to those facts, even if the XCS succeed in keeping an almost stable policy for the Woods 101 environment, it fails when facing an environment with numerous aliasing situations.
124
É. Gilles and P. Mathias
(a) XCS (8000 Individuals)
(b) APCS (NI = 40, Nc = 50, Ct = 11)
Fig. 8. Best performances with Maze E2
On the opposite, the learning mechanism used by the APCS relies on its cognitive capacity (Nc ) which is highly problem dependent but allows it to develop strategies regarding its past actions. The most accurate classifiers will tend to stay in the population because they will allow their owner to reach the food more accurately and more often than the other individuals of the APCS. As a consequence, in a multistep environment, the classifiers contained by those strong individuals allow them to build action chains which are reward dependent. Moreover, instead of solving one problem at once, the APCS tries to solve NI problems in the same time (see Section 3.1.2) which allows it to explore a higher number of situations in a short time of simulation. Due to all those facts, the system tend to find a “sub-optimal” solution that allows it to be rewarded accurately and more often.
Fig. 9. Example of policy deployed by the APCS to solve the Maze E2
Building Accurate Strategies in Non Markovian Environments
125
We observe the formation of policies (see fig. 9 for policy obtained with APCS in maze E2) that makes it possible to reach the food from every position, even when the environment contains aliasing squares. This policy evolves with the frequency of the reward encountered by the individuals which allow them to adapt and modify (via the genetic algorithm) their classifiers.
6
Conclusions
Through this paper, we have shown and studied results which indicate that, without any knowledge of its environment, even when facing non-markovian positions, the Adapted Pittsburgh Classifier System improved with the covering mechanism and with the fitted parameters, is able to adopt an almost stable policy in maze environments containing aliasing square. This policy allows it to reach accurately the food with a low but not optimal number of steps. When studying the number of classifiers contained by an individual, we have shown that the raise of this local cognitive capacity can benefit to the system if it remains under a problem dependent threshold (see also [9]). Cognitive capacity provided over this threshold showed the conservation of defective precursors and a strong disturbance of the system answer due to them. Fortunately, as shown in the experiments, those precursors are assimilated/eliminated by the system in an amount of trials depending on the useless amount of information they carry. We have also shown that the covering mechanism we propose has a noticeable influence on the performances of the system: classifiers eliminated by this mechanism carry an information that have a decreasing usability depending on the number of evaluation steps during which those classifiers are not triggered by a signal from the environment. Some interesting further work remains to be finalized, especially concerning the precise built of the policy evolved by the APCS along the experiment.
References 1. Bacardit, J., Garrell-Guiu, J.M.: Bloat control and generalization pressure using the minimum description length principle for a pittsburgh approach learning classifier system. In: Kovacs, T., Llorà, X., Takadama, K., Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2003. LNCS (LNAI), vol. 4399, pp. 59–79. Springer, Heidelberg (2007) 2. Bagnall, A.J., Zatuchna, Z.: On the classification of maze problems. In: Bull, L., Kovacs, T. (eds.) Applications of Learning Classifier Systems. Studies in Fuzziness and Soft Computing, vol. 183, pp. 307–316. Springer, Heidelberg (2005) 3. Bernadó-Mansilla, E., Llorà, X., Garrell-Guiu, J.M.: Xcs and gale: A comparative study of two learning classifier systems on data mining. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 115–132. Springer, Heidelberg (2002) 4. Bull, L.: Lookahead and latent learning in ZCS. In: GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, New York, July 9-13, pp. 897–904. Morgan Kaufmann Publishers, San Francisco (2002)
126
É. Gilles and P. Mathias
5. Butz, M., Wilson, S.W.: An algorithmic description of XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 253–272. Springer, Heidelberg (2001) 6. Butz, M.V.: Documentation of XCS+TS c-code 1.2. IlliGAL Report 2003023, Illinois Genetic Algorithms Laboratory (October 2003) 7. De Jong, K.A., Spears, W.M., Gordon, D.F.: Using Genetic Algorithms for Concept Learning. Machine Learning 13(3), 161–188 8. Énée, G.: Systèmes de Classeurs et Communication dans les Systèmes MultiAgents. PhD thesis, Ecole Doctorale de STIC, Université de Nice Sophia-Antipolis, (Janvier 2003) 9. Énée, G., Barbaroux, P.: Adapted pittsburgh-style classifier-system: Case-study. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2003. LNCS (LNAI), vol. 2661, pp. 30–45. Springer, Heidelberg (2003) 10. Holmes, J.H., Lanzi, P.L., Stolzmann, W., Wilson, S.W.: Learning classifier systems: New models, successful applications. Inf. Process. Lett. 82(1), 23–30 (2002) 11. Lanzi, P.L.: Adding Memory to XCS. In: Proceedings of the IEEE Conference on Evolutionary Computation (ICEC 1998), IEEE Press, Los Alamitos (1998), http://ftp.elet.polimi.it/people/lanzi/icec98.ps.gz 12. Lanzi, P.L.: An analysis of the memory mechanism of XCSM. In: Proceedings of the Third Genetic Programming Conference, pp. 643–651. Morgan Kaufmann, San Francisco (1998), http://ftp.elet.polimi.it/people/lanzi/gp98.ps.gz 13. Lanzi, P.L., Wilson, S.W.: Optimal classifier system performance in non-markovian environments. Technical Report 99.36, Illinois Genetic Algorithms Laboratory, Milan, Italy (1999) 14. Sigaud, O.: Les systèmes de classeurs: un état de lárt. Revue d’intelligence Artificielle RSTI série RIA,Lavoisier, vol. 21 (February 2007) 15. Sigaud, O., Wilson, S.W.: Learning classifier systems: a survey. Soft Comput. 11(11), 1065–1078 (2007) 16. Smith, S.F.: A Learning System based on Genetic Adaptive Algorithms. PhD thesis, University of Pittsburgh (1980) 17. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3(2), 148–175 (1995) 18. Zatuchna, Z.V.: AgentP: A Learning Classifier System with Associative Perception in Maze Environments. PhD thesis, School of Computing Sciences, UEA (2005)
Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets Ajay Kumar Tanwani and Muddassar Farooq Next Generation Intelligent Networks Research Center (nexGIN RC) National University of Computer & Emerging Sciences (FAST-NU) Islamabad, Pakistan {ajay.tanwani,muddassar.farooq}@nexginrc.org
Abstract. Biomedical datasets pose a unique challenge for machine learning and data mining techniques to extract accurate, comprehensible and hidden knowledge from them. In this paper, we investigate the role of a biomedical dataset on the classification accuracy of an algorithm. To this end, we quantify the complexity of a biomedical dataset in terms of its missing values, imbalance ratio, noise and information gain. We have performed our experiments using six wellknown evolutionary rule learning algorithms – XCS, UCS, GAssist, cAnt-Miner, SLAVE and Ishibuchi – on 31 publicly available biomedical datasets. The results of our experiments and statistical analysis show that GAssist gives better classification results on majority of biomedical datasets among the compared schemes but cannot be categorized as the best classifier. Moreover, our analysis reveals that the nature of a biomedical dataset – not the selection of evolutionary algorithm – plays a major role in determining the classification accuracy of a dataset. We further show that noise is a dominating factor in determining the complexity of a dataset and it is inversely proportional to the classification accuracy of all evaluated algorithms. Towards the end, we provide researchers with a metaclassification model that can be used to determine the classification potential of a dataset on the basis of its complexity measures. Keywords: Classification, Evolutionary Rule Learning Algorithms, Biomedical Datasets, Performance Measures.
1 Introduction Recent advancements in the field of bioinformatics and computational biology are increasing the complexity of underlying biomedical datasets. The use of sophisticated equipment like mass spectrometers and magnetic resonance imaging (MRI) scanners generate large amounts of data that pose a number of issues regarding electronic storage and efficient processing. One of the major challenges in this context is to automatically extract accurate, comprehensible, and hidden knowledge from large amounts of raw data. The discovered knowledge can then help medical experts in classification of anomalies for these datasets. Well-known data mining techniques for knowledge extraction and classification include probabilistic methods, neural networks, support vector machines, decision trees, J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 127–144, 2010. c Springer-Verlag Berlin Heidelberg 2010
128
A.K. Tanwani and M. Farooq
instance based learners, rough sets and evolutionary algorithms. The evolutionary algorithms – inspired from the evolution process in the biological species – show a number of desirable properties like self-adaptation, robustness, collective learning etc., which make them suitable for challenging real world problems. The Evolutionary Computation (EC) paradigm has been successfully used in several data mining techniques including but not limited to genetic based machine learning systems (GBML), learning classifier systems (LCS), ant colony inspired classifiers, and hybrid variants of evolutionary fuzzy systems and neural networks. The evolutionary classifiers are becoming popular for data mining of medical datasets because of their ability to find hidden patterns in electronic records that are not otherwise obvious even to physicians [1]. However, it is not obvious to a researcher working on the classification of biomedical datasets to choose a suitable classifier. Consequently, the common methodology adopted by researchers is to empirically evaluate their dataset with a few well-known machine learning techniques and select the one that gives better results. As a result, no attempt is made to systematically investigate the factors that define the accuracy of a classifier. An important contribution of this paper is that the accuracy of a classifier depends on the complexity of a datset. We define the complexity of a dataset in terms of missing values, imbalance ratio, noise and information gain. Moreover, we evaluate the performance of six well-known evolutionary rule learning classifiers – XCS, UCS, GAssist, cAnt-Miner, SLAVE and Ishibuchi – on 31 publicly available biomedical datasets. The results of our experiments provide two valuable insights: (1) classification accuracy strongly depends on the complexity of a biomedical dataset, and (2) noise of a dataset predominately defines its complexity. To conclude, we propose that researchers should first evaluate the complexity of their medical dataset and then use our proposed meta-model to determine its classification potential. The remaining paper is organized as follows: we introduce the evolutionary algorithms used in our study in Section 3. In Section 4, we quantify the complexity of the biomedical datasets. We report the results of our experiments which are followed by statistical analysis and discussions in Section 5. Finally, we conclude the paper with an outlook to our future work.
2 Related Work We now present a brief overview of different studies that analyze the performance of evolutionary algorithms on various biomedical domains. In [2], Wong et al. applied evolutionary algorithms to discover knowledge in the form of rules and casual structures from fracture and scoliosis databases. Their results suggest that evolutionary algorithms are useful in finding interesting patterns. John Holmes in [3] presented his stimulus response learning classifier system, EpiCS, to enhance classification accuracy in an imbalanced class dataset. He, however, used artificially created liver cancer dataset. Bernado-Mansilla in [4] characterized the complexity of the classification problem by a set of geometrical descriptors and analyzed the competence of XCS in this domain. The authors in [5] compared XCS with Bayesian network, SMO and C4.5 for mining breast cancer data and showed that XCS provides significantly higher accuracy followed by C4.5. However its rules are considered more comprehensible and descriptive by the
Classification Potential vs. Classification Accuracy
129
domain experts. The work in [6] evaluates two competitive learning classifier systems, XCS and UCS, for extracting knowledge from imbalanced data using both fabricated and real world problems. The results of their study prove the robustness of these algorithms compared with IBk, C4.5 and SMO. In [7], the authors compared the Pittsburgh and Michigan style classifier using XCS and GAssist on 13 publicly available datasets to reveal important differences between the two systems. The comparative study performed in [8] between evolutionary algorithms (XCS and Gale) and non-evolutionary algorithms (instance based, decision trees, rule-learning, statistical models and support vector machines) on several datasets suggests evolutionary algorithms as more suitable for data mining and classification. The results of the experiments carried in [9] show better classification accuracy for well-known ant colony inspired, Ant-Miner, compared with C4.5 on 4 biomedical datasets. The authors in [10] have analyzed several strategies of evolutionary fuzzy models for data mining and knowledge discovery. In our earlier work [11], we provide several guidelines to select a suitable machine learning scheme for classification of biomedical datasets, however, the work is limited to nonevolutionary algorithms. A common theme observed in various studies is that they are inclined towards particular classifier(s) instead of the biomedical dataset(s). In contrast, our study uses a novel methodology to quantify the complexity of a dataset, which we show, defines the accuracy of a classifier. Moreover, we also build a meta-model of our findings that can be used to determine the classification potential of a biomedical dataset.
3 Evolutionary Algorithms We have selected a diverse set of well-known evolutionary rule learning algorithms for our empirical study. The selected algorithms are: (1) reinforcement learning based Michigan style XCS [12], (2) supervised learning based Michigan style UCS [13], (3) Pittsburgh style GAssist [14], (4) Ant Colony Optimization (ACO) inspired cAnt-Miner [15], (5) genetic fuzzy iterative learner SLAVE [16], and (6) genetic fuzzy classifier Ishibuchi [17]. In all our experiments, the parameters are selected to achieve the best operating point on the ROC (Receiver Operating Characteristic) curve [18]. 3.1 XCS XCS is a reinforcement learning based Michigan-style classifier that evolves a set of rules as a population of classifiers (P ). Each rule consists of a condition, an action and three performance parameters: (1) payoff prediction (p), (2) prediction error (), and (3) fitness (F ). The first step in classification is to build a match set (M ) that consists of rules whose conditions are satisfied. The payoff prediction of each rule is computed and its corresponding action set (A) is created. The online learning is made possible with a reward (r), returned by the environment, that is subsequently used to tune the performance parameters of the rules in the action set. The updated fitness is inversely proportional to the prediction error. Finally a genetic algorithm GA, with crossover and mutation probabilities χ and μ respectively, is applied to the rules in the action set and consequently new rules are added to the population. Some rules are also deleted from the population depending on their experience.
130
A.K. Tanwani and M. Farooq
The parameter configuration of XCS used in our experiments is as follows: population size N = 6400, learning rate β = 0.2, θsub = θdel = 50, tournament size = 0.4, χ = 0.8, μ = 0.04 and the number of explorations are kept at 100, 000. 3.2 UCS UCS is an accuracy based Michigan-style classifier which is in principle quite similar to XCS. However, it uses a supervised learning scheme to compute fitness instead of reinforcement learning employed by XCS. UCS like XCS also evolves a population of rules (P ). Each rule has two parameters: (1) accuracy (acc), and (2) fitness (F ). During the training phase, for every instance a set of rules whose conditions are satisfied become part of its match set (M ). The rules that perform correct classification become part of the correct set (C), and the others become part of the incorrect set (!C). Finally, the genetic algorithm GA is applied to the correct set to update its population. Every instance during testing is classified through weighted voting, on the basis of fitness, to select the action. We have used following parameter settings: N = 6400, number of iterations = 100, 000 and acc0 = 0.99. The other tuning parameters of GA are kept same as that in XCS. 3.3 GAssist GAssist (Genetic Algorithms based claSSIfier sySTem), in contrast to XCS and UCS, is a Pittsburgh-style learning classifier in which the rules are assembled in the form of a decision list. GAssist-ADI uses Adaptive Discretization Intervals (ADI) rule representation. In such systems, the continuous space is discretized into fixed intervals for developing rules. Generalization is introduced by deleting and selecting rule set as a function of their accuracy and length. The crossover between two rules takes place across attribute boundaries rather than attribute intervals. GAssist parameter setting is as follows: crossover probability = 0.6, number of iterations = 500, minimum number of rules for rule deletion = 12, and set of uniform discreteness – 4, 5, 6, 7, 8, 10, 15, 20 and 25 bins. 3.4 cAnt-Miner Ant Miner, inspired by behavior of real ant colonies, uses Ant Colony Optimization (ACO) to construct classification rules from the training data. The Rule Discovery process consists of 3 steps i.e. rule generation, rule pruning and rule updating. In the rule generation step, an ant starts with an empty rule list and adds one term at a time based on the probability of that attribute-value pair. It continues to add terms to the rule without duplication until all the attributes are exhausted or the new terms make the rule more specific, defined by a user specified threshold. In the rule pruning step all the terms are removed one by one from the rule that degrades the accuracy of that rule. While updating rules, the pheromone values of terms are increased or decreased on the basis of their usage in the rule discovery process. cAnt-Miner is a variant of Ant Miner for real valued attributes. The parameters of cAnt-Miner are: the number of ants = 3000, minimum cases per rule = 5, maximum number of uncovered cases = 10 and convergence test size = 10.
Classification Potential vs. Classification Accuracy
131
3.5 SLAVE SLAVE (Structural Learning Algorithm in Vague Environment) is totally different from the classical Michigan-style and Pittsburgh-style rule learning algorithms. In this approach, every entity in the population represents a unique rule. But during an iteration of a genetic algorithm, only the best individual is added to the final set of rules which is eventually used for classification. In this way, SLAVE combines its iterative learning approach with the fuzzy models. The fitness of the rules is determined by their completeness and consistency. In our experiments, the parameter configuration of SLAVE is: the number of labels = 5, population size = 100, number of iterations allowed without change = 500 and mutation probability = 0.01. 3.6 Ishibuchi Ishibuchi et al. proposed a fuzzy rule learning method for multidimensional pattern classification problem with continuous attributes. The classification is done with the help of a fuzzy-rule base in which each fuzzy if-then rule is handled as an individual, and a fitness value is assigned to each rule. The criteria for assigning a class label is based on a simple heuristic procedure which assigns a grade of certainty for each fuzzy if-then rule. Because it uses linguistic values with fixed membership functions as antecedent fuzzy sets, a linguistic interpretation of each fuzzy if-then rule is easily obtained which greatly helps in comprehending the generated solution. The experiments are carried with the following parameters: the number of labels = 5, population size = 100, number of evaluations = 10, 000, along with crossover and mutation probabilities of 1.0 and 0.9.
4 Nature of Biomedical Datasets Biomedical datasets provide a whole spectrum of difficulties – high-dimensionality, multiple classes, imbalanced classes, missing values and noisy data – that affect the classification accuracy of algorithms. The inconsistencies and inherent complexities in biomedical datasets obtained from different sources justify the need to separately investigate the impact of the nature of biomedical dataset in classification. To this end, we have selected 31 diverse biomedical datasets publicly available from UCI machine learning repository [19]. We now introduce four parameters that we use to quantify the complexity of a biomedical dataset: (1) missing values, (2) imbalance ratio, (3) noise, and (4) information gain. 4.1 Missing Values A major focus of the machine learning community has been to analyze the effect of missing data on the accuracy of a classifier. The missing data is generally classified into three types: (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) not missing at random (NMAR). The datasets obtained from clinical databases contain several missing fields which can belong to all three categories of missing values. In Table 1 we see that VA-Heart dataset contains up to 27% of missing values in its attributes.
132
A.K. Tanwani and M. Farooq
4.2 Imbalance Ratio Orriols-Puig and Bernado-Mansilla compute class imbalance as the ratio between the number of majority class instances and the number of minority class instances [6]. But, this is only suitable for two-class problems as it does not include proportion of other class instances for a multi-class dataset. For example, Thyroid0387 has a total of 32 classes with 6771 majority class instances and only 1 minority class instance. The imbalance ratio, using the above method, is 6771 which definitely does not represent the true picture because the distribution of instances of other classes is relatively uniform. Therefore, we use following definition of imbalance ratio Ir to cater for proportion of all class distributions: Nc Nc − 1 Ii Ir = (1) Nc i=1 In − Ii where Ir is in the range (1 ≤ Ir < ∞) and Ir = 1 is a completely balanced dataset having equal instances of all classes. Nc is the number of classes, Ii is the number of instances of class i and In is the total number of instances. Hyperthyroid is the most imbalanced dataset in our repository with an imbalance ratio of 28.81. 4.3 Noise Noise is of two types: (1) attribute noise, and (2) class noise. Research has shown that the impact of class noise on classification accuracy is significantly more as compared to the attribute noise [20] and hence, we only quantify class noise in our study. The common sources of class noise are inconsistent and mislabeled instances. A number of research efforts have been made to quantify the level of noise in a dataset, but its definition still remains subjective. Brodley and Friedl characterized noise as the proportion of incorrectly classified instances by a set of trained classifiers [21]. We use a similar approach to quantify noise but utilize confusion matrices for a set of classifiers to determine noisy instances. Noise is then quantified as the sum of all off-diagonal entities (incorrectly classified instances) where each entity is the minimum of all the corresponding elements in a set of confusion matrices. The defined criteria is based upon two assumptions: (1) an inconsistent or misclassified instance is likely to confuse every classifier, and (2) the bias of an algorithm towards particular class instances can be factored out by using a set of classifiers. The advantage of our approach is that we separately identify misclassified instances of every class and only categorize those as noisy which are misclassified by all the classifiers. The confusion matrix of a nth classifier in a set of n classifiers can in general be represented as: ⎛ n n ⎞ i11 i12 . . . in1j ⎜ in21 in22 . . . in2j ⎟ ⎜ ⎟ ⎟ Cn = ⎜ ⎜ . . ... . ⎟ ⎝ . . ... . ⎠ ini1 ini2 . . . inij where the diagonal elements in Cn represent the correctly classified instances and offdiagonal elements are the incorrectly classified instances. The percentage of class noise in a dataset of In instances can be computed as below:
Classification Potential vs. Classification Accuracy
N oise =
Nc Nc 1 min(C1 (i, j), C2 (i, j)......Cn (i, j)) 100 In i=1 j=1
133
(2)
where i = j and min(C1 (i, j), C2 (i, j)......Cn (i, j)) is an entity for corresponding i and j that represents minimum number of class instances misclassified by all the classifiers. We have used five well-known and diverse machine learning algorithms as a set of classifiers in our study: Naive Bayes (probabilistic), SMO (support vector machines), J48 (decision trees), Ripper (inductive rule learner) and IBk (instance based learner). We use the standard implementations of these schemes in Wakaito Environment for Knowledge Acquisition (WEKA) [22]. It is evident from Table 1 that biomedical datasets are generally associated with high percentage of noise levels. 4.4 Information Gain Information gain is an information-theoretic measure that evaluates the quality of attributes in a dataset [22]. It measures the reduction in uncertainty if the values of an attribute are known. For a given attribute X and a class attribute Y , the uncertainty is given by their respective entropies H(X) and H(Y ). Then the information gain of X with respect to Y is given by I(Y ; X), where I(Y ; X) = H(Y ) − H(Y |X)
(3)
The average and total information gain of a biomedical dataset shown in Table 1 gives a direct measure of the quality of its attributes for classification.
5 Results and Discussions We now present the results of our experiments that we have done to analyze the nature of 31 biomedical datasets with six evolutionary algorithms. We have used the standard ACO framework, MYRA [23], for cAnt-Miner and Knowledge Extraction based on Evolutionary Learning (KEEL) [24] for other evolutionary classifiers to remove any implementation bias in our study. We evaluate the classification accuracy of the evolutionary algorithms using standard ten fold stratified cross-validation in order to ensure systematic and unbiased analysis. The results summarized in Table 1 show the nature of a dataset in terms of its quantified parameters, along with the resulting classification accuracies of all the algorithms. We now provide the insights of the obtained results using statistical procedures to analyze the effect of evolutionary learning paradigm and then discuss in detail the role of nature of biomedical dataset on classification accuracy. 5.1 Statistical Analysis of Results In this section, we provide the statistical analysis of the results obtained in Table 1 to systematically quantify the performance of evolutionary algorithms. The common approach used by many researchers in such cases is to use pairwise comparisons between all the classifiers using commonly used statistical tests such as paired t-test or wilcoxon
1930
Mean
Average Ranks
7200 699 569 198 452 303 1473 366 132 336 306 155 368 294 3772 3163 345 32 148 961 215 768 90 106 21618 2800 3190 270 123 9172 200
4.5
3 2 2 2 16 5 3 6 2 8 3 2 2 5 5 2 2 3 4 2 3 2 3 2 3 2 3 2 5 32 5
Instances Classes
Ann-Thyroid Breast Cancer Breast Cancer Diagnostic Breast Cancer Prognostic Cardiac Arrhythmia Cleveland-Heart Contraceptive Method Dermatology Echocardiogram E-Coli Haberman’s Survival Hepatitis Horse Colic Hungarian Heart Hyper Thyroid Hypo-Thyroid Liver Disorders Lung Cancer Lymph Nodes Mammographic Masses New Thyroid Pima Indians Diabetes Post Operative Patient Promoters Genes Sequence Protein Data Sick Splice-Junction Gene Sequence Statlog Heart Switzerland Heart Thyroid0387 VA-Heart
Dataset
15
4
9
3.73
0.13
2.74
Imb Avg Info Net Info Ratio Gain Gain 8.37 0.037 0.78 1.21 0.451 4.51 1.14 0.303 9.39 1.76 0.004 0.15 1.57 0.047 13.06 1.37 0.115 1.49 1.04 0.041 0.36 1.05 0.442 15.02 1.24 0.084 1.01 1.25 0.678 5.42 1.57 0.023 0.07 2.05 0.058 1.10 1.15 0.061 1.64 1.74 0.079 1.02 28.81 0.012 0.36 9.99 0.024 0.60 1.05 0.011 0.06 1.02 0.152 8.50 1.46 0.138 2.48 1.01 0.193 0.97 1.78 0.602 3.01 1.20 0.064 0.52 1.90 0.016 0.13 1.00 0.078 4.51 1.19 0.065 0.07 7.72 0.013 0.37 1.15 0.022 3.94 1.03 0.092 1.19 1.14 0.023 0.30 2.99 0.091 2.64 1.04 0.023 0.30
12.49 2.97
Nature of Dataset Attributes Missing Noise Con Bin Nom Values 6 15 0 0 0.11 1 0 9 0.23 2.72 31 0 0 0 2.11 33 0 0 0.06 13.64 272 7 0 0.32 11.28 10 3 0 0.15 17.82 2 3 4 0 31.98 1 1 32 0.06 0.82 8 2 2 4.67 6.06 7 0 1 0 6.55 3 0 0 0 16.67 6 0 13 5.67 10.97 8 4 15 19.39 11.96 10 3 0 20.46 13.61 7 21 1 2.17 0.34 7 18 0 6.74 0.54 6 0 0 0 21.88 0 0 56 0.28 9.86 3 9 6 0 10.81 1 0 4 3.37 14.15 5 0 0 0 2.79 8 0 0 0 20.18 0 0 8 0.44 30.00 0 0 58 0 4.72 0 0 1 0 45.48 7 21 1 2.24 0.71 0 0 61 0 4.6 7 3 3 0 15.19 10 3 0 17.07 32.52 7 21 1 5.50 1.35 10 3 0 26.85 27.00 96.99 96.57 92.44 72.82 61.31 52.15 47.32 96.99 84.78 93.73 74.20 81.29 81.47 62.26 97.88 97.85 67.27 44.99 81.09 82.21 92.60 74.76 63.33 76.27 51.21 97.57 57.30 83.33 31.67 81.92 28.99
UCS 94.67 94.56 95.43 70.29 54.86 57.41 55.54 92.64 96.21 74.74 69.96 91.50 93.73 75.14 98.57 99.43 61.18 41.67 78.57 83.25 92.19 72.15 61.11 62.91 54.52 97.32 92.45 81.11 65.83 79.83 58.50 74.56 (2) ± 18.76 3.35 (4)
99.15 93.56 93.15 73.82 67.92 57.74 50.92 91.00 83.19 79.17 71.53 80.00 83.97 62.95 98.12 98.96 65.48 45.83 77.81 81.06 90.24 75.00 60.00 75.45 54.46 97.18 83.80 75.19 30.19 85.47 29.00
92.61 94.71 92.09 76.29 54.44 43.58 30.60 93.24 67.89 73.20 80.04 63.05 63.95 97.30 95.23 58.27 72.90 66.60 86.15 68.62 71.11 54.52 93.89 73.33 42.37 74.02 33.50 66.96 (6) 70.87 (4) ± 24.92 ± 19.21 4.35 (6) 4.27 (5)
93.29 94.70 91.56 74.29 65.49 48.85 25.46 3.83 92.47 82.72 73.18 81.96 67.33 64.60 97.43 95.51 58.54 69.57 65.56 91.23 72.67 70.00 27.27 54.52 93.86 52.55 72.22 31.79 76.46 20.00
GAssist cAnt-Miner SLAVE Ishibuchi
Evolutionary Rule Learning Classifiers
70.11 (5) 74.34 (3) 77.33 (1) ± 26.71 ± 19.95 ± 16.69 3.29 (3) 3.02 (2) 2.71 (1)
97.08 96.14 93.67 65.76 58.09 53.43 94.84 88.63 90.51 74.23 81.33 84.23 65.98 97.35 97.16 63.26 30.83 79.19 80.75 94.93 73.71 70.00 2.82 51.41 93.89 5.60 80.74 31.67 74.13 32.00
XCS
95.63 ± 2.52 95.04 ± 1.11 93.06 ± 1.38 72.21 ± 3.72 62.39 ± 5.72 54.78 ± 3.71 46.04 ± 10.95 68.32 ± 40.53 89.75 ± 5.10 81.46 ± 9.68 72.72 ± 1.67 82.69 ± 4.39 78.96 ± 11.54 65.81 ± 4.75 97.77 ± 0.51 97.36 ± 1.74 62.33 ± 3.67 40.83 ± 6.90 76.52 ± 4.36 76.57 ± 8.18 91.22 ± 2.94 72.81 ± 2.34 65.93 ± 5.00 48.94 ± 32.56 53.44 ± 1.65 95.62 ± 1.91 58.34 ± 34.01 77.65 ± 4.65 38.92 ± 13.91 78.64 ± 4.59 33.66 ± 13.04
Mean
Table 1. The Table shows: (1) Summary of used datasets in alphabetical order; number of instances, classes, attributes (continuous, binary, nominal), percentage of missing values in the attributes, noise, average information gain (Avg Info Gain) and total information gain (Net Info Gain). (2) Classification accuracies of evolutionary rule-learning algorithms; bold entries in every row represents the best accuracy.
134 A.K. Tanwani and M. Farooq
Classification Potential vs. Classification Accuracy
135
signed rank test and to report significant differences between the pairs [6][8]. Demsar has criticized the misuse of these approaches for multiple classifier comparisons because: (1) none of them reasons about comparing the means of more than two random variables, and (2) a certain portion of null hypothesis is always rejected due to a random chance by doing so [25]. In this paper, we use more specialized methods for comparing the average ranks of evolutionary classifiers (see Table 1) as suggested by Demsar [25] and Garcia [26]. Global Comparison of Evolutionary Classifiers. We use two most widely used nonparametric tests for comparison of multiple hypothesis among the classifiers: (1) Friedman Test [27], and (2) Iman and Davenport Test [28]. These tests utilize χ2 and F distributions respectively to check if the distribution of observed and expected frequencies differ from each other. Friedman and Iman and Davenport tests perform a global analysis to check whether the measured average ranks of all the classifiers are significantly different from the mean rank (3.5 in our case). The corresponding statistics χ2F and FF are calculated as explained by Friedman and Iman and Davenport: χ2F = 19.94, FF = 4.44 The critical values for corresponding tests χ2C and FC obtained from the χ2 and F distribution tables at α = 0.05 with 5 and 150 degrees of freedom are: χ2C (5) = 11.07, FC (5, 150) = 2.27 Since, the critical values are lower than the test statistics, the null hypothesis can be rejected and the post-hoc tests can be applied to detect significant differences between classifiers. Comparison with the Control Classifier – GAssist. It can be seen from the results in Table 1 that GAssist provides the best overall classification accuracy of 77.33 and least standard deviation of 16.63. Moreover, it also outperformed other classifiers for 13 biomedical datasets. To compare the performance of GAssist with other evolutionary algorithms, we now establish the multiple hypothesis where every other evolutionary classifier is statistically compared with GAssist. We use two post-hoc tests to determine the statistical significance of results: (1) Bonferroni-Dunn Test [29], and (2) Holm Test [30]. In general, these post-hoc tests vary in adjusting the threshold of significance level α in accordance with their multiple hypothesis. Bonferroni-Dunn Test controls the family-wise error rate in a single step by dividing α with the number of comparisons (k − 1). Holm’s Test is a step-down procedure in which the hypothesis is tested on the p-values arranged in ascending order. Starting from the lowest p-value, all the hypothesis are rejected for which pi <= α/k−i while all the other remaining hypothesis are retained. Holm’s Test is more powerful as it makes no assumptions about the hypothesis and in general, rejects more hypothesis than Bonferroni-Dunn’s Test. The corresponding probability of the test statistic from the normal distribution table is obtained from the z-value by comparing ith and j th classifier. If the probability is less than the appropriate significance level, the null hypothesis is rejected. The results of comparison with control classifier GAssist are shown in Table 2.
136
A.K. Tanwani and M. Farooq
Table 2. Test statistics for comparison with control classifier - GAssist (α = 0.05, k = 6, N = 31 and Rj = 2.71). Null hypothesis is rejected for bold entries in p column. Z-Value Bonferroni-Dunn (B-D) Holm Critical Value p (Ri − Rj )/ k(k + 1)/6N α/(k − 1) α/(k − i) B-D Holm 1 SLAVE 3.462 5.36E-4 0.01 0.01 2 Ishibuchi 3.292 9.93E-4 0.01 0.0125 3 cAntMiner 1.358 0.174 0.01 0.017 0.01 0.017 4 XCS 1.222 0.222 0.01 0.025 5 UCS 0.645 0.519 0.01 0.05 i Algorithms
Table 3. Test statistics for pairwise comparisons (α = 0.05, k = 6, N = 31). Null hypothesis is rejected for bold entries in p column. i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Z-Value Nemenyi Holm Critical Value p (Ri − Rj )/ k(k + 1)/6N 2 ∗ α/k(k − 1) α/(k − i) Nemenyi Holm GAssist vs SLAVE 3.462 5.36E-4 0.003 0.003 GAssist vs Ishibuchi 3.292 9.93E-4 0.003 0.004 UCS vs SLAVE 2.817 0.004 0.003 0.004 UCS vs Ishibuchi 2.647 0.008 0.003 0.004 XCS vs SLAVE 2.240 0.025 0.003 0.004 cAnt-Miner vs SLAVE 2.104 0.035 0.003 0.005 XCS vs Ishibuchi 2.070 0.038 0.003 0.0055 cAnt-Miner vs Ishibuchi 1.935 0.053 0.003 0.006 0.003 0.004 GAssist vs cAntMiner 1.358 0.174 0.003 0.007 XCS vs GAssist 1.222 0.222 0.003 0.008 UCS vs cAnt-Miner 0.713 0.476 0.003 0.01 UCS vs Gassist 0.645 0.519 0.003 0.0125 XCS vs UCS 0.577 0.564 0.003 0.017 SLAVE vs Ishibuchi 0.170 0.865 0.003 0.025 XCS vs cAnt-Miner 0.136 0.892 0.003 0.05 Algorithms
The last column gives the critical values of the used tests. If the p-value is less than or equal to this critical value, the null hypothesis is rejected for the corresponding test. It can be seen that the results of GAssist are statistically significant compared to SLAVE and Ishibuchi and hence, the null hypothesis can be rejected, while nothing much can be said about other algorithms with the given results. Pairwise Comparisons. As GAssist cannot be termed as the best classifier against all the other classifiers in the last section, we now make the pairwise comparisons to analyze the statistical differences between all the classifiers. Along with the Holm’s Test, we use the pairwise counterpart of Bonferroni-Dunn’s Test called Nemenyi Test [31], for comparing all classifiers with each other. Nemenyi Test is more conservative than Benferroni-Dunn’s Test as it steps-down the significance level by number of pairwise comparisons (k(k − 1)/2 instead of (k − 1)). The results in Table 3 show that the Nemenyi Test rejects the hypothesis of GAssist against SLAVE and Ishibuchi while the Holm’s method also allows to reject the hypothesis for UCS vs SLAVE. 5.2 Effect of Evolutionary Algorithm The use of statistical analysis provides deeper analysis to the obtained results than simply averaging the classification accuracies; a raw measure of ranking the performance
Classification Potential vs. Classification Accuracy
137
of algorithms. We now present the role of evolutionary learning paradigm in classifying biomedical datasets based on the obtained results: Pittsburgh-Style – GAssist. The results of our experiments show that GAssist – a Pittsburgh-style learning classifier – performs better than other evolutionary rulelearning algorithms. The greater accuracy is a result of its superior fitness function that combines the accuracy and complexity of an individual using Minimum Description Length (MDL) principle to yield optimum rules [14]. Nature Inspired – cAnt-Miner. cAnt-Miner closely follows GAssist’s policy to generate simpler rules. The ants generate rules by selecting attribute-value pairs on the basis of their entropy and pheromone values [32]. Consequently, it uses only high quality attributes (we model quality with information gain) in the formulation of its rules. Moreover, its pruning mechanism yields simpler and shorter rules, thereby, achieving greater classification accuracy. Michigan-Style – UCS and XCS. The Michigan-style learning classifiers – UCS and XCS – use online learning to evolve a set of condition-action rules from each training instance. Thus, they can be more useful in identifying hidden patterns and generating information rich rules compared with simple and generic rules of GAssist and cAntMiner. We therefore suggest that if medical experts are available to refine rules, Michigan-style classifier for knowledge extraction can prove to be useful. Genetic Fuzzy – SLAVE and Ishibuchi. The results show that the genetic fuzzy rule learning classifiers are not generally suitable for classification of biomedical datasets. The fuzzy rules so generated, however, can be particularly used to evaluate the uncertainty associated with the prognosis. 5.3 Effect of Nature of Dataset A careful insight into the results of Table 1 enables the reader to draw an important conclusion: the variance in accuracy of classifiers on a particular dataset is significantly smaller compared with the variance in accuracy of the same classifier on different datasets. The statement holds for more than 25 datasets; with notable exceptions being Dermatology, Splice-Junction Gene Sequence, and Promoters Gene Sequence. Consequently, we can say that accuracy is strongly dependent on the nature of biomedical dataset. We now discuss important factors that determine the net classification potential of a dataset. Role of Multiple Classes. It can be inferred from Table 1 that for multi-class problems, UCS gives significantly better accuracy compared with other classifiers. The reason is that it evolves only those highly-rewarded classifiers of the match set in the correct set, which predict the same class as that of the training example [33]. In comparison, GAssist has serious problems in dealing with multi-class problems – specially when the number of output classes are more than 5. On these datasets, the average accuracy of UCS is 83.49% compared with 75.52% of GAssist.
138
A.K. Tanwani and M. Farooq
Role of Instances. It is obvious in Table 1 that evolutionary algorithms over-fit on datasets with small number of instances. Consequently, accuracy of classifiers on Lung Cancer, Post Operative Patient, Promoters Gene Sequence and Switzerland Heart datasets severely degrades. We argue that during training, classifiers create small disjuncts with rare cases [34]; as a result, their accuracy significantly degrades during testing. Role of Attributes. The attributes of a dataset vary in three aspects: (1) number, (2) type (continuous, binary and nominal), and (3) quality. We see in Table 1 that number and type of attributes have little role in defining the classification potential of a dataset. Very poor performance of XCS on Splice-Junction Gene Sequence, Promoters Genes Sequence and Lung Cancer datasets came as a surprise to us. Our analysis reveals that large number of nominal attributes in these datasets – 61, 58 and 56 respectively – is the main cause of their poor performance with XCS. Our conclusion is that XCS is unable to cater for large number of nominal attributes in a dataset. Remember, we quantify quality of attributes with information gain. The graph in Figure 1 clearly shows that classification accuracy increases with an increase in the information gain of its attributes.
Fig. 1. Average Information Gain vs Classification Accuracy
Role of Missing Values. The missing or incomplete data degrades the accuracy of learning algorithms. Therefore, a number of methods like Wild-to-Wild, mean or mode method, random assignment, InGrimputation Model, listwise deletion etc. have been proposed for imputation to increase the accuracy of a classifier. Figure 2 reveals that GAssist is relatively more resilient to missing values compared with other algorithms. GAssist replaces a missing value with the mean of its class for real valued attributes. For nominal attributes it replaces missing value with their mode. Role of Imbalanced Classes. A learning algorithm during classification may develop a bias towards its majority class. However, Figure 3 shows that the net accuracy of evolutionary classifiers remains unaffected even in datasets with high imbalance ratios.
Classification Potential vs. Classification Accuracy
139
Fig. 2. Missing Values vs Classification Accuracy
Fig. 3. Class Imbalance (Log Scale) vs Classification Accuracy
Role of Noise. The results in Table 1 show that classification potential of a dataset is inversely proportional to the level of noise in a dataset. Consequently, accuracy of classifying noisy datasets is very small (see Figure 4). GAssist shows more resilience to noise in datasets because of its added generalization pressure with bloat control based on MDL principle. The MDL principle forces GAssist to reduce the size and length of its individuals. In short its ‘simple’ evolution policy makes it resilient to noise. 5.4 Combined Effect of Nature of Dataset Our facet-wise study of dataset parameters show that noise, information gain and missing values play a significant role in defining the classification accuracy of an algorithm while imbalance ratio does not dominate the resulting accuracy. We now conclude our findings in Figure 5 to have a better understanding of the combined effect of the complexity parameters. It is obvious in Figure 5 that noise in a dataset effectively determines the classification accuracy. The high average information gain of a dataset yields better classification accuracy; while the percentage of missing values in a dataset has minor impact on the accuracy.
140
A.K. Tanwani and M. Farooq
Fig. 4. Noise vs Classification Accuracy
Fig. 5. Relationship between Classification Accuracy and Nature of Dataset: x-axis contains biomedical datasets in increasing order of their classification accuracies; y-axis contains normalized parameters of datasets, 1-Average Information Gain, 2-Missing Values, 3-Noise, 4Classification Accuracy
Meta-Model for Classification Potential of a Dataset. In this section, we apply our meta-model framework [35] to get a measure of the classification potential of a dataset based on the complexity parameters. We create a meta-dataset comprising of three attributes for complexity parameters: average information gain, missing values, and noise.
Classification Potential vs. Classification Accuracy
141
We categorize the output class – classification potential – into three classes based on the classification accuracy: good (greater than 0.8), satisfactory (0.6-0.8) and bad (less than 0.6). The interesting patterns lying in this meta-dataset are extracted using two classifiers: (1) GAssist, it gives good classification results, and (2) Boosted J48 [22], to compare the results with well-known non-evolutionary algorithm. Classification Rules of GAssist 0:Noise is [>0.667]|bad 1:MissingValues is [>0.905]|bad 2:MissingValues is [<0.125]|Noise is [>0.145]|satisfactory 3:Noise is [>0.287]|satisfactory 4:AvgInfoGain is [<0.29]|MissingValues is [>0.6]|bad 5:Default rule -> good The classification rules generated by both classifiers prove our thesis that a noise level greater than 0.25 severely degrades the classification potential of a dataset. As expected, GAssist is able to generate more generic and comprehensible rules. For example, if noise level is above 0.667, the classification potential is bad irrespective of other parameters. The knowledge extracted by both algorithms provide same generalization. Hence, our proposed meta-model can be effectively used in determining the true classification potential of a biomedical dataset. We believe this can prove to be a very effective tool for analyzing the inherent complexities and needs for pre-processing the dataset. Decision Tree of J48 Noise <= 0.26297 | MissingValues <= 0.016387 | | AvgInfoGain <= 0.65192 | | | AvgInfoGain <= 0.059957: good | | | AvgInfoGain > 0.059957: satisfactory | | AvgInfoGain > 0.65192: good | MissingValues > 0.016387: good Noise > 0.26297 | MissingValues <= 0.002235: satisfactory | MissingValues > 0.002235: bad
6 Conclusion In this paper, we have quantified the complexity of biomedical datasets in terms of missing values, noise, imbalance ratio and information gain. The effect of complexity on classification accuracy is evaluated using six well-known evolutionary rule learning algorithms. The results of our experiments show that GAssist – in most of the datasets – provides better classification accuracy compared with other algorithms. Our analysis reveals that the classification accuracy of a biomedical dataset is, however, a function
142
A.K. Tanwani and M. Farooq
of the nature of biomedical dataset rather than the choice of a particular evolutionary learner. The major contribution of this paper is a unique methodology to determine the classification potential of a dataset using a meta-model framework. In the future, we would like to present the generated rules of different classifiers to the medical experts for their feedback.
Acknowledgements The authors of this paper are supported, in part, by the National ICT R&D Fund, Ministry of Information Technology, Government of Pakistan. The information, data, comments, and views detailed herein may not necessarily reflect the endorsements of views of the National ICT R&D Fund.
References 1. Pena-Reyes, C.A., Sipper, M.: Evolutionary computation in medicine: an overview. Journal of Artificial Intelligence in Medicine 19(1), 1–23 (2000) 2. Wong, M.L., Lam, W., Leung, K.S., Ngan, P.S., Cheng, J.C.V.: Discovering knowledge from medical databases using evolutionary algorithms. IEEE Engineering in Medicine and Biology 19(4), 45–55 (2000) 3. Holmes, J.H.: Learning classifier systems applied to knowledge discovery in clinical research databases. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 243–261. Springer, Heidelberg (2001) 4. Bernado Mansilla, E.: Domain of competence of XCS classifier system in complexity measurement space. IEEE Transactions on Evolutionary Computation 9(1), 82–104 (2005) 5. Kharbat, F., Bull, L., Odeh, M.: Mining breast cancer data with XCS, Genetic and Evolutionary Computation Conference (GECCO), pp. 2066-2073, UK (2007) 6. Puig, A.O., Mansilla, E.B.: Evolutionary rule-based systems for imbalanced data sets. Soft Computing - A Fusion of Foundations, Methodologies and Applications 13(3), 213–225 (2009) 7. Bacardit, J., Butz, M.V.: Data mining in learning classifier systems: comparing XCS with GAssist. In: Kovacs, T., Llor`a, X., Takadama, K., Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2003. LNCS (LNAI), vol. 4399, pp. 282–290. Springer, Heidelberg (2007) 8. Bernad´o, E., Llor`a, X., Garrell, J.M.: XCS and GALE: a comparative study of two learning classifier systems with six other learning algorithms on classification tasks. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 115–132. Springer, Heidelberg (2002) 9. Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: An ant colony based system for data mining: applications to medical data. In: Int. Conf. on Knowledge Discovery and Data mining, Boston, pp. 55–62 (2000) 10. Galea, M., Shen, Q., Levine, J.: Evolutionary approaches to fuzzy modelling for classification. Knowledge Engineering Review 19(1), 27–59 (2004) 11. Tanwani, A.K., Afridi, J., Shafiq, M.Z., Farooq, M.: Guidelines to select machine learning scheme for classifcation of biomedical datasets. In: Pizzuti, C., Ritchie, M.D., Giacobini, M. (eds.) EvoBIO 2009. LNCS, vol. 5483, pp. 128–139. Springer, Heidelberg (2009) 12. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: Toward a theory of generalization and learning in XCS. IEEE Transactions on Evolutionary Computation 8(1), 28–46 (2004)
Classification Potential vs. Classification Accuracy
143
13. Bernado-Mansilla, E., Garrell-Guiu, J.M.: Accuracy-based learning classifier systems: models, analysis and applications to classification tasks. Evolutionary Computation 11(3), 209– 238 (2006) 14. Bacardit, J., Garrell, J.M.: Bloat control and generalization pressure using the minimum description length principle for a Pittsburgh approach Learning Classifier System. In: Kovacs, T., Llor`a, X., Takadama, K., Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2003. LNCS (LNAI), vol. 4399, pp. 59–79. Springer, Heidelberg (2007) 15. Otero, F.E.B., Freitas, A.A., Johnson, C.J.: cAnt-Miner: an ant colony classification algorithm to cope with continuous attributes. In: Ant Colony Optimization and Swarm Intelligence, Belgium, pp. 48–59 (2008) 16. Gonzalez, A., Perez, R.: SLAVE: a genetic learning system based on an iterative approach. IEEE Transaction on Fuzzy Systems 7(2), 176–191 (1999) 17. Ishibuchi, H., Nakashima, T., Murata, T.: Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Transactions on Systems, Man, and Cybernetics 29(5), 601–618 (1999) 18. Fawcett, T.: ROC graphs: notes and practical considerations for researchers, TR HPL-20034, HP Labs, USA (2004) 19. UCI repository of machine learning databases, University of California-Irvine, Department of Information and Computer Science, www.ics.uci.edu/˜mlearn/MLRepository.html (last accessed: June 25, 2010) 20. Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study of their impacts. Artificial Intelligence Review 22(3), 177–210 (2004) 21. Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. Journal of Artificial Intelligence Research 11, 131–167 (1999) 22. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 23. Otero, F.E.B.: Ant Colony Optimization Framework, MYRA, http://sourceforge.net/projects/myra/ (last accessed: June 27, 2010) 24. Alcala-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V.M., Fernandez, J.C., Herrera, F.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing 13, 307–318 (2008) 25. Demsar, J.: Statistical comparisons of classifiers over multiple datasets. Journal of Machine Learning and Research 7, 1–30 (2006) 26. Garcia, S., Herrera, F.: An extension on ”Statistical comparisons of classifiers over multiple datasets” for all pairwise comparisons. Journal of Machine Learning and Research 9, 2677– 2694 (2008) 27. Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics 11, 86–92 (1940) 28. Iman, R.L., Davenport, J.M.: Approximations of the critical region of the Friedman statistic. Communications in Statistics, 571–595 (1980) 29. Dunn, O.J.: Multiple comparisons among means. Journal of the American Statistical Association 56, 52–64 (1961) 30. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 65–70 (1979) 31. Nemenyi, P.B.: Distribution-free multiple comparisons, PhD Thesis, Princeton University (1963)
144
A.K. Tanwani and M. Farooq
32. Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: Data mining with an ant colony optimization algorithm. IEEE Transactions on Evolutionary Computation 6(4), 321–332 (2002) 33. Orriols-Puig, A., Bernad´o-Mansilla, E.: Revisiting UCS: description, fitness sharing and comparison with XCS. In: Bacardit, J., Bernad´o-Mansilla, E., Butz, M.V., Kovacs, T., Llor`a, X., Takadama, K. (eds.) IWLCS 2006 and IWLCS 2007. LNCS (LNAI), vol. 4998, pp. 96– 116. Springer, Heidelberg (2008) 34. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6(1), 40–49 (2004) 35. Tanwani, A.K., Farooq, M.: The role of biomedical dataset in classification. In: Combi, C., Shahar, Y., Abu-Hanna, A. (eds.) Artificial Intelligence in Medicine. LNCS (LNAI), vol. 5651, pp. 370–374. Springer, Heidelberg (2009)
Supply Chain Management Sales Using XCSR Mar´ıa Franco, Ivette Mart´ınez, and Celso Gorrin Departamento de Computaci´ on y Tecnolog´ıa de la Informaci´ on Universidad Sim´ on Bol´ıvar, Caracas, Venezuela [email protected], [email protected]
Abstract. The Trading Agent Competition in its category Supply Chain Management (TAC SCM) is an international forum where teams develop agents that control a computer assembly company in a simulated environment. TAC SCM involves the following problems: to determine when to send offers, decide the final sales prices of the goods offered and plan the factory and delivery schedules. In this work, we developed a TACSCM agent called TicTACtoe, that uses Wilson’s XCSR classifier system to decide the final sales prices. In addition, we developed an adaptation for this classifier system, that we called blocking classifiers technique, which allows the use of XCSR within environments with single-step tasks and delayed rewards. Our results show that XCSR allows generating a set of rules that solves the TAC SCM sales problem in a satisfactory way. Moreover, we found that the blocking mechanism improves the performance of the agent in the TAC SCM scenario.
1
Introduction
The supply chain management embodies the management of all the process and information that moves along through the supply chain from the supplier to the manufacturer right through to the retailer and the final customer. Nowadays, the supply chain management is one of the most important industrial activities. Planning the activities through the supply chain is vital to the competitiveness of manufacturing enterprises. According to [6], “while today’s supply chains are essentially static, relying on long-term relationships among key trading partners, more flexible and dynamic practices offer the prospect of better matches between suppliers and customers as market conditions change”. The Trading Agent Competition of Supply Chain Management (TAC SCM)[6] was designed to expose the participants to the typical challenges presented in the dynamic supply chain. These challenges include competing for the components provided by the suppliers, managing the inventory, transforming components into final products and competing for the customers. These problems can be classified into three main problems: purchases, production and sales. Pardoe and Stone made experiments applying different learning techniques to sales decisions of TAC SCM agents[11]. One of their main conclusions was that winning offers in TAC SCM is a very complex problem because the winning prices J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 145–165, 2010. c Springer-Verlag Berlin Heidelberg 2010
146
M. Franco, I. Mart´ınez, and C. Gorrin
may vary very quickly. Therefore, this work affirms that taking decisions based on previous states of the current game is inaccurate, while using information taken from a lot of previous games will show better results. The goal of this work is to present an approach to the TAC SCM problem using an evolutionary reinforcement learning system. We specifically use XCSR to solve one of the most important sales problems: pricing the components in order to compete over the market and maximize the profit at the same time.
2
TAC SCM
The TAC SCM competition[6] was designed by a team of researchers from the eSupply Chain Management Lab at Carnegie Mellon University in collaboration with the Swedish Institute of Computer Science (SICS). In this contest, each team has to develop an intelligent agent capable of handle the main supply chain management problems (which orders accept, decide the sale price for products, compete over the market, among others). Agents compete against each other in a simulation that lasts 220 days and includes customers and suppliers to deal with. The main goal of the competitors is to maximize the final profit by selling assembled computers to the customers. The profit of an agent is calculated by subtracting production costs to the incomes. This profit reflects itself in the amount of money the agents have at the end of the game, which indicates which agent is the winner. Each TAC SCM simulation has three actors: customers who buy computers, manufacturers (agents) who produce and sell computers, and suppliers who provide the unassembled components to the manufacturers. A detailed description of these actors can be found in [6]. At the beginning of each day, the agent receives “request for quotes” (also known as RFQs) from the customers. Afterwards, the agent decides which RFQs should be accepted and which should be the final offer price. After sending the offers, the agent waits for the orders from the customers. Only the best priced offers are accepted and turn into orders. If the agent receives the order, it decides when to produce and deliver, and even more important, how much components it should buy to accomplish the production schedules. In order to buy the components, the agent sends the suppliers RFQs for the spare parts. In response, the suppliers send offers to the agent who has to decide whether or not to accept them. Each team competing in a TAC SCM game should develop a manufacturer agent that has to deal with the main decisions of the supply chain management: how much components shall we buy, when shall we produce an order and which RFQs shall we accept. Moreover, when accepting a RFQ, the agent should decide which the final price should be for these goods. In this work, these three problems will be referred as the purchase problem, the production problem and the sales problem. More than 30 agents participate in this competition each year. Among the most successful solutions to the TAC SCM problem we found: TacTex-06,
Supply Chain Management Sales Using XCSR
147
PhantAgent and CMiex. TacTex-06[12] is an agent that uses a prediction model trained with the Additive Regression with Decision Stumps algorithm[17] within the purchase strategy. In addition, this agent also uses another prediction model for the sales strategy based on the idea that the winning prices follow a normal distribution. Other interesting approach to the problem is presented by PhantAgent[13] which uses heuristics to solve the purchase and sales problems. Furthermore, the CMiex agent[2] uses a forecasting module that predicts the sales price for the components and products for the following days. However, the code for these solutions was not available at the time and this situation encouraged us to create our solution to the problem from scratch. In our solution, we addressed the sales problem using an evolutionary reinforcement learning technique. The other problems were solved using simple static strategies in order to evaluate the impact of the learning system on the sales problem. Our approach to the problem will be better explained Section 4.
3
XCSR
XCS is a Michigan Learning Classifier System first described by Wilson[14]. This system is based on the work proposed by Holland[9] but uses the accuracy instead of the payoff as a measure of “goodness” of a classifier. In our implementation we used XCSR[15], a version of the of XCS that accepts real numbers as inputs. To do this the features in the condition are represented by lower and upper bounds while the action remains discrete. The reason why we decided to use this approach is because all the inputs of the decision we wanted to make were real and the decisive thresholds needed to be found dynamically. We also decided to use XCSR because the rule system can constantly adapt to new environments using a fixed rate of exploration[3] and the rules that it generates are interpretable by human beings[10].
4
TicTACtoe
TicTACtoe is our approach to the TAC SCM problem. TicTACtoe has three modules: Purchase, Production and Sales (see Figure 1). Each module manages one of the sub-problems in the supply chain management. Every module takes its own decisions using information taken from the environment and other modules. On the next subsections, we will focus on the details of these modules. In addition to these modules, we provided the agent with memory through an organizer structure. This structure keeps track of: orders scheduled for production, possible order commitments, actual produced orders1 and possible future inventory2. This memory allows to record decisions taken by the agent each day 1 2
Production schedules may vary due to the lack of components. The future inventory is based on the component orders placed by the agent.
148
M. Franco, I. Mart´ınez, and C. Gorrin
Fig. 1. TicTACtoe Architecture
and to consider events that would happen in the future, which will be used to make further decisions. 4.1
Purchases
The purchase module is in charge of sending RFQs to the suppliers in order to buy the necessary components for production. This module has two tasks: a) creating the RFQs to get the current component prices and b) decide which supplier offers to accept. Suppliers RFQ creation. First, the agent calculates how many components are needed for production within the next ten days. These calculations are based on the current inventory, orders scheduled for the next ten days and component orders that have been placed already. The agent always sends the RFQ to its favourite supplier for that particular component, which is the one who has given the best prices lately. There is only one favorite supplier for each component. However, the agent also asks the other suppliers for the current prices in order to update the favourite supplier if necessary. The favorite supplier is preferred in order to get lower prices. This is based in the assumption that the state of a supplier does not change drastically. Therefore, if a supplier gives an agent the best price, probably it would continue giving good prices for some time. Accepting the offers. When the agent ask for components, the suppliers might not be able to comply with the agent’s requirements. When the supplier is not able to deliver the products the agent asks for, it sends two types of adjusted offers instead: offers that vary the quantity and offers with a later due date. If this happens, the priority of TicTACtoe is to accept first the complete offers and then the ones that vary the quantity. Once an order is set, the agent adds a record of the components arrival to calculate the future inventory.
Supply Chain Management Sales Using XCSR
149
Furthermore, the agent keeps a historical record of the base price for each component. The component base price is calculated every day as a weighted average as shown in equation (1): c Pdc = Sdc · w + Pd−1 · (1 − w)
Pdc
(1) Sdc
where is the base price for component c in day d, is the supplier’s price for component c in day d and w is a constant for weighting. 4.2
Production
The production module is in charge of scheduling the production of the active orders (the orders that are waiting to be elaborated and delivered). This module prioritizes the active orders with sooner due dates and higher penalties (in case the orders are behind schedule). The agent loops over the orders checking if there is enough inventory of products to deliver them. If the agent has enough products the order is delivered. This strategy is used by PhantAgent[13] to avoid extra storing charges. In case there are not enough components to deliver the order, the agent verifies if the order is beyond the latest possible delivery day3 . In case it is already too late, the customer would not receive the order anymore. So, the agent cancels it and frees all the components and products associated in order to be able to use them to fulfil other orders. If there are not enough products to fulfill the order but the customer can still wait for it, the agent tries to produce it. To produce an order scheduled for a specific day, the agent checks if there are enough components. When there are not enough components to produce the desired quantity, the agent produces the maximum quantity allowed. If the agent cannot produce an order completely, it continues producing it the next day. At the end of the day, the production module determines the number of late orders and the number of active orders. This information is used by the sales module to adjust the quantity of free cycles the agent can offer. This forces the agent to save cycles for late order production. 4.3
Sales
The sales module is in charge of pricing the products and dealing with the customers. This module checks everyday the customer RFQs and sends offers to the ones that meet the following characteristics: (a) a reserve price higher than the product’s base price and (b) a due date earlier than the end of the simulation. The agent calculates the base price for a product as the sum of the estimated prices of all the spare parts. This estimates how profitable the order would be. Afterwards, the agent uses the set of rules generated using a XCSR to determine the discount factor over the reserve price of each RFQ. The reserve price is the maximum price a customer is willing to pay for an order. The agent that offers the lower price wins the bid. The implementation of XCSR will be explained in greater detail in Section 5. 3
The latest possible deliver day is determined when the customer sends the RFQ.
150
M. Franco, I. Mart´ınez, and C. Gorrin
The final offer price is determined by equation (2), where BaseP rice is the calculated cost of the product based on recent experiences, ReserveP rice is the reference price determined by the customer and d is the discount factor determined by the XCSR. OfferPrice = BasePrice + Revenue · (1 − d)
(2)
Revenue = ReservePrice − BasePrice
(3)
Once the agent calculates the offer price for each RFQ, a production schedule is generated including these possible orders4 . The orders that involve higher revenues have more priority. In order to save production cycles for future orders that would need to be delivered earlier, the agent always tries to produce an order as late as possible according to its due date. This strategy is very similar to the one used in [12]. If there are not free cycles the agent checks the inventory to see if there are enough products to deliver these orders the next day. Moreover, non of these is possible the less profitable RFQs are discarded. Moreover, the daily free cycles are multiplied by a factor between 0 and 1, inversely proportional to the quantity of late orders that the agent has. This helps the agent to get on schedule again, by leaving some cycles for the production of late orders. Our agent remembers all the placed offers as possible commitments. However, customers only accept the best-priced offers. In case a customer rejects an offer, the commitment is removed and all the components and cycles associated are released.
5
XCSR Inside TicTACtoe
One of the most important decisions in the supply chain management is to decide the final price for the products. This price should be low enough to win the order and at the same time high enough to maximize the agents profit. The decision taken by the XCSR is the final price discount the agent should offer to win the bid. This decision is taken inside the Sales Module by accessing the XCSR library through two methods. The first one introduces the current state of the environment, finds the match set and the action that should take effect. Moreover, it associates the action set to the RFQ, in order to reward it later. The second one rewards the action set and saves the error information to compute further population statistics. 5.1
Classifiers Structure
In the following sections, we explain the structure used to represent the TAC SCM sales problem using real inputs and discrete actions. 4
This helps the agent to calculate how much free cycles are left for the production of further orders.
Supply Chain Management Sales Using XCSR
151
Condition. There are simulation values known by the agent that provide important information for its future decisions. Including all this values in the classifiers structure decrease the efficiency of the GA in terms of execution time. To avoid this, we selected the more important features for the decision we wanted to make. Preliminary experiments showed that the more suitable features for the classifier would be: x1 Rate of late orders5 over the total of active orders. This determines how much work is late and how convenient is to make a good offer when the agent is already behind schedule. lateOrders x1 = (4) totalOrders x2 Rate of the factory cycles that remain unused the day before. This helps the agent determine if it should raise or lower the price discount. For example, if the factory is full, the agent should give low discounts in order to try to finish with its active orders before getting new ones. x2 =
freeFactoryCapacity totalFactoryCapacity
(5)
x3 Rate of the base price over the reserve price indicated by the customer. This represents how profitable an order would be. The agent discards the cases when the base price is higher than the reserve price. x3 =
basePrice reservePrice
(6)
x4 The number of days between the current date and the day the order should be delivered. This indicates how much time the agent has to produce and deliver an order. This value is scaled between 0 and 1 considering that the due dates are, at most, 12 days after the actual date. x4 =
(dueDate − day) 12
(7)
x5 The actual day of the simulation normalized by the maximum number of days a game has. This value is very important because there are different situations as the days go by. For example, in the middle of a simulation components start to be scarce and their prices start rising. This feature helps the agent to determine different stages of the simulation that require specific behaviours. day x5 = (8) 220 All the features are normalized between 0 and 1, to use these values as upper and lower bounds. This aspect will be better explained in Section 6.1. 5
The late orders are the active orders that are producing penalties because they are going to be delivered after the due date.
152
M. Franco, I. Mart´ınez, and C. Gorrin
Action. Our implementation of XCSR has 10 actions that represent the different discounts over the possible revenue. The revenue is computed as the difference between the base price and the reserve price determined customers. The different discounts go from 0% to 90% with 10% steps. Reward. The reward is determined by the profit obtained through a RFQ, scaled by the amount of money that implied its fabrication when the agent sent the offer. There are three different scenarios in which the agent can reward an action set. When the offer is not accepted by the customer. In this case the RFQ did not make the agent earn or lose any money and the reward is zero. When the order is delivered. In this case we consider the money earned by the sale and the money lost because of the penalty (if the order was delivered late). The profit and loss are scaled by the investment of the agent, which is calculated based on the base price of the product. The reward in this case is calculated using the equation (9). 2
reward = (prof it) − (loss) where prof it =
2
offeredPrice basePrice
(9) (10)
max(day − duedate, 0) · penalty (11) basePrice · quantity When we scale the profit and the loss using the base price we obtain a percentage value of the money earned with the order. We could think that a good approximation for the reward function is to subtract the expenses to the profit; but the net earnings are not the same for the different products6 . If we use the earnings as the reward, the rules that obtain the highest rewards will be only the ones that sell the most expensive products. However, we want to learn how to sell different types products, not only the most expensive ones. Therefore, it is more appropriate to reward a classifier based on the profit margin. When the order is canceled without being delivered. In this case the agent did not produce the order on time, so it only produced losses for the agent. This is due to the penalties the agent had to pay to the customer. The money invested in the production is not considered as an expense, because these products can be used to fulfil another order. In this situation, the reward is calculated by the equation (12), which is very similar to equation (9) eliminating the term corresponding to the profit. loss =
reward = −(loss)2
(12)
In equations (9) and (12), the terms corresponding to the profit and the loss are squared in order to give the classifier a stronger reward when this quantities are significantly greater. 6
Products produce different earnings depending to their production costs.
Supply Chain Management Sales Using XCSR
6
153
Implementation Details
The XCSR library implementation is based on Butz’s XCS library (Version 1.0)[4]. We adapted this library to XCSR using a lower and upper bound notation as in XCSI[16], but allowing real values for the bounds. In the following subsections, we will explain some characteristics of the system relevant to our implementation. 6.1
Don’t Care
The don’t care in our library was implemented as the absence of lower or upper bound, depending on the allele we wanted to modify. To implement this don’t care, we had to put a restriction to the data: all the features should be bounded between 0 and 1. Putting a don’t care in an allele is equivalent to put 0 or 1, depending if it is a lower or an upper bound. In this way, we open the range to the maximum limit so the allele classifies all the states. 6.2
Classifier Subsumption
Since Butz’s library was oriented to boolean features, we had to implement other subsumption rules so they adapt to our classifier structure, where all the features are bounded between 0 and 1. The rules used were the same rules used by Wilson in [16], where a classifier is more general than another if all the ranges of the first classifier contain the second one. For example, (li , ui ) subsumes (lj , uj ), if ui > uj ∧ li < lj . The actions of the classifier should be the same for the subsumption to occur. 6.3
Crossover Operators
We implemented a restricted two-point crossover operator between conditional ranges that generates new individuals with valid ranges. This means that only the points between an upper bound and a lower bound can be chosen. This crossover operator is equivalent to boolean two-point crossover operator, because the crossover only crosses full conditions. The ranges of the new individuals are always valid, because they are combinations of the parent’s ranges. 6.4
Additional Adaptations
Additional adaptations were necessary to include XCSR in our TAC SCM agent due to the characteristics of the problem. Blocking classifers. In our classifier system, the reward of an action set is given based on the amount of money the agent wins o loses when making the corresponding offer. This value merely depends on the discount given by the agent in the offer. This is the reason why this problem is modelled as a single-step problem. Nevertheless, the agent only knows the reward few days after making the decision. This differs from the classic problems used as benchmarks (i.e. boolean multiplexer), in which the reward arrives immediately after applying an action.
154
M. Franco, I. Mart´ınez, and C. Gorrin
Considering the delayed reward, it is necessary to save the action set associated to the order, so these classifiers are given a reward when the agent gets the final result. Since we are interested in continuing learning while a classifier waits for its reward, classifiers are used in multiple learning iterations parallel to each other. This aspect of the online learning, in addition to the delayed reward, presents a new problem to us. The problem occurs when a classifier that is waiting for a reward is selected for deletion or subsumption. Since these mechanisms could be executed by any learning iteration, they could erase this classifier based on information that is not updated. Consequently, the knowledge represented by this classifier and its upcoming rewards are lost. In order to avoid the deletion of the classifiers expecting a reward based on incomplete information, we implemented a simple counting semaphore. Each classifier has a counter that indicates the number of rewards it is expecting. A single classifier participates in a lot of decisions each day and needs to wait a reward for each one of them. Therefore, we only consider for deletion the classifiers that are not blocked, the ones that have their counter in zero. We had to add also another important restriction in the subsuming mechanism. A classifier can not be subsumed if it is blocked, because its information is not entirely up to date to become part of another classifier. The blocked classifiers may participate in all the other mechanisms like crossover and mutation. Dynamic population generation. The version of XCS used has a dynamic population generation method, in which the population starts empty. Each time the algorithm generates a match set, it inserts new classifiers into the population until all the actions are covered. In other words, the algorithm guarantees that there is at least one classifier for each possible action. If there is no classifier in the population for a specific action, covering is performed and the new classifier is inserted into the population. The advantage of this technique is that the population grows dynamically as the states occur in the experiment, covering all the search space. On the other hand, the population has a limited size and, covering all the actions, yields to the loss of old classifiers when inserting the new ones. Moreover, different groups of classifiers activate themselves in different stages of the simulation. Considering that blocking classifiers places restrictions over the deletion of the active individuals, this increases the probability of deleting classifiers that activate themselves in other stages of the simulation. In order to avoid deleting good inactive classifiers in advance execution stages, we activate the population generation method until the population reaches its size limit. After that, when covering is necessary, we only generate one rule with a random action. However, we continue inserting and deleting individuals when applying the genetic algorithm over the action set. Variable epsilon-greedy action selection policy. The base library interleaves between exploitation and exploration, rewarding the classifiers only during the exploration and taking learning statistics only during exploitation. In this problem, the classifier system learns while the agent competes in a simulation.
Supply Chain Management Sales Using XCSR
155
Since all the decisions taken by the XCSR affects the final result, regardless of whether it was determined by exploration or exploitation, we changed the algorithm in order to reward the classifiers in both cases. Considering the dynamic characteristics of the simulation, we decided to use an -greedy action selection policy. This consist of selecting the best possible action with probability 1 − and exploring the rest of the time. However, we did a slightly modification so the starts at 1 and decreases linearly until it reaches a threshold. This forces the system to explore more at the beginning of the simulation and less by the end of it. When reaches the threshold, its value remains constant allowing the agent to perform some explorations that facilitates its adaptation to changes in the simulation.
7
Experiments and Results
We designed three experiments to test the effectiveness of the proposed mechanisms. First, we tested the performance of the XCSR against other static solutions. Afterwards, we analysed the impact of the blocking classifiers technique and the application of different exploration and exploitation rates. During the execution of the experiments, each agent plays separately against five dummy agents7 . The source code for these dummy agents can be found in [1]. In each experiment, the agents ran for 40 games. In the first game, the XCSR population is empty. At the end of each game, the population is saved, and at the beginning of the next game it is recovered and established as the initial population. All data presented in the following figures corresponds to the last 25 games. The first 15 games were taken as the training stage. However, during this 25 games, the agent is still doing some explorative actions due to the dynamic nature of the simulation (See Section 6.4). The length of each game is 220 simulations days. Each simulation day last 5 seconds8 , considering than none of the agents would need more time to complete its daily actions. The performance of the agents was evaluated using two main performance measures: (a) Final result: the final amount of money in the agent’s bank account. This indicates how much money the agent earned and how profitable its investments were. (b) Received orders: the number of orders placed by the customers. This value indicates the percentage of the market the agent served. This is directly linked to the decision taken by the XCSR, because if the agent gives a better price, it receives more orders 7 8
These agents come along with the TAC SCM library. They are used for testing purposes and they use simple but coherent strategies to handle the different problems. The standard parameters for the games are 220 simulation days with a duration of 15 seconds.
156
M. Franco, I. Mart´ınez, and C. Gorrin
Over these two performance measures, we applied non-parametric tests to determine if the differences between the agents were significant. Since the variables subject to study take integer values, we cannot assume normality. Therefore, we applied the Kruskal-Wallis test[7] to determine if there are significant differences between the agents. After that, we used the Wilcoxon test[7] to perform pair-wise comparisons between the agents. Also, additional performance measures are considered in some experiments to have more observational insights of the performance of the agents: (a) Factory usage: the percentage of usage of the factory capacity. This value indicates how many factory cycles are used on average. This represent the productive capacity of the agents, and it should be used at maximum. (b) Penalties: the amount of money paid to the costumers for late deliveries. This indicates how many late orders the agent had. (c) Interests: the amount of money paid to bank entity for having a negative balance in the bank account. (d) Total income: the total amount of money earned by the agents without considering the losses. (e) Component costs: the amount of money spent in buying components. (f) Storage costs: the amount of money spent in storing components to be used in future production. The component costs, storage costs, penalties and interests are represented as the percentage of the total revenue, while the final result and the total income are represented in US dollars. The combination of these measures with the main ones will show how effective the learning was, considering that we wanted to learn a discount strategy that maximizes the revenue of the agent by winning profitable and manageable orders. However, these performance measures are shown only as a support of the two main measures. Therefore, no statistical test were performed over them. The parameters used in our implementation of XCSR for the calculation of price discounts are α = 0.1, β = 0.2, δ = 0.1, ν = 5, θGA = 25, 0 = 10, θdel = 20, χ = 0.8, μ = 0.04, p# = 0.1, pI = 10.0, I = 0, FI = 0.01, θsub = 20, θmna = 1, s0 = 0.05 and N = 1000. The meaning of these parameters is explained in [5]. Moreover, the sources of TicTACtoe can be found in http://www.gia.usb.ve/~maria/tictactoe. 7.1
Experiment 1: TicTACtoe Performance
The goal of this experiment is to compare three different strategies to determine the price discount. These strategies are: learning using XCSR (L-TicTACtoe), Random and Static. All these strategies were tested using the base version of TicTACtoe. In this experiment, we also compare L-TicTACtoe with the dummy agent provided by the server. The learning version of TicTACtoe, L-TicTACtoe, uses an exploitation rate (1-) of 30% and a population size of 1000 classifiers using the blocking mechanism. This configuration was the most favourable according to Sections 7.2 and
Supply Chain Management Sales Using XCSR
157
7.3. Moreover, preliminary experiments[8] showed that a population size of 1000 produces the best results for this problem. The other versions of TicTACtoe involved are Random and Static. The first one decides the price discount randomly, while the second one gives a discount on day d as follows: ⎧ ⎪ ⎨80% if freeFactoryCapacity d−1 > 80% discount(d) = 10% if freeFactoryCapacityd−1 < 5% ⎪ ⎩ 30% all other cases.
8000 7000 6000 5000
Number of Orders
3000
4000
0e+00 −2e+07 −4e+07
2000
−6e+07
Final Result (US$)
2e+07
where freeFactoryCapacity d−1 is the percentage of free factory capacity on simulation day d − 1. These “naive” rules try to avoid factory saturation raising prices every time the number of free factory cycles goes below 5% and tries to attract customers when this value goes over 80%. Figure 2(a) shows the global performance of the agents. These results clearly show that L-TicTACtoe outperforms Random and Static, with an average result twice as high as Random and four times higher that the dummy agent. Considering that these agents only differ in their pricing strategy, it is evident that a change on this strategy affects the global performance of the agent.
L−TicTACtoe
dummy
static Agents
(a) Final Result
random
L−TicTACtoe
dummy
static
random
Agents
(b) Received Orders
Fig. 2. Comparison of the dummy agent and the TicTACtoe agent using different pricing strategies in terms of final result and received orders
Table 1 shows the p-values of the statistical comparisons among the agents. This table shows that the L-TicTACtoe is significantly better than the other solutions presented. Moreover, the learning agent performs better than the Random agent in 99.9% of the cases, supporting the statements above. Even though we could expect that L-TicTACtoe manages more orders than the other agents, Figure 2(b) reveals that Random and Static win more offers. However, Table 2 indicates that Random and Static are delivering more orders late and therefore, incurring in more penalties. These results show that the pricing strategy of these agents is less advantageous because they commit to orders which they cannot deliver on time and hence, they are penalized. We can also observe in this table that Static gets negative interests. In other words, this agent had to pay the bank for having a negative balance in its bank
158
M. Franco, I. Mart´ınez, and C. Gorrin
Table 1. Statistical comparison of the TicTACtoe agent using different pricing startegies. Column Kt shown the p-value for the Kruskal-Wallis test and column W ilcox. test shows the p-values for the Wilcoxon tests. Agent
Avg ±
Std
Kt
Wilcox. test (p-values) Random Dummy Static
Final Result L-Tic Random Dummy Static
17684004.68 10157938.28 4800379.28 -17317410.84
± ± ± ±
3827810.34 10313748.01 3909834.59 18120669.01
0.0000
0.0013 – – –
0.0000 0.0016 – –
0.0000 0.0000 0.0000 –
0.0000 0.0000 – –
0.0000 0.0000 0.0000 –
Received Orders L-Tic Random Dummy Static
5453.96 6924.28 2676.88 7841.80
± ± ± ±
649.66 593.06 344.13 223.25
0.0000
0.0000 – – –
Table 2. Results in terms of penalties, interest and factory usage of the TicTACtoe agent using different pricing strategies and the dummy agent Agent
Penalties (US$)
Interest (US$) Fact. usage (%)
L-TicTACtoe 412209.08 ± 703824.06 241286.56 ± 93755.58 Random 5596709.56 ± 6221754.37 66187.76 ±248242.79 Static 10535156.52 ±10403092.31 -781983.80 ±594421.05 Dummy 882527.44 ± 1539109.47 37324.28 ± 94207.14
69.12 85.56 93.44 34.56
± ± ± ±
8.05 6.90 1.53 4.28
account. This indicates that the strategy taken by Static is deficient, because it incurs in negative balances on most of the simulation days. On the other hand, L-TicTACtoe is the agent that earns more interests from the bank and presents the lowest variance. This shows that this agent has a more stable behavior in terms of bank account balances. Regarding the factory utilization, we can appreciate that the agents Random and Static achieve a higher factory utilization. High factory utilization suggests a proficient management of the productive capacity. However, the penalties obtained by these agents demonstrate that these agents are surpassing their production capacity. L-TicTACtoe does not use the factory as much as these agents, but still presents a better solution to this problem because it served efficiently a considerable portion of the market. Finally, through this experiment we can confirm that the strategy used by L-TicTACtoe improves the global performance of our solution to the TAC SCM problem. Furthermore, the static and random strategies show poor results as a consequence of the incapacity to adapt themselves to new situations. These
Supply Chain Management Sales Using XCSR
159
results indicate that we have accomplished the goal of applying an evolutionary rule learning technique inside the sales strategy of a TAC SCM agent. 7.2
Experiment 2: Classifiers Blocking
7000 6500
Number of Orders
5500
6000
1e+07 0e+00 −2e+07 −1e+07
Final Result (US$)
2e+07
7500
3e+07
In this experiment, we compare the performance of the TicTACtoe agent with and without the blocking classifier technique described in Section 6.4. The agent that does not block classifiers allows erasing classifiers freely (ignoring if they are waiting for a reward), while the other one preserves this classifiers. In order to keep the agents as similar as possible, both versions of TicTACtoe used an exploitation rate of 70% and a population size of 1000 classifiers. The goal of this comparison was to determine the impact of this technique. The results of this experiment will show if this simplification leads to information loss when we continue learning without waiting for the rewards.
t−block
t−noblock Agents
(a) Final Result
t−block
t−noblock Agents
(b) Received Orders
Fig. 3. Comparison of the performance of TicTACtoe with and without the blocking classifiers technique in terms of final result and received orders
Table 3 shows that t-block (L-TicTACToe with blocking) receives 312 more orders than t-noblock(L-TicTACtoe without blocking). This difference is small and is not strong enough to make any assumptions on the performance of the agents as shown in Table 3. However, Figure 3(a) shows that t-block obtains more frequently a better final result than t-noblock. According to Table 3 this difference is not statistically significant using using a confidence interval of 0.05. However, we could say that t-block behaves better than t-noblock in 94.4% of the cases. This difference in the final balance is explained by the high penalties obtained by t-noblock as shown in Table 4. These penalties indicate that this agent does not develop an appropriate set of rules to determine the final sale price for an RFQ. Moreover, t-noblock makes offers at very low prices to orders that have a high penalty and are very difficult to produce because of the lack of the required components. When this agent offers products at low prices, it obtains plenty of orders, but most of them do not represent a profitable portion of the market considering its penalties. Moreover, we can observe in Table 4 that t-noblock gets negative interests from the bank, while the t-block gets positive interests. This implies that, on
160
M. Franco, I. Mart´ınez, and C. Gorrin
Table 3. Statistical results from the comparison of both agents using and not using the blocking classifier technique. The columns W (p − value) show the p-value of the Wilcoxon test between agents. Agent
Avg ± Std
W (p-val)
Final Result t-block 15206827.28 ± 11654157.53 t-noblock 7860693.72 ± 14530705.27
Avg ± Std
W (p-val)
Received Orders 0.0567 –
7448.44 ± 278.09 7136.20 ± 726.64
0.3159 –
Table 4. Results in terms of interest, penalties, component costs and storage costs of the TicTACtoe agent using and not using the blocking classifiers technique.
t-block t-noblock
Interests (US$)
Penalties (US$)
Comp. costs (%)
Storage (%)
156981±250208 -35459±408030
4791314±6357927 8083767±7679835
85.08 ± 3.26 85.07 ± 4.83
1.13 ± 0.20 1.28 ± 0.28
average, the agent that does not block classifiers incurs in debts, while the other agent maintains a positive balance in its bank account. This factor, in addition to the penalties, explains why in Figure 3(a) the agent t-noblock ends with less money than agent t-block. To determine the impact of the blocking technique, it is also important to analyze the experience of the XCSR system in each agent. The experience is a measure of the classifier usage; it indicates how many times a classifier has been used. In Figure 4, we can observe that the mean experience of the population of t-block is higher than the mean experience of t-noblock. This pattern occurs because t-noblock allows erasing classifiers anytime based on incomplete and inaccurate information. This rules are still waiting for a reward that will determine if they performed well. Consequently, classifiers that could lead to good decisions are erased before the reward arrives, and their knowledge is completely lost. It is interesting to notice that the relationship between the average experience and the day of the simulation is approximately 0.05. This means that each classifier is used at most during 5% of the simulation. Considering that the simulation has 220 days, the 5% correspond to 11 days. Our explanation for this behaviour is that the generated rules are, in fact, detecting different stages during the simulation, and not all the classifiers are used in the same stages. The blocking classifier technique increments the global experience of the population and the probability of survival of possible good sub-solutions. Nevertheless, the tradeoff of using these mechanisms is that the system could also block bad solutions, and the probability of erasing good rules that have not been activated gets higher. The results of this experiment show that agents using the blocking classifiers technique inside XCSR preserve important information in the classifiers. This might lead to better performance in environments with single step tasks and
Supply Chain Management Sales Using XCSR
161
Fig. 4. Mean experience of the XCSR population during 8800 days (40 simulations)
delayed rewards. As a further work, more experimentation will be carried out to validate these hypothesis and determine more advantages and disadvantages this mechanism could have. 7.3
Experiment 3: Exploitation Rate
The aim of this experiment is to determine the best exploitation rate or value for (1 − ) for this particular problem. We tested the performance of the agent using different exploitation rates (0.9; 0.7; 0.5; 0.3) to determine which one is the most suitable for the problem that we are trying to solve. We also included two extra exploitation rates 0 and 1 for control. Afterwards, we analyse the two most interesting cases and compare them with the results of their dummy competitors9 . The rest of the parameters of the algorithm and the agent remained the same. For this experiment we used a population of 1000 individuals and the blocking mechanism. The TicTACtoe and dummy agents involved in this experiment will be referred as tx and dx respectively, where x stands for the final threshold exploitation rate or 1 − (See Section 6.4). Figure 5 shows the results according to the main measures of performance: the final results and the number of received orders. In Figure 5(a) we can observe that agents with the smallest final balance in the bank account at the end of the game are t0 followed by t100. The same behaviour can be observed in Figure 5(b). This means that a constant exploration (t0) (always giving the price discount 9
In this experiment we ran our base agent only with dummy competitors, using different policies each time.
M. Franco, I. Mart´ınez, and C. Gorrin
6000 4000
5000
Number of Orders
1e+07 0e+00
3000
−1e+07
Final Result (US$)
2e+07
7000
8000
162
t0
t10
t30
t50
t70
Agents
(a) Final Result
t90
t100
t0
t10
t30
t50
t70
t90
t100
Agents
(b) Received Orders
Fig. 5. Comparison of the performance of TicTACtoe using different exploitation rates in terms of final result and received orders.
in a random manner) produces the worst results. On the other hand, a pure exploitation (t100) does not achieve a good performance either, because it is incapable of adapting to new environments. Agents that combine exploitation and exploration during the whole learning process obtain the best results due to the dynamic characteristics of the environment. According to Table 3, we can say that the agents t0 and t100 are significantly worse than the rest of the agents, in terms of final result and received orders. It is worth noticing the curve in these two figures. This suggests that the exploration rate does, in fact, affect the strategy developed, and balance between exploitation and exploration is necessary to achieve good performance. In Figure 5(b) we can see that the agent that serve more orders is t70. According to Table 5, there are not significant differences between t30, t50 and t70 in terms of the final result but there are differences in terms of the orders. Moreover, the agent with the highest final result on average turns out to be t30. This situation is clarified by Table 6, which compares the performance of these two agents against their dummy competitors. Despite of efforts of t70 of serving the largest portion of the market, this agent gets plenty of penalties for late deliveries. Moreover, although the agent t30 does not have as many orders as t70, this situation helps the agent to fulfil the orders that it has already. At the end t30, does not have as much penalty as t70, producing a more steady behaviour (lower variance). We could say that agent t30 is learning how to handle a number of orders that minimizes the obtained penalty and maximizes the final revenue. We can also notice in this table that the implementation of TicTACtoe, no matter the exploitation rate used (t30 or t70), gets a higher final revenue and handles a larger portion of the market than the dummy competitors. Furthermore, there is a difference also in the behaviour of both dummy agents since the performance of the agents is relative to the competitor’s behaviour. We can notice that agent t70 makes it more difficult for the competitor d70 to obtain costumers. Regarding to the factory usage, it is considered that a good agent uses its factory capacity as much as possible to complete orders[13]. This helps the agent to obtain higher revenues at the end of the game. Even though both configurations
Supply Chain Management Sales Using XCSR
163
Table 5. Statistical results from the comparisons of the agents using different values for the exploitation rate. Column Kt shows the p-value for the Kruskal-Wallis test and column W ilcox. test shows the p-values for the Wilcoxon tests. Agent
Avg ±
Std
Kt
Wilcox. test (p-values) t30 t50 t70
Final Result t0 t10 t30 t50 t70 t90 t100
9235641 14543996 17684005 17486431 15206827 13201893 10945370
± ± ± ± ± ± ±
3638978 2925682 3827810 5252774 11654158 5666869 3575705
0.0000
0.0000 0.0006 – – – 0.0000 0.0000
0.0000 0.0021 0.5768 – – 0.1823 0.0000
0.0084 0.0593 0.6581 0.7437 – 0.0000 0.0000
0.0000 0.0004 0.0012 – – 0.0303 0.0000
0.0000 0.0000 0.0000 0.0002 – 0.0000 0.0000
Received Orders t0 t10 t30 t50 t70 t90 t100
4226.76± 5301.56± 5453.96± 6398.00± 7448.44± 6284.28± 4586.12±
574.72 674.03 649.66 1374.57 278.09 362.91 660.45
0.0000
0.0000 0.4320 – – – 0.0000 0.0000
Table 6. Comparisons between the agents using 30% and 70% exploitation rates in terms of penalties, factory usage and total income Penalties (US$) t30 d30 t70 d70
412209 882527 4791314 807530
± ± ± ±
703824 1539109 6357927 1301921
Factory Usage (%) 69.12 34.56 91.16 29.52
± ± ± ±
8.05 4.28 2.95 5.23
Total income (US$) 108718063 59104351 142639167 49645853
± ± ± ±
12611823 6999544 6966361 8701695
t30 and t70 have the same production strategy, t70 makes more usage of these resourses than t30. This behaviour is explained by the fact that agent t70 has more orders to attend. Consequently, considering this performance measure the agent t70 learns a better strategy. Nevertheless, the production and purchase strategies are still very simple, which makes it harder for this agent to deliver these orders on time. Regarding to the total income, we can notice that both TicTACtoe agents have incomes proportional to the number of received orders. Also, both agents have higher incomes than their competitors. This evidences that the developed strategies give competitive prices according to the cost of the products and do not offer the products below the production costs.
164
8
M. Franco, I. Mart´ınez, and C. Gorrin
Conclusion
We designed and implemented a supply chain management agent for the TAC SCM problem. Our agent solves the production and the purchases sub-problems using static strategies, while it solves the sales sub-problem using a dynamic strategy. Moreover, the purchase strategy is based on the acquisition of components considering production commitments for the next simulation days. The production strategy is based on manufacturing goods prioritizing orders according to their expected profits and due dates. In addition, we implemented a dynamic sales strategy built on Wilson’s XCSR classifier systems. Through the XCSR mechanism, we obtained a suitable set of rules for the TAC SCM sales problem. This set of rules worked better than the strategies used for control. As our initial solution for the TAC SCM sales problem encountered an issue when handling delayed rewards in a single-step environment, we introduced a blocking classifier technique. We showed that the use of this technique yields to more experienced populations and improves the quality of the generated strategies in this scenario. However, more experimentation needs to be carried out regarding this matter.
References 1. Trading agent competition - TAC SCM game description, http://www.sics.se/tac/page.php?id=13 2. Benisch, M., Sardinha, A., Andrews, J., Sadeh, N.: CMieux: adaptive strategies for competitive supply chain trading. In: ICEC 2006: Proceedings of the 8th international conference on Electronic commerce, pp. 47–58. ACM Press, New York (2006) 3. Bull, L.: Applications of Learning Classifier Systems. Springer, Heidelberg (2004) 4. Butz, M.: Illigal Java-XCS - LCS Web (2006) 5. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, p. 253. Springer, Heidelberg (2001) 6. Collins, J., Arunachalam, R., Sadeh, N., Eriksson, J., Finne, N., Janson, S.: The Supply Chain Management Game for the 2007 Trading Agent Competition, Pittsbourg, Pensilvania (2006) 7. Conover, W.J.: Practical Nonparametric Statistics. John Wiley & Sons, Chichester (December 1998) 8. Franco, M., Gorrin, C.: Dise˜ no e implementaci´ on de un agente de corretaje en una cadena de suministros en un ambiente simulado, Universidad Sim´ on Bol´ıvar (2007) 9. Holland, J.H.: Adaptation. In: Rosen, R., Snell, F.M. (eds.) Progress in theoretical biology IV, pp. 263–293. Academic Press, Nueva York (1976) 10. Lanzi, P.: Learning classifier systems: then and now. Evolutionary Intelligence 1(1), 63–82 (2008) 11. Pardoe, D., Stone, P.: Bidding for customer orders in TAC SCM. In: Faratin, P., Rodr´ıguez-Aguilar, J.-A. (eds.) AMEC 2004. LNCS (LNAI), vol. 3435, pp. 143–157. Springer, Heidelberg (2006)
Supply Chain Management Sales Using XCSR
165
12. Pardoe, D., Stone, P.: An autonomous agent for supply chain management. In: Adomavicius, G., Gupta, A. (eds.) Handbooks in Information Systems Series: Business Computing, vol. 3, pp. 141–172. Emerald Group (2009) 13. Stan, M., Stan, B., Florea, A.M.: A dynamic strategy agent for supply chain management. In: Proceedings of the Eighth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 227–232. IEEE Computer Society, Los Alamitos (2006) 14. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3(2), 149–175 (1995) 15. Wilson, S.W.: Get real! XCS with Continuous-Valued inputs. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, p. 209. Springer, Heidelberg (2000) 16. Wilson, S.W.: Mining oblique data with XCS. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 158–174. Springer, Heidelberg (2001) 17. Witten, I.H., Frank, E.: Data mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators in XCS Richard Preen Department of Computer Science University of the West of England Bristol, BS16 1QY, UK [email protected]
Abstract. This paper extends current LCS research into financial time series forecasting by analysing the performance of agents utilising mathematical technical indicators for both environment classification and in selecting actions to be executed. It compares these agents with traditional models which only use such indicators to classify the environment and exit at the close of the next day. It is proposed that XCS agents utilising mathematical technical indicators for exit conditions will not only outperform similar agents which close the trade at the end of the next day, but also result in fewer trades and consequently lower commissions paid. The results show that in four of five assets, agents using indicator exit conditions outperformed those exiting at the close of the next day, before commissions were factored in. After commissions are factored in, the performance gap between the two agent classes further widens. Keywords: Computational Finance, Learning Classifier Systems, XCS.
1 Introduction The primary objective of this paper is to extend the current research into the use of the XCS Learning Classifier System [28] within the domain of financial time series forecasting. Recent work (e.g., [9], [21], [13], and [24]) has demonstrated the successful application of XCS in this area. However, in each of the studies, agents are trained on daily price data to evolve trade entry rules composed of mathematical technical indicators in conjunction with a fixed rule to close the trade the following day, i.e., the exit timing is not evolved. It is posited that by utilising mathematical technical indicators to identify the timing of the market exit, as opposed to simply exiting on the next day, not only are the associated transaction costs reduced, but the excess returns are increased due to an inherent noise reduction by requiring less prediction accuracy. Initially, several XCS agents are produced to replicate the traditional model and demonstrate their application to financial time series forecasting. In extending this work, the agents additionally evolve mathematical technical indicators to identify appropriate exit conditions. These two models are then compared and the agents are furthermore benchmarked against a buy-and-hold strategy to evaluate whether market beating excess returns can be generated. J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 166–184, 2010. © Springer-Verlag Berlin Heidelberg 2010
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
167
Brock, Lakonishock and LeBaron [4] investigated two of the most popular trading rules from technical analysis (moving averages and trading range breakout) on the Dow Jones Industrial Average over the period 1897-1986. They generated typical returns of 0.8% over a 10-day period compared to a normal 10-day upward drift of 0.17%. After the buy signals were generated, the market increased at a rate of 12% per year. Following the sell signals, a decrease of 7% per year was noted. Subsequently, Detry and Grégoire [6] successfully replicated the results for the moving average tests on a series of formally selected European indexes. Moreover, technical analysis has been shown useful in the foreign exchange markets by Dooley and Schaffer [8], Sweeney [25], Levich and Thomas [12], Neely et al. [15], Dewachter [7], Okunev and White [16], and Olson [17]. The primary benefit from the use of mathematical technical indicators in financial time series forecasting is that the algorithms are precisely defined. This means that the signals they produce are free from errors of subjective human judgement and emotion, are replicable, and can easily be tested over large amounts of data and varying assets to quantify performance. Learning Classifier Systems (LCS) [10] can easily co-evolve different combinations of these indicators to form entry/exit rules for financial trading, and even to evolve the technical indicators themselves.
2 Related Work There has been widespread research on Artificial Neural Networks (ANN) and Genetic Programming (GP) for financial time series forecasting. GP examples include Neely et al. [15], Allen and Karjalainen [1], and Chen [5]. Examples of ANN forecasting financial time series include Tsibouris and Zeidenberg [25], Steiner and Wittkemper [23], Kalyvas [11], and Srinivasa, Venugopal and Patnaik [22]. In contrast to ANN and GP, comparatively little research has been conducted into the use of LCS for financial time series forecasting. Early examples of LCS research in this area include Beltrametti et al. [2] using LCS to predict currencies, and Mahfoud and Mani [14] and Schulenburg and Ross ([18], [19], and [20]) predicting stocks. More recently, Stone and Bull [24] created a single-step ZCS [27] agent to forecast long or short positions on the Foreign Exchange (FX) Market, trading with the full amount of the balance each time. The architecture was modified by utilising the NewBoole update mechanism, tweaking the covering algorithm, and introducing a new specialize operator. The agent was required to always be in the market. Daily price and interest rate data was used, covering the period of January 1974 to October 1995 for the U.S. Dollar (USD), German Deutsche Mark (DEM), British Pound (GBP), Japanese Yen (JPY), and Swiss Frank (CHF). These were then used to create currency pairs for USDGBP, USDDEM, USDCHF, USDJPY, DEMJPY, and GBPCHF. The mathematical technical indicators used were based on four primitive functions of the time series which could return either the average price over a specified period, the minimum price over a specified period, the maximum price over a specified period, or the price at a specified day. ZCS was used to generate the indicators, where an indicator is a ratio of two of the primitive functions. For example, a log indicator:
168
R. Preen
lag(4)/max(10) with a range [0.032,0.457] and an action of 1 translates to ‘go long if the price 4 days ago is greater than 1.033 to 1.579 times the maximum price over the past 10 days’. The Genetic search took place on the range and historical period parameters. Crossover was applied by switching the period parameters. For example, two initial indicators: lag(8)/max(22) and min(12)/avg(40) results in the two indicators: lag(12)/max(40) and min(8)/avg(22). Mutation was then used to modify the range and period parameters in the normal way. An 8-bit encoding was used which limited parameters in the range [0,255]. The reward given was based on the additional return of the next day’s price over any interest potentially accrued on the margin. Commission was set at 2.5 basis points and therefore an action taken could be correct even though it produced a negative return. Therefore, a fixed reward of 1000 was given only for actions resulting in positive returns. The results of the ZCS agent produced Annual Percent Rate (APR) excess returns on 5 out of 6 currency pairs. Additionally, the number of ZCS runs with positive excess returns correlated well with the mean excess return achieved. However, while it was found that it is possible to achieve excess returns with ZCS, the performance (using the derived mathematical technical indicators) was not as good as a Genetic Programming benchmark. Furthermore, the most likely reason for this was because of the ZCS agent’s high trading frequency and associated costs. It is suggested that the reason for this is due to the single-step model, and that using a multi-step model could reduce the frequency. However, it seems more likely the reason is due to the requirement that the agent always have a presence in the market. The rationale for this is unclear, particularly since a major advantage private traders have over institutions is being able to stay out of the market until the exact moment when a high probability opportunity occurs. Moreover, the technical indicators used were extremely primitive. If the indicators had tougher constraints for providing an entry signal this could easily have been used to reduce the trading frequency and perhaps provide superior performance. Gershoff [9] investigated the use of a hierarchical configuration of XCS agents (HXCS). Here agents would take the inputs from technical indicators and attempt to learn profitable rules to trade the market data provided. The three mathematical technical indicators used were: Rate of Change (ROC), Simple Moving Average (SMA) and Relative Strength Index (RSI). The HXCS comprised of four micro agents: an RSI Agent, a Volume Agent, a Random Agent, and a Constant Agent. After viewing a state, each Micro Agent produced an action signal (‘0’ or ‘1’ to buy or sell). The signals were then sent to an Aggregate Agent and treated as a vote for the action. The signal that received the majority of the vote was then designated as the aggregate signal for the collection of the Micro Agents. By receiving the votes, the Aggregate Agent could deduce the competitiveness of the vote for each action (i.e., the confidence value). The confidence value was then used either to simply select the action, or as an indication to a Meta Agent whether the aggregate signal won the vote with more than a specified threshold. The Meta Agent received the set of majority signals and confidence indicators, and produced an action signal to execute in the environment.
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
169
The payoff (or feedback) given to the agents for executing a particular action was decided based upon whether the following day’s price closed above or below the current day. A payoff of zero was awarded for executing a wrong action and a constant non-zero value was awarded for executing a correct action. The agents were assessed using daily price data for IBM, EXXON, Ford, CitiGroup, Coca-Cola, and Banco Santander Cent Hispano. The training period ran from January 1990 to December 2003 and then an evaluation phase took place on data from January 2005 to June 2006. The results found that the Meta Agents usually outperformed the individual technical agents and that the Micro Agents could not outperform both the buy and hold and bank strategies. Further, that the Meta Agent always outperformed the Random Agent. However, in terms of accuracy, the Meta Agents performed the same or worse than the Micro Agents. In summary, the major finding of this model was that a Hierarchical XCS using multiple agents can produce better results than using a single agent XCS. The fact that the Meta Agents always outperformed the Random Agent also illustrates that the system is capable of learning useful rules, even though in this case they were not able to outperform the relevant real-world benchmarks. Schulenburg and Wong [21] explored Portfolio Allocation using a HXCS. Agents received inputs from technical indicators and attempted to learn profitable rules to trade the market data provided. In addition to a Technical Analysis (TA) Agent, a Market (Mkt) Agent and an Options Agent were created to provide further information to the decision making process. The TA Agent incorporated rules based upon inputs from the following four mathematical technical indicators: Rate of Change (ROC), Relative Strength Index (RSI), Ultimate Oscillator (ULTOSC), and On Balance Volume (OBV). The Mkt Agent integrated rules from the following three general market indicators: the daily percent return of the S&P500 Index, the daily S&P500 Index volume, the daily 10 year T-note bond yield, and the daily 3 month T-bill bond yield. The Options Agent included rules from the following 5 Options market indicators: Delta (i.e., the measurement of the sensitivity of an Option value to the underlying stock price), Gamma (i.e., the measurement of the second order sensitivity of the Option value to the underlying stock price), Vega (i.e., the measurement of the sensitivity of Option value to the stock price volatility), Theta (i.e., the measurement of the sensitivity of Option value to the passage of time), and implied volatility (i.e., the stock volatility estimate given by the Black Scholes formula). The daily stock data tested was for CitiGroup, IBM, General Motors, Eastman Kodak, and Exxon Mobil over the period 4th January 1996 to 28th April 2006. A commission fee of 0.5% of the transaction value was set. In contrast to Gershoff’s HXCS, the agents attempted to predict the price movement of tomorrow’s stock price and the percentage of total wealth to invest, instead of just buy or sell signals. The agents were given the choice between investing in the risky stock and investing in safe treasury bills which returned a variable interest rate based on real world values. The input data from the indicators was first divided into nine discrete cut points by using leave-one-out-cross-validation. The target series then underwent two phases of discretization. The first phase quantized the data using the unsupervised method of histogram equalisation in order to add class label information to the target series. Subsequently, the supervised method of entropy-based discretization was used to split
170
R. Preen
the series into intervals in order to maximise the information gain. Once quantization had been completed, a binary vector was mapped to the intervals so that it could be used by an XCS agent. Next, the cumulative performance of the Meta Agent was evaluated. If its prediction accuracy was less than the specified threshold value, all agents (including the Meta Agent itself) were destroyed and a new set of agents with a new discretization process was launched. The new set of cut points were based on the preceding ten days. All new agents then started their training phases by exploring themselves into the new training environment. After completing training they were placed back into the real world environment. The best results of the agents were compared against four benchmarks: buy and hold, bank, price trend, and a Random Agent. In the case of CitiGroup, all of the XCS agents outperformed all four of the benchmark agents. Moreover, in all five stocks, all XCS agents outperformed the Random Agents. The authors suggest that there is a mere 0.00003% probability that this occurred by chance and that it provides solid proof that stock prices have a rational component. Further, the XCS agents discovered a ‘famous 1960s trading rule’1. This last discovery highlights one of the major benefits to using XCS (as opposed to other alternatives such as an ANN) to forecast financial time series. The ability to have the rules in an easily human readable form enables the researcher to evaluate the logic of any discovered rule and decide whether it makes sense. This is important because if the rule does not make any logical sense to a trader then it is quite possible that the rule has been derived from over-fitting the data and its use in the future is questionable. Interestingly, in contrast to Gershoff’s findings, the Meta Agents here did not perform very well in comparison to the single agents. In 3 of the 5 stocks, the Meta Agents underperformed all three of the single agents. If we are to use the best results as indicative of performance (as suggested by [21]), this provides mixed information on the effectiveness of HXCS as opposed to standard XCS agents. Liu and Nagao [13] conducted a further assay on the application of HXCS to financial time series forecasting. Here performance was evaluated on the prediction accuracy of the direction of the next day. Two Meta Agents were used and their binary perceptions set solely according to comparisons between various moving averages. The moving averages used were of the form MAt,m where the average is calculated from time t back to time t-m. Agent1 consisted of a bitstring of length 24 where each bit was set according to the evaluation of 24 pairs of successive moving averages with an interval length of 20. Agent2 consisted of a bitstring of length 18 where the first 6 moving averages used an interval length of 10 and a further 12 moving averages with an interval length of 5, e.g., bit18 is set to logical 1 if MAt4,5<MAt, 5. Furthermore a fuzzy matching mechanism was used where a classifier is said to have matched the environment state even if 10% of the bits are non-matching. For each environment state received by the HXCS, each Meta Agent receives the input, constructs a match set, and then calculates an average prediction value for the set. The agent with the highest average match set prediction value is then chosen to advocate 1
If the ultimate oscillator is greater than 70, and the previous stock price change is within 2 to 3%, then tomorrow’s stock price will be -2.5 to -3.5%.
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
171
an action in the normal XCS procedure and parameters are updated for the action set of that Meta Agent. Experiments were run on four indexes (NIKKEI, NASDAQ, TOPIX, HSI) and 11 other stocks selected from the NIKKEI using daily closing price data from January 2000 to December 2004. The direction hit-rate of both Meta Agents always provided superior performance to a trend-following strategy that predicted the direction of the next day based on the change from the previous day. In addition, HXCS outperformed the Meta Agents by 2-3%. For example, the trend following strategy correctly predicted the direction 56% of the time for the NASDAQ, whereas Agent1 was correct 66.9% of the time, Agent2 70.8%, and HXCS 73.8%.
3 Learning Framework Perhaps the biggest limitation that is consistent among [9], [21], [13], and [24] is that they all attempt to use daily data to forecast the next day’s price. Since “the accuracy of agents’ predictions depends largely on how well the problem is represented” [21] we should adopt an approach that mimics how real trading is conducted as closely as possible. Figure 1 shows the daily price chart of the EURUSD currency pair with the vertical dotted line in the centre marking August 15th 2007. At the close of this day, the Relative Strength Index (RSI) indicator set to 14 periods (i.e., to calculate the RSI over the previous fourteen daily open, high, low, close bars), RSI(14), produces a value of 31.2109. For Agent 1 in [9], this value would set bit6 (RSI(14)≤35) to ‘1’. On the following day the price closed lower (at 1.3426) from its open (of 1.3442). Supposing that the agent had identified that this rule was part of a buy signal, it would have resulted in a loss under the model and negative feedback would have been given.
Fig. 1. EURUSD Daily Price Chart 01.08.2007 – 31.08.2007
172
R. Preen
However, if we look at the bigger picture in Figure 2 we can see that in fact this would have been an excellent place to enter the market. In Figure 2, the vertical dotted line highlights the same day as in Figure 1 but illustrates that, in the bigger picture, the EUR continued to climb in value against the USD during the subsequent months following the RSI signal. Clearly, the method of evaluation and providing feedback to the model is far too short-sighted and asking for far too much accuracy. Real traders utilise Stop Losses (SL) which are triggers set a certain distance from the entry price and exit the market at a loss. This value is there in part because markets are infamous for swaying noisily whilst actually moving towards a logical target (as in a drunken man analogy). Furthermore, most real traders would never attempt to predict the closing price of the next bar (e.g., next day when using daily data) because it is asking for far too much accuracy within a widely acknowledged noisy system. They would simply exit the market at their SL, or attempt to exit the market in profit at some multiple of the initial risk (i.e., SL). Through such a method, successful traders can lose half, or more, of their trades whilst still finishing profitably.
Fig. 2. EURUSD Daily Price Chart 01.08.2007 – 30.11.2007
If the models are intended to replicate real traders, we must adopt a more realworld approach. Such an approach must seek to avoid pre-specifying the exact bar to exit the trade and provide feedback. One approach commonly used in real trading is to define the exit conditions in terms of fixed price numbers. For example, if the agent discovered a buy signal, the SL is set $5 below the entry price, and a Take Profit (TP) (i.e., a price level a trade is considered a winner and profit is taken) of $10 above the entry price is set. A more sophisticated technique would be to test the combinations of SL and TP to find the optimal pair in addition to the entry signal. However, this might easily lead towards curve-fitting the model too specifically to the training set.
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
173
Perhaps the most widely used method to identify when to exit a trade is the same as that used to enter the trade in the first place: technical analysis. For example, if a rule to buy an asset is ‘if RSI(14)<30 then buy’, then a suitable exit rule might be ‘if RSI(14)>70 then exit’. This risks complicating the model and exponentially increasing the search space, but it is the only way to provide a real-world measurement of success.
4 Implementation 4.1 Data The data used is the daily price/volume information over the period of February 3rd 1992 to December 14th 2007 for Exxon Mobil Corp. (XOM) (Figure 3.a) the Dow Jones Industrial Average (DJI) (Figure 3.b), General Motors Corporation (GM) (Figure 3.c), Intel Corp. (INTC) (Figure 3.d). In addition, data over the period of December 26th 1991 to December 14th 2007 is used for 30-Year Treasury Bonds (TYX) (Figure 3.e). They were chosen to include one index (DJI), two ranging assets (GM and INTEL), one falling asset (TYX), and one increasing asset (XOM). Moreover, the assets represent diverse market sectors: automobiles, technology, bonds, oil, and an index average. For DJI, the adjusted closing price is divided by 1000 to enable the agents to purchase shares with a balance of $10,000 or less. In all cases, 4000 data points (i.e., days) are used. The first 3000 data points form a training set used to evolve new rules and the most recent 1000 data points are used as a trading set to evaluate these rules. 4.2 XCS The traditional ternary representation is used, where the environment inputs are discretized as outlined in the following sections. A fixed reward of 1000 is given to profitable actions and 0 to actions which result in no profit or a loss. XCS parameters used are as follows (taken from [3] and not further optimised so as not to bias the results used to compare the models): α=1, β=0.2, δ=0.1, θGA=25, θdel=20, θsub=20, P#=0.6, v=5, χ=0.8, ε0=10, μ=0.04. Each agent is shown the training set only once before being evaluated on the trading set. The alteration between exploring and exploiting rules is modified as in [21] to: (1) Running the equation above over 1000 iterations (i.e., the length of the trading set) produced a range of 896 to 932 exploit steps being executed. Thus, over 1000 iterations, exploits are conducted approximately 89.6 - 93.2% of the time. This produces an increasing bias towards exploiting the knowledge acquired as the rules become more evolved, which is important since the system will perform a single pass through the data.
174
R. Preen
(a)
(c)
(b)
XOM
GM
(d)
(e)
DJI
INTEL
TYX
Fig. 3. Daily Adjusted Closing Price data used in experimentation
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
175
4.3 Agent 1 - Entries Agent 1 utilises three stochastic indicators with the periods (8, 3, 3), (32, 12, 12), and (128, 48, 48). The (8, 3, 3) was chosen simply because it is the most commonly used configuration, then the two subsequent combinations are each four times greater, thereby providing a short-term trend, intermediate-term trend, and long-term trend. The direction of the stochastic indicators and their position (i.e., the value between 0 and 100) is used to classify the environment. The signal line was used for the (8,3,3) parameters to smooth the line to reduce noise whereas the (32,12,12) and (128,48,48) main lines are already sufficiently smoothed. The real numbered indicators are discretized through a simple mechanism. A 9 bit binary string is composed where the first two bits are used to classify the (8,3,3) signal line’s position. The third and fourth bits are used to classify the (32,12,12) main line’s position and the fifth and sixth bits are used to classify the (128,48,48) main line’s position. The indicator to binary encoding for each indicator’s position is summarised below in Figure 4. Indicator 0 - 24 25 - 49 50 - 74 75 - 100
Binary 00 01 10 11
Fig. 4. Indicator Value to Binary Encoding
Lastly, three bits are used to classify the direction of each of the stochastic lines as in Figure 5. Bit7 = Bit8 = Bit9 =
‘1’ if Stochastic (8,3,3) current signal line > Stochastic (8,3,3) previous signal line ‘1’ if Stochastic (32,12,12) current main line > Stochastic (32,12,12) previous main line ‘1’ if Stochastic (128,48,48) current main line > Stochastic (128,48,48) previous main line
else ‘0’ else ‘0’ else ‘0’
Fig. 5. Agent 1 Encoding
4.4 Agent 2 - Entries The second agent is a trend following agent comprised mostly of Exponential Moving Averages (EMA). A 20, 50 and 100 period EMA is constructed. The EMAs’ direction (i.e., rising or falling) and the position of the current price relative to the EMA (i.e., above or below) is used to classify the environment. In addition, the direction of the Moving Average Convergence Divergence (MACD) (12, 26, 9) main line, and the direction of the Stochastic (32, 12, 12) main line are used to provide additional trend information. The encoded is summarised below in Figure 6.
176
R. Preen
Bit1 = Bit2 = Bit3 = Bit4 = Bit5 = Bit6 = Bit7 = Bit8 =
‘1’ if EMA (20) current > EMA (20) previous ‘1’ if EMA (50) current > EMA (50) previous ‘1’ if EMA (100) current > EMA (100) previous ‘1’ if price current > EMA (20) current ‘1’ if price current > EMA (50) current ‘1’ if price current > EMA (100) current ‘1’ if Stochastic (32,12,12) current main line > Stochastic (32,12,12) previous main line ‘1’ if MACD (12,26,9) current main line > MACD (12,26,9) previous main line
else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’
Fig. 6. Agent 2 Encoding
4.5 Agent 3 - Entries Agent 3 is the first Agent (Tt1) from [18]. The agent consists of comparisons between the current price and the previous price, a series of Simple Moving Averages (SMA), and the highest and lowest prices observed. The environment bit string consists of 7 binary digits and is encoded as follows in Figure 7. Bit1 = Bit2 = Bit3 = Bit4 = Bit5 = Bit6 = Bit7 =
‘1’ if price current > price previous ‘1’ if price current > 1.2 x SMA(5) ‘1’ if price current > 1.1 x SMA(10) ‘1’ if price current > 1.05 x SMA(20) ‘1’ if price current > 1.025 x SMA(30) ‘1’ if price current > highest price ‘1’ if price current < lowest price
else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’ else ‘0’
Fig. 7. Agent 3 Encoding
4.6 Agent Exits There are three sets of exit conditions for each agent. Firstly, there is the traditional model where the next day is used as the only exit condition, meaning that any trade entered today is exited at tomorrow’s closing price. In addition to this, there are two sets of technical indicator exit conditions: a simple set with only 4 exit conditions (see Figure 8) and a more advanced set comprising 16 exit conditions (see Figure 9). To keep the current study simple, the agents were only allowed to buy or hold, with selling not permitted. In both the 4 and 16 exit sets, one of the actions causes the agent to move to the next day without trading (i.e., holds for one day) where reward is given if the price remained unchanged or decreased. The executable actions in the set of four: 1. 2. 3. 4.
Do not enter any trades today (i.e., hold for one day.) Buy today and exit when MACD (12,26,9) decreases. Buy today and exit when EMA (20) decreases. Buy today and exit when both MACD (12,26,9) and EMA (20) decrease. Fig. 8. Four Technical Exit Conditions
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
177
This is implemented by moving forward each day in the index and comparing the indicator’s parameters with the exit conditions (as would happen in live trading.) When a match is found, the result of the action is calculated, the balance updated, and reward given. The comparison of the indicator parameters was implemented by individually checking each rule. This was done for simplicity and to ensure that the rules were functioning correctly. However, with a bigger set of exit conditions to test (since we are testing every applicable combination), one would assign bits to each condition in the same manner the environment conditions are constructed, and then any invalid actions (e.g., EMA (20) cannot be rising and falling simultaneously) would be removed by forcing XCS to choose another action. The executable actions in the set of sixteen: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
Do not enter any trades today (i.e., hold for one day.) Buy today and exit when MACD (12,26,9) decreases. Buy today and exit when EMA (20) decreases. Buy today and exit when Stochastic (32,12,12) decreases. Buy today and exit when EMA (50) decreases. Buy today and exit when MACD (12,26,9) and EMA (20) decrease. Buy today and exit when MACD (12,26,9) and Stochastic (32,12,12) decrease. Buy today and exit when MACD (12,26,9) and EMA (50) decrease. Buy today and exit when EMA (20) and Stochastic (32,12,12) decrease. Buy today and exit when EMA (20) and EMA (50) decrease. Buy today and exit when Stochastic (32,12,12) and EMA (50) decrease. Buy today and exit when MACD (12,26,9) and EMA (20) and Stochastic (32,12,12) decrease. Buy today and exit when MACD (12,26,9) and EMA (20) and EMA (50) decrease. Buy today and exit when MACD (12,26,9) and Stochastic (32,12,12) and EMA (50) decrease. Buy today and exit when EMA (20) and Stochastic (32,12,12) and EMA (50) decrease. Buy today and exit when EMA (20) and Stochastic (32,12,12) and EMA(50) and MACD (12,26,9) decrease. Fig. 9. Sixteen Technical Exit Conditions
5 Experimentation Tables 1 to 5 present a comparison between the agents with the next day as the exit condition, 4 technical indicator exits as the exit conditions, and with 16 technical indicator exits as the exit conditions. Each agent starts with an initial balance of $10,000. The results presented are the best run and the average run of 100 experiments. The highest performing result in each category is highlighted in bold. The results from the experiments comparing the next-day-exit agents with the agents using technical indicator exit conditions, after being shown the training set
178
R. Preen
only once (Tables 1-5), show that for XOM, the agent with the highest balance ($25,648.75) and highest average balance ($15,899.56) was Agent 2 with 16 technical indicator exits. For DJI, Agent 1 with 4 technical indicator exits produced the highest balance ($15,120.46) and Agent 3 with 16 technical indicator exits achieved the highest average balance ($12,102.06). For INTEL, Agent 2 with 4 technical indicator exits produced the highest balance ($21,000.59) and the highest average balance ($10,522.50). In the case of GM, again Agent 2 with 4 technical indicator exits produced both the highest balance ($20,116.72) and the highest average balance ($9,645.54). Lastly, for TYX, Agent 1 with next-day-exit conditions produced both the highest balance ($15,671.20) and highest average balance ($11,389.56). The results have shown that in all cases (except TYX), an agent using technical indicator exits was superior to exiting at the next day for both the highest achievable balance and the average balance over its experiments. Moreover, since commissions are not factored into the agents at this stage, it is highly likely that the gap between the two agent classes would further widen. Table 1. XOM Agent
Best ($)
Average ($)
Agent 3: Next Day Exit Agent 2: Next Day Exit Agent 1: Next Day Exit Agent 3: 16 Technical Exits Agent 2: 16 Technical Exits Agent 1: 16 Technical Exits Agent 3: 4 Technical Exits Agent 2: 4 Technical Exits Agent 1: 4 Technical Exits Buy and Hold
16,568.02 17,015.35 18,085.78 25,648.60 25,648.75 22,883.49 16,133.73 21,105,34 19,904.95 24,634.00
13,518.73 12,863.05 13,815.44 15,442.76 15,899.56 15,849.93 14,825.81 13,823.89 14,224.36 24,634.00
Table 2. DJI Agent
Best ($)
Average ($)
Agent 3: Next Day Exit Agent 2: Next Day Exit Agent 1: Next Day Exit Agent 3: 16 Technical Exits Agent 2: 16 Technical Exits Agent 1: 16 Technical Exits Agent 3: 4 Technical Exits Agent 2: 4 Technical Exits Agent 1: 4 Technical Exits Buy and Hold
13,180.21 13,664.05 12,782.90 14,589.01 14,068.26 14,443.68 13,701.04 14,664.57 15,120.46 12,918.69
11,314.48 11,338.99 11,280.55 12,102.06 11,835.86 12,027.56 11,975.34 11,868.51 12,033.45 12,918.69
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
179
Table 3. INTEL Agent
Best ($)
Average ($)
Agent 3: Next Day Exit Agent 2: Next Day Exit Agent 1: Next Day Exit Agent 3: 16 Technical Exits Agent 2: 16 Technical Exits Agent 1: 16 Technical Exits Agent 3: 4 Technical Exits Agent 2: 4 Technical Exits Agent 1: 4 Technical Exits Buy and Hold
12,672.98 14,240.27 13,476.69 12,889.49 13,736.25 15,759.57 16,511.56 21,000.59 16,568.16 8,894.74
9,512.07 9,727.86 9,731.87 8,391.51 8,860.61 8,481.99 9,504.32 10,522.50 9,924.76 8,894.74
The results from the experiments comparing the next-day-exit agents with the agents using technical indicator exit conditions, after being shown the training set only once (Tables 1-5), show that for XOM, the agent with the highest balance ($25,648.75) and highest average balance ($15,899.56) was Agent 2 with 16 technical indicator exits. For DJI, Agent 1 with 4 technical indicator exits produced the highest balance ($15,120.46) and Agent 3 with 16 technical indicator exits achieved the highest average balance ($12,102.06). For INTEL, Agent 2 with 4 technical indicator exits produced the highest balance ($21,000.59) and the highest average balance ($10,522.50). In the case of GM, again Agent 2 with 4 technical indicator exits produced both the highest balance ($20,116.72) and the highest average balance ($9,645.54). Lastly, for TYX, Agent 1 with next-day-exit conditions produced both the highest balance ($15,671.20) and highest average balance ($11,389.56). Table 4. GM Agent
Best ($)
Average ($)
Agent 3: Next Day Exit Agent 2: Next Day Exit Agent 1: Next Day Exit Agent 3: 16 Technical Exits Agent 2: 16 Technical Exits Agent 1: 16 Technical Exits Agent 3: 4 Technical Exits Agent 2: 4 Technical Exits Agent 1: 4 Technical Exits Buy and Hold
13,505.11 14,324.42 16,789.67 15,605.10 18,114.27 17,338.24 15,804.40 20,116.72 14,565.23 5,970.25
8,251.02 7,927.37 8,579.46 8,827.06 9,254.52 9,153.40 9,226.62 9,645.54 8,362.22 5,970.25
180
R. Preen Table 5. TYX Agent
Best ($)
Average ($)
Agent 3: Next Day Exit Agent 2: Next Day Exit Agent 1: Next Day Exit Agent 3: 16 Technical Exits Agent 2: 16 Technical Exits Agent 1: 16 Technical Exits Agent 3: 4 Technical Exits Agent 2: 4 Technical Exits Agent 1: 4 Technical Exits Buy and Hold
14,180.51 14,297.20 15,671.20 12,773.89 12,503.13 12,047.33 11,346.18 14,297.84 12,260.75 9,227.80
10,959.06 10,730.10 11,389.56 10,010.81 9,632.41 9,815.09 9,870.72 10,014.32 9,936.21 9,227.80
Table 6. t-Stats of Tech Exits vs. Next Day (N.D.) exits. Two-Sample Assuming Unequal Variances. Results in bold are statistically significant at the 95% confidence level.
Stock
XOM DJI INTEL GM TYX
Agent 1 4 Ex. 16 Ex. vs. vs. N.D. N.D. 1.90 6.48 5.60 6.19 0.86 -6.20 -0.69 1.93 -8.34 -9.60
Agent 2 4 Ex. 16 Ex. vs. vs. N.D. N.D. 3.15 9.40 3.73 4.05 3.61 -4.06 5.13 4.09 -4.08 -6.90
Agent 3 4 Ex. 16 Ex. vs. vs. N.D. N.D. 4.10 5.80 3.73 5.82 -0.04 -5.72 2.73 1.87 -7.96 -6.30
The results have shown that in all cases (except TYX), an agent using technical indicator exits was superior to exiting at the next day for both the highest achievable balance and the average balance over its experiments. Moreover, since commissions are not factored into the agents at this stage, it is highly likely that the gap between the two agent classes would further widen. However, in the case of TYX, the best performing agent was Agent 1 with nextday-exit conditions. Furthermore, all next-day-exit agents surpassed the technical indicator exit agents in terms of both highest balance and average balance, showing that for some assets next-day-exits can be the best. However, introducing commissions would likely reduce this gap and perhaps even supplant the next-day-exit agents. Nevertheless, the fact that the next-day-exit agents beat the technical indicator exits is perhaps explainable by the split between the training and trading set, since the training set for TYX primarily decreases but the trading set moves in a side-ways range. Table 6 presents the t-Stats for the three agent types where exiting at the close of the next day is compared with both the 4 and 16 technical indicator exit sets. It is shown that almost all of the results are statistically significant at the 95% confidence level. In particular, for XOM and DJI, all agents utilising technical indicator exits surpassed the same agents when exiting at the close of the next day, and these results
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
181
were statistically significant. Additionally, Agent 2 when using 4 indicator exits has provided statistically significant and superior results when compared to exiting at the close of the next day in all cases except for TYX. Finally, when comparing the best performing agents with a buy and hold strategy, we observe that for INTEL, GM, and TYX, all of the agents using technical indicator exits beat this strategy. Further, the best performing agents on all assets were always able to beat the buy and hold balance; however the average of the agents’ balances did not. Furthermore, should commissions be introduced (the cost would vary from broker to broker) these results when compared to a buy and hold strategy would deteriorate to some extent. However, the agents’ average balances only outperformed a buy and hold strategy when the stocks declined. An explanation for this is that when the agent exits the market wrongfully, although there is no actual loss, there is an opportunity cost because the market increases and the agent underperforms its benchmark. Thus, stocks which generally decline over the period analysed are much easier to beat because agents have the choice to be in or out of the market, while it is much harder to beat those that are generally going up. Table 7 shows the average number of trades executed over 100 tests of each asset by Agent 2. Again, the agent is shown the training set only once before being assessed in the trading set. The table shows that when using 4 technical indicator exits, the agent always trades fewer times than with next-day-exit conditions. Further, this is statistically significant (as shown in table 8). In some cases 40% less trades are executed which would result in substantial transaction fee savings. When utilising 16 technical indicator exits, Agent 2 trades a similar number of times as the agents using next-day-exit conditions. This is a result of adding more exit conditions which increase the probability of closing the trade after a short period of time. Thus, the 16 technical indicator exit agents tested do not offer any transaction fee savings in comparison to the traditional model. Table 7. Average Number of Trades Executed by Agent 2. Agent 2: Next-day-exit 4 Tech- Exits 16 Tech- Exits
XOM 243.25 164.84 241.17
DJI 267.20 170.74 255.23
INTEL 266.83 168.30 255.55
GM 154.37 136.14 144.69
TYX 160.89 105.82 158.54
Table 8. t-Stats of Number of trades Executed by Agent 2 with Tech Exits vs. Next Day (N.D.) exits. Two-Sample Assuming Unequal Variances. Results in bold are statistically significant at the 95% confidence level. Agent 2: 4 Tech- Exits vs. N.D. 16 Tech- Exits vs. N.D.
XOM 4.63 0.13
DJI 5.51 0.51
INTEL 6.60 0.60
GM 1.98 1.36
TYX 3.58 0.13
182
R. Preen
6 Conclusions Agents utilising mathematical technical indicators for the exit conditions outperformed similar agents which used the next day as the exit condition in all cases except for TYX (30-Year Treasury bond), even before taking commissions into account, which would penalise the most active agents (i.e., the agents using next-day-exit). Moreover, these results were achieved with generic XCS parameters and not tuned to improve performance. The reason TYX was anomalous is attributable to either the position of the cut-off point between the training and trading set, or the TYX data being inherently noisier than the other assets, which were all stocks. The cut point in this asset is particularly important because it resulted in a training set which primarily declined and a trading set that ranged sideways. Thus, the agents would have adapted rules to trade within this downward environment but were not prepared for the environment within which they were assessed. An analysis of the number of trades executed by each agent showed that, on average, 31.73% less trades were executed when using 4 technical indicator exit conditions; this would result in substantial transaction savings and further boost the performance of these agents in comparison to the agents using next-day-exit conditions. However, the agents using 16 mathematical technical indicator exits executed with approximately the same frequency as the agents using next-day-exit conditions. This was a result of having more rules with different exit conditions that could be triggered, so the agents were closing the trades with greater frequency.
References 1. Allen, F., Karjalainen, R.: Using Genetic Algorithms to find technical trading rules. Journal of Financial Economics 51(2), 245–271 (1999) 2. Beltrametti, L., Fiorentini, R., Marengo, L., Tamborini, R.: A learning-to-forecast experiment on the foreign exchange market with a Classifier System. Journal of Economic Dynamics and Control 21(8&9), 1543–1575 (1997) 3. Butz, M., Sastry, K., Goldberg, D.: Strong, Stable, and Reliable Fitness Pressure in XCS due to Tournament Selection. Genetic Programming and Evolvable Machines 6(1), 53–77 (2005) 4. Brock, W., Lakonishock, J., LeBaron, B.: Simple Technical Trading Rules and the Stochastic Properties of Stock Returns. Journal of Finance 47, 1731–1764 (1992) 5. Chen, S.-H.: Genetic Algorithms and Genetic Programming in Computational Finance. Kluwer Academic Publishers, Norwell (2002) 6. Detry, P.J., Grégoire, P.: Other evidences of the predictive power of technical analysis: the moving average rules on European indexes, CeReFiM, Belgium, pp. 1–25 (1999) 7. Dewachter, H.: Can Markov switching models replicate chartist profits in the foreign exchange market? Journal of International Money and Finance 20(1), 25–41 (2001) 8. Dooley, M., Schaffer, J.: Analysis of Short-Run Exchange Rate Behavior: March 1973 to November 1981. In: Bigman, D., Taya, T. (eds.) Floating Exchange Rates and State of World Trade and Payments, pp. 43–70. Ballinger Publishing Company, Cambridge (1983) 9. Gershoff, M.: An investigation of HXCS Traders. School of Informatics. Vol. Master of Sciences Edinburgh. University of Edinburgh (2006)
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators
183
10. Holland, J.: Adaptation in Natural and Artificial Systems. University of Michigan Press (1975) 11. Kalyvas, E.: Using Neural Networks and Genetic Algorithms to Predict Stock Market Returns. University of Manchester Master of Science thesis (2001) 12. Levich, R., Thomas, L.: The Merits of Active Currency Management: Evidence from International Bond Portfolios. Financial Analysts Journal 49(5), 63–70 (1993) 13. Liu, S., Nagao, T.: HXCS and its Application to Financial Time Series Forecasting. IEEJ Transactions on Electrical and Electronic Engineering 1, 417–425 (2006) 14. Mahfoud, S., Mani, G.: Financial forecasting using Genetic Algorithms. Applied Artificial Intelligence 10(6), 543–565 (1996) 15. Neely, C., Weller, P., Dittmar, R.: Is Technical Analysis in the Foreign Exchange Market Profitable? A Genetic Programming Approach. Journal of Financial and Quantitative Analysis 32(4), 405–426 (1997) 16. Okunev, J., White, D.: Do momentum-based strategies still work in foreign currency markets? Journal of Financial and Quantitative Analysis 38, 425–447 (2003) 17. Olson, D.: Have trading rule profits in the currency market declined over time? Journal of Banking and Finance 28, 85–105 (2004) 18. Schulenburg, S., Ross, P.: An Adaptive Agent Based Economic Model. In: Lanzi, P.L., et al. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1996, pp. 265–284. Springer, Heidelberg (2001) 19. Schulenburg, S., Ross, P.: Strength and money: An LCS approach to increasing returns. In: Lanzi, P.L. (ed.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 114–137. Springer, Heidelberg (2001) 20. Schulenburg, S., Ross, P.: Explorations in LCS models of stock trading. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 151–180. Springer, Heidelberg (2002) 21. Schulenburg, S., Wong, S.Y.: Portfolio allocation using XCS experts in technical analysis, market conditions and options market. In: Proceedings of the 2007 GECCO Conference Companion on Genetic and Evolutionary Computation, pp. 2965–2972. ACM, New York (2007) 22. Srinivasa, K.G., Venugopal, K.R., Patnaik, L.M.: An efficient fuzzy based neuro: genetic algorithm for stock market prediction. International Journal of Hybrid Intelligent Systems 3(2), 63–81, (2006) 23. Steiner, M., Wittkemper, H.G.: Neural networks as an alternative stock market model. In: Refenes, A.P. (ed.) Neural networks in the capital markets, pp. 137–149. John Wiley and Sons, Chichester (1996) 24. Stone, C., Bull, L.: Foreign Exchange Trading using a Learning Classifier System. In: Bull, L., Bernado-Mansilla, E., Holmes, J. (eds.) Learning Classifier Systems in Data Mining, pp. 169–190. Springer, Heidelberg (2008) 25. Sweeney, R.J.: Beating the foreign exchange market. Journal of Finance 41, 163–182 (1986) 26. Tsibouris, G., Zeidenberg, M.: Testing the Efficient Market Hypothesis with Gradient Descent Algorithms, pp. 127–136. John Wiley and Sons Ltd., Chichester (1996) 27. Wilson, S.W.: ZCS: A Zeroth Level Classifier. Evolutionary Computation 2, 1–18 (1994) 28. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149– 175 (1995)
184
R. Preen
Appendix: Mathematical Technical Indicators Simple Moving Average: SMA(N) SMAt = (Closet +Closet-1 ... + Closet-N)/N Where Close is the closing price being averaged and N is the number of days in the moving average. Exponential Moving Average: EMA(N) EMAt = Closet · K + EMAt-1 · (1-K) Where K=2/(N+1), N is the number of days in the EMA, Closet is today’s closing price, and EMAt-1 is the EMA of yesterday. Moving Average Convergence Divergence: MACD(a,b,c) MACD main line = EMA(a) – EMA(b) MACD signal line = EMA(c) Where EMA(c) is an exponential moving average of the MACD main line. Stochastic Oscillator: Stochastic(FastK, SlowK, SlowD) Stochastic main line, Stocht = Stocht-1 + (Fast – Stocht-1 / SlowK) Stochastic signal line, Sigt = Sigt-1 + (Stocht – Sigt-1) / SlowD Where, Stocht is today’s stochastic main line; Stocht-1 is the stochastic main line of yesterday; Fast = 100 · ((Closet – L/(H–L)); Closet is today’s closing price; L is the lowest low price over the last FastK days; and H is the highest high price over the last FastK days.
On the Homogenization of Data from Two Laboratories Using Genetic Programming Jose G. Moreno-Torres1, Xavier Llor` a2, David E. Goldberg3 , and Rohit Bhargava4 1
Department of Computer Science and Artificial Intelligence, Universidad de Granada, 18071 Granada, Spain [email protected] 2 National Center for Supercomputing Applications (NCSA) University of Illinois at Urbana-Champaign 1205 W. Clark Street, Urbana, Illinois, USA [email protected] 3 Illinois Genetic Algorithms Laboratory (IlliGAL) University of Illinois at Urbana-Champaign 104 S. Mathews Ave, Urbana, Illinois, USA [email protected] 4 Department of Bioengineering University of Illinois at Urbana-Champaign 405 N. Mathews Ave, Urbana, Illinois, USA [email protected]
Abstract. In experimental sciences, diversity tends to difficult predictive models’ proper generalization across data provided by different laboratories. Thus, training on a data set produced by one lab and testing on data provided by another lab usually results in low classification accuracy. Despite the fact that the same protocols were followed, variability on measurements can introduce unforeseen variations that affect the quality of the model. This paper proposes a Genetic Programming based approach, where a transformation of the data from the second lab is evolved driven by classifier performance. A real-world problem, prostate cancer diagnosis, is presented as an example where the proposed approach was capable of repairing the fracture between the data of two different laboratories.
1
Introduction
The assumption that a properly trained classifier will be able to predict the behavior of unseen data from the same problem is at the core of any automatic classification process. However, this hypothesis tends to prove unreliable when dealing with biological data (or other experimental sciences), especially when such data is provided by more than one laboratory, even if they are following the same protocols to obtain it. This paper presents an example of such a case, a prostate cancer diagnosis problem where a classifier built using the data of the first laboratory performs J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 185–197, 2010. c Springer-Verlag Berlin Heidelberg 2010
186
J.G. Moreno-Torres et al.
very accurately on the test data from that same laboratory, but comparatively poorly on the data from the second one. It is assumed that this behavior is due to a fracture between the data of the two laboratories, and a Genetic Programming (GP) method is developed to homogenize the data in subsequent subsets. We consider this method a form of feature extraction because the new dataset is constructed with new features which are functional mappings of the old ones. The method presented in this paper attempts to optimize a transformation over the data from the second laboratory, in terms of classifier performance. That is, the data from the second lab is transformed into a new dataset where the classifier, trained on the data from the first lab, performs as accurately as possible. If the performance achieved by the classifier in this new, transformed, dataset, is equivalent to the one obtained in the data from the first lab, we understand the data has been homogenized. More formally, the classifier f is trained on data from one laboratory (dataset A), such that y = f (xA) is the class prediction for one instance xA of dataset A. For the data from the other lab (dataset B), it is assumed that there exists a transformation T such that f (T (xB)) is a good classifier for instances xB of dataset B. The ’goodness’ of the classifier is measured by the loss function l(f (T (xB)), y), where y is the class associated with xB, and l(., .) is a measure of distance between f (T (xB)) and y. The aim is to find a transformation T such that the average loss over all instances in B is minimized. The remainder of this paper is organized as follows: In Section 2, some preliminaries about the techniques used and some approaches to similar problems in the literature are presented. Section 3 has a description of the proposed algorithm. Section 4 details the real-world biological dataset that motivates this paper. Section 5 includes the experimental setup, along with the results obtained, and an analysis. Finally, some concluding remarks are made in Section 6.
2
Preliminaries
This section is divided in the following way: In Section 2.1 we introduce the notation that has been used in this paper. Then we include a brief summary of what has been done in feature extraction in Section 2.2, and a short review of the different approaches we found in the specialized literature on the use of GP for feature extraction in Section 2.3. 2.1
Notation
When describing the problem, datasets A, B and S correspond to: – A: The original dataset, provided by the first lab, that was used to build the classifier. – B: The problem dataset, from the second lab. The classifier is not accurate on this dataset, and that is what the proposed algorithm attempts to solve. – S: The solution dataset, result of applying the evolved transformation to the samples in dataset B. The goal is to have the classifier performance be as high as possible on this dataset.
On the Homogenization of Data from Two Laboratories
2.2
187
Feature Extraction
Feature extraction is one form of pre-processing, which creates new features as functional mappings of the old ones. An early proposer of such a term was probably Wyse in 1980 [1], in a paper about intrinsic dimensionality estimation. There are multiple techniques that have been applied to feature extraction throughout the years, ranging from principal component analysis (PCA) to support vector machines (SVMs) to GAs (see [2,3,4], respectively, for some examples). Among the foundations papers in the literature, Liu’s book in 1998 [5] is one of the earlier compilations of the field. A workshop held in 2003 [6], led Guyon & Elisseeff to publish a book with an important treatment of the foundations of feature extraction[7]. 2.3
Genetic Programming-Based Feature Extraction
Genetic Programming (GP) has been used extensively to optimize feature extraction and selection tasks. One of the first contributions in this line was the work published by Tackett in 1993 [8], who applied GP to feature discovery and image discrimination tasks. We can consider two main branches in the philosophy of GP-based feature extraction: 1 On one hand, we have the proposals that focus only on the feature extraction procedure, of which there are multiple examples: Sherrah et al. [9] presented in 1997 the evolutionary pre-processor (EPrep), which searches for an optimal feature extractor by minimizing the misclassification error over three randomly selected classifiers. Kotani et al.’s work from 1999 [10] determined the optimal polynomial combinations of raw features to pass to a k-nearest neighbor classifier. In 2001, Bot [11] evolved transformed features, one-at-atime, again for a k-NN classifier, utilizing each new feature only if it improved the overall classification performance. Zhang & Rockett, in 2006, [12] used multiobjective GP to learn optimal feature extraction in order to fold the high-dimensional pattern vector to a one-dimensional decision space where the classification would be trivial. Lastly, also in 2006, Guo & Nandi [13] optimized a modified Fisher discriminant using GP, and then Zhang & Rockett [14] extended their work by using a multiobjective approach to prevent tree bloat. 2 On the other hand, some authors have chosen to evolve a full classifier with an embedded feature extraction step. As an example, Harris [15] proposed in 1997 a co-evolutionary strategy involving the simultaneous evolution of the feature extraction procedure along with a classifier. More recently, Smith & Bull [16] developed a hybrid feature construction and selection method using GP together with a GA. 2.4
Finding and Repairing Fractures between Data
Among the proposals to quantify the fracture in the data, we would like to mention the one by Wang et al. [17], where the authors present the idea of
188
J.G. Moreno-Torres et al.
correspondence tracing. They propose an algorithm for the discovering of changes of classification characteristics, which is based on the comparison between two rule-based classifiers, one built from each dataset. Yang et al. [18] presented in 2008 the idea of conceptual equivalence as a method for contrast mining, which consists of the discovery of discrepancies between datasets. Lately, it is important to mention the work by Cieslak and Chawla [19], which presents a statistical framework to analyze changes in data distribution resulting in fractures between the data. The fundamental difference between the mentioned works and this one is we focus on repairing the fracture by modifying the data, using a general method that works with any kind of data fracture, while they propose methods to quantify said fracture that work provided some conditions.
3
A Proposal for GP-Based Feature Extraction to Homogenize Data from Two Laboratories
The problem we are attempting to solve is the design of a method that can create a transformation from a dataset (dataset B) where a classification model built using the data from a different dataset (dataset A) is not accurate; into a new dataset (dataset S) where the classifier is more accurate. Said classifier is kept unchanged throughout the process. We decided to use GP to solve the problem for a number of reasons: 1 It is well suited to evolve arbitrary expressions because its chromosomes are trees. This is useful in our case because we want to have the maximum possible flexibility in terms of the functional expressions of this transformations. 2 GP provides highly-interpretable solutions. This is an advantage because our goal is not only to have a new dataset where the classifier works, but also to analyze what was the problem in the first dataset. Once GP was chosen, we needed to decide what terminals and operators to use, how to calculate the fitness of an individual and which evolutionary parameters (population size, number of generations, selection and mutation rates, etc) are appropriate for the problem at hand. 3.1
Solutions Representation: Context-Free Grammar
The representation of the solutions was achieved by extending GP to evolve more than one tree per solution. Each individual is composed by n trees, where n is the number of attributes present in the dataset. We are trying to develop a new dataset with the same number of attributes as the old one, since this new dataset needs to be fed to the existing model. In the tree structure, the leaves are either constants (we use the Ephemeral Random Constant approach [20]) or attributes from the original dataset. The intermediate nodes are functions from the function set, which is specific to each problem.
On the Homogenization of Data from Two Laboratories
189
The attributes on the transformed dataset are represented by algebraic expressions. These expressions are generated according to the rules of a context-free grammar which allows the absence of some of the functions or terminals. The grammar corresponding to the example problem would look like this: Start → T ree T ree T ree → N ode N ode → N ode Operator N ode N ode → T erminal Operator → + | − | ∗ | ÷ T erminal → x0 | x1 | E E → realN umber(represented by e) 3.2
Fitness Evaluation
The fitness evaluation procedure is probably the most treated aspect of design in the literature when dealing with GP-based feature extraction. As has been stated before, the idea is to have the provided classifier’s performance drive the evolution. To achieve that, our method calculates fitness as the classifier’s accuracy over the dataset obtained by applying the transformations encoded in the individual (training-set accuracy). 3.3
Genetic Operators
This section details the choices made for selection, crossover and mutation operators. Since the objective of this work is not to squeeze the maximum possible performance from GP, but rather to show that it is an appropriate technique for the problem and that it can indeed solve it, we did not pay special attention to these choices, and picked the most common ones in the specialized literature. – Tournament selection without replacement. To perform this selection, s individuals are first randomly picked from the population (where s is the tournament size), while avoiding using any member of the population more than once. The selected individual is then chosen as the one with the best fitness among those picked in the first stage. – One-point crossover: A subtree from one of the parents is substituted by one from the other parent. This procedure is carried over in the following way: 1 Randomly select a non-root non-leave node on each of the two parents. 2 The first child is the result of swapping the subtree below the selected node in the father for that of the mother. 3 The second child is the result of swapping the subtree below the selected node in the mother for that of the father.
190
J.G. Moreno-Torres et al.
– Swap mutation: This is a conservative mutation operator, that helps diversify the search within a close neighborhood of a given solution. It consists of exchanging the primitive associated to a node by one that has the same number of arguments. – Replacement mutation: This is a more aggressive mutation operator that leads to diversification in a larger neighborhood. The procedure to perform this mutation is the following: 1 Randomly select a non-root non-leave node on the tree to mutate. 2 Create a random tree of depth no more than a fixed maximum depth. In this work, the maximum depth allowed was 5. 3 Swap the subtree below the selected node for the randomly generated one. 3.4
Function Set
Which functions to include in the function set are usually dependent on the problem. Since one of our goals is to have an algorithm as universal and robust as possible, where the user does not need to fine-tune any parameters to achieve good performance; we decided not to study the effect of different function set choices. We chose the default functions most authors use in the literature: {+, −, ∗, ÷, exp, cos}. 3.5
Parameters
Table 1 summarizes the parameters used for the experiments. Table 1. Evolutionary parameters for a nv -dimensional problem Parameter Value Number of trees nv Population size 400 ∗ nv Duration of the run 100 generations Selection operator Tournament without replacement Tournament size log2 (nv ) + 1 Crossover operator One-point crossover Crossover probability 0.9 Mutation operator Replacement & Swap mutations Replacement mutation probability 0.001 Swap mutation probability 0.01 Maximum depth of the swapped in subtree 5 Function set {+, −, ∗, ÷, cos, exp} Terminal set {x0 ,x1 ,...,xnv − 1, e}
3.6
Execution Flow
Algorithm 1 contains a summary of the execution flow of the GP procedure, which follows a classical evolutionary scheme. It stops after a user-defined number of generations,
On the Homogenization of Data from Two Laboratories
191
Algorithm 1. Execution flow of the GP method 1 . Randomly c r e a t e t h e i n i t i a l p o p u l a t i o n by a p p l y i n g t h e c o n t e x t −f r e e grammar i n S e c t i o n 3 . 1 . 2 . Repeat Ng t i m e s ( where Ng i s t h e number o f g e n e r a t i o n s ) 2.1 Evaluate the cu r r en t population , using the procedure seen in Section 3 . 2 . 2 . 2 Apply s e l e c t i o n and c r o s s o v e r t o c r e a t e a new p o p u l a t i o n t h a t w i l l r e p l a c e t h e o l d one . 2 . 3 Apply t h e mutation o p e r a t o r s t o t h e new p o p u l a t i o n . 3 . Return t h e b e s t i n d i v i d u a l e v e r s e e n .
4
Case Study: Prostate Cancer Diagnosis
Prostate cancer is the most common non-skin malignancy in the western world. The American Cancer Society estimated 192,280 new cases and 27,360 deaths related to prostate cancer in 2009 [21]. Recognizing the public health implications of this disease, men are actively screened through digital rectal examinations and/or serum prostate specific antigen (PSA) level testing. If these screening tests are suspicious, prostate tissue is extracted, or biopsied, from the patient and examined for structural alterations. Due to imperfect screening technologies and repeated examinations, it is estimated that more than one million people undergo biopsies in the US alone. 4.1
Diagnostic Procedure
Biopsy, followed by manual examination under a microscope is the primary means to definitively diagnose prostate cancer as well as most internal cancers in the human body. Pathologists are trained to recognize patterns of disease in the architecture of tissue, local structural morphology and alterations in cell size and shape. Specific patterns of specific cell types distinguish cancerous and noncancerous tissues. Hence, the primary task of the pathologist examining tissue for cancer is to locate foci of the cell of interest and examine them for alterations indicative of disease. A detailed explanation of the procedure is beyond the scope of this paper and can be found elsewhere [22,23,24,25]. Operator fatigue is well-documented and guidelines limit the workload and rate of examination of samples by a single operator (examination speed and throughput). Importantly, inter- and intra-pathologist variation complicates decision making. For this reason, it would be extremely interesting to have an accurate automatic classifier to help reduce the load on the pathologists. This was partially achieved in [24], but some issues remain open. 4.2
The Generalization Problem
Llor` a et al. [24] successfully applied a genetics-based approach to the development of a classifier that obtained human-competitive results based on FTIR
192
J.G. Moreno-Torres et al.
data. However, the classifier built from the data obtained from one laboratory proved remarkably inaccurate when applied to classify data from a different hospital. Since all the experimental procedure was identical; using the same machine, measuring and post-processing; and having the exact same lab protocols, both for tissue extraction and staining; there was no factor that could explain this discrepancy. What we attempt to do with this work is develop an algorithm that can evolve a transformation over the data from the second laboratory, creating a new dataset where the classifier built from the first lab is as accurate as possible. 4.3
Pre-processing of the Data
The biological data obtained from the laboratories has an enormous size (in the range of 14GB of storage per sample); and parallel computing was needed to achieve better-than-human results. For this reason, feature selection was performed on the dataset obtained by FTIR. It was done by applying an evaluation of pairwise error and incremental increase in classification accuracy for every class, resulting in a subset of 93 attributes. This reduced dataset provided enough information for classifier performance to be rather satisfactory: a simple C4.5 classifier achieved ∼ 95% accuracy on the data from the first lab, but only ∼ 80% on the second one. The dataset consists of 789 samples from one laboratory and 665 from the other one. These samples represent 0.01% of the total data available for each data set, which were selected applying stratified sampling without replacement. A detailed description of the data pre-processing procedure can be found in [22]. The experiments reported in this paper were performed utilizing the reduced dataset, since the associated computational costs make it unfeasible to work with the complete one. The reduced dataset is made of 93 real attributes, and there are two classes (positive and negative diagnosis). The dataset consists of 789 samples from one laboratory and 665 from the other one, with a 60% − 40% class distribution.
5
Experimental Study
This section is organized in the following way: To begin with, a general description of the experimental procedure is presented in Section 5.1, and the parameters used for the experiment. The results obtained are presented in Section 5.2, a statistical analysis is shown in Section 5.3, and lastly some sample transformations are shown in Section 5.4. 5.1
Experimental Framework
The experimental methodology can be summarized as follows: 1 Consider each of the provided datasets (one from each lab) to be datasets A and B respectively.
On the Homogenization of Data from Two Laboratories
193
2 From dataset A, build a classifier. We chose C4.5 [26], but any other classifier would work exactly the same; due to the fact that the proposed method uses the learned classifier as a black box. 3 Apply our method to dataset B in order to evolve a transformation that will create a solution dataset S. Use 5-fold cross validation over dataset S, so that training and test set accuracy results can be obtained. 4 Check the performance of the step 2 classifier on dataset S. Ideally, it should be close to the one on dataset A, meaning the proposed method has successfully discovered the hidden transformation and inverted it.
5.2
Performance Results
This section presents the results for the Prostate Cancer problem, in terms of classifier accuracy. The results obtained can be seen in table 2. Table 2. Classifier performance results Classifier performance in dataset ... A-training A-test B S-training S-test 0.95435 0.92015 0.83570 0.95191 0.92866
The performance results are promising. First and foremost, the proposed method was able to find a transformation over the data from the second laboratory that made the classifier work just as well as it did on the data from the first lab, effectively finding the fracture in the data (that is, the difference in data distribution between the data sets provided by the two labs) that prevented the classifier from working accurately. 5.3
Statistical Analysis
To complete the experimental study, we performed a statistical comparison between the classifier performance over datasets A, B and S. In [27,28,29,30] a set of simple, safe and robust non-parametric tests for statistical comparisons of classifiers are recommended. One of them is the Wilcoxon Signed-Ranks Test [31,32], which is the test that we have selected to do the comparison. In order to perform the Wilcoxon test, we used the results from each partition in the 5-fold cross validation procedure. We ran the experiment four times, resulting in 4 ∗ 5 = 20 performance samples to carry out the statistical test. R+ corresponds to the first algorithm in the comparison winning, R− to the second one. We can conclude our method has proved to be capable of fully homogenizing the data from both laboratories regarding classifier performance, both in terms of training and test set.
194
J.G. Moreno-Torres et al. Table 3. Wilcoxon signed-ranks test results Comparison R+ A-test vs B 210 B vs S-test 0 A-training vs S-training 126 A-test vs S-test 84
5.4
R− p-value null hypothesis of equality 0 1.91E − 007 rejected (A-test outperforms B) 210 1.91E − 007 rejected (S-test outperforms B) 84 −− accepted 126 −− accepted
Obtained Transformations
Figure 1 contains a sample of some of the evolved expressions for the best individual found by our method. Since the dataset has 93 attributes, the individual was composed of 93 trees, but for space concerns only the attributes relevant to the C4.5 classifier were included here.
Fig. 1. Tree representation of the expressions contained in a solution to the Prostate Cancer problem
6
Concluding Remarks
We have presented a new algorithm that approaches a common problem in real life for which not many solutions have been proposed in evolutionary computing. The problem in question is the repairing of fractures between data by adjusting the data itself, not the classifiers built from it.
On the Homogenization of Data from Two Laboratories
195
We have developed a solution to the problem by means of a GP-based algorithm that performs feature extraction on the problem dataset driven by the accuracy of the previously built classifier. We have applied our method to a real-world problem where data from two different laboratories regarding prostate cancer diagnosis was provided, and where the classifier learned from one did not perform well enough on the other. Our algorithm was capable of learning a transformation over the second dataset that made the classifier fit just as well as it did on the first one. The validation results with 5-fold cross validation also support the idea that the algorithm is obtaining good results; and has a strong generalization power. We have applied a statistical analysis methodology that supports the claim that the classifier performance obtained on the solution dataset significantly outperforms the one obtained on the problem dataset. Lastly, we have shown the learned transformations. Unfortunately, we have not been able to extract any useful information from them yet.
Acknowledgments Jose Garc´ıa Moreno-Torres was supported by a scholarship from ‘Obra Social la Caixa’ and is currently supported by a FPU grant from the Ministerio de Educaci´ on y Ciencia of the Spanish Government and the KEEL project. Rohit Bhargava would like to acknowledge collaborators over the years, especially Dr. Stephen M. Hewitt and Dr. Ira W. Levin of the National Institutes of Health, for numerous useful discussions and guidance. Funding for this work was provided in part by University of Illinois Research Board and by the Department of Defense Prostate Cancer Research Program. This work was also funded in part by the National Center for Supercomputing Applications and the University of Illinois, under the auspices of the NCSA/UIUC faculty fellows program.
References 1. Wyse, N., Dubes, R., Jain, A.: A critical evaluation of intrinsic dimensionality algorithmsa critical evaluation of intrinsic dimensionality algorithms. In: Gelsema, E.S., Kanal, L.N. (eds.) Pattern recognition in practice, Amsterdam, pp. 415–425. Morgan Kauffman Publishers, Inc., San Francisco (1980) 2. Kim, K.A., Oh, S.Y., Choi, H.C.: Facial feature extraction using pca and wavelet multi-resolution images. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition, p. 439. IEEE Computer Society, Los Alamitos (2004) 3. Podolak, I.T.: Facial component extraction and face recognition with support vector machines. In: FGR 2002: Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, Washington, DC, USA, p. 83. IEEE Computer Society, Los Alamitos (2002) 4. Pei, M., Goodman, E.D., Punch, W.F.: Pattern discovery from data using genetic algorithms. In: Proceeding of 1st Pacific-Asia Conference Knowledge Discovery & Data Mining, PAKDD 1997 (1997)
196
J.G. Moreno-Torres et al.
5. Liu, H., Motoda, H.: Feature extraction, construction and selection: a data mining perspective. SECS, vol. 453. Kluwer Academic, Boston (1998) 6. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) 7. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (eds.): Feature Extraction, Foundations and Applications. Springer, Heidelberg (2006) 8. Tackett, W.A.: Genetic programming for feature discovery and image discrimination. In: Proceedings of the 5th International Conference on Genetic Algorithms, pp. 303–311. Morgan Kaufmann Publishers Inc., San Francisco (1993) 9. Sherrah, J.R., Bogner, R.E., Bouzerdoum, A.: The evolutionary pre-processor: Automatic feature extraction for supervised classification using genetic programming. In: Proc. 2nd International Conference on Genetic Programming (GP 1997), pp. 304–312. Morgan Kaufmann, San Francisco (1997) 10. Kotani, M., Ozawa, S., Nakai, M., Akazawa, K.: Emergence of feature extraction function using genetic programming. In: KES, pp. 149–152 (1999) 11. Bot, M.C.J.: Feature extraction for the k-nearest neighbour classifier with genetic programming. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Tetamanzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 256–267. Springer, Heidelberg (2001) 12. Zhang, Y., Rockett, P.I.: A generic optimal feature extraction method using multiobjective genetic programming. Technical Report VIE 2006/001, Department of Electronic and Electrical Engineering, University of Sheffield, UK (2006) 13. Guo, H., Nandi, A.K.: Breast cancer diagnosis using genetic programming generated feature. Pattern Recognition 39(5), 980–987 (2006) 14. Zhang, Y., Rockett, P.I.: A generic multi-dimensional feature extraction method using multiobjective genetic programming. Evolutionary Computation 17(1), 89– 115 (2009) 15. Harris, C.: An investigation into the Application of Genetic Programming techniques to Signal Analysis and Feature Detection,September. University College, London (September 26, 1997) 16. Smith, M.G., Bull, L.: Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines 6(3), 265–281 (2005) 17. Wang, K., Zhou, S., Fu, C.A., Yu, J.X., Jeffrey, F., Yu, X.: Mining changes of classification by correspondence tracing. In: Proceedings of the 2003 SIAM International Conference on Data Mining, SDM 2003 (2003) 18. Yang, Y., Wu, X., Zhu, X.: Conceptual equivalence for contrast mining in classification learning. Data & Knowledge Engineering 67(3), 413–429 (2008) 19. Cieslak, D.A., Chawla, N.V.: A framework for monitoring classifiers’ performance: when and why failure occurs? Knowledge and Information Systems 18(1), 83–108 (2009) 20. Koza, J.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge (1992) 21. AmericanCancerSociety: How many men get prostate cancer? http://www.cancer.org/docroot/CRI/content/ CRI 2 2 1X How many men get prostate cancer 36.asp 22. Fernandez, D.C., Bhargava, R., Hewitt, S.M., Levin, I.W.: Infrared spectroscopic imaging for histopathologic recognition. Nature Biotechnology 23(4), 469–474 (2005)
On the Homogenization of Data from Two Laboratories
197
23. Levin, I.W., Bhargava, R.: Fourier transform infrared vibrational spectroscopic imaging: integrating microscopy and molecular recognition. Annual Review of Physical Chemistry 56, 429–474 (2005) 24. Llor` a, X., Reddy, R., Matesic, B., Bhargava, R.: Towards better than human capability in diagnosing prostate cancer using infrared spectroscopic imaging. In: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation GECCO 2007, pp. 2098–2105. ACM, New York (2007) 25. Llor` a, X., Priya, A., Bhargava, R.: Observer-invariant histopathology using genetics-based machine learning. Natural Computing: An International Journal 8(1), 101–120 (2009) 26. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993) 27. Demˇsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006) 28. Garc´ıa, S., Herrera, F.: An extension on ‘statistical comparisons of classifiers over multiple data sets’ for all pairwise comparisons. Journal of Machine Learning Research 9, 2677–2694 (2008) 29. Garc´ıa, S., Fern´ andez, A., Luengo, J., Herrera, F.: A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability. Soft Computing 13(10), 959–977 (2009) 30. Garc´ıa, S., Fern´ andez, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences 180(10), 2044–2064 (2010) 31. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin 1(6), 80–83 (1945) 32. Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures, 4th edn. Chapman & Hall/CRC (2007)
Author Index
Bhargava, Rohit Bull, Larry 87 Butz, Martin V. Casillas, Jorge ´ ee, Gilles En´
185
Lanzi, Pier-Luca 1, 70, 87 Llor` a, Xavier 185 Loiacono, Daniele 1, 70
47, 57
Mart´ınez, Ivette 145 Moreno-Torres, Jose G.
21 107
Orriols-Puig, Albert
Goldberg, David E. Gorrin, Celso 145
185
Howard, Gerard David
21
P´eroumalna¨ık, Mathias Preen, Richard 166
Farooq, Muddassar 127 Franco, Mar´ıa 145
Stalph, Patrick O.
Wilson, Stewart W.
107
47, 57
Tanwani, Ajay Kumar 87
185
127 38